quantulum3

Name	quantulum3 JSON
Version	0.9.2 JSON
	download
home_page	https://github.com/nielstron/quantulum3
Summary	Extract quantities from unstructured text.
upload_time	2024-06-25 14:23:11
maintainer	None
docs_url	None
author	Marco Lagi, nielstron, sohrabtowfighi, grhawk and Rodrigo Castro
requires_python	None
license	MIT
keywords	information extraction quantities units measurements nlp natural language processing text mining text processing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI
coveralls test coverage

            # quantulum3

 [![Travis master build state](https://app.travis-ci.com/nielstron/quantulum3.svg?branch=master "Travis master build state")](https://app.travis-ci.com/nielstron/quantulum3)
 [![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=master)](https://coveralls.io/github/nielstron/quantulum3?branch=master)
 [![PyPI version](https://badge.fury.io/py/quantulum3.svg)](https://pypi.org/project/quantulum3/)
 ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/quantulum3.svg)
 [![PyPI - Status](https://img.shields.io/pypi/status/quantulum3.svg)](https://pypi.org/project/quantulum3/)

Python library for information extraction of quantities, measurements
and their units from unstructured text. It is able to disambiguate between similar
looking units based on their *k-nearest neighbours* in their [GloVe](https://nlp.stanford.edu/projects/glove/) vector representation
and their [Wikipedia](https://en.wikipedia.org/) page.

This is the Python 3 compatible fork of [recastrodiaz\'
fork](https://github.com/recastrodiaz/quantulum) of [grhawks\'
fork](https://github.com/grhawk/quantulum) of [the original by Marco
Lagi](https://github.com/marcolagi/quantulum).
The compatibility with the newest version of sklearn is based on
the fork of [sohrabtowfighi](https://github.com/sohrabtowfighi/quantulum).

## User Guide

### Installation

```bash
pip install quantulum3
```

To install dependencies for using or training the disambiguation classifier, use

```bash
pip install quantulum3[classifier]
```

The disambiguation classifier is used when the parser find two or more units that are a match for the text.

### Usage

```pycon
>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]
```

The *Quantity* class stores the surface of the original text it was
extracted from, as well as the (start, end) positions of the match:

```pycon
>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)
```

The *value* attribute provides the parsed numeric value and the *unit.name*
attribute provides the name of the parsed unit:

```pycon
>>> quants[0].value
2.0
>>> quants[0].unit.name
'litre'
```

An inline parser that embeds the parsed quantities in the text is also
available (especially useful for debugging):

```pycon
>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine
```

As the parser is also able to parse dimensionless numbers,
this library can also be used for simple number extraction.

```pycon
>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]
```

### Units and entities

All units (e.g. *litre*) and the entities they are associated to (e.g.
*volume*) are reconciled against WikiPedia:

```pycon
>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)

>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)
```

This library includes more than 290 units and 75 entities. It also
parses spelled-out numbers, ranges and uncertainties:

```pycon
>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]

>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]

>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1
```

Non-standard units usually don\'t have a WikiPedia page. The parser will
still try to guess their underlying entity based on their
dimensionality:

```pycon
>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)
```

### Export/Import

Entities, Units and Quantities can be exported to dictionaries and JSON strings:

```pycon
>>> quant = parser.parse('I want 2 liters of wine')
>>> quant[0].to_dict()
{'value': 2.0, 'unit': 'litre', "entity": "volume", 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}
>>> quant[0].to_json()
'{"value": 2.0, "unit": "litre", "entity": "volume", "surface": "2 liters", "span": [7, 15], "uncertainty": null, "lang": "en_US"}'
```

By default, only the unit/entity name is included in the exported dictionary, but these can be included:

```pycon
>>> quant = parser.parse('I want 2 liters of wine')
>>> quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)  # same args apply to .to_json()
{'value': 2.0, 'unit': {'name': 'litre', 'surfaces': ['cubic decimetre', 'cubic decimeter', 'litre', 'liter'], 'entity': {'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}, 'uri': 'Litre', 'symbols': ['l', 'L', 'ltr', 'ℓ'], 'dimensions': [{'base': 'decimetre', 'power': 3}], 'original_dimensions': [{'base': 'litre', 'power': 1, 'surface': 'liters'}], 'currency_code': None, 'lang': 'en_US'}, 'entity': 'volume', 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}
```

Similar export syntax applies to exporting Unit and Entity objects.

You can import Entity, Unit and Quantity objects from dictionaries and JSON. This requires that the object was exported with `include_unit_dict=True` and `include_entity_dict=True` (as appropriate):

```pycon
>>> quant_dict = quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)
>>> quant = Quantity.from_dict(quant_dict)
>>> ent_json = "{'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}"
>>> ent = Entity.from_json(ent_json)
```

### Disambiguation

If the parser detects an ambiguity, a classifier based on the WikiPedia
pages of the ambiguous units or entities tries to guess the right one:

```pycon
>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]

>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]
```

or:

```pycon
>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)

>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)
```

In addition to that, the classifier is trained on the most similar words to
all of the units surfaces, according to their distance in [GloVe](https://nlp.stanford.edu/projects/glove/)
vector representation.

### Spoken version

Quantulum classes include methods to convert them to a speakable unit.

```pycon
>>> parser.parse("Gimme 10e9 GW now!")[0].to_spoken()
ten billion gigawatts
>>> parser.inline_parse_and_expand("Gimme $1e10 now and also 1 TW and 0.5 J!")
Gimme ten billion dollars now and also one terawatt and zero point five joules!
```



### Manipulation

While quantities cannot be manipulated within this library, there are
many great options out there:

- [pint](https://pint.readthedocs.org/en/latest/)
- [natu](http://kdavies4.github.io/natu/)
- [quantities](http://python-quantities.readthedocs.org/en/latest/)

## Extension

### Training the classifier

If you want to train the classifier yourself, you will need the dependencies for the classifier (see installation).

Use `quantulum3-training` on the command line, the script `quantulum3/scripts/train.py` or the method `train_classifier` in `quantulum3.classifier` to train the classifier.

``` bash
quantulum3-training --lang <language> --data <path/to/training/file.json> --output <path/to/output/file.joblib>
```

You can pass multiple training files in to the training command. The output is in joblib format.

To use your custom model, pass the path to the trained model file to the
parser:

```pyton
parser = Parser.parse(<text>, classifier_path="path/to/model.joblib")
```

Example training files can be found in `quantulum3/_lang/<language>/train`.

If you want to create a new or different `similars.json`, install `pymagnitude`.

For the extraction of nearest neighbours from a vector word representation file, 
use `scripts/extract_vere.py`. It automatically extracts the `k` nearest neighbours
in vector space of the vector representation for each of the possible surfaces
of the ambiguous units. The resulting neighbours are stored in `quantulum3/similars.json`
and automatically included for training.

The file provided should be in `.magnitude` format as other formats are first
converted to a `.magnitude` file on-the-run. Check out
[pre-formatted Magnitude formatted word-embeddings](https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models)
and [Magnitude](https://github.com/plasticityai/magnitude) for more information.

### Additional units

It is possible to add additional entities and units to be parsed by quantulum. These will be added to the default units and entities. See below code for an example invocation:

```pycon
>>> from quantulum3.load import add_custom_unit, remove_custom_unit
>>> add_custom_unit(name="schlurp", surfaces=["slp"], entity="dimensionless")
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")
[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]
```

The keyword arguments to the function `add_custom_unit` are directly translated
to the properties of the unit to be created.

### Custom Units and Entities

It is possible to load a completely custom set of units and entities. This can be done by passing a list of file paths to the load_custom_units and load_custom_entities functions. Loading custom untis and entities will replace the default units and entities that are normally loaded.

The recomended way to load quantities is via a context manager:

```pycon
>>> from quantulum3 import load, parser
>>> with load.CustomQuantities(["path/to/units.json"], ["path/to/entities.json"]):
>>>     parser.parse("This extremely sharp tool is precise up to 0.5 slp")

[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

>>> # default units and entities are loaded again
```

But it is also possible to load custom units and entities manually:

```pycon
>>> from quantulum3 import load, parser

>>> load.load_custom_units(["path/to/units.json"])
>>> load.load_custom_entities(["path/to/entities.json"])
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")

[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

>>> # remove custom units and entities and load default units and entities
>>> load.reset_quantities()
```

See the Developer Guide below for more information about the format of units and entities files.

## Developer Guide

### Adding Units and Entities

See *units.json* for the complete list of units and *entities.json* for
the complete list of entities. The criteria for adding units have been:

- the unit has (or is redirected to) a WikiPedia page
- the unit is in common use (e.g. not the [premetric Swedish units of
    measurement](https://en.wikipedia.org/wiki/Swedish_units_of_measurement#Length)).

It\'s easy to extend these two files to the units/entities of interest.
Here is an example of an entry in *entities.json*:

```json
"speed": {
    "dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
    "URI": "https://en.wikipedia.org/wiki/Speed"
}
```

- The *name* of an entity is its key. Names are required to be unique.
- *URI* is the name of the wikipedia page of the entity. (i.e. `https://en.wikipedia.org/wiki/Speed` => `Speed`)
- *dimensions* is the dimensionality, a list of dictionaries each
    having a *base* (the name of another entity) and a *power* (an
    integer, can be negative).

Here is an example of an entry in *units.json*:

```json
"metre per second": {
    "surfaces": ["metre per second", "meter per second"],
    "entity": "speed",
    "URI": "Metre_per_second",
    "dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
    "symbols": ["mps"]
},
"year": {
    "surfaces": [ "year", "annum" ],
    "entity": "time",
    "URI": "Year",
    "dimensions": [],
    "symbols": [ "a", "y", "yr" ],
    "prefixes": [ "k", "M", "G", "T", "P", "E" ]
}
```

- The *name* of a unit is its key. Names are required to be unique.
- *URI* follows the same scheme as in the *entities.json*
- *surfaces* is a list of strings that refer to that unit. The library
    takes care of plurals, no need to specify them.
- *entity* is the name of an entity in *entities.json*
- *dimensions* follows the same schema as in *entities.json*, but the
    *base* is the name of another unit, not of another entity.
- *symbols* is a list of possible symbols and abbreviations for that
    unit.
- *prefixes* is an optional list. It can contain [Metric](https://en.wikipedia.org/wiki/Metric_prefix) and [Binary prefixes](https://en.wikipedia.org/wiki/Binary_prefix) and
    automatically generates according units. If you want to
    add specifics (like different surfaces) you need to create an entry for that
    prefixes version on its own.

All fields are case sensitive.

### Contributing

`dev` build: 

[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=dev "Travis dev build state")](https://travis-ci.com/nielstron/quantulum3)
[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=dev)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)

If you'd like to contribute follow these steps:
1. Clone a fork of this project into your workspace
2. Run `pip install -e .` at the root of your development folder.
3. `pip install pipenv` and `pipenv shell`
4. Inside the project folder run `pipenv install --dev`
5. Make your changes
6. Run `scripts/format.sh` and `scripts/build.py` from the package root directory.
7. Test your changes with `python3 setup.py test` 
(Optional, will be done automatically after pushing)
8. Create a Pull Request when having commited and pushed your changes

### Language support

[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=language_support "Travis dev build state")](https://travis-ci.com/nielstron/quantulum3)
[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=language_support)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)

There is a branch for language support, namely `language_support`.
From inspecting the `README` file in the `_lang` subdirectory and
the functions and values given in the new `_lang.en_US` submodule,
one should be able to create own language submodules.
The new language modules should automatically be invoked and be available,
both through the `lang=` keyword argument in the parser functions as well
as in the automatic unittests.

No changes outside the own language submodule folder (i.e. `_lang.de_DE`) should
be necessary. If there are problems implementing a new language, don't hesitate to open an issue.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nielstron/quantulum3",
    "name": "quantulum3",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "information extraction, quantities, units, measurements, nlp, natural language processing, text mining, text processing",
    "author": "Marco Lagi, nielstron, sohrabtowfighi, grhawk and Rodrigo Castro",
    "author_email": "n.muendler@web.de",
    "download_url": "https://files.pythonhosted.org/packages/34/22/b8119b8e1c5162ffc578f86c316fd6304d646abb49535b56438ee340f49c/quantulum3-0.9.2.tar.gz",
    "platform": null,
    "description": "# quantulum3\n\n [![Travis master build state](https://app.travis-ci.com/nielstron/quantulum3.svg?branch=master \"Travis master build state\")](https://app.travis-ci.com/nielstron/quantulum3)\n [![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=master)](https://coveralls.io/github/nielstron/quantulum3?branch=master)\n [![PyPI version](https://badge.fury.io/py/quantulum3.svg)](https://pypi.org/project/quantulum3/)\n ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/quantulum3.svg)\n [![PyPI - Status](https://img.shields.io/pypi/status/quantulum3.svg)](https://pypi.org/project/quantulum3/)\n\nPython library for information extraction of quantities, measurements\nand their units from unstructured text. It is able to disambiguate between similar\nlooking units based on their *k-nearest neighbours* in their [GloVe](https://nlp.stanford.edu/projects/glove/) vector representation\nand their [Wikipedia](https://en.wikipedia.org/) page.\n\nThis is the Python 3 compatible fork of [recastrodiaz\\'\nfork](https://github.com/recastrodiaz/quantulum) of [grhawks\\'\nfork](https://github.com/grhawk/quantulum) of [the original by Marco\nLagi](https://github.com/marcolagi/quantulum).\nThe compatibility with the newest version of sklearn is based on\nthe fork of [sohrabtowfighi](https://github.com/sohrabtowfighi/quantulum).\n\n## User Guide\n\n### Installation\n\n```bash\npip install quantulum3\n```\n\nTo install dependencies for using or training the disambiguation classifier, use\n\n```bash\npip install quantulum3[classifier]\n```\n\nThe disambiguation classifier is used when the parser find two or more units that are a match for the text.\n\n### Usage\n\n```pycon\n>>> from quantulum3 import parser\n>>> quants = parser.parse('I want 2 liters of wine')\n>>> quants\n[Quantity(2, 'litre')]\n```\n\nThe *Quantity* class stores the surface of the original text it was\nextracted from, as well as the (start, end) positions of the match:\n\n```pycon\n>>> quants[0].surface\nu'2 liters'\n>>> quants[0].span\n(7, 15)\n```\n\nThe *value* attribute provides the parsed numeric value and the *unit.name*\nattribute provides the name of the parsed unit:\n\n```pycon\n>>> quants[0].value\n2.0\n>>> quants[0].unit.name\n'litre'\n```\n\nAn inline parser that embeds the parsed quantities in the text is also\navailable (especially useful for debugging):\n\n```pycon\n>>> print parser.inline_parse('I want 2 liters of wine')\nI want 2 liters {Quantity(2, \"litre\")} of wine\n```\n\nAs the parser is also able to parse dimensionless numbers,\nthis library can also be used for simple number extraction.\n\n```pycon\n>>> print parser.parse('I want two')\n[Quantity(2, 'dimensionless')]\n```\n\n### Units and entities\n\nAll units (e.g. *litre*) and the entities they are associated to (e.g.\n*volume*) are reconciled against WikiPedia:\n\n```pycon\n>>> quants[0].unit\nUnit(name=\"litre\", entity=Entity(\"volume\"), uri=https://en.wikipedia.org/wiki/Litre)\n\n>>> quants[0].unit.entity\nEntity(name=\"volume\", uri=https://en.wikipedia.org/wiki/Volume)\n```\n\nThis library includes more than 290 units and 75 entities. It also\nparses spelled-out numbers, ranges and uncertainties:\n\n```pycon\n>>> parser.parse('I want a gallon of beer')\n[Quantity(1, 'gallon')]\n\n>>> parser.parse('The LHC smashes proton beams at 12.8\u201313.0 TeV')\n[Quantity(12.8, \"teraelectronvolt\"), Quantity(13, \"teraelectronvolt\")]\n\n>>> quant = parser.parse('The LHC smashes proton beams at 12.9\u00b10.1 TeV')\n>>> quant[0].uncertainty\n0.1\n```\n\nNon-standard units usually don\\'t have a WikiPedia page. The parser will\nstill try to guess their underlying entity based on their\ndimensionality:\n\n```pycon\n>>> parser.parse('Sound travels at 0.34 km/s')[0].unit\nUnit(name=\"kilometre per second\", entity=Entity(\"speed\"), uri=None)\n```\n\n### Export/Import\n\nEntities, Units and Quantities can be exported to dictionaries and JSON strings:\n\n```pycon\n>>> quant = parser.parse('I want 2 liters of wine')\n>>> quant[0].to_dict()\n{'value': 2.0, 'unit': 'litre', \"entity\": \"volume\", 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}\n>>> quant[0].to_json()\n'{\"value\": 2.0, \"unit\": \"litre\", \"entity\": \"volume\", \"surface\": \"2 liters\", \"span\": [7, 15], \"uncertainty\": null, \"lang\": \"en_US\"}'\n```\n\nBy default, only the unit/entity name is included in the exported dictionary, but these can be included:\n\n```pycon\n>>> quant = parser.parse('I want 2 liters of wine')\n>>> quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)  # same args apply to .to_json()\n{'value': 2.0, 'unit': {'name': 'litre', 'surfaces': ['cubic decimetre', 'cubic decimeter', 'litre', 'liter'], 'entity': {'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}, 'uri': 'Litre', 'symbols': ['l', 'L', 'ltr', '\u2113'], 'dimensions': [{'base': 'decimetre', 'power': 3}], 'original_dimensions': [{'base': 'litre', 'power': 1, 'surface': 'liters'}], 'currency_code': None, 'lang': 'en_US'}, 'entity': 'volume', 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}\n```\n\nSimilar export syntax applies to exporting Unit and Entity objects.\n\nYou can import Entity, Unit and Quantity objects from dictionaries and JSON. This requires that the object was exported with `include_unit_dict=True` and `include_entity_dict=True` (as appropriate):\n\n```pycon\n>>> quant_dict = quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)\n>>> quant = Quantity.from_dict(quant_dict)\n>>> ent_json = \"{'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}\"\n>>> ent = Entity.from_json(ent_json)\n```\n\n### Disambiguation\n\nIf the parser detects an ambiguity, a classifier based on the WikiPedia\npages of the ambiguous units or entities tries to guess the right one:\n\n```pycon\n>>> parser.parse('I spent 20 pounds on this!')\n[Quantity(20, \"pound sterling\")]\n\n>>> parser.parse('It weighs no more than 20 pounds')\n[Quantity(20, \"pound-mass\")]\n```\n\nor:\n\n```pycon\n>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm\u00b3'\n>>> parser.parse(text)[0].unit.entity\nEntity(name=\"density\", uri=https://en.wikipedia.org/wiki/Density)\n\n>>> text = 'The amount of O\u2082 is 2.98e-4 kg per liter of atmosphere'\n>>> parser.parse(text)[0].unit.entity\nEntity(name=\"concentration\", uri=https://en.wikipedia.org/wiki/Concentration)\n```\n\nIn addition to that, the classifier is trained on the most similar words to\nall of the units surfaces, according to their distance in [GloVe](https://nlp.stanford.edu/projects/glove/)\nvector representation.\n\n### Spoken version\n\nQuantulum classes include methods to convert them to a speakable unit.\n\n```pycon\n>>> parser.parse(\"Gimme 10e9 GW now!\")[0].to_spoken()\nten billion gigawatts\n>>> parser.inline_parse_and_expand(\"Gimme $1e10 now and also 1 TW and 0.5 J!\")\nGimme ten billion dollars now and also one terawatt and zero point five joules!\n```\n\n\n\n### Manipulation\n\nWhile quantities cannot be manipulated within this library, there are\nmany great options out there:\n\n- [pint](https://pint.readthedocs.org/en/latest/)\n- [natu](http://kdavies4.github.io/natu/)\n- [quantities](http://python-quantities.readthedocs.org/en/latest/)\n\n## Extension\n\n### Training the classifier\n\nIf you want to train the classifier yourself, you will need the dependencies for the classifier (see installation).\n\nUse `quantulum3-training` on the command line, the script `quantulum3/scripts/train.py` or the method `train_classifier` in `quantulum3.classifier` to train the classifier.\n\n``` bash\nquantulum3-training --lang <language> --data <path/to/training/file.json> --output <path/to/output/file.joblib>\n```\n\nYou can pass multiple training files in to the training command. The output is in joblib format.\n\nTo use your custom model, pass the path to the trained model file to the\nparser:\n\n```pyton\nparser = Parser.parse(<text>, classifier_path=\"path/to/model.joblib\")\n```\n\nExample training files can be found in `quantulum3/_lang/<language>/train`.\n\nIf you want to create a new or different `similars.json`, install `pymagnitude`.\n\nFor the extraction of nearest neighbours from a vector word representation file, \nuse `scripts/extract_vere.py`. It automatically extracts the `k` nearest neighbours\nin vector space of the vector representation for each of the possible surfaces\nof the ambiguous units. The resulting neighbours are stored in `quantulum3/similars.json`\nand automatically included for training.\n\nThe file provided should be in `.magnitude` format as other formats are first\nconverted to a `.magnitude` file on-the-run. Check out\n[pre-formatted Magnitude formatted word-embeddings](https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models)\nand [Magnitude](https://github.com/plasticityai/magnitude) for more information.\n\n### Additional units\n\nIt is possible to add additional entities and units to be parsed by quantulum. These will be added to the default units and entities. See below code for an example invocation:\n\n```pycon\n>>> from quantulum3.load import add_custom_unit, remove_custom_unit\n>>> add_custom_unit(name=\"schlurp\", surfaces=[\"slp\"], entity=\"dimensionless\")\n>>> parser.parse(\"This extremely sharp tool is precise up to 0.5 slp\")\n[Quantity(0.5, \"Unit(name=\"schlurp\", entity=Entity(\"dimensionless\"), uri=None)\")]\n```\n\nThe keyword arguments to the function `add_custom_unit` are directly translated\nto the properties of the unit to be created.\n\n### Custom Units and Entities\n\nIt is possible to load a completely custom set of units and entities. This can be done by passing a list of file paths to the load_custom_units and load_custom_entities functions. Loading custom untis and entities will replace the default units and entities that are normally loaded.\n\nThe recomended way to load quantities is via a context manager:\n\n```pycon\n>>> from quantulum3 import load, parser\n>>> with load.CustomQuantities([\"path/to/units.json\"], [\"path/to/entities.json\"]):\n>>>     parser.parse(\"This extremely sharp tool is precise up to 0.5 slp\")\n\n[Quantity(0.5, \"Unit(name=\"schlurp\", entity=Entity(\"dimensionless\"), uri=None)\")]\n\n>>> # default units and entities are loaded again\n```\n\nBut it is also possible to load custom units and entities manually:\n\n```pycon\n>>> from quantulum3 import load, parser\n\n>>> load.load_custom_units([\"path/to/units.json\"])\n>>> load.load_custom_entities([\"path/to/entities.json\"])\n>>> parser.parse(\"This extremely sharp tool is precise up to 0.5 slp\")\n\n[Quantity(0.5, \"Unit(name=\"schlurp\", entity=Entity(\"dimensionless\"), uri=None)\")]\n\n>>> # remove custom units and entities and load default units and entities\n>>> load.reset_quantities()\n```\n\nSee the Developer Guide below for more information about the format of units and entities files.\n\n## Developer Guide\n\n### Adding Units and Entities\n\nSee *units.json* for the complete list of units and *entities.json* for\nthe complete list of entities. The criteria for adding units have been:\n\n- the unit has (or is redirected to) a WikiPedia page\n- the unit is in common use (e.g. not the [premetric Swedish units of\n    measurement](https://en.wikipedia.org/wiki/Swedish_units_of_measurement#Length)).\n\nIt\\'s easy to extend these two files to the units/entities of interest.\nHere is an example of an entry in *entities.json*:\n\n```json\n\"speed\": {\n    \"dimensions\": [{\"base\": \"length\", \"power\": 1}, {\"base\": \"time\", \"power\": -1}],\n    \"URI\": \"https://en.wikipedia.org/wiki/Speed\"\n}\n```\n\n- The *name* of an entity is its key. Names are required to be unique.\n- *URI* is the name of the wikipedia page of the entity. (i.e. `https://en.wikipedia.org/wiki/Speed` => `Speed`)\n- *dimensions* is the dimensionality, a list of dictionaries each\n    having a *base* (the name of another entity) and a *power* (an\n    integer, can be negative).\n\nHere is an example of an entry in *units.json*:\n\n```json\n\"metre per second\": {\n    \"surfaces\": [\"metre per second\", \"meter per second\"],\n    \"entity\": \"speed\",\n    \"URI\": \"Metre_per_second\",\n    \"dimensions\": [{\"base\": \"metre\", \"power\": 1}, {\"base\": \"second\", \"power\": -1}],\n    \"symbols\": [\"mps\"]\n},\n\"year\": {\n    \"surfaces\": [ \"year\", \"annum\" ],\n    \"entity\": \"time\",\n    \"URI\": \"Year\",\n    \"dimensions\": [],\n    \"symbols\": [ \"a\", \"y\", \"yr\" ],\n    \"prefixes\": [ \"k\", \"M\", \"G\", \"T\", \"P\", \"E\" ]\n}\n```\n\n- The *name* of a unit is its key. Names are required to be unique.\n- *URI* follows the same scheme as in the *entities.json*\n- *surfaces* is a list of strings that refer to that unit. The library\n    takes care of plurals, no need to specify them.\n- *entity* is the name of an entity in *entities.json*\n- *dimensions* follows the same schema as in *entities.json*, but the\n    *base* is the name of another unit, not of another entity.\n- *symbols* is a list of possible symbols and abbreviations for that\n    unit.\n- *prefixes* is an optional list. It can contain [Metric](https://en.wikipedia.org/wiki/Metric_prefix) and [Binary prefixes](https://en.wikipedia.org/wiki/Binary_prefix) and\n    automatically generates according units. If you want to\n    add specifics (like different surfaces) you need to create an entry for that\n    prefixes version on its own.\n\nAll fields are case sensitive.\n\n### Contributing\n\n`dev` build: \n\n[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=dev \"Travis dev build state\")](https://travis-ci.com/nielstron/quantulum3)\n[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=dev)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)\n\nIf you'd like to contribute follow these steps:\n1. Clone a fork of this project into your workspace\n2. Run `pip install -e .` at the root of your development folder.\n3. `pip install pipenv` and `pipenv shell`\n4. Inside the project folder run `pipenv install --dev`\n5. Make your changes\n6. Run `scripts/format.sh` and `scripts/build.py` from the package root directory.\n7. Test your changes with `python3 setup.py test` \n(Optional, will be done automatically after pushing)\n8. Create a Pull Request when having commited and pushed your changes\n\n### Language support\n\n[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=language_support \"Travis dev build state\")](https://travis-ci.com/nielstron/quantulum3)\n[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=language_support)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)\n\nThere is a branch for language support, namely `language_support`.\nFrom inspecting the `README` file in the `_lang` subdirectory and\nthe functions and values given in the new `_lang.en_US` submodule,\none should be able to create own language submodules.\nThe new language modules should automatically be invoked and be available,\nboth through the `lang=` keyword argument in the parser functions as well\nas in the automatic unittests.\n\nNo changes outside the own language submodule folder (i.e. `_lang.de_DE`) should\nbe necessary. If there are problems implementing a new language, don't hesitate to open an issue.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Extract quantities from unstructured text.",
    "version": "0.9.2",
    "project_urls": {
        "Download": "https://github.com/nielstron/quantulum3/tarball/master",
        "Homepage": "https://github.com/nielstron/quantulum3"
    },
    "split_keywords": [
        "information extraction",
        " quantities",
        " units",
        " measurements",
        " nlp",
        " natural language processing",
        " text mining",
        " text processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "477620f62d4f1a69471be78197c7e466607675a89095eacafe7fc9bcc1e61cea",
                "md5": "0625947cbd1ddb252e212263c5df414f",
                "sha256": "bc56c2ee9c96a391b1660d0134d401515e6ea676cbd1a05d8e75195917216077"
            },
            "downloads": -1,
            "filename": "quantulum3-0.9.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0625947cbd1ddb252e212263c5df414f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 10616446,
            "upload_time": "2024-06-25T14:23:06",
            "upload_time_iso_8601": "2024-06-25T14:23:06.978796Z",
            "url": "https://files.pythonhosted.org/packages/47/76/20f62d4f1a69471be78197c7e466607675a89095eacafe7fc9bcc1e61cea/quantulum3-0.9.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3422b8119b8e1c5162ffc578f86c316fd6304d646abb49535b56438ee340f49c",
                "md5": "79c9f8a439d991c3224778a5a3a68a5d",
                "sha256": "797b5d94c52d23107010838fbf2d3bb4f15630f4e90556401bdb042a1c12fc1e"
            },
            "downloads": -1,
            "filename": "quantulum3-0.9.2.tar.gz",
            "has_sig": false,
            "md5_digest": "79c9f8a439d991c3224778a5a3a68a5d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10601484,
            "upload_time": "2024-06-25T14:23:11",
            "upload_time_iso_8601": "2024-06-25T14:23:11.696258Z",
            "url": "https://files.pythonhosted.org/packages/34/22/b8119b8e1c5162ffc578f86c316fd6304d646abb49535b56438ee340f49c/quantulum3-0.9.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-25 14:23:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nielstron",
    "github_project": "quantulum3",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": false,
    "lcname": "quantulum3"
}

Marco Lagi, nielstron, sohrabtowfighi, grhawk and Rodrigo Castro