texta-mlp


Nametexta-mlp JSON
Version 1.10.4 PyPI version JSON
download
home_pagehttps://git.texta.ee/texta/texta-mlp-python
SummaryTEXTA Multilingual Processor (MLP)
upload_time2021-04-01 14:27:45
maintainer
docs_urlNone
authorTEXTA
requires_python
licenseGPLv3
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TEXTA MLP Python package

http://pypi.texta.ee/texta-mlp/

## Installation
### Requirements
`apt-get install python3-lxml`

##### From PyPI
`pip3 install texta-mlp`

##### From Git
`pip3 install git+https://git.texta.ee/texta/texta-mlp-python.git`

### Testing
`python3 -m pytest -v tests`

## Entities
MLP extracts several types of entities from text.

### Model-based Entities
MLP uses Stanza to extract:
* Persons (missing Estonian model)
* Organizations (missing Estonian model)
* Geopolitical entities (missing Estonian model)

### Regex-based Entities
MLP uses regular expressions to extract:
* Phone numbers (regex)
* Email addresses (regex)

### List-based Entitites
MLP also supports entity extraction using lists of predefined entities. These lists come with MLP:
* Companies (Estonian)
* Addresses (Estonian and Russian)
* Currencies (Estonian, Russian, and English)

### Custom List-based Entities
MLP also supports defining custom entity lists. Custom lists must be placed in the **entity_mapper** directory residing in **data** directory.
Entities are defined as JSON files:
```
{
  "MY_ENTITY": [
    "foo",
    "bar"
  ]
}
```

## Usage

### Load MLP
Supported languages: https://stanzanlp.github.io/stanzanlp/models.html
```
>>> from texta_mlp.mlp import MLP
>>> mlp = MLP(language_codes=["et","en","ru"])
```

### Process & Lemmatize Estonian
```
>>> mlp.process("Selle eestikeelse lausega võiks midagi ehk öelda.")
{'text': {'text': 'Selle eestikeelse lausega võiks midagi ehk öelda .', 'lang': 'et', 'lemmas': 'see eestikeelne lause võima miski ehk ütlema .', 'pos_tags': 'P A S V P J V Z'}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("Selle eestikeelse lausega võiks midagi ehk öelda.")
'see eestikeelne lause võima miski ehk ütlema .'
```

You can use the "analyzers" argument to limit the amount of data you want to be analyzed and returned, thus speeding up the process.
Accepted options are: ["lemmas", "pos_tags", "transliteration", "ner", "contacts", "entity_mapper", "all"]
where "all" signifies that you want to use all analyzers (takes the most time). By the default, this value is "all".

```
>>> mlp.process("Selle eestikeelse lausega võiks midagi ehk öelda.", analyzers=["lemmas", "postags"])
```

### Process & Lemmatize Russian
```
>>> mlp.process("Лукашенко заявил о договоренности Москвы и Минска по нефти.")
{'text': {'text': 'Лукашенко заявил о договоренности Москвы и Минска по нефти .', 'lang': 'ru', 'lemmas': 'лукашенко заявить о договоренность москва и минск по нефть .', 'pos_tags': 'X X X X X X X X X X', 'transliteration': 'Lukašenko zajavil o dogovorennosti Moskvõ i Minska po nefti .'}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("Лукашенко заявил о договоренности Москвы и Минска по нефти.")
'лукашенко заявить о договоренность москва и минск по нефть .
```

### Process & Lemmatize English
```
>>> mlp.process("Test sencences are rather difficult to come up with.")
{'text': {'text': 'Test sencences are rather difficult to come up with .', 'lang': 'en', 'lemmas': 'Test sencence be rather difficult to come up with .', 'pos_tags': 'NN NNS VBP RB JJ TO VB RB IN .'}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("Test sencences are rather difficult to come up with.")
'Test sencence be rather difficult to come up with .'
```

### Make MLP Throw an Exception on Unknown Languages
By default, MLP will default to Estonian if language is unknown. To not do so, one must provide *use_default_language_code=False* when initializing MLP.
```
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
{'text': {'text': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء .', 'lang': 'et', 'lemmas': 'lee 1 يولد جميع الناس leele leele في leele leele . وقد وهبوا عقلاً leele lee أن يعامل بعضهم بعضًا بروح lee .', 'pos_tags': 'S N S S S S S S S S Z S S S S S S S S Y Y Y Z'}, 'texta_facts': []}
>>>
>>> mlp = MLP(language_codes=["et","en","ru"], use_default_language_code=False)
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py", line 150, in process
    document = self.generate_document(raw_text, loaded_analyzers)
  File "/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py", line 96, in generate_document
    lang = self.detect_language(processed_text)
  File "/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py", line 89, in detect_language
    raise LanguageNotSupported("Detected language is not supported: {}.".format(lang))
texta_mlp.exceptions.LanguageNotSupported: Detected language is not supported: ar.
```

### Change Default Language Code
Do use some other language as default, one must provide *default_language_code* when initializing MLP.
```
>>> mlp = MLP(language_codes=["et", "en", "ru"], default_language_code="en")
>>>
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
{'text': {'text': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء .', 'lang': 'en', 'lemmas': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء .', 'pos_tags': 'NN CD , NN NN NN NN IN NN NN . UH NN NN NN NN NN NN NN NN NN NN .'}, 'texta_facts': []}
```

### Process Arabic (for real this time)
```
>>> mlp = MLP(language_codes=["et","en","ru", "ar"])
>>> mlp.process("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضًا بروح الإخاء.")
{'text': {'text': 'المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق . وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضا بروح الإخاء .', 'lang': 'ar', 'lemmas': 'مَادَّة 1 وَلَّد جَمِيع إِنسَان حَرَر مُتَسَاوِي فِي كَرَامَة والحقوق . وَقَد وَ عَقَل وضميراً وعليهم أَنَّ يعامل بعضهم بَعض بروح إِخَاء .', 'pos_tags': 'N------S1D Q--------- VIIA-3MS-- N------S4R N------P2D N------P4I A-----MP4I P--------- N------S2D U--------- G--------- U--------- VP-A-3MP-- N------S4I A-----MS4I U--------- C--------- VISA-3MS-- U--------- N------S4I U--------- N------S2D G---------', 'transliteration': "AlmAdp 1 ywld jmyE AlnAs >HrArFA mtsAwyn fy AlkrAmp wAlHqwq . wqd whbwA EqlAF wDmyrFA wElyhm >n yEAml bEDhm bEDA brwH Al<xA' ."}, 'texta_facts': []}
>>>
>>> mlp.lemmatize("المادة 1 يولد جميع الناس أحرارًا متساوين في الكرامة والحقوق. وقد وهبوا عقلاً وضميرًا وعليهم أن يعامل بعضهم بعضا بروح الإخاء.")
'مَادَّة 1 وَلَّد جَمِيع إِنسَان حَرَر مُتَسَاوِي فِي كَرَامَة والحقوق . وَقَد وَ عَقَل وضميراً وعليهم أَنَّ يعامل بعضهم بَعض بروح إِخَاء .'
```

### Load MLP with Custom Resource Path
```
>>> mlp = MLP(language_codes=["et","en","ru"], resource_dir="/home/kalevipoeg/mlp_resources/")
```

### Different phone parsers

Texta MLP has three different phone parsers:

* 'phone_strict' - is used by default. It parses only those numbers that are verified by the [phonenumbers library](https://pypi.org/project/phonenumbers/). It verifies all correct numbers if they have an area code before it. Otherwise (without an area code) it verifies only Estonian ("EE") and Russian ("RU") phone numbers. This is because in this example "Maksekorraldusele märkida viitenumber 2800049900 ning selgitus ...", the "2800049900" is a valid number in Great Britain ("GB"), but not with "EE" or "RU".

* 'phone_high_precision' which output is mainly phonenumbers extracted by regex, but the regex excludes complicated versions. 

* 'phone_high_recall' was originally done for emails and it gets most of the phone numbers (includes complicated versions), but also outputs a lot of noisy data. This **parser is also used by default** in concatenating close entities (read below). This means that while concatenating, only "PHONE_high_recall" fact is considered and other parsers' results are not included in concatenating (avoids overlaping). The other parsers' results won't get lost and are still added in texta_facts. Just not under the fact "BOUNDED".

You can choose the parsers like so:
```
>>> mlp.process(analyzers=["lemmas", "phone_high_precision"], raw_text= "My phone number is 12 34 56 77.")
```
            

Raw data

            {
    "_id": null,
    "home_page": "https://git.texta.ee/texta/texta-mlp-python",
    "name": "texta-mlp",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "TEXTA",
    "author_email": "info@texta.ee",
    "download_url": "https://files.pythonhosted.org/packages/1c/b9/a23c62d66b9778833b67947fbab8a1969cf6384b0187403d1af049845cd6/texta-mlp-1.10.4.tar.gz",
    "platform": "",
    "description": "# TEXTA MLP Python package\n\nhttp://pypi.texta.ee/texta-mlp/\n\n## Installation\n### Requirements\n`apt-get install python3-lxml`\n\n##### From PyPI\n`pip3 install texta-mlp`\n\n##### From Git\n`pip3 install git+https://git.texta.ee/texta/texta-mlp-python.git`\n\n### Testing\n`python3 -m pytest -v tests`\n\n## Entities\nMLP extracts several types of entities from text.\n\n### Model-based Entities\nMLP uses Stanza to extract:\n* Persons (missing Estonian model)\n* Organizations (missing Estonian model)\n* Geopolitical entities (missing Estonian model)\n\n### Regex-based Entities\nMLP uses regular expressions to extract:\n* Phone numbers (regex)\n* Email addresses (regex)\n\n### List-based Entitites\nMLP also supports entity extraction using lists of predefined entities. These lists come with MLP:\n* Companies (Estonian)\n* Addresses (Estonian and Russian)\n* Currencies (Estonian, Russian, and English)\n\n### Custom List-based Entities\nMLP also supports defining custom entity lists. Custom lists must be placed in the **entity_mapper** directory residing in **data** directory.\nEntities are defined as JSON files:\n```\n{\n  \"MY_ENTITY\": [\n    \"foo\",\n    \"bar\"\n  ]\n}\n```\n\n## Usage\n\n### Load MLP\nSupported languages: https://stanzanlp.github.io/stanzanlp/models.html\n```\n>>> from texta_mlp.mlp import MLP\n>>> mlp = MLP(language_codes=[\"et\",\"en\",\"ru\"])\n```\n\n### Process & Lemmatize Estonian\n```\n>>> mlp.process(\"Selle eestikeelse lausega v\u00f5iks midagi ehk \u00f6elda.\")\n{'text': {'text': 'Selle eestikeelse lausega v\u00f5iks midagi ehk \u00f6elda .', 'lang': 'et', 'lemmas': 'see eestikeelne lause v\u00f5ima miski ehk \u00fctlema .', 'pos_tags': 'P A S V P J V Z'}, 'texta_facts': []}\n>>>\n>>> mlp.lemmatize(\"Selle eestikeelse lausega v\u00f5iks midagi ehk \u00f6elda.\")\n'see eestikeelne lause v\u00f5ima miski ehk \u00fctlema .'\n```\n\nYou can use the \"analyzers\" argument to limit the amount of data you want to be analyzed and returned, thus speeding up the process.\nAccepted options are: [\"lemmas\", \"pos_tags\", \"transliteration\", \"ner\", \"contacts\", \"entity_mapper\", \"all\"]\nwhere \"all\" signifies that you want to use all analyzers (takes the most time). By the default, this value is \"all\".\n\n```\n>>> mlp.process(\"Selle eestikeelse lausega v\u00f5iks midagi ehk \u00f6elda.\", analyzers=[\"lemmas\", \"postags\"])\n```\n\n### Process & Lemmatize Russian\n```\n>>> mlp.process(\"\u041b\u0443\u043a\u0430\u0448\u0435\u043d\u043a\u043e \u0437\u0430\u044f\u0432\u0438\u043b \u043e \u0434\u043e\u0433\u043e\u0432\u043e\u0440\u0435\u043d\u043d\u043e\u0441\u0442\u0438 \u041c\u043e\u0441\u043a\u0432\u044b \u0438 \u041c\u0438\u043d\u0441\u043a\u0430 \u043f\u043e \u043d\u0435\u0444\u0442\u0438.\")\n{'text': {'text': '\u041b\u0443\u043a\u0430\u0448\u0435\u043d\u043a\u043e \u0437\u0430\u044f\u0432\u0438\u043b \u043e \u0434\u043e\u0433\u043e\u0432\u043e\u0440\u0435\u043d\u043d\u043e\u0441\u0442\u0438 \u041c\u043e\u0441\u043a\u0432\u044b \u0438 \u041c\u0438\u043d\u0441\u043a\u0430 \u043f\u043e \u043d\u0435\u0444\u0442\u0438 .', 'lang': 'ru', 'lemmas': '\u043b\u0443\u043a\u0430\u0448\u0435\u043d\u043a\u043e \u0437\u0430\u044f\u0432\u0438\u0442\u044c \u043e \u0434\u043e\u0433\u043e\u0432\u043e\u0440\u0435\u043d\u043d\u043e\u0441\u0442\u044c \u043c\u043e\u0441\u043a\u0432\u0430 \u0438 \u043c\u0438\u043d\u0441\u043a \u043f\u043e \u043d\u0435\u0444\u0442\u044c .', 'pos_tags': 'X X X X X X X X X X', 'transliteration': 'Luka\u0161enko zajavil o dogovorennosti Moskv\u00f5 i Minska po nefti .'}, 'texta_facts': []}\n>>>\n>>> mlp.lemmatize(\"\u041b\u0443\u043a\u0430\u0448\u0435\u043d\u043a\u043e \u0437\u0430\u044f\u0432\u0438\u043b \u043e \u0434\u043e\u0433\u043e\u0432\u043e\u0440\u0435\u043d\u043d\u043e\u0441\u0442\u0438 \u041c\u043e\u0441\u043a\u0432\u044b \u0438 \u041c\u0438\u043d\u0441\u043a\u0430 \u043f\u043e \u043d\u0435\u0444\u0442\u0438.\")\n'\u043b\u0443\u043a\u0430\u0448\u0435\u043d\u043a\u043e \u0437\u0430\u044f\u0432\u0438\u0442\u044c \u043e \u0434\u043e\u0433\u043e\u0432\u043e\u0440\u0435\u043d\u043d\u043e\u0441\u0442\u044c \u043c\u043e\u0441\u043a\u0432\u0430 \u0438 \u043c\u0438\u043d\u0441\u043a \u043f\u043e \u043d\u0435\u0444\u0442\u044c .\n```\n\n### Process & Lemmatize English\n```\n>>> mlp.process(\"Test sencences are rather difficult to come up with.\")\n{'text': {'text': 'Test sencences are rather difficult to come up with .', 'lang': 'en', 'lemmas': 'Test sencence be rather difficult to come up with .', 'pos_tags': 'NN NNS VBP RB JJ TO VB RB IN .'}, 'texta_facts': []}\n>>>\n>>> mlp.lemmatize(\"Test sencences are rather difficult to come up with.\")\n'Test sencence be rather difficult to come up with .'\n```\n\n### Make MLP Throw an Exception on Unknown Languages\nBy default, MLP will default to Estonian if language is unknown. To not do so, one must provide *use_default_language_code=False* when initializing MLP.\n```\n>>> mlp.process(\"\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642. \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621.\")\n{'text': {'text': '\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642 . \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621 .', 'lang': 'et', 'lemmas': 'lee 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 leele leele \u0641\u064a leele leele . \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b leele lee \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d lee .', 'pos_tags': 'S N S S S S S S S S Z S S S S S S S S Y Y Y Z'}, 'texta_facts': []}\n>>>\n>>> mlp = MLP(language_codes=[\"et\",\"en\",\"ru\"], use_default_language_code=False)\n>>> mlp.process(\"\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642. \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621.\")\nTraceback (most recent call last):\n  File \"<stdin>\", line 1, in <module>\n  File \"/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py\", line 150, in process\n    document = self.generate_document(raw_text, loaded_analyzers)\n  File \"/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py\", line 96, in generate_document\n    lang = self.detect_language(processed_text)\n  File \"/home/rsirel/dev/texta-mlp-package/texta_mlp/mlp.py\", line 89, in detect_language\n    raise LanguageNotSupported(\"Detected language is not supported: {}.\".format(lang))\ntexta_mlp.exceptions.LanguageNotSupported: Detected language is not supported: ar.\n```\n\n### Change Default Language Code\nDo use some other language as default, one must provide *default_language_code* when initializing MLP.\n```\n>>> mlp = MLP(language_codes=[\"et\", \"en\", \"ru\"], default_language_code=\"en\")\n>>>\n>>> mlp.process(\"\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642. \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621.\")\n{'text': {'text': '\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642 . \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621 .', 'lang': 'en', 'lemmas': '\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642 . \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621 .', 'pos_tags': 'NN CD , NN NN NN NN IN NN NN . UH NN NN NN NN NN NN NN NN NN NN .'}, 'texta_facts': []}\n```\n\n### Process Arabic (for real this time)\n```\n>>> mlp = MLP(language_codes=[\"et\",\"en\",\"ru\", \"ar\"])\n>>> mlp.process(\"\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642. \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u064b\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621.\")\n{'text': {'text': '\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642 . \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621 .', 'lang': 'ar', 'lemmas': '\u0645\u064e\u0627\u062f\u0651\u064e\u0629 1 \u0648\u064e\u0644\u0651\u064e\u062f \u062c\u064e\u0645\u0650\u064a\u0639 \u0625\u0650\u0646\u0633\u064e\u0627\u0646 \u062d\u064e\u0631\u064e\u0631 \u0645\u064f\u062a\u064e\u0633\u064e\u0627\u0648\u0650\u064a \u0641\u0650\u064a \u0643\u064e\u0631\u064e\u0627\u0645\u064e\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642 . \u0648\u064e\u0642\u064e\u062f \u0648\u064e \u0639\u064e\u0642\u064e\u0644 \u0648\u0636\u0645\u064a\u0631\u0627\u064b \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u064e\u0646\u0651\u064e \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u064e\u0639\u0636 \u0628\u0631\u0648\u062d \u0625\u0650\u062e\u064e\u0627\u0621 .', 'pos_tags': 'N------S1D Q--------- VIIA-3MS-- N------S4R N------P2D N------P4I A-----MP4I P--------- N------S2D U--------- G--------- U--------- VP-A-3MP-- N------S4I A-----MS4I U--------- C--------- VISA-3MS-- U--------- N------S4I U--------- N------S2D G---------', 'transliteration': \"AlmAdp 1 ywld jmyE AlnAs >HrArFA mtsAwyn fy AlkrAmp wAlHqwq . wqd whbwA EqlAF wDmyrFA wElyhm >n yEAml bEDhm bEDA brwH Al<xA' .\"}, 'texta_facts': []}\n>>>\n>>> mlp.lemmatize(\"\u0627\u0644\u0645\u0627\u062f\u0629 1 \u064a\u0648\u0644\u062f \u062c\u0645\u064a\u0639 \u0627\u0644\u0646\u0627\u0633 \u0623\u062d\u0631\u0627\u0631\u064b\u0627 \u0645\u062a\u0633\u0627\u0648\u064a\u0646 \u0641\u064a \u0627\u0644\u0643\u0631\u0627\u0645\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642. \u0648\u0642\u062f \u0648\u0647\u0628\u0648\u0627 \u0639\u0642\u0644\u0627\u064b \u0648\u0636\u0645\u064a\u0631\u064b\u0627 \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u0646 \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u0639\u0636\u0627 \u0628\u0631\u0648\u062d \u0627\u0644\u0625\u062e\u0627\u0621.\")\n'\u0645\u064e\u0627\u062f\u0651\u064e\u0629 1 \u0648\u064e\u0644\u0651\u064e\u062f \u062c\u064e\u0645\u0650\u064a\u0639 \u0625\u0650\u0646\u0633\u064e\u0627\u0646 \u062d\u064e\u0631\u064e\u0631 \u0645\u064f\u062a\u064e\u0633\u064e\u0627\u0648\u0650\u064a \u0641\u0650\u064a \u0643\u064e\u0631\u064e\u0627\u0645\u064e\u0629 \u0648\u0627\u0644\u062d\u0642\u0648\u0642 . \u0648\u064e\u0642\u064e\u062f \u0648\u064e \u0639\u064e\u0642\u064e\u0644 \u0648\u0636\u0645\u064a\u0631\u0627\u064b \u0648\u0639\u0644\u064a\u0647\u0645 \u0623\u064e\u0646\u0651\u064e \u064a\u0639\u0627\u0645\u0644 \u0628\u0639\u0636\u0647\u0645 \u0628\u064e\u0639\u0636 \u0628\u0631\u0648\u062d \u0625\u0650\u062e\u064e\u0627\u0621 .'\n```\n\n### Load MLP with Custom Resource Path\n```\n>>> mlp = MLP(language_codes=[\"et\",\"en\",\"ru\"], resource_dir=\"/home/kalevipoeg/mlp_resources/\")\n```\n\n### Different phone parsers\n\nTexta MLP has three different phone parsers:\n\n* 'phone_strict' - is used by default. It parses only those numbers that are verified by the [phonenumbers library](https://pypi.org/project/phonenumbers/). It verifies all correct numbers if they have an area code before it. Otherwise (without an area code) it verifies only Estonian (\"EE\") and Russian (\"RU\") phone numbers. This is because in this example \"Maksekorraldusele m\u00e4rkida viitenumber 2800049900 ning selgitus ...\", the \"2800049900\" is a valid number in Great Britain (\"GB\"), but not with \"EE\" or \"RU\".\n\n* 'phone_high_precision' which output is mainly phonenumbers extracted by regex, but the regex excludes complicated versions. \n\n* 'phone_high_recall' was originally done for emails and it gets most of the phone numbers (includes complicated versions), but also outputs a lot of noisy data. This **parser is also used by default** in concatenating close entities (read below). This means that while concatenating, only \"PHONE_high_recall\" fact is considered and other parsers' results are not included in concatenating (avoids overlaping). The other parsers' results won't get lost and are still added in texta_facts. Just not under the fact \"BOUNDED\".\n\nYou can choose the parsers like so:\n```\n>>> mlp.process(analyzers=[\"lemmas\", \"phone_high_precision\"], raw_text= \"My phone number is 12 34 56 77.\")\n```",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "TEXTA Multilingual Processor (MLP)",
    "version": "1.10.4",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "baf4db4974a746602226921dca3aa9fd",
                "sha256": "ba422ace2238adfffe8b7083558fa1e2db922f3c749484e4a571ad553a5a6b16"
            },
            "downloads": -1,
            "filename": "texta-mlp-1.10.4.tar.gz",
            "has_sig": false,
            "md5_digest": "baf4db4974a746602226921dca3aa9fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 43705,
            "upload_time": "2021-04-01T14:27:45",
            "upload_time_iso_8601": "2021-04-01T14:27:45.706737Z",
            "url": "https://files.pythonhosted.org/packages/1c/b9/a23c62d66b9778833b67947fbab8a1969cf6384b0187403d1af049845cd6/texta-mlp-1.10.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-04-01 14:27:45",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "texta-mlp"
}
        
Elapsed time: 0.23756s