Name | lucytok JSON |
Version |
0.1.9
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2024-11-10 14:58:44 |
maintainer | None |
docs_url | None |
author | Doug Turnbull |
requires_python | <4.0,>=3.10 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
## Lucytok
Lucene's boring English tokenizers recreated for Python. Compatible with [SearchArray](http://github.com/softwaredoug/searcharray).
Lets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of
porter stemming, English stopwords, etc.
### Usage
Creating a tokenizer close to Elasticsearch's default english analyzer
```
from lucytok import english
es_english = english("Nsp->NNN->l->sNNN->1")
tokenized = es_english("The quick brown fox jumps over the lazy døg")
print(tokenized)
```
Outputs
```
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'døg']
```
Make a tokenizer with ASCII folding...
```
from lucytok import english
es_english_folded = english("asp->NNN->l->sNNN->1")
print(es_english_folded("The quick brown fox jumps over the lazy døg"))
```
```
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']
```
Split compounds and convert British to American spelling...
```
from lucytok import english
es_british = english("asp->NNN->l->scbN->1")
print(es_british("The watercolour fox jumps over the lazy døg"))
```
```
['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']
```
### Spec
Create a tokenizer using the following settings (these concepts
correspond to their [Elasticsearch counterparts](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html)):
```
# |- ASCII fold (a) or not (N)
# ||- Standard (s) or WS tokenizer (w)
# ||- Remove possessive suffixes (p) or not (N)
# |||
# "NsN->NNN->N->NNNN->N"
# ||| | |||| |
# ||| | |||| |- Porter stem version (1) or version (2) vs N/0 for none
# ||| | ||||- Manually convert irregular plurals (p) or not (N)
# ||| | |||- Split Compounds (c) or not (N)
# ||| | ||- Convert british to american spelling (b) or not (N)
# ||| | |- Blank out stopwords (s) or not (N)
# ||| |- Lowercase (l) or not (N)
# |||- Split on letter/number transitions (n) or not (N)
# ||- Split on case changes (c) or not (N)
# |- Split on punctuation (p) or not (N)
# "NsN->NNN->N->NNN->N"
# ---
# (tokenization)
# "NsN->NNN->N->NNNN->N"
# ---
# (word splitting on rules, like WordDelimeterFilter in Lucene)
# "NsN->NNN->N->NNNN->N"
# -
# (lowercasing or not)
# "NsN->NNN->N->NNNN->N"
# ----
# (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)
# "NsN->NNN->N->NNNN->N"
# - stemming (porter)
Raw data
{
"_id": null,
"home_page": null,
"name": "lucytok",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Doug Turnbull",
"author_email": "softwaredoug@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/14/0b/69e809d4824d5fa9eee68c34c46e5ce5253dd9f0959ce50988994d8a0436/lucytok-0.1.9.tar.gz",
"platform": null,
"description": "## Lucytok\n\nLucene's boring English tokenizers recreated for Python. Compatible with [SearchArray](http://github.com/softwaredoug/searcharray).\n\nLets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of \nporter stemming, English stopwords, etc.\n\n\n### Usage\n\nCreating a tokenizer close to Elasticsearch's default english analyzer\n\n```\nfrom lucytok import english\nes_english = english(\"Nsp->NNN->l->sNNN->1\")\ntokenized = es_english(\"The quick brown fox jumps over the lazy d\u00f8g\")\nprint(tokenized)\n```\n\nOutputs\n\n```\n['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'd\u00f8g']\n```\n\nMake a tokenizer with ASCII folding...\n\n```\nfrom lucytok import english\nes_english_folded = english(\"asp->NNN->l->sNNN->1\")\nprint(es_english_folded(\"The quick brown fox jumps over the lazy d\u00f8g\"))\n```\n\n```\n['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']\n```\n\nSplit compounds and convert British to American spelling...\n\n```\nfrom lucytok import english\nes_british = english(\"asp->NNN->l->scbN->1\")\nprint(es_british(\"The watercolour fox jumps over the lazy d\u00f8g\"))\n```\n\n```\n['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']\n```\n### Spec\n\nCreate a tokenizer using the following settings (these concepts\ncorrespond to their [Elasticsearch counterparts](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html)):\n\n```\n\n# |- ASCII fold (a) or not (N)\n# ||- Standard (s) or WS tokenizer (w)\n# ||- Remove possessive suffixes (p) or not (N)\n# |||\n# \"NsN->NNN->N->NNNN->N\"\n# ||| | |||| |\n# ||| | |||| |- Porter stem version (1) or version (2) vs N/0 for none\n# ||| | ||||- Manually convert irregular plurals (p) or not (N)\n# ||| | |||- Split Compounds (c) or not (N)\n# ||| | ||- Convert british to american spelling (b) or not (N)\n# ||| | |- Blank out stopwords (s) or not (N)\n# ||| |- Lowercase (l) or not (N)\n# |||- Split on letter/number transitions (n) or not (N)\n# ||- Split on case changes (c) or not (N)\n# |- Split on punctuation (p) or not (N)\n\n\n# \"NsN->NNN->N->NNN->N\"\n# ---\n# (tokenization)\n\n# \"NsN->NNN->N->NNNN->N\"\n# ---\n# (word splitting on rules, like WordDelimeterFilter in Lucene)\n\n# \"NsN->NNN->N->NNNN->N\"\n# -\n# (lowercasing or not)\n\n# \"NsN->NNN->N->NNNN->N\"\n# ----\n# (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)\n\n# \"NsN->NNN->N->NNNN->N\"\n# - stemming (porter)\n\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.1.9",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4c1919b7005f9c8ec3556d515dd0556dbd75c01f79f5a393736ad99ada9c91ef",
"md5": "f79d3d5296d32e1ba02c70075967a6dc",
"sha256": "1c9abe1d154f514f9013a92498b12610b15b48429f8e63cc624eded5fd533e34"
},
"downloads": -1,
"filename": "lucytok-0.1.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f79d3d5296d32e1ba02c70075967a6dc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 33085,
"upload_time": "2024-11-10T14:58:43",
"upload_time_iso_8601": "2024-11-10T14:58:43.745225Z",
"url": "https://files.pythonhosted.org/packages/4c/19/19b7005f9c8ec3556d515dd0556dbd75c01f79f5a393736ad99ada9c91ef/lucytok-0.1.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "140b69e809d4824d5fa9eee68c34c46e5ce5253dd9f0959ce50988994d8a0436",
"md5": "c1403bb4329adea56cfeebf4a955741e",
"sha256": "53d00c6cec459e81b5ae8a9ba037295dbe0df32e26760c782bc8b1663bdd626a"
},
"downloads": -1,
"filename": "lucytok-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "c1403bb4329adea56cfeebf4a955741e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 33222,
"upload_time": "2024-11-10T14:58:44",
"upload_time_iso_8601": "2024-11-10T14:58:44.810760Z",
"url": "https://files.pythonhosted.org/packages/14/0b/69e809d4824d5fa9eee68c34c46e5ce5253dd9f0959ce50988994d8a0436/lucytok-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-10 14:58:44",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "lucytok"
}