lucytok


Namelucytok JSON
Version 0.1.9 PyPI version JSON
download
home_pageNone
SummaryNone
upload_time2024-11-10 14:58:44
maintainerNone
docs_urlNone
authorDoug Turnbull
requires_python<4.0,>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Lucytok

Lucene's boring English tokenizers recreated for Python. Compatible with [SearchArray](http://github.com/softwaredoug/searcharray).

Lets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of 
porter stemming, English stopwords, etc.


### Usage

Creating a tokenizer close to Elasticsearch's default english analyzer

```
from lucytok import english
es_english = english("Nsp->NNN->l->sNNN->1")
tokenized = es_english("The quick brown fox jumps over the lazy døg")
print(tokenized)
```

Outputs

```
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'døg']
```

Make a tokenizer with ASCII folding...

```
from lucytok import english
es_english_folded = english("asp->NNN->l->sNNN->1")
print(es_english_folded("The quick brown fox jumps over the lazy døg"))
```

```
['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']
```

Split compounds and convert British to American spelling...

```
from lucytok import english
es_british = english("asp->NNN->l->scbN->1")
print(es_british("The watercolour fox jumps over the lazy døg"))
```

```
['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']
```
### Spec

Create a tokenizer using the following settings (these concepts
correspond to their [Elasticsearch counterparts](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html)):

```

#  |- ASCII fold (a) or not (N)
#  ||- Standard (s) or WS tokenizer (w)
#  ||- Remove possessive suffixes (p) or not (N)
#  |||
# "NsN->NNN->N->NNNN->N"
#       |||  |  ||||  |
#       |||  |  ||||  |- Porter stem version (1) or version (2) vs N/0 for none
#       |||  |  ||||- Manually convert irregular plurals (p) or not (N)
#       |||  |  |||- Split Compounds (c) or not (N)
#       |||  |  ||- Convert british to american spelling (b) or not (N)
#       |||  |  |- Blank out stopwords (s) or not (N)
#       |||  |- Lowercase (l) or not (N)
#       |||- Split on letter/number transitions (n) or not (N)
#       ||- Split on case changes (c) or not (N)
#       |- Split on punctuation (p) or not (N)


# "NsN->NNN->N->NNN->N"
#  ---
#  (tokenization)

# "NsN->NNN->N->NNNN->N"
#       ---
#       (word splitting on rules, like WordDelimeterFilter in Lucene)

# "NsN->NNN->N->NNNN->N"
#            -
#            (lowercasing or not)

# "NsN->NNN->N->NNNN->N"
#               ----
#               (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)

# "NsN->NNN->N->NNNN->N"
#                     - stemming (porter)


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lucytok",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Doug Turnbull",
    "author_email": "softwaredoug@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/14/0b/69e809d4824d5fa9eee68c34c46e5ce5253dd9f0959ce50988994d8a0436/lucytok-0.1.9.tar.gz",
    "platform": null,
    "description": "## Lucytok\n\nLucene's boring English tokenizers recreated for Python. Compatible with [SearchArray](http://github.com/softwaredoug/searcharray).\n\nLets you configure a handful of normal tokenization rules like ascii folding, posessive removal, both types of \nporter stemming, English stopwords, etc.\n\n\n### Usage\n\nCreating a tokenizer close to Elasticsearch's default english analyzer\n\n```\nfrom lucytok import english\nes_english = english(\"Nsp->NNN->l->sNNN->1\")\ntokenized = es_english(\"The quick brown fox jumps over the lazy d\u00f8g\")\nprint(tokenized)\n```\n\nOutputs\n\n```\n['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'd\u00f8g']\n```\n\nMake a tokenizer with ASCII folding...\n\n```\nfrom lucytok import english\nes_english_folded = english(\"asp->NNN->l->sNNN->1\")\nprint(es_english_folded(\"The quick brown fox jumps over the lazy d\u00f8g\"))\n```\n\n```\n['_', 'quick', 'brown', 'fox', 'jump', 'over', '_', 'lazi', 'dog']\n```\n\nSplit compounds and convert British to American spelling...\n\n```\nfrom lucytok import english\nes_british = english(\"asp->NNN->l->scbN->1\")\nprint(es_british(\"The watercolour fox jumps over the lazy d\u00f8g\"))\n```\n\n```\n['_', 'water', 'color', 'fox', 'jump', 'over', '_', 'lazi', 'dog']\n```\n### Spec\n\nCreate a tokenizer using the following settings (these concepts\ncorrespond to their [Elasticsearch counterparts](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html)):\n\n```\n\n#  |- ASCII fold (a) or not (N)\n#  ||- Standard (s) or WS tokenizer (w)\n#  ||- Remove possessive suffixes (p) or not (N)\n#  |||\n# \"NsN->NNN->N->NNNN->N\"\n#       |||  |  ||||  |\n#       |||  |  ||||  |- Porter stem version (1) or version (2) vs N/0 for none\n#       |||  |  ||||- Manually convert irregular plurals (p) or not (N)\n#       |||  |  |||- Split Compounds (c) or not (N)\n#       |||  |  ||- Convert british to american spelling (b) or not (N)\n#       |||  |  |- Blank out stopwords (s) or not (N)\n#       |||  |- Lowercase (l) or not (N)\n#       |||- Split on letter/number transitions (n) or not (N)\n#       ||- Split on case changes (c) or not (N)\n#       |- Split on punctuation (p) or not (N)\n\n\n# \"NsN->NNN->N->NNN->N\"\n#  ---\n#  (tokenization)\n\n# \"NsN->NNN->N->NNNN->N\"\n#       ---\n#       (word splitting on rules, like WordDelimeterFilter in Lucene)\n\n# \"NsN->NNN->N->NNNN->N\"\n#            -\n#            (lowercasing or not)\n\n# \"NsN->NNN->N->NNNN->N\"\n#               ----\n#               (dictionary based splitting stopwords -> compounds -> british/american English -> irregular plurals)\n\n# \"NsN->NNN->N->NNNN->N\"\n#                     - stemming (porter)\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.9",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4c1919b7005f9c8ec3556d515dd0556dbd75c01f79f5a393736ad99ada9c91ef",
                "md5": "f79d3d5296d32e1ba02c70075967a6dc",
                "sha256": "1c9abe1d154f514f9013a92498b12610b15b48429f8e63cc624eded5fd533e34"
            },
            "downloads": -1,
            "filename": "lucytok-0.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f79d3d5296d32e1ba02c70075967a6dc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 33085,
            "upload_time": "2024-11-10T14:58:43",
            "upload_time_iso_8601": "2024-11-10T14:58:43.745225Z",
            "url": "https://files.pythonhosted.org/packages/4c/19/19b7005f9c8ec3556d515dd0556dbd75c01f79f5a393736ad99ada9c91ef/lucytok-0.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "140b69e809d4824d5fa9eee68c34c46e5ce5253dd9f0959ce50988994d8a0436",
                "md5": "c1403bb4329adea56cfeebf4a955741e",
                "sha256": "53d00c6cec459e81b5ae8a9ba037295dbe0df32e26760c782bc8b1663bdd626a"
            },
            "downloads": -1,
            "filename": "lucytok-0.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "c1403bb4329adea56cfeebf4a955741e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 33222,
            "upload_time": "2024-11-10T14:58:44",
            "upload_time_iso_8601": "2024-11-10T14:58:44.810760Z",
            "url": "https://files.pythonhosted.org/packages/14/0b/69e809d4824d5fa9eee68c34c46e5ce5253dd9f0959ce50988994d8a0436/lucytok-0.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-10 14:58:44",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "lucytok"
}
        
Elapsed time: 4.28873s