# TiniestSegmenter
A port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008).
<b> Usage </b>
`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`
```Python
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
```
With the GIL released on the rust side, multi-threading is also possible.
```Python
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
```
Raw data
{
"_id": null,
"home_page": null,
"name": "tiniestsegmenter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "tokenizer, NLP, ngram, Japanese",
"author": "Teryn Jones",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/54/63/dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e/tiniestsegmenter-0.2.0.tar.gz",
"platform": null,
"description": "# TiniestSegmenter\n\nA port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).\n\nTinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008). \n\n\n<b> Usage </b>\n\n`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`\n\n```Python\nimport tiniestsegmenter\n\ntokens = tiniestsegmenter.tokenize(\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\")\n```\n\nWith the GIL released on the rust side, multi-threading is also possible.\n\n```Python\nimport functools\nimport tiniestsegmenter\n\ntokenizer = functools.partial(tiniestsegmenter.tokenize)\n\ndocuments = [\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\"] * 10_000\nwith ThreadPoolExecutor(4) as e:\n list(e.map(encoder, documents))\n```\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Compact Japanese segmenter",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/jwnz/tiniestsegmenter",
"Issues": "https://github.com/jwnz/tiniestsegmenter/issues",
"Repository": "https://github.com/jwnz/tiniestsegmenter"
},
"split_keywords": [
"tokenizer",
" nlp",
" ngram",
" japanese"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "68df17671439664c9e99525354850730a23ae6401a03516bf6d551fa52a99d8c",
"md5": "ec761cc3dbef375b68020273ac1af9d5",
"sha256": "205f2de3472f024f6fc90742d4236dbbb5ffc53e2b3290aee0a38c0b3a29d2d9"
},
"downloads": -1,
"filename": "tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "ec761cc3dbef375b68020273ac1af9d5",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 234738,
"upload_time": "2024-06-24T14:45:39",
"upload_time_iso_8601": "2024-06-24T14:45:39.402683Z",
"url": "https://files.pythonhosted.org/packages/68/df/17671439664c9e99525354850730a23ae6401a03516bf6d551fa52a99d8c/tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5463dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e",
"md5": "297bba88ad6e735de19452a8a6656c86",
"sha256": "5b92f904af3f550b0fd42a1433c3d45bbe4d606452b889ea1fd51d214b0725ab"
},
"downloads": -1,
"filename": "tiniestsegmenter-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "297bba88ad6e735de19452a8a6656c86",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19271,
"upload_time": "2024-06-24T14:45:40",
"upload_time_iso_8601": "2024-06-24T14:45:40.849406Z",
"url": "https://files.pythonhosted.org/packages/54/63/dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e/tiniestsegmenter-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-24 14:45:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jwnz",
"github_project": "tiniestsegmenter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "tiniestsegmenter"
}