# TiniestSegmenter
A port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008).
<b> Usage </b>
`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`
```Python
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
```
With the GIL released on the rust side, multi-threading is also possible.
```Python
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
```
Raw data
{
"_id": null,
"home_page": null,
"name": "tiniestsegmenter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "tokenizer, NLP, ngram, Japanese",
"author": "Teryn Jones",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/bf/bf/7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444/tiniestsegmenter-0.3.0.tar.gz",
"platform": null,
"description": "# TiniestSegmenter\n\nA port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).\n\nTinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008). \n\n\n<b> Usage </b>\n\n`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`\n\n```Python\nimport tiniestsegmenter\n\ntokens = tiniestsegmenter.tokenize(\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\")\n```\n\nWith the GIL released on the rust side, multi-threading is also possible.\n\n```Python\nimport functools\nimport tiniestsegmenter\n\ntokenizer = functools.partial(tiniestsegmenter.tokenize)\n\ndocuments = [\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\"] * 10_000\nwith ThreadPoolExecutor(4) as e:\n list(e.map(encoder, documents))\n```\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Compact Japanese segmenter",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/jwnz/tiniestsegmenter",
"Issues": "https://github.com/jwnz/tiniestsegmenter/issues",
"Repository": "https://github.com/jwnz/tiniestsegmenter"
},
"split_keywords": [
"tokenizer",
" nlp",
" ngram",
" japanese"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7483539f252a29fd352662438a6914820566e397bb70dd8e2df06efdbaf3ce8a",
"md5": "790f3db999527d1f45bbe4f8d8fa3022",
"sha256": "73259212190adf51a0e0c3731f3b992fbf1e6369730b4c3733d943dae710d5c2"
},
"downloads": -1,
"filename": "tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "790f3db999527d1f45bbe4f8d8fa3022",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.8",
"size": 223562,
"upload_time": "2024-09-24T13:39:22",
"upload_time_iso_8601": "2024-09-24T13:39:22.706780Z",
"url": "https://files.pythonhosted.org/packages/74/83/539f252a29fd352662438a6914820566e397bb70dd8e2df06efdbaf3ce8a/tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bfbf7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444",
"md5": "7b39fb0123a30509692952d84f9a2b1f",
"sha256": "91c15a10de4d68256b733df48d559ba208e1067ded78f8314861b9d2d3bf2502"
},
"downloads": -1,
"filename": "tiniestsegmenter-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "7b39fb0123a30509692952d84f9a2b1f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19223,
"upload_time": "2024-09-24T13:39:24",
"upload_time_iso_8601": "2024-09-24T13:39:24.529908Z",
"url": "https://files.pythonhosted.org/packages/bf/bf/7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444/tiniestsegmenter-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-24 13:39:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jwnz",
"github_project": "tiniestsegmenter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "tiniestsegmenter"
}