tiniestsegmenter


Nametiniestsegmenter JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryCompact Japanese segmenter
upload_time2024-09-24 13:39:24
maintainerNone
docs_urlNone
authorTeryn Jones
requires_python>=3.8
licenseNone
keywords tokenizer nlp ngram japanese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TiniestSegmenter

A port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008). 


<b> Usage </b>

`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`

```Python
import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
```

With the GIL released on the rust side, multi-threading is also possible.

```Python
import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tiniestsegmenter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "tokenizer, NLP, ngram, Japanese",
    "author": "Teryn Jones",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/bf/bf/7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444/tiniestsegmenter-0.3.0.tar.gz",
    "platform": null,
    "description": "# TiniestSegmenter\n\nA port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).\n\nTinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008). \n\n\n<b> Usage </b>\n\n`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`\n\n```Python\nimport tiniestsegmenter\n\ntokens = tiniestsegmenter.tokenize(\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\")\n```\n\nWith the GIL released on the rust side, multi-threading is also possible.\n\n```Python\nimport functools\nimport tiniestsegmenter\n\ntokenizer = functools.partial(tiniestsegmenter.tokenize)\n\ndocuments = [\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\"] * 10_000\nwith ThreadPoolExecutor(4) as e:\n    list(e.map(encoder, documents))\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Compact Japanese segmenter",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/jwnz/tiniestsegmenter",
        "Issues": "https://github.com/jwnz/tiniestsegmenter/issues",
        "Repository": "https://github.com/jwnz/tiniestsegmenter"
    },
    "split_keywords": [
        "tokenizer",
        " nlp",
        " ngram",
        " japanese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7483539f252a29fd352662438a6914820566e397bb70dd8e2df06efdbaf3ce8a",
                "md5": "790f3db999527d1f45bbe4f8d8fa3022",
                "sha256": "73259212190adf51a0e0c3731f3b992fbf1e6369730b4c3733d943dae710d5c2"
            },
            "downloads": -1,
            "filename": "tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "790f3db999527d1f45bbe4f8d8fa3022",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.8",
            "size": 223562,
            "upload_time": "2024-09-24T13:39:22",
            "upload_time_iso_8601": "2024-09-24T13:39:22.706780Z",
            "url": "https://files.pythonhosted.org/packages/74/83/539f252a29fd352662438a6914820566e397bb70dd8e2df06efdbaf3ce8a/tiniestsegmenter-0.3.0-cp310-cp310-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bfbf7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444",
                "md5": "7b39fb0123a30509692952d84f9a2b1f",
                "sha256": "91c15a10de4d68256b733df48d559ba208e1067ded78f8314861b9d2d3bf2502"
            },
            "downloads": -1,
            "filename": "tiniestsegmenter-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7b39fb0123a30509692952d84f9a2b1f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 19223,
            "upload_time": "2024-09-24T13:39:24",
            "upload_time_iso_8601": "2024-09-24T13:39:24.529908Z",
            "url": "https://files.pythonhosted.org/packages/bf/bf/7aa3628a6fda12ee6be30e2b39e8706227d91c91d8348540c7e3ddb1d444/tiniestsegmenter-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-24 13:39:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jwnz",
    "github_project": "tiniestsegmenter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "tiniestsegmenter"
}
        
Elapsed time: 0.35580s