tiniestsegmenter


Nametiniestsegmenter JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryCompact Japanese segmenter
upload_time2024-06-24 14:45:40
maintainerNone
docs_urlNone
authorTeryn Jones
requires_python>=3.8
licenseNone
keywords tokenizer nlp ngram japanese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TiniestSegmenter

A port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008). 


<b> Usage </b>

`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`

```Python
import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")
```

With the GIL released on the rust side, multi-threading is also possible.

```Python
import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tiniestsegmenter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "tokenizer, NLP, ngram, Japanese",
    "author": "Teryn Jones",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/54/63/dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e/tiniestsegmenter-0.2.0.tar.gz",
    "platform": null,
    "description": "# TiniestSegmenter\n\nA port of [TinySegmenter](http://chasen.org/~taku/software/TinySegmenter/) written in pure, safe rust with no dependencies. You can find bindings for both [Rust](https://github.com/jwnz/tiniestsegmenter/tree/master/tiniestsegmenter) and [Python](https://github.com/jwnz/tiniestsegmenter/tree/master/bindings/python/).\n\nTinySegmenter is an n-gram word tokenizer for Japanese text originally built by [Taku Kudo](http://chasen.org/~taku/) (2008). \n\n\n<b> Usage </b>\n\n`tiniestsegmenter` can be installed from PyPI: `pip install tiniestsegmenter`\n\n```Python\nimport tiniestsegmenter\n\ntokens = tiniestsegmenter.tokenize(\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\")\n```\n\nWith the GIL released on the rust side, multi-threading is also possible.\n\n```Python\nimport functools\nimport tiniestsegmenter\n\ntokenizer = functools.partial(tiniestsegmenter.tokenize)\n\ndocuments = [\"\u30b8\u30e3\u30ac\u30a4\u30e2\u304c\u597d\u304d\u3067\u3059\u3002\"] * 10_000\nwith ThreadPoolExecutor(4) as e:\n    list(e.map(encoder, documents))\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Compact Japanese segmenter",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/jwnz/tiniestsegmenter",
        "Issues": "https://github.com/jwnz/tiniestsegmenter/issues",
        "Repository": "https://github.com/jwnz/tiniestsegmenter"
    },
    "split_keywords": [
        "tokenizer",
        " nlp",
        " ngram",
        " japanese"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "68df17671439664c9e99525354850730a23ae6401a03516bf6d551fa52a99d8c",
                "md5": "ec761cc3dbef375b68020273ac1af9d5",
                "sha256": "205f2de3472f024f6fc90742d4236dbbb5ffc53e2b3290aee0a38c0b3a29d2d9"
            },
            "downloads": -1,
            "filename": "tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "ec761cc3dbef375b68020273ac1af9d5",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8",
            "size": 234738,
            "upload_time": "2024-06-24T14:45:39",
            "upload_time_iso_8601": "2024-06-24T14:45:39.402683Z",
            "url": "https://files.pythonhosted.org/packages/68/df/17671439664c9e99525354850730a23ae6401a03516bf6d551fa52a99d8c/tiniestsegmenter-0.2.0-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5463dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e",
                "md5": "297bba88ad6e735de19452a8a6656c86",
                "sha256": "5b92f904af3f550b0fd42a1433c3d45bbe4d606452b889ea1fd51d214b0725ab"
            },
            "downloads": -1,
            "filename": "tiniestsegmenter-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "297bba88ad6e735de19452a8a6656c86",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 19271,
            "upload_time": "2024-06-24T14:45:40",
            "upload_time_iso_8601": "2024-06-24T14:45:40.849406Z",
            "url": "https://files.pythonhosted.org/packages/54/63/dca4442bf2fd8071ae3907ed25288111d4613dc7e4a1362fef201aeb077e/tiniestsegmenter-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-24 14:45:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jwnz",
    "github_project": "tiniestsegmenter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "tiniestsegmenter"
}
        
Elapsed time: 0.44988s