fast-sentence-tokenize


Namefast-sentence-tokenize JSON
Version 0.1.15 PyPI version JSON
download
home_pagehttps://github.com/craigtrim/fast-sentence-tokenize
SummaryFast and Efficient Sentence Tokenization
upload_time2023-06-27 19:14:02
maintainerCraig Trim
docs_urlNone
authorCraig Trim
requires_python>=3.8.5,<4.0.0
licenseNone
keywords nlp nlu text classify classification
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Fast Sentence Tokenizer (fast-sentence-tokenize)
Best in class tokenizer

## Usage

### Import
```python
from fast_sentence_tokenize import fast_sentence_tokenize
```

### Call Tokenizer
```python
results = fast_sentence_tokenize("isn't a test great!!?")
```

### Results
```json
[
   "isn't",
   "a",
   "test",
   "great",
   "!",
   "!",
   "?"
]
```
Note that whitespace is not preserved in the output by default.

This generally results in a more accurate parse from downstream components, but may make the reassembly of the original sentence more challenging.

### Preserve Whitespace
```python
results = fast_sentence_tokenize("isn't a test great!!?", eliminate_whitespace=False)
```
### Results
```json
[
   "isn't ",
   "a ",
   "test ",
   "great",
   "!",
   "!",
   "?"
]
```

This option preserves whitespace.

This is useful if you want to re-assemble the tokens using the pre-existing spacing
```python
assert ''.join(tokens) == input_text
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/craigtrim/fast-sentence-tokenize",
    "name": "fast-sentence-tokenize",
    "maintainer": "Craig Trim",
    "docs_url": null,
    "requires_python": ">=3.8.5,<4.0.0",
    "maintainer_email": "craigtrim@gmail.com",
    "keywords": "nlp,nlu,text,classify,classification",
    "author": "Craig Trim",
    "author_email": "craigtrim@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/36/59/1c68d48388ab9d7e6e77a2d6029d94317159bd1d6dadb6a533facd99cdf1/fast_sentence_tokenize-0.1.15.tar.gz",
    "platform": null,
    "description": "# Fast Sentence Tokenizer (fast-sentence-tokenize)\nBest in class tokenizer\n\n## Usage\n\n### Import\n```python\nfrom fast_sentence_tokenize import fast_sentence_tokenize\n```\n\n### Call Tokenizer\n```python\nresults = fast_sentence_tokenize(\"isn't a test great!!?\")\n```\n\n### Results\n```json\n[\n   \"isn't\",\n   \"a\",\n   \"test\",\n   \"great\",\n   \"!\",\n   \"!\",\n   \"?\"\n]\n```\nNote that whitespace is not preserved in the output by default.\n\nThis generally results in a more accurate parse from downstream components, but may make the reassembly of the original sentence more challenging.\n\n### Preserve Whitespace\n```python\nresults = fast_sentence_tokenize(\"isn't a test great!!?\", eliminate_whitespace=False)\n```\n### Results\n```json\n[\n   \"isn't \",\n   \"a \",\n   \"test \",\n   \"great\",\n   \"!\",\n   \"!\",\n   \"?\"\n]\n```\n\nThis option preserves whitespace.\n\nThis is useful if you want to re-assemble the tokens using the pre-existing spacing\n```python\nassert ''.join(tokens) == input_text\n```\n",
    "bugtrack_url": null,
    "license": "None",
    "summary": "Fast and Efficient Sentence Tokenization",
    "version": "0.1.15",
    "project_urls": {
        "Bug Tracker": "https://github.com/craigtrim/fast-sentence-tokenize/issues",
        "Homepage": "https://github.com/craigtrim/fast-sentence-tokenize",
        "Repository": "https://github.com/craigtrim/fast-sentence-tokenize"
    },
    "split_keywords": [
        "nlp",
        "nlu",
        "text",
        "classify",
        "classification"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0ebc4f5de44e36700aff3303c1def32e7d155146cd9070d7f92ca8904e9983c2",
                "md5": "6f255453224b8296ff8dab0677c56b88",
                "sha256": "85eed0ba762a6f919c7628b8c6951c5a09abf8f0544bfcf5add033c0e59e0b8d"
            },
            "downloads": -1,
            "filename": "fast_sentence_tokenize-0.1.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f255453224b8296ff8dab0677c56b88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.5,<4.0.0",
            "size": 13889,
            "upload_time": "2023-06-27T19:14:00",
            "upload_time_iso_8601": "2023-06-27T19:14:00.714537Z",
            "url": "https://files.pythonhosted.org/packages/0e/bc/4f5de44e36700aff3303c1def32e7d155146cd9070d7f92ca8904e9983c2/fast_sentence_tokenize-0.1.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "36591c68d48388ab9d7e6e77a2d6029d94317159bd1d6dadb6a533facd99cdf1",
                "md5": "c3ab532f89691946b53b66991e91a87b",
                "sha256": "0f5d8f5691f8dc41e321eac720ddaf1cb59fd33259e5482f78992e26162ac294"
            },
            "downloads": -1,
            "filename": "fast_sentence_tokenize-0.1.15.tar.gz",
            "has_sig": false,
            "md5_digest": "c3ab532f89691946b53b66991e91a87b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.5,<4.0.0",
            "size": 9308,
            "upload_time": "2023-06-27T19:14:02",
            "upload_time_iso_8601": "2023-06-27T19:14:02.743296Z",
            "url": "https://files.pythonhosted.org/packages/36/59/1c68d48388ab9d7e6e77a2d6029d94317159bd1d6dadb6a533facd99cdf1/fast_sentence_tokenize-0.1.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-27 19:14:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "craigtrim",
    "github_project": "fast-sentence-tokenize",
    "github_not_found": true,
    "lcname": "fast-sentence-tokenize"
}
        
Elapsed time: 0.08192s