# Sentence segmenter
[![tests](https://github.com/santhoshtr/sentencex/actions/workflows/tests.yaml/badge.svg)](https://github.com/santhoshtr/sentencex/actions/workflows/tests.yaml)
A sentence segmentation library with wide language support optimized for speed and utility.
## Approach
- If it's a period, it ends a sentence.
- If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.
However, it is not 'period' for many languages. So we will use a list of known punctuations that can cause a sentence break in as many languages as possible.
We also collect a list of known, popular abbreviations in as many languages as possible.
Sometimes, it is very hard to get the segmentation correct. In such cases this library is opinionated and prefer not segmenting than wrong segmentation. If two sentences are accidentally together, that is ok. It is better than sentence being split in middle.
Avoid over engineering to get everything linguistically 100% accurate.
This approach would be suitable for applications like text to speech, machine translation.
Consider this example: `We make a good team, you and I. Did you see Albert I. Jones yesterday?`
The accurate splitting of this sentence is
`["We make a good team, you and I." ,"Did you see Albert I. Jones yesterday?"]`
However, to achieve this level precision, complex rules need to be added and it could create side effects. Instead, if we just don't segment between `I. Did`, it is ok for most of downstream applications.
The sentence segmentation in this library is **non-destructive**. This means, if the sentences are combined together, you can reconstruct the original text. Line breaks, punctuations and whitespaces are preserved in the output.
## Usage
Install the library using
```bash
pip install sentencex
```
Then, any text can be segmented as follows.
```python
from sentencex import segment
text = """
The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development")
"""
print(list(segment("en", text)))
```
The first argument is language code, second argument is text to segment. The `segment` method returns an iterator on identified sentences.
## Language support
The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.
## Performance
Measured on Golden Rule Set(GRS) for English. Lists are exempted (1. sentence 2. another sentence).
The following libraries are used for benchmarking:
* mwtokenizer from https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools
* blingfire from https://github.com/microsoft/BlingFire
* nltk from https://pypi.org/project/nltk/
* pysbd from https://github.com/nipunsadvilkar/pySBD/
* spacy from https://github.com/stanfordnlp/stanza
* syntok from https://github.com/fnl/syntok
| Tokenizer Library | English Golden Rule Set score | Speed(Avg over 100 runs) in seconds |
|--------------------------|------------|-----------|
| sentencex_segment | 74.36 | 0.93 |
| mwtokenizer_tokenize | 30.77 | 1.54 |
| blingfire_tokenize | 89.74 | **0.27** |
| nltk_tokenize | 66.67 | 1.86 |
| pysbd_tokenize |**97.44** | 10.57 |
| spacy_tokenize | 61.54 | 2.45 |
| spacy_dep_tokenize | 74.36 | 138.93 |
| stanza_tokenize | 87.18 | 107.51 |
| syntok_tokenize | 79.49 | 4.72 |
## Thanks
* https://github.com/diasks2/pragmatic_segmenter for test cases. The English golden rule set is also sourced from it.
## License
MIT license. See [License.txt](./LICENSE.txt)
Raw data
{
"_id": null,
"home_page": "",
"name": "sentencex",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "NLP,Natural Language Processing,Tokenizer",
"author": "",
"author_email": "Santhosh Thottingal <santhosh.thottingal@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b3/d5/b18a15aba0ea95e7745ddf87ecab4ed87b5f1ad8cc6f63d1219dcc03d9c1/sentencex-0.6.1.tar.gz",
"platform": null,
"description": "# Sentence segmenter\n\n[![tests](https://github.com/santhoshtr/sentencex/actions/workflows/tests.yaml/badge.svg)](https://github.com/santhoshtr/sentencex/actions/workflows/tests.yaml)\n\nA sentence segmentation library with wide language support optimized for speed and utility.\n\n## Approach\n\n- If it's a period, it ends a sentence.\n- If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.\n\nHowever, it is not 'period' for many languages. So we will use a list of known punctuations that can cause a sentence break in as many languages as possible.\n\nWe also collect a list of known, popular abbreviations in as many languages as possible.\n\nSometimes, it is very hard to get the segmentation correct. In such cases this library is opinionated and prefer not segmenting than wrong segmentation. If two sentences are accidentally together, that is ok. It is better than sentence being split in middle.\nAvoid over engineering to get everything linguistically 100% accurate.\n\nThis approach would be suitable for applications like text to speech, machine translation.\n\nConsider this example: `We make a good team, you and I. Did you see Albert I. Jones yesterday?`\n\nThe accurate splitting of this sentence is\n`[\"We make a good team, you and I.\" ,\"Did you see Albert I. Jones yesterday?\"]`\n\nHowever, to achieve this level precision, complex rules need to be added and it could create side effects. Instead, if we just don't segment between `I. Did`, it is ok for most of downstream applications.\n\nThe sentence segmentation in this library is **non-destructive**. This means, if the sentences are combined together, you can reconstruct the original text. Line breaks, punctuations and whitespaces are preserved in the output.\n\n## Usage\n\nInstall the library using\n\n```bash\npip install sentencex\n```\n\nThen, any text can be segmented as follows.\n\n```python\nfrom sentencex import segment\n\ntext = \"\"\"\nThe James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development\")\n\"\"\"\nprint(list(segment(\"en\", text)))\n\n```\n\nThe first argument is language code, second argument is text to segment. The `segment` method returns an iterator on identified sentences.\n\n## Language support\n\nThe aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.\n\n## Performance\n\nMeasured on Golden Rule Set(GRS) for English. Lists are exempted (1. sentence 2. another sentence).\n\nThe following libraries are used for benchmarking:\n\n* mwtokenizer from https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools\n* blingfire from https://github.com/microsoft/BlingFire\n* nltk from https://pypi.org/project/nltk/\n* pysbd from https://github.com/nipunsadvilkar/pySBD/\n* spacy from https://github.com/stanfordnlp/stanza\n* syntok from https://github.com/fnl/syntok\n\n\n| Tokenizer Library | English Golden Rule Set score | Speed(Avg over 100 runs) in seconds |\n|--------------------------|------------|-----------|\n| sentencex_segment | 74.36 | 0.93 |\n| mwtokenizer_tokenize | 30.77 | 1.54 |\n| blingfire_tokenize | 89.74 | **0.27** |\n| nltk_tokenize | 66.67 | 1.86 |\n| pysbd_tokenize |**97.44** | 10.57 |\n| spacy_tokenize | 61.54 | 2.45 |\n| spacy_dep_tokenize | 74.36 | 138.93 |\n| stanza_tokenize | 87.18 | 107.51 |\n| syntok_tokenize | 79.49 | 4.72 |\n\n## Thanks\n\n* https://github.com/diasks2/pragmatic_segmenter for test cases. The English golden rule set is also sourced from it.\n\n## License\n\nMIT license. See [License.txt](./LICENSE.txt)\n",
"bugtrack_url": null,
"license": "",
"summary": "Sentence segmenter that supports ~300 languages",
"version": "0.6.1",
"project_urls": {
"CI": "https://github.com/santhoshtr/sentencex/actions",
"Changelog": "https://github.com/santhoshtr/sentencex/releases",
"Homepage": "https://github.com/santhoshtr/sentencex",
"Issues": "https://github.com/santhoshtr/sentencex/issues"
},
"split_keywords": [
"nlp",
"natural language processing",
"tokenizer"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c5674ead1837fd22ea19e8461eb49a628761aa8a065768a14cfa2566d3067ebe",
"md5": "03e242c51edd4a45452386a381e3aabd",
"sha256": "e8a1bccda38d47136ce7e857d45098151633d53c95394502109297000d57bfb9"
},
"downloads": -1,
"filename": "sentencex-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "03e242c51edd4a45452386a381e3aabd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 43304,
"upload_time": "2023-11-14T06:58:57",
"upload_time_iso_8601": "2023-11-14T06:58:57.095049Z",
"url": "https://files.pythonhosted.org/packages/c5/67/4ead1837fd22ea19e8461eb49a628761aa8a065768a14cfa2566d3067ebe/sentencex-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b3d5b18a15aba0ea95e7745ddf87ecab4ed87b5f1ad8cc6f63d1219dcc03d9c1",
"md5": "59f599ecb7b5cab5d4267876e7b9d133",
"sha256": "4765ece009760f63d0f8c3c2aef08f573880e5d0dc4a5fce60f62219d1ab42cb"
},
"downloads": -1,
"filename": "sentencex-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "59f599ecb7b5cab5d4267876e7b9d133",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 68960,
"upload_time": "2023-11-14T06:58:58",
"upload_time_iso_8601": "2023-11-14T06:58:58.650448Z",
"url": "https://files.pythonhosted.org/packages/b3/d5/b18a15aba0ea95e7745ddf87ecab4ed87b5f1ad8cc6f63d1219dcc03d9c1/sentencex-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-14 06:58:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "santhoshtr",
"github_project": "sentencex",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "cachetools",
"specs": [
[
"==",
"5.3.2"
]
]
},
{
"name": "chardet",
"specs": [
[
"==",
"5.2.0"
]
]
},
{
"name": "colorama",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "distlib",
"specs": [
[
"==",
"0.3.7"
]
]
},
{
"name": "filelock",
"specs": [
[
"==",
"3.13.0"
]
]
},
{
"name": "iniconfig",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"23.2"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"3.11.0"
]
]
},
{
"name": "pluggy",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "pyproject-api",
"specs": [
[
"==",
"1.6.1"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"7.4.3"
]
]
},
{
"name": "ruff",
"specs": [
[
"==",
"0.1.3"
]
]
},
{
"name": "tox",
"specs": [
[
"==",
"4.11.3"
]
]
},
{
"name": "virtualenv",
"specs": [
[
"==",
"20.24.6"
]
]
}
],
"tox": true,
"lcname": "sentencex"
}