![PySBD logo](artifacts/pysbd_logo.png?raw=true "pysbd logo")
# pySBD: Python Sentence Boundary Disambiguation (SBD)
![Python package](https://github.com/nipunsadvilkar/pySBD/workflows/Python%20package/badge.svg) [![codecov](https://codecov.io/gh/nipunsadvilkar/pySBD/branch/master/graph/badge.svg)](https://codecov.io/gh/nipunsadvilkar/pySBD) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/nipunsadvilkar/pySBD/blob/master/LICENSE) [![PyPi](https://img.shields.io/pypi/v/pysbd?color=blue&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/pysbd) [![GitHub](https://img.shields.io/github/v/release/nipunsadvilkar/pySBD.svg?include_prereleases&logo=github&style=flat)](https://github.com/nipunsadvilkar/pySBD)
pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.
This project is a direct port of ruby gem - [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) which provides rule-based sentence boundary detection.
![pysbd_code](artifacts/pysbd_code.png?raw=true "pysbd_code")
## Highlights
**'PySBD: Pragmatic Sentence Boundary Disambiguation'** a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020. </br>
**Research Paper:**</br>
https://arxiv.org/abs/2010.09657</br>
**[Recorded Talk:](https://slideslive.com/38939754)**</br>
[![pysbd_talk](artifacts/pysbd_talk.png)](https://slideslive.com/38939754)</br>
**Poster:**</br>
[![name](artifacts/pysbd_poster.png)](artifacts/pysbd_poster.png)
## Install
**Python**
pip install pysbd
## Usage
- Currently pySBD supports only English language. Support for more languages will be released soon.
```python
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
```
- Use `pysbd` as a [spaCy](https://spacy.io/usage/processing-pipelines) pipeline component. (recommended)</br>Please refer to example [pysbd\_as\_spacy\_component.py](https://github.com/nipunsadvilkar/pySBD/blob/master/examples/pysbd_as_spacy_component.py)
- Use pysbd through [entrypoints](https://spacy.io/usage/saving-loading#entry-points-components)
```python
import spacy
from pysbd.utils import PySBDFactory
nlp = spacy.blank('en')
# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))
# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)
doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]
```
## Contributing
If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to [CONTRIBUTING.md](https://github.com/nipunsadvilkar/pySBD/blob/master/CONTRIBUTING.md) to know more and follow these steps.
1. Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create a new Pull Request
## Citation
If you use `pysbd` package in your projects or research, please cite [PySBD: Pragmatic Sentence Boundary Disambiguation](https://www.aclweb.org/anthology/2020.nlposs-1.15).
```
@inproceedings{sadvilkar-neumann-2020-pysbd,
title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
author = "Sadvilkar, Nipun and
Neumann, Mark",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
pages = "110--114",
abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}
```
## Credit
This project wouldn't be possible without the great work done by [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) team.
Raw data
{
"_id": null,
"home_page": "http://nipunsadvilkar.github.io/",
"name": "pysbd",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": "",
"keywords": "natural-language-processing nlp",
"author": "Nipun Sadvilkar",
"author_email": "nipunsadvilkar@gmail.com",
"download_url": "",
"platform": "",
"description": "\n![PySBD logo](artifacts/pysbd_logo.png?raw=true \"pysbd logo\")\n# pySBD: Python Sentence Boundary Disambiguation (SBD)\n\n![Python package](https://github.com/nipunsadvilkar/pySBD/workflows/Python%20package/badge.svg) [![codecov](https://codecov.io/gh/nipunsadvilkar/pySBD/branch/master/graph/badge.svg)](https://codecov.io/gh/nipunsadvilkar/pySBD) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/nipunsadvilkar/pySBD/blob/master/LICENSE) [![PyPi](https://img.shields.io/pypi/v/pysbd?color=blue&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/pysbd) [![GitHub](https://img.shields.io/github/v/release/nipunsadvilkar/pySBD.svg?include_prereleases&logo=github&style=flat)](https://github.com/nipunsadvilkar/pySBD)\n\npySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.\n\nThis project is a direct port of ruby gem - [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) which provides rule-based sentence boundary detection.\n\n![pysbd_code](artifacts/pysbd_code.png?raw=true \"pysbd_code\")\n\n## Highlights\n**'PySBD: Pragmatic Sentence Boundary Disambiguation'** a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020. </br>\n\n**Research Paper:**</br>\n\nhttps://arxiv.org/abs/2010.09657</br>\n\n**[Recorded Talk:](https://slideslive.com/38939754)**</br>\n\n[![pysbd_talk](artifacts/pysbd_talk.png)](https://slideslive.com/38939754)</br>\n\n**Poster:**</br>\n\n[![name](artifacts/pysbd_poster.png)](artifacts/pysbd_poster.png)\n\n## Install\n\n**Python**\n\n pip install pysbd\n\n## Usage\n\n- Currently pySBD supports only English language. Support for more languages will be released soon.\n\n```python\nimport pysbd\ntext = \"My name is Jonas E. Smith. Please turn to p. 55.\"\nseg = pysbd.Segmenter(language=\"en\", clean=False)\nprint(seg.segment(text))\n# ['My name is Jonas E. Smith.', 'Please turn to p.\u00a055.']\n```\n\n- Use `pysbd` as a [spaCy](https://spacy.io/usage/processing-pipelines) pipeline component. (recommended)</br>Please refer to example [pysbd\\_as\\_spacy\\_component.py](https://github.com/nipunsadvilkar/pySBD/blob/master/examples/pysbd_as_spacy_component.py)\n- Use pysbd through [entrypoints](https://spacy.io/usage/saving-loading#entry-points-components)\n\n```python\nimport spacy\nfrom pysbd.utils import PySBDFactory\n\nnlp = spacy.blank('en')\n\n# explicitly adding component to pipeline\n# (recommended - makes it more readable to tell what's going on)\nnlp.add_pipe(PySBDFactory(nlp))\n\n# or you can use it implicitly with keyword\n# pysbd = nlp.create_pipe('pysbd')\n# nlp.add_pipe(pysbd)\n\ndoc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')\nprint(list(doc.sents))\n# [My name is Jonas E. Smith., Please turn to p. 55.]\n\n```\n\n## Contributing\n\nIf you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to [CONTRIBUTING.md](https://github.com/nipunsadvilkar/pySBD/blob/master/CONTRIBUTING.md) to know more and follow these steps.\n\n1. Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )\n2. Create your feature branch (`git checkout -b my-new-feature`)\n3. Commit your changes (`git commit -am 'Add some feature'`)\n4. Push to the branch (`git push origin my-new-feature`)\n5. Create a new Pull Request\n\n## Citation\nIf you use `pysbd` package in your projects or research, please cite [PySBD: Pragmatic Sentence Boundary Disambiguation](https://www.aclweb.org/anthology/2020.nlposs-1.15).\n```\n@inproceedings{sadvilkar-neumann-2020-pysbd,\n title = \"{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation\",\n author = \"Sadvilkar, Nipun and\n Neumann, Mark\",\n booktitle = \"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)\",\n month = nov,\n year = \"2020\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/2020.nlposs-1.15\",\n pages = \"110--114\",\n abstract = \"We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\\%} of the Golden Rule Set examplars for English, an improvement of 25{\\%} over the next best open source Python tool.\",\n}\n```\n\n## Credit\n\nThis project wouldn't be possible without the great work done by [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) team.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages.",
"version": "0.3.4",
"split_keywords": [
"natural-language-processing",
"nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "480ac99fb7d7e176f8b176ef19704a32e6a9c6aafdf19ef75a187f701fc15801",
"md5": "2cdea56c0fb7e974370b8de3b69dd6f8",
"sha256": "cd838939b7b0b185fcf86b0baf6636667dfb6e474743beeff878e9f42e022953"
},
"downloads": -1,
"filename": "pysbd-0.3.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2cdea56c0fb7e974370b8de3b69dd6f8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3",
"size": 71082,
"upload_time": "2021-02-11T16:36:33",
"upload_time_iso_8601": "2021-02-11T16:36:33.351877Z",
"url": "https://files.pythonhosted.org/packages/48/0a/c99fb7d7e176f8b176ef19704a32e6a9c6aafdf19ef75a187f701fc15801/pysbd-0.3.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2021-02-11 16:36:33",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "pysbd"
}