Name | sentsplit JSON |
Version |
1.0.8
JSON |
| download |
home_page | https://github.com/zaemyung/sentsplit |
Summary | A flexible sentence segmentation library using CRF model and regex rules |
upload_time | 2024-02-26 17:20:50 |
maintainer | |
docs_url | None |
author | Zae Myung Kim |
requires_python | >=3.7 |
license | |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# sentsplit
A flexible sentence segmentation library using CRF model and regex rules
This library allows splitting of text paragraphs into sentences. It is built with the following desiderata:
- Be able to extend to new languages or "types" of sentences from data alone by learning a conditional random field (CRF) model.
- Also provide functionality to segment (or not to segment) lines based on regular expression rules (referred as `segment_regexes` and `prevent_regexes`, respectively).
- Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.
All in all, the library aims to benefit from the best of both worlds: data-driven and rule-based approaches.
You can try out the library [here](https://share.streamlit.io/zaemyung/sentsplit/main).
## Installation
Supports Python 3.7+
```bash
# stable
pip install sentsplit
# bleeding-edge
pip install git+https://github.com/zaemyung/sentsplit
```
Uses [python-crfsuite](https://github.com/scrapinghub/python-crfsuite), which, in turn, is built upon [CRFsuite](https://github.com/chokkan/crfsuite).
## Segmentation
### CLI
```bash
$ sentsplit segment -l lang_code -i /path/to/input_file # outputs to /path/to/input_file.segment
$ sentsplit segment -l lang_code -i /path/to/input_file -o /path/to/output_file
$ sentsplit segment -h # prints out the detailed usage
```
### Python Library
```python
from sentsplit.segment import SentSplit
# use default setting
sent_splitter = SentSplit(lang_code)
# override default setting - see "Features" for detail
sent_splitter = SentSplit(lang_code, **overriding_kwargs)
# segment a single line
sentences = sent_splitter.segment(line)
# can also segment a list of lines
sentences = sent_splitter.segment([lines])
```
## Features
The behavior of segmentation can be adjusted by the following arguments:
- `mincut`: a line is not segmented if its character-level length is smaller than `mincut`, preventing too short sentences.
- `maxcut`: a line is ["heuristically"](https://github.com/zaemyung/sentsplit/blob/cce34e1ed372b6a79c739f42334c775581fc0de8/sentsplit/segment.py#L271) segmented if its character-level length is greater or equal to `maxcut`, preventing too long sentences.
- `strip_spaces`: trim any white spaces in front and end of a sentence; does not guarantee exact reconstruction of original passages.
- `handle_multiple_spaces`: substitute multiple spaces with a single space, perform segmentation, and recover the original spaces.
- `segment_regexes`: segment at either `start` or `end` index of the matched group defined by the regex patterns.
- `prevent_regexes`: a line is not segmented at characters that fall within the matching group(s) captured by the regex patterns.
- `prevent_word_split`: a line is not segmented at characters that are within a word where the word boundary is denoted by white spaces around it or a punctuation;
may not be suitable for languages (e.g. Chinese, Japanese, Thai) that do not use spaces to differentiate words.
Segmentation is performed by first applying a trained CRF model to a line, where each character in the line is labelled as either `O` or `EOS`.
`EOS` label indicates the position for segmentation.
Note that `prevent_regexes` is applied *after* `segment_regexes`, meaning that the segmentation positions captured by `segment_regexes` can be *overridden* by `prevent_regexes`.
### An Example
Let's suppose we want to segment sentences that end with a tilde (`~` or `〜`) which is often used in some East Asian countries to convey a sense of friendliness, silliness, whimsy or flirtatiousness.
We can devise a regex that looks something like this: `(?<=[다요])~+(?= )`, where `다` and `요` are the most common characters that finish the sentences in the polite/formal form.
This regex can be added to `segment_regexes` to take effect:
```python
from copy import deepcopy
from sentsplit.config import ko_config
from sentsplit.segment import SentSplit
my_config = deepcopy(ko_config)
my_config['segment_regexes'].append({'name': 'tilde_ending', 'regex': r'(?<=[다요])~+(?= )', 'at': 'end'})
sent_splitter = SentSplit('ko', **my_config)
sent_splitter.segment('안녕하세요~ 만나서 정말 반갑습니다~~ 잘 부탁드립니다!')
# results with the regex: ['안녕하세요~', ' 만나서 정말 반갑습니다~~', ' 잘 부탁드립니다!']
# results without the regex: ['안녕하세요~ 만나서 정말 반갑습니다~~ 잘 부탁드립니다!']
```
To learn more about the regular expressions, this [website](https://www.regular-expressions.info/tutorial.html) provides a good tutorial.
## Creating a New SentSplit Model
Creating a new model involves first training a CRF model on a dataset of clean sentences, followed by (optionally) adding or modifying the feature arguments for better performance.
### Training a CRF Model
First, prepare a corpus file where a single line corresponds to a single sentence.
Then, a CRF model can be trained by running a command:
```bash
sentsplit train -l lang_code -c corpus_file_path # outputs to {corpus_file_path}.{lang_code}-{ngram}-gram-{YearMonthDate}.model
sentsplit train -h # prints out the detailed usage
```
The following arguments are used to set the training setting:
- `ngram`: maximum ngram features used for CRF model; default is `5`.
- `crf_max_iteration`: maximum number of CRF iteration for training; default is `50`.
- `sample_min_length`: when preparing an input sample for CRF model, gold sentences are concatenated to form a longer sample with a length greater than `sample_min_length`; default is `450`.
- `depunctuation_ratio`: ratio of training samples with no punctuation inbetween the sentences.
May only be suitable for certain languages (e.g. "ko", "ja") that have specific endings for sentences.
The top-`num_depunctuation_endings` most common endings are computed from `corpus`.
1.0 means 100% of the training samples are depunctuated.
- `num_depunctuation_endings`: number of most common sentence endings to extract and use.
- `ending_length`: length of sentence endings counted from reverse, exclusing any punctuation.
- `despace_ratio`: ratio of training samples without whitespaces inbetween the sentences.
1.0 means 100% of the training samples are despaced. For languages that do not often use whitespaces, set this to a high value ~1.0.
### Setting Configuration
Refer to the `base_config` in `config.py`. Append a new config to the file, adjusting the arguments accordingly if needed.
A newly created model can also be called directly in codes by passing the kwargs accordingly:
```python
from sentsplit.segment import SentSplit
sent_splitter = SentSplit(lang_code, model='path/to/model', ...)
```
## Supported Languages
Currently supported languages are:
- English (`en`)
- French (`fr`)
- German (`de`)
- Italian (`it`)
- Japanese (`ja`)
- Korean (`ko`)
- Lithuanian (`lt`)
- Polish (`pl`)
- Portuguese (`pt`)
- Russian (`ru`)
- Simplified Chinese (`zh`)
- Turkish (`tr`)
Please note that many of these languages are trained with openly available sentences gathered from bilingual corpora for machine translations.
The training sentences for European languages are mostly from the [Europarl](https://www.statmt.org/europarl/) corpora, so the default models may not handle colloquial sentences effectively.
We can either train a new CRF model with more gold sentences from the target domain, or devise a set of domain-specific regex rules if need be.
## License
`sentsplit` is licensed under MIT license, as found in [LICENSE](https://github.com/zaemyung/sentsplit/blob/main/LICENSE) file.
Raw data
{
"_id": null,
"home_page": "https://github.com/zaemyung/sentsplit",
"name": "sentsplit",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Zae Myung Kim",
"author_email": "zaemyung@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c7/59/bce2ff6f41c75752038e6dc76d1281ba91982cf05b6ff30e4b6173d0e972/sentsplit-1.0.8.tar.gz",
"platform": null,
"description": "# sentsplit\nA flexible sentence segmentation library using CRF model and regex rules\n\nThis library allows splitting of text paragraphs into sentences. It is built with the following desiderata:\n- Be able to extend to new languages or \"types\" of sentences from data alone by learning a conditional random field (CRF) model.\n- Also provide functionality to segment (or not to segment) lines based on regular expression rules (referred as `segment_regexes` and `prevent_regexes`, respectively).\n- Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.\n\nAll in all, the library aims to benefit from the best of both worlds: data-driven and rule-based approaches.\n\nYou can try out the library [here](https://share.streamlit.io/zaemyung/sentsplit/main).\n\n## Installation\nSupports Python 3.7+\n\n```bash\n# stable\npip install sentsplit\n\n# bleeding-edge\npip install git+https://github.com/zaemyung/sentsplit\n```\n\nUses [python-crfsuite](https://github.com/scrapinghub/python-crfsuite), which, in turn, is built upon [CRFsuite](https://github.com/chokkan/crfsuite).\n\n## Segmentation\n### CLI\n```bash\n$ sentsplit segment -l lang_code -i /path/to/input_file # outputs to /path/to/input_file.segment\n$ sentsplit segment -l lang_code -i /path/to/input_file -o /path/to/output_file\n\n$ sentsplit segment -h # prints out the detailed usage\n```\n\n### Python Library\n```python\nfrom sentsplit.segment import SentSplit\n\n# use default setting\nsent_splitter = SentSplit(lang_code)\n\n# override default setting - see \"Features\" for detail\nsent_splitter = SentSplit(lang_code, **overriding_kwargs)\n\n# segment a single line\nsentences = sent_splitter.segment(line)\n\n# can also segment a list of lines\nsentences = sent_splitter.segment([lines])\n```\n\n## Features\nThe behavior of segmentation can be adjusted by the following arguments:\n- `mincut`: a line is not segmented if its character-level length is smaller than `mincut`, preventing too short sentences.\n- `maxcut`: a line is [\"heuristically\"](https://github.com/zaemyung/sentsplit/blob/cce34e1ed372b6a79c739f42334c775581fc0de8/sentsplit/segment.py#L271) segmented if its character-level length is greater or equal to `maxcut`, preventing too long sentences.\n- `strip_spaces`: trim any white spaces in front and end of a sentence; does not guarantee exact reconstruction of original passages.\n- `handle_multiple_spaces`: substitute multiple spaces with a single space, perform segmentation, and recover the original spaces.\n- `segment_regexes`: segment at either `start` or `end` index of the matched group defined by the regex patterns.\n- `prevent_regexes`: a line is not segmented at characters that fall within the matching group(s) captured by the regex patterns.\n- `prevent_word_split`: a line is not segmented at characters that are within a word where the word boundary is denoted by white spaces around it or a punctuation;\nmay not be suitable for languages (e.g. Chinese, Japanese, Thai) that do not use spaces to differentiate words.\n\nSegmentation is performed by first applying a trained CRF model to a line, where each character in the line is labelled as either `O` or `EOS`.\n`EOS` label indicates the position for segmentation.\n\nNote that `prevent_regexes` is applied *after* `segment_regexes`, meaning that the segmentation positions captured by `segment_regexes` can be *overridden* by `prevent_regexes`.\n\n### An Example\nLet's suppose we want to segment sentences that end with a tilde (`~` or `\u301c`) which is often used in some East Asian countries to convey a sense of friendliness, silliness, whimsy or flirtatiousness.\nWe can devise a regex that looks something like this: `(?<=[\ub2e4\uc694])~+(?= )`, where `\ub2e4` and `\uc694` are the most common characters that finish the sentences in the polite/formal form.\nThis regex can be added to `segment_regexes` to take effect:\n```python\nfrom copy import deepcopy\nfrom sentsplit.config import ko_config\nfrom sentsplit.segment import SentSplit\n\nmy_config = deepcopy(ko_config)\nmy_config['segment_regexes'].append({'name': 'tilde_ending', 'regex': r'(?<=[\ub2e4\uc694])~+(?= )', 'at': 'end'})\nsent_splitter = SentSplit('ko', **my_config)\n\nsent_splitter.segment('\uc548\ub155\ud558\uc138\uc694~ \ub9cc\ub098\uc11c \uc815\ub9d0 \ubc18\uac11\uc2b5\ub2c8\ub2e4~~ \uc798 \ubd80\ud0c1\ub4dc\ub9bd\ub2c8\ub2e4!')\n\n# results with the regex: ['\uc548\ub155\ud558\uc138\uc694~', ' \ub9cc\ub098\uc11c \uc815\ub9d0 \ubc18\uac11\uc2b5\ub2c8\ub2e4~~', ' \uc798 \ubd80\ud0c1\ub4dc\ub9bd\ub2c8\ub2e4!']\n# results without the regex: ['\uc548\ub155\ud558\uc138\uc694~ \ub9cc\ub098\uc11c \uc815\ub9d0 \ubc18\uac11\uc2b5\ub2c8\ub2e4~~ \uc798 \ubd80\ud0c1\ub4dc\ub9bd\ub2c8\ub2e4!']\n```\nTo learn more about the regular expressions, this [website](https://www.regular-expressions.info/tutorial.html) provides a good tutorial.\n\n## Creating a New SentSplit Model\nCreating a new model involves first training a CRF model on a dataset of clean sentences, followed by (optionally) adding or modifying the feature arguments for better performance.\n\n### Training a CRF Model\nFirst, prepare a corpus file where a single line corresponds to a single sentence.\nThen, a CRF model can be trained by running a command:\n```bash\nsentsplit train -l lang_code -c corpus_file_path # outputs to {corpus_file_path}.{lang_code}-{ngram}-gram-{YearMonthDate}.model\n\nsentsplit train -h # prints out the detailed usage\n```\n\nThe following arguments are used to set the training setting:\n- `ngram`: maximum ngram features used for CRF model; default is `5`.\n- `crf_max_iteration`: maximum number of CRF iteration for training; default is `50`.\n- `sample_min_length`: when preparing an input sample for CRF model, gold sentences are concatenated to form a longer sample with a length greater than `sample_min_length`; default is `450`.\n- `depunctuation_ratio`: ratio of training samples with no punctuation inbetween the sentences.\nMay only be suitable for certain languages (e.g. \"ko\", \"ja\") that have specific endings for sentences.\nThe top-`num_depunctuation_endings` most common endings are computed from `corpus`.\n1.0 means 100% of the training samples are depunctuated.\n- `num_depunctuation_endings`: number of most common sentence endings to extract and use.\n- `ending_length`: length of sentence endings counted from reverse, exclusing any punctuation.\n- `despace_ratio`: ratio of training samples without whitespaces inbetween the sentences.\n1.0 means 100% of the training samples are despaced. For languages that do not often use whitespaces, set this to a high value ~1.0.\n\n### Setting Configuration\nRefer to the `base_config` in `config.py`. Append a new config to the file, adjusting the arguments accordingly if needed.\n\nA newly created model can also be called directly in codes by passing the kwargs accordingly:\n```python\nfrom sentsplit.segment import SentSplit\n\nsent_splitter = SentSplit(lang_code, model='path/to/model', ...)\n```\n\n## Supported Languages\nCurrently supported languages are:\n- English (`en`)\n- French (`fr`)\n- German (`de`)\n- Italian (`it`)\n- Japanese (`ja`)\n- Korean (`ko`)\n- Lithuanian (`lt`)\n- Polish (`pl`)\n- Portuguese (`pt`)\n- Russian (`ru`)\n- Simplified Chinese (`zh`)\n- Turkish (`tr`)\n\nPlease note that many of these languages are trained with openly available sentences gathered from bilingual corpora for machine translations.\nThe training sentences for European languages are mostly from the [Europarl](https://www.statmt.org/europarl/) corpora, so the default models may not handle colloquial sentences effectively.\nWe can either train a new CRF model with more gold sentences from the target domain, or devise a set of domain-specific regex rules if need be.\n\n## License\n`sentsplit` is licensed under MIT license, as found in [LICENSE](https://github.com/zaemyung/sentsplit/blob/main/LICENSE) file.\n",
"bugtrack_url": null,
"license": "",
"summary": "A flexible sentence segmentation library using CRF model and regex rules",
"version": "1.0.8",
"project_urls": {
"Homepage": "https://github.com/zaemyung/sentsplit"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c759bce2ff6f41c75752038e6dc76d1281ba91982cf05b6ff30e4b6173d0e972",
"md5": "da1af1ceddab4e42fbd3b8b3798b0ff4",
"sha256": "1c96da09b75e61aa07a991faada6a7f6a46007f83cd1de1aedeaaf9db20e0f1e"
},
"downloads": -1,
"filename": "sentsplit-1.0.8.tar.gz",
"has_sig": false,
"md5_digest": "da1af1ceddab4e42fbd3b8b3798b0ff4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 1692650,
"upload_time": "2024-02-26T17:20:50",
"upload_time_iso_8601": "2024-02-26T17:20:50.873331Z",
"url": "https://files.pythonhosted.org/packages/c7/59/bce2ff6f41c75752038e6dc76d1281ba91982cf05b6ff30e4b6173d0e972/sentsplit-1.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-26 17:20:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zaemyung",
"github_project": "sentsplit",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "sentsplit"
}