sentsplit

Name	sentsplit JSON
Version	1.0.8 JSON
	download
home_page	https://github.com/zaemyung/sentsplit
Summary	A flexible sentence segmentation library using CRF model and regex rules
upload_time	2024-02-26 17:20:50
maintainer
docs_url	None
author	Zae Myung Kim
requires_python	>=3.7
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # sentsplit
A flexible sentence segmentation library using CRF model and regex rules

This library allows splitting of text paragraphs into sentences. It is built with the following desiderata:
- Be able to extend to new languages or "types" of sentences from data alone by learning a conditional random field (CRF) model.
- Also provide functionality to segment (or not to segment) lines based on regular expression rules (referred as `segment_regexes` and `prevent_regexes`, respectively).
- Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.

All in all, the library aims to benefit from the best of both worlds: data-driven and rule-based approaches.

You can try out the library [here](https://share.streamlit.io/zaemyung/sentsplit/main).

## Installation
Supports Python 3.7+

```bash
# stable
pip install sentsplit

# bleeding-edge
pip install git+https://github.com/zaemyung/sentsplit
```

Uses [python-crfsuite](https://github.com/scrapinghub/python-crfsuite), which, in turn, is built upon [CRFsuite](https://github.com/chokkan/crfsuite).

## Segmentation
### CLI
```bash
$ sentsplit segment -l lang_code -i /path/to/input_file  # outputs to /path/to/input_file.segment
$ sentsplit segment -l lang_code -i /path/to/input_file -o /path/to/output_file

$ sentsplit segment -h  # prints out the detailed usage
```

### Python Library
```python
from sentsplit.segment import SentSplit

# use default setting
sent_splitter = SentSplit(lang_code)

# override default setting - see "Features" for detail
sent_splitter = SentSplit(lang_code, **overriding_kwargs)

# segment a single line
sentences = sent_splitter.segment(line)

# can also segment a list of lines
sentences = sent_splitter.segment([lines])
```

## Features
The behavior of segmentation can be adjusted by the following arguments:
- `mincut`: a line is not segmented if its character-level length is smaller than `mincut`, preventing too short sentences.
- `maxcut`: a line is ["heuristically"](https://github.com/zaemyung/sentsplit/blob/cce34e1ed372b6a79c739f42334c775581fc0de8/sentsplit/segment.py#L271) segmented if its character-level length is greater or equal to `maxcut`, preventing too long sentences.
- `strip_spaces`: trim any white spaces in front and end of a sentence; does not guarantee exact reconstruction of original passages.
- `handle_multiple_spaces`: substitute multiple spaces with a single space, perform segmentation, and recover the original spaces.
- `segment_regexes`: segment at either `start` or `end` index of the matched group defined by the regex patterns.
- `prevent_regexes`: a line is not segmented at characters that fall within the matching group(s) captured by the regex patterns.
- `prevent_word_split`: a line is not segmented at characters that are within a word where the word boundary is denoted by white spaces around it or a punctuation;
may not be suitable for languages (e.g. Chinese, Japanese, Thai) that do not use spaces to differentiate words.

Segmentation is performed by first applying a trained CRF model to a line, where each character in the line is labelled as either `O` or `EOS`.
`EOS` label indicates the position for segmentation.

Note that `prevent_regexes` is applied *after* `segment_regexes`, meaning that the segmentation positions captured by `segment_regexes` can be *overridden* by `prevent_regexes`.

### An Example
Let's suppose we want to segment sentences that end with a tilde (`~` or `〜`) which is often used in some East Asian countries to convey a sense of friendliness, silliness, whimsy or flirtatiousness.
We can devise a regex that looks something like this: `(?<=[다요])~+(?= )`, where `다` and `요` are the most common characters that finish the sentences in the polite/formal form.
This regex can be added to `segment_regexes` to take effect:
```python
from copy import deepcopy
from sentsplit.config import ko_config
from sentsplit.segment import SentSplit

my_config = deepcopy(ko_config)
my_config['segment_regexes'].append({'name': 'tilde_ending', 'regex': r'(?<=[다요])~+(?= )', 'at': 'end'})
sent_splitter = SentSplit('ko', **my_config)

sent_splitter.segment('안녕하세요~ 만나서 정말 반갑습니다~~ 잘 부탁드립니다!')

# results with the regex: ['안녕하세요~', ' 만나서 정말 반갑습니다~~', ' 잘 부탁드립니다!']
# results without the regex: ['안녕하세요~ 만나서 정말 반갑습니다~~ 잘 부탁드립니다!']
```
To learn more about the regular expressions, this [website](https://www.regular-expressions.info/tutorial.html) provides a good tutorial.

## Creating a New SentSplit Model
Creating a new model involves first training a CRF model on a dataset of clean sentences, followed by (optionally) adding or modifying the feature arguments for better performance.

### Training a CRF Model
First, prepare a corpus file where a single line corresponds to a single sentence.
Then, a CRF model can be trained by running a command:
```bash
sentsplit train -l lang_code -c corpus_file_path  # outputs to {corpus_file_path}.{lang_code}-{ngram}-gram-{YearMonthDate}.model

sentsplit train -h  # prints out the detailed usage
```

The following arguments are used to set the training setting:
- `ngram`: maximum ngram features used for CRF model; default is `5`.
- `crf_max_iteration`: maximum number of CRF iteration for training; default is `50`.
- `sample_min_length`: when preparing an input sample for CRF model, gold sentences are concatenated to form a longer sample with a length greater than `sample_min_length`; default is `450`.
- `depunctuation_ratio`: ratio of training samples with no punctuation inbetween the sentences.
May only be suitable for certain languages (e.g. "ko", "ja") that have specific endings for sentences.
The top-`num_depunctuation_endings` most common endings are computed from `corpus`.
1.0 means 100% of the training samples are depunctuated.
- `num_depunctuation_endings`: number of most common sentence endings to extract and use.
- `ending_length`: length of sentence endings counted from reverse, exclusing any punctuation.
- `despace_ratio`: ratio of training samples without whitespaces inbetween the sentences.
1.0 means 100% of the training samples are despaced. For languages that do not often use whitespaces, set this to a high value ~1.0.

### Setting Configuration
Refer to the `base_config` in `config.py`. Append a new config to the file, adjusting the arguments accordingly if needed.

A newly created model can also be called directly in codes by passing the kwargs accordingly:
```python
from sentsplit.segment import SentSplit

sent_splitter = SentSplit(lang_code, model='path/to/model', ...)
```

## Supported Languages
Currently supported languages are:
- English (`en`)
- French (`fr`)
- German (`de`)
- Italian (`it`)
- Japanese (`ja`)
- Korean (`ko`)
- Lithuanian (`lt`)
- Polish (`pl`)
- Portuguese (`pt`)
- Russian (`ru`)
- Simplified Chinese (`zh`)
- Turkish (`tr`)

Please note that many of these languages are trained with openly available sentences gathered from bilingual corpora for machine translations.
The training sentences for European languages are mostly from the [Europarl](https://www.statmt.org/europarl/) corpora, so the default models may not handle colloquial sentences effectively.
We can either train a new CRF model with more gold sentences from the target domain, or devise a set of domain-specific regex rules if need be.

## License
`sentsplit` is licensed under MIT license, as found in [LICENSE](https://github.com/zaemyung/sentsplit/blob/main/LICENSE) file.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/zaemyung/sentsplit",
    "name": "sentsplit",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Zae Myung Kim",
    "author_email": "zaemyung@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/c7/59/bce2ff6f41c75752038e6dc76d1281ba91982cf05b6ff30e4b6173d0e972/sentsplit-1.0.8.tar.gz",
    "platform": null,
    "description": "# sentsplit\nA flexible sentence segmentation library using CRF model and regex rules\n\nThis library allows splitting of text paragraphs into sentences. It is built with the following desiderata:\n- Be able to extend to new languages or \"types\" of sentences from data alone by learning a conditional random field (CRF) model.\n- Also provide functionality to segment (or not to segment) lines based on regular expression rules (referred as `segment_regexes` and `prevent_regexes`, respectively).\n- Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.\n\nAll in all, the library aims to benefit from the best of both worlds: data-driven and rule-based approaches.\n\nYou can try out the library [here](https://share.streamlit.io/zaemyung/sentsplit/main).\n\n## Installation\nSupports Python 3.7+\n\n```bash\n# stable\npip install sentsplit\n\n# bleeding-edge\npip install git+https://github.com/zaemyung/sentsplit\n```\n\nUses [python-crfsuite](https://github.com/scrapinghub/python-crfsuite), which, in turn, is built upon [CRFsuite](https://github.com/chokkan/crfsuite).\n\n## Segmentation\n### CLI\n```bash\n$ sentsplit segment -l lang_code -i /path/to/input_file  # outputs to /path/to/input_file.segment\n$ sentsplit segment -l lang_code -i /path/to/input_file -o /path/to/output_file\n\n$ sentsplit segment -h  # prints out the detailed usage\n```\n\n### Python Library\n```python\nfrom sentsplit.segment import SentSplit\n\n# use default setting\nsent_splitter = SentSplit(lang_code)\n\n# override default setting - see \"Features\" for detail\nsent_splitter = SentSplit(lang_code, **overriding_kwargs)\n\n# segment a single line\nsentences = sent_splitter.segment(line)\n\n# can also segment a list of lines\nsentences = sent_splitter.segment([lines])\n```\n\n## Features\nThe behavior of segmentation can be adjusted by the following arguments:\n- `mincut`: a line is not segmented if its character-level length is smaller than `mincut`, preventing too short sentences.\n- `maxcut`: a line is [\"heuristically\"](https://github.com/zaemyung/sentsplit/blob/cce34e1ed372b6a79c739f42334c775581fc0de8/sentsplit/segment.py#L271) segmented if its character-level length is greater or equal to `maxcut`, preventing too long sentences.\n- `strip_spaces`: trim any white spaces in front and end of a sentence; does not guarantee exact reconstruction of original passages.\n- `handle_multiple_spaces`: substitute multiple spaces with a single space, perform segmentation, and recover the original spaces.\n- `segment_regexes`: segment at either `start` or `end` index of the matched group defined by the regex patterns.\n- `prevent_regexes`: a line is not segmented at characters that fall within the matching group(s) captured by the regex patterns.\n- `prevent_word_split`: a line is not segmented at characters that are within a word where the word boundary is denoted by white spaces around it or a punctuation;\nmay not be suitable for languages (e.g. Chinese, Japanese, Thai) that do not use spaces to differentiate words.\n\nSegmentation is performed by first applying a trained CRF model to a line, where each character in the line is labelled as either `O` or `EOS`.\n`EOS` label indicates the position for segmentation.\n\nNote that `prevent_regexes` is applied *after* `segment_regexes`, meaning that the segmentation positions captured by `segment_regexes` can be *overridden* by `prevent_regexes`.\n\n### An Example\nLet's suppose we want to segment sentences that end with a tilde (`~` or `\u301c`) which is often used in some East Asian countries to convey a sense of friendliness, silliness, whimsy or flirtatiousness.\nWe can devise a regex that looks something like this: `(?<=[\ub2e4\uc694])~+(?= )`, where `\ub2e4` and `\uc694` are the most common characters that finish the sentences in the polite/formal form.\nThis regex can be added to `segment_regexes` to take effect:\n```python\nfrom copy import deepcopy\nfrom sentsplit.config import ko_config\nfrom sentsplit.segment import SentSplit\n\nmy_config = deepcopy(ko_config)\nmy_config['segment_regexes'].append({'name': 'tilde_ending', 'regex': r'(?<=[\ub2e4\uc694])~+(?= )', 'at': 'end'})\nsent_splitter = SentSplit('ko', **my_config)\n\nsent_splitter.segment('\uc548\ub155\ud558\uc138\uc694~ \ub9cc\ub098\uc11c \uc815\ub9d0 \ubc18\uac11\uc2b5\ub2c8\ub2e4~~ \uc798 \ubd80\ud0c1\ub4dc\ub9bd\ub2c8\ub2e4!')\n\n# results with the regex: ['\uc548\ub155\ud558\uc138\uc694~', ' \ub9cc\ub098\uc11c \uc815\ub9d0 \ubc18\uac11\uc2b5\ub2c8\ub2e4~~', ' \uc798 \ubd80\ud0c1\ub4dc\ub9bd\ub2c8\ub2e4!']\n# results without the regex: ['\uc548\ub155\ud558\uc138\uc694~ \ub9cc\ub098\uc11c \uc815\ub9d0 \ubc18\uac11\uc2b5\ub2c8\ub2e4~~ \uc798 \ubd80\ud0c1\ub4dc\ub9bd\ub2c8\ub2e4!']\n```\nTo learn more about the regular expressions, this [website](https://www.regular-expressions.info/tutorial.html) provides a good tutorial.\n\n## Creating a New SentSplit Model\nCreating a new model involves first training a CRF model on a dataset of clean sentences, followed by (optionally) adding or modifying the feature arguments for better performance.\n\n### Training a CRF Model\nFirst, prepare a corpus file where a single line corresponds to a single sentence.\nThen, a CRF model can be trained by running a command:\n```bash\nsentsplit train -l lang_code -c corpus_file_path  # outputs to {corpus_file_path}.{lang_code}-{ngram}-gram-{YearMonthDate}.model\n\nsentsplit train -h  # prints out the detailed usage\n```\n\nThe following arguments are used to set the training setting:\n- `ngram`: maximum ngram features used for CRF model; default is `5`.\n- `crf_max_iteration`: maximum number of CRF iteration for training; default is `50`.\n- `sample_min_length`: when preparing an input sample for CRF model, gold sentences are concatenated to form a longer sample with a length greater than `sample_min_length`; default is `450`.\n- `depunctuation_ratio`: ratio of training samples with no punctuation inbetween the sentences.\nMay only be suitable for certain languages (e.g. \"ko\", \"ja\") that have specific endings for sentences.\nThe top-`num_depunctuation_endings` most common endings are computed from `corpus`.\n1.0 means 100% of the training samples are depunctuated.\n- `num_depunctuation_endings`: number of most common sentence endings to extract and use.\n- `ending_length`: length of sentence endings counted from reverse, exclusing any punctuation.\n- `despace_ratio`: ratio of training samples without whitespaces inbetween the sentences.\n1.0 means 100% of the training samples are despaced. For languages that do not often use whitespaces, set this to a high value ~1.0.\n\n### Setting Configuration\nRefer to the `base_config` in `config.py`. Append a new config to the file, adjusting the arguments accordingly if needed.\n\nA newly created model can also be called directly in codes by passing the kwargs accordingly:\n```python\nfrom sentsplit.segment import SentSplit\n\nsent_splitter = SentSplit(lang_code, model='path/to/model', ...)\n```\n\n## Supported Languages\nCurrently supported languages are:\n- English (`en`)\n- French (`fr`)\n- German (`de`)\n- Italian (`it`)\n- Japanese (`ja`)\n- Korean (`ko`)\n- Lithuanian (`lt`)\n- Polish (`pl`)\n- Portuguese (`pt`)\n- Russian (`ru`)\n- Simplified Chinese (`zh`)\n- Turkish (`tr`)\n\nPlease note that many of these languages are trained with openly available sentences gathered from bilingual corpora for machine translations.\nThe training sentences for European languages are mostly from the [Europarl](https://www.statmt.org/europarl/) corpora, so the default models may not handle colloquial sentences effectively.\nWe can either train a new CRF model with more gold sentences from the target domain, or devise a set of domain-specific regex rules if need be.\n\n## License\n`sentsplit` is licensed under MIT license, as found in [LICENSE](https://github.com/zaemyung/sentsplit/blob/main/LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A flexible sentence segmentation library using CRF model and regex rules",
    "version": "1.0.8",
    "project_urls": {
        "Homepage": "https://github.com/zaemyung/sentsplit"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c759bce2ff6f41c75752038e6dc76d1281ba91982cf05b6ff30e4b6173d0e972",
                "md5": "da1af1ceddab4e42fbd3b8b3798b0ff4",
                "sha256": "1c96da09b75e61aa07a991faada6a7f6a46007f83cd1de1aedeaaf9db20e0f1e"
            },
            "downloads": -1,
            "filename": "sentsplit-1.0.8.tar.gz",
            "has_sig": false,
            "md5_digest": "da1af1ceddab4e42fbd3b8b3798b0ff4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 1692650,
            "upload_time": "2024-02-26T17:20:50",
            "upload_time_iso_8601": "2024-02-26T17:20:50.873331Z",
            "url": "https://files.pythonhosted.org/packages/c7/59/bce2ff6f41c75752038e6dc76d1281ba91982cf05b6ff30e4b6173d0e972/sentsplit-1.0.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-26 17:20:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zaemyung",
    "github_project": "sentsplit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "sentsplit"
}

Zae Myung Kim