<img src="https://github.com/natasha/natasha-logos/blob/master/razdel.svg">
![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel)
`razdel` — rule-based system for Russian sentence and word tokenization..
## Usage
```python
>>> from razdel import tokenize
>>> tokens = list(tokenize('Кружка-термос на 0.5л (50/64 см³, 516;...)'))
>>> tokens
[Substring(0, 13, 'Кружка-термос'),
Substring(14, 16, 'на'),
Substring(17, 20, '0.5'),
Substring(20, 21, 'л'),
Substring(22, 23, '(')
...]
>>> [_.text for _ in tokens]
['Кружка-термос', 'на', '0.5', 'л', '(', '50/64', 'см³', ',', '516', ';', '...', ')']
```
```python
>>> from razdel import sentenize
>>> text = '''
... - "Так в чем же дело?" - "Не ра-ду-ют".
... И т. д. и т. п. В общем, вся газета
... '''
>>> list(sentenize(text))
[Substring(1, 23, '- "Так в чем же дело?"'),
Substring(24, 40, '- "Не ра-ду-ют".'),
Substring(41, 56, 'И т. д. и т. п.'),
Substring(57, 76, 'В общем, вся газета')]
```
## Installation
`razdel` supports Python 3.5+ and PyPy 3.
```bash
$ pip install razdel
```
## Quality, performance
<a name="evalualtion"></a>
Unfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `«Как же так?! Захар...» — воскликнут Пронин.` into three sentences `["«Как же так?!", "Захар...»", "— воскликнут Пронин."]` while `razdel` splits it into two `["«Как же так?!", "Захар...» — воскликнут Пронин."]`. What would be the correct way to tokenizer `т.е.`? One may split in into `т.|е.`, `razdel` splits into `т|.|е|.`.
`razdel` tries to mimic segmentation of these 4 datasets : <a href="https://github.com/natasha/corus#load_ud_syntag">SynTagRus</a>, <a href="https://github.com/natasha/corus#load_morphoru_corpora">OpenCorpora</a>, <a href="https://github.com/natasha/corus#load_morphoru_gicrya">GICRYA</a> and <a href="https://github.com/natasha/corus#load_morphoru_rnc">RNC</a>. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.
We measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `чуть-чуть?!` is not non-trivial, one may split it into `чуть|-|чуть|?|!` while the correct tokenization is `чуть-чуть|?!`, such examples are rare. Vast majority of cases are trivial, for example text `в 5 часов ...` is correctly tokenized even via Python native `str.split` into `в| |5| |часов| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.
`errors` — number of errors. For example, consider etalon segmentation is `что-то|?`, prediction is `что|-|то?`, then the number of errors is 3: 1 for missing split `то?` + 2 for extra splits `что|-|то`.
`time` — total seconds taken.
`spacy_tokenize`, `aatimofeev` and others a defined in <a href="https://github.com/natasha/naeval/blob/master/naeval/segment/models.py">naeval/segment/models.py</a>. Tables are computed in <a href="https://github.com/natasha/naeval/blob/master/scripts/segment/main.ipynb">segment/main.ipynb</a>.
### Tokens
<!--- token --->
<table border="0" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="2" halign="left">corpora</th>
<th colspan="2" halign="left">syntag</th>
<th colspan="2" halign="left">gicrya</th>
<th colspan="2" halign="left">rnc</th>
</tr>
<tr>
<th></th>
<th>errors</th>
<th>time</th>
<th>errors</th>
<th>time</th>
<th>errors</th>
<th>time</th>
<th>errors</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<th>re.findall(\w+|\d+|\p+)</th>
<td>4161</td>
<td>0.5</td>
<td>2660</td>
<td>0.5</td>
<td>2277</td>
<td>0.4</td>
<td>7606</td>
<td>0.4</td>
</tr>
<tr>
<th>spacy</th>
<td>4388</td>
<td>6.2</td>
<td>2103</td>
<td>5.8</td>
<td><b>1740</b></td>
<td>4.1</td>
<td>4057</td>
<td>3.9</td>
</tr>
<tr>
<th>nltk.word_tokenize</th>
<td>14245</td>
<td>3.4</td>
<td>60893</td>
<td>3.3</td>
<td>13496</td>
<td>2.7</td>
<td>41485</td>
<td>2.9</td>
</tr>
<tr>
<th>mystem</th>
<td>4514</td>
<td>5.0</td>
<td>3153</td>
<td>4.7</td>
<td>2497</td>
<td>3.7</td>
<td><b>2028</b></td>
<td>3.9</td>
</tr>
<tr>
<th>mosestokenizer</th>
<td><b>1886</b></td>
<td><b>2.1</b></td>
<td><b>1330</b></td>
<td><b>1.9</b></td>
<td>1796</td>
<td><b>1.6</b></td>
<td><b>2123</b></td>
<td><b>1.7</b></td>
</tr>
<tr>
<th>segtok.word_tokenize</th>
<td>2772</td>
<td><b>2.3</b></td>
<td><b>1288</b></td>
<td><b>2.3</b></td>
<td>1759</td>
<td><b>1.8</b></td>
<td><b>1229</b></td>
<td><b>1.8</b></td>
</tr>
<tr>
<th>aatimofeev/spacy_russian_tokenizer</th>
<td>2930</td>
<td>48.7</td>
<td><b>719</b></td>
<td>51.1</td>
<td><b>678</b></td>
<td>39.5</td>
<td>2681</td>
<td>52.2</td>
</tr>
<tr>
<th>koziev/rutokenizer</th>
<td><b>2627</b></td>
<td><b>1.1</b></td>
<td>1386</td>
<td><b>1.0</b></td>
<td>2893</td>
<td><b>0.8</b></td>
<td>9411</td>
<td><b>0.9</b></td>
</tr>
<tr>
<th>razdel.tokenize</th>
<td><b>1510</b></td>
<td>2.9</td>
<td>1483</td>
<td>2.8</td>
<td><b>322</b></td>
<td>2.0</td>
<td>2124</td>
<td>2.2</td>
</tr>
</tbody>
</table>
<!--- token --->
### Sentencies
<!--- sent --->
<table border="0" class="dataframe">
<thead>
<tr>
<th></th>
<th colspan="2" halign="left">corpora</th>
<th colspan="2" halign="left">syntag</th>
<th colspan="2" halign="left">gicrya</th>
<th colspan="2" halign="left">rnc</th>
</tr>
<tr>
<th></th>
<th>errors</th>
<th>time</th>
<th>errors</th>
<th>time</th>
<th>errors</th>
<th>time</th>
<th>errors</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<th>re.split([.?!…])</th>
<td>20456</td>
<td>0.9</td>
<td>6576</td>
<td>0.6</td>
<td>10084</td>
<td>0.7</td>
<td>23356</td>
<td>1.0</td>
</tr>
<tr>
<th>segtok.split_single</th>
<td>19008</td>
<td>17.8</td>
<td>4422</td>
<td>13.4</td>
<td>159738</td>
<td><b>1.1</b></td>
<td>164218</td>
<td><b>2.8</b></td>
</tr>
<tr>
<th>mosestokenizer</th>
<td>41666</td>
<td><b>8.9</b></td>
<td>22082</td>
<td><b>5.7</b></td>
<td>12663</td>
<td>6.4</td>
<td>50560</td>
<td><b>7.4</b></td>
</tr>
<tr>
<th>nltk.sent_tokenize</th>
<td><b>16420</b></td>
<td><b>10.1</b></td>
<td><b>4350</b></td>
<td><b>5.3</b></td>
<td><b>7074</b></td>
<td><b>5.6</b></td>
<td><b>32534</b></td>
<td>8.9</td>
</tr>
<tr>
<th>deeppavlov/rusenttokenize</th>
<td><b>10192</b></td>
<td>10.9</td>
<td><b>1210</b></td>
<td>7.9</td>
<td><b>8910</b></td>
<td>6.8</td>
<td><b>21410</b></td>
<td><b>7.0</b></td>
</tr>
<tr>
<th>razdel.sentenize</th>
<td><b>9274</b></td>
<td><b>6.1</b></td>
<td><b>824</b></td>
<td><b>3.9</b></td>
<td><b>11414</b></td>
<td><b>4.5</b></td>
<td><b>10594</b></td>
<td>7.5</td>
</tr>
</tbody>
</table>
<!--- sent --->
## Support
- Chat — https://telegram.me/natural_language_processing
- Issues — https://github.com/natasha/razdel/issues
## Development
Test:
```bash
pip install -e .
pip install -r requirements/ci.txt
make test
make int # 2000 integration tests
```
Package:
```bash
make version
git push
git push --tags
make clean wheel upload
```
`mystem` errors on `syntag`:
```bash
# see naeval/data
cat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less
```
Non-trivial token tests:
```bash
pv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt
pv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt
```
Update integration tests:
```bash
cd razdel/tests/data/
pv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt
```
`razdel` and `moses` diff:
```bash
cat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less
```
`razdel` performance:
```bash
cat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l
```
Raw data
{
"_id": null,
"home_page": "https://github.com/natasha/razdel",
"name": "razdel",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "nlp,natural language processing,russian,token,sentence,tokenize",
"author": "Alexander Kukushkin",
"author_email": "alex@alexkuk.ru",
"download_url": "https://files.pythonhosted.org/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz",
"platform": "",
"description": "<img src=\"https://github.com/natasha/natasha-logos/blob/master/razdel.svg\">\n\n![CI](https://github.com/natasha/razdel/workflows/CI/badge.svg) [![codecov](https://codecov.io/gh/natasha/razdel/branch/master/graph/badge.svg)](https://codecov.io/gh/natasha/razdel)\n\n`razdel` \u2014 rule-based system for Russian sentence and word tokenization..\n\n## Usage\n\n```python\n>>> from razdel import tokenize\n\n>>> tokens = list(tokenize('\u041a\u0440\u0443\u0436\u043a\u0430-\u0442\u0435\u0440\u043c\u043e\u0441 \u043d\u0430 0.5\u043b (50/64 \u0441\u043c\u00b3, 516;...)'))\n>>> tokens\n[Substring(0, 13, '\u041a\u0440\u0443\u0436\u043a\u0430-\u0442\u0435\u0440\u043c\u043e\u0441'),\n Substring(14, 16, '\u043d\u0430'),\n Substring(17, 20, '0.5'),\n Substring(20, 21, '\u043b'),\n Substring(22, 23, '(')\n ...]\n\n>>> [_.text for _ in tokens]\n['\u041a\u0440\u0443\u0436\u043a\u0430-\u0442\u0435\u0440\u043c\u043e\u0441', '\u043d\u0430', '0.5', '\u043b', '(', '50/64', '\u0441\u043c\u00b3', ',', '516', ';', '...', ')']\n```\n\n```python\n>>> from razdel import sentenize\n\n>>> text = '''\n... - \"\u0422\u0430\u043a \u0432 \u0447\u0435\u043c \u0436\u0435 \u0434\u0435\u043b\u043e?\" - \"\u041d\u0435 \u0440\u0430-\u0434\u0443-\u044e\u0442\".\n... \u0418 \u0442. \u0434. \u0438 \u0442. \u043f. \u0412 \u043e\u0431\u0449\u0435\u043c, \u0432\u0441\u044f \u0433\u0430\u0437\u0435\u0442\u0430\n... '''\n\n>>> list(sentenize(text))\n[Substring(1, 23, '- \"\u0422\u0430\u043a \u0432 \u0447\u0435\u043c \u0436\u0435 \u0434\u0435\u043b\u043e?\"'),\n Substring(24, 40, '- \"\u041d\u0435 \u0440\u0430-\u0434\u0443-\u044e\u0442\".'),\n Substring(41, 56, '\u0418 \u0442. \u0434. \u0438 \u0442. \u043f.'),\n Substring(57, 76, '\u0412 \u043e\u0431\u0449\u0435\u043c, \u0432\u0441\u044f \u0433\u0430\u0437\u0435\u0442\u0430')]\n```\n\n## Installation\n\n`razdel` supports Python 3.5+ and PyPy 3.\n\n```bash\n$ pip install razdel\n```\n\n## Quality, performance\n<a name=\"evalualtion\"></a>\n\nUnfortunately, there is no single correct way to split text into sentences and tokens. For example, one may split `\u00ab\u041a\u0430\u043a \u0436\u0435 \u0442\u0430\u043a?! \u0417\u0430\u0445\u0430\u0440...\u00bb \u2014 \u0432\u043e\u0441\u043a\u043b\u0438\u043a\u043d\u0443\u0442 \u041f\u0440\u043e\u043d\u0438\u043d.` into three sentences `[\"\u00ab\u041a\u0430\u043a \u0436\u0435 \u0442\u0430\u043a?!\", \"\u0417\u0430\u0445\u0430\u0440...\u00bb\", \"\u2014 \u0432\u043e\u0441\u043a\u043b\u0438\u043a\u043d\u0443\u0442 \u041f\u0440\u043e\u043d\u0438\u043d.\"]` while `razdel` splits it into two `[\"\u00ab\u041a\u0430\u043a \u0436\u0435 \u0442\u0430\u043a?!\", \"\u0417\u0430\u0445\u0430\u0440...\u00bb \u2014 \u0432\u043e\u0441\u043a\u043b\u0438\u043a\u043d\u0443\u0442 \u041f\u0440\u043e\u043d\u0438\u043d.\"]`. What would be the correct way to tokenizer `\u0442.\u0435.`? One may split in into `\u0442.|\u0435.`, `razdel` splits into `\u0442|.|\u0435|.`.\n\n`razdel` tries to mimic segmentation of these 4 datasets : <a href=\"https://github.com/natasha/corus#load_ud_syntag\">SynTagRus</a>, <a href=\"https://github.com/natasha/corus#load_morphoru_corpora\">OpenCorpora</a>, <a href=\"https://github.com/natasha/corus#load_morphoru_gicrya\">GICRYA</a> and <a href=\"https://github.com/natasha/corus#load_morphoru_rnc\">RNC</a>. These datasets mainly consist of news and fiction. `razdel` rules are optimized for these kinds of texts. Library may perform worse on other domains like social media, scientific articles, legal documents.\n\nWe measure absolute number of errors. There are a lot of trivial cases in the tokenization task. For example, text `\u0447\u0443\u0442\u044c-\u0447\u0443\u0442\u044c?!` is not non-trivial, one may split it into `\u0447\u0443\u0442\u044c|-|\u0447\u0443\u0442\u044c|?|!` while the correct tokenization is `\u0447\u0443\u0442\u044c-\u0447\u0443\u0442\u044c|?!`, such examples are rare. Vast majority of cases are trivial, for example text `\u0432 5 \u0447\u0430\u0441\u043e\u0432 ...` is correctly tokenized even via Python native `str.split` into `\u0432| |5| |\u0447\u0430\u0441\u043e\u0432| |...`. Due to the large number of trivial case overall quality of all segmenators is high, it is hard to compare differentiate between for examlpe 99.33%, 99.95% and 99.88%, so we report the absolute number of errors.\n\n`errors` \u2014 number of errors. For example, consider etalon segmentation is `\u0447\u0442\u043e-\u0442\u043e|?`, prediction is `\u0447\u0442\u043e|-|\u0442\u043e?`, then the number of errors is 3: 1 for missing split `\u0442\u043e?` + 2 for extra splits `\u0447\u0442\u043e|-|\u0442\u043e`.\n\n`time` \u2014 total seconds taken.\n\n`spacy_tokenize`, `aatimofeev` and others a defined in <a href=\"https://github.com/natasha/naeval/blob/master/naeval/segment/models.py\">naeval/segment/models.py</a>. Tables are computed in <a href=\"https://github.com/natasha/naeval/blob/master/scripts/segment/main.ipynb\">segment/main.ipynb</a>.\n\n### Tokens\n\n<!--- token --->\n<table border=\"0\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"2\" halign=\"left\">corpora</th>\n <th colspan=\"2\" halign=\"left\">syntag</th>\n <th colspan=\"2\" halign=\"left\">gicrya</th>\n <th colspan=\"2\" halign=\"left\">rnc</th>\n </tr>\n <tr>\n <th></th>\n <th>errors</th>\n <th>time</th>\n <th>errors</th>\n <th>time</th>\n <th>errors</th>\n <th>time</th>\n <th>errors</th>\n <th>time</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>re.findall(\\w+|\\d+|\\p+)</th>\n <td>4161</td>\n <td>0.5</td>\n <td>2660</td>\n <td>0.5</td>\n <td>2277</td>\n <td>0.4</td>\n <td>7606</td>\n <td>0.4</td>\n </tr>\n <tr>\n <th>spacy</th>\n <td>4388</td>\n <td>6.2</td>\n <td>2103</td>\n <td>5.8</td>\n <td><b>1740</b></td>\n <td>4.1</td>\n <td>4057</td>\n <td>3.9</td>\n </tr>\n <tr>\n <th>nltk.word_tokenize</th>\n <td>14245</td>\n <td>3.4</td>\n <td>60893</td>\n <td>3.3</td>\n <td>13496</td>\n <td>2.7</td>\n <td>41485</td>\n <td>2.9</td>\n </tr>\n <tr>\n <th>mystem</th>\n <td>4514</td>\n <td>5.0</td>\n <td>3153</td>\n <td>4.7</td>\n <td>2497</td>\n <td>3.7</td>\n <td><b>2028</b></td>\n <td>3.9</td>\n </tr>\n <tr>\n <th>mosestokenizer</th>\n <td><b>1886</b></td>\n <td><b>2.1</b></td>\n <td><b>1330</b></td>\n <td><b>1.9</b></td>\n <td>1796</td>\n <td><b>1.6</b></td>\n <td><b>2123</b></td>\n <td><b>1.7</b></td>\n </tr>\n <tr>\n <th>segtok.word_tokenize</th>\n <td>2772</td>\n <td><b>2.3</b></td>\n <td><b>1288</b></td>\n <td><b>2.3</b></td>\n <td>1759</td>\n <td><b>1.8</b></td>\n <td><b>1229</b></td>\n <td><b>1.8</b></td>\n </tr>\n <tr>\n <th>aatimofeev/spacy_russian_tokenizer</th>\n <td>2930</td>\n <td>48.7</td>\n <td><b>719</b></td>\n <td>51.1</td>\n <td><b>678</b></td>\n <td>39.5</td>\n <td>2681</td>\n <td>52.2</td>\n </tr>\n <tr>\n <th>koziev/rutokenizer</th>\n <td><b>2627</b></td>\n <td><b>1.1</b></td>\n <td>1386</td>\n <td><b>1.0</b></td>\n <td>2893</td>\n <td><b>0.8</b></td>\n <td>9411</td>\n <td><b>0.9</b></td>\n </tr>\n <tr>\n <th>razdel.tokenize</th>\n <td><b>1510</b></td>\n <td>2.9</td>\n <td>1483</td>\n <td>2.8</td>\n <td><b>322</b></td>\n <td>2.0</td>\n <td>2124</td>\n <td>2.2</td>\n </tr>\n </tbody>\n</table>\n<!--- token --->\n\n### Sentencies\n\n<!--- sent --->\n<table border=\"0\" class=\"dataframe\">\n <thead>\n <tr>\n <th></th>\n <th colspan=\"2\" halign=\"left\">corpora</th>\n <th colspan=\"2\" halign=\"left\">syntag</th>\n <th colspan=\"2\" halign=\"left\">gicrya</th>\n <th colspan=\"2\" halign=\"left\">rnc</th>\n </tr>\n <tr>\n <th></th>\n <th>errors</th>\n <th>time</th>\n <th>errors</th>\n <th>time</th>\n <th>errors</th>\n <th>time</th>\n <th>errors</th>\n <th>time</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>re.split([.?!\u2026])</th>\n <td>20456</td>\n <td>0.9</td>\n <td>6576</td>\n <td>0.6</td>\n <td>10084</td>\n <td>0.7</td>\n <td>23356</td>\n <td>1.0</td>\n </tr>\n <tr>\n <th>segtok.split_single</th>\n <td>19008</td>\n <td>17.8</td>\n <td>4422</td>\n <td>13.4</td>\n <td>159738</td>\n <td><b>1.1</b></td>\n <td>164218</td>\n <td><b>2.8</b></td>\n </tr>\n <tr>\n <th>mosestokenizer</th>\n <td>41666</td>\n <td><b>8.9</b></td>\n <td>22082</td>\n <td><b>5.7</b></td>\n <td>12663</td>\n <td>6.4</td>\n <td>50560</td>\n <td><b>7.4</b></td>\n </tr>\n <tr>\n <th>nltk.sent_tokenize</th>\n <td><b>16420</b></td>\n <td><b>10.1</b></td>\n <td><b>4350</b></td>\n <td><b>5.3</b></td>\n <td><b>7074</b></td>\n <td><b>5.6</b></td>\n <td><b>32534</b></td>\n <td>8.9</td>\n </tr>\n <tr>\n <th>deeppavlov/rusenttokenize</th>\n <td><b>10192</b></td>\n <td>10.9</td>\n <td><b>1210</b></td>\n <td>7.9</td>\n <td><b>8910</b></td>\n <td>6.8</td>\n <td><b>21410</b></td>\n <td><b>7.0</b></td>\n </tr>\n <tr>\n <th>razdel.sentenize</th>\n <td><b>9274</b></td>\n <td><b>6.1</b></td>\n <td><b>824</b></td>\n <td><b>3.9</b></td>\n <td><b>11414</b></td>\n <td><b>4.5</b></td>\n <td><b>10594</b></td>\n <td>7.5</td>\n </tr>\n </tbody>\n</table>\n<!--- sent --->\n\n## Support\n\n- Chat \u2014 https://telegram.me/natural_language_processing\n- Issues \u2014 https://github.com/natasha/razdel/issues\n\n## Development\n\nTest:\n\n```bash\npip install -e .\npip install -r requirements/ci.txt\nmake test\nmake int # 2000 integration tests\n```\n\nPackage:\n\n```bash\nmake version\ngit push\ngit push --tags\n\nmake clean wheel upload\n```\n\n`mystem` errors on `syntag`:\n\n```bash\n# see naeval/data\ncat syntag_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl diff --show moses_tokenize | less\n```\n\nNon-trivial token tests:\n\n```bash\npv data/*_tokens.txt | razdel-ctl gen --recall | razdel-ctl diff space_tokenize > tests.txt\npv data/*_tokens.txt | razdel-ctl gen --precision | razdel-ctl diff re_tokenize >> tests.txt\n```\n\nUpdate integration tests:\n\n```bash\ncd razdel/tests/data/\npv sents.txt | razdel-ctl up sentenize > t; mv t sents.txt\n```\n\n`razdel` and `moses` diff:\n\n```bash\ncat data/*_tokens.txt | razdel-ctl sample 1000 | razdel-ctl gen | razdel-ctl up tokenize | razdel-ctl diff moses_tokenize | less\n```\n\n`razdel` performance:\n\n```bash\ncat data/*_tokens.txt | razdel-ctl sample 10000 | pv -l | razdel-ctl gen | razdel-ctl diff tokenize | wc -l\n```\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Splits russian text into tokens, sentences, section. Rule-based",
"version": "0.5.0",
"project_urls": {
"Homepage": "https://github.com/natasha/razdel"
},
"split_keywords": [
"nlp",
"natural language processing",
"russian",
"token",
"sentence",
"tokenize"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "152c664223a3924aa6e70479f7d37220b3a658765b9cfe760b4af7ffdc50d38f",
"md5": "0c7a610b55ce5fd47c204e5978010b33",
"sha256": "76f59691c3216b47d32fef6274c18c12d61f602f1444b7ef4b135b03801f6d37"
},
"downloads": -1,
"filename": "razdel-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0c7a610b55ce5fd47c204e5978010b33",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 21149,
"upload_time": "2020-03-26T04:27:52",
"upload_time_iso_8601": "2020-03-26T04:27:52.591609Z",
"url": "https://files.pythonhosted.org/packages/15/2c/664223a3924aa6e70479f7d37220b3a658765b9cfe760b4af7ffdc50d38f/razdel-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "70ea0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0",
"md5": "638852a3b703aaa57927e1e40a1a74dc",
"sha256": "4334c0fdfe34d4e888cf0ed854968c9df14f0a547df909a77f4634f9ffe626e6"
},
"downloads": -1,
"filename": "razdel-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "638852a3b703aaa57927e1e40a1a74dc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19248,
"upload_time": "2020-03-26T04:27:54",
"upload_time_iso_8601": "2020-03-26T04:27:54.331031Z",
"url": "https://files.pythonhosted.org/packages/70/ea/0151ae55bd26699487e668a865ef43e49409025c7464569beffe1a5789f0/razdel-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-03-26 04:27:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "natasha",
"github_project": "razdel",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "razdel"
}