segments
========
[![Build Status](https://github.com/cldf/segments/workflows/tests/badge.svg)](https://github.com/cldf/segments/actions?query=workflow%3Atests)
[![codecov](https://codecov.io/gh/cldf/segments/branch/master/graph/badge.svg)](https://codecov.io/gh/cldf/segments)
[![PyPI](https://img.shields.io/pypi/v/segments.svg)](https://pypi.org/project/segments)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1051157.svg)](https://doi.org/10.5281/zenodo.1051157)
The segments package provides Unicode Standard tokenization routines and orthography segmentation,
implementing the linear algorithm described in the orthography profile specification from
*The Unicode Cookbook* (Moran and Cysouw 2018 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780)).
Command line usage
------------------
Create a text file:
```
$ echo "aäaaöaaüaa" > text.txt
```
Now look at the profile:
```
$ cat text.txt | segments profile
Grapheme frequency mapping
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
```
Write the profile to a file:
```
$ cat text.txt | segments profile > profile.prf
```
Edit the profile:
```
$ more profile.prf
Grapheme frequency mapping
aa 0 x
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
```
Now tokenize the text without profile:
```
$ cat text.txt | segments tokenize
a ä a a ö a a ü a a
```
And with profile:
```
$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa
$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x
```
API
---
```python
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme mapping
ab x
cd y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'
```
Raw data
{
"_id": null,
"home_page": "https://github.com/cldf/segments",
"name": "segments",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "tokenizer",
"author": "Steven Moran and Robert Forkel",
"author_email": "steven.moran@uzh.ch",
"download_url": "https://files.pythonhosted.org/packages/0b/a6/b678440988daa66ac151bc3ca24f2ad8dcfdb591604f5c2b83e2515b1f58/segments-2.2.1.tar.gz",
"platform": null,
"description": "segments\n========\n\n[![Build Status](https://github.com/cldf/segments/workflows/tests/badge.svg)](https://github.com/cldf/segments/actions?query=workflow%3Atests)\n[![codecov](https://codecov.io/gh/cldf/segments/branch/master/graph/badge.svg)](https://codecov.io/gh/cldf/segments)\n[![PyPI](https://img.shields.io/pypi/v/segments.svg)](https://pypi.org/project/segments)\n\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1051157.svg)](https://doi.org/10.5281/zenodo.1051157)\n\nThe segments package provides Unicode Standard tokenization routines and orthography segmentation,\nimplementing the linear algorithm described in the orthography profile specification from \n*The Unicode Cookbook* (Moran and Cysouw 2018 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780)).\n\n\nCommand line usage\n------------------\n\nCreate a text file:\n```\n$ echo \"a\u00e4aa\u00f6aa\u00fcaa\" > text.txt\n```\n\nNow look at the profile:\n```\n$ cat text.txt | segments profile\nGrapheme frequency mapping\na 7 a\na\u0308 1 a\u0308\nu\u0308 1 u\u0308\no\u0308 1 o\u0308\n```\n\nWrite the profile to a file:\n```\n$ cat text.txt | segments profile > profile.prf\n```\n\nEdit the profile:\n\n```\n$ more profile.prf\nGrapheme frequency mapping\naa 0 x\na 7 a\na\u0308 1 a\u0308\nu\u0308 1 u\u0308\no\u0308 1 o\u0308\n```\n\nNow tokenize the text without profile:\n```\n$ cat text.txt | segments tokenize\na a\u0308 a a o\u0308 a a u\u0308 a a\n```\n\nAnd with profile:\n```\n$ cat text.txt | segments --profile=profile.prf tokenize\na a\u0308 aa o\u0308 aa u\u0308 aa\n\n$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize\na a\u0308 x o\u0308 x u\u0308 x\n```\n\n\nAPI\n---\n\n```python\n>>> from segments import Profile, Tokenizer\n>>> t = Tokenizer()\n>>> t('abcd')\n'a b c d'\n>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})\n>>> print(prf)\nGrapheme\tmapping\nab\tx\ncd\ty\n>>> t = Tokenizer(profile=prf)\n>>> t('abcd')\n'ab cd'\n>>> t('abcd', column='mapping')\n'x y'\n```\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "",
"version": "2.2.1",
"split_keywords": [
"tokenizer"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "05273ad48b946ded53a686fe0136c756",
"sha256": "069860ae5a499ad7bd86e23ee52250a16e61ba3474c17e515b16d494ac1423c1"
},
"downloads": -1,
"filename": "segments-2.2.1-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "05273ad48b946ded53a686fe0136c756",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 15039,
"upload_time": "2022-07-08T13:42:57",
"upload_time_iso_8601": "2022-07-08T13:42:57.237151Z",
"url": "https://files.pythonhosted.org/packages/93/d4/74dba5011533e66becf35aae5cf1d726e760f445db052592bad70e75305c/segments-2.2.1-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "1db512116f28df6b9a3326d9fc19558f",
"sha256": "515ae188f21d24e420d48ad45689edc747d961d6b52fde22e47500a8d85f2741"
},
"downloads": -1,
"filename": "segments-2.2.1.tar.gz",
"has_sig": false,
"md5_digest": "1db512116f28df6b9a3326d9fc19558f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14677,
"upload_time": "2022-07-08T13:42:58",
"upload_time_iso_8601": "2022-07-08T13:42:58.763895Z",
"url": "https://files.pythonhosted.org/packages/0b/a6/b678440988daa66ac151bc3ca24f2ad8dcfdb591604f5c2b83e2515b1f58/segments-2.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-07-08 13:42:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "cldf",
"github_project": "segments",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": null,
"specs": []
}
],
"tox": true,
"lcname": "segments"
}