segments

Name	segments JSON
Version	2.2.1 JSON
	download
home_page	https://github.com/cldf/segments
Summary
upload_time	2022-07-08 13:42:58
maintainer
docs_url	None
author	Steven Moran and Robert Forkel
requires_python
license	Apache 2.0
keywords	tokenizer
VCS
bugtrack_url
requirements	None
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            segments
========

[![Build Status](https://github.com/cldf/segments/workflows/tests/badge.svg)](https://github.com/cldf/segments/actions?query=workflow%3Atests)
[![codecov](https://codecov.io/gh/cldf/segments/branch/master/graph/badge.svg)](https://codecov.io/gh/cldf/segments)
[![PyPI](https://img.shields.io/pypi/v/segments.svg)](https://pypi.org/project/segments)


[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1051157.svg)](https://doi.org/10.5281/zenodo.1051157)

The segments package provides Unicode Standard tokenization routines and orthography segmentation,
implementing the linear algorithm described in the orthography profile specification from 
*The Unicode Cookbook* (Moran and Cysouw 2018 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780)).


Command line usage
------------------

Create a text file:
```
$ echo "aäaaöaaüaa" > text.txt
```

Now look at the profile:
```
$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö
```

Write the profile to a file:
```
$ cat text.txt | segments profile > profile.prf
```

Edit the profile:

```
$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö
```

Now tokenize the text without profile:
```
$ cat text.txt | segments tokenize
a ä a a ö a a ü a a
```

And with profile:
```
$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x
```


API
---

```python
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cldf/segments",
    "name": "segments",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tokenizer",
    "author": "Steven Moran and Robert Forkel",
    "author_email": "steven.moran@uzh.ch",
    "download_url": "https://files.pythonhosted.org/packages/0b/a6/b678440988daa66ac151bc3ca24f2ad8dcfdb591604f5c2b83e2515b1f58/segments-2.2.1.tar.gz",
    "platform": null,
    "description": "segments\n========\n\n[![Build Status](https://github.com/cldf/segments/workflows/tests/badge.svg)](https://github.com/cldf/segments/actions?query=workflow%3Atests)\n[![codecov](https://codecov.io/gh/cldf/segments/branch/master/graph/badge.svg)](https://codecov.io/gh/cldf/segments)\n[![PyPI](https://img.shields.io/pypi/v/segments.svg)](https://pypi.org/project/segments)\n\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1051157.svg)](https://doi.org/10.5281/zenodo.1051157)\n\nThe segments package provides Unicode Standard tokenization routines and orthography segmentation,\nimplementing the linear algorithm described in the orthography profile specification from \n*The Unicode Cookbook* (Moran and Cysouw 2018 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780)).\n\n\nCommand line usage\n------------------\n\nCreate a text file:\n```\n$ echo \"a\u00e4aa\u00f6aa\u00fcaa\" > text.txt\n```\n\nNow look at the profile:\n```\n$ cat text.txt | segments profile\nGrapheme        frequency       mapping\na       7       a\na\u0308       1       a\u0308\nu\u0308       1       u\u0308\no\u0308       1       o\u0308\n```\n\nWrite the profile to a file:\n```\n$ cat text.txt | segments profile > profile.prf\n```\n\nEdit the profile:\n\n```\n$ more profile.prf\nGrapheme        frequency       mapping\naa      0       x\na       7       a\na\u0308       1       a\u0308\nu\u0308       1       u\u0308\no\u0308       1       o\u0308\n```\n\nNow tokenize the text without profile:\n```\n$ cat text.txt | segments tokenize\na a\u0308 a a o\u0308 a a u\u0308 a a\n```\n\nAnd with profile:\n```\n$ cat text.txt | segments --profile=profile.prf tokenize\na a\u0308 aa o\u0308 aa u\u0308 aa\n\n$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize\na a\u0308 x o\u0308 x u\u0308 x\n```\n\n\nAPI\n---\n\n```python\n>>> from segments import Profile, Tokenizer\n>>> t = Tokenizer()\n>>> t('abcd')\n'a b c d'\n>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})\n>>> print(prf)\nGrapheme\tmapping\nab\tx\ncd\ty\n>>> t = Tokenizer(profile=prf)\n>>> t('abcd')\n'ab cd'\n>>> t('abcd', column='mapping')\n'x y'\n```\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "",
    "version": "2.2.1",
    "split_keywords": [
        "tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "05273ad48b946ded53a686fe0136c756",
                "sha256": "069860ae5a499ad7bd86e23ee52250a16e61ba3474c17e515b16d494ac1423c1"
            },
            "downloads": -1,
            "filename": "segments-2.2.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "05273ad48b946ded53a686fe0136c756",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 15039,
            "upload_time": "2022-07-08T13:42:57",
            "upload_time_iso_8601": "2022-07-08T13:42:57.237151Z",
            "url": "https://files.pythonhosted.org/packages/93/d4/74dba5011533e66becf35aae5cf1d726e760f445db052592bad70e75305c/segments-2.2.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "1db512116f28df6b9a3326d9fc19558f",
                "sha256": "515ae188f21d24e420d48ad45689edc747d961d6b52fde22e47500a8d85f2741"
            },
            "downloads": -1,
            "filename": "segments-2.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1db512116f28df6b9a3326d9fc19558f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14677,
            "upload_time": "2022-07-08T13:42:58",
            "upload_time_iso_8601": "2022-07-08T13:42:58.763895Z",
            "url": "https://files.pythonhosted.org/packages/0b/a6/b678440988daa66ac151bc3ca24f2ad8dcfdb591604f5c2b83e2515b1f58/segments-2.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-07-08 13:42:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "cldf",
    "github_project": "segments",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": null,
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "segments"
}

Steven Moran and Robert Forkel