# reldi-tokeniser
A tokeniser developed inside the [ReLDI project](https://reldi.spur.uzh.ch). Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.
## Usage
### Command line
```
$ echo 'kaj sad s tim.daj se nasmij ^_^.' | ./tokeniser.py hr -n
1.1.1.1-3 kaj
1.1.2.5-7 sad
1.1.3.9-9 s
1.1.4.11-13 tim
1.1.5.14-14 .
1.2.1.15-17 daj
1.2.2.19-20 se
1.2.3.22-27 nasmij
1.2.4.29-31 ^_^
1.2.5.32-32 .
```
Language is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.
```
$ python tokeniser.py -h
usage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}
Tokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and
Bulgarian
positional arguments:
{sl,hr,sr,mk,bg} language of the text
optional arguments:
-h, --help show this help message and exit
-c, --conllu generates CONLLU output
-b, --bert generates BERT-compatible output
-d, --document passes through ConLL-U-style document boundaries
-n, --nonstandard invokes the non-standard mode
-t, --tag adds tags and lemmas to punctuations and symbols
```
### Python module
```python
# string mode
import reldi_tokeniser
text = 'kaj sad s tim.daj se nasmij ^_^.'
output = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)
# object mode
from reldi_tokeniser.tokeniser import ReldiTokeniser
reldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)
list_of_lines = [el + '\n' for el in text.split('\n')]
test = reldi.run(list_of_lines, mode='object')
```
Python module has two mandatory parameters - text and language. Other optional parameters are `conllu`, `bert`, `document`, `nonstandard` and `tag`.
## CoNLL-U output
This tokeniser outputs also CoNLL-U format (flag `-c`/`--conllu`). If the additional ```-d```/```--document``` flag is given, the tokeniser passes through lines starting with ```# newdoc id =``` to preserve document structure.
```
$ echo '# newdoc id = prvi
kaj sad s tim.daj se nasmij ^_^.
haha
# newdoc id = gidru
štaš' | ./tokeniser.py hr -n -c -d
# newdoc id = prvi
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1 kaj _ _ _ _ _ _ _ _
2 sad _ _ _ _ _ _ _ _
3 s _ _ _ _ _ _ _ _
4 tim _ _ _ _ _ _ _ SpaceAfter=No
5 . _ _ _ _ _ _ _ SpaceAfter=No
# sent_id = 1.2
# text = daj se nasmij ^_^.
1 daj _ _ _ _ _ _ _ _
2 se _ _ _ _ _ _ _ _
3 nasmij _ _ _ _ _ _ _ _
4 ^_^ _ _ _ _ _ _ _ SpaceAfter=No
5 . _ _ _ _ _ _ _ _
# newpar id = 2
# sent_id = 2.1
# text = haha
1 haha _ _ _ _ _ _ _ _
# newdoc id = gidru
# newpar id = 1
# sent_id = 1.1
# text = štaš
1 štaš _ _ _ _ _ _ _ _
```
## Pre-tagging
The tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag `-t` or `--tag`), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.
```
$ echo -e "kaj sad s tim.daj se nasmij ^_^. haha" | python tokeniser.py hr -n -t
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1 kaj _ _ _ _ _ _ _ _
2 sad _ _ _ _ _ _ _ _
3 s _ _ _ _ _ _ _ _
4 tim _ _ _ _ _ _ _ SpaceAfter=No
5 . . PUNCT Z _ _ _ _ SpaceAfter=No
# sent_id = 1.2
# text = daj se nasmij ^_^.
1 daj _ _ _ _ _ _ _ _
2 se _ _ _ _ _ _ _ _
3 nasmij _ _ _ _ _ _ _ _
4 ^_^ ^_^ SYM Xe _ _ _ _ SpaceAfter=No
5 . . PUNCT Z _ _ _ _ _
# sent_id = 1.3
# text = haha
1 haha _ _ _ _ _ _ _ _
```
Raw data
{
"_id": null,
"home_page": "https://www.github.com/clarinsi/reldi-tokeniser",
"name": "reldi-tokeniser",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "CLARIN.SI",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/15/02/f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb/reldi-tokeniser-1.0.3.tar.gz",
"platform": null,
"description": "# reldi-tokeniser\n\nA tokeniser developed inside the [ReLDI project](https://reldi.spur.uzh.ch). Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.\n\n## Usage\n\n### Command line\n```\n$ echo 'kaj sad s tim.daj se nasmij ^_^.' |\u00a0./tokeniser.py hr -n\n1.1.1.1-3\tkaj\n1.1.2.5-7\tsad\n1.1.3.9-9\ts\n1.1.4.11-13\ttim\n1.1.5.14-14\t.\n\n1.2.1.15-17\tdaj\n1.2.2.19-20\tse\n1.2.3.22-27\tnasmij\n1.2.4.29-31\t^_^\n1.2.5.32-32\t.\n\n\n```\n\nLanguage is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.\n\n```\n$ python tokeniser.py -h\nusage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}\n\nTokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and\nBulgarian\n\npositional arguments:\n {sl,hr,sr,mk,bg} language of the text\n\noptional arguments:\n -h, --help show this help message and exit\n -c, --conllu generates CONLLU output\n -b, --bert generates BERT-compatible output\n -d, --document passes through ConLL-U-style document boundaries\n -n, --nonstandard invokes the non-standard mode\n -t, --tag adds tags and lemmas to punctuations and symbols\n```\n\n### Python module\n```python\n# string mode\nimport reldi_tokeniser\n\ntext = 'kaj sad s tim.daj se nasmij ^_^.'\n\noutput = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)\n\n# object mode\nfrom reldi_tokeniser.tokeniser import ReldiTokeniser\n\nreldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)\nlist_of_lines = [el + '\\n' for el in text.split('\\n')]\ntest = reldi.run(list_of_lines, mode='object')\n```\n\nPython module has two mandatory parameters - text and language. Other optional parameters are `conllu`, `bert`, `document`, `nonstandard` and `tag`.\n\n## CoNLL-U output\n\nThis tokeniser outputs also CoNLL-U format (flag `-c`/`--conllu`). If the additional ```-d```/```--document``` flag is given, the tokeniser passes through lines starting with ```# newdoc id =``` to preserve document structure.\n\n```\n$ echo '# newdoc id = prvi\nkaj sad s tim.daj se nasmij ^_^.\nhaha\n# newdoc id = gidru\n\u0161ta\u0161' | ./tokeniser.py hr -n -c -d\n# newdoc id = prvi\n# newpar id = 1\n# sent_id = 1.1\n# text = kaj sad s tim.\n1\tkaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tsad\t_\t_\t_\t_\t_\t_\t_\t_\n3\ts\t_\t_\t_\t_\t_\t_\t_\t_\n4\ttim\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n\n# sent_id = 1.2\n# text = daj se nasmij ^_^.\n1\tdaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tse\t_\t_\t_\t_\t_\t_\t_\t_\n3\tnasmij\t_\t_\t_\t_\t_\t_\t_\t_\n4\t^_^\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t_\t_\t_\t_\t_\t_\t_\t_\n\n# newpar id = 2\n# sent_id = 2.1\n# text = haha\n1\thaha\t_\t_\t_\t_\t_\t_\t_\t_\n\n# newdoc id = gidru\n# newpar id = 1\n# sent_id = 1.1\n# text = \u0161ta\u0161\n1\t\u0161ta\u0161\t_\t_\t_\t_\t_\t_\t_\t_\n\n```\n## Pre-tagging\n\nThe tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag `-t` or `--tag`), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.\n\n```\n$ echo -e \"kaj sad s tim.daj se nasmij ^_^. haha\" | python tokeniser.py hr -n -t\n# newpar id = 1\n# sent_id = 1.1\n# text = kaj sad s tim.\n1\tkaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tsad\t_\t_\t_\t_\t_\t_\t_\t_\n3\ts\t_\t_\t_\t_\t_\t_\t_\t_\n4\ttim\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t.\tPUNCT\tZ\t_\t_\t_\t_\tSpaceAfter=No\n\n# sent_id = 1.2\n# text = daj se nasmij ^_^.\n1\tdaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tse\t_\t_\t_\t_\t_\t_\t_\t_\n3\tnasmij\t_\t_\t_\t_\t_\t_\t_\t_\n4\t^_^\t^_^\tSYM\tXe\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t.\tPUNCT\tZ\t_\t_\t_\t_\t_\n\n# sent_id = 1.3\n# text = haha\n1\thaha\t_\t_\t_\t_\t_\t_\t_\t_\n\n```\n",
"bugtrack_url": null,
"license": "apache-2.0",
"summary": "Sentence splitting and tokenization for South Slavic languages",
"version": "1.0.3",
"project_urls": {
"Homepage": "https://www.github.com/clarinsi/reldi-tokeniser"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a74bee1f1685fe3769cb9319fb83b626aeab86d9512b0d0ecee2d024419a1d4a",
"md5": "d5e7188c8f58d2b42f8da4200a37b61d",
"sha256": "97be388d5e06519fc0c2f8dce4702c0912f5a279d24c61df37c5fcbfaccef774"
},
"downloads": -1,
"filename": "reldi_tokeniser-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d5e7188c8f58d2b42f8da4200a37b61d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 16610,
"upload_time": "2023-07-27T11:08:35",
"upload_time_iso_8601": "2023-07-27T11:08:35.794007Z",
"url": "https://files.pythonhosted.org/packages/a7/4b/ee1f1685fe3769cb9319fb83b626aeab86d9512b0d0ecee2d024419a1d4a/reldi_tokeniser-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1502f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb",
"md5": "7da2ba6130227f06b1e0b95e8f358ceb",
"sha256": "db76ede15e77cc642bb973ebd93a6fcc2de4c2f3e6a5e4a1a6483d3d2d1be062"
},
"downloads": -1,
"filename": "reldi-tokeniser-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "7da2ba6130227f06b1e0b95e8f358ceb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 16958,
"upload_time": "2023-07-27T11:08:37",
"upload_time_iso_8601": "2023-07-27T11:08:37.107836Z",
"url": "https://files.pythonhosted.org/packages/15/02/f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb/reldi-tokeniser-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-27 11:08:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "clarinsi",
"github_project": "reldi-tokeniser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "reldi-tokeniser"
}