reldi-tokeniser


Namereldi-tokeniser JSON
Version 1.0.3 PyPI version JSON
download
home_pagehttps://www.github.com/clarinsi/reldi-tokeniser
SummarySentence splitting and tokenization for South Slavic languages
upload_time2023-07-27 11:08:37
maintainer
docs_urlNone
authorCLARIN.SI
requires_python
licenseapache-2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # reldi-tokeniser

A tokeniser developed inside the [ReLDI project](https://reldi.spur.uzh.ch). Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.

## Usage

### Command line
```
$ echo 'kaj sad s tim.daj se nasmij ^_^.' | ./tokeniser.py hr -n
1.1.1.1-3	kaj
1.1.2.5-7	sad
1.1.3.9-9	s
1.1.4.11-13	tim
1.1.5.14-14	.

1.2.1.15-17	daj
1.2.2.19-20	se
1.2.3.22-27	nasmij
1.2.4.29-31	^_^
1.2.5.32-32	.


```

Language is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.

```
$ python tokeniser.py -h
usage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}

Tokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and
Bulgarian

positional arguments:
  {sl,hr,sr,mk,bg}   language of the text

optional arguments:
  -h, --help         show this help message and exit
  -c, --conllu       generates CONLLU output
  -b, --bert         generates BERT-compatible output
  -d, --document     passes through ConLL-U-style document boundaries
  -n, --nonstandard  invokes the non-standard mode
  -t, --tag          adds tags and lemmas to punctuations and symbols
```

### Python module
```python
# string mode
import reldi_tokeniser

text = 'kaj sad s tim.daj se nasmij ^_^.'

output = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)

# object mode
from reldi_tokeniser.tokeniser import ReldiTokeniser

reldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)
list_of_lines = [el + '\n' for el in text.split('\n')]
test = reldi.run(list_of_lines, mode='object')
```

Python module has two mandatory parameters - text and language. Other optional parameters are `conllu`, `bert`, `document`, `nonstandard` and `tag`.

## CoNLL-U output

This tokeniser outputs also CoNLL-U format (flag `-c`/`--conllu`). If the additional ```-d```/```--document``` flag is given, the tokeniser passes through lines starting with ```# newdoc id =``` to preserve document structure.

```
$ echo '# newdoc id = prvi
kaj sad s tim.daj se nasmij ^_^.
haha
# newdoc id = gidru
štaš' | ./tokeniser.py hr -n -c -d
# newdoc id = prvi
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1	kaj	_	_	_	_	_	_	_	_
2	sad	_	_	_	_	_	_	_	_
3	s	_	_	_	_	_	_	_	_
4	tim	_	_	_	_	_	_	_	SpaceAfter=No
5	.	_	_	_	_	_	_	_	SpaceAfter=No

# sent_id = 1.2
# text = daj se nasmij ^_^.
1	daj	_	_	_	_	_	_	_	_
2	se	_	_	_	_	_	_	_	_
3	nasmij	_	_	_	_	_	_	_	_
4	^_^	_	_	_	_	_	_	_	SpaceAfter=No
5	.	_	_	_	_	_	_	_	_

# newpar id = 2
# sent_id = 2.1
# text = haha
1	haha	_	_	_	_	_	_	_	_

# newdoc id = gidru
# newpar id = 1
# sent_id = 1.1
# text = štaš
1	štaš	_	_	_	_	_	_	_	_

```
## Pre-tagging

The tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag `-t` or `--tag`), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.

```
$ echo -e "kaj sad s tim.daj se nasmij ^_^. haha" | python tokeniser.py hr -n -t
# newpar id = 1
# sent_id = 1.1
# text = kaj sad s tim.
1	kaj	_	_	_	_	_	_	_	_
2	sad	_	_	_	_	_	_	_	_
3	s	_	_	_	_	_	_	_	_
4	tim	_	_	_	_	_	_	_	SpaceAfter=No
5	.	.	PUNCT	Z	_	_	_	_	SpaceAfter=No

# sent_id = 1.2
# text = daj se nasmij ^_^.
1	daj	_	_	_	_	_	_	_	_
2	se	_	_	_	_	_	_	_	_
3	nasmij	_	_	_	_	_	_	_	_
4	^_^	^_^	SYM	Xe	_	_	_	_	SpaceAfter=No
5	.	.	PUNCT	Z	_	_	_	_	_

# sent_id = 1.3
# text = haha
1	haha	_	_	_	_	_	_	_	_

```

            

Raw data

            {
    "_id": null,
    "home_page": "https://www.github.com/clarinsi/reldi-tokeniser",
    "name": "reldi-tokeniser",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "CLARIN.SI",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/15/02/f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb/reldi-tokeniser-1.0.3.tar.gz",
    "platform": null,
    "description": "# reldi-tokeniser\n\nA tokeniser developed inside the [ReLDI project](https://reldi.spur.uzh.ch). Supports currently five languages -- Slovene, Croatian, Serbian, Macedonian and Bulgarian, and two modes -- standard and non-standard text.\n\n## Usage\n\n### Command line\n```\n$ echo 'kaj sad s tim.daj se nasmij ^_^.' |\u00a0./tokeniser.py hr -n\n1.1.1.1-3\tkaj\n1.1.2.5-7\tsad\n1.1.3.9-9\ts\n1.1.4.11-13\ttim\n1.1.5.14-14\t.\n\n1.2.1.15-17\tdaj\n1.2.2.19-20\tse\n1.2.3.22-27\tnasmij\n1.2.4.29-31\t^_^\n1.2.5.32-32\t.\n\n\n```\n\nLanguage is a positional argument while tokenisation of non-standard text, tagging and lemmatization of symbols and punctuation, and diferent output formats are an optional one.\n\n```\n$ python tokeniser.py -h\nusage: tokeniser.py [-h] [-c] [-b] [-d] [-n] [-t] {sl,hr,sr,mk,bg}\n\nTokeniser for (non-)standard Slovene, Croatian, Serbian, Macedonian and\nBulgarian\n\npositional arguments:\n  {sl,hr,sr,mk,bg}   language of the text\n\noptional arguments:\n  -h, --help         show this help message and exit\n  -c, --conllu       generates CONLLU output\n  -b, --bert         generates BERT-compatible output\n  -d, --document     passes through ConLL-U-style document boundaries\n  -n, --nonstandard  invokes the non-standard mode\n  -t, --tag          adds tags and lemmas to punctuations and symbols\n```\n\n### Python module\n```python\n# string mode\nimport reldi_tokeniser\n\ntext = 'kaj sad s tim.daj se nasmij ^_^.'\n\noutput = reldi_tokeniser.run(text, 'hr', nonstandard=True, tag=True)\n\n# object mode\nfrom reldi_tokeniser.tokeniser import ReldiTokeniser\n\nreldi = ReldiTokeniser('hr', conllu=True, nonstandard=True, tag=True)\nlist_of_lines = [el + '\\n' for el in text.split('\\n')]\ntest = reldi.run(list_of_lines, mode='object')\n```\n\nPython module has two mandatory parameters - text and language. Other optional parameters are `conllu`, `bert`, `document`, `nonstandard` and `tag`.\n\n## CoNLL-U output\n\nThis tokeniser outputs also CoNLL-U format (flag `-c`/`--conllu`). If the additional ```-d```/```--document``` flag is given, the tokeniser passes through lines starting with ```# newdoc id =``` to preserve document structure.\n\n```\n$ echo '# newdoc id = prvi\nkaj sad s tim.daj se nasmij ^_^.\nhaha\n# newdoc id = gidru\n\u0161ta\u0161' | ./tokeniser.py hr -n -c -d\n# newdoc id = prvi\n# newpar id = 1\n# sent_id = 1.1\n# text = kaj sad s tim.\n1\tkaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tsad\t_\t_\t_\t_\t_\t_\t_\t_\n3\ts\t_\t_\t_\t_\t_\t_\t_\t_\n4\ttim\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n\n# sent_id = 1.2\n# text = daj se nasmij ^_^.\n1\tdaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tse\t_\t_\t_\t_\t_\t_\t_\t_\n3\tnasmij\t_\t_\t_\t_\t_\t_\t_\t_\n4\t^_^\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t_\t_\t_\t_\t_\t_\t_\t_\n\n# newpar id = 2\n# sent_id = 2.1\n# text = haha\n1\thaha\t_\t_\t_\t_\t_\t_\t_\t_\n\n# newdoc id = gidru\n# newpar id = 1\n# sent_id = 1.1\n# text = \u0161ta\u0161\n1\t\u0161ta\u0161\t_\t_\t_\t_\t_\t_\t_\t_\n\n```\n## Pre-tagging\n\nThe tokeniser can also pre-annotate text on the part-of-speech (UPOS and XPOS) and lemma level (flag `-t` or `--tag`), if the available tokenisation regexes have sufficient evidence (punctuations, mentions, hashtags, URL-s, e-mails, emoticons, emojis). Default output format in case of pre-tagging is CoNLL-U.\n\n```\n$ echo -e \"kaj sad s tim.daj se nasmij ^_^. haha\" | python tokeniser.py hr -n -t\n# newpar id = 1\n# sent_id = 1.1\n# text = kaj sad s tim.\n1\tkaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tsad\t_\t_\t_\t_\t_\t_\t_\t_\n3\ts\t_\t_\t_\t_\t_\t_\t_\t_\n4\ttim\t_\t_\t_\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t.\tPUNCT\tZ\t_\t_\t_\t_\tSpaceAfter=No\n\n# sent_id = 1.2\n# text = daj se nasmij ^_^.\n1\tdaj\t_\t_\t_\t_\t_\t_\t_\t_\n2\tse\t_\t_\t_\t_\t_\t_\t_\t_\n3\tnasmij\t_\t_\t_\t_\t_\t_\t_\t_\n4\t^_^\t^_^\tSYM\tXe\t_\t_\t_\t_\tSpaceAfter=No\n5\t.\t.\tPUNCT\tZ\t_\t_\t_\t_\t_\n\n# sent_id = 1.3\n# text = haha\n1\thaha\t_\t_\t_\t_\t_\t_\t_\t_\n\n```\n",
    "bugtrack_url": null,
    "license": "apache-2.0",
    "summary": "Sentence splitting and tokenization for South Slavic languages",
    "version": "1.0.3",
    "project_urls": {
        "Homepage": "https://www.github.com/clarinsi/reldi-tokeniser"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a74bee1f1685fe3769cb9319fb83b626aeab86d9512b0d0ecee2d024419a1d4a",
                "md5": "d5e7188c8f58d2b42f8da4200a37b61d",
                "sha256": "97be388d5e06519fc0c2f8dce4702c0912f5a279d24c61df37c5fcbfaccef774"
            },
            "downloads": -1,
            "filename": "reldi_tokeniser-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d5e7188c8f58d2b42f8da4200a37b61d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 16610,
            "upload_time": "2023-07-27T11:08:35",
            "upload_time_iso_8601": "2023-07-27T11:08:35.794007Z",
            "url": "https://files.pythonhosted.org/packages/a7/4b/ee1f1685fe3769cb9319fb83b626aeab86d9512b0d0ecee2d024419a1d4a/reldi_tokeniser-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1502f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb",
                "md5": "7da2ba6130227f06b1e0b95e8f358ceb",
                "sha256": "db76ede15e77cc642bb973ebd93a6fcc2de4c2f3e6a5e4a1a6483d3d2d1be062"
            },
            "downloads": -1,
            "filename": "reldi-tokeniser-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "7da2ba6130227f06b1e0b95e8f358ceb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 16958,
            "upload_time": "2023-07-27T11:08:37",
            "upload_time_iso_8601": "2023-07-27T11:08:37.107836Z",
            "url": "https://files.pythonhosted.org/packages/15/02/f1aebc83789d692ea14723e3875bcdd2d0d25b8147eedf6d2149a717addb/reldi-tokeniser-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-27 11:08:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "clarinsi",
    "github_project": "reldi-tokeniser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "reldi-tokeniser"
}
        
Elapsed time: 0.11856s