classla


Nameclassla JSON
Version 2.1 PyPI version JSON
download
home_pagehttps://github.com/clarinsi/classla-stanfordnlp.git
SummaryAdapted Stanford NLP Python Library with improvements for specific languages.
upload_time2023-08-08 08:36:29
maintainer
docs_urlNone
authorCLARIN.SI
requires_python>=3.6
licenseApache License 2.0
keywords natural-language-processing nlp natural-language-understanding stanford-nlp deep-learning clarinsi
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # A [CLASSLA](http://www.clarin.si/info/k-centre/) Fork of [Stanza](https://github.com/stanfordnlp/stanza) for Processing Slovenian, Croatian, Serbian, Macedonian and Bulgarian

## Description

This pipeline allows for processing of standard Slovenian, Croatian, Serbian and Bulgarian on the levels of

- tokenization and sentence splitting
- part-of-speech tagging
- lemmatization
- dependency parsing
- named entity recognition

It also allows for (alpha) processing of standard Macedonian on the levels of 

- tokenization and sentence splitting
- part-of-speech tagging
- lemmatization

Finally, it allows for processing of non-standard (Internet) Slovenian, Croatian and Serbian on the same levels as standard language (all models are tailored to non-standard language except for dependency parsing where the standard module is used).

## Differences to Stanza

The differences of this pipeline to the original Stanza pipeline are the following:

- usage of language-specific rule-based tokenizers and sentence splitters, [obeliks](https://pypi.org/project/obeliks/) for standard Slovenian and [reldi-tokeniser](https://pypi.org/project/reldi-tokeniser/) for the remaining varieties and languages (Stanza uses inferior machine-learning-based tokenization and sentence splitting trained on UD data)
- default pre-tagging and pre-lemmatization on the level of tokenizers for the following phenomena: punctuation, symbol, e-mail, URL, mention, hashtag, emoticon, emoji (usage documented [here](https://github.com/clarinsi/classla/blob/master/README.superuser.md#usage-of-tagging-control-via-the-tokenizer))
- optional control of the tagger for Slovenian via an inflectional lexicon on the levels of XPOS, UPOS, FEATS (usage documented [here](https://github.com/clarinsi/classla/blob/master/README.superuser.md#usage-of-inflectional-lexicon))
- closed class handling depending on the usage of the options described in the last two bullets, as documented [here](https://github.com/clarinsi/classla/blob/master/README.closed_classes.md)
- usage of external inflectional lexicons for lookup lemmatization, seq2seq being used very infrequently on OOVs only (Stanza uses only UD training data for lookup lemmatization)
- morphosyntactic tagging models based on larger quantities of training data than is available in UD (training data that are morphosyntactically tagged, but not UD-parsed)
- lemmatization models based on larger quantities of training data than is available in UD (training data that are lemmatized, but not UD-parsed)
- optional JOS-project-based parsing of Slovenian (usage documented [here](https://github.com/clarinsi/classla/blob/master/README.superuser.md#jos-dependency-parsing-system))
- named entity recognition models for all languages except Macedonian (Stanza does not cover named entity recognition for any of the languages supported by classla)
- Macedonian models (Macedonian is not available in UD yet)
- non-standard models for Croatian, Slovenian, Serbian (there is no UD data for these varieties)

The above modifications led to some important improvements in the tool’s performance in comparison to original Stanza. For standard Slovenian, for example, running the full classla pipeline increases sentence segmentation F1 scores to 99.52 (94.29% error reduction), lemmatization to 99.17 (68.8% error reduction), XPOS tagging  to 97.38 (46.75% error reduction), UPOS tagging to 98.69 (23.4% error reduction), and LAS to 92.05 (23.56% error reduction).  See official [Stanza performance](https://stanfordnlp.github.io/stanza/performance.html) (evaluated on different data splits) for comparison.

## Installation
### pip
We recommend that you install CLASSLA via pip, the Python package manager. To install, run:
```bash
pip install classla
```
This will also resolve all dependencies.

__NOTE TO EXISTING USERS__: Once you install this classla version, you will HAVE TO re-download the models. All previously downloaded models will not be used anymore. We suggest you delete the old models. Their default location is at `~/classla_resources`.

## Running CLASSLA

### Getting started

To run the CLASSLA pipeline for the first time on processing standard Slovenian, follow these steps:

```
>>> import classla
>>> classla.download('sl')                            # download standard models for Slovenian, use hr for Croatian, sr for Serbian, bg for Bulgarian, mk for Macedonian
>>> nlp = classla.Pipeline('sl')                      # initialize the default Slovenian pipeline, use hr for Croatian, sr for Serbian, bg for Bulgarian, mk for Macedonian
>>> doc = nlp("France Prešeren je rojen v Vrbi.")     # run the pipeline
>>> print(doc.to_conll())                             # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = France Prešeren je rojen v Vrbi.
1	France	France	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	4	nsubj	_	NER=B-PER
2	Prešeren	Prešeren	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	1	flat:name	_	NER=I-PER
3	je	biti	AUX	Va-r3s-n	Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin	4	cop	_	NER=O
4	rojen	rojen	ADJ	Appmsnn	Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part	0	root	_	NER=O
5	v	v	ADP	Sl	Case=Loc	6	case	_	NER=O
6	Vrbi	Vrba	PROPN	Npfsl	Case=Loc|Gender=Fem|Number=Sing	4	obl	_	NER=B-LOC|SpaceAfter=No
7	.	.	PUNCT	Z	_	4	punct	_	NER=O

```
You can find examples of standard language processing for [Croatian](#example-of-standard-croatian), [Serbian](#example-of-standard-serbian), [Macedonian](#example-of-standard-macedonian) and [Bulgarian](#example-of-standard-bulgarian) at the end of this document.

### Processing non-standard language

Processing non-standard Slovenian differs to the above standard example just by an additional argument ```type="nonstandard"```:

```
>>> import classla
>>> classla.download('sl', type='nonstandard')        # download non-standard models for Slovenian, use hr for Croatian and sr for Serbian
>>> nlp = classla.Pipeline('sl', type='nonstandard')  # initialize the default non-standard Slovenian pipeline, use hr for Croatian and sr for Serbian
>>> doc = nlp("kva smo mi zurali zadnje leto v zagrebu...")     # run the pipeline
>>> print(doc.to_conll())                             # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = kva smo mi zurali zadnje leto v zagrebu...
1	kva	kaj	PRON	Pq-nsa	Case=Acc|Gender=Neut|Number=Sing|PronType=Int	4	obj	_	NER=O
2	smo	biti	AUX	Va-r1p-n	Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin	4	aux	_	NER=O
3	mi	jaz	PRON	Pp1mpn	Case=Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs	4	nsubj	_	NER=O
4	zurali	zurati	VERB	Vmpp-pm	Aspect=Imp|Gender=Masc|Number=Plur|VerbForm=Part	0	root	_	NER=O
5	zadnje	zadnji	ADJ	Agpnsa	Case=Acc|Degree=Pos|Gender=Neut|Number=Sing	6	amod	_	NER=O
6	leto	leto	NOUN	Ncnsa	Case=Acc|Gender=Neut|Number=Sing	4	obl	_	NER=O
7	v	v	ADP	Sl	Case=Loc	8	case	_	NER=O
8	zagrebu	Zagreb	PROPN	Npmsl	Case=Loc|Gender=Masc|Number=Sing	4	obl	_	NER=B-LOC|SpaceAfter=No
9	...	...	PUNCT	Z	_	4	punct	_	NER=O

```

You can find examples of non-standard language processing for [Croatian](#example-of-non-standard-croatian) and [Serbian](#example-of-non-standard-serbian)  at the end of this document.

For additional usage examples you can also consult the ```pipeline_demo.py``` file.

### Processing online texts

A special web processing mode for processing texts obtained from the internet can be activated with the ```type="web"``` argument:

```
>>> import classla
>>> classla.download('sl', type='web')        # download web models for Slovenian, use hr for Croatian and sr for Serbian
>>> nlp = classla.Pipeline('sl', type='web')  # initialize the default Slovenian web pipeline, use hr for Croatian and sr for Serbian
>>> doc = nlp("Kdor hoce prenesti preko racunalnika http://t.co/LwWyzs0cA0")     # run the pipeline
>>> print(doc.to_conll())                             # print the output in CoNLL-U format
# newpar id = 1
# sent_id = 1.1
# text = Kdor hoce prenesti preko racunalnika http://t.co/LwWyzs0cA0
1	Kdor	kdor	PRON	Pr-msn	Case=Nom|Gender=Masc|Number=Sing|PronType=Rel	2	nsubj	_	NER=O
2	hoce	hoteti	VERB	Vmpr3s-n	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin	0	root	_	NER=O
3	prenesti	prenesti	VERB	Vmen	Aspect=Perf|VerbForm=Inf	2	xcomp	_	NER=O
4	preko	preko	ADP	Sg	Case=Gen	5	case	_	NER=O
5	racunalnika	računalnik	NOUN	Ncmsg	Case=Gen|Gender=Masc|Number=Sing	3	obl	_	NER=O
6	http://t.co/LwWyzs0cA0	http://t.co/LwWyzs0cA0	SYM	Xw	_	5	nmod	_	NER=O
```

## Processors

The CLASSLA pipeline is built from multiple units. These units are called processors. By default CLASSLA runs the ```tokenize```, ```ner```, ```pos```, ```lemma``` and ```depparse``` processors.

You can specify which processors CLASSLA should run, via the ```processors``` attribute as in the following example, performing tokenization, named entity recognition, part-of-speech tagging and lemmatization.

```python
>>> nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma')
```

Another popular option might be to perform tokenization, part-of-speech tagging, lemmatization and dependency parsing.

```python
>>> nlp = classla.Pipeline('sl', processors='tokenize,pos,lemma,depparse')
```

### Tokenization and sentence splitting

The tokenization and sentence splitting processor ```tokenize``` is the first processor and is required for any further processing.

In case you already have tokenized text, you should separate tokens via spaces and pass the attribute ```tokenize_pretokenized=True```.

By default CLASSLA uses a rule-based tokenizer - [obeliks](https://github.com/clarinsi/obeliks) for Slovenian standard language pipeline. In other cases we use [reldi-tokeniser](https://github.com/clarinsi/reldi-tokeniser).

<!--Most important attributes:
```
tokenize_pretokenized   - [boolean]     ignores tokenizer
```-->

### Part-of-speech tagging

The POS tagging processor ```pos``` will general output that contains morphosyntactic description following the [MULTEXT-East standard](http://nl.ijs.si/ME/V6/msd/html/msd.lang-specific.html) and universal part-of-speech tags and universal features following the [Universal Dependencies standard](https://universaldependencies.org). This processing requires the usage of the ```tokenize``` processor.

<!--Most important attributes:
```
pos_model_path          - [str]         alternative path to model file
pos_pretrain_path       - [str]         alternative path to pretrain file
```-->

### Lemmatization

The lemmatization processor ```lemma``` will produce lemmas (basic forms) for each token in the input. It requires the usage of both the ```tokenize``` and ```pos``` processors.

### Dependency parsing

The dependency parsing processor ```depparse``` performs syntactic dependency parsing of sentences following the [Universal Dependencies formalism](https://universaldependencies.org/introduction.html#:~:text=Universal%20Dependencies%20(UD)%20is%20a,from%20a%20language%20typology%20perspective.). It requires the ```tokenize``` and ```pos``` processors.

### Named entity recognition

The named entity recognition processor ```ner``` identifies named entities in text following the [IOB2](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)) format. It requires only the ```tokenize``` processor.

## Citing

If you use this tool, please cite the following paper:

```
@inproceedings{ljubesic-dobrovoljc-2019-neural,
    title = "What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of {S}lovenian, {C}roatian and {S}erbian",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and
      Dobrovoljc, Kaja",
    booktitle = "Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-3704",
    doi = "10.18653/v1/W19-3704",
    pages = "29--34"
    }
```

## Croatian examples

### Example of standard Croatian 

```
>>> import classla
>>> nlp = classla.Pipeline('hr') # run classla.download('hr') beforehand if necessary
>>> doc = nlp("Ante Starčević rođen je u Velikom Žitniku.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Ante Starčević rođen je u Velikom Žitniku.
1	Ante	Ante	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	3	nsubj	_	NER=B-PER
2	Starčević	Starčević	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	1	flat	_	NER=I-PER
3	rođen	roditi	ADJ	Appmsnn	Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass	0	root	_	NER=O
4	je	biti	AUX	Var3s	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	3	aux	_	NER=O
5	u	u	ADP	Sl	Case=Loc	7	case	_	NER=O
6	Velikom	velik	ADJ	Agpmsly	Case=Loc|Definite=Def|Degree=Pos|Gender=Masc|Number=Sing	7	amod	_	NER=B-LOC
7	Žitniku	Žitnik	PROPN	Npmsl	Case=Loc|Gender=Masc|Number=Sing	3	obl	_	NER=I-LOC|SpaceAfter=No
8	.	.	PUNCT	Z	_	3	punct	_	NER=O

```
### Example of non-standard Croatian

```
>>> import classla
>>> nlp = classla.Pipeline('hr', type='nonstandard') # run classla.download('hr', type='nonstandard') beforehand if necessary
>>> doc = nlp("kaj sam ja tulumaril jucer u ljubljani...")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = kaj sam ja tulumaril jucer u ljubljani...
1	kaj	što	PRON	Pq3n-a	Case=Acc|Gender=Neut|PronType=Int,Rel	4	obj	_	NER=O
2	sam	biti	AUX	Var1s	Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	4	aux	_	NER=O
3	ja	ja	PRON	Pp1-sn	Case=Nom|Number=Sing|Person=1|PronType=Prs	4	nsubj	_	NER=O
4	tulumaril	tulumariti	VERB	Vmp-sm	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act	0	root	_	NER=O
5	jucer	jučer	ADV	Rgp	Degree=Pos	4	advmod	_	NER=O
6	u	u	ADP	Sl	Case=Loc	7	case	_	NER=O
7	ljubljani	Ljubljana	PROPN	Npfsl	Case=Loc|Gender=Fem|Number=Sing	4	obl	_	NER=B-LOC|SpaceAfter=No
8	...	...	PUNCT	Z	_	4	punct	_	NER=O

```

## Serbian examples

### Example of standard Serbian

```
>>> import classla
>>> nlp = classla.Pipeline('sr') # run classla.download('sr') beforehand if necessary
>>> doc = nlp("Slobodan Jovanović rođen je u Novom Sadu.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Slobodan Jovanović rođen je u Novom Sadu.
1	Slobodan	Slobodan	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	3	nsubj	_	NER=B-PER
2	Jovanović	Jovanović	PROPN	Npmsn	Case=Nom|Gender=Masc|Number=Sing	1	flat	_	NER=I-PER
3	rođen	roditi	ADJ	Appmsnn	Case=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass	0	root	_	NER=O
4	je	biti	AUX	Var3s	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	3	aux	_	NER=O
5	u	u	ADP	Sl	Case=Loc	7	case	_	NER=O
6	Novom	nov	ADJ	Agpmsly	Case=Loc|Definite=Def|Degree=Pos|Gender=Masc|Number=Sing	7	amod	_	NER=B-LOC
7	Sadu	Sad	PROPN	Npmsl	Case=Loc|Gender=Masc|Number=Sing	3	obl	_	NER=I-LOC|SpaceAfter=No
8	.	.	PUNCT	Z	_	3	punct	_	NER=O

```

### Example of non-standard Serbian

```
>>> import classla
>>> nlp = classla.Pipeline('sr', type='nonstandard') # run classla.download('sr', type='nonstandard') beforehand if necessary
>>> doc = nlp("ne mogu da verujem kakvo je zezanje bilo prosle godine u zagrebu...")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = ne mogu da verujem kakvo je zezanje bilo prosle godine u zagrebu...
1	ne	ne	PART	Qz	Polarity=Neg	2	advmod	_	NER=O
2	mogu	moći	VERB	Vmr1s	Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	0	root	_	NER=O
3	da	da	SCONJ	Cs	_	4	mark	_	NER=O
4	verujem	verovati	VERB	Vmr1s	Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	2	xcomp	_	NER=O
5	kakvo	kakav	DET	Pi-nsn	Case=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel	4	ccomp	_	NER=O
6	je	biti	AUX	Var3s	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	5	aux	_	NER=O
7	zezanje	zezanje	NOUN	Ncnsn	Case=Nom|Gender=Neut|Number=Sing	8	nsubj	_	NER=O
8	bilo	biti	AUX	Vap-sn	Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act	5	cop	_	NER=O
9	prosle	prošli	ADJ	Agpfsgy	Case=Gen|Definite=Def|Degree=Pos|Gender=Fem|Number=Sing	10	amod	_	NER=O
10	godine	godina	NOUN	Ncfsg	Case=Gen|Gender=Fem|Number=Sing	8	obl	_	NER=O
11	u	u	ADP	Sl	Case=Loc	12	case	_	NER=O
12	zagrebu	Zagreb	PROPN	Npmsl	Case=Loc|Gender=Masc|Number=Sing	8	obl	_	NER=B-LOC|SpaceAfter=No
13	...	...	PUNCT	Z	_	2	punct	_	NER=O

```

## Bulgarian examples

### Example of standard Bulgarian

```
>>> import classla
>>> nlp = classla.Pipeline('bg') # run classla.download('bg') beforehand if necessary
>>> doc = nlp("Алеко Константинов е роден в Свищов.")
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Алеко Константинов е роден в Свищов.
1	Алеко	алеко	PROPN	Npmsi	Definite=Ind|Gender=Masc|Number=Sing	4	nsubj:pass	_	NER=B-PER
2	Константинов	константинов	PROPN	Hmsi	Definite=Ind|Gender=Masc|Number=Sing	1	flat	_	NER=I-PER
3	е	съм	AUX	Vxitf-r3s	Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act	4	aux:pass	_	NER=O
4	роден	родя-(се)	VERB	Vpptcv--smi	Aspect=Perf|Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass	0	root	_	NER=O
5	в	в	ADP	R	_	6	case	_	NER=O
6	Свищов	свищов	PROPN	Npmsi	Definite=Ind|Gender=Masc|Number=Sing	4	iobj	_	NER=B-LOC|SpaceAfter=No
7	.	.	PUNCT	punct	_	4	punct	_	NER=O

```

## Macedonian examples

### Example of standard Macedonian

```
>>> import classla
>>> nlp = classla.Pipeline('mk') # run classla.download('mk') beforehand if necessary
>>> doc = nlp('Крсте Петков Мисирков е роден во Постол.')
>>> print(doc.to_conll())
# newpar id = 1
# sent_id = 1.1
# text = Крсте Петков Мисирков е роден во Постол.
1	Крсте	Крсте	PROPN	Npmsnn	Case=Nom|Definite=Ind|Gender=Masc|Number=Sing	_	_	_	_
2	Петков	Петков	PROPN	Npmsnn	Case=Nom|Definite=Ind|Gender=Masc|Number=Sing	_	_	_	_
3	Мисирков	Мисирков	PROPN	Npmsnn	Case=Nom|Definite=Ind|Gender=Masc|Number=Sing	_	_	_	_
4	е	сум	AUX	Vapip3s-n	Aspect=Prog|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres	_	_	_	_
5	роден	роден	ADJ	Ap-ms-n	Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part	_	_	_	_
6	во	во	ADP	Sps	AdpType=Prep	_	_	_	_
7	Постол	Постол	PROPN	Npmsnn	Case=Nom|Definite=Ind|Gender=Masc|Number=Sing	_	_	_	SpaceAfter=No
8	.	.	PUNCT	Z	_	_	_	_	_

```

## Training instructions

[Training instructions](https://github.com/clarinsi/classla-stanfordnlp/blob/master/README.train.md)

## Superuser instructions

[Superuser instructions](https://github.com/clarinsi/classla-stanfordnlp/blob/master/README.superuser.md)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/clarinsi/classla-stanfordnlp.git",
    "name": "classla",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "natural-language-processing nlp natural-language-understanding stanford-nlp deep-learning clarinsi",
    "author": "CLARIN.SI",
    "author_email": "info@clarin.si",
    "download_url": "https://files.pythonhosted.org/packages/66/31/84d3b08a173fc1dbabd7184d9d5d69ca19ba15485a417dd7a932f6bf23ee/classla-2.1.tar.gz",
    "platform": null,
    "description": "# A [CLASSLA](http://www.clarin.si/info/k-centre/) Fork of [Stanza](https://github.com/stanfordnlp/stanza) for Processing Slovenian, Croatian, Serbian, Macedonian and Bulgarian\n\n## Description\n\nThis pipeline allows for processing of standard Slovenian, Croatian, Serbian and Bulgarian on the levels of\n\n- tokenization and sentence splitting\n- part-of-speech tagging\n- lemmatization\n- dependency parsing\n- named entity recognition\n\nIt also allows for (alpha) processing of standard Macedonian on the levels of \n\n- tokenization and sentence splitting\n- part-of-speech tagging\n- lemmatization\n\nFinally, it allows for processing of non-standard (Internet) Slovenian, Croatian and Serbian on the same levels as standard language (all models are tailored to non-standard language except for dependency parsing where the standard module is used).\n\n## Differences to Stanza\n\nThe differences of this pipeline to the original Stanza pipeline are the following:\n\n- usage of language-specific rule-based tokenizers and sentence splitters, [obeliks](https://pypi.org/project/obeliks/) for standard Slovenian and [reldi-tokeniser](https://pypi.org/project/reldi-tokeniser/) for the remaining varieties and languages (Stanza uses inferior machine-learning-based tokenization and sentence splitting trained on UD data)\n- default pre-tagging and pre-lemmatization on the level of tokenizers for the following phenomena: punctuation, symbol, e-mail, URL, mention, hashtag, emoticon, emoji (usage documented [here](https://github.com/clarinsi/classla/blob/master/README.superuser.md#usage-of-tagging-control-via-the-tokenizer))\n- optional control of the tagger for Slovenian via an inflectional lexicon on the levels of XPOS, UPOS, FEATS (usage documented [here](https://github.com/clarinsi/classla/blob/master/README.superuser.md#usage-of-inflectional-lexicon))\n- closed class handling depending on the usage of the options described in the last two bullets, as documented [here](https://github.com/clarinsi/classla/blob/master/README.closed_classes.md)\n- usage of external inflectional lexicons for lookup lemmatization, seq2seq being used very infrequently on OOVs only (Stanza uses only UD training data for lookup lemmatization)\n- morphosyntactic tagging models based on larger quantities of training data than is available in UD (training data that are morphosyntactically tagged, but not UD-parsed)\n- lemmatization models based on larger quantities of training data than is available in UD (training data that are lemmatized, but not UD-parsed)\n- optional JOS-project-based parsing of Slovenian (usage documented [here](https://github.com/clarinsi/classla/blob/master/README.superuser.md#jos-dependency-parsing-system))\n- named entity recognition models for all languages except Macedonian (Stanza does not cover named entity recognition for any of the languages supported by classla)\n- Macedonian models (Macedonian is not available in UD yet)\n- non-standard models for Croatian, Slovenian, Serbian (there is no UD data for these varieties)\n\nThe above modifications led to some important improvements in the tool\u2019s performance in comparison to original Stanza. For standard Slovenian, for example, running the full classla pipeline increases sentence segmentation F1 scores to 99.52 (94.29% error reduction), lemmatization to 99.17 (68.8% error reduction), XPOS tagging  to 97.38 (46.75% error reduction), UPOS tagging to 98.69 (23.4% error reduction), and LAS to 92.05 (23.56% error reduction).  See official [Stanza performance](https://stanfordnlp.github.io/stanza/performance.html) (evaluated on different data splits) for comparison.\n\n## Installation\n### pip\nWe recommend that you install CLASSLA via pip, the Python package manager. To install, run:\n```bash\npip install classla\n```\nThis will also resolve all dependencies.\n\n__NOTE TO EXISTING USERS__: Once you install this classla version, you will HAVE TO re-download the models. All previously downloaded models will not be used anymore. We suggest you delete the old models. Their default location is at `~/classla_resources`.\n\n## Running CLASSLA\n\n### Getting started\n\nTo run the CLASSLA pipeline for the first time on processing standard Slovenian, follow these steps:\n\n```\n>>> import classla\n>>> classla.download('sl')                            # download standard models for Slovenian, use hr for Croatian, sr for Serbian, bg for Bulgarian, mk for Macedonian\n>>> nlp = classla.Pipeline('sl')                      # initialize the default Slovenian pipeline, use hr for Croatian, sr for Serbian, bg for Bulgarian, mk for Macedonian\n>>> doc = nlp(\"France Pre\u0161eren je rojen v Vrbi.\")     # run the pipeline\n>>> print(doc.to_conll())                             # print the output in CoNLL-U format\n# newpar id = 1\n# sent_id = 1.1\n# text = France Pre\u0161eren je rojen v Vrbi.\n1\tFrance\tFrance\tPROPN\tNpmsn\tCase=Nom|Gender=Masc|Number=Sing\t4\tnsubj\t_\tNER=B-PER\n2\tPre\u0161eren\tPre\u0161eren\tPROPN\tNpmsn\tCase=Nom|Gender=Masc|Number=Sing\t1\tflat:name\t_\tNER=I-PER\n3\tje\tbiti\tAUX\tVa-r3s-n\tMood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin\t4\tcop\t_\tNER=O\n4\trojen\trojen\tADJ\tAppmsnn\tCase=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part\t0\troot\t_\tNER=O\n5\tv\tv\tADP\tSl\tCase=Loc\t6\tcase\t_\tNER=O\n6\tVrbi\tVrba\tPROPN\tNpfsl\tCase=Loc|Gender=Fem|Number=Sing\t4\tobl\t_\tNER=B-LOC|SpaceAfter=No\n7\t.\t.\tPUNCT\tZ\t_\t4\tpunct\t_\tNER=O\n\n```\nYou can find examples of standard language processing for [Croatian](#example-of-standard-croatian), [Serbian](#example-of-standard-serbian), [Macedonian](#example-of-standard-macedonian) and [Bulgarian](#example-of-standard-bulgarian) at the end of this document.\n\n### Processing non-standard language\n\nProcessing non-standard Slovenian differs to the above standard example just by an additional argument ```type=\"nonstandard\"```:\n\n```\n>>> import classla\n>>> classla.download('sl', type='nonstandard')        # download non-standard models for Slovenian, use hr for Croatian and sr for Serbian\n>>> nlp = classla.Pipeline('sl', type='nonstandard')  # initialize the default non-standard Slovenian pipeline, use hr for Croatian and sr for Serbian\n>>> doc = nlp(\"kva smo mi zurali zadnje leto v zagrebu...\")     # run the pipeline\n>>> print(doc.to_conll())                             # print the output in CoNLL-U format\n# newpar id = 1\n# sent_id = 1.1\n# text = kva smo mi zurali zadnje leto v zagrebu...\n1\tkva\tkaj\tPRON\tPq-nsa\tCase=Acc|Gender=Neut|Number=Sing|PronType=Int\t4\tobj\t_\tNER=O\n2\tsmo\tbiti\tAUX\tVa-r1p-n\tMood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin\t4\taux\t_\tNER=O\n3\tmi\tjaz\tPRON\tPp1mpn\tCase=Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs\t4\tnsubj\t_\tNER=O\n4\tzurali\tzurati\tVERB\tVmpp-pm\tAspect=Imp|Gender=Masc|Number=Plur|VerbForm=Part\t0\troot\t_\tNER=O\n5\tzadnje\tzadnji\tADJ\tAgpnsa\tCase=Acc|Degree=Pos|Gender=Neut|Number=Sing\t6\tamod\t_\tNER=O\n6\tleto\tleto\tNOUN\tNcnsa\tCase=Acc|Gender=Neut|Number=Sing\t4\tobl\t_\tNER=O\n7\tv\tv\tADP\tSl\tCase=Loc\t8\tcase\t_\tNER=O\n8\tzagrebu\tZagreb\tPROPN\tNpmsl\tCase=Loc|Gender=Masc|Number=Sing\t4\tobl\t_\tNER=B-LOC|SpaceAfter=No\n9\t...\t...\tPUNCT\tZ\t_\t4\tpunct\t_\tNER=O\n\n```\n\nYou can find examples of non-standard language processing for [Croatian](#example-of-non-standard-croatian) and [Serbian](#example-of-non-standard-serbian)  at the end of this document.\n\nFor additional usage examples you can also consult the ```pipeline_demo.py``` file.\n\n### Processing online texts\n\nA special web processing mode for processing texts obtained from the internet can be activated with the ```type=\"web\"``` argument:\n\n```\n>>> import classla\n>>> classla.download('sl', type='web')        # download web models for Slovenian, use hr for Croatian and sr for Serbian\n>>> nlp = classla.Pipeline('sl', type='web')  # initialize the default Slovenian web pipeline, use hr for Croatian and sr for Serbian\n>>> doc = nlp(\"Kdor hoce prenesti preko racunalnika http://t.co/LwWyzs0cA0\")     # run the pipeline\n>>> print(doc.to_conll())                             # print the output in CoNLL-U format\n# newpar id = 1\n# sent_id = 1.1\n# text = Kdor hoce prenesti preko racunalnika http://t.co/LwWyzs0cA0\n1\tKdor\tkdor\tPRON\tPr-msn\tCase=Nom|Gender=Masc|Number=Sing|PronType=Rel\t2\tnsubj\t_\tNER=O\n2\thoce\thoteti\tVERB\tVmpr3s-n\tAspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin\t0\troot\t_\tNER=O\n3\tprenesti\tprenesti\tVERB\tVmen\tAspect=Perf|VerbForm=Inf\t2\txcomp\t_\tNER=O\n4\tpreko\tpreko\tADP\tSg\tCase=Gen\t5\tcase\t_\tNER=O\n5\tracunalnika\tra\u010dunalnik\tNOUN\tNcmsg\tCase=Gen|Gender=Masc|Number=Sing\t3\tobl\t_\tNER=O\n6\thttp://t.co/LwWyzs0cA0\thttp://t.co/LwWyzs0cA0\tSYM\tXw\t_\t5\tnmod\t_\tNER=O\n```\n\n## Processors\n\nThe CLASSLA pipeline is built from multiple units. These units are called processors. By default CLASSLA runs the ```tokenize```, ```ner```, ```pos```, ```lemma``` and ```depparse``` processors.\n\nYou can specify which processors CLASSLA should run, via the ```processors``` attribute as in the following example, performing tokenization, named entity recognition, part-of-speech tagging and lemmatization.\n\n```python\n>>> nlp = classla.Pipeline('sl', processors='tokenize,ner,pos,lemma')\n```\n\nAnother popular option might be to perform tokenization, part-of-speech tagging, lemmatization and dependency parsing.\n\n```python\n>>> nlp = classla.Pipeline('sl', processors='tokenize,pos,lemma,depparse')\n```\n\n### Tokenization and sentence splitting\n\nThe tokenization and sentence splitting processor ```tokenize``` is the first processor and is required for any further processing.\n\nIn case you already have tokenized text, you should separate tokens via spaces and pass the attribute ```tokenize_pretokenized=True```.\n\nBy default CLASSLA uses a rule-based tokenizer - [obeliks](https://github.com/clarinsi/obeliks) for Slovenian standard language pipeline. In other cases we use [reldi-tokeniser](https://github.com/clarinsi/reldi-tokeniser).\n\n<!--Most important attributes:\n```\ntokenize_pretokenized   - [boolean]     ignores tokenizer\n```-->\n\n### Part-of-speech tagging\n\nThe POS tagging processor ```pos``` will general output that contains morphosyntactic description following the [MULTEXT-East standard](http://nl.ijs.si/ME/V6/msd/html/msd.lang-specific.html) and universal part-of-speech tags and universal features following the [Universal Dependencies standard](https://universaldependencies.org). This processing requires the usage of the ```tokenize``` processor.\n\n<!--Most important attributes:\n```\npos_model_path          - [str]         alternative path to model file\npos_pretrain_path       - [str]         alternative path to pretrain file\n```-->\n\n### Lemmatization\n\nThe lemmatization processor ```lemma``` will produce lemmas (basic forms) for each token in the input. It requires the usage of both the ```tokenize``` and ```pos``` processors.\n\n### Dependency parsing\n\nThe dependency parsing processor ```depparse``` performs syntactic dependency parsing of sentences following the [Universal Dependencies formalism](https://universaldependencies.org/introduction.html#:~:text=Universal%20Dependencies%20(UD)%20is%20a,from%20a%20language%20typology%20perspective.). It requires the ```tokenize``` and ```pos``` processors.\n\n### Named entity recognition\n\nThe named entity recognition processor ```ner``` identifies named entities in text following the [IOB2](https://en.wikipedia.org/wiki/Inside\u2013outside\u2013beginning_(tagging)) format. It requires only the ```tokenize``` processor.\n\n## Citing\n\nIf you use this tool, please cite the following paper:\n\n```\n@inproceedings{ljubesic-dobrovoljc-2019-neural,\n    title = \"What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of {S}lovenian, {C}roatian and {S}erbian\",\n    author = \"Ljube{\\v{s}}i{\\'c}, Nikola  and\n      Dobrovoljc, Kaja\",\n    booktitle = \"Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing\",\n    month = aug,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/W19-3704\",\n    doi = \"10.18653/v1/W19-3704\",\n    pages = \"29--34\"\n    }\n```\n\n## Croatian examples\n\n### Example of standard Croatian \n\n```\n>>> import classla\n>>> nlp = classla.Pipeline('hr') # run classla.download('hr') beforehand if necessary\n>>> doc = nlp(\"Ante Star\u010devi\u0107 ro\u0111en je u Velikom \u017ditniku.\")\n>>> print(doc.to_conll())\n# newpar id = 1\n# sent_id = 1.1\n# text = Ante Star\u010devi\u0107 ro\u0111en je u Velikom \u017ditniku.\n1\tAnte\tAnte\tPROPN\tNpmsn\tCase=Nom|Gender=Masc|Number=Sing\t3\tnsubj\t_\tNER=B-PER\n2\tStar\u010devi\u0107\tStar\u010devi\u0107\tPROPN\tNpmsn\tCase=Nom|Gender=Masc|Number=Sing\t1\tflat\t_\tNER=I-PER\n3\tro\u0111en\troditi\tADJ\tAppmsnn\tCase=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass\t0\troot\t_\tNER=O\n4\tje\tbiti\tAUX\tVar3s\tMood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\t3\taux\t_\tNER=O\n5\tu\tu\tADP\tSl\tCase=Loc\t7\tcase\t_\tNER=O\n6\tVelikom\tvelik\tADJ\tAgpmsly\tCase=Loc|Definite=Def|Degree=Pos|Gender=Masc|Number=Sing\t7\tamod\t_\tNER=B-LOC\n7\t\u017ditniku\t\u017ditnik\tPROPN\tNpmsl\tCase=Loc|Gender=Masc|Number=Sing\t3\tobl\t_\tNER=I-LOC|SpaceAfter=No\n8\t.\t.\tPUNCT\tZ\t_\t3\tpunct\t_\tNER=O\n\n```\n### Example of non-standard Croatian\n\n```\n>>> import classla\n>>> nlp = classla.Pipeline('hr', type='nonstandard') # run classla.download('hr', type='nonstandard') beforehand if necessary\n>>> doc = nlp(\"kaj sam ja tulumaril jucer u ljubljani...\")\n>>> print(doc.to_conll())\n# newpar id = 1\n# sent_id = 1.1\n# text = kaj sam ja tulumaril jucer u ljubljani...\n1\tkaj\t\u0161to\tPRON\tPq3n-a\tCase=Acc|Gender=Neut|PronType=Int,Rel\t4\tobj\t_\tNER=O\n2\tsam\tbiti\tAUX\tVar1s\tMood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin\t4\taux\t_\tNER=O\n3\tja\tja\tPRON\tPp1-sn\tCase=Nom|Number=Sing|Person=1|PronType=Prs\t4\tnsubj\t_\tNER=O\n4\ttulumaril\ttulumariti\tVERB\tVmp-sm\tGender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act\t0\troot\t_\tNER=O\n5\tjucer\tju\u010der\tADV\tRgp\tDegree=Pos\t4\tadvmod\t_\tNER=O\n6\tu\tu\tADP\tSl\tCase=Loc\t7\tcase\t_\tNER=O\n7\tljubljani\tLjubljana\tPROPN\tNpfsl\tCase=Loc|Gender=Fem|Number=Sing\t4\tobl\t_\tNER=B-LOC|SpaceAfter=No\n8\t...\t...\tPUNCT\tZ\t_\t4\tpunct\t_\tNER=O\n\n```\n\n## Serbian examples\n\n### Example of standard Serbian\n\n```\n>>> import classla\n>>> nlp = classla.Pipeline('sr') # run classla.download('sr') beforehand if necessary\n>>> doc = nlp(\"Slobodan Jovanovi\u0107 ro\u0111en je u Novom Sadu.\")\n>>> print(doc.to_conll())\n# newpar id = 1\n# sent_id = 1.1\n# text = Slobodan Jovanovi\u0107 ro\u0111en je u Novom Sadu.\n1\tSlobodan\tSlobodan\tPROPN\tNpmsn\tCase=Nom|Gender=Masc|Number=Sing\t3\tnsubj\t_\tNER=B-PER\n2\tJovanovi\u0107\tJovanovi\u0107\tPROPN\tNpmsn\tCase=Nom|Gender=Masc|Number=Sing\t1\tflat\t_\tNER=I-PER\n3\tro\u0111en\troditi\tADJ\tAppmsnn\tCase=Nom|Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass\t0\troot\t_\tNER=O\n4\tje\tbiti\tAUX\tVar3s\tMood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\t3\taux\t_\tNER=O\n5\tu\tu\tADP\tSl\tCase=Loc\t7\tcase\t_\tNER=O\n6\tNovom\tnov\tADJ\tAgpmsly\tCase=Loc|Definite=Def|Degree=Pos|Gender=Masc|Number=Sing\t7\tamod\t_\tNER=B-LOC\n7\tSadu\tSad\tPROPN\tNpmsl\tCase=Loc|Gender=Masc|Number=Sing\t3\tobl\t_\tNER=I-LOC|SpaceAfter=No\n8\t.\t.\tPUNCT\tZ\t_\t3\tpunct\t_\tNER=O\n\n```\n\n### Example of non-standard Serbian\n\n```\n>>> import classla\n>>> nlp = classla.Pipeline('sr', type='nonstandard') # run classla.download('sr', type='nonstandard') beforehand if necessary\n>>> doc = nlp(\"ne mogu da verujem kakvo je zezanje bilo prosle godine u zagrebu...\")\n>>> print(doc.to_conll())\n# newpar id = 1\n# sent_id = 1.1\n# text = ne mogu da verujem kakvo je zezanje bilo prosle godine u zagrebu...\n1\tne\tne\tPART\tQz\tPolarity=Neg\t2\tadvmod\t_\tNER=O\n2\tmogu\tmo\u0107i\tVERB\tVmr1s\tMood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin\t0\troot\t_\tNER=O\n3\tda\tda\tSCONJ\tCs\t_\t4\tmark\t_\tNER=O\n4\tverujem\tverovati\tVERB\tVmr1s\tMood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin\t2\txcomp\t_\tNER=O\n5\tkakvo\tkakav\tDET\tPi-nsn\tCase=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel\t4\tccomp\t_\tNER=O\n6\tje\tbiti\tAUX\tVar3s\tMood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin\t5\taux\t_\tNER=O\n7\tzezanje\tzezanje\tNOUN\tNcnsn\tCase=Nom|Gender=Neut|Number=Sing\t8\tnsubj\t_\tNER=O\n8\tbilo\tbiti\tAUX\tVap-sn\tGender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act\t5\tcop\t_\tNER=O\n9\tprosle\tpro\u0161li\tADJ\tAgpfsgy\tCase=Gen|Definite=Def|Degree=Pos|Gender=Fem|Number=Sing\t10\tamod\t_\tNER=O\n10\tgodine\tgodina\tNOUN\tNcfsg\tCase=Gen|Gender=Fem|Number=Sing\t8\tobl\t_\tNER=O\n11\tu\tu\tADP\tSl\tCase=Loc\t12\tcase\t_\tNER=O\n12\tzagrebu\tZagreb\tPROPN\tNpmsl\tCase=Loc|Gender=Masc|Number=Sing\t8\tobl\t_\tNER=B-LOC|SpaceAfter=No\n13\t...\t...\tPUNCT\tZ\t_\t2\tpunct\t_\tNER=O\n\n```\n\n## Bulgarian examples\n\n### Example of standard Bulgarian\n\n```\n>>> import classla\n>>> nlp = classla.Pipeline('bg') # run classla.download('bg') beforehand if necessary\n>>> doc = nlp(\"\u0410\u043b\u0435\u043a\u043e \u041a\u043e\u043d\u0441\u0442\u0430\u043d\u0442\u0438\u043d\u043e\u0432 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432 \u0421\u0432\u0438\u0449\u043e\u0432.\")\n>>> print(doc.to_conll())\n# newpar id = 1\n# sent_id = 1.1\n# text = \u0410\u043b\u0435\u043a\u043e \u041a\u043e\u043d\u0441\u0442\u0430\u043d\u0442\u0438\u043d\u043e\u0432 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432 \u0421\u0432\u0438\u0449\u043e\u0432.\n1\t\u0410\u043b\u0435\u043a\u043e\t\u0430\u043b\u0435\u043a\u043e\tPROPN\tNpmsi\tDefinite=Ind|Gender=Masc|Number=Sing\t4\tnsubj:pass\t_\tNER=B-PER\n2\t\u041a\u043e\u043d\u0441\u0442\u0430\u043d\u0442\u0438\u043d\u043e\u0432\t\u043a\u043e\u043d\u0441\u0442\u0430\u043d\u0442\u0438\u043d\u043e\u0432\tPROPN\tHmsi\tDefinite=Ind|Gender=Masc|Number=Sing\t1\tflat\t_\tNER=I-PER\n3\t\u0435\t\u0441\u044a\u043c\tAUX\tVxitf-r3s\tAspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act\t4\taux:pass\t_\tNER=O\n4\t\u0440\u043e\u0434\u0435\u043d\t\u0440\u043e\u0434\u044f-(\u0441\u0435)\tVERB\tVpptcv--smi\tAspect=Perf|Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass\t0\troot\t_\tNER=O\n5\t\u0432\t\u0432\tADP\tR\t_\t6\tcase\t_\tNER=O\n6\t\u0421\u0432\u0438\u0449\u043e\u0432\t\u0441\u0432\u0438\u0449\u043e\u0432\tPROPN\tNpmsi\tDefinite=Ind|Gender=Masc|Number=Sing\t4\tiobj\t_\tNER=B-LOC|SpaceAfter=No\n7\t.\t.\tPUNCT\tpunct\t_\t4\tpunct\t_\tNER=O\n\n```\n\n## Macedonian examples\n\n### Example of standard Macedonian\n\n```\n>>> import classla\n>>> nlp = classla.Pipeline('mk') # run classla.download('mk') beforehand if necessary\n>>> doc = nlp('\u041a\u0440\u0441\u0442\u0435 \u041f\u0435\u0442\u043a\u043e\u0432 \u041c\u0438\u0441\u0438\u0440\u043a\u043e\u0432 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432\u043e \u041f\u043e\u0441\u0442\u043e\u043b.')\n>>> print(doc.to_conll())\n# newpar id = 1\n# sent_id = 1.1\n# text = \u041a\u0440\u0441\u0442\u0435 \u041f\u0435\u0442\u043a\u043e\u0432 \u041c\u0438\u0441\u0438\u0440\u043a\u043e\u0432 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432\u043e \u041f\u043e\u0441\u0442\u043e\u043b.\n1\t\u041a\u0440\u0441\u0442\u0435\t\u041a\u0440\u0441\u0442\u0435\tPROPN\tNpmsnn\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing\t_\t_\t_\t_\n2\t\u041f\u0435\u0442\u043a\u043e\u0432\t\u041f\u0435\u0442\u043a\u043e\u0432\tPROPN\tNpmsnn\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing\t_\t_\t_\t_\n3\t\u041c\u0438\u0441\u0438\u0440\u043a\u043e\u0432\t\u041c\u0438\u0441\u0438\u0440\u043a\u043e\u0432\tPROPN\tNpmsnn\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing\t_\t_\t_\t_\n4\t\u0435\t\u0441\u0443\u043c\tAUX\tVapip3s-n\tAspect=Prog|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres\t_\t_\t_\t_\n5\t\u0440\u043e\u0434\u0435\u043d\t\u0440\u043e\u0434\u0435\u043d\tADJ\tAp-ms-n\tDefinite=Ind|Gender=Masc|Number=Sing|VerbForm=Part\t_\t_\t_\t_\n6\t\u0432\u043e\t\u0432\u043e\tADP\tSps\tAdpType=Prep\t_\t_\t_\t_\n7\t\u041f\u043e\u0441\u0442\u043e\u043b\t\u041f\u043e\u0441\u0442\u043e\u043b\tPROPN\tNpmsnn\tCase=Nom|Definite=Ind|Gender=Masc|Number=Sing\t_\t_\t_\tSpaceAfter=No\n8\t.\t.\tPUNCT\tZ\t_\t_\t_\t_\t_\n\n```\n\n## Training instructions\n\n[Training instructions](https://github.com/clarinsi/classla-stanfordnlp/blob/master/README.train.md)\n\n## Superuser instructions\n\n[Superuser instructions](https://github.com/clarinsi/classla-stanfordnlp/blob/master/README.superuser.md)\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Adapted Stanford NLP Python Library with improvements for specific languages.",
    "version": "2.1",
    "project_urls": {
        "Homepage": "https://github.com/clarinsi/classla-stanfordnlp.git"
    },
    "split_keywords": [
        "natural-language-processing",
        "nlp",
        "natural-language-understanding",
        "stanford-nlp",
        "deep-learning",
        "clarinsi"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "40cbada701cb682dcaef0dfb4374bb6248d38c7a8bc925170db6458ba422b437",
                "md5": "f63a1774450ab03fc9bad2d0098fb2d4",
                "sha256": "4a193a5e4c38add87626ef26aa86d395afedb9592e792dbf932e49ba04109e4a"
            },
            "downloads": -1,
            "filename": "classla-2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f63a1774450ab03fc9bad2d0098fb2d4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 249791,
            "upload_time": "2023-08-08T08:36:26",
            "upload_time_iso_8601": "2023-08-08T08:36:26.640176Z",
            "url": "https://files.pythonhosted.org/packages/40/cb/ada701cb682dcaef0dfb4374bb6248d38c7a8bc925170db6458ba422b437/classla-2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "663184d3b08a173fc1dbabd7184d9d5d69ca19ba15485a417dd7a932f6bf23ee",
                "md5": "d0f2053206265da765eefd18dc6e925a",
                "sha256": "8337d6a271d14da6fc1ef4c8642265a1521a378f1a49f49ee671ce4d9887a66f"
            },
            "downloads": -1,
            "filename": "classla-2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d0f2053206265da765eefd18dc6e925a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 199120,
            "upload_time": "2023-08-08T08:36:29",
            "upload_time_iso_8601": "2023-08-08T08:36:29.091900Z",
            "url": "https://files.pythonhosted.org/packages/66/31/84d3b08a173fc1dbabd7184d9d5d69ca19ba15485a417dd7a932f6bf23ee/classla-2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-08 08:36:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "clarinsi",
    "github_project": "classla-stanfordnlp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "classla"
}
        
Elapsed time: 0.09841s