corus


Namecorus JSON
Version 0.10.0 PyPI version JSON
download
home_pagehttps://github.com/natasha/corus
SummaryLinks to russian corpora, functions for loading and parsing
upload_time2023-07-24 08:54:26
maintainer
docs_urlNone
authorAlexander Kukushkin
requires_python
licenseMIT
keywords corpora russian nlp datasets
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
<img src="https://github.com/natasha/natasha-logos/blob/master/corus.svg">

![CI](https://github.com/natasha/corus/actions/workflows/test.yml/badge.svg)

Links to publicly available Russian corpora + code for loading and parsing. <a href="#reference">20+ datasets, 350Gb+ of text</a>.

## Usage

For example lets use <a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">dump of lenta.ru by @yutkin</a>. Manually download the archive (link in the <a href="#reference">Reference</a> section):
```bash
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
```

Use `corus` to load the data:

```python
>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)
```

Iterate over texts:

```python
>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

```

For links to other datasets and their loaders see the <a href="#reference">Reference</a> section.

## Documentation

Materials are in Russian:

* <a href="https://natasha.github.io/corus">Corus page on natasha.github.io</a> 
* <a href="https://youtu.be/-7XT_U6hVvk?t=2758">Corus section of Datafest 2020 talk</a>

## Install

`corus` supports Python 3.5+, PyPy 3.

```bash
$ pip install corus
```

## Reference

<!--- metas --->
<table>
<tr>
<th>Dataset</th>
<th>API <code>from corus import</code></th>
<th>Tags</th>
<th>Texts</th>
<th>Uncompressed</th>
<th>Description</th>
</tr>
<tr>
<td>
<a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">Lenta.ru</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
Lenta.ru v1.0
</td>
<td>
<a name="load_lenta"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta">load_lenta</a></code>
<a href="#load_lenta"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
739&nbsp;351
</td>
<td align="right">
1.66 Gb
</td>
<td>
<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz</code>
</td>
</tr>
<tr>
<td>
Lenta.ru v1.1+
</td>
<td>
<a name="load_lenta2"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta2">load_lenta2</a></code>
<a href="#load_lenta2"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
800&nbsp;975
</td>
<td align="right">
1.94 Gb
</td>
<td>
<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="https://russe.nlpub.org/downloads/">Lib.rus.ec</a>
</td>
<td>
<a name="load_librusec"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_librusec">load_librusec</a></code>
<a href="#load_librusec"><code>#</code></a>
</td>
<td>
<code>fiction</code>
</td>
<td align="right">
301&nbsp;871
</td>
<td align="right">
144.92 Gb
</td>
<td>
Dump of lib.rus.ec prepared for RUSSE workshop
</br>
</br>
<code>wget http://panchenko.me/data/russe/librusec_fb2.plain.gz</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/RossiyaSegodnya/ria_news_dataset">Rossiya Segodnya</a>
</td>
<td>
<a name="load_ria_raw"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria_raw">load_ria_raw</a></code>
<a href="#load_ria_raw"><code>#</code></a>
</br>
<a name="load_ria"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria">load_ria</a></code>
<a href="#load_ria"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
1&nbsp;003&nbsp;869
</td>
<td align="right">
3.70 Gb
</td>
<td>
<code>wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz</code>
</td>
</tr>
<tr>
<td>
<a href="http://study.mokoron.com/">Mokoron Russian Twitter Corpus</a>
</td>
<td>
<a name="load_mokoron"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_mokoron">load_mokoron</a></code>
<a href="#load_mokoron"><code>#</code></a>
</td>
<td>
<code>social</code>
<code>sentiment</code>
</td>
<td align="right">
17&nbsp;633&nbsp;417
</td>
<td align="right">
1.86 Gb
</td>
<td>
Russian Twitter sentiment markup
</br>
</br>
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
</td>
</tr>
<tr>
<td>
<a href="https://dumps.wikimedia.org/">Wikipedia</a>
</td>
<td>
<a name="load_wiki"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wiki">load_wiki</a></code>
<a href="#load_wiki"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
1&nbsp;541&nbsp;401
</td>
<td align="right">
12.94 Gb
</td>
<td>
Russian Wiki dump
</br>
</br>
<code>wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/dialogue-evaluation/GramEval2020">GramEval2020</a>
</td>
<td>
<a name="load_gramru"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gramru">load_gramru</a></code>
<a href="#load_gramru"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
162&nbsp;372
</td>
<td align="right">
30.04 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip</code>
</br>
<code>unzip master.zip</code>
</br>
<code>mv GramEval2020-master/dataTrain train</code>
</br>
<code>mv GramEval2020-master/dataOpenTest dev</code>
</br>
<code>rm -r master.zip GramEval2020-master</code>
</br>
<code>wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu</code>
</td>
</tr>
<tr>
<td>
<a href="http://opencorpora.org/">OpenCorpora</a>
</td>
<td>
<a name="load_corpora"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_corpora">load_corpora</a></code>
<a href="#load_corpora"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
4&nbsp;030
</td>
<td align="right">
20.21 Mb
</td>
<td>
<code>wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip</code>
</td>
</tr>
<tr>
<td>
RusVectores SimLex-965
</td>
<td>
<a name="load_simlex"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_simlex">load_simlex</a></code>
<a href="#load_simlex"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv</code>
</br>
<code>wget https://rusvectores.org/static/testsets/ru_simlex965.tsv</code>
</td>
</tr>
<tr>
<td>
<a href="https://omnia-russica.github.io/">Omnia Russica</a>
</td>
<td>
<a name="load_omnia"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_omnia">load_omnia</a></code>
<a href="#load_omnia"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>web</code>
<code>fiction</code>
</td>
<td align="right">
</td>
<td align="right">
489.62 Gb
</td>
<td>
Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf
</br>
</br>
Manually download http://bit.ly/2ZT4BY9
</td>
</tr>
<tr>
<td>
<a href="https://github.com/dialogue-evaluation/factRuEval-2016/">factRuEval-2016</a>
</td>
<td>
<a name="load_factru"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_factru">load_factru</a></code>
<a href="#load_factru"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
254
</td>
<td align="right">
969.27 Kb
</td>
<td>
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
</br>
</br>
<code>wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip</code>
</br>
<code>unzip master.zip</code>
</br>
<code>rm master.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://www.researchgate.net/publication/262203599_Introducing_Baselines_for_Russian_Named_Entity_Recognition">Gareev</a>
</td>
<td>
<a name="load_gareev"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gareev">load_gareev</a></code>
<a href="#load_gareev"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
97
</td>
<td align="right">
455.02 Kb
</td>
<td>
Manual PER, ORG markup (no LOC)
</br>
</br>
Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset
</br>
<code>tar -xvf rus-ner-news-corpus.iob.tar.gz</code>
</br>
<code>rm rus-ner-news-corpus.iob.tar.gz</code>
</td>
</tr>
<tr>
<td>
<a href="http://www.labinform.ru/pub/named_entities/">Collection5</a>
</td>
<td>
<a name="load_ne5"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ne5">load_ne5</a></code>
<a href="#load_ne5"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
1&nbsp;000
</td>
<td align="right">
2.96 Mb
</td>
<td>
News articles with manual PER, LOC, ORG markup
</br>
</br>
<code>wget http://www.labinform.ru/pub/named_entities/collection5.zip</code>
</br>
<code>unzip collection5.zip</code>
</br>
<code>rm collection5.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://www.aclweb.org/anthology/I17-1042">WiNER</a>
</td>
<td>
<a name="load_wikiner"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wikiner">load_wikiner</a></code>
<a href="#load_wikiner"><code>#</code></a>
</td>
<td>
<code>ner</code>
</td>
<td align="right">
203&nbsp;287
</td>
<td align="right">
36.15 Mb
</td>
<td>
Sentences from Wiki auto annotated with PER, LOC, ORG tags
</br>
</br>
<code>wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="http://bsnlp.cs.helsinki.fi/shared_task.html">BSNLP-2019</a>
</td>
<td>
<a name="load_bsnlp"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_bsnlp">load_bsnlp</a></code>
<a href="#load_bsnlp"><code>#</code></a>
</td>
<td>
<code>ner</code>
</td>
<td align="right">
464
</td>
<td align="right">
1.16 Mb
</td>
<td>
Markup prepared for 2019 BSNLP Shared Task
</br>
</br>
<code>wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip</code>
</br>
<code>wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip</code>
</br>
<code>unzip TRAININGDATA_BSNLP_2019_shared_task.zip</code>
</br>
<code>unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg</code>
</br>
<code>rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip</code>
</td>
</tr>
<tr>
<td>
<a href="http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000">Persons-1000</a>
</td>
<td>
<a name="load_persons"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_persons">load_persons</a></code>
<a href="#load_persons"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
1&nbsp;000
</td>
<td align="right">
2.96 Mb
</td>
<td>
Same as Collection5, only PER markup + normalized names
</br>
</br>
<code>wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/cimm-kzn/RuDReC">The Russian Drug Reaction Corpus (RuDReC)</a>
</td>
<td>
<a name="load_rudrec"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_rudrec">load_rudrec</a></code>
<a href="#load_rudrec"><code>#</code></a>
</td>
<td>
<code>ner</code>
</td>
<td align="right">
4&nbsp;809
</td>
<td align="right">
1.73 Kb
</td>
<td>
RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.
</br>
</br>
<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json</code>
</td>
</tr>
<tr>
<td>
<a href="https://tatianashavrina.github.io/taiga_site/">Taiga</a>
</td>
<td colspan="5">
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
</br>
</br>
<code>wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz</code>
</br>
<code>tar -xzvf retagged_taiga.tar.gz</code>
</td>
</tr>
<tr>
<td>
Arzamas
</td>
<td>
<a name="load_taiga_arzamas"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_arzamas">load_taiga_arzamas</a></code>
<a href="#load_taiga_arzamas"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
311
</td>
<td align="right">
4.50 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Fontanka
</td>
<td>
<a name="load_taiga_fontanka"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_fontanka">load_taiga_fontanka</a></code>
<a href="#load_taiga_fontanka"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
342&nbsp;683
</td>
<td align="right">
786.23 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Interfax
</td>
<td>
<a name="load_taiga_interfax"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_interfax">load_taiga_interfax</a></code>
<a href="#load_taiga_interfax"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
46&nbsp;429
</td>
<td align="right">
77.55 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
KP
</td>
<td>
<a name="load_taiga_kp"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_kp">load_taiga_kp</a></code>
<a href="#load_taiga_kp"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
45&nbsp;503
</td>
<td align="right">
61.79 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Lenta
</td>
<td>
<a name="load_taiga_lenta"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_lenta">load_taiga_lenta</a></code>
<a href="#load_taiga_lenta"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
36&nbsp;446
</td>
<td align="right">
95.15 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Taiga/N+1
</td>
<td>
<a name="load_taiga_nplus1"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_nplus1">load_taiga_nplus1</a></code>
<a href="#load_taiga_nplus1"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
7&nbsp;696
</td>
<td align="right">
24.96 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Magazines
</td>
<td>
<a name="load_taiga_magazines"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_magazines">load_taiga_magazines</a></code>
<a href="#load_taiga_magazines"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
39&nbsp;890
</td>
<td align="right">
2.19 Gb
</td>
<td>
</td>
</tr>
<tr>
<td>
Subtitles
</td>
<td>
<a name="load_taiga_subtitles"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_subtitles">load_taiga_subtitles</a></code>
<a href="#load_taiga_subtitles"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
19&nbsp;011
</td>
<td align="right">
909.08 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Social
</td>
<td>
<a name="load_taiga_social"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_social">load_taiga_social</a></code>
<a href="#load_taiga_social"><code>#</code></a>
</td>
<td>
<code>social</code>
</td>
<td align="right">
1&nbsp;876&nbsp;442
</td>
<td align="right">
648.18 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Proza
</td>
<td>
<a name="load_taiga_proza"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_proza">load_taiga_proza</a></code>
<a href="#load_taiga_proza"><code>#</code></a>
</td>
<td>
<code>fiction</code>
</td>
<td align="right">
1&nbsp;732&nbsp;434
</td>
<td align="right">
38.25 Gb
</td>
<td>
</td>
</tr>
<tr>
<td>
Stihi
</td>
<td>
<a name="load_taiga_stihi"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_stihi">load_taiga_stihi</a></code>
<a href="#load_taiga_stihi"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
9&nbsp;157&nbsp;686
</td>
<td align="right">
12.80 Gb
</td>
<td>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/buriy/russian-nlp-datasets/releases">Russian NLP Datasets</a>
</td>
<td colspan="5">
Several Russian news datasets from webhose.io, lenta.ru and other news sites.
</td>
</tr>
<tr>
<td>
News
</td>
<td>
<a name="load_buriy_news"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_news">load_buriy_news</a></code>
<a href="#load_buriy_news"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
2&nbsp;154&nbsp;801
</td>
<td align="right">
6.84 Gb
</td>
<td>
Dump of top 40 news + 20 fashion news sites.
</br>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2</code>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2</code>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2</code>
</td>
</tr>
<tr>
<td>
Webhose
</td>
<td>
<a name="load_buriy_webhose"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_webhose">load_buriy_webhose</a></code>
<a href="#load_buriy_webhose"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
285&nbsp;965
</td>
<td align="right">
859.32 Mb
</td>
<td>
Dump from webhose.io, 300 sources for one month.
</br>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/ods-ai-ml4sg/proj_news_viz/releases/tag/data">ODS #proj_news_viz</a>
</td>
<td colspan="5">
Several news sites scraped by members of #proj_news_viz ODS project.
</td>
</tr>
<tr>
<td>
Interfax
</td>
<td>
<a name="load_ods_interfax"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_interfax">load_ods_interfax</a></code>
<a href="#load_ods_interfax"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
543&nbsp;961
</td>
<td align="right">
1.22 Gb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz</code>
</td>
</tr>
<tr>
<td>
Gazeta
</td>
<td>
<a name="load_ods_gazeta"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_gazeta">load_ods_gazeta</a></code>
<a href="#load_ods_gazeta"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
865&nbsp;847
</td>
<td align="right">
1.63 Gb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz</code>
</td>
</tr>
<tr>
<td>
Izvestia
</td>
<td>
<a name="load_ods_izvestia"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_izvestia">load_ods_izvestia</a></code>
<a href="#load_ods_izvestia"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
86&nbsp;601
</td>
<td align="right">
307.19 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz</code>
</td>
</tr>
<tr>
<td>
Meduza
</td>
<td>
<a name="load_ods_meduza"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_meduza">load_ods_meduza</a></code>
<a href="#load_ods_meduza"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
71&nbsp;806
</td>
<td align="right">
270.11 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz</code>
</td>
</tr>
<tr>
<td>
RIA
</td>
<td>
<a name="load_ods_ria"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_ria">load_ods_ria</a></code>
<a href="#load_ods_ria"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
101&nbsp;543
</td>
<td align="right">
233.88 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz</code>
</td>
</tr>
<tr>
<td>
Russia Today
</td>
<td>
<a name="load_ods_rt"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_rt">load_ods_rt</a></code>
<a href="#load_ods_rt"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
106&nbsp;644
</td>
<td align="right">
187.12 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz</code>
</td>
</tr>
<tr>
<td>
TASS
</td>
<td>
<a name="load_ods_tass"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_tass">load_ods_tass</a></code>
<a href="#load_ods_tass"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
1&nbsp;135&nbsp;635
</td>
<td align="right">
3.27 Gb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz</code>
</td>
</tr>
<tr>
<td>
<a href="https://universaldependencies.org/">Universal Dependencies</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
GSD
</td>
<td>
<a name="load_ud_gsd"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_gsd">load_ud_gsd</a></code>
<a href="#load_ud_gsd"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
5&nbsp;030
</td>
<td align="right">
1.01 Mb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu</code>
</td>
</tr>
<tr>
<td>
Taiga
</td>
<td>
<a name="load_ud_taiga"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_taiga">load_ud_taiga</a></code>
<a href="#load_ud_taiga"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
3&nbsp;264
</td>
<td align="right">
353.80 Kb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu</code>
</td>
</tr>
<tr>
<td>
PUD
</td>
<td>
<a name="load_ud_pud"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_pud">load_ud_pud</a></code>
<a href="#load_ud_pud"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
1&nbsp;000
</td>
<td align="right">
207.78 Kb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu</code>
</td>
</tr>
<tr>
<td>
SynTagRus
</td>
<td>
<a name="load_ud_syntag"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_syntag">load_ud_syntag</a></code>
<a href="#load_ud_syntag"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
61&nbsp;889
</td>
<td align="right">
11.33 Mb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/dialogue-evaluation/morphoRuEval-2017">morphoRuEval-2017</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
General Internet-Corpus
</td>
<td>
<a name="load_morphoru_gicrya"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_gicrya">load_morphoru_gicrya</a></code>
<a href="#load_morphoru_gicrya"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
83&nbsp;148
</td>
<td align="right">
10.58 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip</code>
</br>
<code>unzip GIKRYA_texts_new.zip</code>
</br>
<code>rm GIKRYA_texts_new.zip</code>
</td>
</tr>
<tr>
<td>
Russian National Corpus
</td>
<td>
<a name="load_morphoru_rnc"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_rnc">load_morphoru_rnc</a></code>
<a href="#load_morphoru_rnc"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
98&nbsp;892
</td>
<td align="right">
12.71 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar</code>
</br>
<code>unrar x RNC_texts.rar</code>
</br>
<code>rm RNC_texts.rar</code>
</td>
</tr>
<tr>
<td>
OpenCorpora
</td>
<td>
<a name="load_morphoru_corpora"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_corpora">load_morphoru_corpora</a></code>
<a href="#load_morphoru_corpora"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
38&nbsp;510
</td>
<td align="right">
4.80 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar</code>
</br>
<code>unrar x OpenCorpora_Texts.rar</code>
</br>
<code>rm OpenCorpora_Texts.rar</code>
</td>
</tr>
<tr>
<td>
<a href="https://russe.nlpub.org/downloads/">RUSSE Russian Semantic Relatedness</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
HJ: Human Judgements of Word Pairs
</td>
<td>
<a name="load_russe_hj"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_hj">load_russe_hj</a></code>
<a href="#load_russe_hj"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv</code>
</td>
</tr>
<tr>
<td>
RT: Synonyms and Hypernyms from the Thesaurus RuThes
</td>
<td>
<a name="load_russe_rt"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_rt">load_russe_rt</a></code>
<a href="#load_russe_rt"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv</code>
</td>
</tr>
<tr>
<td>
AE: Cognitive Associations from the Sociation.org Experiment
</td>
<td>
<a name="load_russe_ae"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_ae">load_russe_ae</a></code>
<a href="#load_russe_ae"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv</code>
</br>
<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv</code>
</br>
<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv</code>
</td>
</tr>
<tr>
<td>
<a href="https://toloka.yandex.ru/datasets/">Toloka Datasets</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
Lexical Relations from the Wisdom of the Crowd (LRWC)
</td>
<td>
<a name="load_toloka_lrwc"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_toloka_lrwc">load_toloka_lrwc</a></code>
<a href="#load_toloka_lrwc"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://tlk.s3.yandex.net/dataset/LRWC.zip</code>
</br>
<code>unzip LRWC.zip</code>
</br>
<code>rm LRWC.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/cimm-kzn/RuDReC">The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)</a>
</td>
<td>
<a name="load_ruadrect"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ruadrect">load_ruadrect</a></code>
<a href="#load_ruadrect"><code>#</code></a>
</td>
<td>
<code>social</code>
</td>
<td align="right">
9&nbsp;515
</td>
<td align="right">
2.09 Mb
</td>
<td>
This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020
</br>
</br>
<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip</code>
</br>
<code>unzip RuADReCT.zip</code>
</br>
<code>rm RuADReCT.zip</code>
</td>
</tr>
</table>
<!--- metas --->

## Support

- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru

## Add new source

1. Implement `corus/sources/<source>.py`
2. Add import into `corus/sources/__init__.py`
3. Add meta into `corus/source/meta.py`
4. Add example into `docs.ipynb` (check meta table is correct)
5. Run tests (readme is updated)

## Development

Dev env

```bash
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus
```

Lint + update docs

```bash
make lint
make exec-docs
```

Release

```bash
# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/natasha/corus",
    "name": "corus",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "corpora,russian,nlp,datasets",
    "author": "Alexander Kukushkin",
    "author_email": "alex@alexkuk.ru",
    "download_url": "https://files.pythonhosted.org/packages/79/7e/50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3/corus-0.10.0.tar.gz",
    "platform": null,
    "description": "\n<img src=\"https://github.com/natasha/natasha-logos/blob/master/corus.svg\">\n\n![CI](https://github.com/natasha/corus/actions/workflows/test.yml/badge.svg)\n\nLinks to publicly available Russian corpora + code for loading and parsing. <a href=\"#reference\">20+ datasets, 350Gb+ of text</a>.\n\n## Usage\n\nFor example lets use <a href=\"https://github.com/yutkin/Lenta.Ru-News-Dataset\">dump of lenta.ru by @yutkin</a>. Manually download the archive (link in the <a href=\"#reference\">Reference</a> section):\n```bash\nwget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz\n```\n\nUse `corus` to load the data:\n\n```python\n>>> from corus import load_lenta\n\n>>> path = 'lenta-ru-news.csv.gz'\n>>> records = load_lenta(path)\n>>> next(records)\n\nLentaRecord(\n    url='https://lenta.ru/news/2018/12/14/cancer/',\n    title='\u041d\u0430\u0437\u0432\u0430\u043d\u044b \u0440\u0435\u0433\u0438\u043e\u043d\u044b \u0420\u043e\u0441\u0441\u0438\u0438 \u0441\\xa0\u0441\u0430\u043c\u043e\u0439 \u0432\u044b\u0441\u043e\u043a\u043e\u0439 \u0441\u043c\u0435\u0440\u0442\u043d\u043e\u0441\u0442\u044c\u044e \u043e\u0442\\xa0\u0440\u0430\u043a\u0430',\n    text='\u0412\u0438\u0446\u0435-\u043f\u0440\u0435\u043c\u044c\u0435\u0440 \u043f\u043e \u0441\u043e\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u043c \u0432\u043e\u043f\u0440\u043e\u0441\u0430\u043c \u0422\u0430\u0442\u044c\u044f\u043d\u0430 \u0413\u043e\u043b\u0438\u043a\u043e\u0432\u0430 \u0440\u0430\u0441\u0441\u043a\u0430\u0437\u0430\u043b\u0430, \u0432 \u043a\u0430\u043a\u0438\u0445 \u0440\u0435\u0433\u0438\u043e\u043d\u0430\u0445 \u0420\u043e\u0441\u0441\u0438\u0438 \u0437\u0430\u0444\u0438\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u0430 \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u0432\u044b\u0441\u043e\u043a\u0430\u044f \u0441\u043c\u0435\u0440\u0442\u043d\u043e\u0441\u0442\u044c \u043e\u0442 \u0440\u0430\u043a\u0430, \u0441\u043e\u043e\u0431...',\n    topic='\u0420\u043e\u0441\u0441\u0438\u044f',\n    tags='\u041e\u0431\u0449\u0435\u0441\u0442\u0432\u043e'\n)\n```\n\nIterate over texts:\n\n```python\n>>> records = load_lenta(path)\n>>> for record in records:\n...     text = record.text\n...     ...\n\n```\n\nFor links to other datasets and their loaders see the <a href=\"#reference\">Reference</a> section.\n\n## Documentation\n\nMaterials are in Russian:\n\n* <a href=\"https://natasha.github.io/corus\">Corus page on natasha.github.io</a> \n* <a href=\"https://youtu.be/-7XT_U6hVvk?t=2758\">Corus section of Datafest 2020 talk</a>\n\n## Install\n\n`corus` supports Python 3.5+, PyPy 3.\n\n```bash\n$ pip install corus\n```\n\n## Reference\n\n<!--- metas --->\n<table>\n<tr>\n<th>Dataset</th>\n<th>API <code>from corus import</code></th>\n<th>Tags</th>\n<th>Texts</th>\n<th>Uncompressed</th>\n<th>Description</th>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/yutkin/Lenta.Ru-News-Dataset\">Lenta.ru</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nLenta.ru v1.0\n</td>\n<td>\n<a name=\"load_lenta\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta\">load_lenta</a></code>\n<a href=\"#load_lenta\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n739&nbsp;351\n</td>\n<td align=\"right\">\n1.66 Gb\n</td>\n<td>\n<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nLenta.ru v1.1+\n</td>\n<td>\n<a name=\"load_lenta2\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta2\">load_lenta2</a></code>\n<a href=\"#load_lenta2\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n800&nbsp;975\n</td>\n<td align=\"right\">\n1.94 Gb\n</td>\n<td>\n<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://russe.nlpub.org/downloads/\">Lib.rus.ec</a>\n</td>\n<td>\n<a name=\"load_librusec\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_librusec\">load_librusec</a></code>\n<a href=\"#load_librusec\"><code>#</code></a>\n</td>\n<td>\n<code>fiction</code>\n</td>\n<td align=\"right\">\n301&nbsp;871\n</td>\n<td align=\"right\">\n144.92 Gb\n</td>\n<td>\nDump of lib.rus.ec prepared for RUSSE workshop\n</br>\n</br>\n<code>wget http://panchenko.me/data/russe/librusec_fb2.plain.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/RossiyaSegodnya/ria_news_dataset\">Rossiya Segodnya</a>\n</td>\n<td>\n<a name=\"load_ria_raw\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria_raw\">load_ria_raw</a></code>\n<a href=\"#load_ria_raw\"><code>#</code></a>\n</br>\n<a name=\"load_ria\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria\">load_ria</a></code>\n<a href=\"#load_ria\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n1&nbsp;003&nbsp;869\n</td>\n<td align=\"right\">\n3.70 Gb\n</td>\n<td>\n<code>wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://study.mokoron.com/\">Mokoron Russian Twitter Corpus</a>\n</td>\n<td>\n<a name=\"load_mokoron\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_mokoron\">load_mokoron</a></code>\n<a href=\"#load_mokoron\"><code>#</code></a>\n</td>\n<td>\n<code>social</code>\n<code>sentiment</code>\n</td>\n<td align=\"right\">\n17&nbsp;633&nbsp;417\n</td>\n<td align=\"right\">\n1.86 Gb\n</td>\n<td>\nRussian Twitter sentiment markup\n</br>\n</br>\nManually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://dumps.wikimedia.org/\">Wikipedia</a>\n</td>\n<td>\n<a name=\"load_wiki\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wiki\">load_wiki</a></code>\n<a href=\"#load_wiki\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n1&nbsp;541&nbsp;401\n</td>\n<td align=\"right\">\n12.94 Gb\n</td>\n<td>\nRussian Wiki dump\n</br>\n</br>\n<code>wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/dialogue-evaluation/GramEval2020\">GramEval2020</a>\n</td>\n<td>\n<a name=\"load_gramru\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gramru\">load_gramru</a></code>\n<a href=\"#load_gramru\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n162&nbsp;372\n</td>\n<td align=\"right\">\n30.04 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip</code>\n</br>\n<code>unzip master.zip</code>\n</br>\n<code>mv GramEval2020-master/dataTrain train</code>\n</br>\n<code>mv GramEval2020-master/dataOpenTest dev</code>\n</br>\n<code>rm -r master.zip GramEval2020-master</code>\n</br>\n<code>wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://opencorpora.org/\">OpenCorpora</a>\n</td>\n<td>\n<a name=\"load_corpora\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_corpora\">load_corpora</a></code>\n<a href=\"#load_corpora\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n4&nbsp;030\n</td>\n<td align=\"right\">\n20.21 Mb\n</td>\n<td>\n<code>wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip</code>\n</td>\n</tr>\n<tr>\n<td>\nRusVectores SimLex-965\n</td>\n<td>\n<a name=\"load_simlex\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_simlex\">load_simlex</a></code>\n<a href=\"#load_simlex\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv</code>\n</br>\n<code>wget https://rusvectores.org/static/testsets/ru_simlex965.tsv</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://omnia-russica.github.io/\">Omnia Russica</a>\n</td>\n<td>\n<a name=\"load_omnia\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_omnia\">load_omnia</a></code>\n<a href=\"#load_omnia\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>web</code>\n<code>fiction</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n489.62 Gb\n</td>\n<td>\nTaiga + Wiki + Araneum. Read \"Even larger Russian corpus\" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf\n</br>\n</br>\nManually download http://bit.ly/2ZT4BY9\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/dialogue-evaluation/factRuEval-2016/\">factRuEval-2016</a>\n</td>\n<td>\n<a name=\"load_factru\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_factru\">load_factru</a></code>\n<a href=\"#load_factru\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n254\n</td>\n<td align=\"right\">\n969.27 Kb\n</td>\n<td>\nManual PER, LOC, ORG markup prepared for 2016 Dialog competition\n</br>\n</br>\n<code>wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip</code>\n</br>\n<code>unzip master.zip</code>\n</br>\n<code>rm master.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://www.researchgate.net/publication/262203599_Introducing_Baselines_for_Russian_Named_Entity_Recognition\">Gareev</a>\n</td>\n<td>\n<a name=\"load_gareev\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gareev\">load_gareev</a></code>\n<a href=\"#load_gareev\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n97\n</td>\n<td align=\"right\">\n455.02 Kb\n</td>\n<td>\nManual PER, ORG markup (no LOC)\n</br>\n</br>\nEmail Rinat Gareev (gareev-rm@yandex.ru) ask for dataset\n</br>\n<code>tar -xvf rus-ner-news-corpus.iob.tar.gz</code>\n</br>\n<code>rm rus-ner-news-corpus.iob.tar.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://www.labinform.ru/pub/named_entities/\">Collection5</a>\n</td>\n<td>\n<a name=\"load_ne5\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ne5\">load_ne5</a></code>\n<a href=\"#load_ne5\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n1&nbsp;000\n</td>\n<td align=\"right\">\n2.96 Mb\n</td>\n<td>\nNews articles with manual PER, LOC, ORG markup\n</br>\n</br>\n<code>wget http://www.labinform.ru/pub/named_entities/collection5.zip</code>\n</br>\n<code>unzip collection5.zip</code>\n</br>\n<code>rm collection5.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://www.aclweb.org/anthology/I17-1042\">WiNER</a>\n</td>\n<td>\n<a name=\"load_wikiner\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wikiner\">load_wikiner</a></code>\n<a href=\"#load_wikiner\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n</td>\n<td align=\"right\">\n203&nbsp;287\n</td>\n<td align=\"right\">\n36.15 Mb\n</td>\n<td>\nSentences from Wiki auto annotated with PER, LOC, ORG tags\n</br>\n</br>\n<code>wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://bsnlp.cs.helsinki.fi/shared_task.html\">BSNLP-2019</a>\n</td>\n<td>\n<a name=\"load_bsnlp\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_bsnlp\">load_bsnlp</a></code>\n<a href=\"#load_bsnlp\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n</td>\n<td align=\"right\">\n464\n</td>\n<td align=\"right\">\n1.16 Mb\n</td>\n<td>\nMarkup prepared for 2019 BSNLP Shared Task\n</br>\n</br>\n<code>wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip</code>\n</br>\n<code>wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip</code>\n</br>\n<code>unzip TRAININGDATA_BSNLP_2019_shared_task.zip</code>\n</br>\n<code>unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg</code>\n</br>\n<code>rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000\">Persons-1000</a>\n</td>\n<td>\n<a name=\"load_persons\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_persons\">load_persons</a></code>\n<a href=\"#load_persons\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n1&nbsp;000\n</td>\n<td align=\"right\">\n2.96 Mb\n</td>\n<td>\nSame as Collection5, only PER markup + normalized names\n</br>\n</br>\n<code>wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/cimm-kzn/RuDReC\">The Russian Drug Reaction Corpus (RuDReC)</a>\n</td>\n<td>\n<a name=\"load_rudrec\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_rudrec\">load_rudrec</a></code>\n<a href=\"#load_rudrec\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n</td>\n<td align=\"right\">\n4&nbsp;809\n</td>\n<td align=\"right\">\n1.73 Kb\n</td>\n<td>\nRuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.\n</br>\n</br>\n<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://tatianashavrina.github.io/taiga_site/\">Taiga</a>\n</td>\n<td colspan=\"5\">\nLarge collection of Russian texts from various sources: news sites, magazines, literacy, social networks\n</br>\n</br>\n<code>wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz</code>\n</br>\n<code>tar -xzvf retagged_taiga.tar.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nArzamas\n</td>\n<td>\n<a name=\"load_taiga_arzamas\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_arzamas\">load_taiga_arzamas</a></code>\n<a href=\"#load_taiga_arzamas\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n311\n</td>\n<td align=\"right\">\n4.50 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nFontanka\n</td>\n<td>\n<a name=\"load_taiga_fontanka\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_fontanka\">load_taiga_fontanka</a></code>\n<a href=\"#load_taiga_fontanka\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n342&nbsp;683\n</td>\n<td align=\"right\">\n786.23 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nInterfax\n</td>\n<td>\n<a name=\"load_taiga_interfax\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_interfax\">load_taiga_interfax</a></code>\n<a href=\"#load_taiga_interfax\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n46&nbsp;429\n</td>\n<td align=\"right\">\n77.55 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nKP\n</td>\n<td>\n<a name=\"load_taiga_kp\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_kp\">load_taiga_kp</a></code>\n<a href=\"#load_taiga_kp\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n45&nbsp;503\n</td>\n<td align=\"right\">\n61.79 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nLenta\n</td>\n<td>\n<a name=\"load_taiga_lenta\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_lenta\">load_taiga_lenta</a></code>\n<a href=\"#load_taiga_lenta\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n36&nbsp;446\n</td>\n<td align=\"right\">\n95.15 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nTaiga/N+1\n</td>\n<td>\n<a name=\"load_taiga_nplus1\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_nplus1\">load_taiga_nplus1</a></code>\n<a href=\"#load_taiga_nplus1\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n7&nbsp;696\n</td>\n<td align=\"right\">\n24.96 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nMagazines\n</td>\n<td>\n<a name=\"load_taiga_magazines\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_magazines\">load_taiga_magazines</a></code>\n<a href=\"#load_taiga_magazines\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n39&nbsp;890\n</td>\n<td align=\"right\">\n2.19 Gb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nSubtitles\n</td>\n<td>\n<a name=\"load_taiga_subtitles\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_subtitles\">load_taiga_subtitles</a></code>\n<a href=\"#load_taiga_subtitles\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n19&nbsp;011\n</td>\n<td align=\"right\">\n909.08 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nSocial\n</td>\n<td>\n<a name=\"load_taiga_social\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_social\">load_taiga_social</a></code>\n<a href=\"#load_taiga_social\"><code>#</code></a>\n</td>\n<td>\n<code>social</code>\n</td>\n<td align=\"right\">\n1&nbsp;876&nbsp;442\n</td>\n<td align=\"right\">\n648.18 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nProza\n</td>\n<td>\n<a name=\"load_taiga_proza\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_proza\">load_taiga_proza</a></code>\n<a href=\"#load_taiga_proza\"><code>#</code></a>\n</td>\n<td>\n<code>fiction</code>\n</td>\n<td align=\"right\">\n1&nbsp;732&nbsp;434\n</td>\n<td align=\"right\">\n38.25 Gb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nStihi\n</td>\n<td>\n<a name=\"load_taiga_stihi\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_stihi\">load_taiga_stihi</a></code>\n<a href=\"#load_taiga_stihi\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n9&nbsp;157&nbsp;686\n</td>\n<td align=\"right\">\n12.80 Gb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/buriy/russian-nlp-datasets/releases\">Russian NLP Datasets</a>\n</td>\n<td colspan=\"5\">\nSeveral Russian news datasets from webhose.io, lenta.ru and other news sites.\n</td>\n</tr>\n<tr>\n<td>\nNews\n</td>\n<td>\n<a name=\"load_buriy_news\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_news\">load_buriy_news</a></code>\n<a href=\"#load_buriy_news\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n2&nbsp;154&nbsp;801\n</td>\n<td align=\"right\">\n6.84 Gb\n</td>\n<td>\nDump of top 40 news + 20 fashion news sites.\n</br>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2</code>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2</code>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\nWebhose\n</td>\n<td>\n<a name=\"load_buriy_webhose\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_webhose\">load_buriy_webhose</a></code>\n<a href=\"#load_buriy_webhose\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n285&nbsp;965\n</td>\n<td align=\"right\">\n859.32 Mb\n</td>\n<td>\nDump from webhose.io, 300 sources for one month.\n</br>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/ods-ai-ml4sg/proj_news_viz/releases/tag/data\">ODS #proj_news_viz</a>\n</td>\n<td colspan=\"5\">\nSeveral news sites scraped by members of #proj_news_viz ODS project.\n</td>\n</tr>\n<tr>\n<td>\nInterfax\n</td>\n<td>\n<a name=\"load_ods_interfax\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_interfax\">load_ods_interfax</a></code>\n<a href=\"#load_ods_interfax\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n543&nbsp;961\n</td>\n<td align=\"right\">\n1.22 Gb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nGazeta\n</td>\n<td>\n<a name=\"load_ods_gazeta\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_gazeta\">load_ods_gazeta</a></code>\n<a href=\"#load_ods_gazeta\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n865&nbsp;847\n</td>\n<td align=\"right\">\n1.63 Gb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nIzvestia\n</td>\n<td>\n<a name=\"load_ods_izvestia\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_izvestia\">load_ods_izvestia</a></code>\n<a href=\"#load_ods_izvestia\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n86&nbsp;601\n</td>\n<td align=\"right\">\n307.19 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nMeduza\n</td>\n<td>\n<a name=\"load_ods_meduza\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_meduza\">load_ods_meduza</a></code>\n<a href=\"#load_ods_meduza\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n71&nbsp;806\n</td>\n<td align=\"right\">\n270.11 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nRIA\n</td>\n<td>\n<a name=\"load_ods_ria\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_ria\">load_ods_ria</a></code>\n<a href=\"#load_ods_ria\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n101&nbsp;543\n</td>\n<td align=\"right\">\n233.88 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nRussia Today\n</td>\n<td>\n<a name=\"load_ods_rt\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_rt\">load_ods_rt</a></code>\n<a href=\"#load_ods_rt\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n106&nbsp;644\n</td>\n<td align=\"right\">\n187.12 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nTASS\n</td>\n<td>\n<a name=\"load_ods_tass\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_tass\">load_ods_tass</a></code>\n<a href=\"#load_ods_tass\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n1&nbsp;135&nbsp;635\n</td>\n<td align=\"right\">\n3.27 Gb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://universaldependencies.org/\">Universal Dependencies</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nGSD\n</td>\n<td>\n<a name=\"load_ud_gsd\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_gsd\">load_ud_gsd</a></code>\n<a href=\"#load_ud_gsd\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n5&nbsp;030\n</td>\n<td align=\"right\">\n1.01 Mb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\nTaiga\n</td>\n<td>\n<a name=\"load_ud_taiga\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_taiga\">load_ud_taiga</a></code>\n<a href=\"#load_ud_taiga\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n3&nbsp;264\n</td>\n<td align=\"right\">\n353.80 Kb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\nPUD\n</td>\n<td>\n<a name=\"load_ud_pud\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_pud\">load_ud_pud</a></code>\n<a href=\"#load_ud_pud\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n1&nbsp;000\n</td>\n<td align=\"right\">\n207.78 Kb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\nSynTagRus\n</td>\n<td>\n<a name=\"load_ud_syntag\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_syntag\">load_ud_syntag</a></code>\n<a href=\"#load_ud_syntag\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n61&nbsp;889\n</td>\n<td align=\"right\">\n11.33 Mb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/dialogue-evaluation/morphoRuEval-2017\">morphoRuEval-2017</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nGeneral Internet-Corpus\n</td>\n<td>\n<a name=\"load_morphoru_gicrya\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_gicrya\">load_morphoru_gicrya</a></code>\n<a href=\"#load_morphoru_gicrya\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n83&nbsp;148\n</td>\n<td align=\"right\">\n10.58 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip</code>\n</br>\n<code>unzip GIKRYA_texts_new.zip</code>\n</br>\n<code>rm GIKRYA_texts_new.zip</code>\n</td>\n</tr>\n<tr>\n<td>\nRussian National Corpus\n</td>\n<td>\n<a name=\"load_morphoru_rnc\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_rnc\">load_morphoru_rnc</a></code>\n<a href=\"#load_morphoru_rnc\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n98&nbsp;892\n</td>\n<td align=\"right\">\n12.71 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar</code>\n</br>\n<code>unrar x RNC_texts.rar</code>\n</br>\n<code>rm RNC_texts.rar</code>\n</td>\n</tr>\n<tr>\n<td>\nOpenCorpora\n</td>\n<td>\n<a name=\"load_morphoru_corpora\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_corpora\">load_morphoru_corpora</a></code>\n<a href=\"#load_morphoru_corpora\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n38&nbsp;510\n</td>\n<td align=\"right\">\n4.80 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar</code>\n</br>\n<code>unrar x OpenCorpora_Texts.rar</code>\n</br>\n<code>rm OpenCorpora_Texts.rar</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://russe.nlpub.org/downloads/\">RUSSE Russian Semantic Relatedness</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nHJ: Human Judgements of Word Pairs\n</td>\n<td>\n<a name=\"load_russe_hj\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_hj\">load_russe_hj</a></code>\n<a href=\"#load_russe_hj\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv</code>\n</td>\n</tr>\n<tr>\n<td>\nRT: Synonyms and Hypernyms from the Thesaurus RuThes\n</td>\n<td>\n<a name=\"load_russe_rt\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_rt\">load_russe_rt</a></code>\n<a href=\"#load_russe_rt\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv</code>\n</td>\n</tr>\n<tr>\n<td>\nAE: Cognitive Associations from the Sociation.org Experiment\n</td>\n<td>\n<a name=\"load_russe_ae\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_ae\">load_russe_ae</a></code>\n<a href=\"#load_russe_ae\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv</code>\n</br>\n<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv</code>\n</br>\n<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://toloka.yandex.ru/datasets/\">Toloka Datasets</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nLexical Relations from the Wisdom of the Crowd (LRWC)\n</td>\n<td>\n<a name=\"load_toloka_lrwc\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_toloka_lrwc\">load_toloka_lrwc</a></code>\n<a href=\"#load_toloka_lrwc\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://tlk.s3.yandex.net/dataset/LRWC.zip</code>\n</br>\n<code>unzip LRWC.zip</code>\n</br>\n<code>rm LRWC.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/cimm-kzn/RuDReC\">The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)</a>\n</td>\n<td>\n<a name=\"load_ruadrect\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ruadrect\">load_ruadrect</a></code>\n<a href=\"#load_ruadrect\"><code>#</code></a>\n</td>\n<td>\n<code>social</code>\n</td>\n<td align=\"right\">\n9&nbsp;515\n</td>\n<td align=\"right\">\n2.09 Mb\n</td>\n<td>\nThis corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020\n</br>\n</br>\n<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip</code>\n</br>\n<code>unzip RuADReCT.zip</code>\n</br>\n<code>rm RuADReCT.zip</code>\n</td>\n</tr>\n</table>\n<!--- metas --->\n\n## Support\n\n- Chat \u2014 https://t.me/natural_language_processing\n- Issues \u2014 https://github.com/natasha/corus/issues\n- Commercial support \u2014 https://lab.alexkuk.ru\n\n## Add new source\n\n1. Implement `corus/sources/<source>.py`\n2. Add import into `corus/sources/__init__.py`\n3. Add meta into `corus/source/meta.py`\n4. Add example into `docs.ipynb` (check meta table is correct)\n5. Run tests (readme is updated)\n\n## Development\n\nDev env\n\n```bash\npython -m venv ~/.venvs/natasha-corus\nsource ~/.venvs/natasha-corus/bin/activate\n\npip install -r requirements/dev.txt\npip install -e .\n\npython -m ipykernel install --user --name natasha-corus\n```\n\nLint + update docs\n\n```bash\nmake lint\nmake exec-docs\n```\n\nRelease\n\n```bash\n# Update setup.py version\n\ngit commit -am 'Up version'\ngit tag v0.10.0\n\ngit push\ngit push --tags\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Links to russian corpora, functions for loading and parsing",
    "version": "0.10.0",
    "project_urls": {
        "Homepage": "https://github.com/natasha/corus"
    },
    "split_keywords": [
        "corpora",
        "russian",
        "nlp",
        "datasets"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "26102c40454156b8bc65bdce019785aa508487b3b5cc07b35fd2c2da3d9b1418",
                "md5": "01619d7269db12d678cfc61e80962f4a",
                "sha256": "7b8da75d9fab0c3ee0d52a9fd575965dcd93fa1818da01a91bff178b3ad90bc7"
            },
            "downloads": -1,
            "filename": "corus-0.10.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "01619d7269db12d678cfc61e80962f4a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 83650,
            "upload_time": "2023-07-24T08:54:25",
            "upload_time_iso_8601": "2023-07-24T08:54:25.371235Z",
            "url": "https://files.pythonhosted.org/packages/26/10/2c40454156b8bc65bdce019785aa508487b3b5cc07b35fd2c2da3d9b1418/corus-0.10.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "797e50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3",
                "md5": "cdf056d3171481018d543e92b674436d",
                "sha256": "0e203f4fb96b841822ca34a79c2004564ec68a1bcf247ab09e08e49b0a7563e9"
            },
            "downloads": -1,
            "filename": "corus-0.10.0.tar.gz",
            "has_sig": false,
            "md5_digest": "cdf056d3171481018d543e92b674436d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 76494,
            "upload_time": "2023-07-24T08:54:26",
            "upload_time_iso_8601": "2023-07-24T08:54:26.618878Z",
            "url": "https://files.pythonhosted.org/packages/79/7e/50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3/corus-0.10.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-24 08:54:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "natasha",
    "github_project": "corus",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "corus"
}
        
Elapsed time: 0.09166s