<img src="https://github.com/natasha/natasha-logos/blob/master/corus.svg">
![CI](https://github.com/natasha/corus/actions/workflows/test.yml/badge.svg)
Links to publicly available Russian corpora + code for loading and parsing. <a href="#reference">20+ datasets, 350Gb+ of text</a>.
## Usage
For example lets use <a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">dump of lenta.ru by @yutkin</a>. Manually download the archive (link in the <a href="#reference">Reference</a> section):
```bash
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
```
Use `corus` to load the data:
```python
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
```
Iterate over texts:
```python
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...
```
For links to other datasets and their loaders see the <a href="#reference">Reference</a> section.
## Documentation
Materials are in Russian:
* <a href="https://natasha.github.io/corus">Corus page on natasha.github.io</a>
* <a href="https://youtu.be/-7XT_U6hVvk?t=2758">Corus section of Datafest 2020 talk</a>
## Install
`corus` supports Python 3.5+, PyPy 3.
```bash
$ pip install corus
```
## Reference
<!--- metas --->
<table>
<tr>
<th>Dataset</th>
<th>API <code>from corus import</code></th>
<th>Tags</th>
<th>Texts</th>
<th>Uncompressed</th>
<th>Description</th>
</tr>
<tr>
<td>
<a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">Lenta.ru</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
Lenta.ru v1.0
</td>
<td>
<a name="load_lenta"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta">load_lenta</a></code>
<a href="#load_lenta"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
739 351
</td>
<td align="right">
1.66 Gb
</td>
<td>
<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz</code>
</td>
</tr>
<tr>
<td>
Lenta.ru v1.1+
</td>
<td>
<a name="load_lenta2"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta2">load_lenta2</a></code>
<a href="#load_lenta2"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
800 975
</td>
<td align="right">
1.94 Gb
</td>
<td>
<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="https://russe.nlpub.org/downloads/">Lib.rus.ec</a>
</td>
<td>
<a name="load_librusec"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_librusec">load_librusec</a></code>
<a href="#load_librusec"><code>#</code></a>
</td>
<td>
<code>fiction</code>
</td>
<td align="right">
301 871
</td>
<td align="right">
144.92 Gb
</td>
<td>
Dump of lib.rus.ec prepared for RUSSE workshop
</br>
</br>
<code>wget http://panchenko.me/data/russe/librusec_fb2.plain.gz</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/RossiyaSegodnya/ria_news_dataset">Rossiya Segodnya</a>
</td>
<td>
<a name="load_ria_raw"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria_raw">load_ria_raw</a></code>
<a href="#load_ria_raw"><code>#</code></a>
</br>
<a name="load_ria"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria">load_ria</a></code>
<a href="#load_ria"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
1 003 869
</td>
<td align="right">
3.70 Gb
</td>
<td>
<code>wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz</code>
</td>
</tr>
<tr>
<td>
<a href="http://study.mokoron.com/">Mokoron Russian Twitter Corpus</a>
</td>
<td>
<a name="load_mokoron"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_mokoron">load_mokoron</a></code>
<a href="#load_mokoron"><code>#</code></a>
</td>
<td>
<code>social</code>
<code>sentiment</code>
</td>
<td align="right">
17 633 417
</td>
<td align="right">
1.86 Gb
</td>
<td>
Russian Twitter sentiment markup
</br>
</br>
Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
</td>
</tr>
<tr>
<td>
<a href="https://dumps.wikimedia.org/">Wikipedia</a>
</td>
<td>
<a name="load_wiki"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wiki">load_wiki</a></code>
<a href="#load_wiki"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
1 541 401
</td>
<td align="right">
12.94 Gb
</td>
<td>
Russian Wiki dump
</br>
</br>
<code>wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/dialogue-evaluation/GramEval2020">GramEval2020</a>
</td>
<td>
<a name="load_gramru"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gramru">load_gramru</a></code>
<a href="#load_gramru"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
162 372
</td>
<td align="right">
30.04 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip</code>
</br>
<code>unzip master.zip</code>
</br>
<code>mv GramEval2020-master/dataTrain train</code>
</br>
<code>mv GramEval2020-master/dataOpenTest dev</code>
</br>
<code>rm -r master.zip GramEval2020-master</code>
</br>
<code>wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu</code>
</td>
</tr>
<tr>
<td>
<a href="http://opencorpora.org/">OpenCorpora</a>
</td>
<td>
<a name="load_corpora"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_corpora">load_corpora</a></code>
<a href="#load_corpora"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
4 030
</td>
<td align="right">
20.21 Mb
</td>
<td>
<code>wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip</code>
</td>
</tr>
<tr>
<td>
RusVectores SimLex-965
</td>
<td>
<a name="load_simlex"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_simlex">load_simlex</a></code>
<a href="#load_simlex"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv</code>
</br>
<code>wget https://rusvectores.org/static/testsets/ru_simlex965.tsv</code>
</td>
</tr>
<tr>
<td>
<a href="https://omnia-russica.github.io/">Omnia Russica</a>
</td>
<td>
<a name="load_omnia"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_omnia">load_omnia</a></code>
<a href="#load_omnia"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>web</code>
<code>fiction</code>
</td>
<td align="right">
</td>
<td align="right">
489.62 Gb
</td>
<td>
Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf
</br>
</br>
Manually download http://bit.ly/2ZT4BY9
</td>
</tr>
<tr>
<td>
<a href="https://github.com/dialogue-evaluation/factRuEval-2016/">factRuEval-2016</a>
</td>
<td>
<a name="load_factru"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_factru">load_factru</a></code>
<a href="#load_factru"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
254
</td>
<td align="right">
969.27 Kb
</td>
<td>
Manual PER, LOC, ORG markup prepared for 2016 Dialog competition
</br>
</br>
<code>wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip</code>
</br>
<code>unzip master.zip</code>
</br>
<code>rm master.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://www.researchgate.net/publication/262203599_Introducing_Baselines_for_Russian_Named_Entity_Recognition">Gareev</a>
</td>
<td>
<a name="load_gareev"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gareev">load_gareev</a></code>
<a href="#load_gareev"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
97
</td>
<td align="right">
455.02 Kb
</td>
<td>
Manual PER, ORG markup (no LOC)
</br>
</br>
Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset
</br>
<code>tar -xvf rus-ner-news-corpus.iob.tar.gz</code>
</br>
<code>rm rus-ner-news-corpus.iob.tar.gz</code>
</td>
</tr>
<tr>
<td>
<a href="http://www.labinform.ru/pub/named_entities/">Collection5</a>
</td>
<td>
<a name="load_ne5"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ne5">load_ne5</a></code>
<a href="#load_ne5"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
1 000
</td>
<td align="right">
2.96 Mb
</td>
<td>
News articles with manual PER, LOC, ORG markup
</br>
</br>
<code>wget http://www.labinform.ru/pub/named_entities/collection5.zip</code>
</br>
<code>unzip collection5.zip</code>
</br>
<code>rm collection5.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://www.aclweb.org/anthology/I17-1042">WiNER</a>
</td>
<td>
<a name="load_wikiner"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wikiner">load_wikiner</a></code>
<a href="#load_wikiner"><code>#</code></a>
</td>
<td>
<code>ner</code>
</td>
<td align="right">
203 287
</td>
<td align="right">
36.15 Mb
</td>
<td>
Sentences from Wiki auto annotated with PER, LOC, ORG tags
</br>
</br>
<code>wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="http://bsnlp.cs.helsinki.fi/shared_task.html">BSNLP-2019</a>
</td>
<td>
<a name="load_bsnlp"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_bsnlp">load_bsnlp</a></code>
<a href="#load_bsnlp"><code>#</code></a>
</td>
<td>
<code>ner</code>
</td>
<td align="right">
464
</td>
<td align="right">
1.16 Mb
</td>
<td>
Markup prepared for 2019 BSNLP Shared Task
</br>
</br>
<code>wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip</code>
</br>
<code>wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip</code>
</br>
<code>unzip TRAININGDATA_BSNLP_2019_shared_task.zip</code>
</br>
<code>unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg</code>
</br>
<code>rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip</code>
</td>
</tr>
<tr>
<td>
<a href="http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000">Persons-1000</a>
</td>
<td>
<a name="load_persons"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_persons">load_persons</a></code>
<a href="#load_persons"><code>#</code></a>
</td>
<td>
<code>ner</code>
<code>news</code>
</td>
<td align="right">
1 000
</td>
<td align="right">
2.96 Mb
</td>
<td>
Same as Collection5, only PER markup + normalized names
</br>
</br>
<code>wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/cimm-kzn/RuDReC">The Russian Drug Reaction Corpus (RuDReC)</a>
</td>
<td>
<a name="load_rudrec"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_rudrec">load_rudrec</a></code>
<a href="#load_rudrec"><code>#</code></a>
</td>
<td>
<code>ner</code>
</td>
<td align="right">
4 809
</td>
<td align="right">
1.73 Kb
</td>
<td>
RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.
</br>
</br>
<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json</code>
</td>
</tr>
<tr>
<td>
<a href="https://tatianashavrina.github.io/taiga_site/">Taiga</a>
</td>
<td colspan="5">
Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks
</br>
</br>
<code>wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz</code>
</br>
<code>tar -xzvf retagged_taiga.tar.gz</code>
</td>
</tr>
<tr>
<td>
Arzamas
</td>
<td>
<a name="load_taiga_arzamas"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_arzamas">load_taiga_arzamas</a></code>
<a href="#load_taiga_arzamas"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
311
</td>
<td align="right">
4.50 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Fontanka
</td>
<td>
<a name="load_taiga_fontanka"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_fontanka">load_taiga_fontanka</a></code>
<a href="#load_taiga_fontanka"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
342 683
</td>
<td align="right">
786.23 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Interfax
</td>
<td>
<a name="load_taiga_interfax"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_interfax">load_taiga_interfax</a></code>
<a href="#load_taiga_interfax"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
46 429
</td>
<td align="right">
77.55 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
KP
</td>
<td>
<a name="load_taiga_kp"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_kp">load_taiga_kp</a></code>
<a href="#load_taiga_kp"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
45 503
</td>
<td align="right">
61.79 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Lenta
</td>
<td>
<a name="load_taiga_lenta"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_lenta">load_taiga_lenta</a></code>
<a href="#load_taiga_lenta"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
36 446
</td>
<td align="right">
95.15 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Taiga/N+1
</td>
<td>
<a name="load_taiga_nplus1"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_nplus1">load_taiga_nplus1</a></code>
<a href="#load_taiga_nplus1"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
7 696
</td>
<td align="right">
24.96 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Magazines
</td>
<td>
<a name="load_taiga_magazines"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_magazines">load_taiga_magazines</a></code>
<a href="#load_taiga_magazines"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
39 890
</td>
<td align="right">
2.19 Gb
</td>
<td>
</td>
</tr>
<tr>
<td>
Subtitles
</td>
<td>
<a name="load_taiga_subtitles"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_subtitles">load_taiga_subtitles</a></code>
<a href="#load_taiga_subtitles"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
19 011
</td>
<td align="right">
909.08 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Social
</td>
<td>
<a name="load_taiga_social"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_social">load_taiga_social</a></code>
<a href="#load_taiga_social"><code>#</code></a>
</td>
<td>
<code>social</code>
</td>
<td align="right">
1 876 442
</td>
<td align="right">
648.18 Mb
</td>
<td>
</td>
</tr>
<tr>
<td>
Proza
</td>
<td>
<a name="load_taiga_proza"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_proza">load_taiga_proza</a></code>
<a href="#load_taiga_proza"><code>#</code></a>
</td>
<td>
<code>fiction</code>
</td>
<td align="right">
1 732 434
</td>
<td align="right">
38.25 Gb
</td>
<td>
</td>
</tr>
<tr>
<td>
Stihi
</td>
<td>
<a name="load_taiga_stihi"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_stihi">load_taiga_stihi</a></code>
<a href="#load_taiga_stihi"><code>#</code></a>
</td>
<td>
</td>
<td align="right">
9 157 686
</td>
<td align="right">
12.80 Gb
</td>
<td>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/buriy/russian-nlp-datasets/releases">Russian NLP Datasets</a>
</td>
<td colspan="5">
Several Russian news datasets from webhose.io, lenta.ru and other news sites.
</td>
</tr>
<tr>
<td>
News
</td>
<td>
<a name="load_buriy_news"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_news">load_buriy_news</a></code>
<a href="#load_buriy_news"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
2 154 801
</td>
<td align="right">
6.84 Gb
</td>
<td>
Dump of top 40 news + 20 fashion news sites.
</br>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2</code>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2</code>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2</code>
</td>
</tr>
<tr>
<td>
Webhose
</td>
<td>
<a name="load_buriy_webhose"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_webhose">load_buriy_webhose</a></code>
<a href="#load_buriy_webhose"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
285 965
</td>
<td align="right">
859.32 Mb
</td>
<td>
Dump from webhose.io, 300 sources for one month.
</br>
</br>
<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/ods-ai-ml4sg/proj_news_viz/releases/tag/data">ODS #proj_news_viz</a>
</td>
<td colspan="5">
Several news sites scraped by members of #proj_news_viz ODS project.
</td>
</tr>
<tr>
<td>
Interfax
</td>
<td>
<a name="load_ods_interfax"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_interfax">load_ods_interfax</a></code>
<a href="#load_ods_interfax"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
543 961
</td>
<td align="right">
1.22 Gb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz</code>
</td>
</tr>
<tr>
<td>
Gazeta
</td>
<td>
<a name="load_ods_gazeta"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_gazeta">load_ods_gazeta</a></code>
<a href="#load_ods_gazeta"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
865 847
</td>
<td align="right">
1.63 Gb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz</code>
</td>
</tr>
<tr>
<td>
Izvestia
</td>
<td>
<a name="load_ods_izvestia"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_izvestia">load_ods_izvestia</a></code>
<a href="#load_ods_izvestia"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
86 601
</td>
<td align="right">
307.19 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz</code>
</td>
</tr>
<tr>
<td>
Meduza
</td>
<td>
<a name="load_ods_meduza"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_meduza">load_ods_meduza</a></code>
<a href="#load_ods_meduza"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
71 806
</td>
<td align="right">
270.11 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz</code>
</td>
</tr>
<tr>
<td>
RIA
</td>
<td>
<a name="load_ods_ria"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_ria">load_ods_ria</a></code>
<a href="#load_ods_ria"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
101 543
</td>
<td align="right">
233.88 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz</code>
</td>
</tr>
<tr>
<td>
Russia Today
</td>
<td>
<a name="load_ods_rt"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_rt">load_ods_rt</a></code>
<a href="#load_ods_rt"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
106 644
</td>
<td align="right">
187.12 Mb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz</code>
</td>
</tr>
<tr>
<td>
TASS
</td>
<td>
<a name="load_ods_tass"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_tass">load_ods_tass</a></code>
<a href="#load_ods_tass"><code>#</code></a>
</td>
<td>
<code>news</code>
</td>
<td align="right">
1 135 635
</td>
<td align="right">
3.27 Gb
</td>
<td>
<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz</code>
</td>
</tr>
<tr>
<td>
<a href="https://universaldependencies.org/">Universal Dependencies</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
GSD
</td>
<td>
<a name="load_ud_gsd"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_gsd">load_ud_gsd</a></code>
<a href="#load_ud_gsd"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
5 030
</td>
<td align="right">
1.01 Mb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu</code>
</td>
</tr>
<tr>
<td>
Taiga
</td>
<td>
<a name="load_ud_taiga"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_taiga">load_ud_taiga</a></code>
<a href="#load_ud_taiga"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
3 264
</td>
<td align="right">
353.80 Kb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu</code>
</td>
</tr>
<tr>
<td>
PUD
</td>
<td>
<a name="load_ud_pud"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_pud">load_ud_pud</a></code>
<a href="#load_ud_pud"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
1 000
</td>
<td align="right">
207.78 Kb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu</code>
</td>
</tr>
<tr>
<td>
SynTagRus
</td>
<td>
<a name="load_ud_syntag"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_syntag">load_ud_syntag</a></code>
<a href="#load_ud_syntag"><code>#</code></a>
</td>
<td>
<code>morph</code>
<code>syntax</code>
</td>
<td align="right">
61 889
</td>
<td align="right">
11.33 Mb
</td>
<td>
<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu</code>
</br>
<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/dialogue-evaluation/morphoRuEval-2017">morphoRuEval-2017</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
General Internet-Corpus
</td>
<td>
<a name="load_morphoru_gicrya"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_gicrya">load_morphoru_gicrya</a></code>
<a href="#load_morphoru_gicrya"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
83 148
</td>
<td align="right">
10.58 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip</code>
</br>
<code>unzip GIKRYA_texts_new.zip</code>
</br>
<code>rm GIKRYA_texts_new.zip</code>
</td>
</tr>
<tr>
<td>
Russian National Corpus
</td>
<td>
<a name="load_morphoru_rnc"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_rnc">load_morphoru_rnc</a></code>
<a href="#load_morphoru_rnc"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
98 892
</td>
<td align="right">
12.71 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar</code>
</br>
<code>unrar x RNC_texts.rar</code>
</br>
<code>rm RNC_texts.rar</code>
</td>
</tr>
<tr>
<td>
OpenCorpora
</td>
<td>
<a name="load_morphoru_corpora"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_corpora">load_morphoru_corpora</a></code>
<a href="#load_morphoru_corpora"><code>#</code></a>
</td>
<td>
<code>morph</code>
</td>
<td align="right">
38 510
</td>
<td align="right">
4.80 Mb
</td>
<td>
<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar</code>
</br>
<code>unrar x OpenCorpora_Texts.rar</code>
</br>
<code>rm OpenCorpora_Texts.rar</code>
</td>
</tr>
<tr>
<td>
<a href="https://russe.nlpub.org/downloads/">RUSSE Russian Semantic Relatedness</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
HJ: Human Judgements of Word Pairs
</td>
<td>
<a name="load_russe_hj"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_hj">load_russe_hj</a></code>
<a href="#load_russe_hj"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv</code>
</td>
</tr>
<tr>
<td>
RT: Synonyms and Hypernyms from the Thesaurus RuThes
</td>
<td>
<a name="load_russe_rt"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_rt">load_russe_rt</a></code>
<a href="#load_russe_rt"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv</code>
</td>
</tr>
<tr>
<td>
AE: Cognitive Associations from the Sociation.org Experiment
</td>
<td>
<a name="load_russe_ae"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_ae">load_russe_ae</a></code>
<a href="#load_russe_ae"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv</code>
</br>
<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv</code>
</br>
<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv</code>
</td>
</tr>
<tr>
<td>
<a href="https://toloka.yandex.ru/datasets/">Toloka Datasets</a>
</td>
<td colspan="5">
</td>
</tr>
<tr>
<td>
Lexical Relations from the Wisdom of the Crowd (LRWC)
</td>
<td>
<a name="load_toloka_lrwc"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_toloka_lrwc">load_toloka_lrwc</a></code>
<a href="#load_toloka_lrwc"><code>#</code></a>
</td>
<td>
<code>emb</code>
<code>sim</code>
</td>
<td align="right">
</td>
<td align="right">
</td>
<td>
<code>wget https://tlk.s3.yandex.net/dataset/LRWC.zip</code>
</br>
<code>unzip LRWC.zip</code>
</br>
<code>rm LRWC.zip</code>
</td>
</tr>
<tr>
<td>
<a href="https://github.com/cimm-kzn/RuDReC">The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)</a>
</td>
<td>
<a name="load_ruadrect"></a>
<code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ruadrect">load_ruadrect</a></code>
<a href="#load_ruadrect"><code>#</code></a>
</td>
<td>
<code>social</code>
</td>
<td align="right">
9 515
</td>
<td align="right">
2.09 Mb
</td>
<td>
This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020
</br>
</br>
<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip</code>
</br>
<code>unzip RuADReCT.zip</code>
</br>
<code>rm RuADReCT.zip</code>
</td>
</tr>
</table>
<!--- metas --->
## Support
- Chat — https://t.me/natural_language_processing
- Issues — https://github.com/natasha/corus/issues
- Commercial support — https://lab.alexkuk.ru
## Add new source
1. Implement `corus/sources/<source>.py`
2. Add import into `corus/sources/__init__.py`
3. Add meta into `corus/source/meta.py`
4. Add example into `docs.ipynb` (check meta table is correct)
5. Run tests (readme is updated)
## Development
Dev env
```bash
python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate
pip install -r requirements/dev.txt
pip install -e .
python -m ipykernel install --user --name natasha-corus
```
Lint + update docs
```bash
make lint
make exec-docs
```
Release
```bash
# Update setup.py version
git commit -am 'Up version'
git tag v0.10.0
git push
git push --tags
```
Raw data
{
"_id": null,
"home_page": "https://github.com/natasha/corus",
"name": "corus",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "corpora,russian,nlp,datasets",
"author": "Alexander Kukushkin",
"author_email": "alex@alexkuk.ru",
"download_url": "https://files.pythonhosted.org/packages/79/7e/50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3/corus-0.10.0.tar.gz",
"platform": null,
"description": "\n<img src=\"https://github.com/natasha/natasha-logos/blob/master/corus.svg\">\n\n![CI](https://github.com/natasha/corus/actions/workflows/test.yml/badge.svg)\n\nLinks to publicly available Russian corpora + code for loading and parsing. <a href=\"#reference\">20+ datasets, 350Gb+ of text</a>.\n\n## Usage\n\nFor example lets use <a href=\"https://github.com/yutkin/Lenta.Ru-News-Dataset\">dump of lenta.ru by @yutkin</a>. Manually download the archive (link in the <a href=\"#reference\">Reference</a> section):\n```bash\nwget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz\n```\n\nUse `corus` to load the data:\n\n```python\n>>> from corus import load_lenta\n\n>>> path = 'lenta-ru-news.csv.gz'\n>>> records = load_lenta(path)\n>>> next(records)\n\nLentaRecord(\n url='https://lenta.ru/news/2018/12/14/cancer/',\n title='\u041d\u0430\u0437\u0432\u0430\u043d\u044b \u0440\u0435\u0433\u0438\u043e\u043d\u044b \u0420\u043e\u0441\u0441\u0438\u0438 \u0441\\xa0\u0441\u0430\u043c\u043e\u0439 \u0432\u044b\u0441\u043e\u043a\u043e\u0439 \u0441\u043c\u0435\u0440\u0442\u043d\u043e\u0441\u0442\u044c\u044e \u043e\u0442\\xa0\u0440\u0430\u043a\u0430',\n text='\u0412\u0438\u0446\u0435-\u043f\u0440\u0435\u043c\u044c\u0435\u0440 \u043f\u043e \u0441\u043e\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u043c \u0432\u043e\u043f\u0440\u043e\u0441\u0430\u043c \u0422\u0430\u0442\u044c\u044f\u043d\u0430 \u0413\u043e\u043b\u0438\u043a\u043e\u0432\u0430 \u0440\u0430\u0441\u0441\u043a\u0430\u0437\u0430\u043b\u0430, \u0432 \u043a\u0430\u043a\u0438\u0445 \u0440\u0435\u0433\u0438\u043e\u043d\u0430\u0445 \u0420\u043e\u0441\u0441\u0438\u0438 \u0437\u0430\u0444\u0438\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u0430 \u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435 \u0432\u044b\u0441\u043e\u043a\u0430\u044f \u0441\u043c\u0435\u0440\u0442\u043d\u043e\u0441\u0442\u044c \u043e\u0442 \u0440\u0430\u043a\u0430, \u0441\u043e\u043e\u0431...',\n topic='\u0420\u043e\u0441\u0441\u0438\u044f',\n tags='\u041e\u0431\u0449\u0435\u0441\u0442\u0432\u043e'\n)\n```\n\nIterate over texts:\n\n```python\n>>> records = load_lenta(path)\n>>> for record in records:\n... text = record.text\n... ...\n\n```\n\nFor links to other datasets and their loaders see the <a href=\"#reference\">Reference</a> section.\n\n## Documentation\n\nMaterials are in Russian:\n\n* <a href=\"https://natasha.github.io/corus\">Corus page on natasha.github.io</a> \n* <a href=\"https://youtu.be/-7XT_U6hVvk?t=2758\">Corus section of Datafest 2020 talk</a>\n\n## Install\n\n`corus` supports Python 3.5+, PyPy 3.\n\n```bash\n$ pip install corus\n```\n\n## Reference\n\n<!--- metas --->\n<table>\n<tr>\n<th>Dataset</th>\n<th>API <code>from corus import</code></th>\n<th>Tags</th>\n<th>Texts</th>\n<th>Uncompressed</th>\n<th>Description</th>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/yutkin/Lenta.Ru-News-Dataset\">Lenta.ru</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nLenta.ru v1.0\n</td>\n<td>\n<a name=\"load_lenta\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta\">load_lenta</a></code>\n<a href=\"#load_lenta\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n739 351\n</td>\n<td align=\"right\">\n1.66 Gb\n</td>\n<td>\n<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nLenta.ru v1.1+\n</td>\n<td>\n<a name=\"load_lenta2\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta2\">load_lenta2</a></code>\n<a href=\"#load_lenta2\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n800 975\n</td>\n<td align=\"right\">\n1.94 Gb\n</td>\n<td>\n<code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://russe.nlpub.org/downloads/\">Lib.rus.ec</a>\n</td>\n<td>\n<a name=\"load_librusec\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_librusec\">load_librusec</a></code>\n<a href=\"#load_librusec\"><code>#</code></a>\n</td>\n<td>\n<code>fiction</code>\n</td>\n<td align=\"right\">\n301 871\n</td>\n<td align=\"right\">\n144.92 Gb\n</td>\n<td>\nDump of lib.rus.ec prepared for RUSSE workshop\n</br>\n</br>\n<code>wget http://panchenko.me/data/russe/librusec_fb2.plain.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/RossiyaSegodnya/ria_news_dataset\">Rossiya Segodnya</a>\n</td>\n<td>\n<a name=\"load_ria_raw\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria_raw\">load_ria_raw</a></code>\n<a href=\"#load_ria_raw\"><code>#</code></a>\n</br>\n<a name=\"load_ria\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria\">load_ria</a></code>\n<a href=\"#load_ria\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n1 003 869\n</td>\n<td align=\"right\">\n3.70 Gb\n</td>\n<td>\n<code>wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://study.mokoron.com/\">Mokoron Russian Twitter Corpus</a>\n</td>\n<td>\n<a name=\"load_mokoron\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_mokoron\">load_mokoron</a></code>\n<a href=\"#load_mokoron\"><code>#</code></a>\n</td>\n<td>\n<code>social</code>\n<code>sentiment</code>\n</td>\n<td align=\"right\">\n17 633 417\n</td>\n<td align=\"right\">\n1.86 Gb\n</td>\n<td>\nRussian Twitter sentiment markup\n</br>\n</br>\nManually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://dumps.wikimedia.org/\">Wikipedia</a>\n</td>\n<td>\n<a name=\"load_wiki\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wiki\">load_wiki</a></code>\n<a href=\"#load_wiki\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n1 541 401\n</td>\n<td align=\"right\">\n12.94 Gb\n</td>\n<td>\nRussian Wiki dump\n</br>\n</br>\n<code>wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/dialogue-evaluation/GramEval2020\">GramEval2020</a>\n</td>\n<td>\n<a name=\"load_gramru\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gramru\">load_gramru</a></code>\n<a href=\"#load_gramru\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n162 372\n</td>\n<td align=\"right\">\n30.04 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip</code>\n</br>\n<code>unzip master.zip</code>\n</br>\n<code>mv GramEval2020-master/dataTrain train</code>\n</br>\n<code>mv GramEval2020-master/dataOpenTest dev</code>\n</br>\n<code>rm -r master.zip GramEval2020-master</code>\n</br>\n<code>wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://opencorpora.org/\">OpenCorpora</a>\n</td>\n<td>\n<a name=\"load_corpora\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_corpora\">load_corpora</a></code>\n<a href=\"#load_corpora\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n4 030\n</td>\n<td align=\"right\">\n20.21 Mb\n</td>\n<td>\n<code>wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip</code>\n</td>\n</tr>\n<tr>\n<td>\nRusVectores SimLex-965\n</td>\n<td>\n<a name=\"load_simlex\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_simlex\">load_simlex</a></code>\n<a href=\"#load_simlex\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv</code>\n</br>\n<code>wget https://rusvectores.org/static/testsets/ru_simlex965.tsv</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://omnia-russica.github.io/\">Omnia Russica</a>\n</td>\n<td>\n<a name=\"load_omnia\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_omnia\">load_omnia</a></code>\n<a href=\"#load_omnia\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>web</code>\n<code>fiction</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n489.62 Gb\n</td>\n<td>\nTaiga + Wiki + Araneum. Read \"Even larger Russian corpus\" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf\n</br>\n</br>\nManually download http://bit.ly/2ZT4BY9\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/dialogue-evaluation/factRuEval-2016/\">factRuEval-2016</a>\n</td>\n<td>\n<a name=\"load_factru\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_factru\">load_factru</a></code>\n<a href=\"#load_factru\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n254\n</td>\n<td align=\"right\">\n969.27 Kb\n</td>\n<td>\nManual PER, LOC, ORG markup prepared for 2016 Dialog competition\n</br>\n</br>\n<code>wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip</code>\n</br>\n<code>unzip master.zip</code>\n</br>\n<code>rm master.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://www.researchgate.net/publication/262203599_Introducing_Baselines_for_Russian_Named_Entity_Recognition\">Gareev</a>\n</td>\n<td>\n<a name=\"load_gareev\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gareev\">load_gareev</a></code>\n<a href=\"#load_gareev\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n97\n</td>\n<td align=\"right\">\n455.02 Kb\n</td>\n<td>\nManual PER, ORG markup (no LOC)\n</br>\n</br>\nEmail Rinat Gareev (gareev-rm@yandex.ru) ask for dataset\n</br>\n<code>tar -xvf rus-ner-news-corpus.iob.tar.gz</code>\n</br>\n<code>rm rus-ner-news-corpus.iob.tar.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://www.labinform.ru/pub/named_entities/\">Collection5</a>\n</td>\n<td>\n<a name=\"load_ne5\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ne5\">load_ne5</a></code>\n<a href=\"#load_ne5\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n1 000\n</td>\n<td align=\"right\">\n2.96 Mb\n</td>\n<td>\nNews articles with manual PER, LOC, ORG markup\n</br>\n</br>\n<code>wget http://www.labinform.ru/pub/named_entities/collection5.zip</code>\n</br>\n<code>unzip collection5.zip</code>\n</br>\n<code>rm collection5.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://www.aclweb.org/anthology/I17-1042\">WiNER</a>\n</td>\n<td>\n<a name=\"load_wikiner\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wikiner\">load_wikiner</a></code>\n<a href=\"#load_wikiner\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n</td>\n<td align=\"right\">\n203 287\n</td>\n<td align=\"right\">\n36.15 Mb\n</td>\n<td>\nSentences from Wiki auto annotated with PER, LOC, ORG tags\n</br>\n</br>\n<code>wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://bsnlp.cs.helsinki.fi/shared_task.html\">BSNLP-2019</a>\n</td>\n<td>\n<a name=\"load_bsnlp\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_bsnlp\">load_bsnlp</a></code>\n<a href=\"#load_bsnlp\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n</td>\n<td align=\"right\">\n464\n</td>\n<td align=\"right\">\n1.16 Mb\n</td>\n<td>\nMarkup prepared for 2019 BSNLP Shared Task\n</br>\n</br>\n<code>wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip</code>\n</br>\n<code>wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip</code>\n</br>\n<code>unzip TRAININGDATA_BSNLP_2019_shared_task.zip</code>\n</br>\n<code>unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg</code>\n</br>\n<code>rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"http://ai-center.botik.ru/Airec/index.php/ru/collections/28-persons-1000\">Persons-1000</a>\n</td>\n<td>\n<a name=\"load_persons\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_persons\">load_persons</a></code>\n<a href=\"#load_persons\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n<code>news</code>\n</td>\n<td align=\"right\">\n1 000\n</td>\n<td align=\"right\">\n2.96 Mb\n</td>\n<td>\nSame as Collection5, only PER markup + normalized names\n</br>\n</br>\n<code>wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/cimm-kzn/RuDReC\">The Russian Drug Reaction Corpus (RuDReC)</a>\n</td>\n<td>\n<a name=\"load_rudrec\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_rudrec\">load_rudrec</a></code>\n<a href=\"#load_rudrec\"><code>#</code></a>\n</td>\n<td>\n<code>ner</code>\n</td>\n<td align=\"right\">\n4 809\n</td>\n<td align=\"right\">\n1.73 Kb\n</td>\n<td>\nRuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC.\n</br>\n</br>\n<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://tatianashavrina.github.io/taiga_site/\">Taiga</a>\n</td>\n<td colspan=\"5\">\nLarge collection of Russian texts from various sources: news sites, magazines, literacy, social networks\n</br>\n</br>\n<code>wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz</code>\n</br>\n<code>tar -xzvf retagged_taiga.tar.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nArzamas\n</td>\n<td>\n<a name=\"load_taiga_arzamas\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_arzamas\">load_taiga_arzamas</a></code>\n<a href=\"#load_taiga_arzamas\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n311\n</td>\n<td align=\"right\">\n4.50 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nFontanka\n</td>\n<td>\n<a name=\"load_taiga_fontanka\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_fontanka\">load_taiga_fontanka</a></code>\n<a href=\"#load_taiga_fontanka\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n342 683\n</td>\n<td align=\"right\">\n786.23 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nInterfax\n</td>\n<td>\n<a name=\"load_taiga_interfax\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_interfax\">load_taiga_interfax</a></code>\n<a href=\"#load_taiga_interfax\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n46 429\n</td>\n<td align=\"right\">\n77.55 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nKP\n</td>\n<td>\n<a name=\"load_taiga_kp\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_kp\">load_taiga_kp</a></code>\n<a href=\"#load_taiga_kp\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n45 503\n</td>\n<td align=\"right\">\n61.79 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nLenta\n</td>\n<td>\n<a name=\"load_taiga_lenta\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_lenta\">load_taiga_lenta</a></code>\n<a href=\"#load_taiga_lenta\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n36 446\n</td>\n<td align=\"right\">\n95.15 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nTaiga/N+1\n</td>\n<td>\n<a name=\"load_taiga_nplus1\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_nplus1\">load_taiga_nplus1</a></code>\n<a href=\"#load_taiga_nplus1\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n7 696\n</td>\n<td align=\"right\">\n24.96 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nMagazines\n</td>\n<td>\n<a name=\"load_taiga_magazines\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_magazines\">load_taiga_magazines</a></code>\n<a href=\"#load_taiga_magazines\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n39 890\n</td>\n<td align=\"right\">\n2.19 Gb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nSubtitles\n</td>\n<td>\n<a name=\"load_taiga_subtitles\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_subtitles\">load_taiga_subtitles</a></code>\n<a href=\"#load_taiga_subtitles\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n19 011\n</td>\n<td align=\"right\">\n909.08 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nSocial\n</td>\n<td>\n<a name=\"load_taiga_social\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_social\">load_taiga_social</a></code>\n<a href=\"#load_taiga_social\"><code>#</code></a>\n</td>\n<td>\n<code>social</code>\n</td>\n<td align=\"right\">\n1 876 442\n</td>\n<td align=\"right\">\n648.18 Mb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nProza\n</td>\n<td>\n<a name=\"load_taiga_proza\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_proza\">load_taiga_proza</a></code>\n<a href=\"#load_taiga_proza\"><code>#</code></a>\n</td>\n<td>\n<code>fiction</code>\n</td>\n<td align=\"right\">\n1 732 434\n</td>\n<td align=\"right\">\n38.25 Gb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\nStihi\n</td>\n<td>\n<a name=\"load_taiga_stihi\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_taiga_stihi\">load_taiga_stihi</a></code>\n<a href=\"#load_taiga_stihi\"><code>#</code></a>\n</td>\n<td>\n</td>\n<td align=\"right\">\n9 157 686\n</td>\n<td align=\"right\">\n12.80 Gb\n</td>\n<td>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/buriy/russian-nlp-datasets/releases\">Russian NLP Datasets</a>\n</td>\n<td colspan=\"5\">\nSeveral Russian news datasets from webhose.io, lenta.ru and other news sites.\n</td>\n</tr>\n<tr>\n<td>\nNews\n</td>\n<td>\n<a name=\"load_buriy_news\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_news\">load_buriy_news</a></code>\n<a href=\"#load_buriy_news\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n2 154 801\n</td>\n<td align=\"right\">\n6.84 Gb\n</td>\n<td>\nDump of top 40 news + 20 fashion news sites.\n</br>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2</code>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2</code>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\nWebhose\n</td>\n<td>\n<a name=\"load_buriy_webhose\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_buriy_webhose\">load_buriy_webhose</a></code>\n<a href=\"#load_buriy_webhose\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n285 965\n</td>\n<td align=\"right\">\n859.32 Mb\n</td>\n<td>\nDump from webhose.io, 300 sources for one month.\n</br>\n</br>\n<code>wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/ods-ai-ml4sg/proj_news_viz/releases/tag/data\">ODS #proj_news_viz</a>\n</td>\n<td colspan=\"5\">\nSeveral news sites scraped by members of #proj_news_viz ODS project.\n</td>\n</tr>\n<tr>\n<td>\nInterfax\n</td>\n<td>\n<a name=\"load_ods_interfax\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_interfax\">load_ods_interfax</a></code>\n<a href=\"#load_ods_interfax\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n543 961\n</td>\n<td align=\"right\">\n1.22 Gb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nGazeta\n</td>\n<td>\n<a name=\"load_ods_gazeta\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_gazeta\">load_ods_gazeta</a></code>\n<a href=\"#load_ods_gazeta\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n865 847\n</td>\n<td align=\"right\">\n1.63 Gb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nIzvestia\n</td>\n<td>\n<a name=\"load_ods_izvestia\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_izvestia\">load_ods_izvestia</a></code>\n<a href=\"#load_ods_izvestia\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n86 601\n</td>\n<td align=\"right\">\n307.19 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nMeduza\n</td>\n<td>\n<a name=\"load_ods_meduza\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_meduza\">load_ods_meduza</a></code>\n<a href=\"#load_ods_meduza\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n71 806\n</td>\n<td align=\"right\">\n270.11 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nRIA\n</td>\n<td>\n<a name=\"load_ods_ria\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_ria\">load_ods_ria</a></code>\n<a href=\"#load_ods_ria\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n101 543\n</td>\n<td align=\"right\">\n233.88 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nRussia Today\n</td>\n<td>\n<a name=\"load_ods_rt\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_rt\">load_ods_rt</a></code>\n<a href=\"#load_ods_rt\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n106 644\n</td>\n<td align=\"right\">\n187.12 Mb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\nTASS\n</td>\n<td>\n<a name=\"load_ods_tass\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ods_tass\">load_ods_tass</a></code>\n<a href=\"#load_ods_tass\"><code>#</code></a>\n</td>\n<td>\n<code>news</code>\n</td>\n<td align=\"right\">\n1 135 635\n</td>\n<td align=\"right\">\n3.27 Gb\n</td>\n<td>\n<code>wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://universaldependencies.org/\">Universal Dependencies</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nGSD\n</td>\n<td>\n<a name=\"load_ud_gsd\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_gsd\">load_ud_gsd</a></code>\n<a href=\"#load_ud_gsd\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n5 030\n</td>\n<td align=\"right\">\n1.01 Mb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\nTaiga\n</td>\n<td>\n<a name=\"load_ud_taiga\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_taiga\">load_ud_taiga</a></code>\n<a href=\"#load_ud_taiga\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n3 264\n</td>\n<td align=\"right\">\n353.80 Kb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\nPUD\n</td>\n<td>\n<a name=\"load_ud_pud\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_pud\">load_ud_pud</a></code>\n<a href=\"#load_ud_pud\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n1 000\n</td>\n<td align=\"right\">\n207.78 Kb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\nSynTagRus\n</td>\n<td>\n<a name=\"load_ud_syntag\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ud_syntag\">load_ud_syntag</a></code>\n<a href=\"#load_ud_syntag\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n<code>syntax</code>\n</td>\n<td align=\"right\">\n61 889\n</td>\n<td align=\"right\">\n11.33 Mb\n</td>\n<td>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu</code>\n</br>\n<code>wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/dialogue-evaluation/morphoRuEval-2017\">morphoRuEval-2017</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nGeneral Internet-Corpus\n</td>\n<td>\n<a name=\"load_morphoru_gicrya\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_gicrya\">load_morphoru_gicrya</a></code>\n<a href=\"#load_morphoru_gicrya\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n83 148\n</td>\n<td align=\"right\">\n10.58 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip</code>\n</br>\n<code>unzip GIKRYA_texts_new.zip</code>\n</br>\n<code>rm GIKRYA_texts_new.zip</code>\n</td>\n</tr>\n<tr>\n<td>\nRussian National Corpus\n</td>\n<td>\n<a name=\"load_morphoru_rnc\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_rnc\">load_morphoru_rnc</a></code>\n<a href=\"#load_morphoru_rnc\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n98 892\n</td>\n<td align=\"right\">\n12.71 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar</code>\n</br>\n<code>unrar x RNC_texts.rar</code>\n</br>\n<code>rm RNC_texts.rar</code>\n</td>\n</tr>\n<tr>\n<td>\nOpenCorpora\n</td>\n<td>\n<a name=\"load_morphoru_corpora\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_morphoru_corpora\">load_morphoru_corpora</a></code>\n<a href=\"#load_morphoru_corpora\"><code>#</code></a>\n</td>\n<td>\n<code>morph</code>\n</td>\n<td align=\"right\">\n38 510\n</td>\n<td align=\"right\">\n4.80 Mb\n</td>\n<td>\n<code>wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar</code>\n</br>\n<code>unrar x OpenCorpora_Texts.rar</code>\n</br>\n<code>rm OpenCorpora_Texts.rar</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://russe.nlpub.org/downloads/\">RUSSE Russian Semantic Relatedness</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nHJ: Human Judgements of Word Pairs\n</td>\n<td>\n<a name=\"load_russe_hj\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_hj\">load_russe_hj</a></code>\n<a href=\"#load_russe_hj\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv</code>\n</td>\n</tr>\n<tr>\n<td>\nRT: Synonyms and Hypernyms from the Thesaurus RuThes\n</td>\n<td>\n<a name=\"load_russe_rt\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_rt\">load_russe_rt</a></code>\n<a href=\"#load_russe_rt\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv</code>\n</td>\n</tr>\n<tr>\n<td>\nAE: Cognitive Associations from the Sociation.org Experiment\n</td>\n<td>\n<a name=\"load_russe_ae\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_russe_ae\">load_russe_ae</a></code>\n<a href=\"#load_russe_ae\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv</code>\n</br>\n<code>wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv</code>\n</br>\n<code>wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://toloka.yandex.ru/datasets/\">Toloka Datasets</a>\n</td>\n<td colspan=\"5\">\n</td>\n</tr>\n<tr>\n<td>\nLexical Relations from the Wisdom of the Crowd (LRWC)\n</td>\n<td>\n<a name=\"load_toloka_lrwc\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_toloka_lrwc\">load_toloka_lrwc</a></code>\n<a href=\"#load_toloka_lrwc\"><code>#</code></a>\n</td>\n<td>\n<code>emb</code>\n<code>sim</code>\n</td>\n<td align=\"right\">\n</td>\n<td align=\"right\">\n</td>\n<td>\n<code>wget https://tlk.s3.yandex.net/dataset/LRWC.zip</code>\n</br>\n<code>unzip LRWC.zip</code>\n</br>\n<code>rm LRWC.zip</code>\n</td>\n</tr>\n<tr>\n<td>\n<a href=\"https://github.com/cimm-kzn/RuDReC\">The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)</a>\n</td>\n<td>\n<a name=\"load_ruadrect\"></a>\n<code><a href=\"https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ruadrect\">load_ruadrect</a></code>\n<a href=\"#load_ruadrect\"><code>#</code></a>\n</td>\n<td>\n<code>social</code>\n</td>\n<td align=\"right\">\n9 515\n</td>\n<td align=\"right\">\n2.09 Mb\n</td>\n<td>\nThis corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020\n</br>\n</br>\n<code>wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip</code>\n</br>\n<code>unzip RuADReCT.zip</code>\n</br>\n<code>rm RuADReCT.zip</code>\n</td>\n</tr>\n</table>\n<!--- metas --->\n\n## Support\n\n- Chat \u2014 https://t.me/natural_language_processing\n- Issues \u2014 https://github.com/natasha/corus/issues\n- Commercial support \u2014 https://lab.alexkuk.ru\n\n## Add new source\n\n1. Implement `corus/sources/<source>.py`\n2. Add import into `corus/sources/__init__.py`\n3. Add meta into `corus/source/meta.py`\n4. Add example into `docs.ipynb` (check meta table is correct)\n5. Run tests (readme is updated)\n\n## Development\n\nDev env\n\n```bash\npython -m venv ~/.venvs/natasha-corus\nsource ~/.venvs/natasha-corus/bin/activate\n\npip install -r requirements/dev.txt\npip install -e .\n\npython -m ipykernel install --user --name natasha-corus\n```\n\nLint + update docs\n\n```bash\nmake lint\nmake exec-docs\n```\n\nRelease\n\n```bash\n# Update setup.py version\n\ngit commit -am 'Up version'\ngit tag v0.10.0\n\ngit push\ngit push --tags\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Links to russian corpora, functions for loading and parsing",
"version": "0.10.0",
"project_urls": {
"Homepage": "https://github.com/natasha/corus"
},
"split_keywords": [
"corpora",
"russian",
"nlp",
"datasets"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "26102c40454156b8bc65bdce019785aa508487b3b5cc07b35fd2c2da3d9b1418",
"md5": "01619d7269db12d678cfc61e80962f4a",
"sha256": "7b8da75d9fab0c3ee0d52a9fd575965dcd93fa1818da01a91bff178b3ad90bc7"
},
"downloads": -1,
"filename": "corus-0.10.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "01619d7269db12d678cfc61e80962f4a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 83650,
"upload_time": "2023-07-24T08:54:25",
"upload_time_iso_8601": "2023-07-24T08:54:25.371235Z",
"url": "https://files.pythonhosted.org/packages/26/10/2c40454156b8bc65bdce019785aa508487b3b5cc07b35fd2c2da3d9b1418/corus-0.10.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "797e50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3",
"md5": "cdf056d3171481018d543e92b674436d",
"sha256": "0e203f4fb96b841822ca34a79c2004564ec68a1bcf247ab09e08e49b0a7563e9"
},
"downloads": -1,
"filename": "corus-0.10.0.tar.gz",
"has_sig": false,
"md5_digest": "cdf056d3171481018d543e92b674436d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 76494,
"upload_time": "2023-07-24T08:54:26",
"upload_time_iso_8601": "2023-07-24T08:54:26.618878Z",
"url": "https://files.pythonhosted.org/packages/79/7e/50769ae67af426bb53727fdfbf34e768edb14f5e4900f4110174588666e3/corus-0.10.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-24 08:54:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "natasha",
"github_project": "corus",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "corus"
}