# MTData
[![image](http://img.shields.io/pypi/v/mtdata.svg)](https://pypi.python.org/pypi/mtdata/)
![Travis (.com)](https://img.shields.io/travis/com/thammegowda/mtdata?style=plastic)
MTData automates the collection and preparation of machine translation (MT) datasets.
It provides CLI and python APIs, which can be used for preparing MT experiments.
* [Quickstart Example](#quickstart--example)
* [Docs](https://thammegowda.github.io/mtdata/)
* [Search datasets](https://thammegowda.github.io/mtdata/search.html)
This tool knows:
- From where to download data sets: WMT News Translation tests and devs for Paracrawl,
Europarl, News Commentary, WikiTitles, Tilde Model corpus, OPUS ...
- How to extract files : .tar, .tar.gz, .tgz, .zip, ...
- How to parse .tmx, .sgm and such XMLs, or .tsv ... Checks if they have same number of segments.
- Whether parallel data is in one .tsv file or two sgm files.
- Whether data is compressed in gz, xz or none at all.
- Whether the source-target is in the same order or is it swapped as target-source order.
- How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.
- New in v0.3: BCP-47 like language ID: (language, script, region)
- Download only once and keep the files in local cache.
- (And more of such tiny details over the time.)
[MTData](https://github.com/thammegowda/mtdata) is here to:
- Automate machinbe translation training data creation by taking out human intervention. This is inspired by [SacreBLEU](https://github.com/mjpost/sacreBLEU) that takes out human intervention at the evaluation stage.
- A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.
## Installation
```bash
# Option 1: from pypi
pip install -I mtdata
# To install a specific version, get version number from https://pypi.org/project/mtdata/#history
pip install mtdata==[version]
# Option 2: install from latest master branch
pip install -I git+https://github.com/thammegowda/mtdata
# Option 3: for development/editable mode
git clone https://github.com/thammegowda/mtdata
cd mtdata
pip install --editable .
```
## Current Status:
We have added some commonly used datasets - you are welcome to add more!
These are the summary of datasets from various sources (Updated: Feb 2022).
| Source | Dataset Count |
|-------------:|--------------:|
| OPUS | 151,753|
| Flores | 51,714|
| Microsoft | 8,128|
| Leipzig | 5,893|
| Neulab | 4,455|
| Statmt | 1,784|
| Facebook | 1,617|
| AllenAi | 1,611|
| ELRC | 1,575|
| EU | 1,178|
| Tilde | 519|
| LinguaTools | 253|
| Anuvaad | 196|
| AI4Bharath | 192|
| ParaCrawl | 127|
| Lindat | 56|
| UN | 30|
| JoshuaDec | 29|
| StanfordNLP | 15|
| ParIce | 8|
| LangUk | 5|
| Phontron | 4|
| NRC_CA | 4|
| KECL | 3|
| IITB | 3|
| WAT | 3|
| Masakhane | 2|
| **Total** | **231,157** |
## Usecases
* WMT 2023 General (News) Translation Task: https://www.statmt.org/wmt23/mtdata/
* WMT 2022 General (News) Translation Task: https://www.statmt.org/wmt22/mtdata/
* USC ISI's 500-to-English MT: ~http://rtg.isi.edu/many-eng/~ http://gowda.ai/006-many-to-eng/)
* Meta AI's 200-to-200 MT: [Whitepaper](https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf)
## CLI Usage
- After pip installation, the CLI can be called using `mtdata` command or `python -m mtdata`
- There are two sub commands: `list` for listing the datasets, and `get` for getting them
### `mtdata list`
Lists datasets that are known to this tool.
```bash
mtdata list -h
usage: __main__.py list [-h] [-l L1-L2] [-n [NAME ...]] [-nn [NAME ...]] [-f] [-o OUT]
optional arguments:
-h, --help show this help message and exit
-l L1-L2, --langs L1-L2
Language pairs; e.g.: deu-eng (default: None)
-n [NAME ...], --names [NAME ...]
Name of dataset set; eg europarl_v9. (default: None)
-nn [NAME ...], --not-names [NAME ...]
Exclude these names (default: None)
-f, --full Show Full Citation (default: False)
```
```bash
# List everything ; add | cut -f1 to see ID column only
mtdata list | cut -f1
# List a lang pair
mtdata list -l deu-eng
# List a dataset by name(s)
mtdata list -n europarl
mtdata list -n europarl news_commentary
# list by both language pair and dataset name
mtdata list -l deu-eng -n europarl news_commentary newstest_deen | cut -f1
Statmt-europarl-9-deu-eng
Statmt-europarl-7-deu-eng
Statmt-news_commentary-14-deu-eng
Statmt-news_commentary-15-deu-eng
Statmt-news_commentary-16-deu-eng
Statmt-newstest_deen-2014-deu-eng
Statmt-newstest_deen-2015-deu-eng
Statmt-newstest_deen-2016-deu-eng
Statmt-newstest_deen-2017-deu-eng
Statmt-newstest_deen-2018-deu-eng
Statmt-newstest_deen-2019-deu-eng
Statmt-newstest_deen-2020-deu-eng
Statmt-europarl-10-deu-eng
OPUS-europarl-8-deu-eng
# get citation of a dataset (if available in index.py)
mtdata list -l deu-eng -n newstest_deen --full
```
### Dataset ID
Dataset IDs are standardized to this format:
`<Group>-<name>-<version>-<lang1>-<lang2>`
* `Group`: source or the website where we are obtaining this dataset
* `name`: name of the dataset
* `version`: version name
* `lang1` and `lang2` are BCP47-like codes. In simple case, they are ISO-639-3 codes, however, they might have script and language tags separated by underscores (`_`).
### `mtdata get`
This command downloads datasets specified by names for languages to a directory.
You will have to make definite choice for `--train` and `--test` arguments
```
mtdata get -h
python -m mtdata get -h
usage: __main__.py get [-h] -l L1-L2 [-tr [ID ...]] [-ts [ID ...]] [-dv ID] [--merge | --no-merge] [--compress] -o OUT_DIR
optional arguments:
-h, --help show this help message and exit
-l L1-L2, --langs L1-L2
Language pairs; e.g.: deu-eng (default: None)
-tr [ID ...], --train [ID ...]
Names of datasets separated by space, to be used for *training*.
e.g. -tr Statmt-news_commentary-16-deu-eng europarl_v9 .
To concatenate all these into a single train file, set --merge flag. (default: None)
-ts [ID ...], --test [ID ...]
Names of datasets separated by space, to be used for *testing*.
e.g. "-ts Statmt-newstest_deen-2019-deu-eng Statmt-newstest_deen-2020-deu-eng ".
You may also use shell expansion if your shell supports it.
e.g. "-ts Statmt-newstest_deen-20{19,20}-deu-eng" (default: None)
-dv ID, --dev ID Dataset to be used for development (aka validation).
e.g. "-dv Statmt-newstest_deen-2017-deu-eng" (default: None)
--merge Merge train into a single file (default: False)
--no-merge Do not Merge train into a single file (default: True)
--compress Keep the files compressed (default: False)
-o OUT_DIR, --out OUT_DIR
Output directory name (default: None)
```
## Quickstart / Example
See what datasets are available for `deu-eng`
```bash
$ mtdata list -l deu-eng | cut -f1 # see available datasets
Statmt-commoncrawl_wmt13-1-deu-eng
Statmt-europarl_wmt13-7-deu-eng
Statmt-news_commentary_wmt18-13-deu-eng
Statmt-europarl-9-deu-eng
Statmt-europarl-7-deu-eng
Statmt-news_commentary-14-deu-eng
Statmt-news_commentary-15-deu-eng
Statmt-news_commentary-16-deu-eng
Statmt-wiki_titles-1-deu-eng
Statmt-wiki_titles-2-deu-eng
Statmt-newstest_deen-2014-deu-eng
....[truncated]
```
Get these datasets and store under dir `data/deu-eng`
```bash
$ mtdata get -l deu-eng --out data/deu-eng --merge \
--train Statmt-europarl-10-deu-eng Statmt-news_commentary-16-deu-eng \
--dev Statmt-newstest_deen-2017-deu-eng --test Statmt-newstest_deen-20{18,19,20}-deu-eng
# ...[truncated]
INFO:root:Train stats:
{
"total": 2206240,
"parts": {
"Statmt-news_commentary-16-deu-eng": 388482,
"Statmt-europarl-10-deu-eng": 1817758
}
}
INFO:root:Dataset is ready at deu-eng
```
To reproduce this dataset again in the future or by others, please refer to `<out-dir>/mtdata.signature.txt`:
```bash
$ cat deu-eng/mtdata.signature.txt
mtdata get -l deu-eng -tr Statmt-europarl-10-deu-eng Statmt-news_commentary-16-deu-eng \
-ts Statmt-newstest_deen-2018-deu-eng Statmt-newstest_deen-2019-deu-eng Statmt-newstest_deen-2020-deu-eng \
-dv Statmt-newstest_deen-2017-deu-eng --merge -o <out-dir>
mtdata version 0.3.0-dev
```
See what the above command has accomplished:
```bash
$ tree data/deu-eng/
├── dev.deu -> tests/Statmt-newstest_deen-2017-deu-eng.deu
├── dev.eng -> tests/Statmt-newstest_deen-2017-deu-eng.eng
├── mtdata.signature.txt
├── test1.deu -> tests/Statmt-newstest_deen-2020-deu-eng.deu
├── test1.eng -> tests/Statmt-newstest_deen-2020-deu-eng.eng
├── test2.deu -> tests/Statmt-newstest_deen-2018-deu-eng.deu
├── test2.eng -> tests/Statmt-newstest_deen-2018-deu-eng.eng
├── test3.deu -> tests/Statmt-newstest_deen-2019-deu-eng.deu
├── test3.eng -> tests/Statmt-newstest_deen-2019-deu-eng.eng
├── tests
│ ├── Statmt-newstest_deen-2017-deu-eng.deu
│ ├── Statmt-newstest_deen-2017-deu-eng.eng
│ ├── Statmt-newstest_deen-2018-deu-eng.deu
│ ├── Statmt-newstest_deen-2018-deu-eng.eng
│ ├── Statmt-newstest_deen-2019-deu-eng.deu
│ ├── Statmt-newstest_deen-2019-deu-eng.eng
│ ├── Statmt-newstest_deen-2020-deu-eng.deu
│ └── Statmt-newstest_deen-2020-deu-eng.eng
├── train-parts
│ ├── Statmt-europarl-10-deu-eng.deu
│ ├── Statmt-europarl-10-deu-eng.eng
│ ├── Statmt-news_commentary-16-deu-eng.deu
│ └── Statmt-news_commentary-16-deu-eng.eng
├── train.deu
├── train.eng
├── train.meta.gz
└── train.stats.json
```
## Recipes
> Since v0.3.1
Recipe is a set of datasets nominated for train, dev, and tests, and are meant to improve reproducibility of experiments.
Recipes are loaded from
1. Default: [`mtdata/recipe/recipes.yml`](mtdata/recipe/recipes.yml) from source code
2. Cache dir: `$MTDATA/mtdata.recipes.yml` where `$MTDATA` has default of `~/.mtdata`
3. Current dir: All files matching the glob: `$PWD/mtdata.recipes*.yml`
* If current dir is not preferred, `export MTDATA_RECIPES=/path/to/dir`
* Alternatively, `MTDATA_RECIPES=/path/to/dir mtdata list-recipe`
See [`mtdata/recipe/recipes.yml`](mtdata/recipe/recipes.yml) for the format and examples.
```bash
mtdata list-recipe # see all recipes
mtdata get-recipe -ri <recipe_id> -o <out_dir> # get recipe, recreate dataset
```
## Language Name Standardization
### ISO 639 3
Internally, all language codes are mapped to ISO-639 3 codes.
The mapping can be inspected with `python -m mtdata.iso ` or `mtdata-iso`
```bash
$ mtdata-iso -h
usage: python -m mtdata.iso [-h] [-b] [langs [langs ...]]
ISO 639-3 lookup tool
positional arguments:
langs Language code or name that needs to be looked up. When no
language code is given, all languages are listed.
optional arguments:
-h, --help show this help message and exit
-b, --brief be brief; do crash on error inputs
# list all 7000+ languages and their 3 letter codes
$ mtdata-iso # python -m mtdata.iso
...
# lookup codes for some languages
$ mtdata-iso ka kn en de xx english german
Input ISO639_3 Name
ka kat Georgian
kn kan Kannada
en eng English
de deu German
xx -none- -none-
english eng English
german deu German
# Print no header, and crash on error;
$ mtdata-iso xx -b
Exception: Unable to find ISO 639-3 code for 'xx'. Please run
python -m mtdata.iso | grep -i <name>
to know the 3 letter ISO code for the language.
```
To use Python API
```python
from mtdata.iso import iso3_code
print(iso3_code('en', fail_error=True))
print(iso3_code('eNgLIsH', fail_error=True)) # case doesnt matter
```
### BCP-47
> Since v0.3.0
We used ISO 639-3 from the beginning, however, we soon faced the limitation that ISO 639-3 cannot distinguish script and region variants of language. So we have upgraded to BCP-47 like language tags in `v0.3.0`.
* BCP47 uses two-letter codes to some and three-letter codes to the rest, we use three-letter codes to all languages.
* BCP47 uses `-` hyphens we use `_` underscores, since hyphens are used by MT community to separate bitext pairs (e.g. en-de or eng-deu)
Our tags are of form `xxx_Yyyy_ZZ` where
| Pattern | Purpose | Standard | Length | Case | Required |
|---------|----------|------------|---------------|-----------|-----------|
| `xxx` | Language | ISO 639-3 | three-letters | lowercase | mandatory |
| `Yyyy` | Script | ISO 15924 | four-letters | Titlecase | optional |
| `ZZ` | Region | ISO 3166-1 | two-letters | CAPITALS | optional |
Notes:
* Region is preserved when available and left blank when unavailable
* Script `Yyyy` is forcibly suppressed in obvious cases. E.g. `eng` is written using `Latn` script, writing `eng-Latn` is just awkward to read as `Latn` is default we suppress `Latn` script for English. On the other hand a language like `Kannada` is written using `Knda` script (`kan-Knda` -> `kan`), but occasionally written using `Latn` script, so `kan-Latn` is not suppressed.
* The information about what is default script is obtained from IANA language code registry
* Language code `mul` stands for _multiple languages, and is used as a placeholder for multilingual datasets (See `mul-eng` to represent many-to-English dataset recipes in [(mtdata/recipe/recipes.yml](mtdata/recipe/recipes.yml))
#### Example:
To inspect parsing/mapping, use `python -m mtdata.iso.bcp47 <args>`
```bash
mtdata-bcp47 eng English en-US en-GB eng-Latn kan Kannada-Deva hin-Deva kan-Latn
```
| INPUT | STD | LANG | SCRIPT | REGION |
|---------------|-----------|-------|---------|--------|
| eng | eng | eng | None | None |
| English | eng | eng | None | None |
| en-US | eng_US | eng | None | US |
| en-GB | eng_GB | eng | None | GB |
| eng-Latn | eng | eng | None | None |
| kan | kan | kan | None | None |
| Kannada-Deva | kan_Deva | kan | Deva | None |
| hin-Deva | hin | hin | None | None |
| kan-Latn | kan_Latn | kan | Latn | None |
| kan-in | kan_IN | kan | None | IN |
| kn-knda-in | kan_IN | kan | None | IN |
__Pipe Mode__
```bash
# --pipe/-p : maps stdin -> stdout
# -s express : expresses scripts (unlike BCP47, which supresses default script
$ echo -e "en\neng\nfr\nfra\nara\nkan\ntel\neng_Latn\nhin_deva"| mtdata-bcp47 -p -s express
eng_Latn
eng_Latn
fra_Latn
fra_Latn
ara_Arab
kan_Knda
tel_Telu
eng_Latn
hin_Deva
```
**Python API for BCP47 Mapping**
```python
from mtdata.iso.bcp47 import bcp47
tag = bcp47("en_US")
print(*tag) # tag is a tuple
print(f"{tag}") # str(tag) gets standardized string
```
## How to Contribute:
* Please help grow the datasets by adding any missing and new datasets to [`index`](mtdata/index/__init__.py) module.
* Please create issues and/or pull requests at https://github.com/thammegowda/mtdata/
## Change Cache Directory:
The default cache directory is `$HOME/.mtdata`.
It can grow to a large size when you download a lot of datasets using this command.
To change it:
* set the following environment variable
`export MTDATA=/path/to/new-cache-dir`
* Alternatively, move `$HOME/.mtdata` to the desired place and create a symbolic link
```bash
mv $HOME/.mtdata /path/to/new/place
ln -s /path/to/new/place $HOME/.mtdata
```
## Performance Optimization Tips
* Use `mtdata cache -j <jobs> ...` to download many datasets in parallel using specified number of jobs
* use `--compress` flag `mtdata get|get-recipe` to keep the datasets compressed.
* mtdata uses `pigz` by default to handle compressed files (Highly recommend installing `pigz`). If you'd like to disable pigz, `export USE_PIGZ=0`
## Run tests
Tests are located in [tests/](tests) directory. To run all the tests:
python -m pytest
## Developers and Contributor:
See - https://github.com/thammegowda/mtdata/graphs/contributors
## Citation
https://aclanthology.org/2021.acl-demo.37/
```
@inproceedings{gowda-etal-2021-many,
title = "Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models",
author = "Gowda, Thamme and
Zhang, Zhao and
Mattmann, Chris and
May, Jonathan",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-demo.37",
doi = "10.18653/v1/2021.acl-demo.37",
pages = "306--316",
}
```
---
## Disclaimer on Datasets
This tools downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or make any claims regarding license to use these datasets. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.
We request all the users of this tool to cite the original creators of the datsets, which maybe obtained from `mtdata list -n <NAME> -l <L1-L2> -full`.
If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!
Raw data
{
"_id": null,
"home_page": "https://github.com/thammegowda/mtdata",
"name": "mtdata",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "machine translation, datasets, NLP, natural language processing, computational linguistics",
"author": "Thamme Gowda",
"author_email": "tgowdan@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/7b/97/2b84d291a5cb277ebef3ab62af7d77edbb1baba9a287833c67a449d4aa57/mtdata-0.4.2.tar.gz",
"platform": "any",
"description": "# MTData\n[![image](http://img.shields.io/pypi/v/mtdata.svg)](https://pypi.python.org/pypi/mtdata/)\n![Travis (.com)](https://img.shields.io/travis/com/thammegowda/mtdata?style=plastic)\n\nMTData automates the collection and preparation of machine translation (MT) datasets.\nIt provides CLI and python APIs, which can be used for preparing MT experiments.\n\n* [Quickstart Example](#quickstart--example)\n* [Docs](https://thammegowda.github.io/mtdata/)\n* [Search datasets](https://thammegowda.github.io/mtdata/search.html)\n\n\nThis tool knows:\n- From where to download data sets: WMT News Translation tests and devs for Paracrawl,\n Europarl, News Commentary, WikiTitles, Tilde Model corpus, OPUS ...\n- How to extract files : .tar, .tar.gz, .tgz, .zip, ...\n- How to parse .tmx, .sgm and such XMLs, or .tsv ... Checks if they have same number of segments.\n- Whether parallel data is in one .tsv file or two sgm files.\n- Whether data is compressed in gz, xz or none at all.\n- Whether the source-target is in the same order or is it swapped as target-source order.\n- How to map code to ISO language codes! Using ISO 639_3 that has space for 7000+ languages of our planet.\n - New in v0.3: BCP-47 like language ID: (language, script, region)\n- Download only once and keep the files in local cache.\n- (And more of such tiny details over the time.)\n\n[MTData](https://github.com/thammegowda/mtdata) is here to:\n- Automate machinbe translation training data creation by taking out human intervention. This is inspired by [SacreBLEU](https://github.com/mjpost/sacreBLEU) that takes out human intervention at the evaluation stage.\n- A reusable tool instead of dozens of use-once shell scripts spread across multiple repos.\n\n\n## Installation\n```bash\n# Option 1: from pypi\npip install -I mtdata\n# To install a specific version, get version number from https://pypi.org/project/mtdata/#history\npip install mtdata==[version]\n\n# Option 2: install from latest master branch\npip install -I git+https://github.com/thammegowda/mtdata\n\n\n# Option 3: for development/editable mode\ngit clone https://github.com/thammegowda/mtdata\ncd mtdata\npip install --editable .\n```\n\n\n## Current Status:\n\nWe have added some commonly used datasets - you are welcome to add more! \nThese are the summary of datasets from various sources (Updated: Feb 2022).\n\n| Source | Dataset Count |\n|-------------:|--------------:|\n| OPUS | 151,753|\n| Flores | 51,714|\n| Microsoft | 8,128|\n| Leipzig | 5,893|\n| Neulab | 4,455|\n| Statmt | 1,784|\n| Facebook | 1,617|\n| AllenAi | 1,611|\n| ELRC | 1,575|\n| EU | 1,178|\n| Tilde | 519|\n| LinguaTools | 253|\n| Anuvaad | 196|\n| AI4Bharath | 192|\n| ParaCrawl | 127|\n| Lindat | 56|\n| UN | 30|\n| JoshuaDec | 29|\n| StanfordNLP | 15|\n| ParIce | 8|\n| LangUk | 5|\n| Phontron | 4|\n| NRC_CA | 4|\n| KECL | 3|\n| IITB | 3|\n| WAT | 3|\n| Masakhane | 2|\n| **Total** | **231,157** |\n\n\n## Usecases\n* WMT 2023 General (News) Translation Task: https://www.statmt.org/wmt23/mtdata/ \n* WMT 2022 General (News) Translation Task: https://www.statmt.org/wmt22/mtdata/ \n* USC ISI's 500-to-English MT: ~http://rtg.isi.edu/many-eng/~ http://gowda.ai/006-many-to-eng/)\n* Meta AI's 200-to-200 MT: [Whitepaper](https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf)\n\n## CLI Usage\n- After pip installation, the CLI can be called using `mtdata` command or `python -m mtdata`\n- There are two sub commands: `list` for listing the datasets, and `get` for getting them\n\n### `mtdata list`\nLists datasets that are known to this tool.\n```bash\nmtdata list -h\nusage: __main__.py list [-h] [-l L1-L2] [-n [NAME ...]] [-nn [NAME ...]] [-f] [-o OUT]\n\noptional arguments:\n -h, --help show this help message and exit\n -l L1-L2, --langs L1-L2\n Language pairs; e.g.: deu-eng (default: None)\n -n [NAME ...], --names [NAME ...]\n Name of dataset set; eg europarl_v9. (default: None)\n -nn [NAME ...], --not-names [NAME ...]\n Exclude these names (default: None)\n -f, --full Show Full Citation (default: False)\n``` \n\n```bash\n# List everything ; add | cut -f1 to see ID column only\nmtdata list | cut -f1\n\n# List a lang pair \nmtdata list -l deu-eng \n\n# List a dataset by name(s)\nmtdata list -n europarl\nmtdata list -n europarl news_commentary\n\n# list by both language pair and dataset name\n mtdata list -l deu-eng -n europarl news_commentary newstest_deen | cut -f1\n Statmt-europarl-9-deu-eng\n Statmt-europarl-7-deu-eng\n Statmt-news_commentary-14-deu-eng\n Statmt-news_commentary-15-deu-eng\n Statmt-news_commentary-16-deu-eng\n Statmt-newstest_deen-2014-deu-eng\n Statmt-newstest_deen-2015-deu-eng\n Statmt-newstest_deen-2016-deu-eng\n Statmt-newstest_deen-2017-deu-eng\n Statmt-newstest_deen-2018-deu-eng\n Statmt-newstest_deen-2019-deu-eng\n Statmt-newstest_deen-2020-deu-eng\n Statmt-europarl-10-deu-eng\n OPUS-europarl-8-deu-eng\n\n# get citation of a dataset (if available in index.py)\nmtdata list -l deu-eng -n newstest_deen --full\n```\n\n### Dataset ID\nDataset IDs are standardized to this format: \n`<Group>-<name>-<version>-<lang1>-<lang2>`\n\n* `Group`: source or the website where we are obtaining this dataset\n* `name`: name of the dataset\n* `version`: version name\n* `lang1` and `lang2` are BCP47-like codes. In simple case, they are ISO-639-3 codes, however, they might have script and language tags separated by underscores (`_`). \n\n\n### `mtdata get`\nThis command downloads datasets specified by names for languages to a directory.\nYou will have to make definite choice for `--train` and `--test` arguments \n\n```\nmtdata get -h\npython -m mtdata get -h\nusage: __main__.py get [-h] -l L1-L2 [-tr [ID ...]] [-ts [ID ...]] [-dv ID] [--merge | --no-merge] [--compress] -o OUT_DIR\n\noptional arguments:\n -h, --help show this help message and exit\n -l L1-L2, --langs L1-L2\n Language pairs; e.g.: deu-eng (default: None)\n -tr [ID ...], --train [ID ...]\n Names of datasets separated by space, to be used for *training*.\n e.g. -tr Statmt-news_commentary-16-deu-eng europarl_v9 .\n To concatenate all these into a single train file, set --merge flag. (default: None)\n -ts [ID ...], --test [ID ...]\n Names of datasets separated by space, to be used for *testing*.\n e.g. \"-ts Statmt-newstest_deen-2019-deu-eng Statmt-newstest_deen-2020-deu-eng \".\n You may also use shell expansion if your shell supports it.\n e.g. \"-ts Statmt-newstest_deen-20{19,20}-deu-eng\" (default: None)\n -dv ID, --dev ID Dataset to be used for development (aka validation).\n e.g. \"-dv Statmt-newstest_deen-2017-deu-eng\" (default: None)\n --merge Merge train into a single file (default: False)\n --no-merge Do not Merge train into a single file (default: True)\n --compress Keep the files compressed (default: False)\n -o OUT_DIR, --out OUT_DIR\n Output directory name (default: None)\n```\n\n## Quickstart / Example \nSee what datasets are available for `deu-eng`\n```bash\n$ mtdata list -l deu-eng | cut -f1 # see available datasets\n Statmt-commoncrawl_wmt13-1-deu-eng\n Statmt-europarl_wmt13-7-deu-eng\n Statmt-news_commentary_wmt18-13-deu-eng\n Statmt-europarl-9-deu-eng\n Statmt-europarl-7-deu-eng\n Statmt-news_commentary-14-deu-eng\n Statmt-news_commentary-15-deu-eng\n Statmt-news_commentary-16-deu-eng\n Statmt-wiki_titles-1-deu-eng\n Statmt-wiki_titles-2-deu-eng\n Statmt-newstest_deen-2014-deu-eng\n ....[truncated]\n```\nGet these datasets and store under dir `data/deu-eng`\n```bash\n $ mtdata get -l deu-eng --out data/deu-eng --merge \\\n --train Statmt-europarl-10-deu-eng Statmt-news_commentary-16-deu-eng \\\n --dev Statmt-newstest_deen-2017-deu-eng --test Statmt-newstest_deen-20{18,19,20}-deu-eng\n # ...[truncated] \n INFO:root:Train stats:\n {\n \"total\": 2206240,\n \"parts\": {\n \"Statmt-news_commentary-16-deu-eng\": 388482,\n \"Statmt-europarl-10-deu-eng\": 1817758\n }\n }\n INFO:root:Dataset is ready at deu-eng\n```\nTo reproduce this dataset again in the future or by others, please refer to `<out-dir>/mtdata.signature.txt`:\n```bash\n$ cat deu-eng/mtdata.signature.txt\nmtdata get -l deu-eng -tr Statmt-europarl-10-deu-eng Statmt-news_commentary-16-deu-eng \\\n -ts Statmt-newstest_deen-2018-deu-eng Statmt-newstest_deen-2019-deu-eng Statmt-newstest_deen-2020-deu-eng \\\n -dv Statmt-newstest_deen-2017-deu-eng --merge -o <out-dir>\nmtdata version 0.3.0-dev\n```\n\nSee what the above command has accomplished:\n```bash \n$ tree data/deu-eng/\n\u251c\u2500\u2500 dev.deu -> tests/Statmt-newstest_deen-2017-deu-eng.deu\n\u251c\u2500\u2500 dev.eng -> tests/Statmt-newstest_deen-2017-deu-eng.eng\n\u251c\u2500\u2500 mtdata.signature.txt\n\u251c\u2500\u2500 test1.deu -> tests/Statmt-newstest_deen-2020-deu-eng.deu\n\u251c\u2500\u2500 test1.eng -> tests/Statmt-newstest_deen-2020-deu-eng.eng\n\u251c\u2500\u2500 test2.deu -> tests/Statmt-newstest_deen-2018-deu-eng.deu\n\u251c\u2500\u2500 test2.eng -> tests/Statmt-newstest_deen-2018-deu-eng.eng\n\u251c\u2500\u2500 test3.deu -> tests/Statmt-newstest_deen-2019-deu-eng.deu\n\u251c\u2500\u2500 test3.eng -> tests/Statmt-newstest_deen-2019-deu-eng.eng\n\u251c\u2500\u2500 tests\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2017-deu-eng.deu\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2017-deu-eng.eng\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2018-deu-eng.deu\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2018-deu-eng.eng\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2019-deu-eng.deu\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2019-deu-eng.eng\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-newstest_deen-2020-deu-eng.deu\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 Statmt-newstest_deen-2020-deu-eng.eng\n\u251c\u2500\u2500 train-parts\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-europarl-10-deu-eng.deu\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-europarl-10-deu-eng.eng\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 Statmt-news_commentary-16-deu-eng.deu\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 Statmt-news_commentary-16-deu-eng.eng\n\u251c\u2500\u2500 train.deu\n\u251c\u2500\u2500 train.eng\n\u251c\u2500\u2500 train.meta.gz\n\u2514\u2500\u2500 train.stats.json\n```\n\n## Recipes\n\n> Since v0.3.1\n\nRecipe is a set of datasets nominated for train, dev, and tests, and are meant to improve reproducibility of experiments.\nRecipes are loaded from \n1. Default: [`mtdata/recipe/recipes.yml`](mtdata/recipe/recipes.yml) from source code\n2. Cache dir: `$MTDATA/mtdata.recipes.yml` where `$MTDATA` has default of `~/.mtdata`\n3. Current dir: All files matching the glob: `$PWD/mtdata.recipes*.yml` \n * If current dir is not preferred, `export MTDATA_RECIPES=/path/to/dir`\n * Alternatively, `MTDATA_RECIPES=/path/to/dir mtdata list-recipe` \n\nSee [`mtdata/recipe/recipes.yml`](mtdata/recipe/recipes.yml) for the format and examples.\n\n```bash\nmtdata list-recipe # see all recipes\nmtdata get-recipe -ri <recipe_id> -o <out_dir> # get recipe, recreate dataset\n```\n\n## Language Name Standardization\n### ISO 639 3 \nInternally, all language codes are mapped to ISO-639 3 codes.\nThe mapping can be inspected with `python -m mtdata.iso ` or `mtdata-iso`\n```bash\n$ mtdata-iso -h\nusage: python -m mtdata.iso [-h] [-b] [langs [langs ...]]\n\nISO 639-3 lookup tool\n\npositional arguments:\n langs Language code or name that needs to be looked up. When no\n language code is given, all languages are listed.\n\noptional arguments:\n -h, --help show this help message and exit\n -b, --brief be brief; do crash on error inputs\n\n# list all 7000+ languages and their 3 letter codes\n$ mtdata-iso # python -m mtdata.iso \n...\n\n# lookup codes for some languages\n$ mtdata-iso ka kn en de xx english german\nInput ISO639_3 Name\nka kat Georgian\nkn kan Kannada\nen eng English\nde deu German\nxx -none- -none-\nenglish eng English\ngerman deu German\n\n# Print no header, and crash on error; \n$ mtdata-iso xx -b\nException: Unable to find ISO 639-3 code for 'xx'. Please run\npython -m mtdata.iso | grep -i <name>\nto know the 3 letter ISO code for the language.\n```\nTo use Python API\n```python\nfrom mtdata.iso import iso3_code\nprint(iso3_code('en', fail_error=True))\nprint(iso3_code('eNgLIsH', fail_error=True)) # case doesnt matter\n```\n\n### BCP-47 \n\n> Since v0.3.0\n\nWe used ISO 639-3 from the beginning, however, we soon faced the limitation that ISO 639-3 cannot distinguish script and region variants of language. So we have upgraded to BCP-47 like language tags in `v0.3.0`.\n\n* BCP47 uses two-letter codes to some and three-letter codes to the rest, we use three-letter codes to all languages.\n* BCP47 uses `-` hyphens we use `_` underscores, since hyphens are used by MT community to separate bitext pairs (e.g. en-de or eng-deu)\n\n\nOur tags are of form `xxx_Yyyy_ZZ` where \n \n| Pattern | Purpose | Standard | Length | Case | Required | \n|---------|----------|------------|---------------|-----------|-----------|\n| `xxx` | Language | ISO 639-3 | three-letters | lowercase | mandatory |\n| `Yyyy` | Script | ISO 15924 | four-letters | Titlecase | optional |\n| `ZZ` | Region | ISO 3166-1 | two-letters | CAPITALS | optional |\n\n\nNotes:\n* Region is preserved when available and left blank when unavailable\n* Script `Yyyy` is forcibly suppressed in obvious cases. E.g. `eng` is written using `Latn` script, writing `eng-Latn` is just awkward to read as `Latn` is default we suppress `Latn` script for English. On the other hand a language like `Kannada` is written using `Knda` script (`kan-Knda` -> `kan`), but occasionally written using `Latn` script, so `kan-Latn` is not suppressed. \n* The information about what is default script is obtained from IANA language code registry\n* Language code `mul` stands for _multiple languages, and is used as a placeholder for multilingual datasets (See `mul-eng` to represent many-to-English dataset recipes in [(mtdata/recipe/recipes.yml](mtdata/recipe/recipes.yml))\n\n#### Example:\nTo inspect parsing/mapping, use `python -m mtdata.iso.bcp47 <args>` \n\n```bash\nmtdata-bcp47 eng English en-US en-GB eng-Latn kan Kannada-Deva hin-Deva kan-Latn\n```\n\n| INPUT\t | STD\t | LANG\t | SCRIPT\t | REGION |\n|---------------|-----------|-------|---------|--------|\n| eng\t | eng\t | eng\t | None\t | None |\n| English\t | eng\t | eng\t | None\t | None |\n| en-US\t | eng_US\t | eng\t | None\t | US |\n| en-GB\t | eng_GB\t | eng\t | None\t | GB |\n| eng-Latn\t | eng\t | eng\t | None\t | None |\n| kan\t | kan\t | kan\t | None\t | None |\n| Kannada-Deva\t | kan_Deva\t | kan\t | Deva\t | None |\n| hin-Deva\t | hin\t | hin\t | None\t | None |\n| kan-Latn\t | kan_Latn\t | kan\t | Latn\t | None |\n| kan-in\t | kan_IN\t | kan\t | None\t | IN |\n| kn-knda-in\t | kan_IN\t | kan\t | None\t | IN |\n\n__Pipe Mode__\n```bash\n# --pipe/-p : maps stdin -> stdout \n# -s express : expresses scripts (unlike BCP47, which supresses default script\n$ echo -e \"en\\neng\\nfr\\nfra\\nara\\nkan\\ntel\\neng_Latn\\nhin_deva\"| mtdata-bcp47 -p -s express\neng_Latn\neng_Latn\nfra_Latn\nfra_Latn\nara_Arab\nkan_Knda\ntel_Telu\neng_Latn\nhin_Deva\n```\n\n**Python API for BCP47 Mapping**\n```python\nfrom mtdata.iso.bcp47 import bcp47\ntag = bcp47(\"en_US\")\nprint(*tag) # tag is a tuple\nprint(f\"{tag}\") # str(tag) gets standardized string\n```\n\n## How to Contribute:\n* Please help grow the datasets by adding any missing and new datasets to [`index`](mtdata/index/__init__.py) module.\n* Please create issues and/or pull requests at https://github.com/thammegowda/mtdata/ \n\n## Change Cache Directory:\n\nThe default cache directory is `$HOME/.mtdata`.\nIt can grow to a large size when you download a lot of datasets using this command.\n\nTo change it: \n* set the following environment variable\n`export MTDATA=/path/to/new-cache-dir`\n* Alternatively, move `$HOME/.mtdata` to the desired place and create a symbolic link \n```bash\nmv $HOME/.mtdata /path/to/new/place\nln -s /path/to/new/place $HOME/.mtdata\n```\n\n## Performance Optimization Tips\n* Use `mtdata cache -j <jobs> ...` to download many datasets in parallel using specified number of jobs\n* use `--compress` flag `mtdata get|get-recipe` to keep the datasets compressed. \n* mtdata uses `pigz` by default to handle compressed files (Highly recommend installing `pigz`). If you'd like to disable pigz, `export USE_PIGZ=0`\n \n\n## Run tests\nTests are located in [tests/](tests) directory. To run all the tests:\n\n python -m pytest\n\n\n\n## Developers and Contributor:\nSee - https://github.com/thammegowda/mtdata/graphs/contributors\n\n## Citation\n\nhttps://aclanthology.org/2021.acl-demo.37/ \n\n\n```\n@inproceedings{gowda-etal-2021-many,\n title = \"Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models\",\n author = \"Gowda, Thamme and\n Zhang, Zhao and\n Mattmann, Chris and\n May, Jonathan\",\n booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations\",\n month = aug,\n year = \"2021\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://aclanthology.org/2021.acl-demo.37\",\n doi = \"10.18653/v1/2021.acl-demo.37\",\n pages = \"306--316\",\n}\n```\n\n---\n## Disclaimer on Datasets\n\nThis tools downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or make any claims regarding license to use these datasets. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.\nWe request all the users of this tool to cite the original creators of the datsets, which maybe obtained from `mtdata list -n <NAME> -l <L1-L2> -full`.\n\nIf you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!\n",
"bugtrack_url": null,
"license": null,
"summary": "mtdata is a tool to download datasets for machine translation",
"version": "0.4.2",
"project_urls": {
"Download": "https://github.com/thammegowda/mtdata",
"Homepage": "https://github.com/thammegowda/mtdata"
},
"split_keywords": [
"machine translation",
" datasets",
" nlp",
" natural language processing",
" computational linguistics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "38c85f7a91d97bd43f9ca07ed096791890dbcc7a9ab698007444a08e341d065b",
"md5": "63fd55981cece9d6f0c189e782a4efa0",
"sha256": "5a54e92929341752a11b071908ea3745fab6c7e48f8ac8ed50c92ba1230400a7"
},
"downloads": -1,
"filename": "mtdata-0.4.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "63fd55981cece9d6f0c189e782a4efa0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 819651,
"upload_time": "2024-05-25T03:24:12",
"upload_time_iso_8601": "2024-05-25T03:24:12.148113Z",
"url": "https://files.pythonhosted.org/packages/38/c8/5f7a91d97bd43f9ca07ed096791890dbcc7a9ab698007444a08e341d065b/mtdata-0.4.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7b972b84d291a5cb277ebef3ab62af7d77edbb1baba9a287833c67a449d4aa57",
"md5": "ae3145f864346663a59c79d8dd802836",
"sha256": "4e276b48134224dfa70f6e57fb925a4cb97f7908f06e802ca90e6eeeba9b2501"
},
"downloads": -1,
"filename": "mtdata-0.4.2.tar.gz",
"has_sig": false,
"md5_digest": "ae3145f864346663a59c79d8dd802836",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 801288,
"upload_time": "2024-05-25T03:24:14",
"upload_time_iso_8601": "2024-05-25T03:24:14.861926Z",
"url": "https://files.pythonhosted.org/packages/7b/97/2b84d291a5cb277ebef3ab62af7d77edbb1baba9a287833c67a449d4aa57/mtdata-0.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-25 03:24:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thammegowda",
"github_project": "mtdata",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "mtdata"
}