stopes

Name	stopes JSON
Version	2.2.1 JSON
	download
home_page	None
Summary	Large-Scale Translation Data Mining.
upload_time	2025-01-23 09:18:45
maintainer	None
docs_url	None
author	Facebook AI Research
requires_python	>=3.9
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![stopes](/website/static/img/banner.png?raw=true "stopes by NLLB.")

# `stopes`: A library for preparing data for machine translation research

As part of the FAIR No Language Left Behind (NLLB) ([Paper](https://research.facebook.com/publications/no-language-left-behind/), [Website](https://ai.facebook.com/research/no-language-left-behind/), [Blog](https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/))
project to drive inclusion through machine translation, a large amount of data was processed to create training data. We provide the libraries and tools we used to:

1. create clean monolingual data from web data
2. mine bitext
3. easily write scalable pipelines for processing data for machine translation

Full documentation on https://facebookresearch.github.io/stopes

## Examples

checkout the `demo` directory for an example usage with the [WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African
Languages](https://statmt.org/wmt22/large-scale-multilingual-translation-task.html) data.

## Requirements

`stopes` relies on:

- submitit to schedule jobs when ran on clusters
- hydra-core version >= 1.2.0 for configuration
- fairseq to use LASER encoders
- PyTorch version >= 1.5.0
- Python version >= 3.8

## Installing stopes

stopes uses [flit](https://flit.pypa.io/) to manage its setup, you will need a recent version of
pip for the install to work. We recommend that you first upgrade pip:
`python -m pip install --upgrade pip`
Warning: the latest pip (about `23.3`) seems to no longer support the way the dependencies are declared in `fasttext`
and maybe other packages without `pyproject.toml`. To bypass the associated problems, consider installing the `wheel` package.

The mining pipeline relies on fairseq to run LASER encoders, because of competing dependency version, you'll have to first install fairseq with pip separately:

```
pip install fairseq==0.12.2
```

You can then install stopes with pip:

```
git clone https://github.com/facebookresearch/stopes.git
cd stopes
pip install -e '.[dev,mono,mining]'
```

You can choose what to install. If you are only interested in `mining`, you do not need to install `dev`, and `mono`. If you are interested in the distillation pipeline, you will need to install at least `mono`. `mining` will install the cpu version of the dependencies for mining, if you want to do mining on gpu, and your system is compatible, you can install `[mining,mining-gpu]`.

Currently `fairseq` and `stopes` require different version of hydra, so `pip` might output some warnings, do not worry about them, we want hydra>=1.1.

If you plan to train a lot of NMT model you will also want to setup apex to get a faster training.

```
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
```

### Speech package installing

Some speech packages like the MMS Text-to-Speech(TTS) require additional library installation, see [here](stopes/speech/README.md) for more details.

In addition, the [UST library](ust/README.md) has its own set of extra dependencies.

## How `stopes` works

`stopes` is made of a few different parts:

1. `core` provides a library to write readable piplines
2. `modules` provides a set of modules using the core library and implementing
   common steps in our mining and evaluation pipelines
3. `pipelines` provides pipeline implementation for the data pipelines we use in
   NLLB:

- `monolingual` to preprocess and clean single language data
- `bitext` to run the "global mining" pipeline and extract aligned sentences
  from two monolingual datasets. (inspired by
  [CCMatrix](https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/))
- `distilation` to run our sequence-level knowledge distillation pipeline which trains a small student model from a pre-trained large teacher model (approach based on https://arxiv.org/abs/1606.07947)

4. `eval` provides a set of evaluation tools, including ALTI+ and BLASER for text-free speech translation evaluation.
5. `demo` contains applications of stopes, including a quickstart demo that you can run at home of mining as well as a example usage of ALTI+ for toxicity and hallucination analysis.

**Full documentation**: see https://facebookresearch.github.io/stopes
or the `websites/docs` folder.

## Contributing

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## Contributors

You can find the list of the main contributors in the [NLLB](https://github.com/facebookresearch/fairseq/tree/nllb) and [Seamless Communication](https://github.com/facebookresearch/seamless_communication) papers.

(in alphabetical order)

## Citation

If you use `stopes` in your work, please cite:

```bibtex
@inproceedings{andrews-etal-2022-stopes,
    title = "stopes - Modular Machine Translation Pipelines",
    author = "Pierre Andrews, Guillaume Wenzek, Kevin Heffernan, Onur Çelebi, Anna Sun, Ammar Kamran, Yingzhe Guo, Alexandre Mourachko, Holger Schwenk, Angela Fan",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    publisher = "Association for Computational Linguistics",
}
```

Some of the tools in stopes, like BLASER and ALTI have their own publications, please see in the specific readme for the correct citation to use for these specific tools.

`stopes` was originally built as part of the NLLB project, if you use any models/datasets/artifacts published in NLLB, please cite :

```bibtex
@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={{NLLB Team} and Costa-jussà, Marta R. and Cross, James and Çelebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Mejia-Gonzalez, Gabriel and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzmán, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},
  year={2022}
}
```

If you use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite :

```bibtex
@article{seamlessm4t2023,
  title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
  journal={ArXiv},
  year={2023}
}
```

## License

`stopes` is MIT licensed, as found in the LICENSE file.

When using speech mining with the [SONAR](https://github.com/facebookresearch/SONAR) models, beware that this code and models are released under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "stopes",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Facebook AI Research",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b4/f6/fa510377300ea915ee2b89db55978eb363ce1fb488ac356325a0339ca0bd/stopes-2.2.1.tar.gz",
    "platform": null,
    "description": "![stopes](/website/static/img/banner.png?raw=true \"stopes by NLLB.\")\n\n# `stopes`: A library for preparing data for machine translation research\n\nAs part of the FAIR No Language Left Behind (NLLB) ([Paper](https://research.facebook.com/publications/no-language-left-behind/), [Website](https://ai.facebook.com/research/no-language-left-behind/), [Blog](https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/))\nproject to drive inclusion through machine translation, a large amount of data was processed to create training data. We provide the libraries and tools we used to:\n\n1. create clean monolingual data from web data\n2. mine bitext\n3. easily write scalable pipelines for processing data for machine translation\n\nFull documentation on https://facebookresearch.github.io/stopes\n\n## Examples\n\ncheckout the `demo` directory for an example usage with the [WMT22 Shared Task: Large-Scale Machine Translation Evaluation for African\nLanguages](https://statmt.org/wmt22/large-scale-multilingual-translation-task.html) data.\n\n## Requirements\n\n`stopes` relies on:\n\n- submitit to schedule jobs when ran on clusters\n- hydra-core version >= 1.2.0 for configuration\n- fairseq to use LASER encoders\n- PyTorch version >= 1.5.0\n- Python version >= 3.8\n\n## Installing stopes\n\nstopes uses [flit](https://flit.pypa.io/) to manage its setup, you will need a recent version of\npip for the install to work. We recommend that you first upgrade pip:\n`python -m pip install --upgrade pip`\nWarning: the latest pip (about `23.3`) seems to no longer support the way the dependencies are declared in `fasttext`\nand maybe other packages without `pyproject.toml`. To bypass the associated problems, consider installing the `wheel` package.\n\nThe mining pipeline relies on fairseq to run LASER encoders, because of competing dependency version, you'll have to first install fairseq with pip separately:\n\n```\npip install fairseq==0.12.2\n```\n\nYou can then install stopes with pip:\n\n```\ngit clone https://github.com/facebookresearch/stopes.git\ncd stopes\npip install -e '.[dev,mono,mining]'\n```\n\nYou can choose what to install. If you are only interested in `mining`, you do not need to install `dev`, and `mono`. If you are interested in the distillation pipeline, you will need to install at least `mono`. `mining` will install the cpu version of the dependencies for mining, if you want to do mining on gpu, and your system is compatible, you can install `[mining,mining-gpu]`.\n\nCurrently `fairseq` and `stopes` require different version of hydra, so `pip` might output some warnings, do not worry about them, we want hydra>=1.1.\n\nIf you plan to train a lot of NMT model you will also want to setup apex to get a faster training.\n\n```\ngit clone https://github.com/NVIDIA/apex\ncd apex\npip install -v --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" \\\n  --global-option=\"--deprecated_fused_adam\" --global-option=\"--xentropy\" \\\n  --global-option=\"--fast_multihead_attn\" ./\n```\n\n### Speech package installing\n\nSome speech packages like the MMS Text-to-Speech(TTS) require additional library installation, see [here](stopes/speech/README.md) for more details.\n\nIn addition, the [UST library](ust/README.md) has its own set of extra dependencies.\n\n## How `stopes` works\n\n`stopes` is made of a few different parts:\n\n1. `core` provides a library to write readable piplines\n2. `modules` provides a set of modules using the core library and implementing\n   common steps in our mining and evaluation pipelines\n3. `pipelines` provides pipeline implementation for the data pipelines we use in\n   NLLB:\n\n- `monolingual` to preprocess and clean single language data\n- `bitext` to run the \"global mining\" pipeline and extract aligned sentences\n  from two monolingual datasets. (inspired by\n  [CCMatrix](https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/))\n- `distilation` to run our sequence-level knowledge distillation pipeline which trains a small student model from a pre-trained large teacher model (approach based on https://arxiv.org/abs/1606.07947)\n\n4. `eval` provides a set of evaluation tools, including ALTI+ and BLASER for text-free speech translation evaluation.\n5. `demo` contains applications of stopes, including a quickstart demo that you can run at home of mining as well as a example usage of ALTI+ for toxicity and hallucination analysis.\n\n**Full documentation**: see https://facebookresearch.github.io/stopes\nor the `websites/docs` folder.\n\n## Contributing\n\nSee the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.\n\n## Contributors\n\nYou can find the list of the main contributors in the [NLLB](https://github.com/facebookresearch/fairseq/tree/nllb) and [Seamless Communication](https://github.com/facebookresearch/seamless_communication) papers.\n\n(in alphabetical order)\n\n## Citation\n\nIf you use `stopes` in your work, please cite:\n\n```bibtex\n@inproceedings{andrews-etal-2022-stopes,\n    title = \"stopes - Modular Machine Translation Pipelines\",\n    author = \"Pierre Andrews, Guillaume Wenzek, Kevin Heffernan, Onur \u00c7elebi, Anna Sun, Ammar Kamran, Yingzhe Guo, Alexandre Mourachko, Holger Schwenk, Angela Fan\",\n    booktitle = \"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    month = dec,\n    year = \"2022\",\n    publisher = \"Association for Computational Linguistics\",\n}\n```\n\nSome of the tools in stopes, like BLASER and ALTI have their own publications, please see in the specific readme for the correct citation to use for these specific tools.\n\n`stopes` was originally built as part of the NLLB project, if you use any models/datasets/artifacts published in NLLB, please cite :\n\n```bibtex\n@article{nllb2022,\n  title={No Language Left Behind: Scaling Human-Centered Machine Translation},\n  author={{NLLB Team} and Costa-juss\u00e0, Marta R. and Cross, James and \u00c7elebi, Onur and Elbayad, Maha and Heafield, Kenneth and Heffernan, Kevin and Kalbassi, Elahe and Lam, Janice and Licht, Daniel and Maillard, Jean and Sun, Anna and Wang, Skyler and Wenzek, Guillaume and Youngblood, Al and Akula, Bapi and Barrault, Loic and Mejia-Gonzalez, Gabriel and Hansanti, Prangthip and Hoffman, John and Jarrett, Semarley and Sadagopan, Kaushik Ram and Rowe, Dirk and Spruit, Shannon and Tran, Chau and Andrews, Pierre and Ayan, Necip Fazil and Bhosale, Shruti and Edunov, Sergey and Fan, Angela and Gao, Cynthia and Goswami, Vedanuj and Guzm\u00e1n, Francisco and Koehn, Philipp and Mourachko, Alexandre and Ropers, Christophe and Saleem, Safiyyah and Schwenk, Holger and Wang, Jeff},\n  year={2022}\n}\n```\n\nIf you use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite :\n\n```bibtex\n@article{seamlessm4t2023,\n  title={SeamlessM4T\u2014Massively Multilingual \\& Multimodal Machine Translation},\n  author={{Seamless Communication}, Lo\\\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\\`{a} \\footnotemark[3], Onur \\,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},\n  journal={ArXiv},\n  year={2023}\n}\n```\n\n## License\n\n`stopes` is MIT licensed, as found in the LICENSE file.\n\nWhen using speech mining with the [SONAR](https://github.com/facebookresearch/SONAR) models, beware that this code and models are released under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Large-Scale Translation Data Mining.",
    "version": "2.2.1",
    "project_urls": {
        "Source": "https://github.com/facebookresearch/stopes",
        "Tracker": "https://github.com/facebookresearch/stopes/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6a8a9a24f7f4a7f26e883094738a156d7ce64e5c229d109ec838343da131d440",
                "md5": "742bde26b8cf622e2ef2238e32dc40b5",
                "sha256": "edf258d46f4471459c7164b1d78d432ae350447d751c78b3b1cf76742857421f"
            },
            "downloads": -1,
            "filename": "stopes-2.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "742bde26b8cf622e2ef2238e32dc40b5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 1647727,
            "upload_time": "2025-01-23T09:18:42",
            "upload_time_iso_8601": "2025-01-23T09:18:42.662002Z",
            "url": "https://files.pythonhosted.org/packages/6a/8a/9a24f7f4a7f26e883094738a156d7ce64e5c229d109ec838343da131d440/stopes-2.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b4f6fa510377300ea915ee2b89db55978eb363ce1fb488ac356325a0339ca0bd",
                "md5": "ea4c32dd5c4fd38f8074912a0735cd80",
                "sha256": "0c8b039c98555a33c140d2173e9b0a1fc26318f7ac5102161b8822fc87869159"
            },
            "downloads": -1,
            "filename": "stopes-2.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ea4c32dd5c4fd38f8074912a0735cd80",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 3295748,
            "upload_time": "2025-01-23T09:18:45",
            "upload_time_iso_8601": "2025-01-23T09:18:45.526442Z",
            "url": "https://files.pythonhosted.org/packages/b4/f6/fa510377300ea915ee2b89db55978eb363ce1fb488ac356325a0339ca0bd/stopes-2.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-23 09:18:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "facebookresearch",
    "github_project": "stopes",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "stopes"
}

Facebook AI Research