sotastream


Namesotastream JSON
Version 1.0.1 PyPI version JSON
download
home_page
SummarySotastream is a command line tool that augments a batch of text and produces infinite stream of records.
upload_time2023-08-29 16:39:00
maintainer
docs_urlNone
author
requires_python>=3.6
license
keywords data augmentation machine translation natural language processing text processing text augmentation machine learning deep learning artificial intelligence
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Sotastream
[![image](http://img.shields.io/pypi/v/sotastream.svg)](https://pypi.python.org/pypi/sotastream/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)
[![Read the Docs](https://img.shields.io/readthedocs/sotastream.svg)](https://sotastream.readthedocs.io/)


Sotastream is a tool for data augmentation for training
pipeline. It uses `infinibatch` internally to generate an infinite
stream of shuffled training data and provides a means for on-the-fly
data manipulation, augmentation, mixing, and sampling.


## Setup

To install from PyPI (https://pypi.org/project/sotastream/)
```bash
pip install sotastream
```

*Developer Setup:*

```bash
# To begin, clone the repository:
git clone https://github.com/marian-nmt/sotastream
cd sotastream
# option 1:
python -m pip install .
# option 2: install in --editable mode
python -m pip install -e .
```

*Entry points*
* As a module:  `python -m sotastream`
* As a bin in your $PATH: `sotastream`

## Development

Install development tools
```bash
python -m pip install -e .[dev,test]   # editable mode
```
Editable mode (`-e / --editable`) is recommended for development purposes, `pip` creates symbolic link to your source code in a way that any edits made are reflected directly to the installed package. `[dev,test]` installs depencies for development and tests which includes `black`, `pytest` etc.

We use `black` to reformat code to a common code style.
```bash
make reformat
```

Before creating any pull requests, run
```bash
make check          # runs reformatter and tests
```

## Running tests

```bash
make test           # run unit tests
make regression     # run regression tests
```

 See `Makefile` for more details.


## Usage examples

A folder like `split/parallel` contains training data in tsv format (`src<tab>tgt`) split into 
`*.gz` files of around 100,000 lines for better shuffling. The below will output an infinite
stream of data generated from the gzipped files in these folders, according to the "wmt" recipe 
found in `sotastream/pipelines/example_pipeline.py`.

```
python -m sotastream example split/parallel split/backtrans
```
You can also provide compressed TSV files directly, in which case sotastream will split them
to checksummed folders under `/tmp/sotastream/{checksum}`:

```
python -m sotastream example parallel.tsv.gz backtrans.tsv.gz
```

There are currently two main pipelines: "default", and "wmt". These vary according to
the data sources they take as well as the other options available to them.

There are global options that control behavioral aspects such as splitting and parallelization,
and also pipeline-specific arguments. You can see these by running

```
# see global options
python -m sotastream -h

# see default pipeline options
python -m sotastream default -h

# see wmt pipeline options
python -m sotastream wmt -h
```

## Don't cross the streams!

Sotastream workflows build a directed acyclic graph (DAG)
consisting of cascades of generators that pass through mutable lines
from the graph inputs to the pipeline output. Since each step provides
transformations and manipulations of each input line, the only
requirement is that modifications along separate branches must not be
merged into a single node in the graph, or at least, that great care 
should be taken when doing so. An example is the Mixer, which 
does not actually merge modifications from alternate branches, but instead
selects across multiple incoming branches using a provided probability
distribution.

# Custom/private pipelines from own (private) directory

You can create a custom pipeline by adding a file in the current (invocation)
directory with a file name matching the pattern "*_pipeline.py". This should
follow the interface defined in `sotastream/pipelines`, namely:

* Call `@pipeline("name")` to give your pipeline a name. This name must not conflict with existing names.
* Inherit from `Pipeline` base class from `sotastream.pipeline`. For document pipelines, use `DocumentPipeline` as base class.

You can find some examples in `test/dummy_pipeline.py`, as well as the real examples in `sotastream/pipelines`.

# Authors

Sotastream is developed by _TextMT Team_ @ Microsoft Translator.

If you use this tool, please cite: 
```bibtex
@misc{post2023sotastream,
      title={SOTASTREAM: A Streaming Approach to Machine Translation Training}, 
      author={Matt Post and Thamme Gowda and Roman Grundkiewicz and Huda Khayrallah and Rohit Jain and Marcin Junczys-Dowmunt},
      year={2023},
      eprint={2308.07489},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
Paper link: https://arxiv.org/abs/2308.07489 

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "sotastream",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "Thamme Gowda <thammegowda@microsoft.com>, Roman Grundkiewicz <roman.grundkiewicz@microsoft.com>, Matt Post <mattpost@microsoft.com>",
    "keywords": "data augmentation,machine translation,natural language processing,text processing,text augmentation,machine learning,deep learning,artificial intelligence",
    "author": "",
    "author_email": "\"Text MT @ Microsoft Translator\" <marcinjd@microsoft.com>",
    "download_url": "https://files.pythonhosted.org/packages/31/78/47bb3daab2f444d193c172394b50693a2661fb8bdb7e7ef459c630d12a34/sotastream-1.0.1.tar.gz",
    "platform": null,
    "description": "# Sotastream\n[![image](http://img.shields.io/pypi/v/sotastream.svg)](https://pypi.python.org/pypi/sotastream/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](./LICENSE)\n[![Read the Docs](https://img.shields.io/readthedocs/sotastream.svg)](https://sotastream.readthedocs.io/)\n\n\nSotastream is a tool for data augmentation for training\npipeline. It uses `infinibatch` internally to generate an infinite\nstream of shuffled training data and provides a means for on-the-fly\ndata manipulation, augmentation, mixing, and sampling.\n\n\n## Setup\n\nTo install from PyPI (https://pypi.org/project/sotastream/)\n```bash\npip install sotastream\n```\n\n*Developer Setup:*\n\n```bash\n# To begin, clone the repository:\ngit clone https://github.com/marian-nmt/sotastream\ncd sotastream\n# option 1:\npython -m pip install .\n# option 2: install in --editable mode\npython -m pip install -e .\n```\n\n*Entry points*\n* As a module:  `python -m sotastream`\n* As a bin in your $PATH: `sotastream`\n\n## Development\n\nInstall development tools\n```bash\npython -m pip install -e .[dev,test]   # editable mode\n```\nEditable mode (`-e / --editable`) is recommended for development purposes, `pip` creates symbolic link to your source code in a way that any edits made are reflected directly to the installed package. `[dev,test]` installs depencies for development and tests which includes `black`, `pytest` etc.\n\nWe use `black` to reformat code to a common code style.\n```bash\nmake reformat\n```\n\nBefore creating any pull requests, run\n```bash\nmake check          # runs reformatter and tests\n```\n\n## Running tests\n\n```bash\nmake test           # run unit tests\nmake regression     # run regression tests\n```\n\n See `Makefile` for more details.\n\n\n## Usage examples\n\nA folder like `split/parallel` contains training data in tsv format (`src<tab>tgt`) split into \n`*.gz` files of around 100,000 lines for better shuffling. The below will output an infinite\nstream of data generated from the gzipped files in these folders, according to the \"wmt\" recipe \nfound in `sotastream/pipelines/example_pipeline.py`.\n\n```\npython -m sotastream example split/parallel split/backtrans\n```\nYou can also provide compressed TSV files directly, in which case sotastream will split them\nto checksummed folders under `/tmp/sotastream/{checksum}`:\n\n```\npython -m sotastream example parallel.tsv.gz backtrans.tsv.gz\n```\n\nThere are currently two main pipelines: \"default\", and \"wmt\". These vary according to\nthe data sources they take as well as the other options available to them.\n\nThere are global options that control behavioral aspects such as splitting and parallelization,\nand also pipeline-specific arguments. You can see these by running\n\n```\n# see global options\npython -m sotastream -h\n\n# see default pipeline options\npython -m sotastream default -h\n\n# see wmt pipeline options\npython -m sotastream wmt -h\n```\n\n## Don't cross the streams!\n\nSotastream workflows build a directed acyclic graph (DAG)\nconsisting of cascades of generators that pass through mutable lines\nfrom the graph inputs to the pipeline output. Since each step provides\ntransformations and manipulations of each input line, the only\nrequirement is that modifications along separate branches must not be\nmerged into a single node in the graph, or at least, that great care \nshould be taken when doing so. An example is the Mixer, which \ndoes not actually merge modifications from alternate branches, but instead\nselects across multiple incoming branches using a provided probability\ndistribution.\n\n# Custom/private pipelines from own (private) directory\n\nYou can create a custom pipeline by adding a file in the current (invocation)\ndirectory with a file name matching the pattern \"*_pipeline.py\". This should\nfollow the interface defined in `sotastream/pipelines`, namely:\n\n* Call `@pipeline(\"name\")` to give your pipeline a name. This name must not conflict with existing names.\n* Inherit from `Pipeline` base class from `sotastream.pipeline`. For document pipelines, use `DocumentPipeline` as base class.\n\nYou can find some examples in `test/dummy_pipeline.py`, as well as the real examples in `sotastream/pipelines`.\n\n# Authors\n\nSotastream is developed by _TextMT Team_ @ Microsoft Translator.\n\nIf you use this tool, please cite: \n```bibtex\n@misc{post2023sotastream,\n      title={SOTASTREAM: A Streaming Approach to Machine Translation Training}, \n      author={Matt Post and Thamme Gowda and Roman Grundkiewicz and Huda Khayrallah and Rohit Jain and Marcin Junczys-Dowmunt},\n      year={2023},\n      eprint={2308.07489},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\nPaper link: https://arxiv.org/abs/2308.07489 \n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Sotastream is a command line tool that augments a batch of text and produces infinite stream of records.",
    "version": "1.0.1",
    "project_urls": {
        "documentation": "https://github.com/marian-nmt/sotastream",
        "homepage": "https://github.com/marian-nmt/sotastream",
        "repository": "https://github.com/marian-nmt/sotastream"
    },
    "split_keywords": [
        "data augmentation",
        "machine translation",
        "natural language processing",
        "text processing",
        "text augmentation",
        "machine learning",
        "deep learning",
        "artificial intelligence"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f4153bb4e438a8c5cfece8f4c9f2280f7878282cd634a34d3bc3e1424d129138",
                "md5": "0961c7c67adce7a80fb03775e055be22",
                "sha256": "a03644b40ac960bde0a41217e5f108aaa5fd5202a18a0373207c257ce522c020"
            },
            "downloads": -1,
            "filename": "sotastream-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0961c7c67adce7a80fb03775e055be22",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 27721,
            "upload_time": "2023-08-29T16:38:58",
            "upload_time_iso_8601": "2023-08-29T16:38:58.298546Z",
            "url": "https://files.pythonhosted.org/packages/f4/15/3bb4e438a8c5cfece8f4c9f2280f7878282cd634a34d3bc3e1424d129138/sotastream-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "317847bb3daab2f444d193c172394b50693a2661fb8bdb7e7ef459c630d12a34",
                "md5": "26f375a7abf6c7b0351e5d62ec11d25b",
                "sha256": "f3709874c96f2feb4307dea0f26fbab79c757c0567753b9ca20f93109beba4ad"
            },
            "downloads": -1,
            "filename": "sotastream-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "26f375a7abf6c7b0351e5d62ec11d25b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 30503,
            "upload_time": "2023-08-29T16:39:00",
            "upload_time_iso_8601": "2023-08-29T16:39:00.149194Z",
            "url": "https://files.pythonhosted.org/packages/31/78/47bb3daab2f444d193c172394b50693a2661fb8bdb7e7ef459c630d12a34/sotastream-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-29 16:39:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "marian-nmt",
    "github_project": "sotastream",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sotastream"
}
        
Elapsed time: 1.84609s