unbabel-smaug


Nameunbabel-smaug JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/Unbabel/smaug
SummarySentence-level Multilingual Augmentation
upload_time2023-01-10 17:57:28
maintainer
docs_urlNone
authorDuarte Alves
requires_python>=3.8,<4.0
licenseApache-2.0
keywords natural language processing data augmentation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SMAUG: Sentence-level Multilingual AUGmentation

`smaug` is a package for multilingual data augmentation. It offers transformations focused on changing specific aspects of sentences, such as Named Entities, Numbers, etc.

# Getting Started

To start using `smaug`, you can install it with `pip`:

```
pip install unbabel-smaug
```

To run a simple pipeline with all transforms and default validations, first create the following `yaml` file:

```yaml
pipeline:
- cmd: io-read-lines
  path: <path to input file with single sentence per line>
  lang: <two letter language code for the input sentences>
- cmd: transf-swp-ne
- cmd: transf-swp-num
- cmd: transf-swp-poisson-span
- cmd: transf-neg
- cmd: transf-ins-text
- cmd: transf-del-punct-span
- cmd: io-write-json
  path: <path to output file>
# Remove this line for no seed
seed: <seed for the pipeline>
```

The run the following command:

```shell
augment --cfg <path_to_config_file>
```

# Usage

The `smaug` package can be used as a command line interface (CLI) or by directly importing and calling the package Python API. To use `smaug`, first install it by following these [instructions](#install).

## Command Line Interface

The CLI offers a way to read, transform, validate and write perturbed sentences to files. For more information, see the [full details](CLI.md).

### Configuration File

The easiest way to run `smaug` is through a configuration file (see the [full specification](CLI.md#configuration-file-specification)) that specifies and entire pipeline (as shown in the [Getting Started](#getting-started) section), using the following command:

```shell
augment --cfg <path_to_config_file>
```

### Single transform

As an alternative, you can use the command line to directly specify the pipeline to apply. To apply a single transform to a set of sentences, execute the following command:

```shell
augment io-read-lines -p <input_file> -l <input_lang_code> <transf_name> io-write-json -p <output_file>
```

> `<transf_name>` is the name of the transform to apply (see this [section](OPERATIONS.md#transforms) for a list of available transforms).
>
> `<input_file>` is a text file with one sentence per line.
>
> `<input_lang_code>` is a two character language code for the input sentences.
>
> `<output_file>` is a json file to be created with the transformed sentences.

### Multiple Transforms

To apply multiple transforms, just specify them in arbitrary order between the read and write operations:

``` shell
augment io-read-lines -p <input_file> -l <input_lang_code> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>
```

### Multiple Inputs

To read from multiple input files, also specify them in arbitrary order:

```shell
augment io-read-lines -p <input_file_1> -l <input_lang_code_1> read-lines -p <input_file_2> -l <input_lang_code_2> ... <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>
```

You can further have multiple languages in a given file by having each line with the structure \<lang code\>,\<sentence\> and using the following command:

```shell
augment io-read-csv -p <input_file> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>
```

# Developing

To develop this package, execute the following steps:

* Install the [poetry](https://python-poetry.org/docs/#installation) tool for dependency management.

* Clone this git repository and install the project.

```
git clone https://github.com/Unbabel/smaug.git
cd smaug
poetry install
```
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Unbabel/smaug",
    "name": "unbabel-smaug",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "Natural Language Processing,Data Augmentation",
    "author": "Duarte Alves",
    "author_email": "duartemalves@tecnico.ulisboa.pt",
    "download_url": "https://files.pythonhosted.org/packages/2b/3c/955011efdc0357ab0981c2ea09ae59989f8f9ce5aa3cbc9930dceb6cc65a/unbabel_smaug-0.1.3.tar.gz",
    "platform": null,
    "description": "# SMAUG: Sentence-level Multilingual AUGmentation\n\n`smaug` is a package for multilingual data augmentation. It offers transformations focused on changing specific aspects of sentences, such as Named Entities, Numbers, etc.\n\n# Getting Started\n\nTo start using `smaug`, you can install it with `pip`:\n\n```\npip install unbabel-smaug\n```\n\nTo run a simple pipeline with all transforms and default validations, first create the following `yaml` file:\n\n```yaml\npipeline:\n- cmd: io-read-lines\n  path: <path to input file with single sentence per line>\n  lang: <two letter language code for the input sentences>\n- cmd: transf-swp-ne\n- cmd: transf-swp-num\n- cmd: transf-swp-poisson-span\n- cmd: transf-neg\n- cmd: transf-ins-text\n- cmd: transf-del-punct-span\n- cmd: io-write-json\n  path: <path to output file>\n# Remove this line for no seed\nseed: <seed for the pipeline>\n```\n\nThe run the following command:\n\n```shell\naugment --cfg <path_to_config_file>\n```\n\n# Usage\n\nThe `smaug` package can be used as a command line interface (CLI) or by directly importing and calling the package Python API. To use `smaug`, first install it by following these [instructions](#install).\n\n## Command Line Interface\n\nThe CLI offers a way to read, transform, validate and write perturbed sentences to files. For more information, see the [full details](CLI.md).\n\n### Configuration File\n\nThe easiest way to run `smaug` is through a configuration file (see the [full specification](CLI.md#configuration-file-specification)) that specifies and entire pipeline (as shown in the [Getting Started](#getting-started) section), using the following command:\n\n```shell\naugment --cfg <path_to_config_file>\n```\n\n### Single transform\n\nAs an alternative, you can use the command line to directly specify the pipeline to apply. To apply a single transform to a set of sentences, execute the following command:\n\n```shell\naugment io-read-lines -p <input_file> -l <input_lang_code> <transf_name> io-write-json -p <output_file>\n```\n\n> `<transf_name>` is the name of the transform to apply (see this [section](OPERATIONS.md#transforms) for a list of available transforms).\n>\n> `<input_file>` is a text file with one sentence per line.\n>\n> `<input_lang_code>` is a two character language code for the input sentences.\n>\n> `<output_file>` is a json file to be created with the transformed sentences.\n\n### Multiple Transforms\n\nTo apply multiple transforms, just specify them in arbitrary order between the read and write operations:\n\n``` shell\naugment io-read-lines -p <input_file> -l <input_lang_code> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>\n```\n\n### Multiple Inputs\n\nTo read from multiple input files, also specify them in arbitrary order:\n\n```shell\naugment io-read-lines -p <input_file_1> -l <input_lang_code_1> read-lines -p <input_file_2> -l <input_lang_code_2> ... <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>\n```\n\nYou can further have multiple languages in a given file by having each line with the structure \\<lang code\\>,\\<sentence\\> and using the following command:\n\n```shell\naugment io-read-csv -p <input_file> <transf_name_1> <transf_name_2> ... io-write-json -p <output_file>\n```\n\n# Developing\n\nTo develop this package, execute the following steps:\n\n* Install the [poetry](https://python-poetry.org/docs/#installation) tool for dependency management.\n\n* Clone this git repository and install the project.\n\n```\ngit clone https://github.com/Unbabel/smaug.git\ncd smaug\npoetry install\n```",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Sentence-level Multilingual Augmentation",
    "version": "0.1.3",
    "split_keywords": [
        "natural language processing",
        "data augmentation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ecd4756429f983063b1617ad33b87d7f30c96dbc00f07f44a184c428444dc32a",
                "md5": "e600987da02e30249443d9cf011830cb",
                "sha256": "59e1522b8caae338114c0b01e4d58272587b2eda57f76beb5aa1628cb4c3bf9e"
            },
            "downloads": -1,
            "filename": "unbabel_smaug-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e600987da02e30249443d9cf011830cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 46450,
            "upload_time": "2023-01-10T17:57:26",
            "upload_time_iso_8601": "2023-01-10T17:57:26.369624Z",
            "url": "https://files.pythonhosted.org/packages/ec/d4/756429f983063b1617ad33b87d7f30c96dbc00f07f44a184c428444dc32a/unbabel_smaug-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2b3c955011efdc0357ab0981c2ea09ae59989f8f9ce5aa3cbc9930dceb6cc65a",
                "md5": "d4f7b57eac5e851575f8a05fc7663397",
                "sha256": "348e37b2e59e7770363c156e0f3c3ee30688daac2587d45e9ca37163838881b7"
            },
            "downloads": -1,
            "filename": "unbabel_smaug-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d4f7b57eac5e851575f8a05fc7663397",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 33446,
            "upload_time": "2023-01-10T17:57:28",
            "upload_time_iso_8601": "2023-01-10T17:57:28.052414Z",
            "url": "https://files.pythonhosted.org/packages/2b/3c/955011efdc0357ab0981c2ea09ae59989f8f9ce5aa3cbc9930dceb6cc65a/unbabel_smaug-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-10 17:57:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "Unbabel",
    "github_project": "smaug",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "unbabel-smaug"
}
        
Elapsed time: 0.31911s