smashed


Namesmashed JSON
Version 0.21.5 PyPI version JSON
download
home_page
SummarySMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.
upload_time2023-09-22 17:52:24
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache-2.0
keywords mappers pytorch torch huggingface transformers datasets dict pipeline preprocessing nlp natural language processing text prompting prefix tuning in context learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![Colorful logo of smashed. It is the word smashed written in a playful font that vaguely looks like pipes.](https://github.com/allenai/smashed/raw/main/resources/smashed.png)

**S**equential **MA**ppers for **S**equences of **HE**terogeneous **D**ictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries. To start, run

```bash
pip install smashed
```

## Example of Usage

Mappers are initialized and then applied sequentially. In the following example, we create a mapper that is applied to a samples, each containing a sequence of strings.
The mappers are responsible for the following operations.

1. Tokenize each sequence, cropping it to a maximum length if necessary.
2. Stride sequences together to a maximum length or number of samples.
3. Add padding symbols to sequences and attention masks.
4. Concatenate all sequences from a stride into a single sequence.

```python
import transformers
from smashed.mappers import (
    TokenizerMapper,
    MultiSequenceStriderMapper,
    TokensSequencesPaddingMapper,
    AttentionMaskSequencePaddingMapper,
    SequencesConcatenateMapper,
)

tokenizer = transformers.AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path='bert-base-uncased',
)

mappers = [
    TokenizerMapper(
        input_field='sentences',
        tokenizer=tokenizer,
        add_special_tokens=False,
        truncation=True,
        max_length=80
    ),
    MultiSequenceStriderMapper(
        max_stride_count=2,
        max_length=512,
        tokenizer=tokenizer,
        length_reference_field='input_ids'
    ),
    TokensSequencesPaddingMapper(
        tokenizer=tokenizer,
        input_field='input_ids'
    ),
    AttentionMaskSequencePaddingMapper(
        tokenizer=tokenizer,
        input_field='attention_mask'
    ),
    SequencesConcatenateMapper()
]

dataset = [
    {
        'sentences': [
            'This is a sentence.',
            'This is another sentence.',
            'Together, they make a paragraph.',
        ]
    },
    {
        'sentences': [
            'This sentence belongs to another sample',
            'Overall, the dataset is made of multiple samples.',
            'Each sample is made of multiple sentences.',
            'Samples might have a different number of sentences.',
            'And that is the story!',
        ]
    }
]

for mapper in mappers:
    dataset = mapper.map(dataset)

print(len(dataset))

# >>> 5

print(dataset[0])

# >>> {
#    'input_ids': [
#        101,
#        2023,
#        2003,
#        1037,
#        6251,
#        1012,
#        102,
#        2023,
#        2003,
#        2178,
#        6251,
#        1012,
#        102
#    ],
#    'attention_mask': [
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1,
#        1
#    ]
# }
```

## Building a Pipeline

Mappers can also be composed into a pipeline using the `>>` (or `<<`) operator. For example, the code above can be rewritten as follows:

```python
pipeline = TokenizerMapper(
    input_field='sentences',
    tokenizer=tokenizer,
    add_special_tokens=False,
    truncation=True,
    max_length=80
) >> MultiSequenceStriderMapper(
    max_stride_count=2,
    max_length=512,
    tokenizer=tokenizer,
    length_reference_field='input_ids'
) >> TokensSequencesPaddingMapper(
    tokenizer=tokenizer,
    input_field='input_ids'
) >> AttentionMaskSequencePaddingMapper(
    tokenizer=tokenizer,
    input_field='attention_mask'
) >> SequencesConcatenateMapper()

dataset = ...

# apply the full pipeline to the dataset
pipeline.map(dataset)
```

## Dataset Interfaces Available

The initial version of SMASHED supports two interfaces for dataset:

1. **`interfaces.simple.Dataset`**: A simple dataset representation that is just a list of python dictionaries with some extra convenience methods to make it work with SMASHED. You can crate a simple dataset by passing a list of dictionaries to `interfaces.simple.Dataset`.
2. **HuggingFace `datasets` library**. SMASHED mappers work with any datasets from HuggingFace, whether it is a regular or iterable dataset.

## Developing SMASHED

To contribute to SMASHED, make sure to:

1. (If you are not part of AI2) Fork the repository on GitHub.
2. Clone it locally.
3. Create a new branch in for the new feature.
4. Install development dependencies with `pip install -r dev-requirements.txt`.
5. Add your new mapper or feature.
6. Add unit tests.
7. Run tests, linting, and type checking from the root directory of the repo:
    1. *Style:* `black .` (Should format for you)
    2. *Style:* `flake8 .`  (Should return no error)
    3. *Style:* `isort .` (Should sort imports for you)
    4. *Static type check:* `mypy .` (Should return no error)
    5. *Tests:* `pytest -v --color=yes tests/` (Should return no error)
8. Commit, push, and create a pull request.
9. Tag `soldni` to review the PR.

### A note about versioning

SMASHED follows [Semantic Versioning](https://semver.org/). In short, this means that the version number is MAJOR.MINOR.PATCH, where:

- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards compatible manner; adding a mapper typically falls under this category, and
- PATCH version when you make backwards compatible bug fixes.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "smashed",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Luca Soldaini <luca@soldaini.net>",
    "keywords": "mappers,pytorch,torch,huggingface,transformers,datasets,dict,pipeline,preprocessing,nlp,natural language processing,text,prompting,prefix tuning,in context learning",
    "author": "",
    "author_email": "Allen Institute for Artificial Intelligence <contact@allenai.org>, Luca Soldaini <luca@soldaini.net>, Kyle Lo <kylel@allenai.org>",
    "download_url": "https://files.pythonhosted.org/packages/2f/8b/0717eb69dc7ad85b89fdf46f1a9a2c2982a8a3a82756b561e0dfedfd087f/smashed-0.21.5.tar.gz",
    "platform": null,
    "description": "![Colorful logo of smashed. It is the word smashed written in a playful font that vaguely looks like pipes.](https://github.com/allenai/smashed/raw/main/resources/smashed.png)\n\n**S**equential **MA**ppers for **S**equences of **HE**terogeneous **D**ictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries. To start, run\n\n```bash\npip install smashed\n```\n\n## Example of Usage\n\nMappers are initialized and then applied sequentially. In the following example, we create a mapper that is applied to a samples, each containing a sequence of strings.\nThe mappers are responsible for the following operations.\n\n1. Tokenize each sequence, cropping it to a maximum length if necessary.\n2. Stride sequences together to a maximum length or number of samples.\n3. Add padding symbols to sequences and attention masks.\n4. Concatenate all sequences from a stride into a single sequence.\n\n```python\nimport transformers\nfrom smashed.mappers import (\n    TokenizerMapper,\n    MultiSequenceStriderMapper,\n    TokensSequencesPaddingMapper,\n    AttentionMaskSequencePaddingMapper,\n    SequencesConcatenateMapper,\n)\n\ntokenizer = transformers.AutoTokenizer.from_pretrained(\n    pretrained_model_name_or_path='bert-base-uncased',\n)\n\nmappers = [\n    TokenizerMapper(\n        input_field='sentences',\n        tokenizer=tokenizer,\n        add_special_tokens=False,\n        truncation=True,\n        max_length=80\n    ),\n    MultiSequenceStriderMapper(\n        max_stride_count=2,\n        max_length=512,\n        tokenizer=tokenizer,\n        length_reference_field='input_ids'\n    ),\n    TokensSequencesPaddingMapper(\n        tokenizer=tokenizer,\n        input_field='input_ids'\n    ),\n    AttentionMaskSequencePaddingMapper(\n        tokenizer=tokenizer,\n        input_field='attention_mask'\n    ),\n    SequencesConcatenateMapper()\n]\n\ndataset = [\n    {\n        'sentences': [\n            'This is a sentence.',\n            'This is another sentence.',\n            'Together, they make a paragraph.',\n        ]\n    },\n    {\n        'sentences': [\n            'This sentence belongs to another sample',\n            'Overall, the dataset is made of multiple samples.',\n            'Each sample is made of multiple sentences.',\n            'Samples might have a different number of sentences.',\n            'And that is the story!',\n        ]\n    }\n]\n\nfor mapper in mappers:\n    dataset = mapper.map(dataset)\n\nprint(len(dataset))\n\n# >>> 5\n\nprint(dataset[0])\n\n# >>> {\n#    'input_ids': [\n#        101,\n#        2023,\n#        2003,\n#        1037,\n#        6251,\n#        1012,\n#        102,\n#        2023,\n#        2003,\n#        2178,\n#        6251,\n#        1012,\n#        102\n#    ],\n#    'attention_mask': [\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1,\n#        1\n#    ]\n# }\n```\n\n## Building a Pipeline\n\nMappers can also be composed into a pipeline using the `>>` (or `<<`) operator. For example, the code above can be rewritten as follows:\n\n```python\npipeline = TokenizerMapper(\n    input_field='sentences',\n    tokenizer=tokenizer,\n    add_special_tokens=False,\n    truncation=True,\n    max_length=80\n) >> MultiSequenceStriderMapper(\n    max_stride_count=2,\n    max_length=512,\n    tokenizer=tokenizer,\n    length_reference_field='input_ids'\n) >> TokensSequencesPaddingMapper(\n    tokenizer=tokenizer,\n    input_field='input_ids'\n) >> AttentionMaskSequencePaddingMapper(\n    tokenizer=tokenizer,\n    input_field='attention_mask'\n) >> SequencesConcatenateMapper()\n\ndataset = ...\n\n# apply the full pipeline to the dataset\npipeline.map(dataset)\n```\n\n## Dataset Interfaces Available\n\nThe initial version of SMASHED supports two interfaces for dataset:\n\n1. **`interfaces.simple.Dataset`**: A simple dataset representation that is just a list of python dictionaries with some extra convenience methods to make it work with SMASHED. You can crate a simple dataset by passing a list of dictionaries to `interfaces.simple.Dataset`.\n2. **HuggingFace `datasets` library**. SMASHED mappers work with any datasets from HuggingFace, whether it is a regular or iterable dataset.\n\n## Developing SMASHED\n\nTo contribute to SMASHED, make sure to:\n\n1. (If you are not part of AI2) Fork the repository on GitHub.\n2. Clone it locally.\n3. Create a new branch in for the new feature.\n4. Install development dependencies with `pip install -r dev-requirements.txt`.\n5. Add your new mapper or feature.\n6. Add unit tests.\n7. Run tests, linting, and type checking from the root directory of the repo:\n    1. *Style:* `black .` (Should format for you)\n    2. *Style:* `flake8 .`  (Should return no error)\n    3. *Style:* `isort .` (Should sort imports for you)\n    4. *Static type check:* `mypy .` (Should return no error)\n    5. *Tests:* `pytest -v --color=yes tests/` (Should return no error)\n8. Commit, push, and create a pull request.\n9. Tag `soldni` to review the PR.\n\n### A note about versioning\n\nSMASHED follows [Semantic Versioning](https://semver.org/). In short, this means that the version number is MAJOR.MINOR.PATCH, where:\n\n- MAJOR version when you make incompatible API changes,\n- MINOR version when you add functionality in a backwards compatible manner; adding a mapper typically falls under this category, and\n- PATCH version when you make backwards compatible bug fixes.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.",
    "version": "0.21.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/allenai/smashed/issues",
        "Homepage": "https://github.com/allenai/smashed",
        "Repository": "https://github.com/allenai/smashed"
    },
    "split_keywords": [
        "mappers",
        "pytorch",
        "torch",
        "huggingface",
        "transformers",
        "datasets",
        "dict",
        "pipeline",
        "preprocessing",
        "nlp",
        "natural language processing",
        "text",
        "prompting",
        "prefix tuning",
        "in context learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "05ced9f7390f548c52662cbdb0ad43b5935706220d6f01d5288e8fa329075bdc",
                "md5": "962af6a9ea1fa0241c7cc5869be4e400",
                "sha256": "90ed1b5b0fc83d4b213cdcf417bf424e391cc7a335706da2d6c0a4e590b79d1c"
            },
            "downloads": -1,
            "filename": "smashed-0.21.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "962af6a9ea1fa0241c7cc5869be4e400",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 95393,
            "upload_time": "2023-09-22T17:52:22",
            "upload_time_iso_8601": "2023-09-22T17:52:22.745942Z",
            "url": "https://files.pythonhosted.org/packages/05/ce/d9f7390f548c52662cbdb0ad43b5935706220d6f01d5288e8fa329075bdc/smashed-0.21.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2f8b0717eb69dc7ad85b89fdf46f1a9a2c2982a8a3a82756b561e0dfedfd087f",
                "md5": "db8024a60f9a6669d05e384bc1e992d8",
                "sha256": "bf4a96a22e127a7d69378c5c6701aafa710b65a7c9b2605df4fa4a54d7896737"
            },
            "downloads": -1,
            "filename": "smashed-0.21.5.tar.gz",
            "has_sig": false,
            "md5_digest": "db8024a60f9a6669d05e384bc1e992d8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 95901,
            "upload_time": "2023-09-22T17:52:24",
            "upload_time_iso_8601": "2023-09-22T17:52:24.629615Z",
            "url": "https://files.pythonhosted.org/packages/2f/8b/0717eb69dc7ad85b89fdf46f1a9a2c2982a8a3a82756b561e0dfedfd087f/smashed-0.21.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-22 17:52:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "allenai",
    "github_project": "smashed",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "smashed"
}
        
Elapsed time: 0.14018s