![Colorful logo of smashed. It is the word smashed written in a playful font that vaguely looks like pipes.](https://github.com/allenai/smashed/raw/main/resources/smashed.png)
**S**equential **MA**ppers for **S**equences of **HE**terogeneous **D**ictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries. To start, run
```bash
pip install smashed
```
## Example of Usage
Mappers are initialized and then applied sequentially. In the following example, we create a mapper that is applied to a samples, each containing a sequence of strings.
The mappers are responsible for the following operations.
1. Tokenize each sequence, cropping it to a maximum length if necessary.
2. Stride sequences together to a maximum length or number of samples.
3. Add padding symbols to sequences and attention masks.
4. Concatenate all sequences from a stride into a single sequence.
```python
import transformers
from smashed.mappers import (
TokenizerMapper,
MultiSequenceStriderMapper,
TokensSequencesPaddingMapper,
AttentionMaskSequencePaddingMapper,
SequencesConcatenateMapper,
)
tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained_model_name_or_path='bert-base-uncased',
)
mappers = [
TokenizerMapper(
input_field='sentences',
tokenizer=tokenizer,
add_special_tokens=False,
truncation=True,
max_length=80
),
MultiSequenceStriderMapper(
max_stride_count=2,
max_length=512,
tokenizer=tokenizer,
length_reference_field='input_ids'
),
TokensSequencesPaddingMapper(
tokenizer=tokenizer,
input_field='input_ids'
),
AttentionMaskSequencePaddingMapper(
tokenizer=tokenizer,
input_field='attention_mask'
),
SequencesConcatenateMapper()
]
dataset = [
{
'sentences': [
'This is a sentence.',
'This is another sentence.',
'Together, they make a paragraph.',
]
},
{
'sentences': [
'This sentence belongs to another sample',
'Overall, the dataset is made of multiple samples.',
'Each sample is made of multiple sentences.',
'Samples might have a different number of sentences.',
'And that is the story!',
]
}
]
for mapper in mappers:
dataset = mapper.map(dataset)
print(len(dataset))
# >>> 5
print(dataset[0])
# >>> {
# 'input_ids': [
# 101,
# 2023,
# 2003,
# 1037,
# 6251,
# 1012,
# 102,
# 2023,
# 2003,
# 2178,
# 6251,
# 1012,
# 102
# ],
# 'attention_mask': [
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1,
# 1
# ]
# }
```
## Building a Pipeline
Mappers can also be composed into a pipeline using the `>>` (or `<<`) operator. For example, the code above can be rewritten as follows:
```python
pipeline = TokenizerMapper(
input_field='sentences',
tokenizer=tokenizer,
add_special_tokens=False,
truncation=True,
max_length=80
) >> MultiSequenceStriderMapper(
max_stride_count=2,
max_length=512,
tokenizer=tokenizer,
length_reference_field='input_ids'
) >> TokensSequencesPaddingMapper(
tokenizer=tokenizer,
input_field='input_ids'
) >> AttentionMaskSequencePaddingMapper(
tokenizer=tokenizer,
input_field='attention_mask'
) >> SequencesConcatenateMapper()
dataset = ...
# apply the full pipeline to the dataset
pipeline.map(dataset)
```
## Dataset Interfaces Available
The initial version of SMASHED supports two interfaces for dataset:
1. **`interfaces.simple.Dataset`**: A simple dataset representation that is just a list of python dictionaries with some extra convenience methods to make it work with SMASHED. You can crate a simple dataset by passing a list of dictionaries to `interfaces.simple.Dataset`.
2. **HuggingFace `datasets` library**. SMASHED mappers work with any datasets from HuggingFace, whether it is a regular or iterable dataset.
## Developing SMASHED
To contribute to SMASHED, make sure to:
1. (If you are not part of AI2) Fork the repository on GitHub.
2. Clone it locally.
3. Create a new branch in for the new feature.
4. Install development dependencies with `pip install -r dev-requirements.txt`.
5. Add your new mapper or feature.
6. Add unit tests.
7. Run tests, linting, and type checking from the root directory of the repo:
1. *Style:* `black .` (Should format for you)
2. *Style:* `flake8 .` (Should return no error)
3. *Style:* `isort .` (Should sort imports for you)
4. *Static type check:* `mypy .` (Should return no error)
5. *Tests:* `pytest -v --color=yes tests/` (Should return no error)
8. Commit, push, and create a pull request.
9. Tag `soldni` to review the PR.
### A note about versioning
SMASHED follows [Semantic Versioning](https://semver.org/). In short, this means that the version number is MAJOR.MINOR.PATCH, where:
- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards compatible manner; adding a mapper typically falls under this category, and
- PATCH version when you make backwards compatible bug fixes.
Raw data
{
"_id": null,
"home_page": "",
"name": "smashed",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Luca Soldaini <luca@soldaini.net>",
"keywords": "mappers,pytorch,torch,huggingface,transformers,datasets,dict,pipeline,preprocessing,nlp,natural language processing,text,prompting,prefix tuning,in context learning",
"author": "",
"author_email": "Allen Institute for Artificial Intelligence <contact@allenai.org>, Luca Soldaini <luca@soldaini.net>, Kyle Lo <kylel@allenai.org>",
"download_url": "https://files.pythonhosted.org/packages/2f/8b/0717eb69dc7ad85b89fdf46f1a9a2c2982a8a3a82756b561e0dfedfd087f/smashed-0.21.5.tar.gz",
"platform": null,
"description": "![Colorful logo of smashed. It is the word smashed written in a playful font that vaguely looks like pipes.](https://github.com/allenai/smashed/raw/main/resources/smashed.png)\n\n**S**equential **MA**ppers for **S**equences of **HE**terogeneous **D**ictionaries is a set of Python interfaces designed to apply transformations to samples in datasets, which are often implemented as sequences of dictionaries. To start, run\n\n```bash\npip install smashed\n```\n\n## Example of Usage\n\nMappers are initialized and then applied sequentially. In the following example, we create a mapper that is applied to a samples, each containing a sequence of strings.\nThe mappers are responsible for the following operations.\n\n1. Tokenize each sequence, cropping it to a maximum length if necessary.\n2. Stride sequences together to a maximum length or number of samples.\n3. Add padding symbols to sequences and attention masks.\n4. Concatenate all sequences from a stride into a single sequence.\n\n```python\nimport transformers\nfrom smashed.mappers import (\n TokenizerMapper,\n MultiSequenceStriderMapper,\n TokensSequencesPaddingMapper,\n AttentionMaskSequencePaddingMapper,\n SequencesConcatenateMapper,\n)\n\ntokenizer = transformers.AutoTokenizer.from_pretrained(\n pretrained_model_name_or_path='bert-base-uncased',\n)\n\nmappers = [\n TokenizerMapper(\n input_field='sentences',\n tokenizer=tokenizer,\n add_special_tokens=False,\n truncation=True,\n max_length=80\n ),\n MultiSequenceStriderMapper(\n max_stride_count=2,\n max_length=512,\n tokenizer=tokenizer,\n length_reference_field='input_ids'\n ),\n TokensSequencesPaddingMapper(\n tokenizer=tokenizer,\n input_field='input_ids'\n ),\n AttentionMaskSequencePaddingMapper(\n tokenizer=tokenizer,\n input_field='attention_mask'\n ),\n SequencesConcatenateMapper()\n]\n\ndataset = [\n {\n 'sentences': [\n 'This is a sentence.',\n 'This is another sentence.',\n 'Together, they make a paragraph.',\n ]\n },\n {\n 'sentences': [\n 'This sentence belongs to another sample',\n 'Overall, the dataset is made of multiple samples.',\n 'Each sample is made of multiple sentences.',\n 'Samples might have a different number of sentences.',\n 'And that is the story!',\n ]\n }\n]\n\nfor mapper in mappers:\n dataset = mapper.map(dataset)\n\nprint(len(dataset))\n\n# >>> 5\n\nprint(dataset[0])\n\n# >>> {\n# 'input_ids': [\n# 101,\n# 2023,\n# 2003,\n# 1037,\n# 6251,\n# 1012,\n# 102,\n# 2023,\n# 2003,\n# 2178,\n# 6251,\n# 1012,\n# 102\n# ],\n# 'attention_mask': [\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1,\n# 1\n# ]\n# }\n```\n\n## Building a Pipeline\n\nMappers can also be composed into a pipeline using the `>>` (or `<<`) operator. For example, the code above can be rewritten as follows:\n\n```python\npipeline = TokenizerMapper(\n input_field='sentences',\n tokenizer=tokenizer,\n add_special_tokens=False,\n truncation=True,\n max_length=80\n) >> MultiSequenceStriderMapper(\n max_stride_count=2,\n max_length=512,\n tokenizer=tokenizer,\n length_reference_field='input_ids'\n) >> TokensSequencesPaddingMapper(\n tokenizer=tokenizer,\n input_field='input_ids'\n) >> AttentionMaskSequencePaddingMapper(\n tokenizer=tokenizer,\n input_field='attention_mask'\n) >> SequencesConcatenateMapper()\n\ndataset = ...\n\n# apply the full pipeline to the dataset\npipeline.map(dataset)\n```\n\n## Dataset Interfaces Available\n\nThe initial version of SMASHED supports two interfaces for dataset:\n\n1. **`interfaces.simple.Dataset`**: A simple dataset representation that is just a list of python dictionaries with some extra convenience methods to make it work with SMASHED. You can crate a simple dataset by passing a list of dictionaries to `interfaces.simple.Dataset`.\n2. **HuggingFace `datasets` library**. SMASHED mappers work with any datasets from HuggingFace, whether it is a regular or iterable dataset.\n\n## Developing SMASHED\n\nTo contribute to SMASHED, make sure to:\n\n1. (If you are not part of AI2) Fork the repository on GitHub.\n2. Clone it locally.\n3. Create a new branch in for the new feature.\n4. Install development dependencies with `pip install -r dev-requirements.txt`.\n5. Add your new mapper or feature.\n6. Add unit tests.\n7. Run tests, linting, and type checking from the root directory of the repo:\n 1. *Style:* `black .` (Should format for you)\n 2. *Style:* `flake8 .` (Should return no error)\n 3. *Style:* `isort .` (Should sort imports for you)\n 4. *Static type check:* `mypy .` (Should return no error)\n 5. *Tests:* `pytest -v --color=yes tests/` (Should return no error)\n8. Commit, push, and create a pull request.\n9. Tag `soldni` to review the PR.\n\n### A note about versioning\n\nSMASHED follows [Semantic Versioning](https://semver.org/). In short, this means that the version number is MAJOR.MINOR.PATCH, where:\n\n- MAJOR version when you make incompatible API changes,\n- MINOR version when you add functionality in a backwards compatible manner; adding a mapper typically falls under this category, and\n- PATCH version when you make backwards compatible bug fixes.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batching, and more. Supports datasets from Huggingface, torchdata iterables, or simple lists of dictionaries.",
"version": "0.21.5",
"project_urls": {
"Bug Tracker": "https://github.com/allenai/smashed/issues",
"Homepage": "https://github.com/allenai/smashed",
"Repository": "https://github.com/allenai/smashed"
},
"split_keywords": [
"mappers",
"pytorch",
"torch",
"huggingface",
"transformers",
"datasets",
"dict",
"pipeline",
"preprocessing",
"nlp",
"natural language processing",
"text",
"prompting",
"prefix tuning",
"in context learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "05ced9f7390f548c52662cbdb0ad43b5935706220d6f01d5288e8fa329075bdc",
"md5": "962af6a9ea1fa0241c7cc5869be4e400",
"sha256": "90ed1b5b0fc83d4b213cdcf417bf424e391cc7a335706da2d6c0a4e590b79d1c"
},
"downloads": -1,
"filename": "smashed-0.21.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "962af6a9ea1fa0241c7cc5869be4e400",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 95393,
"upload_time": "2023-09-22T17:52:22",
"upload_time_iso_8601": "2023-09-22T17:52:22.745942Z",
"url": "https://files.pythonhosted.org/packages/05/ce/d9f7390f548c52662cbdb0ad43b5935706220d6f01d5288e8fa329075bdc/smashed-0.21.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2f8b0717eb69dc7ad85b89fdf46f1a9a2c2982a8a3a82756b561e0dfedfd087f",
"md5": "db8024a60f9a6669d05e384bc1e992d8",
"sha256": "bf4a96a22e127a7d69378c5c6701aafa710b65a7c9b2605df4fa4a54d7896737"
},
"downloads": -1,
"filename": "smashed-0.21.5.tar.gz",
"has_sig": false,
"md5_digest": "db8024a60f9a6669d05e384bc1e992d8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 95901,
"upload_time": "2023-09-22T17:52:24",
"upload_time_iso_8601": "2023-09-22T17:52:24.629615Z",
"url": "https://files.pythonhosted.org/packages/2f/8b/0717eb69dc7ad85b89fdf46f1a9a2c2982a8a3a82756b561e0dfedfd087f/smashed-0.21.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-22 17:52:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "allenai",
"github_project": "smashed",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "smashed"
}