# SAF-Datasets
### Dataset loading and annotation facilities for the Simple Annotation Framework
The *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels.
It is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:
- Including and modifying annotations on existing datasets.
- Standardized API.
- Support for complex and multi-level annotations.
*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.
It also provides annotator classes to automatically label existing and new datasets.
## Installation
To install, you can use pip:
```bash
pip install saf-datasets
```
## Usage
### Loading datasets
```python
from saf_datasets import STSBDataSet
dataset = STSBDataSet()
print(len(dataset)) # Size of the dataset
# 17256
print(dataset[0].surface) # First sentence in the dataset
# A plane is taking off
print([token.surface for token in dataset[0].tokens]) # Tokens (SpaCy) of the first sentence.
# ['A', 'plane', 'is', 'taking', 'off', '.']
print(dataset[0].annotations) # Annotations for the first sentence
# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}
# There are no token annotations in this dataset
print([(tok.surface, tok.annotations) for tok in dataset[0].tokens])
# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]
```
**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).
### Annotating datasets
```python
from saf_datasets import STSBDataSet
from saf_datasets.annotators import SpacyAnnotator
dataset = STSBDataSet()
annotator = SpacyAnnotator() # Needs spacy and en_core_web_sm to be installed.
annotator.annotate(dataset)
# Now tokens are annotated
for tok in dataset[0].tokens:
print(tok.surface, tok.annotations)
# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}
# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}
# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}
# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}
# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}
# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}
```
### Using with other libraries
*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:
```python
from saf_datasets import CPAEDataSet
from saf_datasets.wrappers.torch import TokenizedDataSet
from transformers import AutoTokenizer
dataset = CPAEDataSet()
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", add_prefix_space=True)
tok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)
print(tok_ds[:10])
# tensor([[50256, 50256, 50256, ..., 2263, 572, 13],
# [50256, 50256, 50256, ..., 2263, 572, 13],
# [50256, 50256, 50256, ..., 781, 1133, 13],
# ...,
# [50256, 50256, 50256, ..., 2712, 19780, 13],
# [50256, 50256, 50256, ..., 2685, 78, 13],
# [50256, 50256, 50256, ..., 2685, 78, 13]])
print(tok_ds[:10].shape)
# torch.Size([10, 128])
```
Raw data
{
"_id": null,
"home_page": null,
"name": "saf-datasets",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "datasets, annotated, nlp",
"author": "Danilo S. Carvalho",
"author_email": "\"Danilo S. Carvalho\" <danilo.carvalho@manchester.ac.uk>",
"download_url": "https://files.pythonhosted.org/packages/8d/12/89c7497f23cd590580b72a42d2adf3372e88bf11a038c7aeb926c16c52dc/saf_datasets-0.6.15.tar.gz",
"platform": null,
"description": "# SAF-Datasets\n### Dataset loading and annotation facilities for the Simple Annotation Framework\n\nThe *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. \n\nIt is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:\n\n- Including and modifying annotations on existing datasets.\n- Standardized API.\n- Support for complex and multi-level annotations.\n\n*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.\n\nIt also provides annotator classes to automatically label existing and new datasets.\n\n\n## Installation\n\nTo install, you can use pip:\n\n```bash\npip install saf-datasets\n```\n\n## Usage\n### Loading datasets\n\n```python\nfrom saf_datasets import STSBDataSet\n\ndataset = STSBDataSet()\nprint(len(dataset)) # Size of the dataset\n# 17256\nprint(dataset[0].surface) # First sentence in the dataset\n# A plane is taking off\nprint([token.surface for token in dataset[0].tokens]) # Tokens (SpaCy) of the first sentence.\n# ['A', 'plane', 'is', 'taking', 'off', '.']\nprint(dataset[0].annotations) # Annotations for the first sentence\n# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}\n\n# There are no token annotations in this dataset\nprint([(tok.surface, tok.annotations) for tok in dataset[0].tokens])\n# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]\n```\n\n**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).\n\n### Annotating datasets\n\n```python\nfrom saf_datasets import STSBDataSet\nfrom saf_datasets.annotators import SpacyAnnotator\n\ndataset = STSBDataSet()\nannotator = SpacyAnnotator() # Needs spacy and en_core_web_sm to be installed.\nannotator.annotate(dataset)\n\n# Now tokens are annotated\nfor tok in dataset[0].tokens:\n print(tok.surface, tok.annotations)\n\n# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}\n# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}\n# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}\n# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}\n# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}\n# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}\n```\n\n### Using with other libraries\n\n*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:\n\n```python\nfrom saf_datasets import CPAEDataSet\nfrom saf_datasets.wrappers.torch import TokenizedDataSet\nfrom transformers import AutoTokenizer\n\ndataset = CPAEDataSet()\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\", padding_side=\"left\", add_prefix_space=True)\ntok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)\nprint(tok_ds[:10])\n# tensor([[50256, 50256, 50256, ..., 2263, 572, 13],\n# [50256, 50256, 50256, ..., 2263, 572, 13],\n# [50256, 50256, 50256, ..., 781, 1133, 13],\n# ...,\n# [50256, 50256, 50256, ..., 2712, 19780, 13],\n# [50256, 50256, 50256, ..., 2685, 78, 13],\n# [50256, 50256, 50256, ..., 2685, 78, 13]])\n\nprint(tok_ds[:10].shape)\n# torch.Size([10, 128])\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Data set loading and annotation facilities for the Simple Annotation Framework",
"version": "0.6.15",
"project_urls": {
"Homepage": "https://github.com/neuro-symbolic-ai/saf_datasets",
"Issues": "https://github.com/neuro-symbolic-ai/saf_datasets/issues"
},
"split_keywords": [
"datasets",
" annotated",
" nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c8439c0294757deee2c29231e2b212df1225f35619e6e2d202efaa0790f083be",
"md5": "f6503db01b1936d8bc97eda9343f1bc2",
"sha256": "4e321f0061e5317b08148ca06d3a5b2bd098f9bb7ee44a58cfe82562bc032a99"
},
"downloads": -1,
"filename": "saf_datasets-0.6.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f6503db01b1936d8bc97eda9343f1bc2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 39311,
"upload_time": "2025-02-15T16:07:27",
"upload_time_iso_8601": "2025-02-15T16:07:27.667160Z",
"url": "https://files.pythonhosted.org/packages/c8/43/9c0294757deee2c29231e2b212df1225f35619e6e2d202efaa0790f083be/saf_datasets-0.6.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8d1289c7497f23cd590580b72a42d2adf3372e88bf11a038c7aeb926c16c52dc",
"md5": "77f31394afdd4927cec82c2d47ee3e0b",
"sha256": "de6fcf5d9e980de7dbdd5ef51baa93fb931f91606a7ce798127f992714a71b4a"
},
"downloads": -1,
"filename": "saf_datasets-0.6.15.tar.gz",
"has_sig": false,
"md5_digest": "77f31394afdd4927cec82c2d47ee3e0b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 31276,
"upload_time": "2025-02-15T16:07:29",
"upload_time_iso_8601": "2025-02-15T16:07:29.037559Z",
"url": "https://files.pythonhosted.org/packages/8d/12/89c7497f23cd590580b72a42d2adf3372e88bf11a038c7aeb926c16c52dc/saf_datasets-0.6.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-15 16:07:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "neuro-symbolic-ai",
"github_project": "saf_datasets",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "saf-nlp",
"specs": []
},
{
"name": "spacy",
"specs": []
},
{
"name": "gdown",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "jsonlines",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
},
{
"name": "protobuf",
"specs": []
}
],
"lcname": "saf-datasets"
}