saf-datasets


Namesaf-datasets JSON
Version 0.6.15 PyPI version JSON
download
home_pageNone
SummaryData set loading and annotation facilities for the Simple Annotation Framework
upload_time2025-02-15 16:07:29
maintainerNone
docs_urlNone
authorDanilo S. Carvalho
requires_python>=3.9
licenseNone
keywords datasets annotated nlp
VCS
bugtrack_url
requirements saf-nlp spacy gdown tqdm torch jsonlines transformers sentencepiece protobuf
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SAF-Datasets
### Dataset loading and annotation facilities for the Simple Annotation Framework

The *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. 

It is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:

- Including and modifying annotations on existing datasets.
- Standardized API.
- Support for complex and multi-level annotations.

*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.

It also provides annotator classes to automatically label existing and new datasets.


## Installation

To install, you can use pip:

```bash
pip install saf-datasets
```

## Usage
### Loading datasets

```python
from saf_datasets import STSBDataSet

dataset = STSBDataSet()
print(len(dataset))  # Size of the dataset
# 17256
print(dataset[0].surface)  # First sentence in the dataset
# A plane is taking off
print([token.surface for token in dataset[0].tokens])  # Tokens (SpaCy) of the first sentence.
# ['A', 'plane', 'is', 'taking', 'off', '.']
print(dataset[0].annotations)  # Annotations for the first sentence
# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}

# There are no token annotations in this dataset
print([(tok.surface, tok.annotations) for tok in dataset[0].tokens])
# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]
```

**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).

### Annotating datasets

```python
from saf_datasets import STSBDataSet
from saf_datasets.annotators import SpacyAnnotator

dataset = STSBDataSet()
annotator = SpacyAnnotator()  # Needs spacy and en_core_web_sm to be installed.
annotator.annotate(dataset)

# Now tokens are annotated
for tok in dataset[0].tokens:
    print(tok.surface, tok.annotations)

# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}
# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}
# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}
# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}
# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}
# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}
```

### Using with other libraries

*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:

```python
from saf_datasets import CPAEDataSet
from saf_datasets.wrappers.torch import TokenizedDataSet
from transformers import AutoTokenizer

dataset = CPAEDataSet()
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", add_prefix_space=True)
tok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)
print(tok_ds[:10])
# tensor([[50256, 50256, 50256,  ...,  2263,   572,    13],
#         [50256, 50256, 50256,  ...,  2263,   572,    13],
#         [50256, 50256, 50256,  ...,   781,  1133,    13],
#         ...,
#         [50256, 50256, 50256,  ...,  2712, 19780,    13],
#         [50256, 50256, 50256,  ...,  2685,    78,    13],
#         [50256, 50256, 50256,  ...,  2685,    78,    13]])

print(tok_ds[:10].shape)
# torch.Size([10, 128])
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "saf-datasets",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "datasets, annotated, nlp",
    "author": "Danilo S. Carvalho",
    "author_email": "\"Danilo S. Carvalho\" <danilo.carvalho@manchester.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/8d/12/89c7497f23cd590580b72a42d2adf3372e88bf11a038c7aeb926c16c52dc/saf_datasets-0.6.15.tar.gz",
    "platform": null,
    "description": "# SAF-Datasets\n### Dataset loading and annotation facilities for the Simple Annotation Framework\n\nThe *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. \n\nIt is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:\n\n- Including and modifying annotations on existing datasets.\n- Standardized API.\n- Support for complex and multi-level annotations.\n\n*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.\n\nIt also provides annotator classes to automatically label existing and new datasets.\n\n\n## Installation\n\nTo install, you can use pip:\n\n```bash\npip install saf-datasets\n```\n\n## Usage\n### Loading datasets\n\n```python\nfrom saf_datasets import STSBDataSet\n\ndataset = STSBDataSet()\nprint(len(dataset))  # Size of the dataset\n# 17256\nprint(dataset[0].surface)  # First sentence in the dataset\n# A plane is taking off\nprint([token.surface for token in dataset[0].tokens])  # Tokens (SpaCy) of the first sentence.\n# ['A', 'plane', 'is', 'taking', 'off', '.']\nprint(dataset[0].annotations)  # Annotations for the first sentence\n# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}\n\n# There are no token annotations in this dataset\nprint([(tok.surface, tok.annotations) for tok in dataset[0].tokens])\n# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]\n```\n\n**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).\n\n### Annotating datasets\n\n```python\nfrom saf_datasets import STSBDataSet\nfrom saf_datasets.annotators import SpacyAnnotator\n\ndataset = STSBDataSet()\nannotator = SpacyAnnotator()  # Needs spacy and en_core_web_sm to be installed.\nannotator.annotate(dataset)\n\n# Now tokens are annotated\nfor tok in dataset[0].tokens:\n    print(tok.surface, tok.annotations)\n\n# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}\n# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}\n# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}\n# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}\n# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}\n# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}\n```\n\n### Using with other libraries\n\n*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:\n\n```python\nfrom saf_datasets import CPAEDataSet\nfrom saf_datasets.wrappers.torch import TokenizedDataSet\nfrom transformers import AutoTokenizer\n\ndataset = CPAEDataSet()\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\", padding_side=\"left\", add_prefix_space=True)\ntok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)\nprint(tok_ds[:10])\n# tensor([[50256, 50256, 50256,  ...,  2263,   572,    13],\n#         [50256, 50256, 50256,  ...,  2263,   572,    13],\n#         [50256, 50256, 50256,  ...,   781,  1133,    13],\n#         ...,\n#         [50256, 50256, 50256,  ...,  2712, 19780,    13],\n#         [50256, 50256, 50256,  ...,  2685,    78,    13],\n#         [50256, 50256, 50256,  ...,  2685,    78,    13]])\n\nprint(tok_ds[:10].shape)\n# torch.Size([10, 128])\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Data set loading and annotation facilities for the Simple Annotation Framework",
    "version": "0.6.15",
    "project_urls": {
        "Homepage": "https://github.com/neuro-symbolic-ai/saf_datasets",
        "Issues": "https://github.com/neuro-symbolic-ai/saf_datasets/issues"
    },
    "split_keywords": [
        "datasets",
        " annotated",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c8439c0294757deee2c29231e2b212df1225f35619e6e2d202efaa0790f083be",
                "md5": "f6503db01b1936d8bc97eda9343f1bc2",
                "sha256": "4e321f0061e5317b08148ca06d3a5b2bd098f9bb7ee44a58cfe82562bc032a99"
            },
            "downloads": -1,
            "filename": "saf_datasets-0.6.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f6503db01b1936d8bc97eda9343f1bc2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 39311,
            "upload_time": "2025-02-15T16:07:27",
            "upload_time_iso_8601": "2025-02-15T16:07:27.667160Z",
            "url": "https://files.pythonhosted.org/packages/c8/43/9c0294757deee2c29231e2b212df1225f35619e6e2d202efaa0790f083be/saf_datasets-0.6.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d1289c7497f23cd590580b72a42d2adf3372e88bf11a038c7aeb926c16c52dc",
                "md5": "77f31394afdd4927cec82c2d47ee3e0b",
                "sha256": "de6fcf5d9e980de7dbdd5ef51baa93fb931f91606a7ce798127f992714a71b4a"
            },
            "downloads": -1,
            "filename": "saf_datasets-0.6.15.tar.gz",
            "has_sig": false,
            "md5_digest": "77f31394afdd4927cec82c2d47ee3e0b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 31276,
            "upload_time": "2025-02-15T16:07:29",
            "upload_time_iso_8601": "2025-02-15T16:07:29.037559Z",
            "url": "https://files.pythonhosted.org/packages/8d/12/89c7497f23cd590580b72a42d2adf3372e88bf11a038c7aeb926c16c52dc/saf_datasets-0.6.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-15 16:07:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuro-symbolic-ai",
    "github_project": "saf_datasets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "saf-nlp",
            "specs": []
        },
        {
            "name": "spacy",
            "specs": []
        },
        {
            "name": "gdown",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "jsonlines",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": []
        },
        {
            "name": "sentencepiece",
            "specs": []
        },
        {
            "name": "protobuf",
            "specs": []
        }
    ],
    "lcname": "saf-datasets"
}
        
Elapsed time: 1.24866s