saf-datasets

Name	saf-datasets JSON
Version	0.6.10 JSON
	download
home_page	None
Summary	Data set loading and annotation facilities for the Simple Annotation Framework
upload_time	2024-07-16 23:59:40
maintainer	None
docs_url	None
author	Danilo S. Carvalho
requires_python	>=3.9
license	None
keywords	datasets annotated nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SAF-Datasets
### Dataset loading and annotation facilities for the Simple Annotation Framework

The *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. 

It is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:

- Including and modifying annotations on existing datasets.
- Standardized API.
- Support for complex and multi-level annotations.

*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.

It also provides annotator classes to automatically label existing and new datasets.


## Installation

To install, you can use pip:

```bash
pip install saf-datasets
```

## Usage
### Loading datasets

```python
from saf_datasets import STSBDataSet

dataset = STSBDataSet()
print(len(dataset))  # Size of the dataset
# 17256
print(dataset[0].surface)  # First sentence in the dataset
# A plane is taking off
print([token.surface for token in dataset[0].tokens])  # Tokens (SpaCy) of the first sentence.
# ['A', 'plane', 'is', 'taking', 'off', '.']
print(dataset[0].annotations)  # Annotations for the first sentence
# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}

# There are no token annotations in this dataset
print([(tok.surface, tok.annotations) for tok in dataset[0].tokens])
# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]
```

**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).

### Annotating datasets

```python
from saf_datasets import STSBDataSet
from saf_datasets.annotators import SpacyAnnotator

dataset = STSBDataSet()
annotator = SpacyAnnotator()  # Needs spacy and en_core_web_sm to be installed.
annotator.annotate(dataset)

# Now tokens are annotated
for tok in dataset[0].tokens:
    print(tok.surface, tok.annotations)

# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}
# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}
# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}
# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}
# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}
# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}
```

### Using with other libraries

*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:

```python
from saf_datasets import CPAEDataSet
from saf_datasets.wrappers.torch import TokenizedDataSet
from transformers import AutoTokenizer

dataset = CPAEDataSet()
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", add_prefix_space=True)
tok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)
print(tok_ds[:10])
# tensor([[50256, 50256, 50256,  ...,  2263,   572,    13],
#         [50256, 50256, 50256,  ...,  2263,   572,    13],
#         [50256, 50256, 50256,  ...,   781,  1133,    13],
#         ...,
#         [50256, 50256, 50256,  ...,  2712, 19780,    13],
#         [50256, 50256, 50256,  ...,  2685,    78,    13],
#         [50256, 50256, 50256,  ...,  2685,    78,    13]])

print(tok_ds[:10].shape)
# torch.Size([10, 128])
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "saf-datasets",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "datasets, annotated, nlp",
    "author": "Danilo S. Carvalho",
    "author_email": "\"Danilo S. Carvalho\" <danilo.carvalho@manchester.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/0c/16/47cbe816c916b9f5b3b9b0e9a57fc1479dd0398fdf159b0ad6b5345c1b41/saf_datasets-0.6.10.tar.gz",
    "platform": null,
    "description": "# SAF-Datasets\n### Dataset loading and annotation facilities for the Simple Annotation Framework\n\nThe *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. \n\nIt is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:\n\n- Including and modifying annotations on existing datasets.\n- Standardized API.\n- Support for complex and multi-level annotations.\n\n*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.\n\nIt also provides annotator classes to automatically label existing and new datasets.\n\n\n## Installation\n\nTo install, you can use pip:\n\n```bash\npip install saf-datasets\n```\n\n## Usage\n### Loading datasets\n\n```python\nfrom saf_datasets import STSBDataSet\n\ndataset = STSBDataSet()\nprint(len(dataset))  # Size of the dataset\n# 17256\nprint(dataset[0].surface)  # First sentence in the dataset\n# A plane is taking off\nprint([token.surface for token in dataset[0].tokens])  # Tokens (SpaCy) of the first sentence.\n# ['A', 'plane', 'is', 'taking', 'off', '.']\nprint(dataset[0].annotations)  # Annotations for the first sentence\n# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}\n\n# There are no token annotations in this dataset\nprint([(tok.surface, tok.annotations) for tok in dataset[0].tokens])\n# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]\n```\n\n**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).\n\n### Annotating datasets\n\n```python\nfrom saf_datasets import STSBDataSet\nfrom saf_datasets.annotators import SpacyAnnotator\n\ndataset = STSBDataSet()\nannotator = SpacyAnnotator()  # Needs spacy and en_core_web_sm to be installed.\nannotator.annotate(dataset)\n\n# Now tokens are annotated\nfor tok in dataset[0].tokens:\n    print(tok.surface, tok.annotations)\n\n# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}\n# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}\n# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}\n# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}\n# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}\n# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}\n```\n\n### Using with other libraries\n\n*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:\n\n```python\nfrom saf_datasets import CPAEDataSet\nfrom saf_datasets.wrappers.torch import TokenizedDataSet\nfrom transformers import AutoTokenizer\n\ndataset = CPAEDataSet()\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\", padding_side=\"left\", add_prefix_space=True)\ntok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)\nprint(tok_ds[:10])\n# tensor([[50256, 50256, 50256,  ...,  2263,   572,    13],\n#         [50256, 50256, 50256,  ...,  2263,   572,    13],\n#         [50256, 50256, 50256,  ...,   781,  1133,    13],\n#         ...,\n#         [50256, 50256, 50256,  ...,  2712, 19780,    13],\n#         [50256, 50256, 50256,  ...,  2685,    78,    13],\n#         [50256, 50256, 50256,  ...,  2685,    78,    13]])\n\nprint(tok_ds[:10].shape)\n# torch.Size([10, 128])\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Data set loading and annotation facilities for the Simple Annotation Framework",
    "version": "0.6.10",
    "project_urls": {
        "Homepage": "https://github.com/neuro-symbolic-ai/saf_datasets",
        "Issues": "https://github.com/neuro-symbolic-ai/saf_datasets/issues"
    },
    "split_keywords": [
        "datasets",
        " annotated",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "98738bd65c88056ecc9769ad8e678f6ec17facef7373e7c36b887b051aaaf839",
                "md5": "08cfe868b3626946553b4ac7abd77f7c",
                "sha256": "0b214fe26d2db315cbc8d00eb6324aa75855d2be421e3be330c248a8c47d97bb"
            },
            "downloads": -1,
            "filename": "saf_datasets-0.6.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "08cfe868b3626946553b4ac7abd77f7c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 37814,
            "upload_time": "2024-07-16T23:59:38",
            "upload_time_iso_8601": "2024-07-16T23:59:38.420312Z",
            "url": "https://files.pythonhosted.org/packages/98/73/8bd65c88056ecc9769ad8e678f6ec17facef7373e7c36b887b051aaaf839/saf_datasets-0.6.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0c1647cbe816c916b9f5b3b9b0e9a57fc1479dd0398fdf159b0ad6b5345c1b41",
                "md5": "68582f6294d53595d9492b886a7574f8",
                "sha256": "d29c4373e92dd20a5535ca1dad42c06984826a4714884d73a3a20d330f15c05f"
            },
            "downloads": -1,
            "filename": "saf_datasets-0.6.10.tar.gz",
            "has_sig": false,
            "md5_digest": "68582f6294d53595d9492b886a7574f8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 30399,
            "upload_time": "2024-07-16T23:59:40",
            "upload_time_iso_8601": "2024-07-16T23:59:40.238835Z",
            "url": "https://files.pythonhosted.org/packages/0c/16/47cbe816c916b9f5b3b9b0e9a57fc1479dd0398fdf159b0ad6b5345c1b41/saf_datasets-0.6.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-16 23:59:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuro-symbolic-ai",
    "github_project": "saf_datasets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "saf-datasets"
}

Danilo S. Carvalho