Name | saf-datasets JSON |
Version |
0.6.10
JSON |
| download |
home_page | None |
Summary | Data set loading and annotation facilities for the Simple Annotation Framework |
upload_time | 2024-07-16 23:59:40 |
maintainer | None |
docs_url | None |
author | Danilo S. Carvalho |
requires_python | >=3.9 |
license | None |
keywords |
datasets
annotated
nlp
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# SAF-Datasets
### Dataset loading and annotation facilities for the Simple Annotation Framework
The *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels.
It is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:
- Including and modifying annotations on existing datasets.
- Standardized API.
- Support for complex and multi-level annotations.
*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.
It also provides annotator classes to automatically label existing and new datasets.
## Installation
To install, you can use pip:
```bash
pip install saf-datasets
```
## Usage
### Loading datasets
```python
from saf_datasets import STSBDataSet
dataset = STSBDataSet()
print(len(dataset)) # Size of the dataset
# 17256
print(dataset[0].surface) # First sentence in the dataset
# A plane is taking off
print([token.surface for token in dataset[0].tokens]) # Tokens (SpaCy) of the first sentence.
# ['A', 'plane', 'is', 'taking', 'off', '.']
print(dataset[0].annotations) # Annotations for the first sentence
# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}
# There are no token annotations in this dataset
print([(tok.surface, tok.annotations) for tok in dataset[0].tokens])
# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]
```
**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).
### Annotating datasets
```python
from saf_datasets import STSBDataSet
from saf_datasets.annotators import SpacyAnnotator
dataset = STSBDataSet()
annotator = SpacyAnnotator() # Needs spacy and en_core_web_sm to be installed.
annotator.annotate(dataset)
# Now tokens are annotated
for tok in dataset[0].tokens:
print(tok.surface, tok.annotations)
# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}
# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}
# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}
# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}
# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}
# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}
```
### Using with other libraries
*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:
```python
from saf_datasets import CPAEDataSet
from saf_datasets.wrappers.torch import TokenizedDataSet
from transformers import AutoTokenizer
dataset = CPAEDataSet()
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", add_prefix_space=True)
tok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)
print(tok_ds[:10])
# tensor([[50256, 50256, 50256, ..., 2263, 572, 13],
# [50256, 50256, 50256, ..., 2263, 572, 13],
# [50256, 50256, 50256, ..., 781, 1133, 13],
# ...,
# [50256, 50256, 50256, ..., 2712, 19780, 13],
# [50256, 50256, 50256, ..., 2685, 78, 13],
# [50256, 50256, 50256, ..., 2685, 78, 13]])
print(tok_ds[:10].shape)
# torch.Size([10, 128])
```
Raw data
{
"_id": null,
"home_page": null,
"name": "saf-datasets",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "datasets, annotated, nlp",
"author": "Danilo S. Carvalho",
"author_email": "\"Danilo S. Carvalho\" <danilo.carvalho@manchester.ac.uk>",
"download_url": "https://files.pythonhosted.org/packages/0c/16/47cbe816c916b9f5b3b9b0e9a57fc1479dd0398fdf159b0ad6b5345c1b41/saf_datasets-0.6.10.tar.gz",
"platform": null,
"description": "# SAF-Datasets\n### Dataset loading and annotation facilities for the Simple Annotation Framework\n\nThe *saf-datasets* library provides easy access to Natural Language Processing (NLP) datasets, and tools to facilitate annotation at document, sentence and token levels. \n\nIt is being developed to address a need for flexibility in manipulating NLP annotations that is not entirely covered by popular dataset libraries, such as HuggingFace Datasets and torch Datasets, Namely:\n\n- Including and modifying annotations on existing datasets.\n- Standardized API.\n- Support for complex and multi-level annotations.\n\n*saf-datasets* is built upon the [Simple Annotation Framework (SAF)](https://github.com/dscarvalho/saf) library, which provides its data model and API.\n\nIt also provides annotator classes to automatically label existing and new datasets.\n\n\n## Installation\n\nTo install, you can use pip:\n\n```bash\npip install saf-datasets\n```\n\n## Usage\n### Loading datasets\n\n```python\nfrom saf_datasets import STSBDataSet\n\ndataset = STSBDataSet()\nprint(len(dataset)) # Size of the dataset\n# 17256\nprint(dataset[0].surface) # First sentence in the dataset\n# A plane is taking off\nprint([token.surface for token in dataset[0].tokens]) # Tokens (SpaCy) of the first sentence.\n# ['A', 'plane', 'is', 'taking', 'off', '.']\nprint(dataset[0].annotations) # Annotations for the first sentence\n# {'split': 'train', 'genre': 'main-captions', 'dataset': 'MSRvid', 'year': '2012test', 'sid': '0001', 'score': '5.000', 'id': 0}\n\n# There are no token annotations in this dataset\nprint([(tok.surface, tok.annotations) for tok in dataset[0].tokens])\n# [('A', {}), ('plane', {}), ('is', {}), ('taking', {}), ('off', {}), ('.', {})]\n```\n\n**Available datasets:** AllNLI, CODWOE, CPAE, EntailmentBank, STSB, Wiktionary, WordNet (Filtered).\n\n### Annotating datasets\n\n```python\nfrom saf_datasets import STSBDataSet\nfrom saf_datasets.annotators import SpacyAnnotator\n\ndataset = STSBDataSet()\nannotator = SpacyAnnotator() # Needs spacy and en_core_web_sm to be installed.\nannotator.annotate(dataset)\n\n# Now tokens are annotated\nfor tok in dataset[0].tokens:\n print(tok.surface, tok.annotations)\n\n# A {'pos': 'DET', 'lemma': 'a', 'dep': 'det', 'ctag': 'DT'}\n# plane {'pos': 'NOUN', 'lemma': 'plane', 'dep': 'nsubj', 'ctag': 'NN'}\n# is {'pos': 'AUX', 'lemma': 'be', 'dep': 'aux', 'ctag': 'VBZ'}\n# taking {'pos': 'VERB', 'lemma': 'take', 'dep': 'ROOT', 'ctag': 'VBG'}\n# off {'pos': 'ADP', 'lemma': 'off', 'dep': 'prt', 'ctag': 'RP'}\n# . {'pos': 'PUNCT', 'lemma': '.', 'dep': 'punct', 'ctag': '.'}\n```\n\n### Using with other libraries\n\n*saf-datasets* provides wrappers for using the datasets with libraries expecting HF or torch datasets:\n\n```python\nfrom saf_datasets import CPAEDataSet\nfrom saf_datasets.wrappers.torch import TokenizedDataSet\nfrom transformers import AutoTokenizer\n\ndataset = CPAEDataSet()\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\", padding_side=\"left\", add_prefix_space=True)\ntok_ds = TokenizedDataSet(dataset, tokenizer, max_len=128, one_hot=False)\nprint(tok_ds[:10])\n# tensor([[50256, 50256, 50256, ..., 2263, 572, 13],\n# [50256, 50256, 50256, ..., 2263, 572, 13],\n# [50256, 50256, 50256, ..., 781, 1133, 13],\n# ...,\n# [50256, 50256, 50256, ..., 2712, 19780, 13],\n# [50256, 50256, 50256, ..., 2685, 78, 13],\n# [50256, 50256, 50256, ..., 2685, 78, 13]])\n\nprint(tok_ds[:10].shape)\n# torch.Size([10, 128])\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Data set loading and annotation facilities for the Simple Annotation Framework",
"version": "0.6.10",
"project_urls": {
"Homepage": "https://github.com/neuro-symbolic-ai/saf_datasets",
"Issues": "https://github.com/neuro-symbolic-ai/saf_datasets/issues"
},
"split_keywords": [
"datasets",
" annotated",
" nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "98738bd65c88056ecc9769ad8e678f6ec17facef7373e7c36b887b051aaaf839",
"md5": "08cfe868b3626946553b4ac7abd77f7c",
"sha256": "0b214fe26d2db315cbc8d00eb6324aa75855d2be421e3be330c248a8c47d97bb"
},
"downloads": -1,
"filename": "saf_datasets-0.6.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "08cfe868b3626946553b4ac7abd77f7c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 37814,
"upload_time": "2024-07-16T23:59:38",
"upload_time_iso_8601": "2024-07-16T23:59:38.420312Z",
"url": "https://files.pythonhosted.org/packages/98/73/8bd65c88056ecc9769ad8e678f6ec17facef7373e7c36b887b051aaaf839/saf_datasets-0.6.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0c1647cbe816c916b9f5b3b9b0e9a57fc1479dd0398fdf159b0ad6b5345c1b41",
"md5": "68582f6294d53595d9492b886a7574f8",
"sha256": "d29c4373e92dd20a5535ca1dad42c06984826a4714884d73a3a20d330f15c05f"
},
"downloads": -1,
"filename": "saf_datasets-0.6.10.tar.gz",
"has_sig": false,
"md5_digest": "68582f6294d53595d9492b886a7574f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 30399,
"upload_time": "2024-07-16T23:59:40",
"upload_time_iso_8601": "2024-07-16T23:59:40.238835Z",
"url": "https://files.pythonhosted.org/packages/0c/16/47cbe816c916b9f5b3b9b0e9a57fc1479dd0398fdf159b0ad6b5345c1b41/saf_datasets-0.6.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-16 23:59:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "neuro-symbolic-ai",
"github_project": "saf_datasets",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "saf-datasets"
}