PET dataset reader
######################
A structured interface to interact with the `PET-dataset`_ hosted on huggingface.
.. _PET-dataset: https://huggingface.co/datasets/patriziobellan/PET
Created by `Patrizio Bellan`_.
.. _Patrizio Bellan: https://pdi.fbk.eu/bellan/
=================
Interacting with the data hosted on HuggingFace could be difficult since the data has a strict format.
For example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them.
In addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the `PET Dataset <https://huggingface.co/datasets/patriziobellan/PET>`_ could become a time-intensive operation.
To alleviate such difficulties, we developed the *PET dataset reader*, a Python package that makes the interaction with the dataset easy.
This package is composed of three different modules: **TokenClassification** module, **RelationExtraction** module, and **ProcessInformation** module.
TokenClassification Module
****************************************
This module is composed of a Python class that allows users to extract structured information at the token levels.
This class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.
#. **GetDocumentNames** This method returns a list of the document names of the dataset.
#. **GetDocumentText** This method returns the textual description of a document.
#. **GetTokens** This method returns the text of a sentence in the form of a list of words of a given sentence ID.
#. **GetNerTagLabels** This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method \emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.
#. **Statistics** This method provides the statistics about the PET elements annotated.
In addition, specific methods were implemented to get the list of elements of a given category. For example, the method \emph{GetActivity} returns all the \PETactivity of a specific document or the entire dataset. Similarly, the method \emph{GetActivityData} returns the \PETactivitydata.
RelationExtraction Module
****************************************
This module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., *PET Uses* relation.
This class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.
#. **GetNerLabels** This method returns the NER tag IDs of a given document.
#. **GetRelations** This method provides a list of PET relations of a given document.
#. **GetSentencesWithIdsAndNerTagLabels** This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags.
#. **Statistics** This method provides the statistics about the PET relations.
ProcessInformation Module
****************************************
This module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph.
The module has six main methods:
#. **GetRawActivityLabels** returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.
#. **GetDFG** returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.
#. **GetKG_DFGActivityData** provides the DFG representation of a document enhanced with the \PETactivitydata elements.
#. **GetKG_DFGPerformsActors** provides the DFG graph representation of a document enhanced with the \actorperformer information.
#. **GetPerformsActors** returns a graph representation of the DFG graph of a document enhanced with \actorperformer relations.
#. **GetKnowledgeGraph** returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.
How to Load the PET dataset
*********************************************
**Token-classification task**
.. code-block:: python
from datasets import load_dataset
modelhub_dataset = load_dataset("patriziobellan/PET", name='token-classification')
**Relations-extraction task**
.. code-block:: python
from datasets import load_dataset
modelhub_dataset = load_dataset("patriziobellan/PET", name='relations-extraction')
..
Raw data
{
"_id": null,
"home_page": "https://pdi.fbk.eu/pet-dataset",
"name": "petdatasetreader",
"maintainer": "Patrizio Bellan",
"docs_url": null,
"requires_python": null,
"maintainer_email": "patrizio.bellan@gmail.com",
"keywords": "huggingface, PET, dataset, process extraction from text, natural language processing, nlp, business process management, bpm",
"author": "Patrizio Bellan",
"author_email": "patrizio.bellan@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ca/c5/ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c/petdatasetreader-0.0.2.tar.gz",
"platform": "Any",
"description": "PET dataset reader\n######################\n\n\nA structured interface to interact with the `PET-dataset`_ hosted on huggingface.\n\n.. _PET-dataset: https://huggingface.co/datasets/patriziobellan/PET\n\n\n\nCreated by `Patrizio Bellan`_.\n\n.. _Patrizio Bellan: https://pdi.fbk.eu/bellan/\n\n=================\n\nInteracting with the data hosted on HuggingFace could be difficult since the data has a strict format. \nFor example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them. \nIn addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the `PET Dataset <https://huggingface.co/datasets/patriziobellan/PET>`_ could become a time-intensive operation. \nTo alleviate such difficulties, we developed the *PET dataset reader*, a Python package that makes the interaction with the dataset easy.\nThis package is composed of three different modules: **TokenClassification** module, **RelationExtraction** module, and **ProcessInformation** module.\n\nTokenClassification Module\n****************************************\n\nThis module is composed of a Python class that allows users to extract structured information at the token levels.\nThis class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.\n\n#. **GetDocumentNames** This method returns a list of the document names of the dataset.\n\n#. **GetDocumentText** This method returns the textual description of a document.\n\n#. **GetTokens** This method returns the text of a sentence in the form of a list of words of a given sentence ID.\n\n#. **GetNerTagLabels** This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method \\emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.\n\n#. **Statistics** This method provides the statistics about the PET elements annotated.\n\nIn addition, specific methods were implemented to get the list of elements of a given category. For example, the method \\emph{GetActivity} returns all the \\PETactivity of a specific document or the entire dataset. Similarly, the method \\emph{GetActivityData} returns the \\PETactivitydata.\n\n\nRelationExtraction Module\n****************************************\n\nThis module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., *PET Uses* relation.\nThis class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.\n\n#. **GetNerLabels** This method returns the NER tag IDs of a given document.\n\n#. **GetRelations** This method provides a list of PET relations of a given document.\n\n#. **GetSentencesWithIdsAndNerTagLabels** This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags. \n\n#. **Statistics** This method provides the statistics about the PET relations.\n\n\nProcessInformation Module\n****************************************\nThis module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph.\nThe module has six main methods:\n\n#. **GetRawActivityLabels** returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.\n\n\n#. **GetDFG** returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.\n\n#. **GetKG_DFGActivityData** provides the DFG representation of a document enhanced with the \\PETactivitydata elements.\n\n#. **GetKG_DFGPerformsActors** provides the DFG graph representation of a document enhanced with the \\actorperformer information.\n\n#. **GetPerformsActors** returns a graph representation of the DFG graph of a document enhanced with \\actorperformer relations.\n\n#. **GetKnowledgeGraph** returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.\n\n\n\nHow to Load the PET dataset \n*********************************************\n\n**Token-classification task**\n\n.. code-block:: python\n \n from datasets import load_dataset\n \n modelhub_dataset = load_dataset(\"patriziobellan/PET\", name='token-classification')\n\n\n**Relations-extraction task**\n\n.. code-block:: python\n\n from datasets import load_dataset \n\n modelhub_dataset = load_dataset(\"patriziobellan/PET\", name='relations-extraction')\n..\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Convenient interface that provides structured representations of the PET dataset hosted on Huggingface",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://pdi.fbk.eu/pet-dataset"
},
"split_keywords": [
"huggingface",
" pet",
" dataset",
" process extraction from text",
" natural language processing",
" nlp",
" business process management",
" bpm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cac5ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c",
"md5": "fb55b5e82bc4b5968e2eb527a13e041b",
"sha256": "6b8cbe23f511228e6182571d71556ef5b97886872a5d878be46b274dad6e9645"
},
"downloads": -1,
"filename": "petdatasetreader-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "fb55b5e82bc4b5968e2eb527a13e041b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 9783,
"upload_time": "2024-05-23T09:30:48",
"upload_time_iso_8601": "2024-05-23T09:30:48.021889Z",
"url": "https://files.pythonhosted.org/packages/ca/c5/ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c/petdatasetreader-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-23 09:30:48",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "petdatasetreader"
}