petdatasetreader

Name	petdatasetreader JSON
Version	0.0.2 JSON
	download
home_page	https://pdi.fbk.eu/pet-dataset
Summary	Convenient interface that provides structured representations of the PET dataset hosted on Huggingface
upload_time	2024-05-23 09:30:48
maintainer	Patrizio Bellan
docs_url	None
author	Patrizio Bellan
requires_python	None
license	MIT
keywords	huggingface pet dataset process extraction from text natural language processing nlp business process management bpm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            PET dataset reader
######################


A structured interface to interact with the `PET-dataset`_ hosted on huggingface.

.. _PET-dataset: https://huggingface.co/datasets/patriziobellan/PET



Created by `Patrizio Bellan`_.

.. _Patrizio Bellan: https://pdi.fbk.eu/bellan/

=================

Interacting with the data hosted on HuggingFace could be difficult since the data has a strict format. 
For example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them. 
In addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the `PET Dataset <https://huggingface.co/datasets/patriziobellan/PET>`_ could become a time-intensive operation. 
To alleviate such difficulties, we developed the *PET dataset reader*, a Python package that makes the interaction with the dataset easy.
This package is composed of three different modules: **TokenClassification** module, **RelationExtraction** module, and **ProcessInformation** module.

TokenClassification Module
****************************************

This module is composed of a Python class that allows users to extract structured information at the token levels.
This class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.

#. **GetDocumentNames**  This method returns a list of the document names of the dataset.

#. **GetDocumentText**  This method returns the textual description of a document.

#. **GetTokens**  This method returns the text of a sentence in the form of a list of words of a given sentence ID.

#. **GetNerTagLabels**  This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method \emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.

#. **Statistics**  This method provides the statistics about the PET elements annotated.

In addition, specific methods were implemented to get the list of elements of a given category. For example, the method \emph{GetActivity} returns all the \PETactivity of a specific document or the entire dataset. Similarly, the method \emph{GetActivityData} returns the \PETactivitydata.


RelationExtraction Module
****************************************

This module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., *PET Uses* relation.
This class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.

#. **GetNerLabels** This method returns the NER tag IDs of a given document.

#. **GetRelations**  This method provides a list of PET relations of a given document.

#. **GetSentencesWithIdsAndNerTagLabels** This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags. 

#. **Statistics**  This method provides the statistics about the PET relations.


ProcessInformation Module
****************************************
This module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph.
The module has six main methods:

#. **GetRawActivityLabels** returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.


#. **GetDFG** returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.

#. **GetKG_DFGActivityData** provides the DFG representation of a document enhanced with the \PETactivitydata elements.

#. **GetKG_DFGPerformsActors** provides the DFG graph representation of a document enhanced with the \actorperformer information.

#. **GetPerformsActors** returns a graph representation of the DFG graph of a document enhanced with \actorperformer relations.

#. **GetKnowledgeGraph** returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.



How to Load the PET dataset 
*********************************************

**Token-classification task**

.. code-block:: python
    
    from datasets import load_dataset
    
    modelhub_dataset = load_dataset("patriziobellan/PET", name='token-classification')


**Relations-extraction task**

.. code-block:: python

    from datasets import load_dataset 

    modelhub_dataset = load_dataset("patriziobellan/PET", name='relations-extraction')
..

Raw data

            {
    "_id": null,
    "home_page": "https://pdi.fbk.eu/pet-dataset",
    "name": "petdatasetreader",
    "maintainer": "Patrizio Bellan",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "patrizio.bellan@gmail.com",
    "keywords": "huggingface, PET, dataset, process extraction from text, natural language processing, nlp, business process management, bpm",
    "author": "Patrizio Bellan",
    "author_email": "patrizio.bellan@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ca/c5/ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c/petdatasetreader-0.0.2.tar.gz",
    "platform": "Any",
    "description": "PET dataset reader\n######################\n\n\nA structured interface to interact with the `PET-dataset`_ hosted on huggingface.\n\n.. _PET-dataset: https://huggingface.co/datasets/patriziobellan/PET\n\n\n\nCreated by `Patrizio Bellan`_.\n\n.. _Patrizio Bellan: https://pdi.fbk.eu/bellan/\n\n=================\n\nInteracting with the data hosted on HuggingFace could be difficult since the data has a strict format. \nFor example, getting the list of PET activities of a PET document requires a user to create a custom script that scans the dataset, extracts the words and their NER tags, and combines them. \nIn addition, documents are stored in the different, non-always continuous samples in the HuggingFace dataset. Thus, conducting experiments with the `PET Dataset <https://huggingface.co/datasets/patriziobellan/PET>`_ could become a time-intensive operation. \nTo alleviate such difficulties, we developed the *PET dataset reader*, a Python package that makes the interaction with the dataset easy.\nThis package is composed of three different modules: **TokenClassification** module, **RelationExtraction** module, and **ProcessInformation** module.\n\nTokenClassification Module\n****************************************\n\nThis module is composed of a Python class that allows users to extract structured information at the token levels.\nThis class has specific methods to get all the PET elements of a specific category. We briefly introduce the principal methods implemented in this module.\n\n#. **GetDocumentNames**  This method returns a list of the document names of the dataset.\n\n#. **GetDocumentText**  This method returns the textual description of a document.\n\n#. **GetTokens**  This method returns the text of a sentence in the form of a list of words of a given sentence ID.\n\n#. **GetNerTagLabels**  This method provides the list of NER tags of a sentence, document, or entire dataset. Since the NER tags are stored as numbers in the dataset, we created specific methods to convert the number into a textual tag. For example, the method \\emph{GetPrefixAndLabel} returns the NER marker (B, I, or O) and the tag text (e.g., Activity) of a specific NER tag number.\n\n#. **Statistics**  This method provides the statistics about the PET elements annotated.\n\nIn addition, specific methods were implemented to get the list of elements of a given category. For example, the method \\emph{GetActivity} returns all the \\PETactivity of a specific document or the entire dataset. Similarly, the method \\emph{GetActivityData} returns the \\PETactivitydata.\n\n\nRelationExtraction Module\n****************************************\n\nThis module is composed of a Python class that allows users to extract structured information about the PET relations annotated in the dataset, e.g., *PET Uses* relation.\nThis class has specific methods to get all the PET relations of a specific category. We briefly introduce the principal methods implemented in this module.\n\n#. **GetNerLabels** This method returns the NER tag IDs of a given document.\n\n#. **GetRelations**  This method provides a list of PET relations of a given document.\n\n#. **GetSentencesWithIdsAndNerTagLabels** This method provides a user with a list of sentences composed of word tokens and the corresponding NER tags. \n\n#. **Statistics**  This method provides the statistics about the PET relations.\n\n\nProcessInformation Module\n****************************************\nThis module contains the methods developed to obtain a structured representation of a document in the form of a graph, e.g., in the form of a Directly Follows Graph.\nThe module has six main methods:\n\n#. **GetRawActivityLabels** returns the activity labels (PET activity + PET Acitity Data) as their are annotated in the text.\n\n\n#. **GetDFG** returns the directlyfollows graph representation of the annotations of a document. This graph is composed of behavioral elements only.\n\n#. **GetKG_DFGActivityData** provides the DFG representation of a document enhanced with the \\PETactivitydata elements.\n\n#. **GetKG_DFGPerformsActors** provides the DFG graph representation of a document enhanced with the \\actorperformer information.\n\n#. **GetPerformsActors** returns a graph representation of the DFG graph of a document enhanced with \\actorperformer relations.\n\n#. **GetKnowledgeGraph** returns a graph representation of a document representing the information about the behavioral elements, the activity data elements, and the actor performer elements.\n\n\n\nHow to Load the PET dataset \n*********************************************\n\n**Token-classification task**\n\n.. code-block:: python\n    \n    from datasets import load_dataset\n    \n    modelhub_dataset = load_dataset(\"patriziobellan/PET\", name='token-classification')\n\n\n**Relations-extraction task**\n\n.. code-block:: python\n\n    from datasets import load_dataset \n\n    modelhub_dataset = load_dataset(\"patriziobellan/PET\", name='relations-extraction')\n..\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convenient interface that provides structured representations of the PET dataset hosted on Huggingface",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://pdi.fbk.eu/pet-dataset"
    },
    "split_keywords": [
        "huggingface",
        " pet",
        " dataset",
        " process extraction from text",
        " natural language processing",
        " nlp",
        " business process management",
        " bpm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cac5ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c",
                "md5": "fb55b5e82bc4b5968e2eb527a13e041b",
                "sha256": "6b8cbe23f511228e6182571d71556ef5b97886872a5d878be46b274dad6e9645"
            },
            "downloads": -1,
            "filename": "petdatasetreader-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "fb55b5e82bc4b5968e2eb527a13e041b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 9783,
            "upload_time": "2024-05-23T09:30:48",
            "upload_time_iso_8601": "2024-05-23T09:30:48.021889Z",
            "url": "https://files.pythonhosted.org/packages/ca/c5/ebd19aef1363df7e536e3c01400f485c23bac47427b49cbec91709ee1d1c/petdatasetreader-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-23 09:30:48",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "petdatasetreader"
}

Patrizio Bellan