text2graphapi


Nametext2graphapi JSON
Version 0.1.8 PyPI version JSON
download
home_page
SummaryUse this library to transform raw text into differents graph representations.
upload_time2023-05-16 22:34:07
maintainer
docs_urlNone
authorPLN-disca-iimas
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# text2graph Library

**text2graphapi** is a python library for text-to-graph tranformations. To use this library it is necessary to install the modules and dependencies in user’s application. Also, the corpus of text documents to be transformed into graphs has to be loaded and read.

 The  text to graph transformation pipeline consists of three main modules:
* **Text Preprocessing and Normalization**. This module aims to perform all the cleaning and pre-processing part of the text. Handling blank spaces, emoticons, HTML tags stop words, etc
* **Graph Model**. This module aims to define the entities/nodes and their relationships/edges according to the problem specification. 
* **Graph Transformation**. This module aims to apply vector transformations to the graph as final output, such as adjacency matrix, dense matrix, etc.

The following diagram depicts the pipeline overview for the text to graph tranformation described above:

![texto to graph pipeline](https://www.linkpicture.com/q/texto-to-graph.pipeline.png#center)

## **_Installation_ from PYPI**
Inside your project, from your CLI type the following command in order to install the latest version of the library:
```Python
pip install text2graphapi
```

## **_Types of graph representation available:_**
Currently, this library support two types of graph representation: *Word Co-Ocurrence Graph* and  *Heterogeneous Graph*. For both representation, the expected input is the same, and has to be the following structure:
```Python
# The input has to be a list of dictionaries, where ecah dict conatins an 'id' and 'doc' text data
# For example:
input_text_docs = [
	{"id": 1, "doc": "text for document 1"},
    {"id": 2, "doc": "text for document 2"}
]
```

I the netxt sections we decribe each of this graph representations and provide some implementation examples:
 - **Word Co-Ocurrence Graph:**
   In this graph, words are represented as a node and the co-occurence of two words within the document text is represented as an edge between the words/nodes. As an attributes/weights, nodes has *Part Of Speech* tag and egdes has the *number of co-occurrence*  between words in the text document. As output we will have one grpah representation for each text document  in the courpus.
   For example, in the following code snippet we have a corpus of one document, and we apply a word-occurence transformation with params: graph type as Digraph, window_size of 1, English language, adjacency matrix as desired output format, etc
   
```Python
from text2graphapi.src.Cooccurrence import Cooccurrence

corpus_docs = [{'id': 1, 'doc': 'The violence on the TV. The article discussed the idea of the amount of violence on the news'}]

to_cooccurrence = Cooccurrence(
                graph_type = 'DiGraph', 
                apply_prep = True, 
                parallel_exec = False,
                window_size = 1, 
                language = 'en',
                output_format = 'adj_matrix')
                
output_text_graphs = to_cooccurrence.transform(corpus_docs)
```
After the execution of this code, we have one directed graph with 8 nodes and 15 edges:
```Python
[{
	'doc_id': 1, 
	'graph': <8x8 sparse array of type '<class 'numpy.int64'>' Sparse Row format>, 
	'number_of_edges': 15, 
	'number_of_nodes': 8, 
	'status': 'success'
}]
```

- **Heterogeneous Graph:**
In this graph, words and documents are represented as nodes and relation between word to word and word to document as edges. As an attributes/weights, the word to word relation has the point-wise mutual information (PMI) measure, and word to document relation has the Term Frequency-Inverse Document Frequency (TFIDF) measure. As output we will have only one grpah representation for all the text documents in the courpus.
For example, in the following code snippet we have a corpus of two document, and we apply a Heterogeneous transformation with params: graph type as Graph, window_size of 20, English language, networkx object as desired output format, etc
```Python
from text2graphapi.src.Heterogeneous import Heterogeneous

corpus_docs= [
	{'id': 1, 'doc': "bible answers organization distribution"},
	{'id': 2, 'doc': "atheists agnostics organization"},
]

hetero_graph = Heterogeneous(
				graph_type = 'Graph',
		        window_size = 20, 
		        parallel_exec = False,
		        apply_preprocessing = True, 
		        language = 'es',
		        output_format = 'networkx')

output_text_graphs = hetero_graph.transform(corpus_docs)
```

After the execution of this code, we have one undirected representing the whole corpus graph with 8 nodes and 11 edges:
```Python
[{
	'id': 1, 
	'doc_graph': <networkx.classes.graph.Graph at 0x7f2b44e6d9a0>, 
	'number_of_edges': 11, 
	'number_of_nodes': 8, 
	'status': 'success'
}]
```


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "text2graphapi",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "PLN-disca-iimas",
    "author_email": "andric.valdez@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/40/90/7fb651a0b310336d18a247f5b6d78e8e563e820a80d3d4aa55b1aeabf7fc/text2graphapi-0.1.8.tar.gz",
    "platform": null,
    "description": "\n# text2graph Library\n\n**text2graphapi** is a python library for text-to-graph tranformations. To use this library it is necessary to install the modules and dependencies in user\u2019s application. Also, the corpus of text documents to be transformed into graphs has to be loaded and read.\n\n The  text to graph transformation pipeline consists of three main modules:\n* **Text Preprocessing and Normalization**. This module aims to perform all the cleaning and pre-processing part of the text. Handling blank spaces, emoticons, HTML tags stop words, etc\n* **Graph Model**. This module aims to define the entities/nodes and their relationships/edges according to the problem specification. \n* **Graph Transformation**. This module aims to apply vector transformations to the graph as final output, such as adjacency matrix, dense matrix, etc.\n\nThe following diagram depicts the pipeline overview for the text to graph tranformation described above:\n\n![texto to graph pipeline](https://www.linkpicture.com/q/texto-to-graph.pipeline.png#center)\n\n## **_Installation_ from PYPI**\nInside your project, from your CLI type the following command in order to install the latest version of the library:\n```Python\npip install text2graphapi\n```\n\n## **_Types of graph representation available:_**\nCurrently, this library support two types of graph representation: *Word Co-Ocurrence Graph* and  *Heterogeneous Graph*. For both representation, the expected input is the same, and has to be the following structure:\n```Python\n# The input has to be a list of dictionaries, where ecah dict conatins an 'id' and 'doc' text data\n# For example:\ninput_text_docs = [\n\t{\"id\": 1, \"doc\": \"text for document 1\"},\n    {\"id\": 2, \"doc\": \"text for document 2\"}\n]\n```\n\nI the netxt sections we decribe each of this graph representations and provide some implementation examples:\n - **Word Co-Ocurrence Graph:**\n   In this graph, words are represented as a node and the co-occurence of two words within the document text is represented as an edge between the words/nodes. As an attributes/weights, nodes has *Part Of Speech* tag and egdes has the *number of co-occurrence*  between words in the text document. As output we will have one grpah representation for each text document  in the courpus.\n   For example, in the following code snippet we have a corpus of one document, and we apply a word-occurence transformation with params: graph type as Digraph, window_size of 1, English language, adjacency matrix as desired output format, etc\n   \n```Python\nfrom text2graphapi.src.Cooccurrence import Cooccurrence\n\ncorpus_docs = [{'id': 1, 'doc': 'The violence on the TV. The article discussed the idea of the amount of violence on the news'}]\n\nto_cooccurrence = Cooccurrence(\n                graph_type = 'DiGraph', \n                apply_prep = True, \n                parallel_exec = False,\n                window_size = 1, \n                language = 'en',\n                output_format = 'adj_matrix')\n                \noutput_text_graphs = to_cooccurrence.transform(corpus_docs)\n```\nAfter the execution of this code, we have one directed graph with 8 nodes and 15 edges:\n```Python\n[{\n\t'doc_id': 1, \n\t'graph': <8x8 sparse array of type '<class 'numpy.int64'>' Sparse Row format>, \n\t'number_of_edges': 15, \n\t'number_of_nodes': 8, \n\t'status': 'success'\n}]\n```\n\n- **Heterogeneous Graph:**\nIn this graph, words and documents are represented as nodes and relation between word to word and word to document as edges. As an attributes/weights, the word to word relation has the point-wise mutual information (PMI) measure, and word to document relation has the Term Frequency-Inverse Document Frequency (TFIDF) measure. As output we will have only one grpah representation for all the text documents in the courpus.\nFor example, in the following code snippet we have a corpus of two document, and we apply a Heterogeneous transformation with params: graph type as Graph, window_size of 20, English language, networkx object as desired output format, etc\n```Python\nfrom text2graphapi.src.Heterogeneous import Heterogeneous\n\ncorpus_docs= [\n\t{'id': 1, 'doc': \"bible answers organization distribution\"},\n\t{'id': 2, 'doc': \"atheists agnostics organization\"},\n]\n\nhetero_graph = Heterogeneous(\n\t\t\t\tgraph_type = 'Graph',\n\t\t        window_size = 20, \n\t\t        parallel_exec = False,\n\t\t        apply_preprocessing = True, \n\t\t        language = 'es',\n\t\t        output_format = 'networkx')\n\noutput_text_graphs = hetero_graph.transform(corpus_docs)\n```\n\nAfter the execution of this code, we have one undirected representing the whole corpus graph with 8 nodes and 11 edges:\n```Python\n[{\n\t'id': 1, \n\t'doc_graph': <networkx.classes.graph.Graph at 0x7f2b44e6d9a0>, \n\t'number_of_edges': 11, \n\t'number_of_nodes': 8, \n\t'status': 'success'\n}]\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Use this library to transform raw text into differents graph representations.",
    "version": "0.1.8",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7ccae4271101d1cf1d8171bdb614eba42da9eea43bdfb70d5791810f11a26136",
                "md5": "c68aad1c43c5026aa6eab3afab9b8d8f",
                "sha256": "509d4f71eaad74370a9890845d3195378d75a82381f9aea52fefa22a17268125"
            },
            "downloads": -1,
            "filename": "text2graphapi-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c68aad1c43c5026aa6eab3afab9b8d8f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 28562,
            "upload_time": "2023-05-16T22:34:04",
            "upload_time_iso_8601": "2023-05-16T22:34:04.591245Z",
            "url": "https://files.pythonhosted.org/packages/7c/ca/e4271101d1cf1d8171bdb614eba42da9eea43bdfb70d5791810f11a26136/text2graphapi-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "40907fb651a0b310336d18a247f5b6d78e8e563e820a80d3d4aa55b1aeabf7fc",
                "md5": "f328a4fb7a99e357ed0e9733deabea4b",
                "sha256": "1e075e9e39f203ae281cb15d5ec91f677f984b27ea09b6458a0128b3b4f918ad"
            },
            "downloads": -1,
            "filename": "text2graphapi-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "f328a4fb7a99e357ed0e9733deabea4b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 27825,
            "upload_time": "2023-05-16T22:34:07",
            "upload_time_iso_8601": "2023-05-16T22:34:07.376696Z",
            "url": "https://files.pythonhosted.org/packages/40/90/7fb651a0b310336d18a247f5b6d78e8e563e820a80d3d4aa55b1aeabf7fc/text2graphapi-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-16 22:34:07",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "text2graphapi"
}
        
Elapsed time: 0.07712s