textToKnowledgeGraph


NametextToKnowledgeGraph JSON
Version 0.2.2 PyPI version JSON
download
home_pagehttps://github.com/ndexbio/llm-text-to-knowledge-graph
SummaryA Python package to generate BEL statements and CX2 networks.
upload_time2025-01-28 13:36:54
maintainerNone
docs_urlNone
authorFavour James
requires_python>=3.11
licenseNone
keywords
VCS
bugtrack_url
requirements langchain langchain_core langchain_openai lxml ndex2 pandas pydantic python-dotenv Requests pytest
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # textToKnowledgeGraph

A Python package to generate BEL statements and CX2 networks.

## Table of Contents

- [Project Description](#project-description)
- [Dependecies](#dependecies)
- [Glossary](#glossary)
- [Installation](#installation)
- [Methodology](#methodology)
  - [Features Available](#features-available)
  - [BEL Generation](#bel-generation)
  - [CX2 Network Generation](#cx2-network-generation)
  - [Uploading to NDEx](#uploading-to-ndex)
- [Usage](#usage)

## Project Description

`textToKnowledgeGraph` is a Python package that converts natural language scientific text into structured knowledge graphs using the capabilities of advanced language models (LLMs). It can be used for:

- Generating BEL statements.
- Extracting entities and interactions from scientific text.
- Uploading the generated CX2 networks to NDEx.

## Dependecies

- "Python>=3.11",
- "langchain==0.3.13",
- "langchain_core==0.3.27",
- "langchain_openai==0.2.13",
- "lxml==5.2.1",
- "ndex2>=3.8.0,<4.0.0",
- "pandas",
- "pydantic==2.10.4",
- "python-dotenv==1.0.1",
- "Requests==2.32.3"

## Glossary

These discusses terms that would be used in this documentation:

- BEL (Biological Expression Language): BEL is a structured language used to represent scientific findings, especially in the biomedical domain, in a computable format. Learn More: [BEL Documentation](https://language.bel.bio/)
- CX2 (Cytoscape Exchange Format 2): CX2 is a JSON-based format used for storing and exchanging network data in Cytoscape. Learn More: [CX2 Specification](http://manual.cytoscape.org/en/stable/Supported_Network_File_Formats.html#cx2)
- PMCID (PubMed Central Identifier): A unique identifier for articles archived in PubMed Central (PMC), a free digital repository of biomedical and life sciences journal literature. Learn More: [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/)
- NDEx (Network Data Exchange): NDEx is an online resource that facilitates the sharing, storage, and visualization of biological networks. Learn More: [NDEx](https://www.ndexbio.org)
- LangChain: LangChain is a framework for developing applications powered by language models. It allows easy integration of language models with data sources and APIs, enabling workflows like knowledge extraction and retrieval. 
Learn More: [LangChain](https://python.langchain.com/docs/introduction/)
- Cytoscape: Cytoscape is an open-source platform for visualizing and analyzing complex networks, including biological pathways, protein interaction networks, and more. Learn More: [Cytoscape](https://cytoscape.org)
- Knowledge Graph: A knowledge graph is a structured representation of knowledge in a graph format, where entities are nodes and relationships are edges. It enables intuitive querying, reasoning, and visualization of complex biological data, aiding in understanding biological systems and facilitating discoveries.
- Pubtator: PubTator is a web-based tool that extracts and annotates biomedical entities and relations from scientific literature. It provides a user-friendly interface for exploring and analyzing scientific texts. Learn More: [PubTator](https://www.ncbi.nlm.nih.gov/research/pubtator/)
- OpenAI: OpenAI is an artificial intelligence research lab that develops advanced language models and other AI technologies. It provides APIs for accessing language models and other AI capabilities. In this project, we are making use of gpt-4o model. Other tests were carried out with gpt-3, gpt-4, and gpt-4o-mini.
Learn More: [OpenAI](https://www.openai.com)

## Installation

Install the package via pip:

```bash
pip install textToKnowledgeGraph
```

## Methodology

- ## Features Available

  - **BEL Generation**: Extracts biological interactions from scientific papers and generates BEL statements.
  - **CX2 Network Generation**: Converts extracted interactions into CX2 network format for visualization in Cytoscape.
  - **Uploading to NDEx**: Uploads the generated CX2 networks to NDEx for sharing and visualization.

<!-- - ## Code WorkFlow -->

- ## BEL Generation

  - The user provides a `PMC ID` and an openai API key as input. This PMC ID is used to fetch the XML version of the scientific paper from pubtator's API. The XML version of the paper is then processed and broken down into a list of dictionaries where each entry contains the index and the text of the paragraph from the XML file. Each entry also contains annotations from the pubtator XML file. This result is saved to a json file.
  - The prompt that directs the model on what to do is defined in a prompt file called `prompt_file_v5.txt` which is processed to extract the prompt that is being passed to the model for instructions on how to extract the BEL statements.
  - **Model Creation**:
    - The `bel_model.py` script defines and initializes the model used to extract BEL statements.
    - It includes schema definitions, API call handling, and model initialization.
    - The `get_interactions.py` script handles the extraction of interactions, prompt processing, and chain initialization.

- ## CX2 Network Generation

  - Converts extracted interactions into CX2 network format for visualization in Cytoscape.

- ## Uploading to NDEx

  - Uploads the generated CX2 networks to NDEx for sharing and visualization. In order to use this function, you need to provide your NDEx email and password as an argument.

## Usage

To install python package:

```bash
pip install textToKnowledgeGraph
```

**Required parameters**:

- **pmc_id**: can only process one at a time

- **api_key**: open_ai api key

**Optional parameters**:

- **ndex_email**: The NDEx email for authentication. ndex_password: The NDEx password for authentication.

**Expected output**:

- **BEL statements**: extracted from the paper
- **CX2 network**: generated from the extracted BEL statements
- **Example of CX2 network**:
![CX2 network image of paper:PMC8354587](https://github.com/ndexbio/llm-text-to-knowledge-graph/blob/main/PMC8354587_image.png?raw=true)

To run in an interactive python environment:

```python
# Process pmcid without uploading to ndex
from textToKnowledgeGraph import process_paper
 
process_paper("PMC8354587","sk-....") 

# Process pmcid and upload to ndex

from textToKnowledgeGraph import process_paper

process_paper("PMC8354587","sk-..", "john_doe@gmail.com", "xxxx", upload_to_ndex=True)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ndexbio/llm-text-to-knowledge-graph",
    "name": "textToKnowledgeGraph",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": "Favour James",
    "author_email": "favour.ujames196@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/90/17/70c5ade21fb33bd143f93346d1f2a136cbbc78a84ac8f7dde67302252817/texttoknowledgegraph-0.2.2.tar.gz",
    "platform": null,
    "description": "# textToKnowledgeGraph\n\nA Python package to generate BEL statements and CX2 networks.\n\n## Table of Contents\n\n- [Project Description](#project-description)\n- [Dependecies](#dependecies)\n- [Glossary](#glossary)\n- [Installation](#installation)\n- [Methodology](#methodology)\n  - [Features Available](#features-available)\n  - [BEL Generation](#bel-generation)\n  - [CX2 Network Generation](#cx2-network-generation)\n  - [Uploading to NDEx](#uploading-to-ndex)\n- [Usage](#usage)\n\n## Project Description\n\n`textToKnowledgeGraph` is a Python package that converts natural language scientific text into structured knowledge graphs using the capabilities of advanced language models (LLMs). It can be used for:\n\n- Generating BEL statements.\n- Extracting entities and interactions from scientific text.\n- Uploading the generated CX2 networks to NDEx.\n\n## Dependecies\n\n- \"Python>=3.11\",\n- \"langchain==0.3.13\",\n- \"langchain_core==0.3.27\",\n- \"langchain_openai==0.2.13\",\n- \"lxml==5.2.1\",\n- \"ndex2>=3.8.0,<4.0.0\",\n- \"pandas\",\n- \"pydantic==2.10.4\",\n- \"python-dotenv==1.0.1\",\n- \"Requests==2.32.3\"\n\n## Glossary\n\nThese discusses terms that would be used in this documentation:\n\n- BEL (Biological Expression Language): BEL is a structured language used to represent scientific findings, especially in the biomedical domain, in a computable format. Learn More: [BEL Documentation](https://language.bel.bio/)\n- CX2 (Cytoscape Exchange Format 2): CX2 is a JSON-based format used for storing and exchanging network data in Cytoscape. Learn More: [CX2 Specification](http://manual.cytoscape.org/en/stable/Supported_Network_File_Formats.html#cx2)\n- PMCID (PubMed Central Identifier): A unique identifier for articles archived in PubMed Central (PMC), a free digital repository of biomedical and life sciences journal literature. Learn More: [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/)\n- NDEx (Network Data Exchange): NDEx is an online resource that facilitates the sharing, storage, and visualization of biological networks. Learn More: [NDEx](https://www.ndexbio.org)\n- LangChain: LangChain is a framework for developing applications powered by language models. It allows easy integration of language models with data sources and APIs, enabling workflows like knowledge extraction and retrieval. \nLearn More: [LangChain](https://python.langchain.com/docs/introduction/)\n- Cytoscape: Cytoscape is an open-source platform for visualizing and analyzing complex networks, including biological pathways, protein interaction networks, and more. Learn More: [Cytoscape](https://cytoscape.org)\n- Knowledge Graph: A knowledge graph is a structured representation of knowledge in a graph format, where entities are nodes and relationships are edges. It enables intuitive querying, reasoning, and visualization of complex biological data, aiding in understanding biological systems and facilitating discoveries.\n- Pubtator: PubTator is a web-based tool that extracts and annotates biomedical entities and relations from scientific literature. It provides a user-friendly interface for exploring and analyzing scientific texts. Learn More: [PubTator](https://www.ncbi.nlm.nih.gov/research/pubtator/)\n- OpenAI: OpenAI is an artificial intelligence research lab that develops advanced language models and other AI technologies. It provides APIs for accessing language models and other AI capabilities. In this project, we are making use of gpt-4o model. Other tests were carried out with gpt-3, gpt-4, and gpt-4o-mini.\nLearn More: [OpenAI](https://www.openai.com)\n\n## Installation\n\nInstall the package via pip:\n\n```bash\npip install textToKnowledgeGraph\n```\n\n## Methodology\n\n- ## Features Available\n\n  - **BEL Generation**: Extracts biological interactions from scientific papers and generates BEL statements.\n  - **CX2 Network Generation**: Converts extracted interactions into CX2 network format for visualization in Cytoscape.\n  - **Uploading to NDEx**: Uploads the generated CX2 networks to NDEx for sharing and visualization.\n\n<!-- - ## Code WorkFlow -->\n\n- ## BEL Generation\n\n  - The user provides a `PMC ID` and an openai API key as input. This PMC ID is used to fetch the XML version of the scientific paper from pubtator's API. The XML version of the paper is then processed and broken down into a list of dictionaries where each entry contains the index and the text of the paragraph from the XML file. Each entry also contains annotations from the pubtator XML file. This result is saved to a json file.\n  - The prompt that directs the model on what to do is defined in a prompt file called `prompt_file_v5.txt` which is processed to extract the prompt that is being passed to the model for instructions on how to extract the BEL statements.\n  - **Model Creation**:\n    - The `bel_model.py` script defines and initializes the model used to extract BEL statements.\n    - It includes schema definitions, API call handling, and model initialization.\n    - The `get_interactions.py` script handles the extraction of interactions, prompt processing, and chain initialization.\n\n- ## CX2 Network Generation\n\n  - Converts extracted interactions into CX2 network format for visualization in Cytoscape.\n\n- ## Uploading to NDEx\n\n  - Uploads the generated CX2 networks to NDEx for sharing and visualization. In order to use this function, you need to provide your NDEx email and password as an argument.\n\n## Usage\n\nTo install python package:\n\n```bash\npip install textToKnowledgeGraph\n```\n\n**Required parameters**:\n\n- **pmc_id**: can only process one at a time\n\n- **api_key**: open_ai api key\n\n**Optional parameters**:\n\n- **ndex_email**: The NDEx email for authentication. ndex_password: The NDEx password for authentication.\n\n**Expected output**:\n\n- **BEL statements**: extracted from the paper\n- **CX2 network**: generated from the extracted BEL statements\n- **Example of CX2 network**:\n![CX2 network image of paper:PMC8354587](https://github.com/ndexbio/llm-text-to-knowledge-graph/blob/main/PMC8354587_image.png?raw=true)\n\nTo run in an interactive python environment:\n\n```python\n# Process pmcid without uploading to ndex\nfrom textToKnowledgeGraph import process_paper\n \nprocess_paper(\"PMC8354587\",\"sk-....\") \n\n# Process pmcid and upload to ndex\n\nfrom textToKnowledgeGraph import process_paper\n\nprocess_paper(\"PMC8354587\",\"sk-..\", \"john_doe@gmail.com\", \"xxxx\", upload_to_ndex=True)\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package to generate BEL statements and CX2 networks.",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/ndexbio/llm-text-to-knowledge-graph"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b0a613a8dcc73c17077b07487fb2935fc9c12ee1df9b0a9d346a9bff867420ae",
                "md5": "bebb8710e592772ed9984caf7f8dd761",
                "sha256": "bf3fb70605bd313c4c01571b07ef91689f0b37df09c33e375fdc014c97259ede"
            },
            "downloads": -1,
            "filename": "textToKnowledgeGraph-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bebb8710e592772ed9984caf7f8dd761",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 22527,
            "upload_time": "2025-01-28T13:36:52",
            "upload_time_iso_8601": "2025-01-28T13:36:52.657586Z",
            "url": "https://files.pythonhosted.org/packages/b0/a6/13a8dcc73c17077b07487fb2935fc9c12ee1df9b0a9d346a9bff867420ae/textToKnowledgeGraph-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "901770c5ade21fb33bd143f93346d1f2a136cbbc78a84ac8f7dde67302252817",
                "md5": "768e96d23ceaedc513f40748562fc2b0",
                "sha256": "94578cb926a62ef6fce4fe7df13f2b266a4d5a52f9d16e7223ff65dc253fae8c"
            },
            "downloads": -1,
            "filename": "texttoknowledgegraph-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "768e96d23ceaedc513f40748562fc2b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 23062,
            "upload_time": "2025-01-28T13:36:54",
            "upload_time_iso_8601": "2025-01-28T13:36:54.416179Z",
            "url": "https://files.pythonhosted.org/packages/90/17/70c5ade21fb33bd143f93346d1f2a136cbbc78a84ac8f7dde67302252817/texttoknowledgegraph-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-28 13:36:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ndexbio",
    "github_project": "llm-text-to-knowledge-graph",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "langchain",
            "specs": [
                [
                    "==",
                    "0.3.13"
                ]
            ]
        },
        {
            "name": "langchain_core",
            "specs": [
                [
                    "==",
                    "0.3.27"
                ]
            ]
        },
        {
            "name": "langchain_openai",
            "specs": [
                [
                    "==",
                    "0.2.13"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "5.2.1"
                ]
            ]
        },
        {
            "name": "ndex2",
            "specs": [
                [
                    "<",
                    "4.0.0"
                ],
                [
                    ">=",
                    "3.8.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.10.4"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "Requests",
            "specs": [
                [
                    "==",
                    "2.32.3"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": []
        }
    ],
    "lcname": "texttoknowledgegraph"
}
        
Elapsed time: 1.40062s