# Evaluating hypotheses using SKiM-GPT (Note: Must be on Mir-81)
This repository provides tools to SKIM through PubMed abstracts to evalaute hypotheses.
## Requirements
- Python 3.9^
- Libraries specified in `requirements.txt`
- OpenAI API key
- Pubmed API key
- CHTC auth token
- Rstewart2 access
## Getting Started
1. **Setup**:
Clone the repository to your machine and change to its top level directory.
```bash
git clone <repository_url>
cd <repository_directory>
```
2. **Install Dependencies (with conda)**
Install the required packages using pip:
```bash
conda create --name {myenv} python=3.9
conda activate {myenv}
pip install -r requirements.txt
```
3. **Get CHTC auth token**
Log onto CHTC's submit node (ap2002.chtc.wisc.edu) and get you auth token using:
```bash
condor_token_fetch -file my_token
```
Copy the token to set youe environment variable back on rstewart2
3. **Environment Variables**
Before running the script, ensure you have set up your environment variables. We recommend setting in your shell profile. You must source your shell profile after setting
the environment variables (Jack has our OpenAI and Pubmed keys in his .bashrc on the server FYI):
```bash
export OPENAI_API_KEY=your_api_key_here
export PUBMED_API_KEY=your_api_key_here
export HTCONDOR_TOKEN=your_token_here
```
4. **Configuring Parameters**
The `config.json` file includes global parameters as well as several job types, each with unique paramenters. Please view the [`config` Module Overview](#config-overview) to help set up your job.
5. **Running the script**
```bash
python main.py
```
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<a name="config-overview"></a>
# `config` Module Overview
This configuration file contains various settings for different job types. Below are descriptions of each parameter:
## General Parameters
- `JOB_TYPE`: Specifies the type of job to be executed, e.g., `km_with_gpt` or `skim_with_gpt`.
- `KM_hypothesis`: Hypothesis template for KM analysis, using f-string format like `{a_term}` and `{b_term}` (e.g., `"Treatment with {b_term} will have no effect on {a_term} patient outcomes."`).
- `SKIM_hypotheses`: A dictionary of hypothesis templates for SKIM analysis (Must use f-string format).
- `AB`: Relevance hyopthesis between `{a_term}` and `{b_term}` (e.g., `"There exists an interaction between the organ {a_term} and the gene {b_term}."`).
- `BC`: Relevance hyopthesis between`{c_term}` and `{b_term}` (e.g., `"There exists an interaction between the disease {c_term} and the gene {b_term}."`).
- `rel_AC`: Relevance hyopthesis between `{c_term}` and `{a_term}` (e.g., `"There exists an interaction between the disease {c_term} and the organ {a_term}."`).
- `ABC`: Evaluation hypothesis (e.g., `"The gene {b_term} links the organ {a_term} to the disease {c_term}."`).
- `AC`: Evaluation hypothesis (e.g., `"The gene {a_term} influences the disease {c_term}."`).
- `Evaluate_single_abstract`: Boolean flag to evaluate a single abstract (e.g., `false`).
## Global Settings
- `A_TERM`: The primary term of interest, such as an organ (e.g., `"Thymus"`).
- `A_TERM_SUFFIX`: Optional suffix for the `A_TERM` (e.g., `""`).
- `TOP_N_ARTICLES_MOST_CITED`: Number of top-cited articles to consider (e.g., `300`).
- `TOP_N_ARTICLES_MOST_RECENT`: Number of most recent articles to consider (e.g., `0`).
- `POST_N`: Number of articles to process after relevance filtering (e.g., `20`).
- `MIN_WORD_COUNT`: Minimum word count for an abstract to be considered (e.g., `98`).
- `MODEL`: Machine learning model used for processing (e.g., `"gpt-4o-2024-08-06"`).
- `MAX_TOKENS`: Maximum number of tokens per API request (e.g., `1000`).
- `API_URL`: URL for the API endpoint (e.g., `"http://localhost:5099/skim/api/jobs"`).
- `PORT`: Port number for the API service (e.g., `"5081"`).
- `RATE_LIMIT`: Maximum number of requests allowed per time unit (e.g., `3`).
- `DELAY`: Time in seconds to wait before making a new request (e.g., `10`).
- `MAX_RETRIES`: Maximum number of retry attempts after a failed request (e.g., `10`).
- `RETRY_DELAY`: Delay in seconds before retrying a failed request (e.g., `5`).
## Abstract Filter Settings
- `MODEL`: Model used for abstract filtering (e.g., `"lexu14/porpoise1"`).
- `TEMPERATURE`: Sampling temperature for model inference (e.g., `0`).
- `TOP_K`: Number of highest-probability vocabulary tokens to keep for top-k-filtering (e.g., `20`).
- `TOP_P`: Cumulative probability for nucleus sampling (e.g., `0.95`).
- `MAX_COT_TOKENS`: Maximum tokens for Chain-of-Thought reasoning (e.g., `500`).
- `DEBUG`: Boolean flag to enable debug mode (e.g., `false`).
- `TEST_LEAKAGE`: Boolean flag to test for data leakage (e.g., `false`).
- `TEST_LEAKAGE_TYPE`: Type of data leakage test (e.g., `"empty"`).
## Job-Specific Settings
### km_with_gpt
- `position`: Boolean flag to consider positional data (e.g., `false`).
- `A_TERM_LIST`: Boolean to indicate if a list of `A` terms is used (e.g., `false`).
- `A_TERMS_FILE`: File path for the `A` terms list (e.g., `"../input_lists/test/km_a.txt"`).
- `B_TERMS_FILE`: File path for the `B` terms list (e.g., `"../input_lists/leakage_b_terms.txt"`).
- `SORT_COLUMN`: Column used for sorting A-B relationships (e.g., `"ab_sort_ratio"`).
- `NUM_B_TERMS`: Number of `B` terms to consider after sorting (e.g., `25`).
- `km_with_gpt`:
- `ab_fet_threshold`: Fisher Exact Test threshold for A-B relationships (e.g., `1`).
- `censor_year`: Year for data censoring or time-slicing (e.g., `2024`).
### skim_with_gpt
- `position`: Boolean flag to consider positional data (e.g., `false`).
- `A_TERM_LIST`: Boolean to indicate if a list of `A` terms is used (e.g., `false`).
- `A_TERMS_FILE`: File path for the `A` terms list (e.g., `"../input_lists/exercise3/skim_a.txt"`).
- `B_TERMS_FILE`: File path for the `B` terms list (e.g., `"../input_lists/genes_no_syn.txt"`).
- `NUM_B_TERMS`: Number of `B` terms to consider (e.g., `20000`).
- `C_TERMS_FILE`: File path for the `C` terms list (e.g., `"../input_lists/down_syndrome.txt"`).
- `SORT_COLUMN`: Column used for sorting B-C relationships (e.g., `"bc_sort_ratio"`).
- `skim`:
- `ab_fet_threshold`: Fisher Exact Test threshold for A-B relationships (e.g., `0.1`).
- `bc_fet_threshold`: Fisher Exact Test threshold for B-C relationships (e.g., `0.5`).
- `censor_year`: Year for data censoring or time-slicing (e.g., `2024`).
- `top_n`: Number of top items to consider after AB linkage (e.g., `300`).
### km_with_gpt_direct_comp
- `position`: Boolean flag to consider positional data (e.g., `false`).
- `A_TERM_LIST`: Boolean to indicate if a list of `A` terms is used (e.g., `false`).
- `A_TERMS_FILE`: File path for the `A` terms list (e.g., `"../input_lists/test/km_a.txt"`).
- `B_TERMS_FILE`: File path for the `B` terms list (e.g., `"../input_lists/scrapie_b_terms_ProteinvsVI_directlyForNewCode.txt"`).
- `SORT_COLUMN`: Column used for sorting A-B relationships (e.g., `"ab_sort_ratio"`).
- `NUM_B_TERMS`: Number of `B` terms to consider (e.g., `25`).
- `km_with_gpt_direct_comp`:
- `ab_fet_threshold`: Fisher Exact Test threshold for A-B relationships (e.g., `1`).
- `censor_year`: Year for data censoring or time-slicing (e.g., `1990`).
This configuration is critical for tailoring the behavior of the system to specific job types and requirements. Ensure all file paths and parameters are correctly set before execution to avoid runtime errors.
## Contributions
Feel free to contribute to this repository by submitting a pull request or opening an issue for suggestions and bugs.
Raw data
{
"_id": null,
"home_page": null,
"name": "skimgpt",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Ron Stewart <rstewart@morgridge.org>, Rob Millikin <rmillikin@morgridge.org>, Kevin George <kshinegeorge@morgridge.org>, Jack Freeman <jfreeman@morgridge.org>",
"keywords": "biomedical, nlp, co-occurrence, knowledge-mining, LLMs, literature-analysis, skim",
"author": null,
"author_email": "Ron Stewart <rstewart@morgridge.org>, Rob Millikin <rmillikin@morgridge.org>, Kevin George <kshinegeorge@morgridge.org>, Jack Freeman <jfreeman@morgridge.org>",
"download_url": "https://files.pythonhosted.org/packages/e4/49/56dc12f28a9a69bec4368adfa335368866c377cf787db7cff12019293df8/skimgpt-1.0.6.tar.gz",
"platform": null,
"description": "# Evaluating hypotheses using SKiM-GPT (Note: Must be on Mir-81)\nThis repository provides tools to SKIM through PubMed abstracts to evalaute hypotheses.\n\n\n ## Requirements\n\n - Python 3.9^\n - Libraries specified in `requirements.txt`\n - OpenAI API key\n - Pubmed API key\n - CHTC auth token\n - Rstewart2 access\n\n ## Getting Started\n\n 1. **Setup**:\n Clone the repository to your machine and change to its top level directory. \n\n ```bash\n git clone <repository_url>\n cd <repository_directory>\n ```\n\n 2. **Install Dependencies (with conda)**\n Install the required packages using pip:\n ```bash\n conda create --name {myenv} python=3.9\n conda activate {myenv}\n pip install -r requirements.txt\n ```\n3. **Get CHTC auth token**\n Log onto CHTC's submit node (ap2002.chtc.wisc.edu) and get you auth token using:\n ```bash\n condor_token_fetch -file my_token\n ```\n Copy the token to set youe environment variable back on rstewart2\n \n3. **Environment Variables**\n Before running the script, ensure you have set up your environment variables. We recommend setting in your shell profile. You must source your shell profile after setting \n the environment variables (Jack has our OpenAI and Pubmed keys in his .bashrc on the server FYI):\n \n ```bash\n export OPENAI_API_KEY=your_api_key_here\n export PUBMED_API_KEY=your_api_key_here\n export HTCONDOR_TOKEN=your_token_here\n ```\n\n4. **Configuring Parameters**\nThe `config.json` file includes global parameters as well as several job types, each with unique paramenters. Please view the [`config` Module Overview](#config-overview) to help set up your job.\n\n5. **Running the script**\n\n ```bash\n \n python main.py\n \n ```\n----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n<a name=\"config-overview\"></a>\n# `config` Module Overview\n\nThis configuration file contains various settings for different job types. Below are descriptions of each parameter:\n\n## General Parameters\n\n- `JOB_TYPE`: Specifies the type of job to be executed, e.g., `km_with_gpt` or `skim_with_gpt`.\n- `KM_hypothesis`: Hypothesis template for KM analysis, using f-string format like `{a_term}` and `{b_term}` (e.g., `\"Treatment with {b_term} will have no effect on {a_term} patient outcomes.\"`).\n- `SKIM_hypotheses`: A dictionary of hypothesis templates for SKIM analysis (Must use f-string format).\n - `AB`: Relevance hyopthesis between `{a_term}` and `{b_term}` (e.g., `\"There exists an interaction between the organ {a_term} and the gene {b_term}.\"`).\n - `BC`: Relevance hyopthesis between`{c_term}` and `{b_term}` (e.g., `\"There exists an interaction between the disease {c_term} and the gene {b_term}.\"`).\n - `rel_AC`: Relevance hyopthesis between `{c_term}` and `{a_term}` (e.g., `\"There exists an interaction between the disease {c_term} and the organ {a_term}.\"`).\n - `ABC`: Evaluation hypothesis (e.g., `\"The gene {b_term} links the organ {a_term} to the disease {c_term}.\"`).\n - `AC`: Evaluation hypothesis (e.g., `\"The gene {a_term} influences the disease {c_term}.\"`).\n- `Evaluate_single_abstract`: Boolean flag to evaluate a single abstract (e.g., `false`).\n\n## Global Settings\n\n- `A_TERM`: The primary term of interest, such as an organ (e.g., `\"Thymus\"`).\n- `A_TERM_SUFFIX`: Optional suffix for the `A_TERM` (e.g., `\"\"`).\n- `TOP_N_ARTICLES_MOST_CITED`: Number of top-cited articles to consider (e.g., `300`).\n- `TOP_N_ARTICLES_MOST_RECENT`: Number of most recent articles to consider (e.g., `0`).\n- `POST_N`: Number of articles to process after relevance filtering (e.g., `20`).\n- `MIN_WORD_COUNT`: Minimum word count for an abstract to be considered (e.g., `98`).\n- `MODEL`: Machine learning model used for processing (e.g., `\"gpt-4o-2024-08-06\"`).\n- `MAX_TOKENS`: Maximum number of tokens per API request (e.g., `1000`).\n- `API_URL`: URL for the API endpoint (e.g., `\"http://localhost:5099/skim/api/jobs\"`).\n- `PORT`: Port number for the API service (e.g., `\"5081\"`).\n- `RATE_LIMIT`: Maximum number of requests allowed per time unit (e.g., `3`).\n- `DELAY`: Time in seconds to wait before making a new request (e.g., `10`).\n- `MAX_RETRIES`: Maximum number of retry attempts after a failed request (e.g., `10`).\n- `RETRY_DELAY`: Delay in seconds before retrying a failed request (e.g., `5`).\n\n## Abstract Filter Settings\n\n- `MODEL`: Model used for abstract filtering (e.g., `\"lexu14/porpoise1\"`).\n- `TEMPERATURE`: Sampling temperature for model inference (e.g., `0`).\n- `TOP_K`: Number of highest-probability vocabulary tokens to keep for top-k-filtering (e.g., `20`).\n- `TOP_P`: Cumulative probability for nucleus sampling (e.g., `0.95`).\n- `MAX_COT_TOKENS`: Maximum tokens for Chain-of-Thought reasoning (e.g., `500`).\n- `DEBUG`: Boolean flag to enable debug mode (e.g., `false`).\n- `TEST_LEAKAGE`: Boolean flag to test for data leakage (e.g., `false`).\n- `TEST_LEAKAGE_TYPE`: Type of data leakage test (e.g., `\"empty\"`).\n\n## Job-Specific Settings\n\n### km_with_gpt\n\n- `position`: Boolean flag to consider positional data (e.g., `false`).\n- `A_TERM_LIST`: Boolean to indicate if a list of `A` terms is used (e.g., `false`).\n- `A_TERMS_FILE`: File path for the `A` terms list (e.g., `\"../input_lists/test/km_a.txt\"`).\n- `B_TERMS_FILE`: File path for the `B` terms list (e.g., `\"../input_lists/leakage_b_terms.txt\"`).\n- `SORT_COLUMN`: Column used for sorting A-B relationships (e.g., `\"ab_sort_ratio\"`).\n- `NUM_B_TERMS`: Number of `B` terms to consider after sorting (e.g., `25`).\n- `km_with_gpt`:\n - `ab_fet_threshold`: Fisher Exact Test threshold for A-B relationships (e.g., `1`).\n - `censor_year`: Year for data censoring or time-slicing (e.g., `2024`).\n\n### skim_with_gpt\n\n- `position`: Boolean flag to consider positional data (e.g., `false`).\n- `A_TERM_LIST`: Boolean to indicate if a list of `A` terms is used (e.g., `false`).\n- `A_TERMS_FILE`: File path for the `A` terms list (e.g., `\"../input_lists/exercise3/skim_a.txt\"`).\n- `B_TERMS_FILE`: File path for the `B` terms list (e.g., `\"../input_lists/genes_no_syn.txt\"`).\n- `NUM_B_TERMS`: Number of `B` terms to consider (e.g., `20000`).\n- `C_TERMS_FILE`: File path for the `C` terms list (e.g., `\"../input_lists/down_syndrome.txt\"`).\n- `SORT_COLUMN`: Column used for sorting B-C relationships (e.g., `\"bc_sort_ratio\"`).\n- `skim`:\n - `ab_fet_threshold`: Fisher Exact Test threshold for A-B relationships (e.g., `0.1`).\n - `bc_fet_threshold`: Fisher Exact Test threshold for B-C relationships (e.g., `0.5`).\n - `censor_year`: Year for data censoring or time-slicing (e.g., `2024`).\n - `top_n`: Number of top items to consider after AB linkage (e.g., `300`).\n\n### km_with_gpt_direct_comp\n\n - `position`: Boolean flag to consider positional data (e.g., `false`).\n - `A_TERM_LIST`: Boolean to indicate if a list of `A` terms is used (e.g., `false`).\n - `A_TERMS_FILE`: File path for the `A` terms list (e.g., `\"../input_lists/test/km_a.txt\"`).\n - `B_TERMS_FILE`: File path for the `B` terms list (e.g., `\"../input_lists/scrapie_b_terms_ProteinvsVI_directlyForNewCode.txt\"`).\n - `SORT_COLUMN`: Column used for sorting A-B relationships (e.g., `\"ab_sort_ratio\"`).\n - `NUM_B_TERMS`: Number of `B` terms to consider (e.g., `25`).\n - `km_with_gpt_direct_comp`:\n - `ab_fet_threshold`: Fisher Exact Test threshold for A-B relationships (e.g., `1`).\n - `censor_year`: Year for data censoring or time-slicing (e.g., `1990`).\n \nThis configuration is critical for tailoring the behavior of the system to specific job types and requirements. Ensure all file paths and parameters are correctly set before execution to avoid runtime errors.\n## Contributions\nFeel free to contribute to this repository by submitting a pull request or opening an issue for suggestions and bugs.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Biomedical Knowledge Mining with co-occurrence modeling and LLMs",
"version": "1.0.6",
"project_urls": {
"Bug Tracker": "https://github.com/stewart-lab/skimgpt/issues",
"Documentation": "https://github.com/stewart-lab/skimgpt#readme",
"Homepage": "https://github.com/stewart-lab/skimgpt",
"Repository": "https://github.com/stewart-lab/skimgpt.git"
},
"split_keywords": [
"biomedical",
" nlp",
" co-occurrence",
" knowledge-mining",
" llms",
" literature-analysis",
" skim"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ab06547b283132634bdd63656d313a025913d71ee5b1d612f3bfcc2d98331b99",
"md5": "d53fd9a01a2b17f8ca541bc715fd317e",
"sha256": "7c6e2a53908bc1828182e3a1792c92eb42a56e3dd6d30f5fef57e11b118dffc4"
},
"downloads": -1,
"filename": "skimgpt-1.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d53fd9a01a2b17f8ca541bc715fd317e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 52889,
"upload_time": "2025-08-29T00:39:51",
"upload_time_iso_8601": "2025-08-29T00:39:51.303710Z",
"url": "https://files.pythonhosted.org/packages/ab/06/547b283132634bdd63656d313a025913d71ee5b1d612f3bfcc2d98331b99/skimgpt-1.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e44956dc12f28a9a69bec4368adfa335368866c377cf787db7cff12019293df8",
"md5": "0bd20c335f9e170542f2b96bfd1df865",
"sha256": "639ac16bd460db28bc7287b98ea1f8131c51419a4d279362f75d320cfda0b971"
},
"downloads": -1,
"filename": "skimgpt-1.0.6.tar.gz",
"has_sig": false,
"md5_digest": "0bd20c335f9e170542f2b96bfd1df865",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 123838,
"upload_time": "2025-08-29T00:39:52",
"upload_time_iso_8601": "2025-08-29T00:39:52.493822Z",
"url": "https://files.pythonhosted.org/packages/e4/49/56dc12f28a9a69bec4368adfa335368866c377cf787db7cff12019293df8/skimgpt-1.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-29 00:39:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "stewart-lab",
"github_project": "skimgpt",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2"
]
]
},
{
"name": "htcondor",
"specs": [
[
">=",
"24.7.3"
]
]
}
],
"lcname": "skimgpt"
}