<img width="685" alt="image" src="https://github.com/ohsu-comp-bio/vrs-python-testing/assets/47808/909db052-972c-4508-a2f4-8a389de03320">
# VRS AnVIL Toolkit
## Project Overview
This Python package is designed to process Variant Call Format (VCF) files and perform lookup operations on Genomic Variation Representation Service (GA4GH VRS) identifiers. The GA4GH VRS identifiers provide a standardized way to represent genomic variations, making it easier to exchange and share genomic information.
In addition, this project facilitates the retrieval of evidence associated with genomic alleles by leveraging the Genomic Data Representation and Knowledge Base (GA4GH MetaKB) service. GA4GH MetaKB provides a comprehensive knowledge base that links genomic variants to relevant clinical variant interpretations.
## Features
1. **VCF File Processing:**
- Streamlines reading and parsing of VCF files, to extract relevant genomic information.
2. **GA4GH VRS Identifier Lookup:**
- Utilizes the GA4GH VRS API to perform lookups for each genomic variation mentioned in the VCF file.
- Retrieves standardized identifiers for the alleles, enhancing interoperability with GA4GH-compliant systems.
- GA4GH MetaKB Service Integration: Utilizes the GA4GH MetaKB retrieve evidence associated with specified genomic alleles.
3. **Output Generation:**
- Generates summary metrics about throughput, errors, evidence, and hits.
- Presents the retrieved evidence in a structured format, providing access to information about studies, publications, and other relevant details.
4. **Additional Features**
- Provides configurable options like threading and caching for processing VCFs.
- Implements robust error handling to address issues like invalid input files, invalid variants, and more.
## Getting Started
### Prerequisites
- Python 3.10 or later
- Internet connectivity for data dependency setup (seqrepo)
### Installation
1. Get the repository either by...
1. Source code
```bash
git clone https://github.com/ohsu-comp-bio/vrs_anvil_toolkit
cd vrs_anvil_toolkit
```
2. PyPi
```bash
pip install vrs_anvil_toolkit
```
2. Install dependencies either...
1. for local use
```bash
# install postgresql@14 (required for vrs-python)
brew install postgresql@14
bash scripts/setup.sh
```
2. for use on Terra
```bash
bash terra/setup.sh
```
### Usage
**General**
All usage has the following general steps...
1. Create a manifest to configure your VCF processing run
1. Use the `vrs_bulk` CLI to create a metrics file of related evidence
1. Use the metrics files for downstream analysis
The follow steps are explained in detail below, with some additional info on using vrs-python to directly annotate VCFs with VRS IDs.
**Manifest**
The configuration of each VCF processing run run is controlled by a `manifest.yaml` file. Most importantly, this file specifies the...
- input VCF file(s) to process
- working directories
- performance and strictness configurations
Use this commented [sample manifest](tests/fixtures/manifest.yaml) as a starting point on the specific variables you can specify per run.
**CLI**
Below are a list of command line utilities that may be useful
```bash
# activate the environment
source venv/bin/activate
# run the vrs_bulk command in the foreground
vrs_bulk annotate
# run the vrs_bulk command in parallel, one process per VCF file
vrs_bulk annotate --scatter
# run the vrs_bulk command in parallel in the background
nohup vrs_bulk annotate --scatter & # press enter to continue
# get the status of the processes for the most recent scatter run
vrs_bulk ps
```
The command line utility supports Google Cloud URIs and running commands in the background to interop with Terra out-of-the-box. This is described in the CLI usage above. For an example notebook, see `vrs-anvil-demo.ipynb` on the `vrs-anvil` workspace.
**Processing VCF Files ([vrs-python](https://github.com/ga4gh/vrs-python))**
vrs-python is a GA4GH GKS package centered around creating Variant Representation specification (VRS) IDs: consistent, globally unique identifiers for variation. Some of its functionality includes variant ID translation and VCF annotation. Used as a dependency in `vrs_bulk`, it can also be used as a standalone package.
For Python usage, see [vrs_vcf_annotator.py](scripts/vrs_vcf_annotator.py) for an example.
For CLI usage:
```bash
python3 -m ga4gh.vrs.extras.vcf_annotation --vcf_in tests/fixtures/1kGP.chr1.1000.vcf --vcf_out annotated_output.vcf.gz --vrs_pickle_out allele_dicts.pkl --seqrepo_root_dir ~/seqrepo/latest
```
The above is an example using an example vcf. Replace the `--vcf_out` and `vrs_pickle_out` here with your desired output file path, where the output vcf can be BCF (`vcf.gz`) or VCF (`vcf`)
Also, see the [VRS Annotator workflow](https://dockstore.org/workflows/github.com/ohsu-comp-bio/vrs-annotator/VRSAnnotator:main?tab=info) on Dockstore for a way to do this on Terra.
### Contributing
This project is open to contributions from the research community. If you are interested in contributing to the project, please contact the project team.
See the [contributing guide](CONTRIBUTING.md) for more information on how to contribute to the project.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE.md) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/ohsu-comp-bio/vrs_anvil_toolkit",
"name": "vrs-anvil-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.10",
"maintainer_email": null,
"keywords": "anvil terra bioinformatics",
"author": "Ellrott Lab",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/30/e2/1c238e2e9b27aa2d130dbf763a6ab99e73c319effd4ed35106235e922970/vrs_anvil_toolkit-0.1.0.tar.gz",
"platform": null,
"description": "<img width=\"685\" alt=\"image\" src=\"https://github.com/ohsu-comp-bio/vrs-python-testing/assets/47808/909db052-972c-4508-a2f4-8a389de03320\">\n\n\n# VRS AnVIL Toolkit\n\n## Project Overview\n\nThis Python package is designed to process Variant Call Format (VCF) files and perform lookup operations on Genomic Variation Representation Service (GA4GH VRS) identifiers. The GA4GH VRS identifiers provide a standardized way to represent genomic variations, making it easier to exchange and share genomic information.\n\nIn addition, this project facilitates the retrieval of evidence associated with genomic alleles by leveraging the Genomic Data Representation and Knowledge Base (GA4GH MetaKB) service. GA4GH MetaKB provides a comprehensive knowledge base that links genomic variants to relevant clinical variant interpretations.\n\n## Features\n\n1. **VCF File Processing:**\n - Streamlines reading and parsing of VCF files, to extract relevant genomic information.\n\n2. **GA4GH VRS Identifier Lookup:**\n - Utilizes the GA4GH VRS API to perform lookups for each genomic variation mentioned in the VCF file.\n - Retrieves standardized identifiers for the alleles, enhancing interoperability with GA4GH-compliant systems.\n - GA4GH MetaKB Service Integration: Utilizes the GA4GH MetaKB retrieve evidence associated with specified genomic alleles.\n\n3. **Output Generation:**\n - Generates summary metrics about throughput, errors, evidence, and hits.\n - Presents the retrieved evidence in a structured format, providing access to information about studies, publications, and other relevant details.\n\n4. **Additional Features**\n - Provides configurable options like threading and caching for processing VCFs.\n - Implements robust error handling to address issues like invalid input files, invalid variants, and more.\n\n## Getting Started\n\n### Prerequisites\n\n- Python 3.10 or later\n- Internet connectivity for data dependency setup (seqrepo)\n\n### Installation\n\n1. Get the repository either by...\n 1. Source code\n ```bash\n git clone https://github.com/ohsu-comp-bio/vrs_anvil_toolkit\n cd vrs_anvil_toolkit\n ```\n 2. PyPi\n ```bash\n pip install vrs_anvil_toolkit\n ```\n\n2. Install dependencies either...\n 1. for local use\n ```bash\n # install postgresql@14 (required for vrs-python)\n brew install postgresql@14\n bash scripts/setup.sh\n ```\n 2. for use on Terra\n ```bash\n bash terra/setup.sh\n ```\n\n### Usage\n**General**\nAll usage has the following general steps...\n\n1. Create a manifest to configure your VCF processing run\n1. Use the `vrs_bulk` CLI to create a metrics file of related evidence\n1. Use the metrics files for downstream analysis\n\nThe follow steps are explained in detail below, with some additional info on using vrs-python to directly annotate VCFs with VRS IDs.\n\n**Manifest**\n\nThe configuration of each VCF processing run run is controlled by a `manifest.yaml` file. Most importantly, this file specifies the...\n- input VCF file(s) to process\n- working directories\n- performance and strictness configurations\n\nUse this commented [sample manifest](tests/fixtures/manifest.yaml) as a starting point on the specific variables you can specify per run.\n\n**CLI**\n\nBelow are a list of command line utilities that may be useful\n```bash\n# activate the environment\nsource venv/bin/activate\n\n# run the vrs_bulk command in the foreground\nvrs_bulk annotate\n\n# run the vrs_bulk command in parallel, one process per VCF file\nvrs_bulk annotate --scatter\n\n# run the vrs_bulk command in parallel in the background\nnohup vrs_bulk annotate --scatter & # press enter to continue\n\n# get the status of the processes for the most recent scatter run\nvrs_bulk ps\n```\n\nThe command line utility supports Google Cloud URIs and running commands in the background to interop with Terra out-of-the-box. This is described in the CLI usage above. For an example notebook, see `vrs-anvil-demo.ipynb` on the `vrs-anvil` workspace.\n\n**Processing VCF Files ([vrs-python](https://github.com/ga4gh/vrs-python))**\n\nvrs-python is a GA4GH GKS package centered around creating Variant Representation specification (VRS) IDs: consistent, globally unique identifiers for variation. Some of its functionality includes variant ID translation and VCF annotation. Used as a dependency in `vrs_bulk`, it can also be used as a standalone package.\n\nFor Python usage, see [vrs_vcf_annotator.py](scripts/vrs_vcf_annotator.py) for an example.\n\nFor CLI usage:\n```bash\npython3 -m ga4gh.vrs.extras.vcf_annotation --vcf_in tests/fixtures/1kGP.chr1.1000.vcf --vcf_out annotated_output.vcf.gz --vrs_pickle_out allele_dicts.pkl --seqrepo_root_dir ~/seqrepo/latest\n```\n\nThe above is an example using an example vcf. Replace the `--vcf_out` and `vrs_pickle_out` here with your desired output file path, where the output vcf can be BCF (`vcf.gz`) or VCF (`vcf`)\n\nAlso, see the [VRS Annotator workflow](https://dockstore.org/workflows/github.com/ohsu-comp-bio/vrs-annotator/VRSAnnotator:main?tab=info) on Dockstore for a way to do this on Terra.\n\n### Contributing\n\nThis project is open to contributions from the research community. If you are interested in contributing to the project, please contact the project team.\nSee the [contributing guide](CONTRIBUTING.md) for more information on how to contribute to the project.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE.md) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Commons utilities",
"version": "0.1.0",
"project_urls": {
"Bug Reports": "https://github.com/ohsu-comp-bio/vrs_anvil_toolkit/issues",
"Homepage": "https://github.com/ohsu-comp-bio/vrs_anvil_toolkit",
"Source": "https://github.com/ohsu-comp-bio/vrs_anvil_toolkit"
},
"split_keywords": [
"anvil",
"terra",
"bioinformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "30e21c238e2e9b27aa2d130dbf763a6ab99e73c319effd4ed35106235e922970",
"md5": "c9428046e3d76e7f700df5ad1f07c864",
"sha256": "19f4f51276791a1c79262336167794ecacca3d58dd295d606a38e4de40e6c44d"
},
"downloads": -1,
"filename": "vrs_anvil_toolkit-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "c9428046e3d76e7f700df5ad1f07c864",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.10",
"size": 26528,
"upload_time": "2024-07-19T17:36:11",
"upload_time_iso_8601": "2024-07-19T17:36:11.268326Z",
"url": "https://files.pythonhosted.org/packages/30/e2/1c238e2e9b27aa2d130dbf763a6ab99e73c319effd4ed35106235e922970/vrs_anvil_toolkit-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-19 17:36:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ohsu-comp-bio",
"github_project": "vrs_anvil_toolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "ga4gh.vrs",
"specs": [
[
"==",
"2.0.0a10"
]
]
},
{
"name": "diskcache",
"specs": []
},
{
"name": "biocommons.seqrepo",
"specs": []
},
{
"name": "glom",
"specs": []
},
{
"name": "click",
"specs": []
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "google",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "boto3",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "google-cloud-storage",
"specs": []
},
{
"name": "psutil",
"specs": []
},
{
"name": "setuptools",
"specs": []
}
],
"lcname": "vrs-anvil-toolkit"
}