# tp53 & seshat python tools
[![Python Tests](https://github.com/clintval/tp53/actions/workflows/tests_python.yml/badge.svg?branch=main)](https://github.com/clintval/tp53/actions/workflows/tests_python.yml?query=branch%3Amain)
[![PyPi Release](https://badge.fury.io/py/tp53.svg)](https://badge.fury.io/py/tp53)
[![Python Versions](https://img.shields.io/badge/python-3.11_|_3.12_|_3.13-blue)](https://github.com/clintval/typeline)
[![mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![basedpyright](https://img.shields.io/badge/basedpyright-checked-42b983)](https://docs.basedpyright.com/latest/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://docs.astral.sh/ruff/)
Python tools for programmatically annotating VCFs with the Seshat TP53 database.
## Installation
The Python package can be installed with `pip`:
```console
pip install tp53
```
## Upload a VCF to Seshat
Upload a VCF to the [Seshat TP53 annotation server](http://vps338341.ovh.net/) using a headless browser.
```bash
❯ python -m tp53.seshat.upload_vcf \
--input "sample.library.vcf" \
--email "example@gmail.com"
```
```console
INFO:tp53.seshat.upload_vcf:Uploading 0 %...
INFO:tp53.seshat.upload_vcf:Uploading 53%...
INFO:tp53.seshat.upload_vcf:Uploading 53%...
INFO:tp53.seshat.upload_vcf:Uploading 60%...
INFO:tp53.seshat.upload_vcf:Uploading 60%...
INFO:tp53.seshat.upload_vcf:Uploading 66%...
INFO:tp53.seshat.upload_vcf:Uploading 66%...
INFO:tp53.seshat.upload_vcf:Uploading 80%...
INFO:tp53.seshat.upload_vcf:Uploading 80%...
INFO:tp53.seshat.upload_vcf:Upload complete!
```
This tool is used to programmatically configure and upload batch variants in VCF format to the Seshat annotation server.
The tool works by building a headless Chrome browser instance and then interacting with the Seshat website directly through simulated key presses and mouse clicks.
###### VCF Input Requirements
Seshat will not let the user know why a VCF fails to annotate, but it has been observed that Seshat can fail to parse some of [VarDictJava](https://github.com/AstraZeneca-NGS/VarDictJava)'s structural variants (SVs) as valid variant records.
One solution that has worked in the past is to remove SVs.
The following command will exclude all variants with a non-empty SVTYPE INFO key:
```bash
❯ bcftools view sample.library.vcf \
--exclude 'SVTYPE!="."' \
> sample.library.noSV.vcf
```
###### Automation
There are no terms and conditions posted on the Seshat annotation server's website, and there is no server-side `robots.txt` rule set.
In lieu of usage terms, we strongly encourage all users of this script to respect the Seshat resource by adhering to the following best practice:
- **Minimize Load**: Limit the rate of requests to the server
- **Minimize Connections**: Limit the number of concurrent requests
###### Environment Setup
This script relies on Google Chrome:
```console
❯ brew install --cask google-chrome
```
Distributions of MacOS may require you to authenticate the Chrome driver ([link](https://stackoverflow.com/a/60362134)).
## Download a Seshat Annotation from Gmail
Download [Seshat](http://vps338341.ovh.net/) VCF annotations by awaiting a server-generated email.
```bash
❯ python -m tp53.seshat.find_in_gmail \
--input "sample.library.vcf" \
--output "sample.library" \
--credentials "~/.secrets/credentials.json"
```
```console
INFO:tp53.seshat.find_in_gmail:Successfully logged into the Gmail service.
INFO:tp53.seshat.find_in_gmail:Querying for a VCF named: sample.library.vcf
INFO:tp53.seshat.find_in_gmail:Searching Gmail messages with: sample.library.vcf from:support@genevia.fi newer_than:5h subject:"Results of batch analysis"
INFO:tp53.seshat.find_in_gmail:Message found with the following metadata: {'id': '193c310d2714b389', 'threadId': '193c30b7244e2662'}
INFO:tp53.seshat.find_in_gmail:Message contents are as follows:
INFO:tp53.seshat.find_in_gmail: Results of batch analysis
INFO:tp53.seshat.find_in_gmail: Analyzed batch file:
INFO:tp53.seshat.find_in_gmail: sample.library.vcf
INFO:tp53.seshat.find_in_gmail: Time taken to run the analysis:
INFO:tp53.seshat.find_in_gmail: 0 minutes 10 seconds
INFO:tp53.seshat.find_in_gmail: Summary:
INFO:tp53.seshat.find_in_gmail: The input file contained
INFO:tp53.seshat.find_in_gmail: 23 mutations out of which
INFO:tp53.seshat.find_in_gmail: 23 were TP53 mutations.
INFO:tp53.seshat.find_in_gmail:Writing attachment to ZIP archive: sample.library.vcf.seshat.zip
INFO:tp53.seshat.find_in_gmail:Extracting ZIP archive: sample.library.vcf.seshat.zip
INFO:tp53.seshat.find_in_gmail:Output file renamed to: sample.library.seshat.short-20241214_034753_129732.tsv
INFO:tp53.seshat.find_in_gmail:Output file renamed to: sample.library.seshat.long-20241214_034753_217420.tsv
```
This tool is used to programmatically wait for, and retrieve, a batch results email from the Seshat TP53 annotation server.
The tool works by searching a user-controlled Gmail inbox for a recent Seshat email that contains the result annotations for a given VCF input file, by name.
It is critically important to be aware that there is no way to prove which annotation files, as they arrive via email, are to be linked with which VCF file on disk.
This tool assists in the correct pairing of VCF input files, and subsequent annotation files, by letting you specify how many hours back in time you will let the Gmail query search (`--newer-than`).
Limiting the window of time in which an email should have arrived minimizes the chance of discovering stale annotation files from an old Seshat execution in the cases where VCF filenames may be non-unique.
If the batch results email from the Seshat annotation server has not yet arrived, this tool will wait a set number of seconds (`--wait-for`) before exiting with exception.
It normally takes less than 1 minute for the Seshat server to annotate an average TP53-only VCF.
###### Search Criteria
The following rules are used to find annotation files:
1. The email contains the filename of the input VCF
2. The email subject line must contain "Results of batch analysis"
3. The email is at least `--newer-than` hours old
4. The email is from the address [support@genevia.fi](mailto:support@genevia.fi)
###### Outputs:
- `<output>.seshat.long-\\d{8}_\\d{6}_\\d{6}.tsv`: The long format Seshat annotations for the input VCF
- `<output>.seshat.short-\\d{8}_\\d{6}_\\d{6}.tsv`: The short format Seshat annotations for the input VCF
- `<output>.seshat.zip`: The original ZIP archive from Seshat
###### Gmail Authentication
You must create a Google developer's OAuth file.
First-time 2FA may be required depending on the configuration of your Gmail service.
If 2FA is required, then this script will block until you acknowledge your 2FA prompt.
A 2FA prompt is often delivered through an auto-opening web browser.
To create a Google developer's OAuth file, navigate to the following URL and follow the instructions.
- [Authorize Credentials for a Desktop Application](https://developers.google.com/gmail/api/quickstart/python#authorize_credentials_for_a_desktop_application)
Ensure your OAuth file is configured as a "Desktop app" and then download the credentials as JSON.
Save your credentials file somewhere safe, ideally in a secure user folder with restricted permissions (`chmod 700`).
Set your OAuth file permissions to also restrict unwarranted access (`chmod 600`).
This script will store a cached token after first-time authentication is successful.
This cached token can be found in the user's home directory within a hidden directory.
Token caching greatly speeds up continued executions of this script.
As of now, the token is cached at the following location:
```bash
"~/.tp53/seshat/seshat-gmail-find-token.pickle"
```
If the cached token is missing, or becomes stale, then you will need to provide your OAuth credentials file.
A typical Google developer's OAuth file is of the format:
```json
{
"installed": {
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"client_id": "272111863110-csldkfjlsdkfjlksdjflksdincie.apps.googleusercontent.com",
"client_secret": "sdlfkjsdlkjfijciejijcei",
"project_id": "gmail-access-2398293892838",
"redirect_uris": [
"urn:ietf:wg:oauth:2.0:oob",
"http://localhost"
],
"token_uri": "https://oauth2.googleapis.com/token"
}
}
```
###### Server Failures
If Seshat fails to annotate the VCF file but still emails the user a response, then this tool will emit the email body to standard error and exit with a non-zero status.
## Development and Testing
See the [contributing guide](https://github.com/clintval/tp53/blob/main/python/CONTRIBUTING.md) for more information.
## References
- [Soussi, Thierry, et al. “Recommendations for Analyzing and Reporting TP53 Gene Variants in the High-Throughput Sequencing Era.” Human Mutation, vol. 35, no. 6, 2014, pp. 766–778., doi:10.1002/humu.22561](https://doi.org/10.1002/humu.22561)
Raw data
{
"_id": null,
"home_page": "https://github.com/clintval/tp53/blob/main/python",
"name": "tp53",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": null,
"author": "Clint Valentine",
"author_email": "valentine.clint@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c5/87/c0488f4f7d012b3d4a0420715ee788b6246dae2ef47f8c6b1b9375a60da1/tp53-0.13.0.tar.gz",
"platform": null,
"description": "# tp53 & seshat python tools\n\n[![Python Tests](https://github.com/clintval/tp53/actions/workflows/tests_python.yml/badge.svg?branch=main)](https://github.com/clintval/tp53/actions/workflows/tests_python.yml?query=branch%3Amain)\n[![PyPi Release](https://badge.fury.io/py/tp53.svg)](https://badge.fury.io/py/tp53)\n[![Python Versions](https://img.shields.io/badge/python-3.11_|_3.12_|_3.13-blue)](https://github.com/clintval/typeline)\n[![mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)\n[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)\n[![basedpyright](https://img.shields.io/badge/basedpyright-checked-42b983)](https://docs.basedpyright.com/latest/)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://docs.astral.sh/ruff/)\n\nPython tools for programmatically annotating VCFs with the Seshat TP53 database.\n\n## Installation\n\nThe Python package can be installed with `pip`:\n\n```console\npip install tp53\n```\n\n## Upload a VCF to Seshat\n\nUpload a VCF to the [Seshat TP53 annotation server](http://vps338341.ovh.net/) using a headless browser.\n\n```bash\n\u276f python -m tp53.seshat.upload_vcf \\\n --input \"sample.library.vcf\" \\\n --email \"example@gmail.com\"\n```\n```console\nINFO:tp53.seshat.upload_vcf:Uploading 0 %...\nINFO:tp53.seshat.upload_vcf:Uploading 53%...\nINFO:tp53.seshat.upload_vcf:Uploading 53%...\nINFO:tp53.seshat.upload_vcf:Uploading 60%...\nINFO:tp53.seshat.upload_vcf:Uploading 60%...\nINFO:tp53.seshat.upload_vcf:Uploading 66%...\nINFO:tp53.seshat.upload_vcf:Uploading 66%...\nINFO:tp53.seshat.upload_vcf:Uploading 80%...\nINFO:tp53.seshat.upload_vcf:Uploading 80%...\nINFO:tp53.seshat.upload_vcf:Upload complete!\n```\n\nThis tool is used to programmatically configure and upload batch variants in VCF format to the Seshat annotation server.\nThe tool works by building a headless Chrome browser instance and then interacting with the Seshat website directly through simulated key presses and mouse clicks.\n\n###### VCF Input Requirements\n\nSeshat will not let the user know why a VCF fails to annotate, but it has been observed that Seshat can fail to parse some of [VarDictJava](https://github.com/AstraZeneca-NGS/VarDictJava)'s structural variants (SVs) as valid variant records.\nOne solution that has worked in the past is to remove SVs.\nThe following command will exclude all variants with a non-empty SVTYPE INFO key:\n\n```bash\n\u276f bcftools view sample.library.vcf \\\n --exclude 'SVTYPE!=\".\"' \\\n > sample.library.noSV.vcf\n```\n\n###### Automation\n\nThere are no terms and conditions posted on the Seshat annotation server's website, and there is no server-side `robots.txt` rule set.\nIn lieu of usage terms, we strongly encourage all users of this script to respect the Seshat resource by adhering to the following best practice:\n\n- **Minimize Load**: Limit the rate of requests to the server\n- **Minimize Connections**: Limit the number of concurrent requests\n\n###### Environment Setup\n\nThis script relies on Google Chrome:\n\n```console\n\u276f brew install --cask google-chrome\n```\n\nDistributions of MacOS may require you to authenticate the Chrome driver ([link](https://stackoverflow.com/a/60362134)).\n\n## Download a Seshat Annotation from Gmail\n\nDownload [Seshat](http://vps338341.ovh.net/) VCF annotations by awaiting a server-generated email.\n\n```bash\n\u276f python -m tp53.seshat.find_in_gmail \\\n --input \"sample.library.vcf\" \\\n --output \"sample.library\" \\\n --credentials \"~/.secrets/credentials.json\"\n```\n```console\nINFO:tp53.seshat.find_in_gmail:Successfully logged into the Gmail service.\nINFO:tp53.seshat.find_in_gmail:Querying for a VCF named: sample.library.vcf\nINFO:tp53.seshat.find_in_gmail:Searching Gmail messages with: sample.library.vcf from:support@genevia.fi newer_than:5h subject:\"Results of batch analysis\"\nINFO:tp53.seshat.find_in_gmail:Message found with the following metadata: {'id': '193c310d2714b389', 'threadId': '193c30b7244e2662'}\nINFO:tp53.seshat.find_in_gmail:Message contents are as follows:\nINFO:tp53.seshat.find_in_gmail: Results of batch analysis\nINFO:tp53.seshat.find_in_gmail: Analyzed batch file:\nINFO:tp53.seshat.find_in_gmail: sample.library.vcf\nINFO:tp53.seshat.find_in_gmail: Time taken to run the analysis:\nINFO:tp53.seshat.find_in_gmail: 0 minutes 10 seconds\nINFO:tp53.seshat.find_in_gmail: Summary:\nINFO:tp53.seshat.find_in_gmail: The input file contained\nINFO:tp53.seshat.find_in_gmail: 23 mutations out of which\nINFO:tp53.seshat.find_in_gmail: 23 were TP53 mutations.\nINFO:tp53.seshat.find_in_gmail:Writing attachment to ZIP archive: sample.library.vcf.seshat.zip\nINFO:tp53.seshat.find_in_gmail:Extracting ZIP archive: sample.library.vcf.seshat.zip\nINFO:tp53.seshat.find_in_gmail:Output file renamed to: sample.library.seshat.short-20241214_034753_129732.tsv\nINFO:tp53.seshat.find_in_gmail:Output file renamed to: sample.library.seshat.long-20241214_034753_217420.tsv\n```\n\nThis tool is used to programmatically wait for, and retrieve, a batch results email from the Seshat TP53 annotation server.\nThe tool works by searching a user-controlled Gmail inbox for a recent Seshat email that contains the result annotations for a given VCF input file, by name.\nIt is critically important to be aware that there is no way to prove which annotation files, as they arrive via email, are to be linked with which VCF file on disk.\n\nThis tool assists in the correct pairing of VCF input files, and subsequent annotation files, by letting you specify how many hours back in time you will let the Gmail query search (`--newer-than`).\nLimiting the window of time in which an email should have arrived minimizes the chance of discovering stale annotation files from an old Seshat execution in the cases where VCF filenames may be non-unique.\nIf the batch results email from the Seshat annotation server has not yet arrived, this tool will wait a set number of seconds (`--wait-for`) before exiting with exception.\nIt normally takes less than 1 minute for the Seshat server to annotate an average TP53-only VCF.\n\n###### Search Criteria\n\nThe following rules are used to find annotation files:\n\n1. The email contains the filename of the input VCF\n2. The email subject line must contain \"Results of batch analysis\"\n3. The email is at least `--newer-than` hours old\n4. The email is from the address [support@genevia.fi](mailto:support@genevia.fi)\n\n###### Outputs:\n\n- `<output>.seshat.long-\\\\d{8}_\\\\d{6}_\\\\d{6}.tsv`: The long format Seshat annotations for the input VCF\n- `<output>.seshat.short-\\\\d{8}_\\\\d{6}_\\\\d{6}.tsv`: The short format Seshat annotations for the input VCF\n- `<output>.seshat.zip`: The original ZIP archive from Seshat\n\n###### Gmail Authentication\n\nYou must create a Google developer's OAuth file.\nFirst-time 2FA may be required depending on the configuration of your Gmail service.\nIf 2FA is required, then this script will block until you acknowledge your 2FA prompt.\nA 2FA prompt is often delivered through an auto-opening web browser.\n\nTo create a Google developer's OAuth file, navigate to the following URL and follow the instructions.\n\n- [Authorize Credentials for a Desktop Application](https://developers.google.com/gmail/api/quickstart/python#authorize_credentials_for_a_desktop_application)\n\nEnsure your OAuth file is configured as a \"Desktop app\" and then download the credentials as JSON.\nSave your credentials file somewhere safe, ideally in a secure user folder with restricted permissions (`chmod 700`).\nSet your OAuth file permissions to also restrict unwarranted access (`chmod 600`).\n\nThis script will store a cached token after first-time authentication is successful.\nThis cached token can be found in the user's home directory within a hidden directory.\nToken caching greatly speeds up continued executions of this script.\nAs of now, the token is cached at the following location:\n\n```bash\n\"~/.tp53/seshat/seshat-gmail-find-token.pickle\"\n```\n\nIf the cached token is missing, or becomes stale, then you will need to provide your OAuth credentials file.\n\nA typical Google developer's OAuth file is of the format:\n\n```json\n{\n\"installed\": {\n \"auth_provider_x509_cert_url\": \"https://www.googleapis.com/oauth2/v1/certs\",\n \"auth_uri\": \"https://accounts.google.com/o/oauth2/auth\",\n \"client_id\": \"272111863110-csldkfjlsdkfjlksdjflksdincie.apps.googleusercontent.com\",\n \"client_secret\": \"sdlfkjsdlkjfijciejijcei\",\n \"project_id\": \"gmail-access-2398293892838\",\n \"redirect_uris\": [\n \"urn:ietf:wg:oauth:2.0:oob\",\n \"http://localhost\"\n ],\n \"token_uri\": \"https://oauth2.googleapis.com/token\"\n }\n}\n```\n\n###### Server Failures\n\nIf Seshat fails to annotate the VCF file but still emails the user a response, then this tool will emit the email body to standard error and exit with a non-zero status.\n\n## Development and Testing\n\nSee the [contributing guide](https://github.com/clintval/tp53/blob/main/python/CONTRIBUTING.md) for more information.\n\n## References\n\n- [Soussi, Thierry, et al. \u201cRecommendations for Analyzing and Reporting TP53 Gene Variants in the High-Throughput Sequencing Era.\u201d Human Mutation, vol. 35, no. 6, 2014, pp. 766\u2013778., doi:10.1002/humu.22561](https://doi.org/10.1002/humu.22561)\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python tools for programmatically annotating VCFs with the Seshat TP53 database.",
"version": "0.13.0",
"project_urls": {
"Bug Tracker": "https://github.com/clintval/tp53/issues",
"Documentation": "https://github.com/clintval/tp53/blob/main/python/README.md",
"Homepage": "https://github.com/clintval/tp53/blob/main/python",
"Repository": "https://github.com/clintval/tp53"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "17500da9bef684ea397cc5a2db16e08a4bde04acd5882f9b269f0018089a6c31",
"md5": "16fa884425e02dcecfbc6634aca59cd6",
"sha256": "7a016808dbdefd79972920c6226f218ad3f9705f6985b9a9c23ab04cc28fbdb7"
},
"downloads": -1,
"filename": "tp53-0.13.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "16fa884425e02dcecfbc6634aca59cd6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 17109,
"upload_time": "2024-12-16T03:08:47",
"upload_time_iso_8601": "2024-12-16T03:08:47.626998Z",
"url": "https://files.pythonhosted.org/packages/17/50/0da9bef684ea397cc5a2db16e08a4bde04acd5882f9b269f0018089a6c31/tp53-0.13.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c587c0488f4f7d012b3d4a0420715ee788b6246dae2ef47f8c6b1b9375a60da1",
"md5": "b9dd1c5e7b2fe63330ba4b6f90763fef",
"sha256": "d0a774773287d137a06b2afa5f5bf5e14dd03b4fe3b1774a6c67556db539326a"
},
"downloads": -1,
"filename": "tp53-0.13.0.tar.gz",
"has_sig": false,
"md5_digest": "b9dd1c5e7b2fe63330ba4b6f90763fef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 16605,
"upload_time": "2024-12-16T03:08:50",
"upload_time_iso_8601": "2024-12-16T03:08:50.361696Z",
"url": "https://files.pythonhosted.org/packages/c5/87/c0488f4f7d012b3d4a0420715ee788b6246dae2ef47f8c6b1b9375a60da1/tp53-0.13.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-16 03:08:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "clintval",
"github_project": "tp53",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tp53"
}