<p align="center">
<img src="gene_fetch_logo.svg" width="400" alt="gene_fetch_logo">
</p>
[](https://pypi.org/project/gene-fetch/)
[](http://bioconda.github.io/recipes/gene-fetch/README.html)
[](https://pypi.org/project/gene-fetch/)
[](https://joss.theoj.org/papers/2ce8ec99977083e2fa095223aa193538)
[](https://github.com/bge-barcoding/gene_fetch/actions)
# GeneFetch
Gene Fetch enables high-throughput retreival of sequence data from NCBI databases based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).
## Highlight features
- Fetch protein and/or nucleotide sequences from NCBI GenBank database.
- Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats).
- Support for both protein-coding and rDNA genes.
- Customisable length filtering thresholds for protein and nucleotide sequences (default: protein=500aa. nucleotide=1000bp).
- Default "batch" mode processes multiple input taxa based on a user specified CSV file.
- Configurable "single" mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds can be bypassed by setting the value to zero or a negative number).
- Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. i.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.
- Taxonomic validation: validates fetched sequence taxonomy against input taxonomic heirarchy, avoiding potential taxonomic homonyms (i.e. when the same taxon name is used for different taxa across the tree of life).
- Robust error handling, progress tracking, and logging, with compliance to NCBI API rate limits (10 requests/second). Caches taxonomy lookups for reduced API calls.
- Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids "unverified" sequences and WGS entries not containing sequence data (i.e. master records).
- 'Checkpointing': if a run fails/crashes, gene-fetch can be rerun using the same arguments and parameters, and it will resume from where it stopped (unless `--clean` is specified).
- When more than 50 matching GenBank records are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders the records by sequence length, and processes the longest sequences first.
- Can output corresponding genbank (.gb) files for each fetched nucleotide and/or protein sequences
## Contents
- [Installation](#installation)
- [Usage](#usage)
- [Examples](#Examples)
- [Input](#input)
- [Output](#output)
- [Cluster](#running-gene_fetch-on-a-cluster)
- [Supported targets](#supported-targets)
- [Notes](#notes)
- [Benchmarking](#benchmarking)
- [Future developments](#future-developments)
- [Contributions and citation](#contributions-and-citations)
## Installation
- Due to the risk of dependency conflicts, it's recommended to install Gene Fetch in a Conda environment.
- First Conda needs to be installed, which can be done from [here](https://www.anaconda.com/docs/getting-started/miniconda/install).
- Once installed:
```bash
# Create new environment
conda create -n gene-fetch
# Activate environment
conda activate gene-fetch
```
- Gene Fetch and all necessary dependencies can then be installed via [Bioconda](https://anaconda.org/bioconda/gene-fetch), [PyPI](https://pypi.org/project/gene-fetch/#description), or by specifying `environment.yaml`:
```bash
# Install via bioconda
conda install bioconda::gene-fetch
# Or, install via pip
pip install gene-fetch
# Or, via environment specification
conda env update --name gene-fetch -f environment.yaml --prune
# Verify installation
gene-fetch --help
```
- If you would rather clone this repository and run a standalone version of Gene Fetch for some reason, you can do that as follows:
```bash
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch
# Activate conda environment (once created), and install gene-fetch (+ dependencies) via your preferred method.
# Run standalone Gene Fetch
python /path/to/gene_fetch.py [options]
```
## Recommended: Testing
- The Gene Fetch package includes some basic tests for each module that we recommend are run after installation.
```bash
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch
# Install pytest
pip install pytest
# Run tests
pytest
```
* This will take a few minutes to run the tests. You will get 1 warning regarding API credentials as these are not provided in the basic tests.
## Usage
```bash
gene-fetch --gene <gene_name> --type <sequence_type> --in <samples.csv> --out <output_directory> --email example@example.co.uk --api-key 1234567890
```
* `--help`: Show usage help and exit.
### Required arguments
* `-g/--gene`: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).
* `-t/--type`: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).
* `-i/--in`: Path to input CSV file containing sample IDs and TaxIDs (see [Input](#input) section below).
* `-i2/--in2`: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see [Input](#input) section below).
* `o/--out`: Path to output directory. The directory will be created if it does not exist.
* `e/--email` and `-k/--api-key`: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found [here](https://support.nlm.nih.gov/kbArticle/?pn=KA-05317).
### Optional arguments
* `-ps/--protein-size`: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500).
* `-ns/--nucleotide-size`: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1000).
* `s/--single`: Taxonomic ID for 'single' sequence search mode (`-i` and `-i2` are ignored when run with `-s` mode). 'single' mode will fetch all (or N if specifying `--max-sequences`) target gene or protein sequences on GenBank for a specific taxonomic ID.
* `-ms/--max-sequences`: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).
* `-b/--genbank`: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences to `genbank/` (applies when run in 'batch' or 'single' mode).
* `-c/--clear`: Forces clean (re)start by clearing output directory regardless of previous run parameters. If ommiting `--clear` and rerunning gene-fetch with the same arguments and parameters, checkpoing will be enabled.
## Examples
Fetch both protein and nucleotide sequences for COI with default sequence length thresholds, and store the corresponding genbank records.
```
gene-fetch -e your.email@domain.com -k your_api_key \
-g cox1 -o ./output_dir -i ./data/samples.csv \
--type both --genbank
```
Fetch COI nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp
```
gene-fetch -e your.email@domain.com -k your_api_key \
-g cox1 -o ./output_dir -i2 .data/samples_taxonomy.csv \
--type nucleotide --nucleotide-size 1000
```
Retrieve 1000 available rbcL protein sequences >400aa for _Arabidopsis thaliana_ (taxid: 3702).
```
gene-fetch -e your.email@domain.com -k your_api_key \
-g rbcL -o ./output_dir -s 3702 \
--type protein --protein-size 400 --max-sequences 1000
```
## Input
**Example 'samples.csv' input file (-i/--in)**
| ID | taxid |
| --- | --- |
| sample-1 | 177658 |
| sample-2 | 177627 |
| sample-3 | 3084599 |
**Example 'samples_taxonomy.csv' input file (-i2/--in2)**
| ID | phylum | class | order | family | genus | species |
| --- | --- | --- | --- | --- | --- | --- |
| sample-1 | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | |
| sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |
| sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |
* Leave blank if taxonomic information not known/needed. At least one rank must be supplied for each sample.
## Output
### 'Batch' mode
```
output_dir/
├── genbank/ # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.
│ ├── nucleotide/
│ ├── protein/
├── nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│ ├── sample-1.fasta
│ ├── sample-2.fasta
│ └── ...
├── protein/ # Protein sequences. Only populated if '--type protein/both' utilised.
│ ├── sample-1.fasta
│ ├── sample-2.fasta
│ └── ...
├── sequence_references.csv # Sequence metadata.
├── failed_searches.csv # Failed search attempts (if any).
└── gene_fetch.log # Log.
```
**sequence_references.csv output example**
| ID | taxid | protein_accession | protein_length | nucleotide_accession | nucleotide_length | matched_rank | ncbi_taxonomy | reference_name | protein_reference_path | nucleotide_reference_path |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| sample-1 | 177658 | AHF21732.1 | 510 | KF756944.1 | 1530 | genus:Apatania | Eukaryota; ...; Apataniinae; Apatania | sample-1 | abs/path/to/protein_references/sample-1.fasta | abs/path/to/protein_references/sample-1_dna.fasta |
| sample-2 | 2719103 | QNE85983.1 | 518 | MT410852.1 | 1557 | species:Isoptena serricornis | Eukaryota; ...; Chloroperlinae; Isoptena | sample-2 | abs/path/to/protein_references/sample-2.fasta | abs/path/to/protein_references/sample-2_dna.fasta |
| sample-3 | 1876143 | YP_009526503.1 | 512 | NC_039659.1 | 1539 | genus:Triaenodes | Eukaryota; ...; Triaenodini; Triaenodes | sample-3 | abs/path/to/protein_references/sample-3.fasta | abs/path/to/protein_references/sample-3_dna.fasta |
### 'Single' mode
```
output_dir/
├── genbank/ # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.
├── nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│ ├── ACCESSION1_dna.fasta
│ ├── ACCESSION2_dna.fasta
│ └── ...
├── ACCESSION1.fasta # Protein sequences.
├── ACCESSION2.fasta
├── fetched_nucleotide_sequences.csv # Sequence metadata. Only populated if '--type nucleotide/both' utilised.
├── fetched_protein_sequences.csv # Sequence metadata. Only populated if '--type protein/both' utilised.
├── failed_searches.csv # Failed search attempts (if any).
└── gene_fetch.log # Log.
```
**fetched_protein|nucleotide_sequences.csv output example**
| ID | taxid | Description |
| --- | --- | --- |
| PQ645072.1 | 1501 | Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645071.1 | 1537 | Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645070.1 | 1501 | Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645069.1 | 1518 | Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PP355486.1 | 581 | Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
## Running GeneFetch on a cluster
- See 'gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular).
- Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads - allocating lots of CPUs is unecessary as Gene Fetch is not paralellised (yet). The tool should run well with 4-10G memory and 1-2 CPUs.
- Change paths and variables as needed.
- Run 'gene_fetch.sh' with:
```
sbatch gene_fetch.sh
```
## Supported targets
GeneFetch will function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the listed targets. More targets can be added if necessary (see 'class config').
- cox1/COI/cytochrome c oxidase subunit I
- cox2/COII/cytochrome c oxidase subunit II
- cox3/COIII/cytochrome c oxidase subunit III
- cytb/cob/cytochrome b
- nd1/NAD1/NADH dehydrogenase subunit 1
- nd2/NAD2/NADH dehydrogenase subunit 2
- rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit
- matK/maturase K/maturase type II intron splicing factor
- 16S ribosomal RNA/16s
- SSU/18s
- LSU/28s
- 12S ribosomal RNA/12s
- ITS (ITS1-5.8S-ITS2)
- ITS1/internal transcribed spacer 1
- ITS2/internal transcribed spacer 2
- tRNA-Leucine/trnL
## Benchmarking
| Sample Description | Run Mode | Target | Input File | Data Type | Memory | CPUs | Run Time (hh:mm:ss) |
|--------------------|----------|--------|------------|-----------|--------|------|----------|
| 570 Arthropod samples | Batch | COI | taxonomy.csv | Both | 4G | 1 | 01:34:47 |
| 570 Arthropod samples | Batch | COI | samples.csv | Both (+ genbank) | 4G | 1 | 01:42:37 |
| 570 Arthropod samples | Batch | COI | samples.csv | Nucleotide | 4G | 1 | 1:07:53 |
| 570 Arthropod samples | Batch | ND1 | samples.csv | Nucleotide (>500bp) | 4G | 1 | 1:23:26 |
| All available (30) _A. thaliana_ sequences | Single | rbcL | N/A | Protein (>300aa) | 4G | 1 | 00:00:25 |
| 1000 Culicidae sequences | Single | COI | N/A | nucleotide (>500bp) | 4G | 1 | 0031:05 |
| 1000 _M. tubercolisis_ sequences | Single | 16S | N/A | nucleotide | 4G | 1 | 01:23:54 |
## Future Development
- Add optional alignment of retrieved sequences
- Further improve efficiency of record searching and selecting the longest sequence
- Add support for additional genetic markers beyond the currently supported set
- Add BOLD query falback if no 'quality' sequence is found in GenBank
## Contributions and guidelines
First off, thanks for taking the time to contribute! ❤️
- If you hav any questions, we assume that you have read the available [Documentation](https://github.com/bge-barcoding/gene_fetch/blob/main/README.md). It may also be worth searching for existing [Issues](https://github.com/bge-barcoding/gene_fetch/issues) that might awnser your question(s). In case you have found a suitable issue and still need clarification, you can write your question in this issue.
- If you feel you still need clarification or want to report a possible bug/unexpected behaviour, we recommend opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/) and provide as much context as you can about what behaviour you were expecting and the behaviour you're running into.
- If you want to suggest a novel feature or minor improvements to existing functionality, please make your case for the feature/enchanment by opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/new) or create a pull request with your contribution (at which point it will be evaluated as a possible addition). We aim to address any issues as soon as possible.
## Authorship & citation
GeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2025).
If you use GeneFetch, please cite our publication: **[XYZ]()**
Raw data
{
"_id": null,
"home_page": "https://github.com/bge-barcoding/gene_fetch",
"name": "gene-fetch",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "bioinformatics, ncbi, sequence, genomics, taxonomy, barcodes",
"author": "D. Parsons",
"author_email": "d.parsons@nhm.ac.uk>, B. Price <b.price@nhm.ac.uk",
"download_url": "https://files.pythonhosted.org/packages/09/db/c056437b6bc849703960d511ff377aeb8e7483b16ffdc2db9f39a0c10d3d/gene_fetch-1.0.13.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"gene_fetch_logo.svg\" width=\"400\" alt=\"gene_fetch_logo\">\n</p>\n\n[](https://pypi.org/project/gene-fetch/)\n[](http://bioconda.github.io/recipes/gene-fetch/README.html)\n[](https://pypi.org/project/gene-fetch/)\n[](https://joss.theoj.org/papers/2ce8ec99977083e2fa095223aa193538)\n[](https://github.com/bge-barcoding/gene_fetch/actions)\n\n# GeneFetch \nGene Fetch enables high-throughput retreival of sequence data from NCBI databases based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).\n\n\n## Highlight features\n- Fetch protein and/or nucleotide sequences from NCBI GenBank database.\n- Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats).\n- Support for both protein-coding and rDNA genes.\n- Customisable length filtering thresholds for protein and nucleotide sequences (default: protein=500aa. nucleotide=1000bp).\n- Default \"batch\" mode processes multiple input taxa based on a user specified CSV file.\n- Configurable \"single\" mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds can be bypassed by setting the value to zero or a negative number).\n- Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. i.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.\n- Taxonomic validation: validates fetched sequence taxonomy against input taxonomic heirarchy, avoiding potential taxonomic homonyms (i.e. when the same taxon name is used for different taxa across the tree of life).\n- Robust error handling, progress tracking, and logging, with compliance to NCBI API rate limits (10 requests/second). Caches taxonomy lookups for reduced API calls.\n- Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids \"unverified\" sequences and WGS entries not containing sequence data (i.e. master records).\n- 'Checkpointing': if a run fails/crashes, gene-fetch can be rerun using the same arguments and parameters, and it will resume from where it stopped (unless `--clean` is specified).\n- When more than 50 matching GenBank records are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders the records by sequence length, and processes the longest sequences first.\n- Can output corresponding genbank (.gb) files for each fetched nucleotide and/or protein sequences\n\n## Contents\n - [Installation](#installation)\n - [Usage](#usage)\n - [Examples](#Examples)\n - [Input](#input)\n - [Output](#output)\n - [Cluster](#running-gene_fetch-on-a-cluster)\n - [Supported targets](#supported-targets)\n - [Notes](#notes)\n - [Benchmarking](#benchmarking)\n - [Future developments](#future-developments)\n - [Contributions and citation](#contributions-and-citations)\n\n\n## Installation\n- Due to the risk of dependency conflicts, it's recommended to install Gene Fetch in a Conda environment.\n- First Conda needs to be installed, which can be done from [here](https://www.anaconda.com/docs/getting-started/miniconda/install).\n- Once installed:\n```bash\n# Create new environment\nconda create -n gene-fetch\n\n# Activate environment\nconda activate gene-fetch\n```\n\n- Gene Fetch and all necessary dependencies can then be installed via [Bioconda](https://anaconda.org/bioconda/gene-fetch), [PyPI](https://pypi.org/project/gene-fetch/#description), or by specifying `environment.yaml`:\n```bash\n# Install via bioconda\nconda install bioconda::gene-fetch\n\n# Or, install via pip\npip install gene-fetch\n\n# Or, via environment specification\nconda env update --name gene-fetch -f environment.yaml --prune\n\n# Verify installation\ngene-fetch --help\n```\n\n- If you would rather clone this repository and run a standalone version of Gene Fetch for some reason, you can do that as follows:\n```bash\n# Clone the repository\ngit clone https://github.com/bge-barcoding/gene_fetch.git\ncd gene_fetch\n\n# Activate conda environment (once created), and install gene-fetch (+ dependencies) via your preferred method.\n\n# Run standalone Gene Fetch\npython /path/to/gene_fetch.py [options]\n```\n \n## Recommended: Testing\n- The Gene Fetch package includes some basic tests for each module that we recommend are run after installation.\n```bash\n# Clone the repository\ngit clone https://github.com/bge-barcoding/gene_fetch.git\ncd gene_fetch\n\n# Install pytest\npip install pytest\n\n# Run tests\npytest\n```\n* This will take a few minutes to run the tests. You will get 1 warning regarding API credentials as these are not provided in the basic tests.\n\n## Usage\n```bash\ngene-fetch --gene <gene_name> --type <sequence_type> --in <samples.csv> --out <output_directory> --email example@example.co.uk --api-key 1234567890\n```\n* `--help`: Show usage help and exit.\n\n### Required arguments\n* `-g/--gene`: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).\n* `-t/--type`: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).\n* `-i/--in`: Path to input CSV file containing sample IDs and TaxIDs (see [Input](#input) section below).\n* `-i2/--in2`: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see [Input](#input) section below).\n* `o/--out`: Path to output directory. The directory will be created if it does not exist.\n* `e/--email` and `-k/--api-key`: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found [here](https://support.nlm.nih.gov/kbArticle/?pn=KA-05317).\n### Optional arguments\n* `-ps/--protein-size`: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500).\n* `-ns/--nucleotide-size`: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1000).\n* `s/--single`: Taxonomic ID for 'single' sequence search mode (`-i` and `-i2` are ignored when run with `-s` mode). 'single' mode will fetch all (or N if specifying `--max-sequences`) target gene or protein sequences on GenBank for a specific taxonomic ID.\n* `-ms/--max-sequences`: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).\n* `-b/--genbank`: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences to `genbank/` (applies when run in 'batch' or 'single' mode).\n* `-c/--clear`: Forces clean (re)start by clearing output directory regardless of previous run parameters. If ommiting `--clear` and rerunning gene-fetch with the same arguments and parameters, checkpoing will be enabled.\n\n\n## Examples\nFetch both protein and nucleotide sequences for COI with default sequence length thresholds, and store the corresponding genbank records.\n```\ngene-fetch -e your.email@domain.com -k your_api_key \\\n -g cox1 -o ./output_dir -i ./data/samples.csv \\\n --type both --genbank\n```\n\nFetch COI nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp\n```\ngene-fetch -e your.email@domain.com -k your_api_key \\\n -g cox1 -o ./output_dir -i2 .data/samples_taxonomy.csv \\\n --type nucleotide --nucleotide-size 1000\n```\n\nRetrieve 1000 available rbcL protein sequences >400aa for _Arabidopsis thaliana_ (taxid: 3702).\n```\ngene-fetch -e your.email@domain.com -k your_api_key \\\n -g rbcL -o ./output_dir -s 3702 \\\n --type protein --protein-size 400 --max-sequences 1000\n```\n\n\n## Input\n**Example 'samples.csv' input file (-i/--in)**\n| ID | taxid |\n| --- | --- |\n| sample-1 | 177658 |\n| sample-2 | 177627 |\n| sample-3 | 3084599 |\n\n**Example 'samples_taxonomy.csv' input file (-i2/--in2)**\n| ID | phylum | class | order | family | genus | species |\n| --- | --- | --- | --- | --- | --- | --- |\n| sample-1 | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | |\n| sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |\n| sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |\n* Leave blank if taxonomic information not known/needed. At least one rank must be supplied for each sample.\n\n## Output\n### 'Batch' mode\n```\noutput_dir/\n\u251c\u2500\u2500 genbank/ # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.\n\u2502 \u251c\u2500\u2500 nucleotide/ \n\u2502 \u251c\u2500\u2500 protein/ \n\u251c\u2500\u2500 nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.\n\u2502 \u251c\u2500\u2500 sample-1.fasta \n\u2502 \u251c\u2500\u2500 sample-2.fasta\n\u2502 \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 protein/ # Protein sequences. Only populated if '--type protein/both' utilised.\n\u2502 \u251c\u2500\u2500 sample-1.fasta \n\u2502 \u251c\u2500\u2500 sample-2.fasta\n\u2502 \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 sequence_references.csv # Sequence metadata.\n\u251c\u2500\u2500 failed_searches.csv # Failed search attempts (if any).\n\u2514\u2500\u2500 gene_fetch.log # Log.\n```\n\n**sequence_references.csv output example**\n| ID | taxid | protein_accession | protein_length | nucleotide_accession | nucleotide_length | matched_rank | ncbi_taxonomy | reference_name | protein_reference_path | nucleotide_reference_path |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| sample-1 | 177658 | AHF21732.1 | 510 | KF756944.1 | 1530 | genus:Apatania | Eukaryota; ...; Apataniinae; Apatania | sample-1 | abs/path/to/protein_references/sample-1.fasta | abs/path/to/protein_references/sample-1_dna.fasta |\n| sample-2 | 2719103 | QNE85983.1 | 518 | MT410852.1 | 1557 | species:Isoptena serricornis | Eukaryota; ...; Chloroperlinae; Isoptena | sample-2 | abs/path/to/protein_references/sample-2.fasta | abs/path/to/protein_references/sample-2_dna.fasta |\n| sample-3 | 1876143 | YP_009526503.1 | 512 | NC_039659.1 | 1539 | genus:Triaenodes | Eukaryota; ...; Triaenodini; Triaenodes | sample-3 | abs/path/to/protein_references/sample-3.fasta | abs/path/to/protein_references/sample-3_dna.fasta |\n\n\n### 'Single' mode\n```\noutput_dir/\n\u251c\u2500\u2500 genbank/ # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.\n\u251c\u2500\u2500 nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.\n\u2502 \u251c\u2500\u2500 ACCESSION1_dna.fasta \n\u2502 \u251c\u2500\u2500 ACCESSION2_dna.fasta\n\u2502 \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 ACCESSION1.fasta # Protein sequences.\n\u251c\u2500\u2500 ACCESSION2.fasta\n\u251c\u2500\u2500 fetched_nucleotide_sequences.csv # Sequence metadata. Only populated if '--type nucleotide/both' utilised.\n\u251c\u2500\u2500 fetched_protein_sequences.csv # Sequence metadata. Only populated if '--type protein/both' utilised.\n\u251c\u2500\u2500 failed_searches.csv # Failed search attempts (if any).\n\u2514\u2500\u2500 gene_fetch.log # Log.\n```\n\n**fetched_protein|nucleotide_sequences.csv output example**\n| ID | taxid | Description |\n| --- | --- | --- |\n| PQ645072.1 | 1501 | Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PQ645071.1 | 1537 | Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PQ645070.1 | 1501 | Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PQ645069.1 | 1518\t| Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PP355486.1 | 581 | Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n\n\n## Running GeneFetch on a cluster\n- See 'gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular). \n- Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads - allocating lots of CPUs is unecessary as Gene Fetch is not paralellised (yet). The tool should run well with 4-10G memory and 1-2 CPUs.\n- Change paths and variables as needed.\n- Run 'gene_fetch.sh' with:\n```\nsbatch gene_fetch.sh\n```\n\n## Supported targets\nGeneFetch will function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the listed targets. More targets can be added if necessary (see 'class config').\n- cox1/COI/cytochrome c oxidase subunit I\n- cox2/COII/cytochrome c oxidase subunit II\n- cox3/COIII/cytochrome c oxidase subunit III\n- cytb/cob/cytochrome b\n- nd1/NAD1/NADH dehydrogenase subunit 1\n- nd2/NAD2/NADH dehydrogenase subunit 2\n- rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit\n- matK/maturase K/maturase type II intron splicing factor\n- 16S ribosomal RNA/16s\n- SSU/18s\n- LSU/28s\n- 12S ribosomal RNA/12s\n- ITS (ITS1-5.8S-ITS2)\n- ITS1/internal transcribed spacer 1\n- ITS2/internal transcribed spacer 2\n- tRNA-Leucine/trnL\n\n\n## Benchmarking\n| Sample Description | Run Mode | Target | Input File | Data Type | Memory | CPUs | Run Time (hh:mm:ss) |\n|--------------------|----------|--------|------------|-----------|--------|------|----------|\n| 570 Arthropod samples | Batch | COI | taxonomy.csv | Both | 4G | 1 | 01:34:47 |\n| 570 Arthropod samples | Batch | COI | samples.csv | Both (+ genbank) | 4G | 1 | 01:42:37 |\n| 570 Arthropod samples | Batch | COI | samples.csv | Nucleotide | 4G | 1 | 1:07:53 |\n| 570 Arthropod samples | Batch | ND1 | samples.csv | Nucleotide (>500bp) | 4G | 1 | 1:23:26 |\n| All available (30) _A. thaliana_ sequences | Single | rbcL | N/A | Protein (>300aa) | 4G | 1 | 00:00:25 |\n| 1000 Culicidae sequences | Single | COI | N/A | nucleotide (>500bp) | 4G | 1 | 0031:05 |\n| 1000 _M. tubercolisis_ sequences | Single | 16S | N/A | nucleotide | 4G | 1 | 01:23:54 |\n\n## Future Development\n- Add optional alignment of retrieved sequences\n- Further improve efficiency of record searching and selecting the longest sequence\n- Add support for additional genetic markers beyond the currently supported set\n- Add BOLD query falback if no 'quality' sequence is found in GenBank\n\n\n## Contributions and guidelines\nFirst off, thanks for taking the time to contribute! \u2764\ufe0f\n\n- If you hav any questions, we assume that you have read the available [Documentation](https://github.com/bge-barcoding/gene_fetch/blob/main/README.md). It may also be worth searching for existing [Issues](https://github.com/bge-barcoding/gene_fetch/issues) that might awnser your question(s). In case you have found a suitable issue and still need clarification, you can write your question in this issue.\n- If you feel you still need clarification or want to report a possible bug/unexpected behaviour, we recommend opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/) and provide as much context as you can about what behaviour you were expecting and the behaviour you're running into.\n- If you want to suggest a novel feature or minor improvements to existing functionality, please make your case for the feature/enchanment by opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/new) or create a pull request with your contribution (at which point it will be evaluated as a possible addition). We aim to address any issues as soon as possible.\n\n## Authorship & citation\nGeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2025).\n\nIf you use GeneFetch, please cite our publication: **[XYZ]()**\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Gene Fetch: High-throughput NCBI Sequence Retrieval Tool",
"version": "1.0.13",
"project_urls": {
"Bug Tracker": "https://github.com/bge-barcoding/gene_fetch/issues",
"Homepage": "https://github.com/bge-barcoding/gene_fetch",
"Repository": "https://github.com/bge-barcoding/gene_fetch"
},
"split_keywords": [
"bioinformatics",
" ncbi",
" sequence",
" genomics",
" taxonomy",
" barcodes"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0a6fd2ac6c9fd41f7aa23d4d312f303f114ca7b254af6bfb07ca476cc9661813",
"md5": "7964f9e3e1a01d94ba6859d020ed58a8",
"sha256": "c59ed491bf43d8ab5c85acabf5841608b10c2daf8b9ae4ed20c938b6f78bf3e7"
},
"downloads": -1,
"filename": "gene_fetch-1.0.13-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7964f9e3e1a01d94ba6859d020ed58a8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 41596,
"upload_time": "2025-07-08T09:50:54",
"upload_time_iso_8601": "2025-07-08T09:50:54.230587Z",
"url": "https://files.pythonhosted.org/packages/0a/6f/d2ac6c9fd41f7aa23d4d312f303f114ca7b254af6bfb07ca476cc9661813/gene_fetch-1.0.13-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "09dbc056437b6bc849703960d511ff377aeb8e7483b16ffdc2db9f39a0c10d3d",
"md5": "16b05dedd85b8f578ac0305eb018e8a6",
"sha256": "77a33e299ec8321e625adb88554b35db62ee84f31d86a86c28f7503ea7a7f4e4"
},
"downloads": -1,
"filename": "gene_fetch-1.0.13.tar.gz",
"has_sig": false,
"md5_digest": "16b05dedd85b8f578ac0305eb018e8a6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 64987,
"upload_time": "2025-07-08T09:50:55",
"upload_time_iso_8601": "2025-07-08T09:50:55.370269Z",
"url": "https://files.pythonhosted.org/packages/09/db/c056437b6bc849703960d511ff377aeb8e7483b16ffdc2db9f39a0c10d3d/gene_fetch-1.0.13.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-08 09:50:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bge-barcoding",
"github_project": "gene_fetch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "gene-fetch"
}