gene-fetch

Name	gene-fetch JSON
Version	1.0.14 JSON
	download
home_page	https://github.com/bge-barcoding/gene_fetch
Summary	Gene Fetch: High-throughput NCBI Sequence Retrieval Tool
upload_time	2025-07-28 12:34:51
maintainer	None
docs_url	None
author	D. Parsons
requires_python	>=3.9
license	MIT
keywords	bioinformatics ncbi sequence genomics taxonomy barcodes
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
  <img src="gene_fetch_logo.svg" width="400" alt="gene_fetch_logo">
</p>

[![PyPI version](https://img.shields.io/pypi/v/gene-fetch.svg)](https://pypi.org/project/gene-fetch/)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/gene-fetch/README.html)
[![Python versions](https://img.shields.io/pypi/pyversions/gene-fetch.svg)](https://pypi.org/project/gene-fetch/)
[![status](https://joss.theoj.org/papers/2ce8ec99977083e2fa095223aa193538/status.svg)](https://joss.theoj.org/papers/2ce8ec99977083e2fa095223aa193538)
[![Github Action test](https://github.com/bge-barcoding/gene_fetch/workflows/Test%20gene-fetch/badge.svg)](https://github.com/bge-barcoding/gene_fetch/actions)

# GeneFetch 
Gene Fetch enables high-throughput retreival of sequence data from NCBI's GenBank sequence database based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).


## Highlight features
- Fetch protein and/or nucleotide sequences from NCBI GenBank database.
- Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats).
- Support for both protein-coding and rDNA genes.
- Customisable length filtering thresholds for protein and nucleotide sequences (default: protein=500aa. nucleotide=1000bp).
- Default "batch" mode processes multiple input taxa based on a user specified CSV file.
- Configurable "single" mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds can be bypassed by setting the value to zero or a negative number).
- Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. i.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.
- Taxonomic validation: validates fetched sequence taxonomy against input taxonomic heirarchy, avoiding potential taxonomic homonyms (i.e. when the same taxon name is used for different taxa across the tree of life).
- Robust error handling, progress tracking, and logging, with compliance to NCBI API rate limits (10 requests/second). Caches taxonomy lookups for reduced API calls.
- Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids "unverified" sequences and WGS entries not containing sequence data (i.e. master records).
- 'Checkpointing': if a run fails/crashes, gene-fetch can be rerun using the same arguments and parameters, and it will resume from where it stopped (unless `--clean` is specified).
- When more than 50 matching GenBank records are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders the records by sequence length, and processes the longest sequences first.
- Can output corresponding genbank (.gb) files for each fetched nucleotide and/or protein sequences

## Contents
 - [Installation](#installation)
 - [Usage](#usage)
 - [Examples](#Examples)
 - [Input](#input)
 - [Output](#output)
 - [Cluster](#running-gene_fetch-on-a-cluster)
 - [Supported targets](#supported-targets)
 - [Notes](#notes)
 - [Benchmarking](#benchmarking)
 - [Future developments](#future-developments)
 - [Contributions and citation](#contributions-and-citations)


## Installation
- Due to the risk of dependency conflicts, it's recommended to install Gene Fetch in a Conda environment.
- First Conda needs to be installed, which can be done from [here](https://www.anaconda.com/docs/getting-started/miniconda/install).
- Once installed:
```bash
# Create new environment
conda create -n gene-fetch

# Activate environment
conda activate gene-fetch
```

- Gene Fetch and all necessary dependencies can then be installed via [Bioconda](https://anaconda.org/bioconda/gene-fetch), [PyPI](https://pypi.org/project/gene-fetch/#description), or by specifying `environment.yaml`:
```bash
# Install via bioconda
conda install bioconda::gene-fetch

# Or, install via pip
pip install gene-fetch

# Or, via environment specification
conda env update --name gene-fetch -f environment.yaml --prune

# Verify installation
gene-fetch --help
```

- If you would rather clone this repository and run a standalone version of Gene Fetch for some reason, you can do that as follows:
```bash
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch

# Activate conda environment (once created), and install gene-fetch (+ dependencies) via your preferred method.

# Run standalone Gene Fetch:
python /path/to/gene_fetch.py [options]

```
  
## Recommended: Testing
- The Gene Fetch package includes some basic tests for each module that we recommend are run after installation.
```bash
# Clone the repository
git clone https://github.com/bge-barcoding/gene_fetch.git
cd gene_fetch

# Install pytest
pip install pytest

# [Optional] Locally install Gene Fetch in editable mode from source (when inside `gene_fetch`) - enables testing of source code in development
pip install -e .

# Run tests
pytest
```
* This will take a few minutes to run the tests. You will get 1 warning regarding API credentials as these are not provided in the basic tests.

## Usage
```bash
gene-fetch --gene <gene_name> --type <sequence_type> --in <samples.csv> --out <output_directory> --email example@example.co.uk --api-key 1234567890
```
* `--help`: Show usage help and exit.

### Required arguments
* `-g/--gene`: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).
* `-t/--type`: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).
* `-i/--in`: Path to input CSV file containing sample IDs and TaxIDs (see [Input](#input) section below).
* `-i2/--in2`: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see [Input](#input) section below).
* `o/--out`: Path to output directory. The directory will be created if it does not exist.
* `e/--email` and `-k/--api-key`: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found [here](https://support.nlm.nih.gov/kbArticle/?pn=KA-05317).
### Optional arguments
* `-ps/--protein-size`: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500).
* `-ns/--nucleotide-size`: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1000).
* `s/--single`: Taxonomic ID for 'single' sequence search mode (`-i` and `-i2` are ignored when run with `-s` mode). 'single' mode will fetch all (or N if specifying `--max-sequences`) target gene or protein sequences on GenBank for a specific taxonomic ID.
* `-ms/--max-sequences`: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).
* `-b/--genbank`: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences to `genbank/` (applies when run in 'batch' or 'single' mode).
* `-c/--clear`: Forces clean (re)start by clearing output directory regardless of previous run parameters. If ommiting `--clear` and rerunning gene-fetch with the same arguments and parameters, checkpoing will be enabled.


## Examples
Fetch both protein and nucleotide sequences for COI with default sequence length thresholds, and store the corresponding genbank records.
```
gene-fetch -e your.email@domain.com -k your_api_key \
            -g cox1 -o ./output_dir -i ./data/samples.csv \
            --type both --genbank
```

Fetch COI nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp
```
gene-fetch -e your.email@domain.com -k your_api_key \
            -g cox1 -o ./output_dir -i2 ./data/samples_taxonomy.csv \
            --type nucleotide --nucleotide-size 1000
```

Retrieve 100 available rbcL protein sequences >400aa for _Arabidopsis thaliana_ (taxid: 3702).
```
gene-fetch -e your.email@domain.com -k your_api_key \
            -g rbcL -o ./output_dir -s 3702 \
            --type protein --protein-size 400 --max-sequences 100
```


## Input
**Example 'samples.csv' input file (-i/--in)**
| ID | taxid |
| --- | --- |
| sample-1  | 177658 |
| sample-2 | 177627 |
| sample-3 | 3084599 |

**Example 'samples_taxonomy.csv' input file (-i2/--in2)**
| ID | phylum | class | order | family | genus | species |
| --- | --- | --- | --- | --- | --- | --- |
| sample-1  | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | |
| sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |
| sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |
* Leave blank if taxonomic information not known/needed. At least one rank must be supplied for each sample.

## Output
### 'Batch' mode
```
output_dir/
├── genbank/                    # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.
│   ├── nucleotide/  
│   ├── protein/  
├── nucleotide/                 # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│   ├── sample-1.fasta   
│   ├── sample-2.fasta
│   └── ...
├── protein/                    # Protein sequences. Only populated if '--type protein/both' utilised.
│   ├── sample-1.fasta   
│   ├── sample-2.fasta
│   └── ...
├── sequence_references.csv     # Sequence metadata.
├── failed_searches.csv         # Failed search attempts (if any).
└── gene_fetch.log              # Log.
```

**sequence_references.csv output example**
| ID | taxid | protein_accession | protein_length | nucleotide_accession | nucleotide_length | matched_rank | ncbi_taxonomy | reference_name | protein_reference_path | nucleotide_reference_path |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| sample-1 | 177658 | AHF21732.1 | 510 | KF756944.1 | 1530 | genus:Apatania | Eukaryota; ...; Apataniinae; Apatania | sample-1 | abs/path/to/protein_references/sample-1.fasta | abs/path/to/protein_references/sample-1_dna.fasta |
| sample-2 | 2719103 | QNE85983.1 | 518 | MT410852.1 | 1557 | species:Isoptena serricornis | Eukaryota; ...; Chloroperlinae; Isoptena | sample-2 | abs/path/to/protein_references/sample-2.fasta | abs/path/to/protein_references/sample-2_dna.fasta |
| sample-3 | 1876143 | YP_009526503.1 | 512 | NC_039659.1 | 1539 | genus:Triaenodes | Eukaryota; ...; Triaenodini; Triaenodes | sample-3 | abs/path/to/protein_references/sample-3.fasta | abs/path/to/protein_references/sample-3_dna.fasta |


### 'Single' mode
```
output_dir/
├── genbank/                         # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.
├── nucleotide/                      # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│   ├── ACCESSION1_dna.fasta   
│   ├── ACCESSION2_dna.fasta
│   └── ...
├── ACCESSION1.fasta                 # Protein sequences.
├── ACCESSION2.fasta
├── fetched_nucleotide_sequences.csv # Sequence metadata. Only populated if '--type nucleotide/both' utilised.
├── fetched_protein_sequences.csv    # Sequence metadata. Only populated if '--type protein/both' utilised.
├── failed_searches.csv              # Failed search attempts (if any).
└── gene_fetch.log                   # Log.
```

**fetched_protein|nucleotide_sequences.csv output example**
| ID | taxid | Description |
| --- | --- | --- |
| PQ645072.1 | 1501 | Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645071.1 | 1537 | Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645070.1 | 1501 | Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PQ645069.1 | 1518	| Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
| PP355486.1 | 581 | Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |


## Running GeneFetch on a cluster
- See 'gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular). 
- Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads - allocating lots of CPUs is unecessary as Gene Fetch is not paralellised (yet). The tool should run well with 4-10G memory and 1-2 CPUs.
- Change paths and variables as needed.
- Run 'gene_fetch.sh' with:
```
sbatch gene_fetch.sh
```

## Supported targets
GeneFetch will function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the listed targets. More targets can be added if necessary (see 'class config').
- cox1/COI/cytochrome c oxidase subunit I
- cox2/COII/cytochrome c oxidase subunit II
- cox3/COIII/cytochrome c oxidase subunit III
- cytb/cob/cytochrome b
- nd1/NAD1/NADH dehydrogenase subunit 1
- nd2/NAD2/NADH dehydrogenase subunit 2
- rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit
- matK/maturase K/maturase type II intron splicing factor
- 16S ribosomal RNA/16s
- SSU/18s
- LSU/28s
- 12S ribosomal RNA/12s
- ITS (ITS1-5.8S-ITS2)
- ITS1/internal transcribed spacer 1
- ITS2/internal transcribed spacer 2
- tRNA-Leucine/trnL


## Benchmarking
| Sample Description | Run Mode | Target | Input File | Data Type | Memory | CPUs | Run Time (hh:mm:ss) |
|--------------------|----------|--------|------------|-----------|--------|------|----------|
| 570 Arthropod samples | Batch | COI | taxonomy.csv | Both | 4G | 1 | 01:34:47 |
| 570 Arthropod samples | Batch | COI | samples.csv | Both (+ genbank) | 4G | 1 | 01:42:37 |
| 570 Arthropod samples | Batch | COI | samples.csv | Nucleotide | 4G | 1 | 1:07:53  |
| 570 Arthropod samples | Batch | ND1 | samples.csv | Nucleotide (>500bp) | 4G | 1 | 1:23:26 |
| All available (30) _A. thaliana_ sequences | Single | rbcL | N/A | Protein (>300aa) | 4G | 1 | 00:00:25 |
| 1000 Culicidae sequences | Single | COI | N/A | nucleotide (>500bp) | 4G | 1 | 0031:05 |
| 1000 _M. tubercolisis_ sequences | Single | 16S | N/A | nucleotide | 4G | 1 | 01:23:54 |

## Future Development
- Add optional alignment of retrieved sequences
- Further improve efficiency of record searching and selecting the longest sequence
- Add support for additional genetic markers beyond the currently supported set
- Add BOLD query falback if no 'quality' sequence is found in GenBank


## Contributions and guidelines
First off, thanks for taking the time to contribute! ❤️

- If you hav any questions, we assume that you have read the available [Documentation](https://github.com/bge-barcoding/gene_fetch/blob/main/README.md). It may also be worth searching for existing [Issues](https://github.com/bge-barcoding/gene_fetch/issues) that might awnser your question(s). In case you have found a suitable issue and still need clarification, you can write your question in this issue.
- If you feel you still need clarification or want to report a possible bug/unexpected behaviour, we recommend opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/) and provide as much context as you can about what behaviour you were expecting and the behaviour you're running into.
- If you want to suggest a novel feature or minor improvements to existing functionality, please make your case for the feature/enchanment by opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/new) or create a pull request with your contribution (at which point it will be evaluated as a possible addition). We aim to address any issues as soon as possible.

## Authorship & citation
GeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2025).

If you use GeneFetch, please cite our publication: **[XYZ]()**

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bge-barcoding/gene_fetch",
    "name": "gene-fetch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "bioinformatics, ncbi, sequence, genomics, taxonomy, barcodes",
    "author": "D. Parsons",
    "author_email": "d.parsons@nhm.ac.uk>, B. Price <b.price@nhm.ac.uk",
    "download_url": "https://files.pythonhosted.org/packages/53/5e/cd257f6fd032a05f9ed527776b5bdf3d063de041b8304e993979ee77dbcf/gene_fetch-1.0.14.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"gene_fetch_logo.svg\" width=\"400\" alt=\"gene_fetch_logo\">\n</p>\n\n[![PyPI version](https://img.shields.io/pypi/v/gene-fetch.svg)](https://pypi.org/project/gene-fetch/)\n[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/gene-fetch/README.html)\n[![Python versions](https://img.shields.io/pypi/pyversions/gene-fetch.svg)](https://pypi.org/project/gene-fetch/)\n[![status](https://joss.theoj.org/papers/2ce8ec99977083e2fa095223aa193538/status.svg)](https://joss.theoj.org/papers/2ce8ec99977083e2fa095223aa193538)\n[![Github Action test](https://github.com/bge-barcoding/gene_fetch/workflows/Test%20gene-fetch/badge.svg)](https://github.com/bge-barcoding/gene_fetch/actions)\n\n# GeneFetch \nGene Fetch enables high-throughput retreival of sequence data from NCBI's GenBank sequence database based on taxonomy IDs (taxids) or taxonomic heirarchies. It can retrieve both protein and/or nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).\n\n\n## Highlight features\n- Fetch protein and/or nucleotide sequences from NCBI GenBank database.\n- Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats).\n- Support for both protein-coding and rDNA genes.\n- Customisable length filtering thresholds for protein and nucleotide sequences (default: protein=500aa. nucleotide=1000bp).\n- Default \"batch\" mode processes multiple input taxa based on a user specified CSV file.\n- Configurable \"single\" mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds can be bypassed by setting the value to zero or a negative number).\n- Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. i.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.\n- Taxonomic validation: validates fetched sequence taxonomy against input taxonomic heirarchy, avoiding potential taxonomic homonyms (i.e. when the same taxon name is used for different taxa across the tree of life).\n- Robust error handling, progress tracking, and logging, with compliance to NCBI API rate limits (10 requests/second). Caches taxonomy lookups for reduced API calls.\n- Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids \"unverified\" sequences and WGS entries not containing sequence data (i.e. master records).\n- 'Checkpointing': if a run fails/crashes, gene-fetch can be rerun using the same arguments and parameters, and it will resume from where it stopped (unless `--clean` is specified).\n- When more than 50 matching GenBank records are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders the records by sequence length, and processes the longest sequences first.\n- Can output corresponding genbank (.gb) files for each fetched nucleotide and/or protein sequences\n\n## Contents\n - [Installation](#installation)\n - [Usage](#usage)\n - [Examples](#Examples)\n - [Input](#input)\n - [Output](#output)\n - [Cluster](#running-gene_fetch-on-a-cluster)\n - [Supported targets](#supported-targets)\n - [Notes](#notes)\n - [Benchmarking](#benchmarking)\n - [Future developments](#future-developments)\n - [Contributions and citation](#contributions-and-citations)\n\n\n## Installation\n- Due to the risk of dependency conflicts, it's recommended to install Gene Fetch in a Conda environment.\n- First Conda needs to be installed, which can be done from [here](https://www.anaconda.com/docs/getting-started/miniconda/install).\n- Once installed:\n```bash\n# Create new environment\nconda create -n gene-fetch\n\n# Activate environment\nconda activate gene-fetch\n```\n\n- Gene Fetch and all necessary dependencies can then be installed via [Bioconda](https://anaconda.org/bioconda/gene-fetch), [PyPI](https://pypi.org/project/gene-fetch/#description), or by specifying `environment.yaml`:\n```bash\n# Install via bioconda\nconda install bioconda::gene-fetch\n\n# Or, install via pip\npip install gene-fetch\n\n# Or, via environment specification\nconda env update --name gene-fetch -f environment.yaml --prune\n\n# Verify installation\ngene-fetch --help\n```\n\n- If you would rather clone this repository and run a standalone version of Gene Fetch for some reason, you can do that as follows:\n```bash\n# Clone the repository\ngit clone https://github.com/bge-barcoding/gene_fetch.git\ncd gene_fetch\n\n# Activate conda environment (once created), and install gene-fetch (+ dependencies) via your preferred method.\n\n# Run standalone Gene Fetch:\npython /path/to/gene_fetch.py [options]\n\n```\n  \n## Recommended: Testing\n- The Gene Fetch package includes some basic tests for each module that we recommend are run after installation.\n```bash\n# Clone the repository\ngit clone https://github.com/bge-barcoding/gene_fetch.git\ncd gene_fetch\n\n# Install pytest\npip install pytest\n\n# [Optional] Locally install Gene Fetch in editable mode from source (when inside `gene_fetch`) - enables testing of source code in development\npip install -e .\n\n# Run tests\npytest\n```\n* This will take a few minutes to run the tests. You will get 1 warning regarding API credentials as these are not provided in the basic tests.\n\n## Usage\n```bash\ngene-fetch --gene <gene_name> --type <sequence_type> --in <samples.csv> --out <output_directory> --email example@example.co.uk --api-key 1234567890\n```\n* `--help`: Show usage help and exit.\n\n### Required arguments\n* `-g/--gene`: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).\n* `-t/--type`: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).\n* `-i/--in`: Path to input CSV file containing sample IDs and TaxIDs (see [Input](#input) section below).\n* `-i2/--in2`: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see [Input](#input) section below).\n* `o/--out`: Path to output directory. The directory will be created if it does not exist.\n* `e/--email` and `-k/--api-key`: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found [here](https://support.nlm.nih.gov/kbArticle/?pn=KA-05317).\n### Optional arguments\n* `-ps/--protein-size`: Minimum protein sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 500).\n* `-ns/--nucleotide-size`: Minimum nucleotide sequence length filter. Applicable to mode 'batch' and 'single' search modes (default: 1000).\n* `s/--single`: Taxonomic ID for 'single' sequence search mode (`-i` and `-i2` are ignored when run with `-s` mode). 'single' mode will fetch all (or N if specifying `--max-sequences`) target gene or protein sequences on GenBank for a specific taxonomic ID.\n* `-ms/--max-sequences`: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single' mode).\n* `-b/--genbank`: Saves genbank (.gb) files for fetched nucleotide and/or protein sequences to `genbank/` (applies when run in 'batch' or 'single' mode).\n* `-c/--clear`: Forces clean (re)start by clearing output directory regardless of previous run parameters. If ommiting `--clear` and rerunning gene-fetch with the same arguments and parameters, checkpoing will be enabled.\n\n\n## Examples\nFetch both protein and nucleotide sequences for COI with default sequence length thresholds, and store the corresponding genbank records.\n```\ngene-fetch -e your.email@domain.com -k your_api_key \\\n            -g cox1 -o ./output_dir -i ./data/samples.csv \\\n            --type both --genbank\n```\n\nFetch COI nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp\n```\ngene-fetch -e your.email@domain.com -k your_api_key \\\n            -g cox1 -o ./output_dir -i2 ./data/samples_taxonomy.csv \\\n            --type nucleotide --nucleotide-size 1000\n```\n\nRetrieve 100 available rbcL protein sequences >400aa for _Arabidopsis thaliana_ (taxid: 3702).\n```\ngene-fetch -e your.email@domain.com -k your_api_key \\\n            -g rbcL -o ./output_dir -s 3702 \\\n            --type protein --protein-size 400 --max-sequences 100\n```\n\n\n## Input\n**Example 'samples.csv' input file (-i/--in)**\n| ID | taxid |\n| --- | --- |\n| sample-1  | 177658 |\n| sample-2 | 177627 |\n| sample-3 | 3084599 |\n\n**Example 'samples_taxonomy.csv' input file (-i2/--in2)**\n| ID | phylum | class | order | family | genus | species |\n| --- | --- | --- | --- | --- | --- | --- |\n| sample-1  | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | |\n| sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |\n| sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |\n* Leave blank if taxonomic information not known/needed. At least one rank must be supplied for each sample.\n\n## Output\n### 'Batch' mode\n```\noutput_dir/\n\u251c\u2500\u2500 genbank/                    # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.\n\u2502   \u251c\u2500\u2500 nucleotide/  \n\u2502   \u251c\u2500\u2500 protein/  \n\u251c\u2500\u2500 nucleotide/                 # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.\n\u2502   \u251c\u2500\u2500 sample-1.fasta   \n\u2502   \u251c\u2500\u2500 sample-2.fasta\n\u2502   \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 protein/                    # Protein sequences. Only populated if '--type protein/both' utilised.\n\u2502   \u251c\u2500\u2500 sample-1.fasta   \n\u2502   \u251c\u2500\u2500 sample-2.fasta\n\u2502   \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 sequence_references.csv     # Sequence metadata.\n\u251c\u2500\u2500 failed_searches.csv         # Failed search attempts (if any).\n\u2514\u2500\u2500 gene_fetch.log              # Log.\n```\n\n**sequence_references.csv output example**\n| ID | taxid | protein_accession | protein_length | nucleotide_accession | nucleotide_length | matched_rank | ncbi_taxonomy | reference_name | protein_reference_path | nucleotide_reference_path |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| sample-1 | 177658 | AHF21732.1 | 510 | KF756944.1 | 1530 | genus:Apatania | Eukaryota; ...; Apataniinae; Apatania | sample-1 | abs/path/to/protein_references/sample-1.fasta | abs/path/to/protein_references/sample-1_dna.fasta |\n| sample-2 | 2719103 | QNE85983.1 | 518 | MT410852.1 | 1557 | species:Isoptena serricornis | Eukaryota; ...; Chloroperlinae; Isoptena | sample-2 | abs/path/to/protein_references/sample-2.fasta | abs/path/to/protein_references/sample-2_dna.fasta |\n| sample-3 | 1876143 | YP_009526503.1 | 512 | NC_039659.1 | 1539 | genus:Triaenodes | Eukaryota; ...; Triaenodini; Triaenodes | sample-3 | abs/path/to/protein_references/sample-3.fasta | abs/path/to/protein_references/sample-3_dna.fasta |\n\n\n### 'Single' mode\n```\noutput_dir/\n\u251c\u2500\u2500 genbank/                         # Genbank (.gb) files for each fetched nucleotide and/or protein sequence.\n\u251c\u2500\u2500 nucleotide/                      # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.\n\u2502   \u251c\u2500\u2500 ACCESSION1_dna.fasta   \n\u2502   \u251c\u2500\u2500 ACCESSION2_dna.fasta\n\u2502   \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 ACCESSION1.fasta                 # Protein sequences.\n\u251c\u2500\u2500 ACCESSION2.fasta\n\u251c\u2500\u2500 fetched_nucleotide_sequences.csv # Sequence metadata. Only populated if '--type nucleotide/both' utilised.\n\u251c\u2500\u2500 fetched_protein_sequences.csv    # Sequence metadata. Only populated if '--type protein/both' utilised.\n\u251c\u2500\u2500 failed_searches.csv              # Failed search attempts (if any).\n\u2514\u2500\u2500 gene_fetch.log                   # Log.\n```\n\n**fetched_protein|nucleotide_sequences.csv output example**\n| ID | taxid | Description |\n| --- | --- | --- |\n| PQ645072.1 | 1501 | Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PQ645071.1 | 1537 | Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PQ645070.1 | 1501 | Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PQ645069.1 | 1518\t| Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n| PP355486.1 | 581 | Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |\n\n\n## Running GeneFetch on a cluster\n- See 'gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular). \n- Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads - allocating lots of CPUs is unecessary as Gene Fetch is not paralellised (yet). The tool should run well with 4-10G memory and 1-2 CPUs.\n- Change paths and variables as needed.\n- Run 'gene_fetch.sh' with:\n```\nsbatch gene_fetch.sh\n```\n\n## Supported targets\nGeneFetch will function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the listed targets. More targets can be added if necessary (see 'class config').\n- cox1/COI/cytochrome c oxidase subunit I\n- cox2/COII/cytochrome c oxidase subunit II\n- cox3/COIII/cytochrome c oxidase subunit III\n- cytb/cob/cytochrome b\n- nd1/NAD1/NADH dehydrogenase subunit 1\n- nd2/NAD2/NADH dehydrogenase subunit 2\n- rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit\n- matK/maturase K/maturase type II intron splicing factor\n- 16S ribosomal RNA/16s\n- SSU/18s\n- LSU/28s\n- 12S ribosomal RNA/12s\n- ITS (ITS1-5.8S-ITS2)\n- ITS1/internal transcribed spacer 1\n- ITS2/internal transcribed spacer 2\n- tRNA-Leucine/trnL\n\n\n## Benchmarking\n| Sample Description | Run Mode | Target | Input File | Data Type | Memory | CPUs | Run Time (hh:mm:ss) |\n|--------------------|----------|--------|------------|-----------|--------|------|----------|\n| 570 Arthropod samples | Batch | COI | taxonomy.csv | Both | 4G | 1 | 01:34:47 |\n| 570 Arthropod samples | Batch | COI | samples.csv | Both (+ genbank) | 4G | 1 | 01:42:37 |\n| 570 Arthropod samples | Batch | COI | samples.csv | Nucleotide | 4G | 1 | 1:07:53  |\n| 570 Arthropod samples | Batch | ND1 | samples.csv | Nucleotide (>500bp) | 4G | 1 | 1:23:26 |\n| All available (30) _A. thaliana_ sequences | Single | rbcL | N/A | Protein (>300aa) | 4G | 1 | 00:00:25 |\n| 1000 Culicidae sequences | Single | COI | N/A | nucleotide (>500bp) | 4G | 1 | 0031:05 |\n| 1000 _M. tubercolisis_ sequences | Single | 16S | N/A | nucleotide | 4G | 1 | 01:23:54 |\n\n## Future Development\n- Add optional alignment of retrieved sequences\n- Further improve efficiency of record searching and selecting the longest sequence\n- Add support for additional genetic markers beyond the currently supported set\n- Add BOLD query falback if no 'quality' sequence is found in GenBank\n\n\n## Contributions and guidelines\nFirst off, thanks for taking the time to contribute! \u2764\ufe0f\n\n- If you hav any questions, we assume that you have read the available [Documentation](https://github.com/bge-barcoding/gene_fetch/blob/main/README.md). It may also be worth searching for existing [Issues](https://github.com/bge-barcoding/gene_fetch/issues) that might awnser your question(s). In case you have found a suitable issue and still need clarification, you can write your question in this issue.\n- If you feel you still need clarification or want to report a possible bug/unexpected behaviour, we recommend opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/) and provide as much context as you can about what behaviour you were expecting and the behaviour you're running into.\n- If you want to suggest a novel feature or minor improvements to existing functionality, please make your case for the feature/enchanment by opening an [Issue](https://github.com/bge-barcoding/gene_fetch/issues/new) or create a pull request with your contribution (at which point it will be evaluated as a possible addition). We aim to address any issues as soon as possible.\n\n## Authorship & citation\nGeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2025).\n\nIf you use GeneFetch, please cite our publication: **[XYZ]()**\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Gene Fetch: High-throughput NCBI Sequence Retrieval Tool",
    "version": "1.0.14",
    "project_urls": {
        "Bug Tracker": "https://github.com/bge-barcoding/gene_fetch/issues",
        "Homepage": "https://github.com/bge-barcoding/gene_fetch",
        "Repository": "https://github.com/bge-barcoding/gene_fetch"
    },
    "split_keywords": [
        "bioinformatics",
        " ncbi",
        " sequence",
        " genomics",
        " taxonomy",
        " barcodes"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f80475d83ca146db5006ee44167995a23c55a4891a834e1a6ffbd81497fc6505",
                "md5": "71ec5245ea7eda08b7470c305c01804b",
                "sha256": "c8ce4b126fb14f0da78b087c4d7952384a76d5d6c185d19146592d5ebc31598e"
            },
            "downloads": -1,
            "filename": "gene_fetch-1.0.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "71ec5245ea7eda08b7470c305c01804b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 41809,
            "upload_time": "2025-07-28T12:34:50",
            "upload_time_iso_8601": "2025-07-28T12:34:50.416633Z",
            "url": "https://files.pythonhosted.org/packages/f8/04/75d83ca146db5006ee44167995a23c55a4891a834e1a6ffbd81497fc6505/gene_fetch-1.0.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "535ecd257f6fd032a05f9ed527776b5bdf3d063de041b8304e993979ee77dbcf",
                "md5": "25244240fc87671f0b4eccc686115e59",
                "sha256": "d202b9790975d25e4f456e9cef42475ec74b4387e3b6229d2589751e10daa025"
            },
            "downloads": -1,
            "filename": "gene_fetch-1.0.14.tar.gz",
            "has_sig": false,
            "md5_digest": "25244240fc87671f0b4eccc686115e59",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 65281,
            "upload_time": "2025-07-28T12:34:51",
            "upload_time_iso_8601": "2025-07-28T12:34:51.789015Z",
            "url": "https://files.pythonhosted.org/packages/53/5e/cd257f6fd032a05f9ed527776b5bdf3d063de041b8304e993979ee77dbcf/gene_fetch-1.0.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-28 12:34:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bge-barcoding",
    "github_project": "gene_fetch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "gene-fetch"
}

D. Parsons