ncbi-genome-download


Namencbi-genome-download JSON
Version 0.3.3 PyPI version JSON
download
home_pagehttps://github.com/kblin/ncbi-genome-download/
SummaryDownload genome files from the NCBI FTP server.
upload_time2023-07-28 12:49:22
maintainer
docs_urlNone
authorKai Blin
requires_python
licenseApache Software License
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # NCBI Genome Downloading Scripts

[![PyPI release](https://img.shields.io/pypi/v/ncbi-genome-download.svg)](https://pypi.python.org/pypi/ncbi-genome-download/)
[![DOI](https://zenodo.org/badge/57950916.svg)](https://zenodo.org/badge/latestdoi/57950916)

Some script to download bacterial and fungal genomes from NCBI after they
restructured their FTP a while ago.

Idea shamelessly stolen from [Mick Watson's Kraken downloader
scripts](http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/)
that can also be found in [Mick's GitHub
repo](https://github.com/mw55309/Kraken_db_install_scripts). However, Mick's
scripts are ~~written in Perl~~ specific to actually building a Kraken database
(as advertised).

So this is a set of scripts that focuses on the actual genome downloading.

## Installation

```bash
pip install ncbi-genome-download
```

Alternatively, clone this repository from GitHub, then run (in a python virtual environment)

```bash
pip install .
```

If this fails on older versions of Python, try updating your `pip` tool first:

```bash
pip install --upgrade pip
```

and then rerun the `ncbi-genome-download` install.

Alternatively, `ncbi-genome-download` is packaged in `conda`.
Refer the the Anaconda/[miniconda](https://conda.io/miniconda.html) site to
install a distribution (highly recommended). With that installed one can do:

```bash
conda install -c bioconda ncbi-genome-download
```

`ncbi-genome-download` is only developed and tested on Python releases still
under active support by the Python project. At the moment, this means versions
3.7, 3.8, 3.9, 3.10 and 3.11.
Specifically, no attempt at testing under Python versions older than 3.7 is
being made.

If your system is stuck on an older version of Python, consider using a tool like
[Homebrew](http://brew.sh) to obtain a more up-to-date version.

`ncbi-genome-download` 0.2.12 was the last version to support Python 2.

## Usage

To download all bacterial RefSeq genomes in GenBank format from NCBI, run the following:

```bash
ncbi-genome-download bacteria
```

Downloading multiple groups is also possible:

```bash
ncbi-genome-download bacteria,viral
```

**Note**: To see all available groups, see `ncbi-genome-download --help`, or
simply use `all` to check all groups. Naming a more specific group will reduce
the download size and the time needed to find the sequences to download.

If you're on a reasonably fast connection, you might want to try running
multiple downloads in parallel:

```bash
ncbi-genome-download bacteria --parallel 4
```

To download all fungal GenBank genomes from NCBI in GenBank format, run:

```bash
ncbi-genome-download --section genbank fungi
```

To download all viral RefSeq genomes in FASTA format, run:

```bash
ncbi-genome-download --formats fasta viral
```

It is possible to download multiple formats by supplying a list of formats or
simply downloading all formats:

```bash
ncbi-genome-download --formats fasta,assembly-report viral
ncbi-genome-download --formats all viral
```

To download only completed bacterial RefSeq genomes in GenBank format, run:

```bash
ncbi-genome-download --assembly-levels complete bacteria
```

It is possible to download multiple assembly levels at once by supplying a list:

```bash
ncbi-genome-download --assembly-levels complete,chromosome bacteria
```

To download only bacterial reference genomes from RefSeq in GenBank format, run:

```bash
ncbi-genome-download --refseq-categories reference bacteria
```

To download bacterial RefSeq genomes of the genus _Streptomyces_, run:

```bash
ncbi-genome-download --genera Streptomyces bacteria
```

**Note**: This is a simple string match on the organism name provided by NCBI only.

You can also use this with a slight trick to download genomes of a certain
species as well:

```bash
ncbi-genome-download --genera "Streptomyces coelicolor" bacteria
```

**Note**: The quotes are important. Again, this is a simple string match on the organism
name provided by the NCBI.

Multiple genera is also possible:

```bash
ncbi-genome-download --genera "Streptomyces coelicolor,Escherichia coli" bacteria
```

You can also put genus names into a file, one organism per line, e.g.:

```bash
Streptomyces
Amycolatopsis
```

Then, pass the path to that file (e.g. `my_genera.txt`) to the `--genera`
option, like so:

```bash
ncbi-genome-download --genera my_genera.txt bacteria
```

**Note**: The above command will download all _Streptomyces_ and _Amycolatopsis_
genomes from RefSeq.

You can make the string match fuzzy using the `--fuzzy-genus` option. This can
be handy if you need to match a value in the middle of the NCBI organism name,
like so:

```bash
ncbi-genome-download --genera coelicolor --fuzzy-genus bacteria
```

**Note**: The above command will download all bacterial genomes containing
"coelicolor" anywhere in their organism name from RefSeq.

To download bacterial RefSeq genomes based on their NCBI species taxonomy ID, run:

```bash
ncbi-genome-download --species-taxids 562 bacteria
```

**Note**: The above command will download all RefSeq genomes belonging to
_Escherichia coli_.

To download a specific bacterial RefSeq genomes based on its NCBI taxonomy ID, run:

```bash
ncbi-genome-download --taxids 511145 bacteria
```

**Note**: The above command will download the RefSeq genome belonging to
_Escherichia coli str. K-12 substr. MG1655_.

It is also possible to download multiple species taxids or taxids by supplying
the numbers in a comma-separated list:

```bash
ncbi-genome-download --taxids 9606,9685 --assembly-level chromosome vertebrate_mammalian
```

**Note**: The above command will download the reference genomes for cat and human.

In addition, you can put multiple species taxids or taxids into a file, one per line
and pass that filename to the `--species-taxids` or `--taxids` parameters, respectively.

Assuming you had a file `my_taxids.txt` with the following contents:

```text
9606
9685
```

You could download the reference genomes for cat and human like this:

```bash
ncbi-genome-download --taxids my_taxids.txt --assembly-levels chromosome vertebrate_mammalian
```

It is possible to also create a human-readable directory structure in parallel
to mirroring the layout used by NCBI:

```bash
ncbi-genome-download --human-readable bacteria
```

This will use links to point to the appropriate files in the NCBI directory structure,
so it saves file space. Note that links are not supported on some Windows file
systems and some older versions of Windows.

It is also possible to re-run a previous download with the `--human-readable` option.
In this case, `ncbi-genome-download` will not download any new genome files, and
just create human-readable directory structure. Note that if any files have been
changed on the NCBI side, a file download will be triggered.

There is a "dry-run" option to show which accessions would be downloaded, given
your filters:

```bash
ncbi-genome-download --dry-run bacteria
```

If you want to filter for the "relation to type material" column of the
assembly summary file, you can use the `--type-materials` option. Possible
values are "any", "all", "type", "reference", "synonym", "proxytype", and/or
"neotype". "any" will include assemblies with no relation to type material
value defined, "all" will download only assemblies with a defined value.
Multiple values can be given, separated by comma:

```bash
ncbi-genome-download --type-materials type,reference
```

By default, ncbi-genome-download caches the assembly summary files for the
respective taxonomic groups for one day. You can skip using the cache file by
using the `--no-cache` option. The output of `--help` also shows the cache
directory, should you want to remove any of the cached files.

To get an overview of all options, run

```bash
ncbi-genome-download --help
```

### As a method

You can also use it as a method call. Pass the pythonised keyword arguments
(`_` instead of `-`) as described above or in the `--help`:

```python
import ncbi_genome_download as ngd
ngd.download()
```

**Note**: To specify a taxonomic group, like _bacteria_, use the `group` keyword.

### Contributed Scripts: `gimme_taxa.py`

This script lets you find out what TaxIDs to pass to `ngd`, and will write a
simple one-item-per-line file to pass in to it. It utilises the `ete3` toolkit,
so refer to their site to install the dependency if it's not already satisfied.

You can query the database using a particular TaxID, or a scientific name. The
primary function of the script is to return all the child taxa of the specified
parent taxa. The script has various options for what information is written in
the output.

A basic invocation may look like:

```bash
# Fetch all descendent taxa for Escherichia (taxid 561):
python gimme_taxa.py -o ~/mytaxafile.txt 561

# Alternatively, just provide the taxon name
python gimme_taxa.py -o all_descendent_taxids.txt Escherichia

# You can provide multiple taxids and/or names
python gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter
```

On first use, a small sqlite database will be created in your home directory
by default (change the location with the `--database` flag). You can update this
database by using the `--update` flag. Note that if the database is not in your
home directory, you must specify it with `--database` or a new database will be
created in your home directory.

To see all help:

```bash
python gimme_taxa.py
python gimme_taxa.py -h
python gimme_taxa.py --help
```

## Citing `ncbi-genome-download`

You can cite `ncbi-genome-download` via the Zenodo deposit under
[DOI: 10.5281/zenodo.8192433](https://doi.org/10.5281/zenodo.8192433).

## License

All code is available under the Apache License version 2, see the
[`LICENSE`](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kblin/ncbi-genome-download/",
    "name": "ncbi-genome-download",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Kai Blin",
    "author_email": "kblin@biosustain.dtu.dk",
    "download_url": "https://files.pythonhosted.org/packages/d8/6b/75d324699f57430aae0b1687adb89a4ba00c96b268230d569ae31eb79877/ncbi-genome-download-0.3.3.tar.gz",
    "platform": null,
    "description": "# NCBI Genome Downloading Scripts\n\n[![PyPI release](https://img.shields.io/pypi/v/ncbi-genome-download.svg)](https://pypi.python.org/pypi/ncbi-genome-download/)\n[![DOI](https://zenodo.org/badge/57950916.svg)](https://zenodo.org/badge/latestdoi/57950916)\n\nSome script to download bacterial and fungal genomes from NCBI after they\nrestructured their FTP a while ago.\n\nIdea shamelessly stolen from [Mick Watson's Kraken downloader\nscripts](http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/)\nthat can also be found in [Mick's GitHub\nrepo](https://github.com/mw55309/Kraken_db_install_scripts). However, Mick's\nscripts are ~~written in Perl~~ specific to actually building a Kraken database\n(as advertised).\n\nSo this is a set of scripts that focuses on the actual genome downloading.\n\n## Installation\n\n```bash\npip install ncbi-genome-download\n```\n\nAlternatively, clone this repository from GitHub, then run (in a python virtual environment)\n\n```bash\npip install .\n```\n\nIf this fails on older versions of Python, try updating your `pip` tool first:\n\n```bash\npip install --upgrade pip\n```\n\nand then rerun the `ncbi-genome-download` install.\n\nAlternatively, `ncbi-genome-download` is packaged in `conda`.\nRefer the the Anaconda/[miniconda](https://conda.io/miniconda.html) site to\ninstall a distribution (highly recommended). With that installed one can do:\n\n```bash\nconda install -c bioconda ncbi-genome-download\n```\n\n`ncbi-genome-download` is only developed and tested on Python releases still\nunder active support by the Python project. At the moment, this means versions\n3.7, 3.8, 3.9, 3.10 and 3.11.\nSpecifically, no attempt at testing under Python versions older than 3.7 is\nbeing made.\n\nIf your system is stuck on an older version of Python, consider using a tool like\n[Homebrew](http://brew.sh) to obtain a more up-to-date version.\n\n`ncbi-genome-download` 0.2.12 was the last version to support Python 2.\n\n## Usage\n\nTo download all bacterial RefSeq genomes in GenBank format from NCBI, run the following:\n\n```bash\nncbi-genome-download bacteria\n```\n\nDownloading multiple groups is also possible:\n\n```bash\nncbi-genome-download bacteria,viral\n```\n\n**Note**: To see all available groups, see `ncbi-genome-download --help`, or\nsimply use `all` to check all groups. Naming a more specific group will reduce\nthe download size and the time needed to find the sequences to download.\n\nIf you're on a reasonably fast connection, you might want to try running\nmultiple downloads in parallel:\n\n```bash\nncbi-genome-download bacteria --parallel 4\n```\n\nTo download all fungal GenBank genomes from NCBI in GenBank format, run:\n\n```bash\nncbi-genome-download --section genbank fungi\n```\n\nTo download all viral RefSeq genomes in FASTA format, run:\n\n```bash\nncbi-genome-download --formats fasta viral\n```\n\nIt is possible to download multiple formats by supplying a list of formats or\nsimply downloading all formats:\n\n```bash\nncbi-genome-download --formats fasta,assembly-report viral\nncbi-genome-download --formats all viral\n```\n\nTo download only completed bacterial RefSeq genomes in GenBank format, run:\n\n```bash\nncbi-genome-download --assembly-levels complete bacteria\n```\n\nIt is possible to download multiple assembly levels at once by supplying a list:\n\n```bash\nncbi-genome-download --assembly-levels complete,chromosome bacteria\n```\n\nTo download only bacterial reference genomes from RefSeq in GenBank format, run:\n\n```bash\nncbi-genome-download --refseq-categories reference bacteria\n```\n\nTo download bacterial RefSeq genomes of the genus _Streptomyces_, run:\n\n```bash\nncbi-genome-download --genera Streptomyces bacteria\n```\n\n**Note**: This is a simple string match on the organism name provided by NCBI only.\n\nYou can also use this with a slight trick to download genomes of a certain\nspecies as well:\n\n```bash\nncbi-genome-download --genera \"Streptomyces coelicolor\" bacteria\n```\n\n**Note**: The quotes are important. Again, this is a simple string match on the organism\nname provided by the NCBI.\n\nMultiple genera is also possible:\n\n```bash\nncbi-genome-download --genera \"Streptomyces coelicolor,Escherichia coli\" bacteria\n```\n\nYou can also put genus names into a file, one organism per line, e.g.:\n\n```bash\nStreptomyces\nAmycolatopsis\n```\n\nThen, pass the path to that file (e.g. `my_genera.txt`) to the `--genera`\noption, like so:\n\n```bash\nncbi-genome-download --genera my_genera.txt bacteria\n```\n\n**Note**: The above command will download all _Streptomyces_ and _Amycolatopsis_\ngenomes from RefSeq.\n\nYou can make the string match fuzzy using the `--fuzzy-genus` option. This can\nbe handy if you need to match a value in the middle of the NCBI organism name,\nlike so:\n\n```bash\nncbi-genome-download --genera coelicolor --fuzzy-genus bacteria\n```\n\n**Note**: The above command will download all bacterial genomes containing\n\"coelicolor\" anywhere in their organism name from RefSeq.\n\nTo download bacterial RefSeq genomes based on their NCBI species taxonomy ID, run:\n\n```bash\nncbi-genome-download --species-taxids 562 bacteria\n```\n\n**Note**: The above command will download all RefSeq genomes belonging to\n_Escherichia coli_.\n\nTo download a specific bacterial RefSeq genomes based on its NCBI taxonomy ID, run:\n\n```bash\nncbi-genome-download --taxids 511145 bacteria\n```\n\n**Note**: The above command will download the RefSeq genome belonging to\n_Escherichia coli str. K-12 substr. MG1655_.\n\nIt is also possible to download multiple species taxids or taxids by supplying\nthe numbers in a comma-separated list:\n\n```bash\nncbi-genome-download --taxids 9606,9685 --assembly-level chromosome vertebrate_mammalian\n```\n\n**Note**: The above command will download the reference genomes for cat and human.\n\nIn addition, you can put multiple species taxids or taxids into a file, one per line\nand pass that filename to the `--species-taxids` or `--taxids` parameters, respectively.\n\nAssuming you had a file `my_taxids.txt` with the following contents:\n\n```text\n9606\n9685\n```\n\nYou could download the reference genomes for cat and human like this:\n\n```bash\nncbi-genome-download --taxids my_taxids.txt --assembly-levels chromosome vertebrate_mammalian\n```\n\nIt is possible to also create a human-readable directory structure in parallel\nto mirroring the layout used by NCBI:\n\n```bash\nncbi-genome-download --human-readable bacteria\n```\n\nThis will use links to point to the appropriate files in the NCBI directory structure,\nso it saves file space. Note that links are not supported on some Windows file\nsystems and some older versions of Windows.\n\nIt is also possible to re-run a previous download with the `--human-readable` option.\nIn this case, `ncbi-genome-download` will not download any new genome files, and\njust create human-readable directory structure. Note that if any files have been\nchanged on the NCBI side, a file download will be triggered.\n\nThere is a \"dry-run\" option to show which accessions would be downloaded, given\nyour filters:\n\n```bash\nncbi-genome-download --dry-run bacteria\n```\n\nIf you want to filter for the \"relation to type material\" column of the\nassembly summary file, you can use the `--type-materials` option. Possible\nvalues are \"any\", \"all\", \"type\", \"reference\", \"synonym\", \"proxytype\", and/or\n\"neotype\". \"any\" will include assemblies with no relation to type material\nvalue defined, \"all\" will download only assemblies with a defined value.\nMultiple values can be given, separated by comma:\n\n```bash\nncbi-genome-download --type-materials type,reference\n```\n\nBy default, ncbi-genome-download caches the assembly summary files for the\nrespective taxonomic groups for one day. You can skip using the cache file by\nusing the `--no-cache` option. The output of `--help` also shows the cache\ndirectory, should you want to remove any of the cached files.\n\nTo get an overview of all options, run\n\n```bash\nncbi-genome-download --help\n```\n\n### As a method\n\nYou can also use it as a method call. Pass the pythonised keyword arguments\n(`_` instead of `-`) as described above or in the `--help`:\n\n```python\nimport ncbi_genome_download as ngd\nngd.download()\n```\n\n**Note**: To specify a taxonomic group, like _bacteria_, use the `group` keyword.\n\n### Contributed Scripts: `gimme_taxa.py`\n\nThis script lets you find out what TaxIDs to pass to `ngd`, and will write a\nsimple one-item-per-line file to pass in to it. It utilises the `ete3` toolkit,\nso refer to their site to install the dependency if it's not already satisfied.\n\nYou can query the database using a particular TaxID, or a scientific name. The\nprimary function of the script is to return all the child taxa of the specified\nparent taxa. The script has various options for what information is written in\nthe output.\n\nA basic invocation may look like:\n\n```bash\n# Fetch all descendent taxa for Escherichia (taxid 561):\npython gimme_taxa.py -o ~/mytaxafile.txt 561\n\n# Alternatively, just provide the taxon name\npython gimme_taxa.py -o all_descendent_taxids.txt Escherichia\n\n# You can provide multiple taxids and/or names\npython gimme_taxa.py -o all_descendent_taxids.txt 561,Methanobrevibacter\n```\n\nOn first use, a small sqlite database will be created in your home directory\nby default (change the location with the `--database` flag). You can update this\ndatabase by using the `--update` flag. Note that if the database is not in your\nhome directory, you must specify it with `--database` or a new database will be\ncreated in your home directory.\n\nTo see all help:\n\n```bash\npython gimme_taxa.py\npython gimme_taxa.py -h\npython gimme_taxa.py --help\n```\n\n## Citing `ncbi-genome-download`\n\nYou can cite `ncbi-genome-download` via the Zenodo deposit under\n[DOI: 10.5281/zenodo.8192433](https://doi.org/10.5281/zenodo.8192433).\n\n## License\n\nAll code is available under the Apache License version 2, see the\n[`LICENSE`](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "Apache Software License",
    "summary": "Download genome files from the NCBI FTP server.",
    "version": "0.3.3",
    "project_urls": {
        "Homepage": "https://github.com/kblin/ncbi-genome-download/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "86d8d252f9692f06a01ad82dec33c1b0ce4829684260d6e4506666b486d4ebef",
                "md5": "83a61443d6e3fec4e7bfc7eee2e480df",
                "sha256": "0c309e9f875da1d985c71c716d905aa47c0607cdbc01c8aa83888907fd734a4e"
            },
            "downloads": -1,
            "filename": "ncbi_genome_download-0.3.3-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "83a61443d6e3fec4e7bfc7eee2e480df",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 26013,
            "upload_time": "2023-07-28T12:49:20",
            "upload_time_iso_8601": "2023-07-28T12:49:20.561610Z",
            "url": "https://files.pythonhosted.org/packages/86/d8/d252f9692f06a01ad82dec33c1b0ce4829684260d6e4506666b486d4ebef/ncbi_genome_download-0.3.3-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d86b75d324699f57430aae0b1687adb89a4ba00c96b268230d569ae31eb79877",
                "md5": "71cbe73d5d5e14cf2f08cec19b1bb533",
                "sha256": "fb949f087f2cde1408414758678e714fb1a1f1b9196b3e8cac6bd3e8e395c996"
            },
            "downloads": -1,
            "filename": "ncbi-genome-download-0.3.3.tar.gz",
            "has_sig": false,
            "md5_digest": "71cbe73d5d5e14cf2f08cec19b1bb533",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 34185,
            "upload_time": "2023-07-28T12:49:22",
            "upload_time_iso_8601": "2023-07-28T12:49:22.261819Z",
            "url": "https://files.pythonhosted.org/packages/d8/6b/75d324699f57430aae0b1687adb89a4ba00c96b268230d569ae31eb79877/ncbi-genome-download-0.3.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-28 12:49:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kblin",
    "github_project": "ncbi-genome-download",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [],
    "lcname": "ncbi-genome-download"
}
        
Elapsed time: 0.46904s