gencube


Namegencube JSON
Version 0.9.0 PyPI version JSON
download
home_pageNone
SummaryGenCube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.
upload_time2024-06-27 22:19:07
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseMIT License Copyright (c) 2024, Keun Hong Son Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords genome genomic geneset annotation comparative sequencing metadata
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            Gencube
=======
<!-- 1. GitHub Version Badge: -->
<!-- 2. PyPI Version Badge: -->
<!-- 3. Supported Python Versions Badge: -->
<!-- 6. PyPI Downloads Badge: -->
<!-- 8. License Badge: -->
![github version](https://img.shields.io/badge/Version-1.0.0-informational)
[![pypi version](https://img.shields.io/pypi/v/gencube)](https://pypi.org/project/gencube/)
![python versions](https://img.shields.io/pypi/pyversions/gencube)
[![pypi downloads](https://img.shields.io/pypi/dm/gencube)](https://pypi.org/project/gencube/)
[![license](https://img.shields.io/pypi/l/gencube)](LICENSE)

<!-- 4. GitHub Actions CI Status Badge:
![status](https://github.com/keun-hong/gencube/workflows/CI/badge.svg) -->
<!-- 5. Codecov Badge:
[![codecov](https://codecov.io/gh/keun-hong/gencube/branch/master/graph/badge.svg)](https://codecov.io/gh/keun-hong/gencube) -->
<!-- 7. Documentation Badge:
[![docs](https://readthedocs.org/projects/gencube/badge/?version=latest)](https://gencube.readthedocs.io/en/latest/?badge=latest) -->

### Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

[**Keun Hong Son**](https://keun-hong.github.io/)<sup>1,2,3</sup>, and [**Je-Yoel Cho**](https://vetbio.snu.ac.kr/)<sup>1,2,3</sup>

<sup>1</sup> Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, Korea<br>
<sup>2</sup> Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, Korea<br>
<sup>3</sup> BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, Korea<be>
### Manuscript
[**bioRxiv**]() (uploaded: 2024.07.01)
<!-- Bioinformatics (accepted. 2024.09) -->

---
`gencube` enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.

![gencube_overview](https://github.com/keun-hong/gencube/blob/master/figures/gencube_overview.jpg?raw=true)

### Databases accessed from gencube
- [GenBank](https://www.ncbi.nlm.nih.gov/genbank/): NCBI GenBank Nucleotide Sequence Database
- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): NCBI Reference Sequence Database
- [GenArk](https://hgdownload.soe.ucsc.edu/hubs/): UCSC Genome Archive
- [Ensembl Rapid Release](https://rapid.ensembl.org/index.html): Ensembl genome browser that provides frequent updates for newly sequenced species
- [Zoonomia TOGA](https://zoonomiaproject.org/the-data/): Tool to infer Orthologs from Genome Alignments
- [INSDC](https://www.insdc.org/): International Nucleotide Sequence Database Collaboration
- [SRA](https://www.ncbi.nlm.nih.gov/sra): NCBI Sequence Read Archive
- [ENA](https://www.ebi.ac.uk/ena/browser/home): EMBL-EBI European Nucleotide Archive
- [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html): DNA Data Bank of Japan

### Detailed information of each database
- [GenBank & RefSeq README.txt](https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt) - `genome`, `geneset`, `sequence`
- [UCSC GenArk paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x) - `genome`, `geneset`, `annotation`
- [Ensembl Rapid Release Help & Docs](https://rapid.ensembl.org/info/index.html) & [Ensembl 2023 paper](https://academic.oup.com/nar/article/51/D1/D933/6786199?login=false) - `genome`, `geneset`, `sequence`, `crossgenome`
- [Zoonomia TOGA README.txt](https://genome.senckenberg.de/download/TOGA/README.txt) & [Paper](https://www.science.org/doi/10.1126/science.abn3107) - `geneset`, `crossgenome`
- [Search in SRA Entrez](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/), [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/) & [SRA Advanced Search Builder](https://www.ncbi.nlm.nih.gov/sra/advanced) - `seqmeta`


## Installation
The latest release can be installed with
```bash
pip install gencube
```
Alternative
```bash
conda install -c bioconda gencube
```

## Tutorials
`gencube` consists of six subcommands
```plaintext
$ gencube
usage: gencube [-h] {genome,geneset,sequence,annotation,crossgenome,seqmeta} ...

gencube v1.0.0

positional arguments:
  {genome,geneset,sequence,annotation,crossgenome,seqmeta}
    genome              Search, download, and modify chromosome labels for genome assemblies
    geneset             Search, download, and modify chromosome labels for genesets (gene annotations)
    sequence            Search and download sequence data of genesets
    annotation          Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats
    crossgenome         Search and download comparative genomics data, such as homology, and codon or protein alignment
    seqmeta             Search, fetch, and integrate metadata of experimental sequencing data

options:
  -h, --help            show this help message and exit
```
---
### The positional arguments and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand
When using the above five subcommands, it's important to find genome assemblies required for personal research.
Below are the positional arguments and options shared by the these subcommands to browse and search for specific genome assemblies.

```plaintext
positional arguments:
  keywords              Taxonomic names to search for genomes. You can provide various forms 
                        such as species names or accession numbers.  
                        Examples: homo_sapiens, human, GCF_000001405.40, GCA_000001405.29, GRCh38, hg38 
                        
                        Multiple names can be combined and will be merged in the search results.
                        To specify multiple names, separate them with spaces.

options:
  -h, --help            show this help message and exit
  -v level, --level level
                        Specify the genome assembly level (default: complete,chromosome)
                        complete   : Fully assembled genomes
                        chromosome : Assembled at the chromosome level
                        scaffold   : Assembled into scaffolds, but not to the chromosome level
                        contig     : Contiguous sequences without gaps
                        
  -r, --refseq          Show genomes that have RefSeq accession (GCF_* format)
  -u, --ucsc            Show genomes that have UCSC name
  -l, --latest          Show genomes corresponding to the latest version
  -m, --metadata        Save metadata for the searched genomes
```
#### Examples
```bash
# Search using scientific or common name
$ gencube genome homo_sapiens canis_lupus_familiaris
$ gencube genome human dog

# Search using assembly name
$ gencube genome T2T-CHM13v2.0 GRCh38

# Search using UCSC name
$ gencube genome hg38 hg19

# Search using GenBank (GCF_*) or RefSeq (GCA_*) accession
$ gencube genome GCF_000001405.40 GCA_021950905.1

# Show searched genomes corresponding to all genome assembly levels
$ gencube genome homo_sapiens --level complete,chromosome,scaffold,contig

# Only show genomes that have RefSeq accession and UCSC name, and correspond to the latest version
$ gencube genome homo_sapiens --refseq --ucsc --latest

# Download the full information metadata of searched genomes
$ gencube genome homo_sapiens --metadata
```
#### Example output displayed in the terminal
```plaintext
$ gencube genome GCF_000001405.40 GCA_021950905.1

# Search assemblies in NCBI database
  Keyword: ['GCF_000001405.40', 'GCA_021950905.1']

  Total 3 genomes are searched.

# Convert JSON to dataframe format.
  Filter options
  Level:   ['Complete', 'Chromosome']
  RefSeq:  False
  UCSC:    False
  Latest:  False

# Check accessibility to GenArk, Ensembl Rapid Release
  UCSC GenArk  : 4167 genomes across 2813 species
  Ensembl Rapid: 2272 genomes across 1522 species

+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|    | Assembly name          |   Taxid | Release    | NCBI             | UCSC   | GenArk   | Ensembl   |
+====+========================+=========+============+==================+========+==========+===========+
|  0 | HG002.mat.cur.20211005 |    9606 | 2022/02/04 | GCA_021951015.1  |        | v        | v         |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|  1 | HG002.pat.cur.20211005 |    9606 | 2022/02/04 | GCA_021950905.1  |        | v        | v         |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|  2 | GRCh38.p14             |    9606 | 2022/02/03 | GCF_000001405.40 | hg38   | v        |           |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
```

---
### `genome`: Search, download, and modify chromosome labels for genome assemblies
You can download genome data in FASTA format from four different databases (GenBank, RefSeq, GenArk, Ensembl Rapid Release). Each database uses a different soft-masking method, and you can selectively download the data as needed. You can also download unmasked and hard-masked genomes from the Ensembl Rapid Release database.
```plaintext
options:
  -d, --download        Download "fasta" formatted genome file.
  -f types, --fasta types
                        Type of "fasta" formatted genome file (default: refseq).
                        Default is from the RefSeq database.
                        If not available, download from the GenBank database.
                        genbank    : soft-masked genome by NCBI GenBank
                        refseq     : soft-masked genome by NCBI RefSeq
                        genark     : soft-masked genome by UCSC GenArk
                        ensembl    : soft-masked genome by Ensembl Rapid Release
                        ensembl_hm : hard-masked genome by Ensembl Rapid Release
                        ensembl_um : unmasked genome by Ensembl Rapid Release
  -c type, --chr_style type
                        Chromosome label style used in the download file (default: ensembl)
                        ensembl : 1, 2, X, MT. Unknowns use GenBank IDs.
                        gencode : chr1, chr2, chrX, chrM. Unknowns use GenBank IDs.
                        ucsc    : chr1, chr2, chrX, chrM. Uses UCSC-specific IDs for unknowns.
                                  (!! Limited use if UCSC IDs are not issued.)
                        raw     : Uses raw file labels without modification. Format depends on the database:
                                 - NCBI GenBank: CM_* or other-form IDs
                                 - NCBI RefSeq : NC_*, NW_* or other-form IDs
                                 - GenArk      : GenBank or RefSeq IDs
                                 - Ensembl     : Ensembl IDs
  -p 1-9, --compresslevel 1-9
                        Compression level for output data (default: 6).
                        Lower numbers are faster but have lower compression.
  --recursive           Download file regardless of their presence only if integrity check is not possible.

```
#### Examples
```bash
# Download genome files under the default conditions (RefSeq or GenBank)
$ gencube genome GCF_011100685.1 --download
# Download multiple genomes from various databases
$ gencube genome GCF_011100685.1 --download --fasta refseq,genark,ensembl
# Change the chromosome labels to the GENCODE style and set the compression level of the file to 9.
$ gencube genome GCF_011100685.1 --download --chr_style gencode --compresslevel 9
```
---

### `geneset`: Search, download, and modify chromosome labels for genesets (gene annotations)
```plaintext
options:
  -d types, --download types
                        Type of gene set
                        refseq_gtf    : RefSeq gene set (GTF format)
                        refseq_gff    : RefSeq gene set (GFF)
                        gnomon        : RefSeq Gnomon gene prediction (GFF)
                        cross         : RefSeq Cross-species alignments (GFF)
                        same          : RefSeq Same-species alignments (GFF)
                        agustus       : GenArk Augustus gene prediction (GFF)
                        xenoref       : GenArk XenoRefGene (GFF)
                        genark_ref    : GenArk RefSeq gene models (GFF)
                        ensembl_gff   : Ensembl Rapid Release gene set (GFF)
                        toga_gtf      : Zoonomia TOGA gene set (GTF)
                        toga_bed      : Zoonomia TOGA gene set (BED)
                        toga_pseudo   : Zoonomia TOGA processed pseudogenes (BED)
```

#### Examples
```bash
# search usable and accessible data
gencube geneset GCF_011100685.1

# Download multiple genesets from various databases
$ gencube geneset GCF_011100685.1 --download refseq_gtf,agustus,toga_gtf
```
---

### `sequence`: Search and download sequence data of genesets
```plaintext
options:
  -d types, --download types
                        Download "fasta" formatted sequence file. 
                        1. Nucleotide sequences:
                           refseq_rna         : Accessioned RNA sequences annotated on the genome assembly.
                           refseq_rna_genomic : RNA features based on the genome sequence.
                           refseq_cds_genomic : CDS features based on the genome sequence.
                           refseq_pseudo      : Pseudogene and other gene regions without transcribed RNA or translated protein products.
                           ensembl_cdna       : Ensembl Rapid Release cDNA sequences of transcripts.
                           ensembl_cds        : Ensembl Rapid Release coding sequences (CDS).
                           ensembl_repeat     : Ensembl repeat modeler sequences.
                        
                        2. Protein sequences:
                           refseq_pep         : Accessioned protein sequences annotated on the genome assembly.
                           refseq_pep_cds     : CDS features translated into protein sequences.
                           ensembl_pep        : Ensembl Rapid Release protein sequences.
```

#### Examples
```bash
# search usable and accessible data
gencube sequence GCF_011100685.1

# Download multiple genesets from various databases
$ gencube sequence GCF_011100685.1 --download refseq_rna,ensembl_cdna,refseq_pep,ensembl_pep
```
---

### `annotation`: Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats
```plaintext
options:
  -d types, --download types
                        Download annotation file.
                        gap : Genomic gaps - AGP defined (bigBed format)
                        sr   : Simple tandem repeats by TRF (bigBed) 
                        td   : Tandem duplications (bigBed) 
                        wm   : Genomic intervals masked by WindowMasker + SDust (bigBed) 
                        rmsk : Repeated elements annotated by RepeatMasker (bigBed) 
                        cpg  : CpG Islands - Islands < 300 bases are light green (bigBed) 
                        gc   : GC percent in 5-Base window (bigWig)
```

#### Examples
```bash
# search usable and accessible data
gencube annotation GCF_011100685.1

# Download multiple annotations
gencube annotation GCF_011100685.1 --download sr,td,rmsk,gc
```
---

### `crossgenome`: Search and download comparative genomics data, such as homology, and codon or protein alignment
```plaintext
options:
  -d types, --download types
                        ensembl_homology   : Homology data from Ensembl Rapid Release, detailing gene orthology relationships across species.
                        toga_homology      : Homology data from TOGA, providing predictions of orthologous genes based on genome alignments.
                        toga_align_codon   : Codon alignment data from TOGA, showing aligned codon sequences between reference and query species.
                        toga_align_protein : Protein alignment data from TOGA, detailing aligned protein sequences between reference and query species.
                        toga_inact_mut     : List of inactivating mutations from TOGA, identifying mutations that disrupt gene function.
```

#### Examples
```bash
# search usable and accessible data
gencube crossgenome GCF_011100685.1

# Download multiple crossgenome data
$ gencube sequence GCF_011100685.1 --download toga_homology,toga_align_codon
```
---

### `seqmeta`: Search, fetch, and integrate metadata of experimental sequencing data
![seqmeta_scheme](https://github.com/keun-hong/gencube/blob/master/figures/seqmeta_scheme.jpg?raw=true)
```plaintext
$ gencube seqmeta
usage: gencube seqmeta [-h] [--info] [-o string] [-st string] [-sr string] [-l string] [-ex keywords] [-m] [keywords ...]

Search, fetch, and integrate metadata of experimental sequencing data

positional arguments:
  keywords              Keywords to search for sequencing-based experimental data. You can provide various forms 
                        Examples: tissue name, cell line, disease name, etc 
                                  liver, k562, cancer, breast_cancer
                        Multiple keywords can be combined and will be merged in the search results.
                        To specify multiple names, separate them with spaces.

options:
  -h, --help            show this help message and exit
  --info                Show full information about organism, strategy, source and layout 
                         
  -o string, --organism string
                        Scientific name or common name 
                        Example: homo_sapiens or human 
                        
                        Available common names:
                        human, mouse, dog, dingo, wolf, cat, pig, pig_domestic, cow, dairy_cow, chicken, horse, rice, wheat
                        elephant, whale, naked_mole_rat, blind_mole_rat, gorilla, rhesus_monkey, cynomolgus_monkey, baboon
                        chimpanzee, marmoset, macaque, capuchin_monkey, squirrel_monkey, bonobo, yeast, fruit_fly, nematode
                        zebrafish, african clawed frog, rat, guinea pig, rabbit, opossum
                         
  -st string, --strategy string
                        Available strategies 
                        wgs, wga, wxs, targeted, synthetic_long_read, gbs, rad, tn, clone_end, amplicon, clone, rna, mrna
                        ncrna, ribo, rip, mirna, ssrna, est, fl_cdna, atac, dnase, faire, chip, mre, bisulfite, mbd, medip
                        hic, chiapet, tethered
                         
  -sr string, --source string
                        Available sources 
                        genomic, genomic_single_cell, transcriptomic, transcriptomic_single_cell, metagenomic
                        metatranscriptomic, synthetic, viral, other
                         
  -l string, --layout string
                        Available layout: paired, single (default: paired,single) 
                         
  -ex keywords, --exclude keywords
                        Exclude the results for the keywords used in this option  
                         
  -m, --metadata        Save integrated metadata
```

#### Examples
```bash
$ gencube seqmeta --organism dog --strategy chip,chip_seq

$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor

$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor --exclude crispr,

((((("human"[Organism]) AND ("rna seq"[Strategy])) AND ("illumina"[Platform])) AND ("public"[Access])) AND (("liver" OR "lung") AND ("cancer" OR "tumor"))) NOT "crispr"
```
---

#### Credits
This package was created with [`Cookiecutter`](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gencube",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "Keun Hong Son <newhong@snu.ac.kr>",
    "keywords": "Genome, Genomic, Geneset, Annotation, Comparative, Sequencing, Metadata",
    "author": null,
    "author_email": "Keun Hong Son <newhong@snu.ac.kr>",
    "download_url": "https://files.pythonhosted.org/packages/e5/15/f5ceb365c2896a948816912d4cdd1b601f7f06ab73d171776bcf98136a5b/gencube-0.9.0.tar.gz",
    "platform": null,
    "description": "Gencube\n=======\n<!-- 1. GitHub Version Badge: -->\n<!-- 2. PyPI Version Badge: -->\n<!-- 3. Supported Python Versions Badge: -->\n<!-- 6. PyPI Downloads Badge: -->\n<!-- 8. License Badge: -->\n![github version](https://img.shields.io/badge/Version-1.0.0-informational)\n[![pypi version](https://img.shields.io/pypi/v/gencube)](https://pypi.org/project/gencube/)\n![python versions](https://img.shields.io/pypi/pyversions/gencube)\n[![pypi downloads](https://img.shields.io/pypi/dm/gencube)](https://pypi.org/project/gencube/)\n[![license](https://img.shields.io/pypi/l/gencube)](LICENSE)\n\n<!-- 4. GitHub Actions CI Status Badge:\n![status](https://github.com/keun-hong/gencube/workflows/CI/badge.svg) -->\n<!-- 5. Codecov Badge:\n[![codecov](https://codecov.io/gh/keun-hong/gencube/branch/master/graph/badge.svg)](https://codecov.io/gh/keun-hong/gencube) -->\n<!-- 7. Documentation Badge:\n[![docs](https://readthedocs.org/projects/gencube/badge/?version=latest)](https://gencube.readthedocs.io/en/latest/?badge=latest) -->\n\n### Efficient retrieval, download, and unification of genomic data from leading biodiversity databases\n\n[**Keun Hong Son**](https://keun-hong.github.io/)<sup>1,2,3</sup>, and [**Je-Yoel Cho**](https://vetbio.snu.ac.kr/)<sup>1,2,3</sup>\n\n<sup>1</sup> Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, Korea<br>\n<sup>2</sup> Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, Korea<br>\n<sup>3</sup> BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, Korea<be>\n### Manuscript\n[**bioRxiv**]() (uploaded: 2024.07.01)\n<!-- Bioinformatics (accepted. 2024.09) -->\n\n---\n`gencube` enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.\n\n![gencube_overview](https://github.com/keun-hong/gencube/blob/master/figures/gencube_overview.jpg?raw=true)\n\n### Databases accessed from gencube\n- [GenBank](https://www.ncbi.nlm.nih.gov/genbank/): NCBI GenBank Nucleotide Sequence Database\n- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): NCBI Reference Sequence Database\n- [GenArk](https://hgdownload.soe.ucsc.edu/hubs/): UCSC Genome Archive\n- [Ensembl Rapid Release](https://rapid.ensembl.org/index.html): Ensembl genome browser that provides frequent updates for newly sequenced species\n- [Zoonomia TOGA](https://zoonomiaproject.org/the-data/): Tool to infer Orthologs from Genome Alignments\n- [INSDC](https://www.insdc.org/): International Nucleotide Sequence Database Collaboration\n- [SRA](https://www.ncbi.nlm.nih.gov/sra): NCBI Sequence Read Archive\n- [ENA](https://www.ebi.ac.uk/ena/browser/home): EMBL-EBI European Nucleotide Archive\n- [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html): DNA Data Bank of Japan\n\n### Detailed information of each database\n- [GenBank & RefSeq README.txt](https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt) - `genome`, `geneset`, `sequence`\n- [UCSC GenArk paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x) - `genome`, `geneset`, `annotation`\n- [Ensembl Rapid Release Help & Docs](https://rapid.ensembl.org/info/index.html) & [Ensembl 2023 paper](https://academic.oup.com/nar/article/51/D1/D933/6786199?login=false) - `genome`, `geneset`, `sequence`, `crossgenome`\n- [Zoonomia TOGA README.txt](https://genome.senckenberg.de/download/TOGA/README.txt) & [Paper](https://www.science.org/doi/10.1126/science.abn3107) - `geneset`, `crossgenome`\n- [Search in SRA Entrez](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/), [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/) & [SRA Advanced Search Builder](https://www.ncbi.nlm.nih.gov/sra/advanced) - `seqmeta`\n\n\n## Installation\nThe latest release can be installed with\n```bash\npip install gencube\n```\nAlternative\n```bash\nconda install -c bioconda gencube\n```\n\n## Tutorials\n`gencube` consists of six subcommands\n```plaintext\n$ gencube\nusage: gencube [-h] {genome,geneset,sequence,annotation,crossgenome,seqmeta} ...\n\ngencube v1.0.0\n\npositional arguments:\n  {genome,geneset,sequence,annotation,crossgenome,seqmeta}\n    genome              Search, download, and modify chromosome labels for genome assemblies\n    geneset             Search, download, and modify chromosome labels for genesets (gene annotations)\n    sequence            Search and download sequence data of genesets\n    annotation          Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats\n    crossgenome         Search and download comparative genomics data, such as homology, and codon or protein alignment\n    seqmeta             Search, fetch, and integrate metadata of experimental sequencing data\n\noptions:\n  -h, --help            show this help message and exit\n```\n---\n### The positional arguments and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand\nWhen using the above five subcommands, it's important to find genome assemblies required for personal research.\nBelow are the positional arguments and options shared by the these subcommands to browse and search for specific genome assemblies.\n\n```plaintext\npositional arguments:\n  keywords              Taxonomic names to search for genomes. You can provide various forms \n                        such as species names or accession numbers.  \n                        Examples: homo_sapiens, human, GCF_000001405.40, GCA_000001405.29, GRCh38, hg38 \n                        \n                        Multiple names can be combined and will be merged in the search results.\n                        To specify multiple names, separate them with spaces.\n\noptions:\n  -h, --help            show this help message and exit\n  -v level, --level level\n                        Specify the genome assembly level (default: complete,chromosome)\n                        complete   : Fully assembled genomes\n                        chromosome : Assembled at the chromosome level\n                        scaffold   : Assembled into scaffolds, but not to the chromosome level\n                        contig     : Contiguous sequences without gaps\n                        \n  -r, --refseq          Show genomes that have RefSeq accession (GCF_* format)\n  -u, --ucsc            Show genomes that have UCSC name\n  -l, --latest          Show genomes corresponding to the latest version\n  -m, --metadata        Save metadata for the searched genomes\n```\n#### Examples\n```bash\n# Search using scientific or common name\n$ gencube genome homo_sapiens canis_lupus_familiaris\n$ gencube genome human dog\n\n# Search using assembly name\n$ gencube genome T2T-CHM13v2.0 GRCh38\n\n# Search using UCSC name\n$ gencube genome hg38 hg19\n\n# Search using GenBank (GCF_*) or RefSeq (GCA_*) accession\n$ gencube genome GCF_000001405.40 GCA_021950905.1\n\n# Show searched genomes corresponding to all genome assembly levels\n$ gencube genome homo_sapiens --level complete,chromosome,scaffold,contig\n\n# Only show genomes that have RefSeq accession and UCSC name, and correspond to the latest version\n$ gencube genome homo_sapiens --refseq --ucsc --latest\n\n# Download the full information metadata of searched genomes\n$ gencube genome homo_sapiens --metadata\n```\n#### Example output displayed in the terminal\n```plaintext\n$ gencube genome GCF_000001405.40 GCA_021950905.1\n\n# Search assemblies in NCBI database\n  Keyword: ['GCF_000001405.40', 'GCA_021950905.1']\n\n  Total 3 genomes are searched.\n\n# Convert JSON to dataframe format.\n  Filter options\n  Level:   ['Complete', 'Chromosome']\n  RefSeq:  False\n  UCSC:    False\n  Latest:  False\n\n# Check accessibility to GenArk, Ensembl Rapid Release\n  UCSC GenArk  : 4167 genomes across 2813 species\n  Ensembl Rapid: 2272 genomes across 1522 species\n\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n|    | Assembly name          |   Taxid | Release    | NCBI             | UCSC   | GenArk   | Ensembl   |\n+====+========================+=========+============+==================+========+==========+===========+\n|  0 | HG002.mat.cur.20211005 |    9606 | 2022/02/04 | GCA_021951015.1  |        | v        | v         |\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n|  1 | HG002.pat.cur.20211005 |    9606 | 2022/02/04 | GCA_021950905.1  |        | v        | v         |\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n|  2 | GRCh38.p14             |    9606 | 2022/02/03 | GCF_000001405.40 | hg38   | v        |           |\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n```\n\n---\n### `genome`: Search, download, and modify chromosome labels for genome assemblies\nYou can download genome data in FASTA format from four different databases (GenBank, RefSeq, GenArk, Ensembl Rapid Release). Each database uses a different soft-masking method, and you can selectively download the data as needed. You can also download unmasked and hard-masked genomes from the Ensembl Rapid Release database.\n```plaintext\noptions:\n  -d, --download        Download \"fasta\" formatted genome file.\n  -f types, --fasta types\n                        Type of \"fasta\" formatted genome file (default: refseq).\n                        Default is from the RefSeq database.\n                        If not available, download from the GenBank database.\n                        genbank    : soft-masked genome by NCBI GenBank\n                        refseq     : soft-masked genome by NCBI RefSeq\n                        genark     : soft-masked genome by UCSC GenArk\n                        ensembl    : soft-masked genome by Ensembl Rapid Release\n                        ensembl_hm : hard-masked genome by Ensembl Rapid Release\n                        ensembl_um : unmasked genome by Ensembl Rapid Release\n  -c type, --chr_style type\n                        Chromosome label style used in the download file (default: ensembl)\n                        ensembl : 1, 2, X, MT. Unknowns use GenBank IDs.\n                        gencode : chr1, chr2, chrX, chrM. Unknowns use GenBank IDs.\n                        ucsc    : chr1, chr2, chrX, chrM. Uses UCSC-specific IDs for unknowns.\n                                  (!! Limited use if UCSC IDs are not issued.)\n                        raw     : Uses raw file labels without modification. Format depends on the database:\n                                 - NCBI GenBank: CM_* or other-form IDs\n                                 - NCBI RefSeq : NC_*, NW_* or other-form IDs\n                                 - GenArk      : GenBank or RefSeq IDs\n                                 - Ensembl     : Ensembl IDs\n  -p 1-9, --compresslevel 1-9\n                        Compression level for output data (default: 6).\n                        Lower numbers are faster but have lower compression.\n  --recursive           Download file regardless of their presence only if integrity check is not possible.\n\n```\n#### Examples\n```bash\n# Download genome files under the default conditions (RefSeq or GenBank)\n$ gencube genome GCF_011100685.1 --download\n# Download multiple genomes from various databases\n$ gencube genome GCF_011100685.1 --download --fasta refseq,genark,ensembl\n# Change the chromosome labels to the GENCODE style and set the compression level of the file to 9.\n$ gencube genome GCF_011100685.1 --download --chr_style gencode --compresslevel 9\n```\n---\n\n### `geneset`: Search, download, and modify chromosome labels for genesets (gene annotations)\n```plaintext\noptions:\n  -d types, --download types\n                        Type of gene set\n                        refseq_gtf    : RefSeq gene set (GTF format)\n                        refseq_gff    : RefSeq gene set (GFF)\n                        gnomon        : RefSeq Gnomon gene prediction (GFF)\n                        cross         : RefSeq Cross-species alignments (GFF)\n                        same          : RefSeq Same-species alignments (GFF)\n                        agustus       : GenArk Augustus gene prediction (GFF)\n                        xenoref       : GenArk XenoRefGene (GFF)\n                        genark_ref    : GenArk RefSeq gene models (GFF)\n                        ensembl_gff   : Ensembl Rapid Release gene set (GFF)\n                        toga_gtf      : Zoonomia TOGA gene set (GTF)\n                        toga_bed      : Zoonomia TOGA gene set (BED)\n                        toga_pseudo   : Zoonomia TOGA processed pseudogenes (BED)\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube geneset GCF_011100685.1\n\n# Download multiple genesets from various databases\n$ gencube geneset GCF_011100685.1 --download refseq_gtf,agustus,toga_gtf\n```\n---\n\n### `sequence`: Search and download sequence data of genesets\n```plaintext\noptions:\n  -d types, --download types\n                        Download \"fasta\" formatted sequence file. \n                        1. Nucleotide sequences:\n                           refseq_rna         : Accessioned RNA sequences annotated on the genome assembly.\n                           refseq_rna_genomic : RNA features based on the genome sequence.\n                           refseq_cds_genomic : CDS features based on the genome sequence.\n                           refseq_pseudo      : Pseudogene and other gene regions without transcribed RNA or translated protein products.\n                           ensembl_cdna       : Ensembl Rapid Release cDNA sequences of transcripts.\n                           ensembl_cds        : Ensembl Rapid Release coding sequences (CDS).\n                           ensembl_repeat     : Ensembl repeat modeler sequences.\n                        \n                        2. Protein sequences:\n                           refseq_pep         : Accessioned protein sequences annotated on the genome assembly.\n                           refseq_pep_cds     : CDS features translated into protein sequences.\n                           ensembl_pep        : Ensembl Rapid Release protein sequences.\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube sequence GCF_011100685.1\n\n# Download multiple genesets from various databases\n$ gencube sequence GCF_011100685.1 --download refseq_rna,ensembl_cdna,refseq_pep,ensembl_pep\n```\n---\n\n### `annotation`: Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats\n```plaintext\noptions:\n  -d types, --download types\n                        Download annotation file.\n                        gap : Genomic gaps - AGP defined (bigBed format)\n                        sr   : Simple tandem repeats by TRF (bigBed) \n                        td   : Tandem duplications (bigBed) \n                        wm   : Genomic intervals masked by WindowMasker + SDust (bigBed) \n                        rmsk : Repeated elements annotated by RepeatMasker (bigBed) \n                        cpg  : CpG Islands - Islands < 300 bases are light green (bigBed) \n                        gc   : GC percent in 5-Base window (bigWig)\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube annotation GCF_011100685.1\n\n# Download multiple annotations\ngencube annotation GCF_011100685.1 --download sr,td,rmsk,gc\n```\n---\n\n### `crossgenome`: Search and download comparative genomics data, such as homology, and codon or protein alignment\n```plaintext\noptions:\n  -d types, --download types\n                        ensembl_homology   : Homology data from Ensembl Rapid Release, detailing gene orthology relationships across species.\n                        toga_homology      : Homology data from TOGA, providing predictions of orthologous genes based on genome alignments.\n                        toga_align_codon   : Codon alignment data from TOGA, showing aligned codon sequences between reference and query species.\n                        toga_align_protein : Protein alignment data from TOGA, detailing aligned protein sequences between reference and query species.\n                        toga_inact_mut     : List of inactivating mutations from TOGA, identifying mutations that disrupt gene function.\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube crossgenome GCF_011100685.1\n\n# Download multiple crossgenome data\n$ gencube sequence GCF_011100685.1 --download toga_homology,toga_align_codon\n```\n---\n\n### `seqmeta`: Search, fetch, and integrate metadata of experimental sequencing data\n![seqmeta_scheme](https://github.com/keun-hong/gencube/blob/master/figures/seqmeta_scheme.jpg?raw=true)\n```plaintext\n$ gencube seqmeta\nusage: gencube seqmeta [-h] [--info] [-o string] [-st string] [-sr string] [-l string] [-ex keywords] [-m] [keywords ...]\n\nSearch, fetch, and integrate metadata of experimental sequencing data\n\npositional arguments:\n  keywords              Keywords to search for sequencing-based experimental data. You can provide various forms \n                        Examples: tissue name, cell line, disease name, etc \n                                  liver, k562, cancer, breast_cancer\n                        Multiple keywords can be combined and will be merged in the search results.\n                        To specify multiple names, separate them with spaces.\n\noptions:\n  -h, --help            show this help message and exit\n  --info                Show full information about organism, strategy, source and layout \n                         \n  -o string, --organism string\n                        Scientific name or common name \n                        Example: homo_sapiens or human \n                        \n                        Available common names:\n                        human, mouse, dog, dingo, wolf, cat, pig, pig_domestic, cow, dairy_cow, chicken, horse, rice, wheat\n                        elephant, whale, naked_mole_rat, blind_mole_rat, gorilla, rhesus_monkey, cynomolgus_monkey, baboon\n                        chimpanzee, marmoset, macaque, capuchin_monkey, squirrel_monkey, bonobo, yeast, fruit_fly, nematode\n                        zebrafish, african clawed frog, rat, guinea pig, rabbit, opossum\n                         \n  -st string, --strategy string\n                        Available strategies \n                        wgs, wga, wxs, targeted, synthetic_long_read, gbs, rad, tn, clone_end, amplicon, clone, rna, mrna\n                        ncrna, ribo, rip, mirna, ssrna, est, fl_cdna, atac, dnase, faire, chip, mre, bisulfite, mbd, medip\n                        hic, chiapet, tethered\n                         \n  -sr string, --source string\n                        Available sources \n                        genomic, genomic_single_cell, transcriptomic, transcriptomic_single_cell, metagenomic\n                        metatranscriptomic, synthetic, viral, other\n                         \n  -l string, --layout string\n                        Available layout: paired, single (default: paired,single) \n                         \n  -ex keywords, --exclude keywords\n                        Exclude the results for the keywords used in this option  \n                         \n  -m, --metadata        Save integrated metadata\n```\n\n#### Examples\n```bash\n$ gencube seqmeta --organism dog --strategy chip,chip_seq\n\n$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor\n\n$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor --exclude crispr,\n\n(((((\"human\"[Organism]) AND (\"rna seq\"[Strategy])) AND (\"illumina\"[Platform])) AND (\"public\"[Access])) AND ((\"liver\" OR \"lung\") AND (\"cancer\" OR \"tumor\"))) NOT \"crispr\"\n```\n---\n\n#### Credits\nThis package was created with [`Cookiecutter`](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024, Keun Hong Son  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.  ",
    "summary": "GenCube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.",
    "version": "0.9.0",
    "project_urls": {
        "Bugs": "https://github.com/keun-hong/gencube/issues",
        "Changelog": "https://github.com/keun-hong/gencube/blob/master/changelog.md",
        "Homepage": "https://keun-hong.github.io/",
        "Repository": "https://github.com/keun-hong/gencube"
    },
    "split_keywords": [
        "genome",
        " genomic",
        " geneset",
        " annotation",
        " comparative",
        " sequencing",
        " metadata"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "588c3ba2c6bb6a561db69a04ce6d5885d120e3a342f3a7863bc0280259efb36a",
                "md5": "c6684e0ef1192b506642a83a72ca6c78",
                "sha256": "b94a2947a2cc8cbc15f648962d601c350b12c45a9270f1ebd111abbd7e6782ce"
            },
            "downloads": -1,
            "filename": "gencube-0.9.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c6684e0ef1192b506642a83a72ca6c78",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 56048,
            "upload_time": "2024-06-27T22:18:48",
            "upload_time_iso_8601": "2024-06-27T22:18:48.224868Z",
            "url": "https://files.pythonhosted.org/packages/58/8c/3ba2c6bb6a561db69a04ce6d5885d120e3a342f3a7863bc0280259efb36a/gencube-0.9.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e515f5ceb365c2896a948816912d4cdd1b601f7f06ab73d171776bcf98136a5b",
                "md5": "072a38cada432bcc53e61ffdf7b61b91",
                "sha256": "1bb46d2796eb796e399b7c63f883358704ba9bdfdecd73d584a2d4ae9d239ca2"
            },
            "downloads": -1,
            "filename": "gencube-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "072a38cada432bcc53e61ffdf7b61b91",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 75772,
            "upload_time": "2024-06-27T22:19:07",
            "upload_time_iso_8601": "2024-06-27T22:19:07.600581Z",
            "url": "https://files.pythonhosted.org/packages/e5/15/f5ceb365c2896a948816912d4cdd1b601f7f06ab73d171776bcf98136a5b/gencube-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-27 22:19:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "keun-hong",
    "github_project": "gencube",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "tox": true,
    "lcname": "gencube"
}
        
Elapsed time: 0.28488s