Name | gencube JSON |
Version |
0.9.0
JSON |
| download |
home_page | None |
Summary | GenCube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements. |
upload_time | 2024-06-27 22:19:07 |
maintainer | None |
docs_url | None |
author | None |
requires_python | None |
license | MIT License Copyright (c) 2024, Keun Hong Son Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
genome
genomic
geneset
annotation
comparative
sequencing
metadata
|
VCS |
![](/static/img/github-24-000000.png) |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
|
coveralls test coverage |
No coveralls.
|
Gencube
=======
<!-- 1. GitHub Version Badge: -->
<!-- 2. PyPI Version Badge: -->
<!-- 3. Supported Python Versions Badge: -->
<!-- 6. PyPI Downloads Badge: -->
<!-- 8. License Badge: -->
![github version](https://img.shields.io/badge/Version-1.0.0-informational)
[![pypi version](https://img.shields.io/pypi/v/gencube)](https://pypi.org/project/gencube/)
![python versions](https://img.shields.io/pypi/pyversions/gencube)
[![pypi downloads](https://img.shields.io/pypi/dm/gencube)](https://pypi.org/project/gencube/)
[![license](https://img.shields.io/pypi/l/gencube)](LICENSE)
<!-- 4. GitHub Actions CI Status Badge:
![status](https://github.com/keun-hong/gencube/workflows/CI/badge.svg) -->
<!-- 5. Codecov Badge:
[![codecov](https://codecov.io/gh/keun-hong/gencube/branch/master/graph/badge.svg)](https://codecov.io/gh/keun-hong/gencube) -->
<!-- 7. Documentation Badge:
[![docs](https://readthedocs.org/projects/gencube/badge/?version=latest)](https://gencube.readthedocs.io/en/latest/?badge=latest) -->
### Efficient retrieval, download, and unification of genomic data from leading biodiversity databases
[**Keun Hong Son**](https://keun-hong.github.io/)<sup>1,2,3</sup>, and [**Je-Yoel Cho**](https://vetbio.snu.ac.kr/)<sup>1,2,3</sup>
<sup>1</sup> Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, Korea<br>
<sup>2</sup> Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, Korea<br>
<sup>3</sup> BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, Korea<be>
### Manuscript
[**bioRxiv**]() (uploaded: 2024.07.01)
<!-- Bioinformatics (accepted. 2024.09) -->
---
`gencube` enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.
![gencube_overview](https://github.com/keun-hong/gencube/blob/master/figures/gencube_overview.jpg?raw=true)
### Databases accessed from gencube
- [GenBank](https://www.ncbi.nlm.nih.gov/genbank/): NCBI GenBank Nucleotide Sequence Database
- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): NCBI Reference Sequence Database
- [GenArk](https://hgdownload.soe.ucsc.edu/hubs/): UCSC Genome Archive
- [Ensembl Rapid Release](https://rapid.ensembl.org/index.html): Ensembl genome browser that provides frequent updates for newly sequenced species
- [Zoonomia TOGA](https://zoonomiaproject.org/the-data/): Tool to infer Orthologs from Genome Alignments
- [INSDC](https://www.insdc.org/): International Nucleotide Sequence Database Collaboration
- [SRA](https://www.ncbi.nlm.nih.gov/sra): NCBI Sequence Read Archive
- [ENA](https://www.ebi.ac.uk/ena/browser/home): EMBL-EBI European Nucleotide Archive
- [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html): DNA Data Bank of Japan
### Detailed information of each database
- [GenBank & RefSeq README.txt](https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt) - `genome`, `geneset`, `sequence`
- [UCSC GenArk paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x) - `genome`, `geneset`, `annotation`
- [Ensembl Rapid Release Help & Docs](https://rapid.ensembl.org/info/index.html) & [Ensembl 2023 paper](https://academic.oup.com/nar/article/51/D1/D933/6786199?login=false) - `genome`, `geneset`, `sequence`, `crossgenome`
- [Zoonomia TOGA README.txt](https://genome.senckenberg.de/download/TOGA/README.txt) & [Paper](https://www.science.org/doi/10.1126/science.abn3107) - `geneset`, `crossgenome`
- [Search in SRA Entrez](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/), [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/) & [SRA Advanced Search Builder](https://www.ncbi.nlm.nih.gov/sra/advanced) - `seqmeta`
## Installation
The latest release can be installed with
```bash
pip install gencube
```
Alternative
```bash
conda install -c bioconda gencube
```
## Tutorials
`gencube` consists of six subcommands
```plaintext
$ gencube
usage: gencube [-h] {genome,geneset,sequence,annotation,crossgenome,seqmeta} ...
gencube v1.0.0
positional arguments:
{genome,geneset,sequence,annotation,crossgenome,seqmeta}
genome Search, download, and modify chromosome labels for genome assemblies
geneset Search, download, and modify chromosome labels for genesets (gene annotations)
sequence Search and download sequence data of genesets
annotation Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats
crossgenome Search and download comparative genomics data, such as homology, and codon or protein alignment
seqmeta Search, fetch, and integrate metadata of experimental sequencing data
options:
-h, --help show this help message and exit
```
---
### The positional arguments and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand
When using the above five subcommands, it's important to find genome assemblies required for personal research.
Below are the positional arguments and options shared by the these subcommands to browse and search for specific genome assemblies.
```plaintext
positional arguments:
keywords Taxonomic names to search for genomes. You can provide various forms
such as species names or accession numbers.
Examples: homo_sapiens, human, GCF_000001405.40, GCA_000001405.29, GRCh38, hg38
Multiple names can be combined and will be merged in the search results.
To specify multiple names, separate them with spaces.
options:
-h, --help show this help message and exit
-v level, --level level
Specify the genome assembly level (default: complete,chromosome)
complete : Fully assembled genomes
chromosome : Assembled at the chromosome level
scaffold : Assembled into scaffolds, but not to the chromosome level
contig : Contiguous sequences without gaps
-r, --refseq Show genomes that have RefSeq accession (GCF_* format)
-u, --ucsc Show genomes that have UCSC name
-l, --latest Show genomes corresponding to the latest version
-m, --metadata Save metadata for the searched genomes
```
#### Examples
```bash
# Search using scientific or common name
$ gencube genome homo_sapiens canis_lupus_familiaris
$ gencube genome human dog
# Search using assembly name
$ gencube genome T2T-CHM13v2.0 GRCh38
# Search using UCSC name
$ gencube genome hg38 hg19
# Search using GenBank (GCF_*) or RefSeq (GCA_*) accession
$ gencube genome GCF_000001405.40 GCA_021950905.1
# Show searched genomes corresponding to all genome assembly levels
$ gencube genome homo_sapiens --level complete,chromosome,scaffold,contig
# Only show genomes that have RefSeq accession and UCSC name, and correspond to the latest version
$ gencube genome homo_sapiens --refseq --ucsc --latest
# Download the full information metadata of searched genomes
$ gencube genome homo_sapiens --metadata
```
#### Example output displayed in the terminal
```plaintext
$ gencube genome GCF_000001405.40 GCA_021950905.1
# Search assemblies in NCBI database
Keyword: ['GCF_000001405.40', 'GCA_021950905.1']
Total 3 genomes are searched.
# Convert JSON to dataframe format.
Filter options
Level: ['Complete', 'Chromosome']
RefSeq: False
UCSC: False
Latest: False
# Check accessibility to GenArk, Ensembl Rapid Release
UCSC GenArk : 4167 genomes across 2813 species
Ensembl Rapid: 2272 genomes across 1522 species
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
| | Assembly name | Taxid | Release | NCBI | UCSC | GenArk | Ensembl |
+====+========================+=========+============+==================+========+==========+===========+
| 0 | HG002.mat.cur.20211005 | 9606 | 2022/02/04 | GCA_021951015.1 | | v | v |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
| 1 | HG002.pat.cur.20211005 | 9606 | 2022/02/04 | GCA_021950905.1 | | v | v |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
| 2 | GRCh38.p14 | 9606 | 2022/02/03 | GCF_000001405.40 | hg38 | v | |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
```
---
### `genome`: Search, download, and modify chromosome labels for genome assemblies
You can download genome data in FASTA format from four different databases (GenBank, RefSeq, GenArk, Ensembl Rapid Release). Each database uses a different soft-masking method, and you can selectively download the data as needed. You can also download unmasked and hard-masked genomes from the Ensembl Rapid Release database.
```plaintext
options:
-d, --download Download "fasta" formatted genome file.
-f types, --fasta types
Type of "fasta" formatted genome file (default: refseq).
Default is from the RefSeq database.
If not available, download from the GenBank database.
genbank : soft-masked genome by NCBI GenBank
refseq : soft-masked genome by NCBI RefSeq
genark : soft-masked genome by UCSC GenArk
ensembl : soft-masked genome by Ensembl Rapid Release
ensembl_hm : hard-masked genome by Ensembl Rapid Release
ensembl_um : unmasked genome by Ensembl Rapid Release
-c type, --chr_style type
Chromosome label style used in the download file (default: ensembl)
ensembl : 1, 2, X, MT. Unknowns use GenBank IDs.
gencode : chr1, chr2, chrX, chrM. Unknowns use GenBank IDs.
ucsc : chr1, chr2, chrX, chrM. Uses UCSC-specific IDs for unknowns.
(!! Limited use if UCSC IDs are not issued.)
raw : Uses raw file labels without modification. Format depends on the database:
- NCBI GenBank: CM_* or other-form IDs
- NCBI RefSeq : NC_*, NW_* or other-form IDs
- GenArk : GenBank or RefSeq IDs
- Ensembl : Ensembl IDs
-p 1-9, --compresslevel 1-9
Compression level for output data (default: 6).
Lower numbers are faster but have lower compression.
--recursive Download file regardless of their presence only if integrity check is not possible.
```
#### Examples
```bash
# Download genome files under the default conditions (RefSeq or GenBank)
$ gencube genome GCF_011100685.1 --download
# Download multiple genomes from various databases
$ gencube genome GCF_011100685.1 --download --fasta refseq,genark,ensembl
# Change the chromosome labels to the GENCODE style and set the compression level of the file to 9.
$ gencube genome GCF_011100685.1 --download --chr_style gencode --compresslevel 9
```
---
### `geneset`: Search, download, and modify chromosome labels for genesets (gene annotations)
```plaintext
options:
-d types, --download types
Type of gene set
refseq_gtf : RefSeq gene set (GTF format)
refseq_gff : RefSeq gene set (GFF)
gnomon : RefSeq Gnomon gene prediction (GFF)
cross : RefSeq Cross-species alignments (GFF)
same : RefSeq Same-species alignments (GFF)
agustus : GenArk Augustus gene prediction (GFF)
xenoref : GenArk XenoRefGene (GFF)
genark_ref : GenArk RefSeq gene models (GFF)
ensembl_gff : Ensembl Rapid Release gene set (GFF)
toga_gtf : Zoonomia TOGA gene set (GTF)
toga_bed : Zoonomia TOGA gene set (BED)
toga_pseudo : Zoonomia TOGA processed pseudogenes (BED)
```
#### Examples
```bash
# search usable and accessible data
gencube geneset GCF_011100685.1
# Download multiple genesets from various databases
$ gencube geneset GCF_011100685.1 --download refseq_gtf,agustus,toga_gtf
```
---
### `sequence`: Search and download sequence data of genesets
```plaintext
options:
-d types, --download types
Download "fasta" formatted sequence file.
1. Nucleotide sequences:
refseq_rna : Accessioned RNA sequences annotated on the genome assembly.
refseq_rna_genomic : RNA features based on the genome sequence.
refseq_cds_genomic : CDS features based on the genome sequence.
refseq_pseudo : Pseudogene and other gene regions without transcribed RNA or translated protein products.
ensembl_cdna : Ensembl Rapid Release cDNA sequences of transcripts.
ensembl_cds : Ensembl Rapid Release coding sequences (CDS).
ensembl_repeat : Ensembl repeat modeler sequences.
2. Protein sequences:
refseq_pep : Accessioned protein sequences annotated on the genome assembly.
refseq_pep_cds : CDS features translated into protein sequences.
ensembl_pep : Ensembl Rapid Release protein sequences.
```
#### Examples
```bash
# search usable and accessible data
gencube sequence GCF_011100685.1
# Download multiple genesets from various databases
$ gencube sequence GCF_011100685.1 --download refseq_rna,ensembl_cdna,refseq_pep,ensembl_pep
```
---
### `annotation`: Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats
```plaintext
options:
-d types, --download types
Download annotation file.
gap : Genomic gaps - AGP defined (bigBed format)
sr : Simple tandem repeats by TRF (bigBed)
td : Tandem duplications (bigBed)
wm : Genomic intervals masked by WindowMasker + SDust (bigBed)
rmsk : Repeated elements annotated by RepeatMasker (bigBed)
cpg : CpG Islands - Islands < 300 bases are light green (bigBed)
gc : GC percent in 5-Base window (bigWig)
```
#### Examples
```bash
# search usable and accessible data
gencube annotation GCF_011100685.1
# Download multiple annotations
gencube annotation GCF_011100685.1 --download sr,td,rmsk,gc
```
---
### `crossgenome`: Search and download comparative genomics data, such as homology, and codon or protein alignment
```plaintext
options:
-d types, --download types
ensembl_homology : Homology data from Ensembl Rapid Release, detailing gene orthology relationships across species.
toga_homology : Homology data from TOGA, providing predictions of orthologous genes based on genome alignments.
toga_align_codon : Codon alignment data from TOGA, showing aligned codon sequences between reference and query species.
toga_align_protein : Protein alignment data from TOGA, detailing aligned protein sequences between reference and query species.
toga_inact_mut : List of inactivating mutations from TOGA, identifying mutations that disrupt gene function.
```
#### Examples
```bash
# search usable and accessible data
gencube crossgenome GCF_011100685.1
# Download multiple crossgenome data
$ gencube sequence GCF_011100685.1 --download toga_homology,toga_align_codon
```
---
### `seqmeta`: Search, fetch, and integrate metadata of experimental sequencing data
![seqmeta_scheme](https://github.com/keun-hong/gencube/blob/master/figures/seqmeta_scheme.jpg?raw=true)
```plaintext
$ gencube seqmeta
usage: gencube seqmeta [-h] [--info] [-o string] [-st string] [-sr string] [-l string] [-ex keywords] [-m] [keywords ...]
Search, fetch, and integrate metadata of experimental sequencing data
positional arguments:
keywords Keywords to search for sequencing-based experimental data. You can provide various forms
Examples: tissue name, cell line, disease name, etc
liver, k562, cancer, breast_cancer
Multiple keywords can be combined and will be merged in the search results.
To specify multiple names, separate them with spaces.
options:
-h, --help show this help message and exit
--info Show full information about organism, strategy, source and layout
-o string, --organism string
Scientific name or common name
Example: homo_sapiens or human
Available common names:
human, mouse, dog, dingo, wolf, cat, pig, pig_domestic, cow, dairy_cow, chicken, horse, rice, wheat
elephant, whale, naked_mole_rat, blind_mole_rat, gorilla, rhesus_monkey, cynomolgus_monkey, baboon
chimpanzee, marmoset, macaque, capuchin_monkey, squirrel_monkey, bonobo, yeast, fruit_fly, nematode
zebrafish, african clawed frog, rat, guinea pig, rabbit, opossum
-st string, --strategy string
Available strategies
wgs, wga, wxs, targeted, synthetic_long_read, gbs, rad, tn, clone_end, amplicon, clone, rna, mrna
ncrna, ribo, rip, mirna, ssrna, est, fl_cdna, atac, dnase, faire, chip, mre, bisulfite, mbd, medip
hic, chiapet, tethered
-sr string, --source string
Available sources
genomic, genomic_single_cell, transcriptomic, transcriptomic_single_cell, metagenomic
metatranscriptomic, synthetic, viral, other
-l string, --layout string
Available layout: paired, single (default: paired,single)
-ex keywords, --exclude keywords
Exclude the results for the keywords used in this option
-m, --metadata Save integrated metadata
```
#### Examples
```bash
$ gencube seqmeta --organism dog --strategy chip,chip_seq
$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor
$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor --exclude crispr,
((((("human"[Organism]) AND ("rna seq"[Strategy])) AND ("illumina"[Platform])) AND ("public"[Access])) AND (("liver" OR "lung") AND ("cancer" OR "tumor"))) NOT "crispr"
```
---
#### Credits
This package was created with [`Cookiecutter`](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.
Raw data
{
"_id": null,
"home_page": null,
"name": "gencube",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": "Keun Hong Son <newhong@snu.ac.kr>",
"keywords": "Genome, Genomic, Geneset, Annotation, Comparative, Sequencing, Metadata",
"author": null,
"author_email": "Keun Hong Son <newhong@snu.ac.kr>",
"download_url": "https://files.pythonhosted.org/packages/e5/15/f5ceb365c2896a948816912d4cdd1b601f7f06ab73d171776bcf98136a5b/gencube-0.9.0.tar.gz",
"platform": null,
"description": "Gencube\n=======\n<!-- 1. GitHub Version Badge: -->\n<!-- 2. PyPI Version Badge: -->\n<!-- 3. Supported Python Versions Badge: -->\n<!-- 6. PyPI Downloads Badge: -->\n<!-- 8. License Badge: -->\n![github version](https://img.shields.io/badge/Version-1.0.0-informational)\n[![pypi version](https://img.shields.io/pypi/v/gencube)](https://pypi.org/project/gencube/)\n![python versions](https://img.shields.io/pypi/pyversions/gencube)\n[![pypi downloads](https://img.shields.io/pypi/dm/gencube)](https://pypi.org/project/gencube/)\n[![license](https://img.shields.io/pypi/l/gencube)](LICENSE)\n\n<!-- 4. GitHub Actions CI Status Badge:\n![status](https://github.com/keun-hong/gencube/workflows/CI/badge.svg) -->\n<!-- 5. Codecov Badge:\n[![codecov](https://codecov.io/gh/keun-hong/gencube/branch/master/graph/badge.svg)](https://codecov.io/gh/keun-hong/gencube) -->\n<!-- 7. Documentation Badge:\n[![docs](https://readthedocs.org/projects/gencube/badge/?version=latest)](https://gencube.readthedocs.io/en/latest/?badge=latest) -->\n\n### Efficient retrieval, download, and unification of genomic data from leading biodiversity databases\n\n[**Keun Hong Son**](https://keun-hong.github.io/)<sup>1,2,3</sup>, and [**Je-Yoel Cho**](https://vetbio.snu.ac.kr/)<sup>1,2,3</sup>\n\n<sup>1</sup> Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, Korea<br>\n<sup>2</sup> Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, Korea<br>\n<sup>3</sup> BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, Korea<be>\n### Manuscript\n[**bioRxiv**]() (uploaded: 2024.07.01)\n<!-- Bioinformatics (accepted. 2024.09) -->\n\n---\n`gencube` enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.\n\n![gencube_overview](https://github.com/keun-hong/gencube/blob/master/figures/gencube_overview.jpg?raw=true)\n\n### Databases accessed from gencube\n- [GenBank](https://www.ncbi.nlm.nih.gov/genbank/): NCBI GenBank Nucleotide Sequence Database\n- [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): NCBI Reference Sequence Database\n- [GenArk](https://hgdownload.soe.ucsc.edu/hubs/): UCSC Genome Archive\n- [Ensembl Rapid Release](https://rapid.ensembl.org/index.html): Ensembl genome browser that provides frequent updates for newly sequenced species\n- [Zoonomia TOGA](https://zoonomiaproject.org/the-data/): Tool to infer Orthologs from Genome Alignments\n- [INSDC](https://www.insdc.org/): International Nucleotide Sequence Database Collaboration\n- [SRA](https://www.ncbi.nlm.nih.gov/sra): NCBI Sequence Read Archive\n- [ENA](https://www.ebi.ac.uk/ena/browser/home): EMBL-EBI European Nucleotide Archive\n- [DDBJ](https://www.ddbj.nig.ac.jp/index-e.html): DNA Data Bank of Japan\n\n### Detailed information of each database\n- [GenBank & RefSeq README.txt](https://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt) - `genome`, `geneset`, `sequence`\n- [UCSC GenArk paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03057-x) - `genome`, `geneset`, `annotation`\n- [Ensembl Rapid Release Help & Docs](https://rapid.ensembl.org/info/index.html) & [Ensembl 2023 paper](https://academic.oup.com/nar/article/51/D1/D933/6786199?login=false) - `genome`, `geneset`, `sequence`, `crossgenome`\n- [Zoonomia TOGA README.txt](https://genome.senckenberg.de/download/TOGA/README.txt) & [Paper](https://www.science.org/doi/10.1126/science.abn3107) - `geneset`, `crossgenome`\n- [Search in SRA Entrez](https://www.ncbi.nlm.nih.gov/sra/docs/srasearch/), [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/) & [SRA Advanced Search Builder](https://www.ncbi.nlm.nih.gov/sra/advanced) - `seqmeta`\n\n\n## Installation\nThe latest release can be installed with\n```bash\npip install gencube\n```\nAlternative\n```bash\nconda install -c bioconda gencube\n```\n\n## Tutorials\n`gencube` consists of six subcommands\n```plaintext\n$ gencube\nusage: gencube [-h] {genome,geneset,sequence,annotation,crossgenome,seqmeta} ...\n\ngencube v1.0.0\n\npositional arguments:\n {genome,geneset,sequence,annotation,crossgenome,seqmeta}\n genome Search, download, and modify chromosome labels for genome assemblies\n geneset Search, download, and modify chromosome labels for genesets (gene annotations)\n sequence Search and download sequence data of genesets\n annotation Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats\n crossgenome Search and download comparative genomics data, such as homology, and codon or protein alignment\n seqmeta Search, fetch, and integrate metadata of experimental sequencing data\n\noptions:\n -h, --help show this help message and exit\n```\n---\n### The positional arguments and options shared among the `genome`, `geneset`, `sequence`, `annotation`, and `crossgenome` subcommand\nWhen using the above five subcommands, it's important to find genome assemblies required for personal research.\nBelow are the positional arguments and options shared by the these subcommands to browse and search for specific genome assemblies.\n\n```plaintext\npositional arguments:\n keywords Taxonomic names to search for genomes. You can provide various forms \n such as species names or accession numbers. \n Examples: homo_sapiens, human, GCF_000001405.40, GCA_000001405.29, GRCh38, hg38 \n \n Multiple names can be combined and will be merged in the search results.\n To specify multiple names, separate them with spaces.\n\noptions:\n -h, --help show this help message and exit\n -v level, --level level\n Specify the genome assembly level (default: complete,chromosome)\n complete : Fully assembled genomes\n chromosome : Assembled at the chromosome level\n scaffold : Assembled into scaffolds, but not to the chromosome level\n contig : Contiguous sequences without gaps\n \n -r, --refseq Show genomes that have RefSeq accession (GCF_* format)\n -u, --ucsc Show genomes that have UCSC name\n -l, --latest Show genomes corresponding to the latest version\n -m, --metadata Save metadata for the searched genomes\n```\n#### Examples\n```bash\n# Search using scientific or common name\n$ gencube genome homo_sapiens canis_lupus_familiaris\n$ gencube genome human dog\n\n# Search using assembly name\n$ gencube genome T2T-CHM13v2.0 GRCh38\n\n# Search using UCSC name\n$ gencube genome hg38 hg19\n\n# Search using GenBank (GCF_*) or RefSeq (GCA_*) accession\n$ gencube genome GCF_000001405.40 GCA_021950905.1\n\n# Show searched genomes corresponding to all genome assembly levels\n$ gencube genome homo_sapiens --level complete,chromosome,scaffold,contig\n\n# Only show genomes that have RefSeq accession and UCSC name, and correspond to the latest version\n$ gencube genome homo_sapiens --refseq --ucsc --latest\n\n# Download the full information metadata of searched genomes\n$ gencube genome homo_sapiens --metadata\n```\n#### Example output displayed in the terminal\n```plaintext\n$ gencube genome GCF_000001405.40 GCA_021950905.1\n\n# Search assemblies in NCBI database\n Keyword: ['GCF_000001405.40', 'GCA_021950905.1']\n\n Total 3 genomes are searched.\n\n# Convert JSON to dataframe format.\n Filter options\n Level: ['Complete', 'Chromosome']\n RefSeq: False\n UCSC: False\n Latest: False\n\n# Check accessibility to GenArk, Ensembl Rapid Release\n UCSC GenArk : 4167 genomes across 2813 species\n Ensembl Rapid: 2272 genomes across 1522 species\n\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n| | Assembly name | Taxid | Release | NCBI | UCSC | GenArk | Ensembl |\n+====+========================+=========+============+==================+========+==========+===========+\n| 0 | HG002.mat.cur.20211005 | 9606 | 2022/02/04 | GCA_021951015.1 | | v | v |\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n| 1 | HG002.pat.cur.20211005 | 9606 | 2022/02/04 | GCA_021950905.1 | | v | v |\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n| 2 | GRCh38.p14 | 9606 | 2022/02/03 | GCF_000001405.40 | hg38 | v | |\n+----+------------------------+---------+------------+------------------+--------+----------+-----------+\n```\n\n---\n### `genome`: Search, download, and modify chromosome labels for genome assemblies\nYou can download genome data in FASTA format from four different databases (GenBank, RefSeq, GenArk, Ensembl Rapid Release). Each database uses a different soft-masking method, and you can selectively download the data as needed. You can also download unmasked and hard-masked genomes from the Ensembl Rapid Release database.\n```plaintext\noptions:\n -d, --download Download \"fasta\" formatted genome file.\n -f types, --fasta types\n Type of \"fasta\" formatted genome file (default: refseq).\n Default is from the RefSeq database.\n If not available, download from the GenBank database.\n genbank : soft-masked genome by NCBI GenBank\n refseq : soft-masked genome by NCBI RefSeq\n genark : soft-masked genome by UCSC GenArk\n ensembl : soft-masked genome by Ensembl Rapid Release\n ensembl_hm : hard-masked genome by Ensembl Rapid Release\n ensembl_um : unmasked genome by Ensembl Rapid Release\n -c type, --chr_style type\n Chromosome label style used in the download file (default: ensembl)\n ensembl : 1, 2, X, MT. Unknowns use GenBank IDs.\n gencode : chr1, chr2, chrX, chrM. Unknowns use GenBank IDs.\n ucsc : chr1, chr2, chrX, chrM. Uses UCSC-specific IDs for unknowns.\n (!! Limited use if UCSC IDs are not issued.)\n raw : Uses raw file labels without modification. Format depends on the database:\n - NCBI GenBank: CM_* or other-form IDs\n - NCBI RefSeq : NC_*, NW_* or other-form IDs\n - GenArk : GenBank or RefSeq IDs\n - Ensembl : Ensembl IDs\n -p 1-9, --compresslevel 1-9\n Compression level for output data (default: 6).\n Lower numbers are faster but have lower compression.\n --recursive Download file regardless of their presence only if integrity check is not possible.\n\n```\n#### Examples\n```bash\n# Download genome files under the default conditions (RefSeq or GenBank)\n$ gencube genome GCF_011100685.1 --download\n# Download multiple genomes from various databases\n$ gencube genome GCF_011100685.1 --download --fasta refseq,genark,ensembl\n# Change the chromosome labels to the GENCODE style and set the compression level of the file to 9.\n$ gencube genome GCF_011100685.1 --download --chr_style gencode --compresslevel 9\n```\n---\n\n### `geneset`: Search, download, and modify chromosome labels for genesets (gene annotations)\n```plaintext\noptions:\n -d types, --download types\n Type of gene set\n refseq_gtf : RefSeq gene set (GTF format)\n refseq_gff : RefSeq gene set (GFF)\n gnomon : RefSeq Gnomon gene prediction (GFF)\n cross : RefSeq Cross-species alignments (GFF)\n same : RefSeq Same-species alignments (GFF)\n agustus : GenArk Augustus gene prediction (GFF)\n xenoref : GenArk XenoRefGene (GFF)\n genark_ref : GenArk RefSeq gene models (GFF)\n ensembl_gff : Ensembl Rapid Release gene set (GFF)\n toga_gtf : Zoonomia TOGA gene set (GTF)\n toga_bed : Zoonomia TOGA gene set (BED)\n toga_pseudo : Zoonomia TOGA processed pseudogenes (BED)\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube geneset GCF_011100685.1\n\n# Download multiple genesets from various databases\n$ gencube geneset GCF_011100685.1 --download refseq_gtf,agustus,toga_gtf\n```\n---\n\n### `sequence`: Search and download sequence data of genesets\n```plaintext\noptions:\n -d types, --download types\n Download \"fasta\" formatted sequence file. \n 1. Nucleotide sequences:\n refseq_rna : Accessioned RNA sequences annotated on the genome assembly.\n refseq_rna_genomic : RNA features based on the genome sequence.\n refseq_cds_genomic : CDS features based on the genome sequence.\n refseq_pseudo : Pseudogene and other gene regions without transcribed RNA or translated protein products.\n ensembl_cdna : Ensembl Rapid Release cDNA sequences of transcripts.\n ensembl_cds : Ensembl Rapid Release coding sequences (CDS).\n ensembl_repeat : Ensembl repeat modeler sequences.\n \n 2. Protein sequences:\n refseq_pep : Accessioned protein sequences annotated on the genome assembly.\n refseq_pep_cds : CDS features translated into protein sequences.\n ensembl_pep : Ensembl Rapid Release protein sequences.\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube sequence GCF_011100685.1\n\n# Download multiple genesets from various databases\n$ gencube sequence GCF_011100685.1 --download refseq_rna,ensembl_cdna,refseq_pep,ensembl_pep\n```\n---\n\n### `annotation`: Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats\n```plaintext\noptions:\n -d types, --download types\n Download annotation file.\n gap : Genomic gaps - AGP defined (bigBed format)\n sr : Simple tandem repeats by TRF (bigBed) \n td : Tandem duplications (bigBed) \n wm : Genomic intervals masked by WindowMasker + SDust (bigBed) \n rmsk : Repeated elements annotated by RepeatMasker (bigBed) \n cpg : CpG Islands - Islands < 300 bases are light green (bigBed) \n gc : GC percent in 5-Base window (bigWig)\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube annotation GCF_011100685.1\n\n# Download multiple annotations\ngencube annotation GCF_011100685.1 --download sr,td,rmsk,gc\n```\n---\n\n### `crossgenome`: Search and download comparative genomics data, such as homology, and codon or protein alignment\n```plaintext\noptions:\n -d types, --download types\n ensembl_homology : Homology data from Ensembl Rapid Release, detailing gene orthology relationships across species.\n toga_homology : Homology data from TOGA, providing predictions of orthologous genes based on genome alignments.\n toga_align_codon : Codon alignment data from TOGA, showing aligned codon sequences between reference and query species.\n toga_align_protein : Protein alignment data from TOGA, detailing aligned protein sequences between reference and query species.\n toga_inact_mut : List of inactivating mutations from TOGA, identifying mutations that disrupt gene function.\n```\n\n#### Examples\n```bash\n# search usable and accessible data\ngencube crossgenome GCF_011100685.1\n\n# Download multiple crossgenome data\n$ gencube sequence GCF_011100685.1 --download toga_homology,toga_align_codon\n```\n---\n\n### `seqmeta`: Search, fetch, and integrate metadata of experimental sequencing data\n![seqmeta_scheme](https://github.com/keun-hong/gencube/blob/master/figures/seqmeta_scheme.jpg?raw=true)\n```plaintext\n$ gencube seqmeta\nusage: gencube seqmeta [-h] [--info] [-o string] [-st string] [-sr string] [-l string] [-ex keywords] [-m] [keywords ...]\n\nSearch, fetch, and integrate metadata of experimental sequencing data\n\npositional arguments:\n keywords Keywords to search for sequencing-based experimental data. You can provide various forms \n Examples: tissue name, cell line, disease name, etc \n liver, k562, cancer, breast_cancer\n Multiple keywords can be combined and will be merged in the search results.\n To specify multiple names, separate them with spaces.\n\noptions:\n -h, --help show this help message and exit\n --info Show full information about organism, strategy, source and layout \n \n -o string, --organism string\n Scientific name or common name \n Example: homo_sapiens or human \n \n Available common names:\n human, mouse, dog, dingo, wolf, cat, pig, pig_domestic, cow, dairy_cow, chicken, horse, rice, wheat\n elephant, whale, naked_mole_rat, blind_mole_rat, gorilla, rhesus_monkey, cynomolgus_monkey, baboon\n chimpanzee, marmoset, macaque, capuchin_monkey, squirrel_monkey, bonobo, yeast, fruit_fly, nematode\n zebrafish, african clawed frog, rat, guinea pig, rabbit, opossum\n \n -st string, --strategy string\n Available strategies \n wgs, wga, wxs, targeted, synthetic_long_read, gbs, rad, tn, clone_end, amplicon, clone, rna, mrna\n ncrna, ribo, rip, mirna, ssrna, est, fl_cdna, atac, dnase, faire, chip, mre, bisulfite, mbd, medip\n hic, chiapet, tethered\n \n -sr string, --source string\n Available sources \n genomic, genomic_single_cell, transcriptomic, transcriptomic_single_cell, metagenomic\n metatranscriptomic, synthetic, viral, other\n \n -l string, --layout string\n Available layout: paired, single (default: paired,single) \n \n -ex keywords, --exclude keywords\n Exclude the results for the keywords used in this option \n \n -m, --metadata Save integrated metadata\n```\n\n#### Examples\n```bash\n$ gencube seqmeta --organism dog --strategy chip,chip_seq\n\n$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor\n\n$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor --exclude crispr,\n\n(((((\"human\"[Organism]) AND (\"rna seq\"[Strategy])) AND (\"illumina\"[Platform])) AND (\"public\"[Access])) AND ((\"liver\" OR \"lung\") AND (\"cancer\" OR \"tumor\"))) NOT \"crispr\"\n```\n---\n\n#### Credits\nThis package was created with [`Cookiecutter`](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024, Keun Hong Son Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "GenCube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.",
"version": "0.9.0",
"project_urls": {
"Bugs": "https://github.com/keun-hong/gencube/issues",
"Changelog": "https://github.com/keun-hong/gencube/blob/master/changelog.md",
"Homepage": "https://keun-hong.github.io/",
"Repository": "https://github.com/keun-hong/gencube"
},
"split_keywords": [
"genome",
" genomic",
" geneset",
" annotation",
" comparative",
" sequencing",
" metadata"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "588c3ba2c6bb6a561db69a04ce6d5885d120e3a342f3a7863bc0280259efb36a",
"md5": "c6684e0ef1192b506642a83a72ca6c78",
"sha256": "b94a2947a2cc8cbc15f648962d601c350b12c45a9270f1ebd111abbd7e6782ce"
},
"downloads": -1,
"filename": "gencube-0.9.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c6684e0ef1192b506642a83a72ca6c78",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 56048,
"upload_time": "2024-06-27T22:18:48",
"upload_time_iso_8601": "2024-06-27T22:18:48.224868Z",
"url": "https://files.pythonhosted.org/packages/58/8c/3ba2c6bb6a561db69a04ce6d5885d120e3a342f3a7863bc0280259efb36a/gencube-0.9.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e515f5ceb365c2896a948816912d4cdd1b601f7f06ab73d171776bcf98136a5b",
"md5": "072a38cada432bcc53e61ffdf7b61b91",
"sha256": "1bb46d2796eb796e399b7c63f883358704ba9bdfdecd73d584a2d4ae9d239ca2"
},
"downloads": -1,
"filename": "gencube-0.9.0.tar.gz",
"has_sig": false,
"md5_digest": "072a38cada432bcc53e61ffdf7b61b91",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 75772,
"upload_time": "2024-06-27T22:19:07",
"upload_time_iso_8601": "2024-06-27T22:19:07.600581Z",
"url": "https://files.pythonhosted.org/packages/e5/15/f5ceb365c2896a948816912d4cdd1b601f7f06ab73d171776bcf98136a5b/gencube-0.9.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-27 22:19:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "keun-hong",
"github_project": "gencube",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [],
"tox": true,
"lcname": "gencube"
}