# ncbi_counts
Download the [NCBI-generated RNA-seq count data](https://www.ncbi.nlm.nih.gov/geo/info/rnaseqcounts.html) by specifying the Series accession number(s), and the regular expression of the Sample attributes.
If you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.
## Installation
From [PyPI](https://pypi.org/project/ncbi-counts/):
```sh
pip install ncbi-counts
```
## Usage
```sh
python -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE
```
### Options
```sh
positional arguments:
FILE Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').
options:
-h, --help show this help message and exit
-n NORM, --norm-type NORM
Normalization type of counts (choices: None, fpkm, tpm, default: None)
-a ANNOT_VER, --annot-ver ANNOT_VER
Annotation version of counts (default: GRCh38.p13)
-k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]
Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)
-s SRC_DIR, --src-dir SRC_DIR
A directory to save the source obtained from NCBI (default: ./)
-o OUTPUT, --output OUTPUT
A directory to save the count matrix (or matrices) (default: ./)
-q, --silent If True, suppress warnings (default: False)
-S SEP, --sep SEP Separator between group and GSM in column (default: -)
-y GSM_YAML, --yaml GSM_YAML
Path to save YAML file which contains GSMs (default: None)
-c, --cleanup If True, remove source files (default: False)
```
### Command-line Example
To create a mock vs. CoV2 comparison pair for each tissues from [GSE164073](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164073), please prepare the following yaml file (but do not need words beginning with "!!" as they are type hints):
> [!NOTE]
> The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the [Sample Attributes](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#sample_tab) table or [SOFT download](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#download) section in [SOFT submission instructions](https://www.ncbi.nlm.nih.gov/geo/info/soft.html) page.
> You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.
>
> If you want a comprehensive list of attributes for all samples in a series, [`GEOparse` library](https://geoparse.readthedocs.io/en/latest/GEOparse.html#GEOparse.GEOTypes.GSE.phenotype_data) is useful.
>
> ```python
> import GEOparse
> GEOparse.get_GEO("GSExxxxx").phenotype_data
> ```
```sample_regex.yaml
GSE164073: !!seq
- control: !!map
title: !!str Cornea
characteristics_ch1: !!str mock
treatment: !!map
title: !!str Cornea
characteristics_ch1: !!str SARS-CoV-2
- control: !!map
title: !!str Limbus
characteristics_ch1: !!str mock
treatment: !!map
title: !!str Limbus
characteristics_ch1: !!str SARS-CoV-2
- control: !!map
title: !!str Sclera
characteristics_ch1: !!str mock
treatment: !!map
title: !!str Sclera
characteristics_ch1: !!str SARS-CoV-2
```
or if you would like to specify the GSM directly, please prepare the following yaml file:
```samples.yaml
GSE164073: !!seq
- control: !!map
geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$
treatment: !!map
geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$
- control: !!map
geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$
treatment: !!map
geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$
- control: !!map
geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$
treatment: !!map
geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$
```
and run the following command ("Symbol" column is kept in this expample):
```sh
python -m ncbi_counts sample_regex.yaml -k Symbol -c
```
then you will get the following files:
<details open><summary>GSE164073-1.tsv</summary>
|GeneID|Symbol|control-GSM4996084|control-GSM4996085|control-GSM4996086|treatment-GSM4996088|treatment-GSM4996087|treatment-GSM4996089|
|:----|:----|:----|:----|:----|:----|:----|:----|
|1|A1BG|144|197|157|156|133|122|
|2|A2M|254|276|262|178|153|178|
|3|A2MP1|1|0|2|0|0|0|
|9|NAT1|97|133|103|83|93|88|
|...|...|...|...|...|...|...|...|
</details>
<details><summary>GSE164073-2.tsv</summary>
|GeneID|Symbol|control-GSM4996092|control-GSM4996091|control-GSM4996090|treatment-GSM4996095|treatment-GSM4996094|treatment-GSM4996093|
|:----|:----|:----|:----|:----|:----|:----|:----|
|1|A1BG|175|167|203|143|145|145|
|2|A2M|261|158|427|215|145|169|
|3|A2MP1|0|0|0|0|0|2|
|9|NAT1|122|100|133|90|78|80|
|...|...|...|...|...|...|...|...|
</details>
<details><summary>GSE164073-3.tsv</summary>
|GeneID|Symbol|control-GSM4996098|control-GSM4996097|control-GSM4996096|treatment-GSM4996099|treatment-GSM4996100|treatment-GSM4996101|
|:----|:----|:----|:----|:----|:----|:----|:----|
|1|A1BG|158|115|140|136|124|145|
|2|A2M|3337|2261|2536|1524|1288|1807|
|3|A2MP1|0|0|0|0|0|0|
|9|NAT1|83|64|68|65|52|79|
|...|...|...|...|...|...|...|...|
</details>
If you don't need source files from NCBI, please delete the following files:
### Example in Python
To get the output as a pandas DataFrame, please refer to the following code:
```python
from ncbi_counts import Series
series = Series(
"GSE164073",
[
{
"control": {"title": "Cornea", "characteristics_ch1": "mock"},
"treatment": {"title": "Cornea", "characteristics_ch1": "SARS-CoV-2"},
},
{
"control": {"title": "Limbus", "characteristics_ch1": "mock"},
"treatment": {"title": "Limbus", "characteristics_ch1": "SARS-CoV-2"},
},
{
"control": {"geo_accession": "^GSM499609[6-8]$"},
"treatment": {"geo_accession": "^GSM4996099$|^GSM4996100$|^GSM4996101$"},
},
],
keep_annot=["Symbol"],
save_to=None,
)
series.generate_pair_matrix()
# series.cleanup() # remove source files
series.pair_count_list[0] # Corresponds to GSE164073-1.tsv
series.pair_count_list[1] # Corresponds to GSE164073-2.tsv
series.pair_count_list[2] # Corresponds to GSE164073-3.tsv
```
## License
ncbi_counts is released under an [MIT license](LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/136s/ncbi_counts",
"name": "ncbi-counts",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9.0",
"maintainer_email": null,
"keywords": "GEO, Gene Expression Omnibus, Bioinformatics, RNA-seq, NCBI",
"author": "Yuki SUYAMA",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/6b/86/aaa903c47fca3da2a27f7c62032ad1fc21918a730454dfd81af764b094ea/ncbi_counts-0.2.0.tar.gz",
"platform": null,
"description": "# ncbi_counts\n\nDownload the [NCBI-generated RNA-seq count data](https://www.ncbi.nlm.nih.gov/geo/info/rnaseqcounts.html) by specifying the Series accession number(s), and the regular expression of the Sample attributes.\n\nIf you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.\n\n## Installation\n\nFrom [PyPI](https://pypi.org/project/ncbi-counts/):\n\n```sh\npip install ncbi-counts\n```\n\n## Usage\n\n```sh\npython -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE\n```\n\n### Options\n\n```sh\npositional arguments:\n FILE Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').\n\noptions:\n -h, --help show this help message and exit\n -n NORM, --norm-type NORM\n Normalization type of counts (choices: None, fpkm, tpm, default: None)\n -a ANNOT_VER, --annot-ver ANNOT_VER\n Annotation version of counts (default: GRCh38.p13)\n -k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]\n Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)\n -s SRC_DIR, --src-dir SRC_DIR\n A directory to save the source obtained from NCBI (default: ./)\n -o OUTPUT, --output OUTPUT\n A directory to save the count matrix (or matrices) (default: ./)\n -q, --silent If True, suppress warnings (default: False)\n -S SEP, --sep SEP Separator between group and GSM in column (default: -)\n -y GSM_YAML, --yaml GSM_YAML\n Path to save YAML file which contains GSMs (default: None)\n -c, --cleanup If True, remove source files (default: False)\n```\n\n### Command-line Example\n\nTo create a mock vs. CoV2 comparison pair for each tissues from [GSE164073](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164073), please prepare the following yaml file (but do not need words beginning with \"!!\" as they are type hints):\n\n> [!NOTE]\n> The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the [Sample Attributes](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#sample_tab) table or [SOFT download](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#download) section in [SOFT submission instructions](https://www.ncbi.nlm.nih.gov/geo/info/soft.html) page.\n> You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.\n>\n> If you want a comprehensive list of attributes for all samples in a series, [`GEOparse` library](https://geoparse.readthedocs.io/en/latest/GEOparse.html#GEOparse.GEOTypes.GSE.phenotype_data) is useful.\n>\n> ```python\n> import GEOparse\n> GEOparse.get_GEO(\"GSExxxxx\").phenotype_data\n> ```\n\n```sample_regex.yaml\nGSE164073: !!seq\n- control: !!map\n title: !!str Cornea\n characteristics_ch1: !!str mock\n treatment: !!map\n title: !!str Cornea\n characteristics_ch1: !!str SARS-CoV-2\n- control: !!map\n title: !!str Limbus\n characteristics_ch1: !!str mock\n treatment: !!map\n title: !!str Limbus\n characteristics_ch1: !!str SARS-CoV-2\n- control: !!map\n title: !!str Sclera\n characteristics_ch1: !!str mock\n treatment: !!map\n title: !!str Sclera\n characteristics_ch1: !!str SARS-CoV-2\n```\n\nor if you would like to specify the GSM directly, please prepare the following yaml file:\n\n```samples.yaml\nGSE164073: !!seq\n- control: !!map\n geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$\n treatment: !!map\n geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$\n- control: !!map\n geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$\n treatment: !!map\n geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$\n- control: !!map\n geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$\n treatment: !!map\n geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$\n```\n\nand run the following command (\"Symbol\" column is kept in this expample):\n\n```sh\npython -m ncbi_counts sample_regex.yaml -k Symbol -c\n```\n\nthen you will get the following files:\n\n<details open><summary>GSE164073-1.tsv</summary>\n\n|GeneID|Symbol|control-GSM4996084|control-GSM4996085|control-GSM4996086|treatment-GSM4996088|treatment-GSM4996087|treatment-GSM4996089|\n|:----|:----|:----|:----|:----|:----|:----|:----|\n|1|A1BG|144|197|157|156|133|122|\n|2|A2M|254|276|262|178|153|178|\n|3|A2MP1|1|0|2|0|0|0|\n|9|NAT1|97|133|103|83|93|88|\n|...|...|...|...|...|...|...|...|\n</details>\n<details><summary>GSE164073-2.tsv</summary>\n\n|GeneID|Symbol|control-GSM4996092|control-GSM4996091|control-GSM4996090|treatment-GSM4996095|treatment-GSM4996094|treatment-GSM4996093|\n|:----|:----|:----|:----|:----|:----|:----|:----|\n|1|A1BG|175|167|203|143|145|145|\n|2|A2M|261|158|427|215|145|169|\n|3|A2MP1|0|0|0|0|0|2|\n|9|NAT1|122|100|133|90|78|80|\n|...|...|...|...|...|...|...|...|\n</details>\n\n<details><summary>GSE164073-3.tsv</summary>\n\n|GeneID|Symbol|control-GSM4996098|control-GSM4996097|control-GSM4996096|treatment-GSM4996099|treatment-GSM4996100|treatment-GSM4996101|\n|:----|:----|:----|:----|:----|:----|:----|:----|\n|1|A1BG|158|115|140|136|124|145|\n|2|A2M|3337|2261|2536|1524|1288|1807|\n|3|A2MP1|0|0|0|0|0|0|\n|9|NAT1|83|64|68|65|52|79|\n|...|...|...|...|...|...|...|...|\n</details>\n\nIf you don't need source files from NCBI, please delete the following files:\n\n### Example in Python\n\nTo get the output as a pandas DataFrame, please refer to the following code:\n\n```python\nfrom ncbi_counts import Series\n\nseries = Series(\n \"GSE164073\",\n [\n {\n \"control\": {\"title\": \"Cornea\", \"characteristics_ch1\": \"mock\"},\n \"treatment\": {\"title\": \"Cornea\", \"characteristics_ch1\": \"SARS-CoV-2\"},\n },\n {\n \"control\": {\"title\": \"Limbus\", \"characteristics_ch1\": \"mock\"},\n \"treatment\": {\"title\": \"Limbus\", \"characteristics_ch1\": \"SARS-CoV-2\"},\n },\n {\n \"control\": {\"geo_accession\": \"^GSM499609[6-8]$\"},\n \"treatment\": {\"geo_accession\": \"^GSM4996099$|^GSM4996100$|^GSM4996101$\"},\n },\n ],\n keep_annot=[\"Symbol\"],\n save_to=None,\n)\nseries.generate_pair_matrix()\n# series.cleanup() # remove source files\nseries.pair_count_list[0] # Corresponds to GSE164073-1.tsv\nseries.pair_count_list[1] # Corresponds to GSE164073-2.tsv\nseries.pair_count_list[2] # Corresponds to GSE164073-3.tsv\n```\n\n## License\n\nncbi_counts is released under an [MIT license](LICENSE).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/136s/ncbi_counts"
},
"split_keywords": [
"geo",
" gene expression omnibus",
" bioinformatics",
" rna-seq",
" ncbi"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8f57806a8093a26f689b128f5da050934f1d1fb7d2c913d25fe25dbd45cf931c",
"md5": "5b8892ad4b65a287c3d10deb6f2f54bc",
"sha256": "ecec3bb7e01a6aab54cc6c459145e371f79b53aaa372a3e7d91916ae99f1dae8"
},
"downloads": -1,
"filename": "ncbi_counts-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5b8892ad4b65a287c3d10deb6f2f54bc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9.0",
"size": 13005,
"upload_time": "2024-06-04T10:34:44",
"upload_time_iso_8601": "2024-06-04T10:34:44.560999Z",
"url": "https://files.pythonhosted.org/packages/8f/57/806a8093a26f689b128f5da050934f1d1fb7d2c913d25fe25dbd45cf931c/ncbi_counts-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6b86aaa903c47fca3da2a27f7c62032ad1fc21918a730454dfd81af764b094ea",
"md5": "6f77ccee5c14ba89dc0bbba4e715c9b8",
"sha256": "9414e707974148fa7b24d57fbe0a1e4c295a6a91969447921c234d86d0524202"
},
"downloads": -1,
"filename": "ncbi_counts-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "6f77ccee5c14ba89dc0bbba4e715c9b8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0",
"size": 13670,
"upload_time": "2024-06-04T10:34:45",
"upload_time_iso_8601": "2024-06-04T10:34:45.715101Z",
"url": "https://files.pythonhosted.org/packages/6b/86/aaa903c47fca3da2a27f7c62032ad1fc21918a730454dfd81af764b094ea/ncbi_counts-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-04 10:34:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "136s",
"github_project": "ncbi_counts",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "ncbi-counts"
}