ncbi-counts


Namencbi-counts JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/136s/ncbi_counts
SummaryDownload the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.
upload_time2023-12-05 02:16:57
maintainer
docs_urlNone
authorYuki SUYAMA
requires_python>=3.10.0
licenseMIT
keywords geo gene expression omnibus bioinformatics rna-seq ncbi
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ncbi_counts

Download the [NCBI-generated RNA-seq count data](https://www.ncbi.nlm.nih.gov/geo/info/rnaseqcounts.html) by specifying the Series accession number(s), and the regular expression of the Sample attributes.

If you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.

## Installation

From [PyPI](https://pypi.org/project/ncbi-counts/):

```sh
pip install ncbi-counts
```

## Usage

```sh
python -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE
```

### Options

```sh
positional arguments:
  FILE                  Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').

options:
  -h, --help            show this help message and exit
  -n NORM, --norm-type NORM
                        Normalization type of counts (choices: None, fpkm, tpm, default: None)
  -a ANNOT_VER, --annot-ver ANNOT_VER
                        Annotation version of counts (default: GRCh38.p13)
  -k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]
                        Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)
  -s SRC_DIR, --src-dir SRC_DIR
                        A directory to save the source obtained from NCBI (default: ./)
  -o OUTPUT, --output OUTPUT
                        A directory to save the count matrix (or matrices) (default: ./)
  -q, --silent          If True, suppress warnings (default: False)
  -S SEP, --sep SEP     Separator between group and GSM in column (default: -)
  -y GSM_YAML, --yaml GSM_YAML
                        Path to save YAML file which contains GSMs (default: None)
  -c, --cleanup         If True, remove source files (default: False)
```

### Command-line Example

To create a mock vs. CoV2 comparison pair for each tissues from [GSE164073](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164073), please prepare the following yaml file (but do not need words beginning with "!!" as they are type hints):

> [!NOTE]
> The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the [Sample Attributes](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#sample_tab) table or [SOFT download](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#download) section in [SOFT submission instructions](https://www.ncbi.nlm.nih.gov/geo/info/soft.html) page.
> You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.
>
> If you want a comprehensive list of attributes for all samples in a series, [`GEOparse` library](https://geoparse.readthedocs.io/en/latest/GEOparse.html#GEOparse.GEOTypes.GSE.phenotype_data) is useful.
>
> ```python
>  import GEOparse
>  GEOparse.get_GEO("GSExxxxx").phenotype_data
> ```

```sample_regex.yaml
GSE164073: !!seq
- control: !!map
    title: !!str Cornea
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Cornea
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Limbus
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Limbus
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Sclera
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Sclera
    characteristics_ch1: !!str SARS-CoV-2
```

or if you would like to specify the GSM directly, please prepare the following yaml file:

```samples.yaml
GSE164073: !!seq
- control: !!map
    geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$
  treatment: !!map
    geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$
- control: !!map
    geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$
  treatment: !!map
    geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$
- control: !!map
    geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$
  treatment: !!map
    geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$
```

and run the following command ("Symbol" column is kept in this expample):

```sh
python -m ncbi_counts sample_regex.yaml -k Symbol -c
```

then you will get the following files:

<details open><summary>GSE164073-1.tsv</summary>

|GeneID|Symbol|control-GSM4996084|control-GSM4996085|control-GSM4996086|treatment-GSM4996088|treatment-GSM4996087|treatment-GSM4996089|
|:----|:----|:----|:----|:----|:----|:----|:----|
|1|A1BG|144|197|157|156|133|122|
|2|A2M|254|276|262|178|153|178|
|3|A2MP1|1|0|2|0|0|0|
|9|NAT1|97|133|103|83|93|88|
|...|...|...|...|...|...|...|...|
</details>
<details><summary>GSE164073-2.tsv</summary>

|GeneID|Symbol|control-GSM4996092|control-GSM4996091|control-GSM4996090|treatment-GSM4996095|treatment-GSM4996094|treatment-GSM4996093|
|:----|:----|:----|:----|:----|:----|:----|:----|
|1|A1BG|175|167|203|143|145|145|
|2|A2M|261|158|427|215|145|169|
|3|A2MP1|0|0|0|0|0|2|
|9|NAT1|122|100|133|90|78|80|
|...|...|...|...|...|...|...|...|
</details>

<details><summary>GSE164073-3.tsv</summary>

|GeneID|Symbol|control-GSM4996098|control-GSM4996097|control-GSM4996096|treatment-GSM4996099|treatment-GSM4996100|treatment-GSM4996101|
|:----|:----|:----|:----|:----|:----|:----|:----|
|1|A1BG|158|115|140|136|124|145|
|2|A2M|3337|2261|2536|1524|1288|1807|
|3|A2MP1|0|0|0|0|0|0|
|9|NAT1|83|64|68|65|52|79|
|...|...|...|...|...|...|...|...|
</details>

If you don't need source files from NCBI, please delete the following files:

### Example in Python

To get the output as a pandas DataFrame, please refer to the following code:

```python
from ncbi_counts import Series

series = Series(
    "GSE164073",
    [
        {
            "control": {"title": "Cornea", "characteristics_ch1": "mock"},
            "treatment": {"title": "Cornea", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"title": "Limbus", "characteristics_ch1": "mock"},
            "treatment": {"title": "Limbus", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"geo_accession": "^GSM499609[6-8]$"},
            "treatment": {"geo_accession": "^GSM4996099$|^GSM4996100$|^GSM4996101$"},
        },
    ],
    keep_annot=["Symbol"],
    save_to=None,
)
series.generate_pair_matrix()
# series.cleanup()  # remove source files
series.pair_count_list[0]  # Corresponds to GSE164073-1.tsv
series.pair_count_list[1]  # Corresponds to GSE164073-2.tsv
series.pair_count_list[2]  # Corresponds to GSE164073-3.tsv
```

## License

ncbi_counts is released under an [MIT license](LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/136s/ncbi_counts",
    "name": "ncbi-counts",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10.0",
    "maintainer_email": "",
    "keywords": "GEO,Gene Expression Omnibus,Bioinformatics,RNA-seq,NCBI",
    "author": "Yuki SUYAMA",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/4e/ac/2bd799201d1d278611e86e115d74897239a40ad8d66e2c5a3b24dd0c8a63/ncbi_counts-0.1.0.tar.gz",
    "platform": null,
    "description": "# ncbi_counts\r\n\r\nDownload the [NCBI-generated RNA-seq count data](https://www.ncbi.nlm.nih.gov/geo/info/rnaseqcounts.html) by specifying the Series accession number(s), and the regular expression of the Sample attributes.\r\n\r\nIf you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.\r\n\r\n## Installation\r\n\r\nFrom [PyPI](https://pypi.org/project/ncbi-counts/):\r\n\r\n```sh\r\npip install ncbi-counts\r\n```\r\n\r\n## Usage\r\n\r\n```sh\r\npython -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE\r\n```\r\n\r\n### Options\r\n\r\n```sh\r\npositional arguments:\r\n  FILE                  Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').\r\n\r\noptions:\r\n  -h, --help            show this help message and exit\r\n  -n NORM, --norm-type NORM\r\n                        Normalization type of counts (choices: None, fpkm, tpm, default: None)\r\n  -a ANNOT_VER, --annot-ver ANNOT_VER\r\n                        Annotation version of counts (default: GRCh38.p13)\r\n  -k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]\r\n                        Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)\r\n  -s SRC_DIR, --src-dir SRC_DIR\r\n                        A directory to save the source obtained from NCBI (default: ./)\r\n  -o OUTPUT, --output OUTPUT\r\n                        A directory to save the count matrix (or matrices) (default: ./)\r\n  -q, --silent          If True, suppress warnings (default: False)\r\n  -S SEP, --sep SEP     Separator between group and GSM in column (default: -)\r\n  -y GSM_YAML, --yaml GSM_YAML\r\n                        Path to save YAML file which contains GSMs (default: None)\r\n  -c, --cleanup         If True, remove source files (default: False)\r\n```\r\n\r\n### Command-line Example\r\n\r\nTo create a mock vs. CoV2 comparison pair for each tissues from [GSE164073](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164073), please prepare the following yaml file (but do not need words beginning with \"!!\" as they are type hints):\r\n\r\n> [!NOTE]\r\n> The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the [Sample Attributes](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#sample_tab) table or [SOFT download](https://www.ncbi.nlm.nih.gov/geo/info/soft.html#download) section in [SOFT submission instructions](https://www.ncbi.nlm.nih.gov/geo/info/soft.html) page.\r\n> You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.\r\n>\r\n> If you want a comprehensive list of attributes for all samples in a series, [`GEOparse` library](https://geoparse.readthedocs.io/en/latest/GEOparse.html#GEOparse.GEOTypes.GSE.phenotype_data) is useful.\r\n>\r\n> ```python\r\n>  import GEOparse\r\n>  GEOparse.get_GEO(\"GSExxxxx\").phenotype_data\r\n> ```\r\n\r\n```sample_regex.yaml\r\nGSE164073: !!seq\r\n- control: !!map\r\n    title: !!str Cornea\r\n    characteristics_ch1: !!str mock\r\n  treatment: !!map\r\n    title: !!str Cornea\r\n    characteristics_ch1: !!str SARS-CoV-2\r\n- control: !!map\r\n    title: !!str Limbus\r\n    characteristics_ch1: !!str mock\r\n  treatment: !!map\r\n    title: !!str Limbus\r\n    characteristics_ch1: !!str SARS-CoV-2\r\n- control: !!map\r\n    title: !!str Sclera\r\n    characteristics_ch1: !!str mock\r\n  treatment: !!map\r\n    title: !!str Sclera\r\n    characteristics_ch1: !!str SARS-CoV-2\r\n```\r\n\r\nor if you would like to specify the GSM directly, please prepare the following yaml file:\r\n\r\n```samples.yaml\r\nGSE164073: !!seq\r\n- control: !!map\r\n    geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$\r\n  treatment: !!map\r\n    geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$\r\n- control: !!map\r\n    geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$\r\n  treatment: !!map\r\n    geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$\r\n- control: !!map\r\n    geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$\r\n  treatment: !!map\r\n    geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$\r\n```\r\n\r\nand run the following command (\"Symbol\" column is kept in this expample):\r\n\r\n```sh\r\npython -m ncbi_counts sample_regex.yaml -k Symbol -c\r\n```\r\n\r\nthen you will get the following files:\r\n\r\n<details open><summary>GSE164073-1.tsv</summary>\r\n\r\n|GeneID|Symbol|control-GSM4996084|control-GSM4996085|control-GSM4996086|treatment-GSM4996088|treatment-GSM4996087|treatment-GSM4996089|\r\n|:----|:----|:----|:----|:----|:----|:----|:----|\r\n|1|A1BG|144|197|157|156|133|122|\r\n|2|A2M|254|276|262|178|153|178|\r\n|3|A2MP1|1|0|2|0|0|0|\r\n|9|NAT1|97|133|103|83|93|88|\r\n|...|...|...|...|...|...|...|...|\r\n</details>\r\n<details><summary>GSE164073-2.tsv</summary>\r\n\r\n|GeneID|Symbol|control-GSM4996092|control-GSM4996091|control-GSM4996090|treatment-GSM4996095|treatment-GSM4996094|treatment-GSM4996093|\r\n|:----|:----|:----|:----|:----|:----|:----|:----|\r\n|1|A1BG|175|167|203|143|145|145|\r\n|2|A2M|261|158|427|215|145|169|\r\n|3|A2MP1|0|0|0|0|0|2|\r\n|9|NAT1|122|100|133|90|78|80|\r\n|...|...|...|...|...|...|...|...|\r\n</details>\r\n\r\n<details><summary>GSE164073-3.tsv</summary>\r\n\r\n|GeneID|Symbol|control-GSM4996098|control-GSM4996097|control-GSM4996096|treatment-GSM4996099|treatment-GSM4996100|treatment-GSM4996101|\r\n|:----|:----|:----|:----|:----|:----|:----|:----|\r\n|1|A1BG|158|115|140|136|124|145|\r\n|2|A2M|3337|2261|2536|1524|1288|1807|\r\n|3|A2MP1|0|0|0|0|0|0|\r\n|9|NAT1|83|64|68|65|52|79|\r\n|...|...|...|...|...|...|...|...|\r\n</details>\r\n\r\nIf you don't need source files from NCBI, please delete the following files:\r\n\r\n### Example in Python\r\n\r\nTo get the output as a pandas DataFrame, please refer to the following code:\r\n\r\n```python\r\nfrom ncbi_counts import Series\r\n\r\nseries = Series(\r\n    \"GSE164073\",\r\n    [\r\n        {\r\n            \"control\": {\"title\": \"Cornea\", \"characteristics_ch1\": \"mock\"},\r\n            \"treatment\": {\"title\": \"Cornea\", \"characteristics_ch1\": \"SARS-CoV-2\"},\r\n        },\r\n        {\r\n            \"control\": {\"title\": \"Limbus\", \"characteristics_ch1\": \"mock\"},\r\n            \"treatment\": {\"title\": \"Limbus\", \"characteristics_ch1\": \"SARS-CoV-2\"},\r\n        },\r\n        {\r\n            \"control\": {\"geo_accession\": \"^GSM499609[6-8]$\"},\r\n            \"treatment\": {\"geo_accession\": \"^GSM4996099$|^GSM4996100$|^GSM4996101$\"},\r\n        },\r\n    ],\r\n    keep_annot=[\"Symbol\"],\r\n    save_to=None,\r\n)\r\nseries.generate_pair_matrix()\r\n# series.cleanup()  # remove source files\r\nseries.pair_count_list[0]  # Corresponds to GSE164073-1.tsv\r\nseries.pair_count_list[1]  # Corresponds to GSE164073-2.tsv\r\nseries.pair_count_list[2]  # Corresponds to GSE164073-3.tsv\r\n```\r\n\r\n## License\r\n\r\nncbi_counts is released under an [MIT license](LICENSE).\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/136s/ncbi_counts"
    },
    "split_keywords": [
        "geo",
        "gene expression omnibus",
        "bioinformatics",
        "rna-seq",
        "ncbi"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1fc8651b6cfeb6ac6a7bb77e817760308ccaafaf27da55497cb13466ce34bf8d",
                "md5": "400e0e1cea9fca02f3bbfcf235a7de90",
                "sha256": "4b62ea449f0ccf79db8fac69f6ef407197fb9ffa8325fe28db3faa19811518a7"
            },
            "downloads": -1,
            "filename": "ncbi_counts-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "400e0e1cea9fca02f3bbfcf235a7de90",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10.0",
            "size": 13086,
            "upload_time": "2023-12-05T02:16:55",
            "upload_time_iso_8601": "2023-12-05T02:16:55.186987Z",
            "url": "https://files.pythonhosted.org/packages/1f/c8/651b6cfeb6ac6a7bb77e817760308ccaafaf27da55497cb13466ce34bf8d/ncbi_counts-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4eac2bd799201d1d278611e86e115d74897239a40ad8d66e2c5a3b24dd0c8a63",
                "md5": "0bc065060897778377765af8fb4d3bf6",
                "sha256": "016fb3eb56c05640a7ea8160c6ac17177c1d8617b3134606b0a5d9f2a5ea24c6"
            },
            "downloads": -1,
            "filename": "ncbi_counts-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0bc065060897778377765af8fb4d3bf6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10.0",
            "size": 13665,
            "upload_time": "2023-12-05T02:16:57",
            "upload_time_iso_8601": "2023-12-05T02:16:57.446022Z",
            "url": "https://files.pythonhosted.org/packages/4e/ac/2bd799201d1d278611e86e115d74897239a40ad8d66e2c5a3b24dd0c8a63/ncbi_counts-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-05 02:16:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "136s",
    "github_project": "ncbi_counts",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "ncbi-counts"
}
        
Elapsed time: 0.14563s