archs4py

Name	archs4py JSON
Version	0.2.19 JSON
	download
home_page	https://github.com/maayanlab/archs4py
Summary	ARCHS4 Python package supporting data loading and data queries.
upload_time	2024-08-26 15:20:38
maintainer	None
docs_url	None
author	Alexander Lachmann
requires_python	>=3.7
license	None
keywords
VCS
bugtrack_url
requirements	h5py numpy pandas qnorm setuptools tqdm wget s3fs xalign
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <img title="archs4py" alt="archs4py" src="https://user-images.githubusercontent.com/32603869/243101679-c5147d56-fce0-4498-9577-a300df7d6dce.png">

# archs4py - Official Python package to load and query ARCHS4 data

Official ARCHS4 companion package. This package is a wrapper for basic H5 commands performed on the ARCHS4 data files. Some of the data access is optimized for specific query strategies and should make this implementation faster than manually querying the data. The package supports automated file download, mutithreading, and some convenience functions such as data normalization.

ARCHS4py also supports the ARCHS4 alignment pipeline. When aligning FASTQ files using ARCHS4py gene and transcript counts will be compatible with the preprocessed ARCHS4 samples.

[Installation](#installation) | [Download H5 Files](#usage) | [List H5 Contents](#list-data-fields-in-h5) | [Extract Counts](#data-access) | [Extract Meta Data](#meta-data) | [Normalize Samples](#normalizing-data) | [Filter Genes](#filter-genes-with-low-expression) | [Aggregate Duplicate Genes](#aggregate-duplicate-genes) | [FASTQ Alignment](#sequence-alignment) | [Versions](#list-versions)

## ARCHS4 data

ARCHS4 data is regularly updated to include publically available gene expression samples from RNA-seq. ARCHS4 processes the major platforms for human and mouse. As of 6/2023 ARCHS4 encompasses more than 1.5 million RNA-seq samples. All samples in ARCHS4 are homogeniously processed. ARCHS4 does currently not decern whether samples are bulk or single-cell and purely crawls GEO. Since samples are not always correctly annotated as single cell ARCHS4 uses a machine learning approach to predict single-cell samples and associated a singlecellprobability to each sample. Samples with a value larger than 0.5 can be removed from the queries if needed.

## Installation

The python package can be directly installed from this GitHub repository using the following command (pip or pip3 depending on system setup)

```
pip3 install archs4py
```

## Usage

### Download data file


The data is stored in large HDF5 files which first need to be downloaded. HDF5 stores matrix information in a compressed datastructure that allows efficient data access to slices of the data. There are separate files for `human` and `mouse` data. The supported files are `gene counts` and `transcript counts`. As of 6/2023 the files are larger than 30GB and depending on the network speed will take some time to download.

```python
import archs4py as a4

file_path = a4.download.counts("human", path="", version="latest")
```

## List data fields in H5

The H5 files contain data and meta data information. To list the contents of ARCHS4 H5 files use the built in `ls` function.

```python
import archs4py as a4

file = "human_gene_v2.4.h5"
a4.ls(file)
```

## Data access

archs4py supports several ways to load gene expression data. When querying ARCHS4 be aware that when loading too many samples the system might run out of memory. (e.g. the meta data search term is very broad). In most cases loading several thousand samples at the same time should be no problem. To find relevant samples there are 5 main functions in the `archs4py.data` module. A function to extract N random samples `archs4py.data.rand()`, a function to extract samples by index `archs4py.data.index()`, a function to extract samples based on meta data search `archs4py.data.meta()`, a function to extract samples based on a list of geo accessions `archs4py.data.samples()` and lastly a function to extract all samples belonging to a series `archs4.data.series()`.

<span id="#extract-counts"></span>

#### Extract a random set of samples

To extract a random gene expression matrix use the `archs4py.data.rand()` function. The function will return a pandas dataframe with samples as columns and genes as rows.

```python
import archs4py as a4

#path to file
file = "human_gene_v2.4.h5"

# extract 100 random samples and remove sinle cell data
rand_counts = a4.data.rand(file, 100, remove_sc=True)

```

#### Extract samples at specified index positions

Extract samples based on their index positions in the H5 file.

```python
import archs4py as a4

#path to file
file = "human_gene_v2.4.h5"

# get counts for samples at position [0,1,2,3,4]
pos_counts = a4.data.index(file, [0,1,2,3,4])

```

#### Extract samples matching search term in meta data

The ARCHS4 H5 file contains all meta data of samples. Using meta data search all matching samples can be extracted with the use of search terms. There is also an `archs4py.meta` module that will only return meta data. Meta data fields to be returned can be specified `meta_fields=["geo_accession", "series_id", "characteristics_ch1", "extract_protocol_ch1", "source_name_ch1", "title"]`

```python
import archs4py as a4

#path to file
file = "human_gene_v2.4.h5"

# search and extract samples matching regex (ignores whitespaces)
meta_counts = a4.data.meta(file, "myoblast", remove_sc=True)

```

#### Extract samples in a list of GEO accession IDs

Samples can directly be downloaded by providing a list of GSM IDs. Samples not contained in ARCHS4 will be ignored.

```python
import archs4py as a4

#path to file
file = "human_gene_v2.4.h5"

#get sample counts
sample_counts = a4.data.samples(file, ["GSM1158284","GSM1482938","GSM1562817"])

```

#### Extract samples belonging to a GEO series

To download all samples of a GEO series for example `GSE64016` use the series function.

```python
import archs4py as a4

#path to file
file = "human_gene_v2.4.h5"

#get sample counts
series_counts = a4.data.series(file, "GSE64016")

```

## Meta data

<span id="#extract-meta"></span>

Additinally to the data module archs4py also supports the extraction of meta data. It supports similar endpoints to the `archs4.data` module. Meta data fields can be specified with: `meta_fields=["geo_accession", "series_id", "characteristics_ch1", "extract_protocol_ch1", "source_name_ch1", "title"]`

```python
import archs4py as a4

#path to file
file = "human_gene_v2.4.h5"

# get sample meta data based on search term
meta_meta = a4.meta.meta(file, "myoblast", meta_fields=["characteristics_ch1", "source_name_ch1"])

# get sample meta data
sample_meta = a4.meta.samples(file, ["GSM1158284","GSM1482938","GSM1562817"])

# get series meta data
series_meta = a4.meta.series(file, "GSE64016")

# get all entries of a meta data field for all samples. In this example get all sample ids and gene symbols in H5 file
all_samples = a4.meta.field(file, "geo_accession")
all_symbols = a4.meta.field(file, "symbol")
```

## Normalizing data
<span id="#normalize"></span>
The package also supports simple normalization. Currently supported are quantile normalization, log2 + quantile normalization, and cpm. In the example below we load 100 random samples and apply log quantile.

```python
import archs4py as a4

file = "human_gene_v2.4.h5"
rand_counts = a4.data.rand(file, 100)

#normalize using log quantile (method options for now = ["log_quantile", "quantile", "cpm", "tmm"])
norm_exp = a4.normalize(rand_counts, method="log_quantile")

```

## Filter genes with low expression

<span id="#filter-genes"></span>

To filter genes with low expression use the `utils.filter()` function. It uses two parameters to determine which genes to filter. `readThreshold` and `sampleThreshold`. In the example below genes are removed that don't have at least 50 reads in 2% of samples. `aggregate` will also deal with duplicate gene symbols in the ARCHS4 data and aggregate the counts.

```python
import archs4py as a4

file = "human_gene_v2.4.h5"
rand_counts = a4.data.rand(file, 100)

# aggregate duplicate genes
filtered_exp = a4.utils.filter_genes(exp, readThreshold=50, sampleThreshold: float=0.02, deterministic: bool=True, aggregate=True)
```

## Aggregate duplicate genes

<span id="#aggregate-genes"></span>

Some gene symbols are duplicated, which is an artifact from the Ensembl gene annotation. The transcript sequences are often identical and reads are split between the different entries. The `utils.aggregate_duplicate_genes()` function will sum all counts of duplicate gene symbols and eliminate duplicate entries.

```python
import archs4py as a4

file = "human_gene_v2.4.h5"
rand_counts = a4.data.rand(file, 100)

# filter genes with low expression
agg_exp = a4.utils.aggregate_duplicate_genes(rand_counts)

```

## Sequence alignment

<span id="#align"></span>

The `align` module contains a replication of the ARCHS4 alignment pipeline. When used on FASTQ files the resulting gene or transcript counts are compatible with the previously aligned samples in ARCHS4. The package is highly automated and only required a path to a FASTQ file or a folder containing multiple FASTQ files. All file dependencies will downloaded automatically and index will be built when needed.

### Align FASTQ file

Pass either a single or paired FASTQ file. This function can return transcript count, gene counts, or transcript level TPM data.

```python

import archs4py as a4

a4.align.load(["SRR14457464"], "data/example_1")

result = a4.align.fastq("human", "data/example_1/SRR14457464.fastq", return_type="gene", identifier="symbol")

```

The next example is a SRR file that extracts into a pair of paired end FASTQ files. They can be passed to ARCHS4py like this:

```python
import archs4py as a4

# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)
a4.align.load(["SRR15972519"], "data/example_2")

result = a4.align.fastq("mouse", ["data/example_2/SRR15972519_1.fastq", "data/example_2/SRR15972519_2.fastq"], return_type="transcript")

```

### Align FASTQ files from folder

Align all FASTQ files in folder using the function `a4.align.folder()`. ARCHS4py will automatically matching samples if data is paired end.

```python

import archs4py as a4

a4.align.load(["SRR15972519", "SRR15972520", "SRR15972521"], "data/example_3")

result = a4.align.folder("mouse", "data/example_3", return_type="gene", identifier="symbol")

```

## List versions
<span id="#version"></span>
ARCHS4 has different versions to download from. Recommended is the default setting, which will download the latest data release.

```python
import archs4 as a4

print(a4.versions())

```

# Citation

When using ARCHS4 please cite the following reference:

Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. Silverstein, and Avi Ma’ayan. "Massive mining of publicly available RNA-seq data from human and mouse." Nature communications 9, no. 1 (2018): 1366.
https://www.nature.com/articles/s41467-018-03751-6

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/maayanlab/archs4py",
    "name": "archs4py",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Alexander Lachmann",
    "author_email": "alexander.lachmann@mssm.edu",
    "download_url": "https://files.pythonhosted.org/packages/8c/24/bbbd4d0089ee0d2bab3b7534ecaa8b706892fad7393b9bfcd87181fe7018/archs4py-0.2.19.tar.gz",
    "platform": null,
    "description": "<img title=\"archs4py\" alt=\"archs4py\" src=\"https://user-images.githubusercontent.com/32603869/243101679-c5147d56-fce0-4498-9577-a300df7d6dce.png\">\n\n# archs4py - Official Python package to load and query ARCHS4 data\n\nOfficial ARCHS4 companion package. This package is a wrapper for basic H5 commands performed on the ARCHS4 data files. Some of the data access is optimized for specific query strategies and should make this implementation faster than manually querying the data. The package supports automated file download, mutithreading, and some convenience functions such as data normalization.\n\nARCHS4py also supports the ARCHS4 alignment pipeline. When aligning FASTQ files using ARCHS4py gene and transcript counts will be compatible with the preprocessed ARCHS4 samples.\n\n[Installation](#installation) | [Download H5 Files](#usage) | [List H5 Contents](#list-data-fields-in-h5) | [Extract Counts](#data-access) | [Extract Meta Data](#meta-data) | [Normalize Samples](#normalizing-data) | [Filter Genes](#filter-genes-with-low-expression) | [Aggregate Duplicate Genes](#aggregate-duplicate-genes) | [FASTQ Alignment](#sequence-alignment) | [Versions](#list-versions)\n\n## ARCHS4 data\n\nARCHS4 data is regularly updated to include publically available gene expression samples from RNA-seq. ARCHS4 processes the major platforms for human and mouse. As of 6/2023 ARCHS4 encompasses more than 1.5 million RNA-seq samples. All samples in ARCHS4 are homogeniously processed. ARCHS4 does currently not decern whether samples are bulk or single-cell and purely crawls GEO. Since samples are not always correctly annotated as single cell ARCHS4 uses a machine learning approach to predict single-cell samples and associated a singlecellprobability to each sample. Samples with a value larger than 0.5 can be removed from the queries if needed.\n\n## Installation\n\nThe python package can be directly installed from this GitHub repository using the following command (pip or pip3 depending on system setup)\n\n```\npip3 install archs4py\n```\n\n## Usage\n\n### Download data file\n\n\nThe data is stored in large HDF5 files which first need to be downloaded. HDF5 stores matrix information in a compressed datastructure that allows efficient data access to slices of the data. There are separate files for `human` and `mouse` data. The supported files are `gene counts` and `transcript counts`. As of 6/2023 the files are larger than 30GB and depending on the network speed will take some time to download.\n\n```python\nimport archs4py as a4\n\nfile_path = a4.download.counts(\"human\", path=\"\", version=\"latest\")\n```\n\n## List data fields in H5\n\nThe H5 files contain data and meta data information. To list the contents of ARCHS4 H5 files use the built in `ls` function.\n\n```python\nimport archs4py as a4\n\nfile = \"human_gene_v2.4.h5\"\na4.ls(file)\n```\n\n## Data access\n\narchs4py supports several ways to load gene expression data. When querying ARCHS4 be aware that when loading too many samples the system might run out of memory. (e.g. the meta data search term is very broad). In most cases loading several thousand samples at the same time should be no problem. To find relevant samples there are 5 main functions in the `archs4py.data` module. A function to extract N random samples `archs4py.data.rand()`, a function to extract samples by index `archs4py.data.index()`, a function to extract samples based on meta data search `archs4py.data.meta()`, a function to extract samples based on a list of geo accessions `archs4py.data.samples()` and lastly a function to extract all samples belonging to a series `archs4.data.series()`.\n\n<span id=\"#extract-counts\"></span>\n\n#### Extract a random set of samples\n\nTo extract a random gene expression matrix use the `archs4py.data.rand()` function. The function will return a pandas dataframe with samples as columns and genes as rows.\n\n```python\nimport archs4py as a4\n\n#path to file\nfile = \"human_gene_v2.4.h5\"\n\n# extract 100 random samples and remove sinle cell data\nrand_counts = a4.data.rand(file, 100, remove_sc=True)\n\n```\n\n#### Extract samples at specified index positions\n\nExtract samples based on their index positions in the H5 file.\n\n```python\nimport archs4py as a4\n\n#path to file\nfile = \"human_gene_v2.4.h5\"\n\n# get counts for samples at position [0,1,2,3,4]\npos_counts = a4.data.index(file, [0,1,2,3,4])\n\n```\n\n#### Extract samples matching search term in meta data\n\nThe ARCHS4 H5 file contains all meta data of samples. Using meta data search all matching samples can be extracted with the use of search terms. There is also an `archs4py.meta` module that will only return meta data. Meta data fields to be returned can be specified `meta_fields=[\"geo_accession\", \"series_id\", \"characteristics_ch1\", \"extract_protocol_ch1\", \"source_name_ch1\", \"title\"]`\n\n```python\nimport archs4py as a4\n\n#path to file\nfile = \"human_gene_v2.4.h5\"\n\n# search and extract samples matching regex (ignores whitespaces)\nmeta_counts = a4.data.meta(file, \"myoblast\", remove_sc=True)\n\n```\n\n#### Extract samples in a list of GEO accession IDs\n\nSamples can directly be downloaded by providing a list of GSM IDs. Samples not contained in ARCHS4 will be ignored.\n\n```python\nimport archs4py as a4\n\n#path to file\nfile = \"human_gene_v2.4.h5\"\n\n#get sample counts\nsample_counts = a4.data.samples(file, [\"GSM1158284\",\"GSM1482938\",\"GSM1562817\"])\n\n```\n\n#### Extract samples belonging to a GEO series\n\nTo download all samples of a GEO series for example `GSE64016` use the series function.\n\n```python\nimport archs4py as a4\n\n#path to file\nfile = \"human_gene_v2.4.h5\"\n\n#get sample counts\nseries_counts = a4.data.series(file, \"GSE64016\")\n\n```\n\n## Meta data\n\n<span id=\"#extract-meta\"></span>\n\nAdditinally to the data module archs4py also supports the extraction of meta data. It supports similar endpoints to the `archs4.data` module. Meta data fields can be specified with: `meta_fields=[\"geo_accession\", \"series_id\", \"characteristics_ch1\", \"extract_protocol_ch1\", \"source_name_ch1\", \"title\"]`\n\n```python\nimport archs4py as a4\n\n#path to file\nfile = \"human_gene_v2.4.h5\"\n\n# get sample meta data based on search term\nmeta_meta = a4.meta.meta(file, \"myoblast\", meta_fields=[\"characteristics_ch1\", \"source_name_ch1\"])\n\n# get sample meta data\nsample_meta = a4.meta.samples(file, [\"GSM1158284\",\"GSM1482938\",\"GSM1562817\"])\n\n# get series meta data\nseries_meta = a4.meta.series(file, \"GSE64016\")\n\n# get all entries of a meta data field for all samples. In this example get all sample ids and gene symbols in H5 file\nall_samples = a4.meta.field(file, \"geo_accession\")\nall_symbols = a4.meta.field(file, \"symbol\")\n```\n\n## Normalizing data\n<span id=\"#normalize\"></span>\nThe package also supports simple normalization. Currently supported are quantile normalization, log2 + quantile normalization, and cpm. In the example below we load 100 random samples and apply log quantile.\n\n```python\nimport archs4py as a4\n\nfile = \"human_gene_v2.4.h5\"\nrand_counts = a4.data.rand(file, 100)\n\n#normalize using log quantile (method options for now = [\"log_quantile\", \"quantile\", \"cpm\", \"tmm\"])\nnorm_exp = a4.normalize(rand_counts, method=\"log_quantile\")\n\n```\n\n## Filter genes with low expression\n\n<span id=\"#filter-genes\"></span>\n\nTo filter genes with low expression use the `utils.filter()` function. It uses two parameters to determine which genes to filter. `readThreshold` and `sampleThreshold`. In the example below genes are removed that don't have at least 50 reads in 2% of samples. `aggregate` will also deal with duplicate gene symbols in the ARCHS4 data and aggregate the counts.\n\n```python\nimport archs4py as a4\n\nfile = \"human_gene_v2.4.h5\"\nrand_counts = a4.data.rand(file, 100)\n\n# aggregate duplicate genes\nfiltered_exp = a4.utils.filter_genes(exp, readThreshold=50, sampleThreshold: float=0.02, deterministic: bool=True, aggregate=True)\n```\n\n## Aggregate duplicate genes\n\n<span id=\"#aggregate-genes\"></span>\n\nSome gene symbols are duplicated, which is an artifact from the Ensembl gene annotation. The transcript sequences are often identical and reads are split between the different entries. The `utils.aggregate_duplicate_genes()` function will sum all counts of duplicate gene symbols and eliminate duplicate entries.\n\n```python\nimport archs4py as a4\n\nfile = \"human_gene_v2.4.h5\"\nrand_counts = a4.data.rand(file, 100)\n\n# filter genes with low expression\nagg_exp = a4.utils.aggregate_duplicate_genes(rand_counts)\n\n```\n\n## Sequence alignment\n\n<span id=\"#align\"></span>\n\nThe `align` module contains a replication of the ARCHS4 alignment pipeline. When used on FASTQ files the resulting gene or transcript counts are compatible with the previously aligned samples in ARCHS4. The package is highly automated and only required a path to a FASTQ file or a folder containing multiple FASTQ files. All file dependencies will downloaded automatically and index will be built when needed.\n\n### Align FASTQ file\n\nPass either a single or paired FASTQ file. This function can return transcript count, gene counts, or transcript level TPM data.\n\n```python\n\nimport archs4py as a4\n\na4.align.load([\"SRR14457464\"], \"data/example_1\")\n\nresult = a4.align.fastq(\"human\", \"data/example_1/SRR14457464.fastq\", return_type=\"gene\", identifier=\"symbol\")\n\n```\n\nThe next example is a SRR file that extracts into a pair of paired end FASTQ files. They can be passed to ARCHS4py like this:\n\n```python\nimport archs4py as a4\n\n# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)\na4.align.load([\"SRR15972519\"], \"data/example_2\")\n\nresult = a4.align.fastq(\"mouse\", [\"data/example_2/SRR15972519_1.fastq\", \"data/example_2/SRR15972519_2.fastq\"], return_type=\"transcript\")\n\n```\n\n### Align FASTQ files from folder\n\nAlign all FASTQ files in folder using the function `a4.align.folder()`. ARCHS4py will automatically matching samples if data is paired end.\n\n```python\n\nimport archs4py as a4\n\na4.align.load([\"SRR15972519\", \"SRR15972520\", \"SRR15972521\"], \"data/example_3\")\n\nresult = a4.align.folder(\"mouse\", \"data/example_3\", return_type=\"gene\", identifier=\"symbol\")\n\n```\n\n## List versions\n<span id=\"#version\"></span>\nARCHS4 has different versions to download from. Recommended is the default setting, which will download the latest data release.\n\n```python\nimport archs4 as a4\n\nprint(a4.versions())\n\n```\n\n# Citation\n\nWhen using ARCHS4 please cite the following reference:\n\nLachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. Silverstein, and Avi Ma\u2019ayan. \"Massive mining of publicly available RNA-seq data from human and mouse.\" Nature communications 9, no. 1 (2018): 1366.\nhttps://www.nature.com/articles/s41467-018-03751-6\n\n\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "ARCHS4 Python package supporting data loading and data queries.",
    "version": "0.2.19",
    "project_urls": {
        "Homepage": "https://github.com/maayanlab/archs4py"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8c24bbbd4d0089ee0d2bab3b7534ecaa8b706892fad7393b9bfcd87181fe7018",
                "md5": "de2de5c99d787b78ccd528fa7c8851b6",
                "sha256": "a6d31725e40aa148d321d9ac8a9a4a61e761a3bbcea4e7eaf84cf2ca22594ce7"
            },
            "downloads": -1,
            "filename": "archs4py-0.2.19.tar.gz",
            "has_sig": false,
            "md5_digest": "de2de5c99d787b78ccd528fa7c8851b6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 20076,
            "upload_time": "2024-08-26T15:20:38",
            "upload_time_iso_8601": "2024-08-26T15:20:38.881055Z",
            "url": "https://files.pythonhosted.org/packages/8c/24/bbbd4d0089ee0d2bab3b7534ecaa8b706892fad7393b9bfcd87181fe7018/archs4py-0.2.19.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-26 15:20:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maayanlab",
    "github_project": "archs4py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "h5py",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "qnorm",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "wget",
            "specs": []
        },
        {
            "name": "s3fs",
            "specs": []
        },
        {
            "name": "xalign",
            "specs": []
        }
    ],
    "lcname": "archs4py"
}

Alexander Lachmann