xalign


Namexalign JSON
Version 0.1.74 PyPI version JSON
download
home_pagehttps://github.com/maayanlab/xalign
SummaryAlignment in a python wrapper.
upload_time2023-06-09 14:31:16
maintainer
docs_urlNone
authorAlexander Lachmann
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements matplotlib mygene numpy pandas requests setuptools tqdm
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # xAlign: Hassle-free transcript quantification

xAlign is an efficient python package to align FASTQ files against any Ensembl reference genomes. The currently supported alignment algorithms are `kallisto` (https://pachterlab.github.io/kallisto/) and `Salmon` (https://salmon.readthedocs.io/en/latest/salmon.html). The package contains modules for Ensemble ID mapping to gene symbols via the `mygene.info` python package and SRA download capabilities. When using this package please cite the corresponding alignment algorithm.

## Installation

```
pip3 install git+https://github.com/MaayanLab/xalign.git
```

## Requirements

The alignment algorithms require a minimum of around 5GB of memory to run. When downloading SRA files, make sure that there is sufficient available disk space. `xalign` is currently only working on `Linux` operating systems.

## Usage

The recommended usage is `xalign.align_folder()` if there are multiple FASTQ files. These FASTQ files can be aligned one by one, and gene level counts can be aggregated using the function `xalign.ensembl.agg_gene_counts()`

### Align a single FASTQ file in single-read mode

To align a single RNA-seq file we first download an example SRA file and save it in the folder `data/example_1` relative to the working directory. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.

When the alignment is run against a new species, the initial setup will take a few minutes to complete because building a new index and creating gene mapping files are required.

```python

import xalign

xalign.sra.load_sras(["SRR14457464"], "data/example_1")

result = xalign.align_fastq("homo_sapiens", "data/example_1/SRR14457464.fastq", t=8)

```

### Align a single FASTQ file in paired-end mode

To align a single RNA-seq file in paired-end mode we first download an example SRA file and save it in folder `data/example_2` relative to the working directory. If the SRA file is a paired-end sample, two files will be generated with the two suffixes `_1` and `_2`. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.

When the alignment is run against a new species, the initial setup will take a couple of minutes to built the index and to create the gene mapping files.

```python

import xalign

# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)
xalign.sra.load_sras(["SRR15972519"], "data/example_2")

result = xalign.align_fastq("homo_sapiens", ["data/example_2/SRR15972519_1.fastq", "data/example_2/SRR15972519_2.fastq"], t=8)

```

### Align FASTQ files in a directory

`xalign` can automatically align all files in a given folder, instead of calling `xalign.align_fastq()` multiple times. In this case `xalign.align_folder()` will automatically detect whether the folder contains paired- or single-end samples and group the samples accordingly without manual input. The output will be two dataframes. `gene_count` will contain gene level counts that can be aggregated for different gene identifiers (symbol:default, ensembl_id, entrezgene_id). Transcripts that can not be mapped to corresponding identifiers are discarded. `transcript_count` contains the read counts at transcript level.

```python

import xalign

# this will download multiple GB of samples
xalign.sra.load_sras(["SRR15972519", "SRR15972520", "SRR15972521"], "data/example_3")

gene_count, transcript_count = xalign.align_folder("homo_sapiens", "data/example_3", t=8, overwrite=False)

```

### Mapping transcript counts to gene-level counts

When FASTQ files are aligned individually using `xalign.align_fastq()` the output is in transcript-level. To aggregate counts to gene-level the function `xalign.ensembl.agg_gene_counts()` can be used.

```python

import xalign

xalign.sra.load_sras(["SRR14457464"], "data/example_4")

result = xalign.align_fastq("homo_sapiens", "data/example_4/SRR15972519.fastq", t=8)

# identifier can be symbol/ensembl_id/entrezgene_id
gene_counts = xalign.ensembl.agg_gene_counts(result, "homo_sapiens", identifier="symbol")

```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/maayanlab/xalign",
    "name": "xalign",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Alexander Lachmann",
    "author_email": "alexander.lachmann@mssm.edu",
    "download_url": "https://files.pythonhosted.org/packages/f6/79/822e98e6ef0670a245f39338196c73b416b0fc5f63460fa7fba7d0b07b35/xalign-0.1.74.tar.gz",
    "platform": null,
    "description": "# xAlign: Hassle-free transcript quantification\n\nxAlign is an efficient python package to align FASTQ files against any Ensembl reference genomes. The currently supported alignment algorithms are `kallisto` (https://pachterlab.github.io/kallisto/) and `Salmon` (https://salmon.readthedocs.io/en/latest/salmon.html). The package contains modules for Ensemble ID mapping to gene symbols via the `mygene.info` python package and SRA download capabilities. When using this package please cite the corresponding alignment algorithm.\n\n## Installation\n\n```\npip3 install git+https://github.com/MaayanLab/xalign.git\n```\n\n## Requirements\n\nThe alignment algorithms require a minimum of around 5GB of memory to run. When downloading SRA files, make sure that there is sufficient available disk space. `xalign` is currently only working on `Linux` operating systems.\n\n## Usage\n\nThe recommended usage is `xalign.align_folder()` if there are multiple FASTQ files. These FASTQ files can be aligned one by one, and gene level counts can be aggregated using the function `xalign.ensembl.agg_gene_counts()`\n\n### Align a single FASTQ file in single-read mode\n\nTo align a single RNA-seq file we first download an example SRA file and save it in the folder `data/example_1` relative to the working directory. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.\n\nWhen the alignment is run against a new species, the initial setup will take a few minutes to complete because building a new index and creating gene mapping files are required.\n\n```python\n\nimport xalign\n\nxalign.sra.load_sras([\"SRR14457464\"], \"data/example_1\")\n\nresult = xalign.align_fastq(\"homo_sapiens\", \"data/example_1/SRR14457464.fastq\", t=8)\n\n```\n\n### Align a single FASTQ file in paired-end mode\n\nTo align a single RNA-seq file in paired-end mode we first download an example SRA file and save it in folder `data/example_2` relative to the working directory. If the SRA file is a paired-end sample, two files will be generated with the two suffixes `_1` and `_2`. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.\n\nWhen the alignment is run against a new species, the initial setup will take a couple of minutes to built the index and to create the gene mapping files.\n\n```python\n\nimport xalign\n\n# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)\nxalign.sra.load_sras([\"SRR15972519\"], \"data/example_2\")\n\nresult = xalign.align_fastq(\"homo_sapiens\", [\"data/example_2/SRR15972519_1.fastq\", \"data/example_2/SRR15972519_2.fastq\"], t=8)\n\n```\n\n### Align FASTQ files in a directory\n\n`xalign` can automatically align all files in a given folder, instead of calling `xalign.align_fastq()` multiple times. In this case `xalign.align_folder()` will automatically detect whether the folder contains paired- or single-end samples and group the samples accordingly without manual input. The output will be two dataframes. `gene_count` will contain gene level counts that can be aggregated for different gene identifiers (symbol:default, ensembl_id, entrezgene_id). Transcripts that can not be mapped to corresponding identifiers are discarded. `transcript_count` contains the read counts at transcript level.\n\n```python\n\nimport xalign\n\n# this will download multiple GB of samples\nxalign.sra.load_sras([\"SRR15972519\", \"SRR15972520\", \"SRR15972521\"], \"data/example_3\")\n\ngene_count, transcript_count = xalign.align_folder(\"homo_sapiens\", \"data/example_3\", t=8, overwrite=False)\n\n```\n\n### Mapping transcript counts to gene-level counts\n\nWhen FASTQ files are aligned individually using `xalign.align_fastq()` the output is in transcript-level. To aggregate counts to gene-level the function `xalign.ensembl.agg_gene_counts()` can be used.\n\n```python\n\nimport xalign\n\nxalign.sra.load_sras([\"SRR14457464\"], \"data/example_4\")\n\nresult = xalign.align_fastq(\"homo_sapiens\", \"data/example_4/SRR15972519.fastq\", t=8)\n\n# identifier can be symbol/ensembl_id/entrezgene_id\ngene_counts = xalign.ensembl.agg_gene_counts(result, \"homo_sapiens\", identifier=\"symbol\")\n\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Alignment in a python wrapper.",
    "version": "0.1.74",
    "project_urls": {
        "Homepage": "https://github.com/maayanlab/xalign"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f679822e98e6ef0670a245f39338196c73b416b0fc5f63460fa7fba7d0b07b35",
                "md5": "32b55b83d94fc651d0504aa522ab77f2",
                "sha256": "2f378c582af972cfac702ef9ee578403ecabf572649f6c5027717ca0b7a12b14"
            },
            "downloads": -1,
            "filename": "xalign-0.1.74.tar.gz",
            "has_sig": false,
            "md5_digest": "32b55b83d94fc651d0504aa522ab77f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 6056225,
            "upload_time": "2023-06-09T14:31:16",
            "upload_time_iso_8601": "2023-06-09T14:31:16.097394Z",
            "url": "https://files.pythonhosted.org/packages/f6/79/822e98e6ef0670a245f39338196c73b416b0fc5f63460fa7fba7d0b07b35/xalign-0.1.74.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-09 14:31:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maayanlab",
    "github_project": "xalign",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.5.2"
                ]
            ]
        },
        {
            "name": "mygene",
            "specs": [
                [
                    "==",
                    "3.2.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.22.4"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "1.4.3"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.23.0"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "49.2.1"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.64.0"
                ]
            ]
        }
    ],
    "lcname": "xalign"
}
        
Elapsed time: 0.07826s