# xAlign: Hassle-free transcript quantification
xAlign is an efficient python package to align FASTQ files against any Ensembl reference genomes. The currently supported alignment algorithms are `kallisto` (https://pachterlab.github.io/kallisto/) and `Salmon` (https://salmon.readthedocs.io/en/latest/salmon.html). The package contains modules for Ensemble ID mapping to gene symbols via the `mygene.info` python package and SRA download capabilities. When using this package please cite the corresponding alignment algorithm.
## Installation
```
pip3 install git+https://github.com/MaayanLab/xalign.git
```
## Requirements
The alignment algorithms require a minimum of around 5GB of memory to run. When downloading SRA files, make sure that there is sufficient available disk space. `xalign` is currently only working on `Linux` operating systems.
## Usage
The recommended usage is `xalign.align_folder()` if there are multiple FASTQ files. These FASTQ files can be aligned one by one, and gene level counts can be aggregated using the function `xalign.ensembl.agg_gene_counts()`
### Align a single FASTQ file in single-read mode
To align a single RNA-seq file we first download an example SRA file and save it in the folder `data/example_1` relative to the working directory. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.
When the alignment is run against a new species, the initial setup will take a few minutes to complete because building a new index and creating gene mapping files are required.
```python
import xalign
xalign.sra.load_sras(["SRR14457464"], "data/example_1")
result = xalign.align_fastq("homo_sapiens", "data/example_1/SRR14457464.fastq", t=8)
```
### Align a single FASTQ file in paired-end mode
To align a single RNA-seq file in paired-end mode we first download an example SRA file and save it in folder `data/example_2` relative to the working directory. If the SRA file is a paired-end sample, two files will be generated with the two suffixes `_1` and `_2`. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.
When the alignment is run against a new species, the initial setup will take a couple of minutes to built the index and to create the gene mapping files.
```python
import xalign
# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)
xalign.sra.load_sras(["SRR15972519"], "data/example_2")
result = xalign.align_fastq("homo_sapiens", ["data/example_2/SRR15972519_1.fastq", "data/example_2/SRR15972519_2.fastq"], t=8)
```
### Align FASTQ files in a directory
`xalign` can automatically align all files in a given folder, instead of calling `xalign.align_fastq()` multiple times. In this case `xalign.align_folder()` will automatically detect whether the folder contains paired- or single-end samples and group the samples accordingly without manual input. The output will be two dataframes. `gene_count` will contain gene level counts that can be aggregated for different gene identifiers (symbol:default, ensembl_id, entrezgene_id). Transcripts that can not be mapped to corresponding identifiers are discarded. `transcript_count` contains the read counts at transcript level.
```python
import xalign
# this will download multiple GB of samples
xalign.sra.load_sras(["SRR15972519", "SRR15972520", "SRR15972521"], "data/example_3")
gene_count, transcript_count = xalign.align_folder("homo_sapiens", "data/example_3", t=8, overwrite=False)
```
### Mapping transcript counts to gene-level counts
When FASTQ files are aligned individually using `xalign.align_fastq()` the output is in transcript-level. To aggregate counts to gene-level the function `xalign.ensembl.agg_gene_counts()` can be used.
```python
import xalign
xalign.sra.load_sras(["SRR14457464"], "data/example_4")
result = xalign.align_fastq("homo_sapiens", "data/example_4/SRR15972519.fastq", t=8)
# identifier can be symbol/ensembl_id/entrezgene_id
gene_counts = xalign.ensembl.agg_gene_counts(result, "homo_sapiens", identifier="symbol")
```
Raw data
{
"_id": null,
"home_page": "https://github.com/maayanlab/xalign",
"name": "xalign",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Alexander Lachmann",
"author_email": "alexander.lachmann@mssm.edu",
"download_url": "https://files.pythonhosted.org/packages/f6/79/822e98e6ef0670a245f39338196c73b416b0fc5f63460fa7fba7d0b07b35/xalign-0.1.74.tar.gz",
"platform": null,
"description": "# xAlign: Hassle-free transcript quantification\n\nxAlign is an efficient python package to align FASTQ files against any Ensembl reference genomes. The currently supported alignment algorithms are `kallisto` (https://pachterlab.github.io/kallisto/) and `Salmon` (https://salmon.readthedocs.io/en/latest/salmon.html). The package contains modules for Ensemble ID mapping to gene symbols via the `mygene.info` python package and SRA download capabilities. When using this package please cite the corresponding alignment algorithm.\n\n## Installation\n\n```\npip3 install git+https://github.com/MaayanLab/xalign.git\n```\n\n## Requirements\n\nThe alignment algorithms require a minimum of around 5GB of memory to run. When downloading SRA files, make sure that there is sufficient available disk space. `xalign` is currently only working on `Linux` operating systems.\n\n## Usage\n\nThe recommended usage is `xalign.align_folder()` if there are multiple FASTQ files. These FASTQ files can be aligned one by one, and gene level counts can be aggregated using the function `xalign.ensembl.agg_gene_counts()`\n\n### Align a single FASTQ file in single-read mode\n\nTo align a single RNA-seq file we first download an example SRA file and save it in the folder `data/example_1` relative to the working directory. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.\n\nWhen the alignment is run against a new species, the initial setup will take a few minutes to complete because building a new index and creating gene mapping files are required.\n\n```python\n\nimport xalign\n\nxalign.sra.load_sras([\"SRR14457464\"], \"data/example_1\")\n\nresult = xalign.align_fastq(\"homo_sapiens\", \"data/example_1/SRR14457464.fastq\", t=8)\n\n```\n\n### Align a single FASTQ file in paired-end mode\n\nTo align a single RNA-seq file in paired-end mode we first download an example SRA file and save it in folder `data/example_2` relative to the working directory. If the SRA file is a paired-end sample, two files will be generated with the two suffixes `_1` and `_2`. The function `xalign.align_fastq()` will generate the required cDNA index from the Ensembl reference genome when the index is not already built. `result` is a dataframe with transcript IDs, gene counts, and TPM.\n\nWhen the alignment is run against a new species, the initial setup will take a couple of minutes to built the index and to create the gene mapping files.\n\n```python\n\nimport xalign\n\n# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)\nxalign.sra.load_sras([\"SRR15972519\"], \"data/example_2\")\n\nresult = xalign.align_fastq(\"homo_sapiens\", [\"data/example_2/SRR15972519_1.fastq\", \"data/example_2/SRR15972519_2.fastq\"], t=8)\n\n```\n\n### Align FASTQ files in a directory\n\n`xalign` can automatically align all files in a given folder, instead of calling `xalign.align_fastq()` multiple times. In this case `xalign.align_folder()` will automatically detect whether the folder contains paired- or single-end samples and group the samples accordingly without manual input. The output will be two dataframes. `gene_count` will contain gene level counts that can be aggregated for different gene identifiers (symbol:default, ensembl_id, entrezgene_id). Transcripts that can not be mapped to corresponding identifiers are discarded. `transcript_count` contains the read counts at transcript level.\n\n```python\n\nimport xalign\n\n# this will download multiple GB of samples\nxalign.sra.load_sras([\"SRR15972519\", \"SRR15972520\", \"SRR15972521\"], \"data/example_3\")\n\ngene_count, transcript_count = xalign.align_folder(\"homo_sapiens\", \"data/example_3\", t=8, overwrite=False)\n\n```\n\n### Mapping transcript counts to gene-level counts\n\nWhen FASTQ files are aligned individually using `xalign.align_fastq()` the output is in transcript-level. To aggregate counts to gene-level the function `xalign.ensembl.agg_gene_counts()` can be used.\n\n```python\n\nimport xalign\n\nxalign.sra.load_sras([\"SRR14457464\"], \"data/example_4\")\n\nresult = xalign.align_fastq(\"homo_sapiens\", \"data/example_4/SRR15972519.fastq\", t=8)\n\n# identifier can be symbol/ensembl_id/entrezgene_id\ngene_counts = xalign.ensembl.agg_gene_counts(result, \"homo_sapiens\", identifier=\"symbol\")\n\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "Alignment in a python wrapper.",
"version": "0.1.74",
"project_urls": {
"Homepage": "https://github.com/maayanlab/xalign"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f679822e98e6ef0670a245f39338196c73b416b0fc5f63460fa7fba7d0b07b35",
"md5": "32b55b83d94fc651d0504aa522ab77f2",
"sha256": "2f378c582af972cfac702ef9ee578403ecabf572649f6c5027717ca0b7a12b14"
},
"downloads": -1,
"filename": "xalign-0.1.74.tar.gz",
"has_sig": false,
"md5_digest": "32b55b83d94fc651d0504aa522ab77f2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 6056225,
"upload_time": "2023-06-09T14:31:16",
"upload_time_iso_8601": "2023-06-09T14:31:16.097394Z",
"url": "https://files.pythonhosted.org/packages/f6/79/822e98e6ef0670a245f39338196c73b416b0fc5f63460fa7fba7d0b07b35/xalign-0.1.74.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-09 14:31:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "maayanlab",
"github_project": "xalign",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "matplotlib",
"specs": [
[
"==",
"3.5.2"
]
]
},
{
"name": "mygene",
"specs": [
[
"==",
"3.2.2"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.22.4"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"1.4.3"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.23.0"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"49.2.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.64.0"
]
]
}
],
"lcname": "xalign"
}