<p align="center">
<img width="500" alt="demuxalot_logo_small" src="https://user-images.githubusercontent.com/6318811/118947887-a261da00-b90c-11eb-8932-a66e6d2caa1f.png">
</p>
[![Run tests and deploy](https://github.com/herophilus/demuxalot/actions/workflows/run_test.yml/badge.svg)](https://github.com/herophilus/demuxalot/actions/workflows/run_test.yml)
<img src="./.github/python_badge.svg">
# Demuxalot
Reliable and efficient idenfitication of genotypes for individual cells
in RNA sequencing that refines the knowledge about genotypes from the data.
Demuxalot is fast and optimized to work with lots of genotypes.
Preprint [is available at biorxiv.](https://www.biorxiv.org/content/10.1101/2021.05.22.443646v2)
## Background
During single-cell RNA-sequencing (scRnaSeq) we pool cells from different donors and process them together.
- Pro: all cells come through the same pipeline, so preparation/biological variation effects are cancelled out from analysis automatically.
Also experiments are much cheaper!
- Con: we don't know cell origin, everything is mixed!
Demuxalot solves the con:
it guesses genotype of each cell by matching reads coming from cell against genotypes.
This is called *demuxltiplexing*.
Herophilus uses scRnaSeq to study cells in organoids with multiple genetic backgrounds at scale.
## Comparisons
Demuxalot shows high reliability, data efficiency and speed.
Below is a benchmark on PMBC data with 32 donors from [preprint](https://www.biorxiv.org/content/10.1101/2021.05.22.443646v2)
<img width="1434" alt="Screen Shot 2021-06-03 at 6 03 12 PM" src="https://user-images.githubusercontent.com/6318811/120730293-07cdd300-c496-11eb-8a9c-62c8b8cf9847.png">
## Known genotypes and refined genotypes: the tale of two scenarios
Typical approach to get genotype-specific mutations are
- whole-genome sequencing (expensive, very good)
- you have information about all (ok, >90%) the genotype, and it is unlikely that you need to refine it
- so you just go straight to demultiplexing
- demuxlet solves this case
- Bead arrays (aka SNP arrays aka DNA microarrays) are super cheap and practically more relevant
- you get information about 50k to 650k most common SNPs, and that's only a small fraction, but you also pay very little
- this case is covered by `demuxalot` (this package)
- [Illumina's video](https://www.youtube.com/watch?v=lVG04dAAyvY) about this technology
## Why is it worth refining genotypes?
SNP array provides up to ~650k (as of 2021) positions in the genome.
Around 20-30% of them would be specific for a genotype (i.e. deviate from majority).
- Each genotype has around 10 times more SNV (single nucleotide variations)
that are not captured by array. Some of this missing SNPs are very valuable for demultiplexing
## What's special power of demuxalot?
- much better handling of multiple reads coming from the same UMI (i.e. same transcript)
- `demuxalot` efficiently combines information from multiple reads with same UMI and cross-checks it
- default settings are CellRanger-specific (that is - optimized for 10X pipeline). Cellranger's and STAR's flags in BAM break some common conventions,
but we can still efficiently use them (by using filtering callbacks)
- ability to refine genotypes. without failing and diverging
- Vireo is a tool that was created with similar purposes. But it either diverges or does not learn better genotypes
- optimized variant calling. It's also faster than `demuxlet` due to multiprocessing
- this is not a command-line tool, and not meant to be
- write python code, this gives full control and flexibility of demultiplexing
## Installation
Package is pip-installable. Requires python >= 3.6
```bash
pip install demuxalot
```
Developer installation:
```bash
git clone https://github.com/herophilus/demuxalot
cd demuxalot && pip install -e .
```
Here are some common scenarios and how they are implemented in demuxalot.
Also visit `examples/` folder
## Running (simple scenario)
Only using provided genotypes
```python
from demuxalot import Demultiplexer, BarcodeHandler, ProbabilisticGenotypes, count_snps
# Loading genotypes
genotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])
genotypes.add_vcf('path/to/genotypes.vcf')
# Loading barcodes
barcode_handler = BarcodeHandler.from_file('path/to/barcodes.csv')
snps = count_snps(
bamfile_location='path/to/sorted_alignments.bam',
chromosome2positions=genotypes.get_chromosome2positions(),
barcode_handler=barcode_handler,
)
# returns two dataframes with likelihoods and posterior probabilities
likelihoods, posterior_probabilities = Demultiplexer.predict_posteriors(
snps,
genotypes=genotypes,
barcode_handler=barcode_handler,
)
```
## Running (complex scenario)
Refinement of known genotypes is shown in a notebook, see `examples/`
## Saving/loading genotypes
```python
# You can always export learnt genotypes to be used later
refined_genotypes.save_betas('learnt_genotypes.parquet')
refined_genotypes = ProbabilisticGenotypes(genotype_names= <list which genotypes to load here>)
refined_genotypes.add_prior_betas('learnt_genotypes.parquet')
```
## Re-saving VCF genotypes with betas (optional, recommended)
Generally makes sense to export VCF to internal format only when you plan to load it many times.
Loading of internal format is *much* faster than parsing/validating VCF.
```python
genotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])
genotypes.add_vcf('path/to/genotypes.vcf')
genotypes.save_betas('learnt_genotypes.parquet')
```
Raw data
{
"_id": null,
"home_page": null,
"name": "demuxalot",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "bayesian modelling,demultiplexing,genotype learning,single cell RNA sequencing",
"author": "Alex Rogozhnikov, System1 Biosciences",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/73/99/88927da8c0aa81c140a7a0e3c73e395bc21f9cdd4cff85a91126b59db0c6/demuxalot-0.4.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n<img width=\"500\" alt=\"demuxalot_logo_small\" src=\"https://user-images.githubusercontent.com/6318811/118947887-a261da00-b90c-11eb-8932-a66e6d2caa1f.png\">\n</p>\n \n[![Run tests and deploy](https://github.com/herophilus/demuxalot/actions/workflows/run_test.yml/badge.svg)](https://github.com/herophilus/demuxalot/actions/workflows/run_test.yml)\n<img src=\"./.github/python_badge.svg\">\n# Demuxalot \n\nReliable and efficient idenfitication of genotypes for individual cells \nin RNA sequencing that refines the knowledge about genotypes from the data.\n\nDemuxalot is fast and optimized to work with lots of genotypes.\n\nPreprint [is available at biorxiv.](https://www.biorxiv.org/content/10.1101/2021.05.22.443646v2)\n\n## Background\n\nDuring single-cell RNA-sequencing (scRnaSeq) we pool cells from different donors and process them together.\n\n- Pro: all cells come through the same pipeline, so preparation/biological variation effects are cancelled out from analysis automatically. \n Also experiments are much cheaper!\n- Con: we don't know cell origin, everything is mixed!\n\nDemuxalot solves the con: \nit guesses genotype of each cell by matching reads coming from cell against genotypes. \nThis is called *demuxltiplexing*.\n\nHerophilus uses scRnaSeq to study cells in organoids with multiple genetic backgrounds at scale.\n\n## Comparisons\n\nDemuxalot shows high reliability, data efficiency and speed. \nBelow is a benchmark on PMBC data with 32 donors from [preprint](https://www.biorxiv.org/content/10.1101/2021.05.22.443646v2)\n\n<img width=\"1434\" alt=\"Screen Shot 2021-06-03 at 6 03 12 PM\" src=\"https://user-images.githubusercontent.com/6318811/120730293-07cdd300-c496-11eb-8a9c-62c8b8cf9847.png\">\n\n\n\n## Known genotypes and refined genotypes: the tale of two scenarios\n\nTypical approach to get genotype-specific mutations are \n \n- whole-genome sequencing (expensive, very good)\n - you have information about all (ok, >90%) the genotype, and it is unlikely that you need to refine it\n - so you just go straight to demultiplexing\n - demuxlet solves this case\n- Bead arrays (aka SNP arrays aka DNA microarrays) are super cheap and practically more relevant\n - you get information about 50k to 650k most common SNPs, and that's only a small fraction, but you also pay very little\n - this case is covered by `demuxalot` (this package)\n - [Illumina's video](https://www.youtube.com/watch?v=lVG04dAAyvY) about this technology\n\n## Why is it worth refining genotypes? \n \nSNP array provides up to ~650k (as of 2021) positions in the genome.\nAround 20-30% of them would be specific for a genotype (i.e. deviate from majority).\n\n- Each genotype has around 10 times more SNV (single nucleotide variations) \n that are not captured by array. Some of this missing SNPs are very valuable for demultiplexing\n\n## What's special power of demuxalot?\n\n- much better handling of multiple reads coming from the same UMI (i.e. same transcript)\n - `demuxalot` efficiently combines information from multiple reads with same UMI and cross-checks it\n- default settings are CellRanger-specific (that is - optimized for 10X pipeline). Cellranger's and STAR's flags in BAM break some common conventions, \n but we can still efficiently use them (by using filtering callbacks) \n- ability to refine genotypes. without failing and diverging\n - Vireo is a tool that was created with similar purposes. But it either diverges or does not learn better genotypes\n- optimized variant calling. It's also faster than `demuxlet` due to multiprocessing\n- this is not a command-line tool, and not meant to be \n - write python code, this gives full control and flexibility of demultiplexing\n\n## Installation\n\nPackage is pip-installable. Requires python >= 3.6\n```bash\npip install demuxalot\n```\n\nDeveloper installation:\n```bash\ngit clone https://github.com/herophilus/demuxalot\ncd demuxalot && pip install -e .\n```\n\nHere are some common scenarios and how they are implemented in demuxalot.\nAlso visit `examples/` folder\n\n## Running (simple scenario)\nOnly using provided genotypes\n\n```python\nfrom demuxalot import Demultiplexer, BarcodeHandler, ProbabilisticGenotypes, count_snps\n\n# Loading genotypes\ngenotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])\ngenotypes.add_vcf('path/to/genotypes.vcf')\n\n# Loading barcodes\nbarcode_handler = BarcodeHandler.from_file('path/to/barcodes.csv')\n\nsnps = count_snps(\n bamfile_location='path/to/sorted_alignments.bam',\n chromosome2positions=genotypes.get_chromosome2positions(),\n barcode_handler=barcode_handler, \n)\n\n# returns two dataframes with likelihoods and posterior probabilities \nlikelihoods, posterior_probabilities = Demultiplexer.predict_posteriors(\n snps,\n genotypes=genotypes,\n barcode_handler=barcode_handler,\n)\n```\n\n\n## Running (complex scenario)\n\nRefinement of known genotypes is shown in a notebook, see `examples/`\n\n## Saving/loading genotypes\n \n```python\n# You can always export learnt genotypes to be used later\nrefined_genotypes.save_betas('learnt_genotypes.parquet')\nrefined_genotypes = ProbabilisticGenotypes(genotype_names= <list which genotypes to load here>)\nrefined_genotypes.add_prior_betas('learnt_genotypes.parquet')\n```\n\n## Re-saving VCF genotypes with betas (optional, recommended)\n\nGenerally makes sense to export VCF to internal format only when you plan to load it many times.\nLoading of internal format is *much* faster than parsing/validating VCF. \n\n```python\ngenotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])\ngenotypes.add_vcf('path/to/genotypes.vcf')\ngenotypes.save_betas('learnt_genotypes.parquet')\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Scalable and reliable demulitplexing for single-cell RNA sequencing.",
"version": "0.4.1",
"project_urls": {
"Homepage": "https://github.com/arogozhnikov/demuxalot"
},
"split_keywords": [
"bayesian modelling",
"demultiplexing",
"genotype learning",
"single cell rna sequencing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a661561057f08d357220b62d35b7af40aa3eac5e1e1a1ced3d11183c602096be",
"md5": "2b7357c7ac7525f0b18ae91348c80d75",
"sha256": "aa141a0d3c0ed7febe5cdf3318f5a9d54d7ccef080f26b183b98e2b7f951f159"
},
"downloads": -1,
"filename": "demuxalot-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2b7357c7ac7525f0b18ae91348c80d75",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 27551,
"upload_time": "2024-02-28T03:17:54",
"upload_time_iso_8601": "2024-02-28T03:17:54.701319Z",
"url": "https://files.pythonhosted.org/packages/a6/61/561057f08d357220b62d35b7af40aa3eac5e1e1a1ced3d11183c602096be/demuxalot-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "739988927da8c0aa81c140a7a0e3c73e395bc21f9cdd4cff85a91126b59db0c6",
"md5": "60ff8b70700f8f74a1f4513578231463",
"sha256": "6f433326330455c449079ff338a2dae9b1deb1aa81143e3eae805ea5f91d7110"
},
"downloads": -1,
"filename": "demuxalot-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "60ff8b70700f8f74a1f4513578231463",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 23724,
"upload_time": "2024-02-28T03:17:53",
"upload_time_iso_8601": "2024-02-28T03:17:53.316350Z",
"url": "https://files.pythonhosted.org/packages/73/99/88927da8c0aa81c140a7a0e3c73e395bc21f9cdd4cff85a91126b59db0c6/demuxalot-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-28 03:17:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arogozhnikov",
"github_project": "demuxalot",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "demuxalot"
}