paraphase


Nameparaphase JSON
Version 3.1.1 PyPI version JSON
download
home_pagehttps://github.com/PacificBiosciences/paraphase
Summaryparaphase: HiFi-based caller for highly homologous genes
upload_time2024-04-18 18:41:05
maintainerNone
docs_urlNone
authorXiao Chen
requires_pythonNone
licenseBSD-3-Clause-Clear
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center"><img width="300px" src="docs/logo_Paraphase.svg"/></h1>

<h1 align="center">Paraphase</h1>

<h3 align="center">HiFi-based caller for highly similar paralogous genes</h3>

Many medically relevant genes fall into 'dark' regions where variant calling is limited due to high sequence homology with paralogs or pseudogenes. Paraphase is a Python tool that takes HiFi aligned BAMs as input (whole-genome or enrichment), phases haplotypes for genes of the same family, determines copy numbers and makes phased variant calls. 

![Paraphase diagram](docs/figures/paraphase_diagram.png)
Paraphase takes all reads from a gene family, realigns to one representative gene of the family and then phases them into haplotypes. This approach bypasses the error-prone process of aligning reads to multiple similar regions and allows us to examine all copies of genes in a gene family. This gene-family-centered approach allows Paraphase to perform well when there is a copy number difference between an individual and the reference, as is often the case in segmental duplications.
Futhermore, this approach also streamlines sequence comparisons between genes within the same family, making it straightforward to conduct analyses such as identifying non-allelic gene conversions.  

Paraphase supports 160 segmental duplication [regions](docs/regions.md) in GRCh38. Among these, there are 11 medically relevant regions that are also supported in GRCh37/hg19:
- SMN1/SMN2 (spinal muscular atrophy)
- RCCX module
  - CYP21A2 (21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia)
  - TNXB (Ehlers-Danlos syndrome)
  - C4A/C4B (relevant in autoimmune diseases)
- PMS2 (Lynch Syndrome)
- STRC (hereditary hearing loss and deafness)
- IKBKG (Incontinentia Pigmenti)
- NCF1 (chronic granulomatous disease; Williams syndrome)
- NEB (Nemaline myopathy)
- F8 (intron 22 inversion, Hemophilia A)
- CFC1 (heterotaxy syndrome)
- OPN1LW/OPN1MW (color vision deficiencies)
- HBA1/HBA2 (Alpha-Thalassemia)
- GBA (Gaucher disease and Parkison's disease)
- CYP11B1/CYP11B2 (Glucocorticoid-remediable aldosteronism)
- CFH/CFHR1/CFHR2/CFHR3/CFHR4 (large deletions/duplications, atypical hemolytic uremic syndrome and age-related macular degeneration)

Please check out our [paper](https://www.cell.com/ajhg/fulltext/S0002-9297(23)00001-0) on its application to the gene SMN1 for more details about Paraphase.   
Chen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;0(0). doi:10.1016/j.ajhg.2023.01.001

For whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when gene copies are highly similar to each other. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.

## Contact

If you have suggestions or need assistance, please don't hesitate to reach out by email or open a GitHub issue.

Xiao Chen: xchen@pacificbiosciences.com

## Dependencies

- [samtools](http://www.htslib.org/)
- [minimap2](https://github.com/lh3/minimap2)

## Installation

Paraphase can be installed through pip or conda:
```bash
pip install paraphase
# or
conda install -c conda-forge -c bioconda paraphase
```

Alternatively, Paraphase can be installed from GitHub.
```bash
git clone https://github.com/PacificBiosciences/paraphase
cd paraphase
python setup.py install
```

## Running the program

```bash
paraphase -b input.bam -o output_directory -r genome_fasta
```

Alternatively when you have a list of bam files
```bash
paraphase -l list.txt -o output_directory -r genome_fasta
```

Required parameters:
- `-b`: Input BAM file or `-l`: text file listing BAM files one per line (a BAI file needs to exist in the same directory)
- `-o`: Output directory
- `-r`: Path to the reference genome fasta file

Please note that the input BAM should be one that's aligned to the ENTIRE reference genome (either GRCh38 or GRCh37/hg19), and this reference should NOT include ALT contigs. The fasta file of this reference genome should be provided to Paraphase with `-r`. Recommendations on reference genomes to use are documented [here](https://github.com/PacificBiosciences/reference_genomes).

Optional parameters:
- `-g`: Region(s) to analyze, separated by comma. All supported [regions](docs/regions.md) will be analyzed if not specified. Please use region name, i.e. first column in the doc.
- `-t`: Number of threads.
- `-p`: Prefix of output files when the input is a single sample, i.e. use with `-b`. If not provided, prefix will be extracted from the header of the input BAM. 
- `--genome`: Genome reference build. Default is `38`. If `37` or `19` is specified, Paraphase will run the analysis for GRCh37 or hg19, respectively (note that only 11 medically relevant [regions](docs/regions.md) are supported now for GRCh37/hg19).
- `--gene1only`: If specified, variants calls will be made against the main gene only for SMN1, PMS2, STRC, NCF1 and IKBKG, see more information [here](docs/vcf.md).
- `--novcf`: If specified, no VCF files will be produced.
- `--samtools`: path to samtools. If the paths to samtools or minimap2 are not already in the PATH environment variable, they can be provided through the `--samtools` and `--minimap2` parameters.
- `--minimap2`: path to minimap2

## Interpreting the output

Paraphase produces a few output files in the directory specified by `-o`, with the specified or default prefix.

1. `.vcf` in `${prefix}_paraphase_vcfs` folder. A VCF file is written for each region (gene family). More descriptions on the VCF can be found [here](docs/vcf.md).

2. `.paraphase.bam`: This BAM file can be loaded into IGV for visualization of haplotypes (group reads by `HP` tag and color alignments by `YC` tag). All haplotypes are aligned against the main gene of interest. Tutorials/Examples are provided for medically relevant genes (See below).  

3. `.paraphase.json`: Output file summarizing haplotypes and variant calls for each gene family in each sample. In brief, a few generally used fields are explained below.
- `final_haplotypes`: phased haplotypes for all gene copies in a gene family
- `total_cn`: total copy number of the family (sum of gene and paralog/pseudogene)
- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.
- `haplotype_details`: lists information about each haplotype 
  - `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.
- `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.

Tutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below: 
- [SMN1/SMN2](docs/SMN1_SMN2.md)
- [RCCX module (CYP21A2)](docs/RCCX.md)
- [PMS2](docs/PMS2.md)
- [STRC](docs/STRC.md)
- [OPN1LW/OPN1MW](docs/OPN1LW_OPN1MW.md)
- [HBA1/HBA2](docs/HBA1_HBA2.md)
- [IKBKG](docs/IKBKG.md)
- [F8](docs/F8.md)
- [NEB](docs/NEB.md)
- [NCF1](docs/NCF1.md)
- [GBA](docs/GBA.md)
- [CFH gene cluster](docs/CFH.md)


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/PacificBiosciences/paraphase",
    "name": "paraphase",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Xiao Chen",
    "author_email": "xchen@pacificbiosciences.com",
    "download_url": "https://files.pythonhosted.org/packages/b9/60/629ec3fc9e57d2c03b233fa42def8e323749cdb8ad54a2af6509f13815f8/paraphase-3.1.1.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\"><img width=\"300px\" src=\"docs/logo_Paraphase.svg\"/></h1>\n\n<h1 align=\"center\">Paraphase</h1>\n\n<h3 align=\"center\">HiFi-based caller for highly similar paralogous genes</h3>\n\nMany medically relevant genes fall into 'dark' regions where variant calling is limited due to high sequence homology with paralogs or pseudogenes. Paraphase is a Python tool that takes HiFi aligned BAMs as input (whole-genome or enrichment), phases haplotypes for genes of the same family, determines copy numbers and makes phased variant calls. \n\n![Paraphase diagram](docs/figures/paraphase_diagram.png)\nParaphase takes all reads from a gene family, realigns to one representative gene of the family and then phases them into haplotypes. This approach bypasses the error-prone process of aligning reads to multiple similar regions and allows us to examine all copies of genes in a gene family. This gene-family-centered approach allows Paraphase to perform well when there is a copy number difference between an individual and the reference, as is often the case in segmental duplications.\nFuthermore, this approach also streamlines sequence comparisons between genes within the same family, making it straightforward to conduct analyses such as identifying non-allelic gene conversions.  \n\nParaphase supports 160 segmental duplication [regions](docs/regions.md) in GRCh38. Among these, there are 11 medically relevant regions that are also supported in GRCh37/hg19:\n- SMN1/SMN2 (spinal muscular atrophy)\n- RCCX module\n  - CYP21A2 (21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia)\n  - TNXB (Ehlers-Danlos syndrome)\n  - C4A/C4B (relevant in autoimmune diseases)\n- PMS2 (Lynch Syndrome)\n- STRC (hereditary hearing loss and deafness)\n- IKBKG (Incontinentia Pigmenti)\n- NCF1 (chronic granulomatous disease; Williams syndrome)\n- NEB (Nemaline myopathy)\n- F8 (intron 22 inversion, Hemophilia A)\n- CFC1 (heterotaxy syndrome)\n- OPN1LW/OPN1MW (color vision deficiencies)\n- HBA1/HBA2 (Alpha-Thalassemia)\n- GBA (Gaucher disease and Parkison's disease)\n- CYP11B1/CYP11B2 (Glucocorticoid-remediable aldosteronism)\n- CFH/CFHR1/CFHR2/CFHR3/CFHR4 (large deletions/duplications, atypical hemolytic uremic syndrome and age-related macular degeneration)\n\nPlease check out our [paper](https://www.cell.com/ajhg/fulltext/S0002-9297(23)00001-0) on its application to the gene SMN1 for more details about Paraphase.   \nChen X, Harting J, Farrow E, et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. The American Journal of Human Genetics. 2023;0(0). doi:10.1016/j.ajhg.2023.01.001\n\nFor whole-genome sequencing (WGS) data, we recommend >20X, ideally 30X, genome coverage. Low coverage or short read length could result in less accurate phasing, especially when gene copies are highly similar to each other. For hybrid capture-based enrichment data, a higher read depth (>50X) is recommended as the read length is generally shorter than WGS.\n\n## Contact\n\nIf you have suggestions or need assistance, please don't hesitate to reach out by email or open a GitHub issue.\n\nXiao Chen: xchen@pacificbiosciences.com\n\n## Dependencies\n\n- [samtools](http://www.htslib.org/)\n- [minimap2](https://github.com/lh3/minimap2)\n\n## Installation\n\nParaphase can be installed through pip or conda:\n```bash\npip install paraphase\n# or\nconda install -c conda-forge -c bioconda paraphase\n```\n\nAlternatively, Paraphase can be installed from GitHub.\n```bash\ngit clone https://github.com/PacificBiosciences/paraphase\ncd paraphase\npython setup.py install\n```\n\n## Running the program\n\n```bash\nparaphase -b input.bam -o output_directory -r genome_fasta\n```\n\nAlternatively when you have a list of bam files\n```bash\nparaphase -l list.txt -o output_directory -r genome_fasta\n```\n\nRequired parameters:\n- `-b`: Input BAM file or `-l`: text file listing BAM files one per line (a BAI file needs to exist in the same directory)\n- `-o`: Output directory\n- `-r`: Path to the reference genome fasta file\n\nPlease note that the input BAM should be one that's aligned to the ENTIRE reference genome (either GRCh38 or GRCh37/hg19), and this reference should NOT include ALT contigs. The fasta file of this reference genome should be provided to Paraphase with `-r`. Recommendations on reference genomes to use are documented [here](https://github.com/PacificBiosciences/reference_genomes).\n\nOptional parameters:\n- `-g`: Region(s) to analyze, separated by comma. All supported [regions](docs/regions.md) will be analyzed if not specified. Please use region name, i.e. first column in the doc.\n- `-t`: Number of threads.\n- `-p`: Prefix of output files when the input is a single sample, i.e. use with `-b`. If not provided, prefix will be extracted from the header of the input BAM. \n- `--genome`: Genome reference build. Default is `38`. If `37` or `19` is specified, Paraphase will run the analysis for GRCh37 or hg19, respectively (note that only 11 medically relevant [regions](docs/regions.md) are supported now for GRCh37/hg19).\n- `--gene1only`: If specified, variants calls will be made against the main gene only for SMN1, PMS2, STRC, NCF1 and IKBKG, see more information [here](docs/vcf.md).\n- `--novcf`: If specified, no VCF files will be produced.\n- `--samtools`: path to samtools. If the paths to samtools or minimap2 are not already in the PATH environment variable, they can be provided through the `--samtools` and `--minimap2` parameters.\n- `--minimap2`: path to minimap2\n\n## Interpreting the output\n\nParaphase produces a few output files in the directory specified by `-o`, with the specified or default prefix.\n\n1. `.vcf` in `${prefix}_paraphase_vcfs` folder. A VCF file is written for each region (gene family). More descriptions on the VCF can be found [here](docs/vcf.md).\n\n2. `.paraphase.bam`: This BAM file can be loaded into IGV for visualization of haplotypes (group reads by `HP` tag and color alignments by `YC` tag). All haplotypes are aligned against the main gene of interest. Tutorials/Examples are provided for medically relevant genes (See below).  \n\n3. `.paraphase.json`: Output file summarizing haplotypes and variant calls for each gene family in each sample. In brief, a few generally used fields are explained below.\n- `final_haplotypes`: phased haplotypes for all gene copies in a gene family\n- `total_cn`: total copy number of the family (sum of gene and paralog/pseudogene)\n- `two_copy_haplotypes`: haplotypes that are present in two copies based on depth. This happens when (in a small number of cases) two haplotypes are identical and we infer that there exist two of them instead of one by checking the read depth.\n- `haplotype_details`: lists information about each haplotype \n  - `boundary`: the boundary of the region that is resolved on the haplotype. This is useful when a haplotype is only partially phased.\n- `alleles_final`: haplotypes phased into alleles. This is possible when the segmental duplication is in tandem.\n\nTutorials/Examples are provided for interpreting the `json` output and visualizing haplotypes for medically relevant genes listed below: \n- [SMN1/SMN2](docs/SMN1_SMN2.md)\n- [RCCX module (CYP21A2)](docs/RCCX.md)\n- [PMS2](docs/PMS2.md)\n- [STRC](docs/STRC.md)\n- [OPN1LW/OPN1MW](docs/OPN1LW_OPN1MW.md)\n- [HBA1/HBA2](docs/HBA1_HBA2.md)\n- [IKBKG](docs/IKBKG.md)\n- [F8](docs/F8.md)\n- [NEB](docs/NEB.md)\n- [NCF1](docs/NCF1.md)\n- [GBA](docs/GBA.md)\n- [CFH gene cluster](docs/CFH.md)\n\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause-Clear",
    "summary": "paraphase: HiFi-based caller for highly homologous genes",
    "version": "3.1.1",
    "project_urls": {
        "Homepage": "https://github.com/PacificBiosciences/paraphase"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f65ca3ea484f50fae3ed7e8b42722d4790f598a7076c8dbffa2e69d520aaf27e",
                "md5": "7e893ed8b91b76c7181ed012e164edcc",
                "sha256": "87a2db032439fc4267dde71a98a1455d4452b5be7355ea230923f74b474cbd67"
            },
            "downloads": -1,
            "filename": "paraphase-3.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7e893ed8b91b76c7181ed012e164edcc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 143909,
            "upload_time": "2024-04-18T18:41:03",
            "upload_time_iso_8601": "2024-04-18T18:41:03.021184Z",
            "url": "https://files.pythonhosted.org/packages/f6/5c/a3ea484f50fae3ed7e8b42722d4790f598a7076c8dbffa2e69d520aaf27e/paraphase-3.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b960629ec3fc9e57d2c03b233fa42def8e323749cdb8ad54a2af6509f13815f8",
                "md5": "f761bd77c7294b95db92ac640620bb12",
                "sha256": "9613cc0792ce9ea248eaf5574d316e5018239441f65cf73c8a346aba25632624"
            },
            "downloads": -1,
            "filename": "paraphase-3.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f761bd77c7294b95db92ac640620bb12",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 139522,
            "upload_time": "2024-04-18T18:41:05",
            "upload_time_iso_8601": "2024-04-18T18:41:05.578432Z",
            "url": "https://files.pythonhosted.org/packages/b9/60/629ec3fc9e57d2c03b233fa42def8e323749cdb8ad54a2af6509f13815f8/paraphase-3.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-18 18:41:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "PacificBiosciences",
    "github_project": "paraphase",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "paraphase"
}
        
Elapsed time: 0.40418s