HaplotagLR


NameHaplotagLR JSON
Version 1.1.5 PyPI version JSON
download
home_pagehttps://github.com/Boyle-Lab/HaplotagLR.git
SummaryPhasing individual long reads using known haplotype information.
upload_time2023-12-07 21:26:34
maintainer
docs_urlNone
authorGreg Farnum
requires_python>=3.7
licenseMIT
keywords long-reads phasing haplotype
VCS
bugtrack_url
requirements virtualenv pip setuptools six pysam pyliftover biopython requests numpy powerlaw
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HaplotagLR
## A tool for haplotagging long-read sequencing results.

HaplotagLR haplotags long reads based on existing, pre-phased haplotypes in VCF format.


## Dependencies
### All modes:
* HTSlib (https://www.htslib.org/)
* Python >= v3.7
* minimap2  (https://github.com/lh3/minimap2)
* numpy (https://numpy.org/)
* powerlaw (https://github.com/jeffalstott/powerlaw)
* pysam (https://github.com/pysam-developers/pysam)
* pyliftover (https://github.com/konstantint/pyliftover)
* requests (http://python-requests.org)
* samtools (https://github.com/samtools/samtools)

### Simulation mode
* pbsim2 (https://github.com/yukiteruono/pbsim2)


## Installation

We strongly recommend installing with conda, into a new environment:
```
conda create -n HaplotagLR_env -c conda-forge -c bioconda numpy pysam powerlaw pyliftover pbsim2 minimap2 requests samtools HaplotagLR
```

Install with pip:
```
pip install HaplotagLR
```

Installation from the github repository is not recommended. However, if you must, follow the steps below:
1) git clone https://github.com/Boyle-Lab/HaplotagLR.git
2) cd HaplotagLR/
3) python3 -m pip install -e .


## Usage

```
HaplotagLR [-h] [--version] [-q] {haplotag} ...
```

HaplotagLR currently only offers haplotag mode, but may support more operations in future releases.


### Haplotag Mode
A tool for haplotagging individual long reads using pre-phased haplotypes

```
usage: HaplotagLR haplotag [-h] -v <VCF_FILE> -i <SAM/BAM/FASTQ>             
                           [-o </path/to/output>] [-r <REF_FASTA>]            
                           [-A <ASSEMBLY_NAME>] [-t <THREADS>] [-q] [-S] 
                           [-O {combined,phase_tagged,full}]
                           [-s <SAMPLE_NAME>] [-e EPSILON] [-c]
                           [-F FDR_THRESHOLD]                              
                           [--log_likelihood_threshold <MIN_LIKELIHOOD_RATIO>]
                           [--no_multcoeff]
```

#### Required Arguments
| Argument | Description |
|---|---|
| __-v <VCF_FILE>, --vcf <VCF_FILE>__ |Path to vcf file with haplotype information that will be used for haplotagging. (Must be in .vcf.gz format with tabix index in same folder. If .vcf file is provided, bgzip and tabix must be installed and available on PATH because HaplotagLR will attempt to convert it. EX: -v GM12878_haplotype.vcf.gz) |
| __-i <SAM/BAM/FASTQ>__ | Path to sequencing file (.fasta) or alignment file (.bam or .sam) of long reads that will be used for haplotagging. If either a .sam file is provided or an index is not found, .sam and .bam file will be sorted and indexed with SAMtools. Sorted.bam files should be in same directory as their index (.sorted.bam.bai). EX: -a data/minion_GM12878_run3.sorted.bam, -i minion_GM12878_run3.sam) Path to long read file in .fastq format that will be used for alignment and haplotagging (ex: -i minion_GM12878_run3.fastq). **** NOTE: the -r/--reference argument is REQUIRED if using input in fastq format! ****|

#### Optional Arguments
| Argument | Description |
|---|---|
| __-h, --help__ | Show help message and exit |
| __-o </path/to/output>, --output_directory_name </path/to/output_directory>__ | Output directory name. Name given to directory where results will be output. |
| __-r <REF_FASTA>, --reference <REF_FASTA>__ | Path to reference genome sequence file. REQUIRED if argument to -i a fastq file. |
| __-A <ASSEMBLY_NAME>, --reference_assembly <ASSEMBLY_NAME>__ | Assembly for the reference genome. EX: -A hg38. |
| __-t <THREADS>, --threads <THREADS>__ | Number of threads to use for mapping, sorting, and indexing steps. |
| __-q, --quiet__ | Output to stderr from subprocesses will be muted. |
| __-S, --silent__ | Output to stderr and stdout from subprocesses will be muted. |

#### Output Options
| Argument | Description |
|---|---|
| __-O {combined,phase_tagged,full}, --output_mode {combined,phase_tagged,full}__ | Specify whether/how phased, unphased, and nonphasable reads are printed to output. Modes available: combined: All reads will be written to a common output file. The phase tag (HP:i:N) can be used to extract maternal/paternal phased reads, unphased reads, and nonphasable reads. phase_tagged: Phased reads for both maternal and paternal phases will be written to a single output file, while unphased and nonphasable reads will be written to their own respective output files. full: Maternal, paternal, unphased, and nonphasable reads will be printed to separate output files. |
| __-s <SAMPLE_NAME>, --one_sample <SAMPLE_NAME>__ | Use the --one_sample option to haplotag a specific sample present in the input reads and vcf file. (-HG001) |

#### Statistical Options for Haplotag Mode
| Argument | Description |
|---|---|
| __-e GLOBAL_EPSILON, --global_epsilon GLOBAL_EPSILON__ | Use a global value for the sequencing error rate, epsilon. By default, epsilon is calculated per read as the mean observed error rate. With --global_epsilon, epsilon will be fixed at the given value when scoring reads and in calculating the FDR threshold value for the optional haplotagging error model (see --FDR_threshold). Supersedes --epsilon_from_quality_scores. |
| __-c, --epsilon_from_quality_scores__ | Obtain the sequencing error rate, epsilon, as per-base observed error rates, calculated directly from Phred scores in each BAM record. Default = Epsilon is calculated as the mean error rate per-read. Superseded by --global_epsilon. |
| __-F FDR_THRESHOLD, --FDR_threshold FDR_THRESHOLD__ | Control the false discovery rate at the given value using a negative-binomial estimate of the number of haplotagging errors (N) given the average per-base sequencing error rate observed among all taggable reads. Haplotagged reads are sorted by their observed log-likelihood ratios and the bottom N*(1-FDR) reads will be reassigned to the "Untagged" set. Set this to zero to skip this step and return all haplotagging predictions. Default = 0. |
| __--log_likelihood_threshold <LOG_LIKELIHOOD_THRESHOLD>__ | Use a hard threshold on log-likelihood ratios when haplotagging reads. Results will only be printed for predicted haplotaggings with log-likelihood ratios equal to or greater than this threshold. Setting this to zero will cause all reads to be assigned to the phase to which they share the greatest number matches. Log-likelihood ratios will still be reported in the output in this case, but are not used for haplotagging decisions. |
| __--no_multcoeff__ | Do not apply the multinomial coefficient in the likelihood calculation. WARNING: The model will not output valid probabilities without this! Default=False (The multinomal coefficient will be used.) |

## Interpreting the Output
By default, HaplotagLR tags each record with several key:type:value tuples to encode the haplotagging decision and several values used in the tagging decision. These are written as BAM records to one or more output files, depending on invocation (See help for -O option). Specific tags added to each record are described below:

| Tag | Description |
|---|---|
| al | Alignment type, e.g., supplementary. Added if not already present in BAM record. |
| PS | Name of the overlapping phase set. |
| py | Ploidy number of overlapping phase set. |
| HS | Number of heterozygous variants overlapping read. |
| GP | Comma-delimited list of overlapping variants' position(s) relative to the read start. |
| PA | Phased alleles for all haplotypes overlapping read. Comma-delimited list of tuples. |
| OA | Observed allele(s) in sequenced read at heterozygous positions. May or may not match one of the values in PA! |
| PR | Comma-delimited list of Bayesian prior values. |
| LS | Comma-delimited list of Log-Likelihood-Ratios for each haplotype. |
| PC | Log-Likelihood-Ratio for assigned haplotag. |
| HP | Assigned haplotag. |

## Example Dataset
We provide a sample dataset and example usage [here](https://github.com/Boyle-Lab/HaplotagLR/tree/main/example_data)

## Citing HaplotagLR
The HaplotagLR algorithm and software release 1.0.3 are described in [pub link here](https://example.com). Please use the following citation if you use this software in your work:

HaplotagLR: an efficient algorithm for assigning haplotypic identity to long reads.
Monica J. Holmes, Babak Mahjour, Christopher Castro, Gregory A. Farnum, Adam G. Diehl, Alan P. Boyle.
2022. BioArxiv. [URL](http://example.com)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Boyle-Lab/HaplotagLR.git",
    "name": "HaplotagLR",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "long-reads,phasing,haplotype",
    "author": "Greg Farnum",
    "author_email": "gregfar@umich.edu",
    "download_url": "https://files.pythonhosted.org/packages/d2/ee/6a54dd85a43c26dec4e4b7554622b2bc92298bc60c6870fad504de6a964e/HaplotagLR-1.1.5.tar.gz",
    "platform": null,
    "description": "# HaplotagLR\n## A tool for haplotagging long-read sequencing results.\n\nHaplotagLR haplotags long reads based on existing, pre-phased haplotypes in VCF format.\n\n\n## Dependencies\n### All modes:\n* HTSlib (https://www.htslib.org/)\n* Python >= v3.7\n* minimap2  (https://github.com/lh3/minimap2)\n* numpy (https://numpy.org/)\n* powerlaw (https://github.com/jeffalstott/powerlaw)\n* pysam (https://github.com/pysam-developers/pysam)\n* pyliftover (https://github.com/konstantint/pyliftover)\n* requests (http://python-requests.org)\n* samtools (https://github.com/samtools/samtools)\n\n### Simulation mode\n* pbsim2 (https://github.com/yukiteruono/pbsim2)\n\n\n## Installation\n\nWe strongly recommend installing with conda, into a new environment:\n```\nconda create -n HaplotagLR_env -c conda-forge -c bioconda numpy pysam powerlaw pyliftover pbsim2 minimap2 requests samtools HaplotagLR\n```\n\nInstall with pip:\n```\npip install HaplotagLR\n```\n\nInstallation from the github repository is not recommended. However, if you must, follow the steps below:\n1) git clone https://github.com/Boyle-Lab/HaplotagLR.git\n2) cd HaplotagLR/\n3) python3 -m pip install -e .\n\n\n## Usage\n\n```\nHaplotagLR [-h] [--version] [-q] {haplotag} ...\n```\n\nHaplotagLR currently only offers haplotag mode, but may support more operations in future releases.\n\n\n### Haplotag Mode\nA tool for haplotagging individual long reads using pre-phased haplotypes\n\n```\nusage: HaplotagLR haplotag [-h] -v <VCF_FILE> -i <SAM/BAM/FASTQ>             \n                           [-o </path/to/output>] [-r <REF_FASTA>]            \n                           [-A <ASSEMBLY_NAME>] [-t <THREADS>] [-q] [-S] \n                           [-O {combined,phase_tagged,full}]\n                           [-s <SAMPLE_NAME>] [-e EPSILON] [-c]\n                           [-F FDR_THRESHOLD]                              \n                           [--log_likelihood_threshold <MIN_LIKELIHOOD_RATIO>]\n                           [--no_multcoeff]\n```\n\n#### Required Arguments\n| Argument | Description |\n|---|---|\n| __-v <VCF_FILE>, --vcf <VCF_FILE>__ |Path to vcf file with haplotype information that will be used for haplotagging. (Must be in .vcf.gz format with tabix index in same folder. If .vcf file is provided, bgzip and tabix must be installed and available on PATH because HaplotagLR will attempt to convert it. EX: -v GM12878_haplotype.vcf.gz) |\n| __-i <SAM/BAM/FASTQ>__ | Path to sequencing file (.fasta) or alignment file (.bam or .sam) of long reads that will be used for haplotagging. If either a .sam file is provided or an index is not found, .sam and .bam file will be sorted and indexed with SAMtools. Sorted.bam files should be in same directory as their index (.sorted.bam.bai). EX: -a data/minion_GM12878_run3.sorted.bam, -i minion_GM12878_run3.sam) Path to long read file in .fastq format that will be used for alignment and haplotagging (ex: -i minion_GM12878_run3.fastq). **** NOTE: the -r/--reference argument is REQUIRED if using input in fastq format! ****|\n\n#### Optional Arguments\n| Argument | Description |\n|---|---|\n| __-h, --help__ | Show help message and exit |\n| __-o </path/to/output>, --output_directory_name </path/to/output_directory>__ | Output directory name. Name given to directory where results will be output. |\n| __-r <REF_FASTA>, --reference <REF_FASTA>__ | Path to reference genome sequence file. REQUIRED if argument to -i a fastq file. |\n| __-A <ASSEMBLY_NAME>, --reference_assembly <ASSEMBLY_NAME>__ | Assembly for the reference genome. EX: -A hg38. |\n| __-t <THREADS>, --threads <THREADS>__ | Number of threads to use for mapping, sorting, and indexing steps. |\n| __-q, --quiet__ | Output to stderr from subprocesses will be muted. |\n| __-S, --silent__ | Output to stderr and stdout from subprocesses will be muted. |\n\n#### Output Options\n| Argument | Description |\n|---|---|\n| __-O {combined,phase_tagged,full}, --output_mode {combined,phase_tagged,full}__ | Specify whether/how phased, unphased, and nonphasable reads are printed to output. Modes available: combined: All reads will be written to a common output file. The phase tag (HP:i:N) can be used to extract maternal/paternal phased reads, unphased reads, and nonphasable reads. phase_tagged: Phased reads for both maternal and paternal phases will be written to a single output file, while unphased and nonphasable reads will be written to their own respective output files. full: Maternal, paternal, unphased, and nonphasable reads will be printed to separate output files. |\n| __-s <SAMPLE_NAME>, --one_sample <SAMPLE_NAME>__ | Use the --one_sample option to haplotag a specific sample present in the input reads and vcf file. (-HG001) |\n\n#### Statistical Options for Haplotag Mode\n| Argument | Description |\n|---|---|\n| __-e GLOBAL_EPSILON, --global_epsilon GLOBAL_EPSILON__ | Use a global value for the sequencing error rate, epsilon. By default, epsilon is calculated per read as the mean observed error rate. With --global_epsilon, epsilon will be fixed at the given value when scoring reads and in calculating the FDR threshold value for the optional haplotagging error model (see --FDR_threshold). Supersedes --epsilon_from_quality_scores. |\n| __-c, --epsilon_from_quality_scores__ | Obtain the sequencing error rate, epsilon, as per-base observed error rates, calculated directly from Phred scores in each BAM record. Default = Epsilon is calculated as the mean error rate per-read. Superseded by --global_epsilon. |\n| __-F FDR_THRESHOLD, --FDR_threshold FDR_THRESHOLD__ | Control the false discovery rate at the given value using a negative-binomial estimate of the number of haplotagging errors (N) given the average per-base sequencing error rate observed among all taggable reads. Haplotagged reads are sorted by their observed log-likelihood ratios and the bottom N*(1-FDR) reads will be reassigned to the \"Untagged\" set. Set this to zero to skip this step and return all haplotagging predictions. Default = 0. |\n| __--log_likelihood_threshold <LOG_LIKELIHOOD_THRESHOLD>__ | Use a hard threshold on log-likelihood ratios when haplotagging reads. Results will only be printed for predicted haplotaggings with log-likelihood ratios equal to or greater than this threshold. Setting this to zero will cause all reads to be assigned to the phase to which they share the greatest number matches. Log-likelihood ratios will still be reported in the output in this case, but are not used for haplotagging decisions. |\n| __--no_multcoeff__ | Do not apply the multinomial coefficient in the likelihood calculation. WARNING: The model will not output valid probabilities without this! Default=False (The multinomal coefficient will be used.) |\n\n## Interpreting the Output\nBy default, HaplotagLR tags each record with several key:type:value tuples to encode the haplotagging decision and several values used in the tagging decision. These are written as BAM records to one or more output files, depending on invocation (See help for -O option). Specific tags added to each record are described below:\n\n| Tag | Description |\n|---|---|\n| al | Alignment type, e.g., supplementary. Added if not already present in BAM record. |\n| PS | Name of the overlapping phase set. |\n| py | Ploidy number of overlapping phase set. |\n| HS | Number of heterozygous variants overlapping read. |\n| GP | Comma-delimited list of overlapping variants' position(s) relative to the read start. |\n| PA | Phased alleles for all haplotypes overlapping read. Comma-delimited list of tuples. |\n| OA | Observed allele(s) in sequenced read at heterozygous positions. May or may not match one of the values in PA! |\n| PR | Comma-delimited list of Bayesian prior values. |\n| LS | Comma-delimited list of Log-Likelihood-Ratios for each haplotype. |\n| PC | Log-Likelihood-Ratio for assigned haplotag. |\n| HP | Assigned haplotag. |\n\n## Example Dataset\nWe provide a sample dataset and example usage [here](https://github.com/Boyle-Lab/HaplotagLR/tree/main/example_data)\n\n## Citing HaplotagLR\nThe HaplotagLR algorithm and software release 1.0.3 are described in [pub link here](https://example.com). Please use the following citation if you use this software in your work:\n\nHaplotagLR: an efficient algorithm for assigning haplotypic identity to long reads.\nMonica J. Holmes, Babak Mahjour, Christopher Castro, Gregory A. Farnum, Adam G. Diehl, Alan P. Boyle.\n2022. BioArxiv. [URL](http://example.com)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Phasing individual long reads using known haplotype information.",
    "version": "1.1.5",
    "project_urls": {
        "Homepage": "https://github.com/Boyle-Lab/HaplotagLR.git",
        "Issue Tracker": "https://github.com/Boyle-Lab/HaplotagLR/issues"
    },
    "split_keywords": [
        "long-reads",
        "phasing",
        "haplotype"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ab2da9501f4e2f08278d4e153b9f2b31636baf3924379a54bebdd214135c8bcc",
                "md5": "8a9903b5e0c397f56e3e48fa5af256a9",
                "sha256": "612b9244df5891735410eebc81d221010302a12d36a36eba34fedf6ac1f56759"
            },
            "downloads": -1,
            "filename": "HaplotagLR-1.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8a9903b5e0c397f56e3e48fa5af256a9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 158878,
            "upload_time": "2023-12-07T21:26:32",
            "upload_time_iso_8601": "2023-12-07T21:26:32.633072Z",
            "url": "https://files.pythonhosted.org/packages/ab/2d/a9501f4e2f08278d4e153b9f2b31636baf3924379a54bebdd214135c8bcc/HaplotagLR-1.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d2ee6a54dd85a43c26dec4e4b7554622b2bc92298bc60c6870fad504de6a964e",
                "md5": "10510f1822987e4f67c1e93fdc866fc3",
                "sha256": "5cbaa37b817a40f4b06ff3437f24e5364d516477df3abd550e3b00382be7b7f8"
            },
            "downloads": -1,
            "filename": "HaplotagLR-1.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "10510f1822987e4f67c1e93fdc866fc3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 101649,
            "upload_time": "2023-12-07T21:26:34",
            "upload_time_iso_8601": "2023-12-07T21:26:34.506215Z",
            "url": "https://files.pythonhosted.org/packages/d2/ee/6a54dd85a43c26dec4e4b7554622b2bc92298bc60c6870fad504de6a964e/HaplotagLR-1.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-07 21:26:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Boyle-Lab",
    "github_project": "HaplotagLR",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "virtualenv",
            "specs": [
                [
                    ">=",
                    "16.6.0"
                ]
            ]
        },
        {
            "name": "pip",
            "specs": [
                [
                    ">=",
                    "19.1.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    ">=",
                    "18.0.1"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    ">=",
                    "1.14.0"
                ]
            ]
        },
        {
            "name": "pysam",
            "specs": [
                [
                    ">=",
                    "0.16.0.1"
                ]
            ]
        },
        {
            "name": "pyliftover",
            "specs": [
                [
                    ">=",
                    "0.4"
                ]
            ]
        },
        {
            "name": "biopython",
            "specs": [
                [
                    ">=",
                    "1.78"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.26.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.1"
                ]
            ]
        },
        {
            "name": "powerlaw",
            "specs": [
                [
                    ">=",
                    "1.4.6"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "haplotaglr"
}
        
Elapsed time: 0.14929s