[![License: MIT](https://img.shields.io/badge/License-MIT%20-blue.svg)](https://www.mit.edu/~amini/LICENSE.md)
# PHASEfilter
PHASEfilter is a software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes.
It is an easily implementable tool that provides a simple approach to detect and filter heterozygous SNPs and INDELs in diploid species based on a phased genome assembly.
# Installation
This installation is oriented for Linux distributions.
### Install directly
```
$ pip3 install PHASEfilter
```
### Install with virtualenv
```
$ virtualenv PHASEfilter --python=python3 --prompt "(PHASEfilter 1.1.0) "
$ . PHASEfilter/bin/activate
(phasefilter 1.1.0) $ pip install PHASEfilter
## install all Software dependencies of PHASEfilter
(phasefilter 1.1.0) $ cd PHASEfilter/bin/
(phasefilter 1.1.0) $ ./install_phasefilter_dependencies.sh
```
### Install with conda
```
$ wget https://raw.githubusercontent.com/ibigen/PHASEfilter/main/conda/conda_phasefilter_env.yml -O conda_phasefilter_env.yml
$ conda env create -f conda_phasefilter_env.yml
$ conda activate PHASEfilter
```
The follow software must be available in your computer:
* [minimpa2](https://github.com/lh3/minimap2) v2.22 or up
* [bcftools](http://www.htslib.org/download/) v1.3 or up
* [samtools](http://www.htslib.org/download/) v1.3 or up
* [htslib](http://www.htslib.org/download/) v1.3 or up
# All software available
## Filter variants in phased genomes
This software that can identify heterozygosity positions between two phased references.
The software starts by aligning pairs of diploid chromosomes, based on Minimap2 aligner. With synchronization done it is possible to identify the position of a variation, in both pair of chromosomes, allowing variants to be removed if they meets some established criterias.
To classify variants it is necessary to pass two VCF files, one for each reference phase. After that, the PHASEfilter will go through the variants called in reference A and check if there are any homologous in the variants called in reference B. For each variant called in the reference A it can happen three situations: 1) both references, for the position in analysis, are equal and the variant is valid; 2) the position is heterozygous in the references and the variant reflects it, so the variant is removed; 3) the position is heterozygous in the references and the variant is homozygous. It goes to the valid variants file but it also go to the Loss Of Heterozygous (LOH) file.
The variant file in analysis it is always the one passed in parameter '--vcf1'.
```
$ phasefilter --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir
## you can use chain if exists
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir --chain_A_B temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain --chain_B_A temp_raw_data/Assembly22_hapB_To_Assembly22_hapA.over.chain
```
Eighth possible files will be created after the commands ends. The outputs are from refrence A (ref1) to reference B (ref2), and from reference B (ref2) to reference A (ref1).
- report_[A]_to_[B].txt - has the statistics about the analysis;
- valid_[A]_to_[B].vcf.gz - has all variants that are not heterozygous between two references;
- removed_[A]_to_[B].vcf.gz - has all heterozygous variants;
- LOH_[A]_to_[B].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.
- report_[B]_to_[A].txt - has the statistics about the analysis from ;
- valid_[B]_to_[A].vcf.gz - has all variants that are not heterozygous between two references;
- removed_[B]_to_[A].vcf.gz - has all heterozygous variants;
- LOH_[B]_to_[A].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.
Headings description in report files:
- **Heterozygous (Removed)** Heterozygous identified and they go the re remove_[YYY]_to[XXX].vcf.gz file
- **Keep alleles** Alleles present in valid_[YYY]_to[XXX].vcf.gz file
- **LOH alleles Loss of Heterozygous** They are in valid_[YYY]_to[XXX].vcf.gz and LOH_[YYY]_to[XXX].vcf.gz file.
- **Other than SNP** Other variants thar are not SNPs and INDELs and they go to valid_[YYY]_to[XXX].vcf.gz file
- **Don't have hit position** Variants that don’t have position in hit (ref B) genome and they go to valid_[YYY]_to[XXX].vcf.gz file
- **Could Not Fetch VCF Record on Hit** Variants that are present in source file but not in hit VCF file. They go to valid_[YYY]_to[XXX].vcf.gz file
- **Total alleles** All the alleles present in the source vcf file. Analyzed alleles.
- **Total Alleles new Source VCF** Total alleles that are in valid_[YYY]_to[XXX].vcf.gz file
- **Method** Alignment method.
- **Alignment %** Percentage of alignment.
**Note:** You can can copy some example data to test the commands.
{: .note}
## Filter variants in phased genomes but only one direction
This tool do as the same of the previous script but only analysis from Reference A (ref1) to Reference B (ref2)
```
$ phasefilter_single --help
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz
## with chain
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz --chain temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain
```
## Synchronize annotation genomes
Synchronize annotations genomes adapting the annotations that are in reference 1 to the reference 2, adding the tags 'StartHit' and 'EndHit' to the result file. In VCF type files only add 'StartHit' tag in Info. The annotations (input file need to be in VCF or GFF3 and belong to the reference 1.
```
$ synchronize_genomes --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ synchronize_genomes --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --gff temp_raw_data/T1_Fluc_7A_snps.gff3.gz --out T1_Fluc_7A_snps.sync.gff3.gz
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --gff S288C_reference.gff3 --out result.gff3 --pass_chr chrmt
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --vcf S288C_reference.vcf.gz --out result.vcf.gz
```
## Make alignments
Obtain the percentage of the minimap2 alignment between chromosomes and create an output in ClustalX format.
```
$ make_alignment --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ make_alignment --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --out report.txt
$ make_alignment --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --out report.txt --pass_chr chrmt --out_alignment syncronizationSacharo
```
## Reference Statistics
With this application it is possible to obtain the number of nucleotides by chromosome.
```
$ reference_statistics --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ reference_statistics --ref temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --out report_stats.txt
$ reference_statistics --ref Ca22chr1A_C_albicans_SC5314.fasta.gz --out retport.txt
```
## Copy some example data to test all tools
It is possible to copy some example data and test the tools available
```
$ copy_raw_data_example_phasefilter --help
$ copy_raw_data_example_phasefilter --out temp_dir
```
# Documentation
PHASEfilter documentation is available in [ReadTheDocs: PHASEfilter](https://phasefilter.readthedocs.io/en/latest/)
Raw data
{
"_id": null,
"home_page": "https://github.com/ibigen/PHASEfilter",
"name": "PHASEfilter",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": "",
"keywords": "SNPs INDEL phased genome filtering bioinformatics",
"author": "Miguel Pinheiro",
"author_email": "monsanto@ua.pt",
"download_url": "https://files.pythonhosted.org/packages/45/4f/4afad6213ebd3b21f8a36cbcd8b17c3c6264054507b9790378441c6fd589/PHASEfilter-1.1.0.tar.gz",
"platform": "Linux",
"description": "\n[![License: MIT](https://img.shields.io/badge/License-MIT%20-blue.svg)](https://www.mit.edu/~amini/LICENSE.md)\n\n# PHASEfilter\nPHASEfilter is a software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes.\nIt is an easily implementable tool that provides a simple approach to detect and filter heterozygous SNPs and INDELs in diploid species based on a phased genome assembly.\n\n# Installation\n\nThis installation is oriented for Linux distributions.\n\n### Install directly\n\n```\n$ pip3 install PHASEfilter\n```\n\n### Install with virtualenv\n\n```\n$ virtualenv PHASEfilter --python=python3 --prompt \"(PHASEfilter 1.1.0) \"\n$ . PHASEfilter/bin/activate\n(phasefilter 1.1.0) $ pip install PHASEfilter\n\n## install all Software dependencies of PHASEfilter \n(phasefilter 1.1.0) $ cd PHASEfilter/bin/\n(phasefilter 1.1.0) $ ./install_phasefilter_dependencies.sh\n```\n\n### Install with conda\n\n```\n$ wget https://raw.githubusercontent.com/ibigen/PHASEfilter/main/conda/conda_phasefilter_env.yml -O conda_phasefilter_env.yml\n$ conda env create -f conda_phasefilter_env.yml\n$ conda activate PHASEfilter\n```\n\n\nThe follow software must be available in your computer:\n* [minimpa2](https://github.com/lh3/minimap2) v2.22 or up\n* [bcftools](http://www.htslib.org/download/) v1.3 or up\n* [samtools](http://www.htslib.org/download/) v1.3 or up\n* [htslib](http://www.htslib.org/download/) v1.3 or up\n\n\n# All software available\n\n## Filter variants in phased genomes\n\nThis software that can identify heterozygosity positions between two phased references.\nThe software starts by aligning pairs of diploid chromosomes, based on Minimap2 aligner. With synchronization done it is possible to identify the position of a variation, in both pair of chromosomes, allowing variants to be removed if they meets some established criterias.\nTo classify variants it is necessary to pass two VCF files, one for each reference phase. After that, the PHASEfilter will go through the variants called in reference A and check if there are any homologous in the variants called in reference B. For each variant called in the reference A it can happen three situations: 1) both references, for the position in analysis, are equal and the variant is valid; 2) the position is heterozygous in the references and the variant reflects it, so the variant is removed; 3) the position is heterozygous in the references and the variant is homozygous. It goes to the valid variants file but it also go to the Loss Of Heterozygous (LOH) file.\nThe variant file in analysis it is always the one passed in parameter '--vcf1'.\n\n```\n$ phasefilter --help\n## You can can copy some example data to test the commands\n$ copy_raw_data_example_phasefilter --out temp_raw_data\n$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir\n\n## you can use chain if exists\n$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir --chain_A_B temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain --chain_B_A temp_raw_data/Assembly22_hapB_To_Assembly22_hapA.over.chain\n```\n\nEighth possible files will be created after the commands ends. The outputs are from refrence A (ref1) to reference B (ref2), and from reference B (ref2) to reference A (ref1).\n\n- report_[A]_to_[B].txt - has the statistics about the analysis;\n- valid_[A]_to_[B].vcf.gz - has all variants that are not heterozygous between two references;\n- removed_[A]_to_[B].vcf.gz - has all heterozygous variants;\n- LOH_[A]_to_[B].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.\n\n- report_[B]_to_[A].txt - has the statistics about the analysis from ;\n- valid_[B]_to_[A].vcf.gz - has all variants that are not heterozygous between two references;\n- removed_[B]_to_[A].vcf.gz - has all heterozygous variants;\n- LOH_[B]_to_[A].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.\n\nHeadings description in report files:\n\n- **Heterozygous (Removed)** Heterozygous identified and they go the re remove_[YYY]_to[XXX].vcf.gz file\n- **Keep alleles** Alleles present in valid_[YYY]_to[XXX].vcf.gz file\n- **LOH alleles Loss of Heterozygous** They are in valid_[YYY]_to[XXX].vcf.gz and LOH_[YYY]_to[XXX].vcf.gz file.\n- **Other than SNP** Other variants thar are not SNPs and INDELs and they go to valid_[YYY]_to[XXX].vcf.gz file\n- **Don't have hit position** Variants that don\u2019t have position in hit (ref B) genome and they go to valid_[YYY]_to[XXX].vcf.gz file\n- **Could Not Fetch VCF Record on Hit** Variants that are present in source file but not in hit VCF file. They go to valid_[YYY]_to[XXX].vcf.gz file\n- **Total alleles** All the alleles present in the source vcf file. Analyzed alleles.\n- **Total Alleles new Source VCF** Total alleles that are in valid_[YYY]_to[XXX].vcf.gz file\n- **Method** Alignment method.\n- **Alignment %** Percentage of alignment.\n\n**Note:** You can can copy some example data to test the commands.\n{: .note}\n\n\n## Filter variants in phased genomes but only one direction\n\nThis tool do as the same of the previous script but only analysis from Reference A (ref1) to Reference B (ref2)\n\n```\n$ phasefilter_single --help\n$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz\n\n## with chain\n$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz --chain temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain\n```\n\n## Synchronize annotation genomes\n\nSynchronize annotations genomes adapting the annotations that are in reference 1 to the reference 2, adding the tags 'StartHit' and 'EndHit' to the result file. In VCF type files only add 'StartHit' tag in Info. The annotations (input file need to be in VCF or GFF3 and belong to the reference 1.\n\n```\n$ synchronize_genomes --help\n## You can can copy some example data to test the commands\n$ copy_raw_data_example_phasefilter --out temp_raw_data\n$ synchronize_genomes --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --gff temp_raw_data/T1_Fluc_7A_snps.gff3.gz --out T1_Fluc_7A_snps.sync.gff3.gz\n$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --gff S288C_reference.gff3 --out result.gff3 --pass_chr chrmt\n$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --vcf S288C_reference.vcf.gz --out result.vcf.gz\n```\n\n## Make alignments\n\nObtain the percentage of the minimap2 alignment between chromosomes and create an output in ClustalX format.\n\n```\n$ make_alignment --help\n## You can can copy some example data to test the commands\n$ copy_raw_data_example_phasefilter --out temp_raw_data\n$ make_alignment --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --out report.txt\n$ make_alignment --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --out report.txt --pass_chr chrmt --out_alignment syncronizationSacharo\n```\n\n## Reference Statistics\n\nWith this application it is possible to obtain the number of nucleotides by chromosome.\n\n```\n$ reference_statistics --help\n## You can can copy some example data to test the commands\n$ copy_raw_data_example_phasefilter --out temp_raw_data\n$ reference_statistics --ref temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --out report_stats.txt\n$ reference_statistics --ref Ca22chr1A_C_albicans_SC5314.fasta.gz --out retport.txt\n```\n\n## Copy some example data to test all tools\n\nIt is possible to copy some example data and test the tools available\n\n```\n$ copy_raw_data_example_phasefilter --help\n$ copy_raw_data_example_phasefilter --out temp_dir\n```\n\n# Documentation\n\nPHASEfilter documentation is available in [ReadTheDocs: PHASEfilter](https://phasefilter.readthedocs.io/en/latest/)\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes.",
"version": "1.1.0",
"project_urls": {
"Bug Reports": "https://github.com/ibigen/PHASEfilter/issues",
"Documentation": "https://phasefilter.readthedocs.io/en/latest/index.html",
"Homepage": "https://github.com/ibigen/PHASEfilter",
"Source": "https://github.com/ibigen/PHASEfilter"
},
"split_keywords": [
"snps",
"indel",
"phased",
"genome",
"filtering",
"bioinformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e80cb1ba707d308767faf0768f96d331da1e112b9a337f43b56ee9f8d6d8aff6",
"md5": "dc508fdc00e06904f6a871582af22cad",
"sha256": "087e4d5940560b121fd64e4de84c809fbe3c09c6eb8aedb4a9532e8de21acacf"
},
"downloads": -1,
"filename": "PHASEfilter-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dc508fdc00e06904f6a871582af22cad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 5295981,
"upload_time": "2023-05-05T14:19:39",
"upload_time_iso_8601": "2023-05-05T14:19:39.016348Z",
"url": "https://files.pythonhosted.org/packages/e8/0c/b1ba707d308767faf0768f96d331da1e112b9a337f43b56ee9f8d6d8aff6/PHASEfilter-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "454f4afad6213ebd3b21f8a36cbcd8b17c3c6264054507b9790378441c6fd589",
"md5": "ab9e8f332ff281a27699d1ff95b754c7",
"sha256": "bc1db109e9d28168138ce32e62bf36e19ca3e17650fc962c9dc0426d12adbd59"
},
"downloads": -1,
"filename": "PHASEfilter-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "ab9e8f332ff281a27699d1ff95b754c7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 61716,
"upload_time": "2023-05-05T14:20:25",
"upload_time_iso_8601": "2023-05-05T14:20:25.176678Z",
"url": "https://files.pythonhosted.org/packages/45/4f/4afad6213ebd3b21f8a36cbcd8b17c3c6264054507b9790378441c6fd589/PHASEfilter-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-05 14:20:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ibigen",
"github_project": "PHASEfilter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "phasefilter"
}