# SNPio: A Python API for Population Genetic File Processing, Filtering, and Encoding
## Introduction
This guide provides an overview of how to get started with the SNPio
library. It covers the basic steps to read, manipulate, and analyze
genotype data using the VCFReader, PhylipReader, StructureReader, and
NRemover2 classes. SNPio is designed to simplify the process of handling
genotype data and preparing it for downstream analysis, such as
population genetics, phylogenetics, and machine learning. The library
supports various file formats, including VCF, PHYLIP, and STRUCTURE, and
provides tools for filtering, encoding, and visualizing genotype data.
This guide will help you get up and running with SNPio quickly and
efficiently.
`VCFReader`, `PhylipReader`, and `StructureReader` classes are used to
read genotype data from VCF, PHYLIP, and STRUCTURE files, respectively.
These classes load the data into a `GenotypeData` object that has
various useful methods and properties.
The `NRemover2` class is used to filter genotype data based on various
criteria, such as missing data, minor allele count, minor allele
frequency, and more. The `GenotypeEncoder` class is used to encode
genotype data into different formats, such as one-hot encoding, integer
encoding, and 0-1-2 encoding, for downstream analysis and machine
learning tasks.
Below is a step-by-step guide to using SNPio to read, filter, and encode
genotype data for analysis.
## Installation
Before using SNPio, ensure it is installed in your Python environment.
You can install it using pip. In the project root directory (the
directory containing setup.py), type the following command into your
terminal:
``` shell
pip install snpio
```
We recommend using a virtual environment to manage your Python packages.
If you do not have a virtual environment set up, you can create one
using the following commands:
``` shell
python3 -m venv snpio_env
source snpio_env/bin/activate
```
This will create a virtual environment named `snpio_env` and activate
it. You can then install SNPio in this virtual environment using the pip
command mentioned above.
**Note:**
SNPio does not support Windows operating systems at the moment. We
recommend using a Unix-based operating system such as Linux or macOS.
**Note:**
We aim to support anaconda environments in the future. For now, we
recommend using a virtual environment with `pip` to install SNPio.
## Importing SNPio
To start using SNPio, import the necessary modules:
``` python
# Import the necessary modules
from snpio import (
NRemover2,
VCFReader,
PhylipReader,
StructureReader,
Plotting,
GenotypeEncoder,
)
```
Example usage:
``` python
# Define input filenames
vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
# Load the genotype data from a VCF file
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example"
)
```
You can also include or exclude any populations from the analysis by
using the `include_pops` and `exclude_pops` parameters in the reader
classes. For example:
``` python
# Only include the populations "ON", "DS", "EA", "GU", and "TT"
# Exclude the populations "MX", "YU", and "CH"
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example",
include_pops=["ON", "DS", "EA", "GU"],
exclude_pops=["MX", "YU", "CH"],
)
```
The `include_pops` and `exclude_pops` parameters are optional and can be
used to filter the populations included in the analysis. If both
parameters are provided, the populations in `include_pops` will be
included, and the populations in `exclude_pops` will be excluded.
However, populations cannot overlap between lists.
## Important Notes
- The `VCFReader`, `PhylipReader`, `StructureReader`, `NRemover2`, and `GenotypeEncoder` classes treat the following characters as missing data:
- "N"
- "."
- "?"
- "-"
- The `VCFReader` class can read both uncompressed and compressed VCF
files (gzipped). If your input file is in PHYLIP or STRUCTURE format, it
will be forced to be biallelic. To handle more than two alleles per
site, use the VCF format.
## The Population Map File
To use `VCFReader`, `PhylipReader`, or `StructureReader`, you can
optionally use a population map (popmap) file. This is a simple
two-column, whitespace-delimited or comma-delimited file with SampleIDs
in the first column and the corresponding PopulationIDs in the second
column. It can optionally contain a header line, with the first column
labeled "SampleID" and the second column labeled "PopulationID"
(case-insensitive). The population IDs can be any string, such as
"Population1", "Population2", etc, or an integer. SampleIDs must match
the sample names in the alignment file.
For example:
``` shell
Sample1,Population1
Sample2,Population1
Sample3,Population2
Sample4,Population2
```
Or, with a header:
``` shell
SampleID,PopulationID
Sample1,Population1
Sample2,Population1
Sample3,Population2
Sample4,Population2
```
The population map file is used to assign samples to populations and is
useful for filtering and visualizing genotype data by population. If you
do not provide a population map file, the samples will be treated as a
single population.
The population map file can be provided as an argument to the reader
classes. For example:
``` python
vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example"
)
```
**Note:**
The `force_popmap` parameter in the reader classes is used to force the
population map file to align with the samples in the alignment without
an error. If set to `False`, the population map file must match the
samples in the alignment exactly, and if they do not match, an error
will be raised. If set to `True`, the population map file will be forced
to align with the samples in the alignment by removing extra samples.
This parameter is set to `False` by default.
The `verbose` parameter in the reader classes is used to print
additional information about the genotype data and filtering steps.
The `plot_format`, `plot_fontsize`, `plot_dpi`, and `despine` parameters
in the reader classes are used to customize the output plots generated
by the reader classes. See API documentation for more details.
## Reading Genotype Data
SNPio provides readers for different file formats. Here are examples of
how to read genotype data from various file formats:
### VCFReader
``` python
vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example",
exclude_pops=["MX", "YU", "CH"],
include_pops=["ON", "DS", "EA", "GU", "TT"],
)
```
This will read the genotype data from a VCF file and apply the
population map if provided.
### PhylipReader
If you would like to read a Phylip file, you can use the `PhylipReader`
class:
``` python
phylip = "snpio/example_data/phylip_files/phylogen_subset14K.phy"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = PhylipReader(
filename=phylip,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example",
exclude_pops=["MX", "YU", "CH"],
include_pops=["ON", "DS", "EA", "GU", "TT"],
)
```
### StructureReader
If you would like to read in a Structure file, you can use the
`StructureReader` class. For example:
``` python
structure = "snpio/example_data/structure_files/phylogen_subset14K.str"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = StructureReader(
filename=structure,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example",
exclude_pops=["MX", "YU", "CH"],
include_pops=["ON", "DS", "EA", "GU", "TT"],
)
```
**Note:**
The `StructureReader` class will automatically detect the format of the
STRUCTURE file. It can be in one-line or two-line format (see STRUCTURE
documentation), and can optionally contain population information in the
file as the second tab-delimited column. If the population information
is not provided in the STRUCTURE file, you can provide a population map
file to assign samples to populations.
### Key Methods in VCFReader, PhylipReader, and StructureReader
`VCFReader(filename, popmapfile, force_popmap, ...)`: Reads and writes
genotype data from/ to a VCF file and applies a population map if
provided.
`write_vcf(output_file)`: Writes the filtered or modified genotype data
back to a VCF file (for all three readers).
`PhylipReader(filename, popmapfile, force_popmap, ...)`: Reads and
writes genotype data from/ to a PHYLIP file and applies a population
map.
`write_phylip(output_file)`: Writes the filtered or modified genotype
data back to a PHYLIP file (for PhylipReader).
`StructureReader(filename, popmapfile, force_popmap, ...)`: Reads and
writes genotype data from/ to a STRUCTURE file and applies a population
map.
`write_structure(output_file)`: Writes the filtered or modified genotype
data back to a STRUCTURE file (for StructureReader).
**Note:**
The `write_vcf`, `write_phylip`, and `write_structure` methods are used
to write the filtered or modified genotype data back to a VCF, PHYLIP,
or STRUCTURE file, respectively. **These methods can also be used to
convert between file VCF, PHYLIP, and STRUCTURE formats.**
## Other GenotypeData Methods
The `GenotypeData` along with the `Plotting` classes have several useful
methods for working with genotype data:
- `Plotting.run_pca()`: Runs principal component analysis (PCA) on the
genotype data and plots the results. The PCA plot can help visualize
the genetic structure of the populations in the dataset, with each
point representing an individual. Individuals are colored by missing
data proportion, and populations are represented by different
shapes. A 2-dimensional PCA plot is generated by default, but you
can specify three PCA axes as well. For example:
![Principle Component Analysis (PCA) colored by missingness proportion. Shapes depict distinct populations.](snpio/img/pca_missingness.png)
- `GenotypeData.missingness_reports()`: Generates missing data reports
and plots for the dataset. The reports include the proportion of
missing data per individual, per locus, and per population. These
reports can help you identify samples, loci, or populations with
high levels of missing data. For example:
![Plots depicting missing data proportions for samples and loci (SNPs).](snpio/img/missingness_report.png)
- The `GenotypeData` class will automatically create a plot showing
the number of inidviduals present in each population, if a
`popmapfile` is provided. For example:
![Bar plot depicting counts per populations as provided in the population map file.](snpio/img/population_counts.png)
## Filtering Genotype Data with NRemover2
NRemover2 provides a variety of filtering methods to clean your genotype
data. Here is an example of how to apply filters to remove samples and
loci with too much missing data, monomorphic sites, singletons, minor
allele count (MAC), minor allele frequency (MAF), and more:
``` python
# Apply filters to remove samples and loci with too much missing data
gd_filt = nrm.filter_missing_sample(0.75)
.filter_missing(0.75)
.filter_missing_pop(0.75)
.filter_mac(2)
.filter_monomorphic(exclude_heterozygous=False)
.filter_singletons(exclude_heterozygous=False)
.filter_biallelic(exclude_heterozygous=False)
.resolve()
# Write the filtered VCF to a new file
gd_filt.write_vcf("filtered_output.vcf")
```
### Key Methods in NRemover2
`filter_missing_sample(threshold)`: Filters samples with missing data
above the threshold.
`filter_missing(threshold)`: Filters loci with missing data above the
threshold.
`filter_missing_pop(threshold)`: Filters loci where missing data for any
given population is above the threshold.
`filter_mac(threshold)`: Filters loci with a minor allele count below
the threshold.
`filter_maf(threshold)`: Filters loci with a minor allele frequency
below the threshold.
`filter_monomorphic(exclude_heterozygous)`: Filters monomorphic loci
(sites with only one allele).
`filter_singletons(exclude_heterozygous)`: Filters singletons (sites
with only one occurrence of an allele).
`filter_biallelic(exclude_heterozygous)`: Filters biallelic loci (sites
with only two alleles).
`thin_loci(size)`: Thins loci by removing loci within `size` bases of
each other on the same locus or chromosome (based on input VCF `CHROM`
and `POS` fields). Note that this method only works with VCFReader and
is not available for `PhylipReader` and `StructureReader`. For example,
`thin_loci(100)` will remove all but one locus within 100 bases of eaach
other on the same chromosome.
`filter_linked(size)`: Filters loci that are linked to other loci within
a specified distance (size), only considering the `CHROM` field from the
VCF file and ignoring the `POS` field. This method only works with
VCFReader and is not available for PhylipReader and StructureReader.
`random_subset_loci(size)`: Randomly selects `size` number of loci from
the input dataset, where size is an integer.
`resolve()`: Applies the filters and returns the filtered GenotypeData
object. This method must be called at the end of the filtering chain to
apply the filters.
**Note:**
You must call `resolve()` at the end of the filtering chain to apply the
filters and return the filtered GenotypeData object.
**Note:**
The `exclude_heterozygous` parameter in `filter_monomorphic`,
`filter_singletons`, and `filter_biallelic` methods allows you to
exclude heterozygous genotypes from the filtering process. By default,
heterozygous genotypes are included in the filtering process.
**Note:**
`thin_loci` and `filter_linked` are only available for VCFReader and not
for PhylipReader and StructureReader.
**Warning:**
The `filter_linked(size)` method might yield a limited number of loci
with SNP data. It is recommended to use this method with caution and
check the output carefully.
### Additional Methods in NRemover2
`search_thresholds()` searches a range of filtering thresholds for all
missing data, minor allele frequency (MAF), and minor allele count (MAC)
filters. This method helps you find the optimal thresholds for your
dataset. It will plot the threshold search results so you can visualize
the impact of different thresholds on the dataset.
With `search_thresholds()`, you can specify the thresholds to search for
and the order in which to apply the filters:
``` python
# Initialize NRemover2 with GenotypeData object
nrm = NRemover2(gd)
# Specify filtering thresholds and order of filters
nrm.search_thresholds(
thresholds=[0.25, 0.5, 0.75, 1.0],
maf_thresholds=[0.01, 0.05],
mac_thresholds=[2, 5],
filter_order=[
"filter_missing_sample",
"filter_missing",
"filter_missing_pop",
"filter_mac",
"filter_monomorphic",
"filter_singletons",
"filter_biallelic"
]
)
```
The `search_thresholds()` method will search for the optimal thresholds
for missing data, MAF, and MAC filters based on the specified thresholds
and filter order. It will plot the results so you can visualize the
impact of different thresholds on the dataset.
Below are example plots that are created when running the
`search_thresholds()` method:
Filtering Results for Singletons, Monomorphic Sites, and Biallelic
Sites:
![NRemover2 filtering results for the boolean filtering methods ('filter_monomorphic', 'filter_singletons', and 'filter_biallelic')](snpio/img/filtering_results_bool.png)
Filtering Results for Minor Allele Count (MAC):
![NRemover2 filtering results for the Minor Allele Count filtering method](snpio/img/filtering_results_mac.png)
Filtering Results for Minor Allele Frequency:
![NRemover2 filtering results for the Minor Allele Frequency method](snpio/img/filtering_results_maf.png)
Missing Data Filtering for Loci and Samples:
![NRemover2 filtering results for the missing data methods ('filter_missing' and 'filter_missing_samples'). The 'filter_missing' method filters out columns (loci) exceeding a missing data threshold, whereas the 'filter_missing_sample' method filters out samples (rows) exceeding the threhold.](snpio/img/filtering_results_missing_loci_samples.png)
Missing Data Filtering for Populations:
![NRemover2 filtering results for the 'filter_missing_pop' method, which filters out loci (SNP columns) wherein any given population group exceeds the provided missing data threshold.](snpio/img/filtering_results_missing_population.png)
**Note:**
The `search_thresholds()` method is incompatible with `thin_loci(size)`
and `filter_linked()` being in the filter_order list.
**Warning:**
The `search_thresholds()` method can also be called either before or
after any other filtering, but note that it will reset the filtering
chain to the original state.
`plot_sankey_filtering_report()` generates a Sankey plot to visualize
how SNPs are filtered at each step of the pipeline. For example:
``` python
from snpio import NRemover2, VCFReader
vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example"
)
# Initialize NRemover2.
nrm = NRemover2(gd)
# Apply filters to remove samples and loci.
gd_filt = nrm.filter_missing_sample(0.75)
.filter_missing(0.75)
.filter_missing_pop(0.75)
.filter_mac(2)
.filter_monomorphic(exclude_heterozygous=False)
.filter_singletons(exclude_heterozygous=False)
.filter_biallelic(exclude_heterozygous=False)
.resolve()
nrm.plot_sankey_filtering_report()
```
This will automatically track the number of loci at each filtering step
and generate a Sankey plot to visualize the filtering process. The
Sankey plot shows how many loci are removed at each step of the
filtering process. For example:
![Sankey Diagram depicting the number (count) of loci retained (green bands) and removed (red bands) at each NRemover2 filtering step. Band widths are proportional to the number of loci retained and removed at each consecutive step.](snpio/img/nremover_sankey_plot.png)
In the Sankey Diagram above, the green nodes represent the number of
loci remaining after each filtering step, and the red nodes represent
the number of loci removed at each filtering step. The size of each edge
is proportional to the number of loci retained or removed at each step.
The Sankey plot provides a visual representation of the filtering
process and helps you understand how each filtering method affects the
dataset. The filtering order is dynamic based on the order each method was called.
**Note:**
The `plot_sankey_filtering_report()` must be called after filtering and
calling the `resolve()` method to generate the Sankey plot. It is also
incompatible with `thin_loci()`, `filter_linked()`, and
`random_subset_loci()` being in the filter_order list.
`plot_sankey_filtering_report()` only plots loci removed at each
filtering step and does not plot samples removed.
## GenotypeData Properties
Once genotype data is loaded using any of the readers, you can access
several useful properties from the `GenotypeData` object:
`num_snps`: Number of SNPs or loci in the dataset.
`num_inds`: Number of individuals in the dataset.
`populations`: List of populations in the dataset.
`popmap`: Mapping of SampleIDs to PopulationIDs.
`popmap_inverse`: Dictionary with population IDs as keys and lists of
samples as values.
`samples`: List of samples in the dataset.
`snpsdict`: Dictionary with sampleIDs as keys and genotypes as values.
`loci_indices`: Numpy array with boolean values indicating the loci that
passed the filtering criteria set to `True`.
`sample_indices`: Numpy arrray with boolean values indicating the
samples that passed the filtering criteria set to `True`.
`snp_data`: 2D numpy array of SNP data of shape (num_inds, num_snps).
`ref`: List of reference alleles for each locus.
`alt`: List of alternate alleles for each locus.
`inputs`: Dictionary of input parameters used to load the genotype data.
## Genotype Encoding with GenotypeEncoder
SNPio also includes the GenotypeEncoder class for encoding genotype data
into formats useful for downstream analysis and commonly used for
machine and deep learning tasks.
The GenotypeEncoder class provides three encoding properties:
`genotypes_onehot`: Encodes genotype data into one-hot encoding, where
each possible biallelic IUPAC genotype is represented by a one-hot
vector. Heterozygotes are represented as multi-label vectors as follows:
``` python
onehot_dict = {
"A": [1.0, 0.0, 0.0, 0.0],
"T": [0.0, 1.0, 0.0, 0.0],
"G": [0.0, 0.0, 1.0, 0.0],
"C": [0.0, 0.0, 0.0, 1.0],
"N": [0.0, 0.0, 0.0, 0.0],
"W": [0.5, 0.5, 0.0, 0.0],
"R": [0.5, 0.0, 0.5, 0.0],
"M": [0.5, 0.0, 0.0, 0.5],
"K": [0.0, 0.5, 0.5, 0.0],
"Y": [0.0, 0.5, 0.0, 0.5],
"S": [0.0, 0.0, 0.5, 0.5],
"N": [0.0, 0.0, 0.0, 0.0],
}
```
`genotypes_int`: Encodes genotype data into integer encoding, where each
possible biallelic IUPAC genotype is represented by an integer as
follows: as follows:
`A=0, T=1, G=2, C=3, W=4, R=5, M=6, K=7, Y=8, S=9, N=-9`.
`genotypes_012`: Encodes genotype data into 0-1-2 encoding, where 0
represents the homozygous reference genotype, 1 represents the
heterozygous genotype, and 2 represents the homozygous alternate
genotype.
Example Usage:
``` python
from snpio import VCFReader, GenotypeEncoder
vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="png",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example"
)
encoder = GenotypeEncoder(gd)
# Convert genotype data to one-hot encoding
gt_ohe = encoder.genotypes_onehot
# Convert genotype data to integer encoding
gt_int = encoder.genotypes_int
# Convert genotype data to 0-1-2 encoding.
gt_012 = encoder.genotypes_012
```
The GenotypeEncoder allows you to seamlessly convert genotype data into
different formats depending on your needs for analysis or machine
learning workflows.
You can also inversely convert the encoded data back to the original
genotypes by just setting the GenotypeEncoder properties to a new value.
For example:
``` python
# Convert one-hot encoded data back to genotypes
encoder.genotypes_onehot = gt_ohe
# Convert integer encoded data back to genotypes
encoder.genotypes_int = gt_int
# Convert 0-1-2 encoded data back to genotypes
encoder.genotypes_012 = gt_012
```
This will automatically update the original genotype data in the
GenotypeData object and convert it to the original format stored in the
`snp_data` property of the GenotypeData object.
## Loading and Parsing Phylogenetic TreeParser
SNPio also provides a `TreeParser` class to load and parse phylogenetic trees in Newick and NEXUS formats. The `TreeParser` class can read and parse tree files, modify tree structures, draw trees, and save trees in different formats.
Here are some examples of how to load and parse a phylogenetic tree using the `TreeParser` class:
```python
from snpio import TreeParser, VCFReader
vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"
gd = VCFReader(
filename=vcf,
popmapfile=popmap,
force_popmap=True,
verbose=True,
plot_format="pdf",
plot_fontsize=20,
plot_dpi=300,
despine=True,
prefix="snpio_example"
)
# Load a phylogenetic tree from a Newick file
tp = TreeParser(
genotype_data=gd,
treefile="snpio/example_data/trees/test.tre",
siterates="snpio/example_data/trees/test14K.rates",
qmatrix="snpio/example_data/trees/test.iqtree",
verbose=True
)
tree = tp.read_tree()
tree.draw() # Draw the tree
# Save the tree in Newick format
tp.write_tree(tree, save_path="snpio/example_data/trees/test_newick.tre")
# Save the tree in NEXUS format
tp.write_tree(tree, save_path="snpio/example_data/trees/test_nexus.nex", nexus=True)
# Returns the tree in Newick format as a string
tp.write_tree(tree, save_path=None)
# Get the tree stats. Returns a dictionary of tree stats.
print(tp.tree_stats())
# Reroot the tree at any nodes containing the string 'EA' in the sampleID.
# Use the '~' character to specify a regular expression pattern to match.
tp.reroot_tree("~EA")
# Get a distance matrix between all nodes in the tree.
print(tp.get_distance_matrix())
# Get the Rate Matrix Q from the Qmatrix file.
print(tp.qmat)
# Get the Site Rates from the Site Rates file.
print(tp.site_rates)
# Get a subtree with only the samples containing 'EA' in the sampleID.
# Use the '~' character to specify a regular expression pattern to select all
# tips containing the pattern.
subtree = tp.get_subtree("~EA")
# Prune the tree to remove samples containing 'ON' in the sampleID.
pruned_tree = tp.prune_tree("~ON")
# Write the subtree and pruned tree. Returns a Newick string if 'save_path'
# is None. Otherwise saves it to 'save_path'.
print(tp.write_tree(subtree, save_path=None))
print(tp.write_tree(pruned_tree, save_path=None))
```
The `TreeParser` class provides several methods for working with phylogenetic trees, including reading, writing, and modifying trees. You can use these methods to analyze and manipulate phylogenetic trees for your research and analysis tasks.
The `TreeParser` class also provides methods for calculating tree statistics, rerooting trees, getting distance matrices, and extracting subtrees based on sample IDs. These methods can help you analyze and visualize phylogenetic trees and extract relevant information for downstream analysis.
The `Rate matrix Q` and Site Rates can be accessed from the Qmatrix and Site Rates files, respectively. These matrices can be used to calculate evolutionary distances and rates between samples in the phylogenetic tree. The siterates file can be output by IQ-TREE or specified as a one-column file with the rates for each site in the alignment (header optional). The `qmatrix` file can be obtained from the `IQ-TREE` standard output (.iqtree file) or from a stand-alone Qmatrix file with the rate matrix Q. In the latter case, the file should be a tab-delimited or comma-delimited file with the rate matrix Q with substitution rates in the order: "A, "C", "G", "T". A header line is optional.
The rate matrix and site rates objects can be accessed by their corresponding properties:
- `tp.qmat`: Rate matrix Q.
- `tp.site_rates`: Site rates.
For more information on the `TreeParser` class and its methods, please refer to the [API documentation](https://snpio.readthedocs.io).
## Benchmarking the Performance
You can benchmark the filtering performance using the Benchmark class to
visualize how thresholds affect the dataset, if you have installed the
snpio dev requirements:
``` shell
pip install snpio[dev]
```
Then, you can use the Benchmark class to plot performance metrics for
your filtered genotype data after the `resolve()` method is called. For
example:
``` python
from snpio.utils.benchmarking import Benchmark
Benchmark.plot_performance(nrm.genotype_data, nrm.genotype_data.resource_data)
```
This function will plot performance metrics for your filtered genotype
data and for the `VCFReader` class, giving insights into data quality
changes.
For more information on the Benchmark class and how to use it, see the
API documentation.
## Conclusion
This guide provides an overview of how to get started with the SNPio
library. It covers the basic steps to read, manipulate, and analyze
genotype data using the VCFReader, PhylipReader, StructureReader, and
NRemover2 classes. SNPio is designed to simplify the process of handling
genotype data and preparing it for downstream analysis, such as
population genetics, phylogenetics, and machine learning. The library
supports various file formats, including VCF, PHYLIP, and STRUCTURE, and
provides tools for filtering, encoding, and visualizing genotype data.
This guide will help you get up and running with SNPio quickly and
efficiently.
For more information on the SNPio library, please refer to the [API documentation](https://snpio.readthedocs.io) and examples provided in the repository. If you have any questions or feedback, please feel free to reach out to the developers.
We hope you find SNPio useful for your bioinformatic analyses!
**Note:**
The SNPio library is under active development, and we welcome
contributions from the community. If you would like to contribute to the
project, please check the GitHub repository for open issues and submit a
pull request. We appreciate your support and feedback!
If you encounter any issues or have any questions about the SNPio
library, please feel free to reach out to the developers or open an
issue on the GitHub repository. We are here to help and improve the
library based on your feedback.
The SNPio library is licensed under the GPL3 License, and we encourage
you to use it for your research and analysis tasks. If you find the
library useful, please cite it in your publications. We appreciate your
support and feedback!
Raw data
{
"_id": null,
"home_page": null,
"name": "snpio",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "genomics, bioinformatics, population genetics, SNP, VCF, PHYLIP, STRUCTURE, missing data, filtering, filter, MAF, minor allele frequency, MAC, minor allele count, biallelic, monomorphic, singleton",
"author": null,
"author_email": "\"Drs. Bradley T. Martin and Tyler K. Chafin\" <evobio721@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/26/d8/2bb2b6a776ed3eabd765b033039950d573a6289b55096f0a4cd18d186878/snpio-1.1.4.tar.gz",
"platform": null,
"description": "# SNPio: A Python API for Population Genetic File Processing, Filtering, and Encoding\n\n## Introduction\n\nThis guide provides an overview of how to get started with the SNPio\nlibrary. It covers the basic steps to read, manipulate, and analyze\ngenotype data using the VCFReader, PhylipReader, StructureReader, and\nNRemover2 classes. SNPio is designed to simplify the process of handling\ngenotype data and preparing it for downstream analysis, such as\npopulation genetics, phylogenetics, and machine learning. The library\nsupports various file formats, including VCF, PHYLIP, and STRUCTURE, and\nprovides tools for filtering, encoding, and visualizing genotype data.\nThis guide will help you get up and running with SNPio quickly and\nefficiently.\n\n`VCFReader`, `PhylipReader`, and `StructureReader` classes are used to\nread genotype data from VCF, PHYLIP, and STRUCTURE files, respectively.\nThese classes load the data into a `GenotypeData` object that has\nvarious useful methods and properties.\n\nThe `NRemover2` class is used to filter genotype data based on various\ncriteria, such as missing data, minor allele count, minor allele\nfrequency, and more. The `GenotypeEncoder` class is used to encode\ngenotype data into different formats, such as one-hot encoding, integer\nencoding, and 0-1-2 encoding, for downstream analysis and machine\nlearning tasks.\n\nBelow is a step-by-step guide to using SNPio to read, filter, and encode\ngenotype data for analysis.\n\n## Installation\n\nBefore using SNPio, ensure it is installed in your Python environment.\nYou can install it using pip. In the project root directory (the\ndirectory containing setup.py), type the following command into your\nterminal:\n\n``` shell\npip install snpio\n```\n\nWe recommend using a virtual environment to manage your Python packages.\nIf you do not have a virtual environment set up, you can create one\nusing the following commands:\n\n``` shell\npython3 -m venv snpio_env\nsource snpio_env/bin/activate\n```\n\nThis will create a virtual environment named `snpio_env` and activate\nit. You can then install SNPio in this virtual environment using the pip\ncommand mentioned above.\n\n**Note:**\n\nSNPio does not support Windows operating systems at the moment. We\nrecommend using a Unix-based operating system such as Linux or macOS.\n\n**Note:**\n\nWe aim to support anaconda environments in the future. For now, we\nrecommend using a virtual environment with `pip` to install SNPio.\n\n## Importing SNPio\n\nTo start using SNPio, import the necessary modules:\n\n``` python\n# Import the necessary modules\nfrom snpio import (\n NRemover2, \n VCFReader, \n PhylipReader, \n StructureReader, \n Plotting, \n GenotypeEncoder,\n)\n```\n\nExample usage:\n\n``` python\n# Define input filenames\nvcf = \"snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz\" \npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\" \n\n# Load the genotype data from a VCF file\ngd = VCFReader(\n filename=vcf, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\", \n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\"\n)\n```\n\nYou can also include or exclude any populations from the analysis by\nusing the `include_pops` and `exclude_pops` parameters in the reader\nclasses. For example:\n\n``` python\n# Only include the populations \"ON\", \"DS\", \"EA\", \"GU\", and \"TT\"\n# Exclude the populations \"MX\", \"YU\", and \"CH\"\ngd = VCFReader(\n filename=vcf, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\", \n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\", \n include_pops=[\"ON\", \"DS\", \"EA\", \"GU\"], \n exclude_pops=[\"MX\", \"YU\", \"CH\"],\n)\n```\n\nThe `include_pops` and `exclude_pops` parameters are optional and can be\nused to filter the populations included in the analysis. If both\nparameters are provided, the populations in `include_pops` will be\nincluded, and the populations in `exclude_pops` will be excluded.\nHowever, populations cannot overlap between lists.\n\n## Important Notes\n\n- The `VCFReader`, `PhylipReader`, `StructureReader`, `NRemover2`, and `GenotypeEncoder` classes treat the following characters as missing data: \n\n - \"N\"\n - \".\"\n - \"?\"\n - \"-\"\n\n- The `VCFReader` class can read both uncompressed and compressed VCF\nfiles (gzipped). If your input file is in PHYLIP or STRUCTURE format, it\nwill be forced to be biallelic. To handle more than two alleles per\nsite, use the VCF format.\n\n## The Population Map File\n\nTo use `VCFReader`, `PhylipReader`, or `StructureReader`, you can\noptionally use a population map (popmap) file. This is a simple\ntwo-column, whitespace-delimited or comma-delimited file with SampleIDs\nin the first column and the corresponding PopulationIDs in the second\ncolumn. It can optionally contain a header line, with the first column\nlabeled \"SampleID\" and the second column labeled \"PopulationID\"\n(case-insensitive). The population IDs can be any string, such as\n\"Population1\", \"Population2\", etc, or an integer. SampleIDs must match\nthe sample names in the alignment file.\n\nFor example:\n\n``` shell\nSample1,Population1\nSample2,Population1\nSample3,Population2\nSample4,Population2\n```\n\nOr, with a header:\n\n``` shell\nSampleID,PopulationID\nSample1,Population1\nSample2,Population1\nSample3,Population2\nSample4,Population2\n```\n\nThe population map file is used to assign samples to populations and is\nuseful for filtering and visualizing genotype data by population. If you\ndo not provide a population map file, the samples will be treated as a\nsingle population.\n\nThe population map file can be provided as an argument to the reader\nclasses. For example:\n\n``` python\nvcf = \"snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz\" \npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\" \n\ngd = VCFReader(\n filename=vcf, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\", \n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\"\n)\n```\n\n**Note:**\n\nThe `force_popmap` parameter in the reader classes is used to force the\npopulation map file to align with the samples in the alignment without\nan error. If set to `False`, the population map file must match the\nsamples in the alignment exactly, and if they do not match, an error\nwill be raised. If set to `True`, the population map file will be forced\nto align with the samples in the alignment by removing extra samples.\nThis parameter is set to `False` by default.\n\nThe `verbose` parameter in the reader classes is used to print\nadditional information about the genotype data and filtering steps.\n\nThe `plot_format`, `plot_fontsize`, `plot_dpi`, and `despine` parameters\nin the reader classes are used to customize the output plots generated\nby the reader classes. See API documentation for more details.\n\n## Reading Genotype Data\n\nSNPio provides readers for different file formats. Here are examples of\nhow to read genotype data from various file formats:\n\n### VCFReader\n\n``` python\nvcf = \"snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz\" \npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\" \n\ngd = VCFReader(\n filename=vcf, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\", \n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\", \n exclude_pops=[\"MX\", \"YU\", \"CH\"], \n include_pops=[\"ON\", \"DS\", \"EA\", \"GU\", \"TT\"],\n)\n```\n\nThis will read the genotype data from a VCF file and apply the\npopulation map if provided.\n\n### PhylipReader\n\nIf you would like to read a Phylip file, you can use the `PhylipReader`\nclass:\n\n``` python\nphylip = \"snpio/example_data/phylip_files/phylogen_subset14K.phy\" \npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\" \n\ngd = PhylipReader(\n filename=phylip, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\", \n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\", \n exclude_pops=[\"MX\", \"YU\", \"CH\"], \n include_pops=[\"ON\", \"DS\", \"EA\", \"GU\", \"TT\"],\n)\n```\n\n### StructureReader\n\nIf you would like to read in a Structure file, you can use the\n`StructureReader` class. For example:\n\n``` python\nstructure = \"snpio/example_data/structure_files/phylogen_subset14K.str\" \npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\" \n\ngd = StructureReader(\n filename=structure, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\",\n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\", \n exclude_pops=[\"MX\", \"YU\", \"CH\"], \n include_pops=[\"ON\", \"DS\", \"EA\", \"GU\", \"TT\"],\n)\n```\n\n**Note:**\n\nThe `StructureReader` class will automatically detect the format of the\nSTRUCTURE file. It can be in one-line or two-line format (see STRUCTURE\ndocumentation), and can optionally contain population information in the\nfile as the second tab-delimited column. If the population information\nis not provided in the STRUCTURE file, you can provide a population map\nfile to assign samples to populations.\n\n### Key Methods in VCFReader, PhylipReader, and StructureReader\n\n`VCFReader(filename, popmapfile, force_popmap, ...)`: Reads and writes\ngenotype data from/ to a VCF file and applies a population map if\nprovided.\n\n`write_vcf(output_file)`: Writes the filtered or modified genotype data\nback to a VCF file (for all three readers).\n\n`PhylipReader(filename, popmapfile, force_popmap, ...)`: Reads and\nwrites genotype data from/ to a PHYLIP file and applies a population\nmap.\n\n`write_phylip(output_file)`: Writes the filtered or modified genotype\ndata back to a PHYLIP file (for PhylipReader).\n\n`StructureReader(filename, popmapfile, force_popmap, ...)`: Reads and\nwrites genotype data from/ to a STRUCTURE file and applies a population\nmap.\n\n`write_structure(output_file)`: Writes the filtered or modified genotype\ndata back to a STRUCTURE file (for StructureReader).\n\n**Note:**\n\nThe `write_vcf`, `write_phylip`, and `write_structure` methods are used\nto write the filtered or modified genotype data back to a VCF, PHYLIP,\nor STRUCTURE file, respectively. **These methods can also be used to\nconvert between file VCF, PHYLIP, and STRUCTURE formats.**\n\n## Other GenotypeData Methods\n\nThe `GenotypeData` along with the `Plotting` classes have several useful\nmethods for working with genotype data:\n\n- `Plotting.run_pca()`: Runs principal component analysis (PCA) on the\n genotype data and plots the results. The PCA plot can help visualize\n the genetic structure of the populations in the dataset, with each\n point representing an individual. Individuals are colored by missing\n data proportion, and populations are represented by different\n shapes. A 2-dimensional PCA plot is generated by default, but you\n can specify three PCA axes as well. For example:\n\n![Principle Component Analysis (PCA) colored by missingness proportion. Shapes depict distinct populations.](snpio/img/pca_missingness.png)\n\n- `GenotypeData.missingness_reports()`: Generates missing data reports\n and plots for the dataset. The reports include the proportion of\n missing data per individual, per locus, and per population. These\n reports can help you identify samples, loci, or populations with\n high levels of missing data. For example:\n\n![Plots depicting missing data proportions for samples and loci (SNPs).](snpio/img/missingness_report.png)\n\n- The `GenotypeData` class will automatically create a plot showing\n the number of inidviduals present in each population, if a\n `popmapfile` is provided. For example:\n\n![Bar plot depicting counts per populations as provided in the population map file.](snpio/img/population_counts.png)\n\n## Filtering Genotype Data with NRemover2\n\nNRemover2 provides a variety of filtering methods to clean your genotype\ndata. Here is an example of how to apply filters to remove samples and\nloci with too much missing data, monomorphic sites, singletons, minor\nallele count (MAC), minor allele frequency (MAF), and more:\n\n``` python\n# Apply filters to remove samples and loci with too much missing data\ngd_filt = nrm.filter_missing_sample(0.75)\n .filter_missing(0.75)\n .filter_missing_pop(0.75)\n .filter_mac(2)\n .filter_monomorphic(exclude_heterozygous=False)\n .filter_singletons(exclude_heterozygous=False)\n .filter_biallelic(exclude_heterozygous=False)\n .resolve()\n\n# Write the filtered VCF to a new file\ngd_filt.write_vcf(\"filtered_output.vcf\")\n```\n\n### Key Methods in NRemover2\n\n`filter_missing_sample(threshold)`: Filters samples with missing data\nabove the threshold.\n\n`filter_missing(threshold)`: Filters loci with missing data above the\nthreshold.\n\n`filter_missing_pop(threshold)`: Filters loci where missing data for any\ngiven population is above the threshold.\n\n`filter_mac(threshold)`: Filters loci with a minor allele count below\nthe threshold.\n\n`filter_maf(threshold)`: Filters loci with a minor allele frequency\nbelow the threshold.\n\n`filter_monomorphic(exclude_heterozygous)`: Filters monomorphic loci\n(sites with only one allele).\n\n`filter_singletons(exclude_heterozygous)`: Filters singletons (sites\nwith only one occurrence of an allele).\n\n`filter_biallelic(exclude_heterozygous)`: Filters biallelic loci (sites\nwith only two alleles).\n\n`thin_loci(size)`: Thins loci by removing loci within `size` bases of\neach other on the same locus or chromosome (based on input VCF `CHROM`\nand `POS` fields). Note that this method only works with VCFReader and\nis not available for `PhylipReader` and `StructureReader`. For example,\n`thin_loci(100)` will remove all but one locus within 100 bases of eaach\nother on the same chromosome.\n\n`filter_linked(size)`: Filters loci that are linked to other loci within\na specified distance (size), only considering the `CHROM` field from the\nVCF file and ignoring the `POS` field. This method only works with\nVCFReader and is not available for PhylipReader and StructureReader.\n\n`random_subset_loci(size)`: Randomly selects `size` number of loci from\nthe input dataset, where size is an integer.\n\n`resolve()`: Applies the filters and returns the filtered GenotypeData\nobject. This method must be called at the end of the filtering chain to\napply the filters.\n\n**Note:**\n\nYou must call `resolve()` at the end of the filtering chain to apply the\nfilters and return the filtered GenotypeData object.\n\n**Note:**\n\nThe `exclude_heterozygous` parameter in `filter_monomorphic`,\n`filter_singletons`, and `filter_biallelic` methods allows you to\nexclude heterozygous genotypes from the filtering process. By default,\nheterozygous genotypes are included in the filtering process.\n\n**Note:**\n\n`thin_loci` and `filter_linked` are only available for VCFReader and not\nfor PhylipReader and StructureReader.\n\n**Warning:**\n\nThe `filter_linked(size)` method might yield a limited number of loci\nwith SNP data. It is recommended to use this method with caution and\ncheck the output carefully.\n\n### Additional Methods in NRemover2\n\n`search_thresholds()` searches a range of filtering thresholds for all\nmissing data, minor allele frequency (MAF), and minor allele count (MAC)\nfilters. This method helps you find the optimal thresholds for your\ndataset. It will plot the threshold search results so you can visualize\nthe impact of different thresholds on the dataset.\n\nWith `search_thresholds()`, you can specify the thresholds to search for\nand the order in which to apply the filters:\n\n``` python\n# Initialize NRemover2 with GenotypeData object\nnrm = NRemover2(gd)\n\n# Specify filtering thresholds and order of filters\nnrm.search_thresholds(\n thresholds=[0.25, 0.5, 0.75, 1.0],\n maf_thresholds=[0.01, 0.05], \n mac_thresholds=[2, 5], \n filter_order=[\n \"filter_missing_sample\", \n \"filter_missing\", \n \"filter_missing_pop\", \n \"filter_mac\", \n \"filter_monomorphic\", \n \"filter_singletons\", \n \"filter_biallelic\"\n ]\n)\n```\n\nThe `search_thresholds()` method will search for the optimal thresholds\nfor missing data, MAF, and MAC filters based on the specified thresholds\nand filter order. It will plot the results so you can visualize the\nimpact of different thresholds on the dataset.\n\nBelow are example plots that are created when running the\n`search_thresholds()` method:\n\nFiltering Results for Singletons, Monomorphic Sites, and Biallelic\nSites:\n\n![NRemover2 filtering results for the boolean filtering methods ('filter_monomorphic', 'filter_singletons', and 'filter_biallelic')](snpio/img/filtering_results_bool.png)\n\nFiltering Results for Minor Allele Count (MAC):\n\n![NRemover2 filtering results for the Minor Allele Count filtering method](snpio/img/filtering_results_mac.png)\n\nFiltering Results for Minor Allele Frequency:\n\n![NRemover2 filtering results for the Minor Allele Frequency method](snpio/img/filtering_results_maf.png)\n\nMissing Data Filtering for Loci and Samples:\n\n![NRemover2 filtering results for the missing data methods ('filter_missing' and 'filter_missing_samples'). The 'filter_missing' method filters out columns (loci) exceeding a missing data threshold, whereas the 'filter_missing_sample' method filters out samples (rows) exceeding the threhold.](snpio/img/filtering_results_missing_loci_samples.png)\n\nMissing Data Filtering for Populations:\n\n![NRemover2 filtering results for the 'filter_missing_pop' method, which filters out loci (SNP columns) wherein any given population group exceeds the provided missing data threshold.](snpio/img/filtering_results_missing_population.png)\n\n**Note:**\n\nThe `search_thresholds()` method is incompatible with `thin_loci(size)`\nand `filter_linked()` being in the filter_order list.\n\n**Warning:**\n\nThe `search_thresholds()` method can also be called either before or\nafter any other filtering, but note that it will reset the filtering\nchain to the original state.\n\n`plot_sankey_filtering_report()` generates a Sankey plot to visualize\nhow SNPs are filtered at each step of the pipeline. For example:\n\n``` python\nfrom snpio import NRemover2, VCFReader\n\nvcf = \"snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz\"\npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\"\n\ngd = VCFReader(\n filename=vcf, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\",\n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\"\n)\n\n# Initialize NRemover2.\nnrm = NRemover2(gd)\n\n# Apply filters to remove samples and loci.\ngd_filt = nrm.filter_missing_sample(0.75)\n .filter_missing(0.75)\n .filter_missing_pop(0.75)\n .filter_mac(2)\n .filter_monomorphic(exclude_heterozygous=False)\n .filter_singletons(exclude_heterozygous=False)\n .filter_biallelic(exclude_heterozygous=False)\n .resolve()\n\nnrm.plot_sankey_filtering_report()\n```\n\nThis will automatically track the number of loci at each filtering step\nand generate a Sankey plot to visualize the filtering process. The\nSankey plot shows how many loci are removed at each step of the\nfiltering process. For example:\n\n![Sankey Diagram depicting the number (count) of loci retained (green bands) and removed (red bands) at each NRemover2 filtering step. Band widths are proportional to the number of loci retained and removed at each consecutive step.](snpio/img/nremover_sankey_plot.png)\n\nIn the Sankey Diagram above, the green nodes represent the number of\nloci remaining after each filtering step, and the red nodes represent\nthe number of loci removed at each filtering step. The size of each edge\nis proportional to the number of loci retained or removed at each step.\nThe Sankey plot provides a visual representation of the filtering\nprocess and helps you understand how each filtering method affects the\ndataset. The filtering order is dynamic based on the order each method was called.\n\n**Note:**\n\nThe `plot_sankey_filtering_report()` must be called after filtering and\ncalling the `resolve()` method to generate the Sankey plot. It is also\nincompatible with `thin_loci()`, `filter_linked()`, and\n`random_subset_loci()` being in the filter_order list.\n\n`plot_sankey_filtering_report()` only plots loci removed at each\nfiltering step and does not plot samples removed.\n\n## GenotypeData Properties\n\nOnce genotype data is loaded using any of the readers, you can access\nseveral useful properties from the `GenotypeData` object:\n\n`num_snps`: Number of SNPs or loci in the dataset.\n\n`num_inds`: Number of individuals in the dataset.\n\n`populations`: List of populations in the dataset.\n\n`popmap`: Mapping of SampleIDs to PopulationIDs.\n\n`popmap_inverse`: Dictionary with population IDs as keys and lists of\nsamples as values.\n\n`samples`: List of samples in the dataset.\n\n`snpsdict`: Dictionary with sampleIDs as keys and genotypes as values.\n\n`loci_indices`: Numpy array with boolean values indicating the loci that\npassed the filtering criteria set to `True`.\n\n`sample_indices`: Numpy arrray with boolean values indicating the\nsamples that passed the filtering criteria set to `True`.\n\n`snp_data`: 2D numpy array of SNP data of shape (num_inds, num_snps).\n\n`ref`: List of reference alleles for each locus.\n\n`alt`: List of alternate alleles for each locus.\n\n`inputs`: Dictionary of input parameters used to load the genotype data.\n\n## Genotype Encoding with GenotypeEncoder\n\nSNPio also includes the GenotypeEncoder class for encoding genotype data\ninto formats useful for downstream analysis and commonly used for\nmachine and deep learning tasks.\n\nThe GenotypeEncoder class provides three encoding properties:\n\n`genotypes_onehot`: Encodes genotype data into one-hot encoding, where\neach possible biallelic IUPAC genotype is represented by a one-hot\nvector. Heterozygotes are represented as multi-label vectors as follows:\n\n``` python\nonehot_dict = {\n \"A\": [1.0, 0.0, 0.0, 0.0],\n \"T\": [0.0, 1.0, 0.0, 0.0],\n \"G\": [0.0, 0.0, 1.0, 0.0],\n \"C\": [0.0, 0.0, 0.0, 1.0],\n \"N\": [0.0, 0.0, 0.0, 0.0],\n \"W\": [0.5, 0.5, 0.0, 0.0],\n \"R\": [0.5, 0.0, 0.5, 0.0],\n \"M\": [0.5, 0.0, 0.0, 0.5],\n \"K\": [0.0, 0.5, 0.5, 0.0],\n \"Y\": [0.0, 0.5, 0.0, 0.5],\n \"S\": [0.0, 0.0, 0.5, 0.5],\n \"N\": [0.0, 0.0, 0.0, 0.0],\n}\n```\n\n`genotypes_int`: Encodes genotype data into integer encoding, where each\npossible biallelic IUPAC genotype is represented by an integer as\nfollows: as follows:\n`A=0, T=1, G=2, C=3, W=4, R=5, M=6, K=7, Y=8, S=9, N=-9`.\n\n`genotypes_012`: Encodes genotype data into 0-1-2 encoding, where 0\nrepresents the homozygous reference genotype, 1 represents the\nheterozygous genotype, and 2 represents the homozygous alternate\ngenotype.\n\nExample Usage:\n\n``` python\nfrom snpio import VCFReader, GenotypeEncoder\n\nvcf = \"snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz\"\npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\"\n\ngd = VCFReader(\n filename=vcf, \n popmapfile=popmap, \n force_popmap=True, \n verbose=True, \n plot_format=\"png\", \n plot_fontsize=20, \n plot_dpi=300, \n despine=True, \n prefix=\"snpio_example\"\n)\n\nencoder = GenotypeEncoder(gd)\n\n# Convert genotype data to one-hot encoding\ngt_ohe = encoder.genotypes_onehot\n\n# Convert genotype data to integer encoding\ngt_int = encoder.genotypes_int\n\n# Convert genotype data to 0-1-2 encoding.\ngt_012 = encoder.genotypes_012\n```\n\nThe GenotypeEncoder allows you to seamlessly convert genotype data into\ndifferent formats depending on your needs for analysis or machine\nlearning workflows.\n\nYou can also inversely convert the encoded data back to the original\ngenotypes by just setting the GenotypeEncoder properties to a new value.\nFor example:\n\n``` python\n# Convert one-hot encoded data back to genotypes\nencoder.genotypes_onehot = gt_ohe\n\n# Convert integer encoded data back to genotypes\nencoder.genotypes_int = gt_int\n\n# Convert 0-1-2 encoded data back to genotypes\nencoder.genotypes_012 = gt_012\n```\n\nThis will automatically update the original genotype data in the\nGenotypeData object and convert it to the original format stored in the\n`snp_data` property of the GenotypeData object.\n\n## Loading and Parsing Phylogenetic TreeParser\n\nSNPio also provides a `TreeParser` class to load and parse phylogenetic trees in Newick and NEXUS formats. The `TreeParser` class can read and parse tree files, modify tree structures, draw trees, and save trees in different formats.\n\nHere are some examples of how to load and parse a phylogenetic tree using the `TreeParser` class:\n\n```python\nfrom snpio import TreeParser, VCFReader\n\nvcf = \"snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz\"\npopmap = \"snpio/example_data/popmaps/phylogen_nomx.popmap\"\n\ngd = VCFReader(\n filename=vcf,\n popmapfile=popmap,\n force_popmap=True,\n verbose=True,\n plot_format=\"pdf\",\n plot_fontsize=20,\n plot_dpi=300,\n despine=True,\n prefix=\"snpio_example\"\n)\n\n# Load a phylogenetic tree from a Newick file\ntp = TreeParser(\n genotype_data=gd,\n treefile=\"snpio/example_data/trees/test.tre\",\n siterates=\"snpio/example_data/trees/test14K.rates\",\n qmatrix=\"snpio/example_data/trees/test.iqtree\",\n verbose=True\n)\n\ntree = tp.read_tree()\ntree.draw() # Draw the tree\n\n# Save the tree in Newick format\ntp.write_tree(tree, save_path=\"snpio/example_data/trees/test_newick.tre\")\n\n# Save the tree in NEXUS format\ntp.write_tree(tree, save_path=\"snpio/example_data/trees/test_nexus.nex\", nexus=True)\n\n# Returns the tree in Newick format as a string\ntp.write_tree(tree, save_path=None)\n\n# Get the tree stats. Returns a dictionary of tree stats.\nprint(tp.tree_stats())\n\n# Reroot the tree at any nodes containing the string 'EA' in the sampleID.\n# Use the '~' character to specify a regular expression pattern to match.\ntp.reroot_tree(\"~EA\")\n\n# Get a distance matrix between all nodes in the tree.\nprint(tp.get_distance_matrix())\n\n# Get the Rate Matrix Q from the Qmatrix file.\nprint(tp.qmat)\n\n# Get the Site Rates from the Site Rates file.\nprint(tp.site_rates)\n\n# Get a subtree with only the samples containing 'EA' in the sampleID.\n# Use the '~' character to specify a regular expression pattern to select all\n# tips containing the pattern.\nsubtree = tp.get_subtree(\"~EA\")\n\n# Prune the tree to remove samples containing 'ON' in the sampleID.\npruned_tree = tp.prune_tree(\"~ON\")\n\n# Write the subtree and pruned tree. Returns a Newick string if 'save_path'\n# is None. Otherwise saves it to 'save_path'.\nprint(tp.write_tree(subtree, save_path=None))\nprint(tp.write_tree(pruned_tree, save_path=None))\n```\n\nThe `TreeParser` class provides several methods for working with phylogenetic trees, including reading, writing, and modifying trees. You can use these methods to analyze and manipulate phylogenetic trees for your research and analysis tasks.\n\nThe `TreeParser` class also provides methods for calculating tree statistics, rerooting trees, getting distance matrices, and extracting subtrees based on sample IDs. These methods can help you analyze and visualize phylogenetic trees and extract relevant information for downstream analysis.\n\nThe `Rate matrix Q` and Site Rates can be accessed from the Qmatrix and Site Rates files, respectively. These matrices can be used to calculate evolutionary distances and rates between samples in the phylogenetic tree. The siterates file can be output by IQ-TREE or specified as a one-column file with the rates for each site in the alignment (header optional). The `qmatrix` file can be obtained from the `IQ-TREE` standard output (.iqtree file) or from a stand-alone Qmatrix file with the rate matrix Q. In the latter case, the file should be a tab-delimited or comma-delimited file with the rate matrix Q with substitution rates in the order: \"A, \"C\", \"G\", \"T\". A header line is optional.\n\nThe rate matrix and site rates objects can be accessed by their corresponding properties:\n\n- `tp.qmat`: Rate matrix Q.\n- `tp.site_rates`: Site rates.\n\nFor more information on the `TreeParser` class and its methods, please refer to the [API documentation](https://snpio.readthedocs.io).\n\n## Benchmarking the Performance\n\nYou can benchmark the filtering performance using the Benchmark class to\nvisualize how thresholds affect the dataset, if you have installed the\nsnpio dev requirements:\n\n``` shell\npip install snpio[dev]\n```\n\nThen, you can use the Benchmark class to plot performance metrics for\nyour filtered genotype data after the `resolve()` method is called. For\nexample:\n\n``` python\nfrom snpio.utils.benchmarking import Benchmark \n\nBenchmark.plot_performance(nrm.genotype_data, nrm.genotype_data.resource_data)\n```\n\nThis function will plot performance metrics for your filtered genotype\ndata and for the `VCFReader` class, giving insights into data quality\nchanges.\n\nFor more information on the Benchmark class and how to use it, see the\nAPI documentation.\n\n## Conclusion\n\nThis guide provides an overview of how to get started with the SNPio\nlibrary. It covers the basic steps to read, manipulate, and analyze\ngenotype data using the VCFReader, PhylipReader, StructureReader, and\nNRemover2 classes. SNPio is designed to simplify the process of handling\ngenotype data and preparing it for downstream analysis, such as\npopulation genetics, phylogenetics, and machine learning. The library\nsupports various file formats, including VCF, PHYLIP, and STRUCTURE, and\nprovides tools for filtering, encoding, and visualizing genotype data.\nThis guide will help you get up and running with SNPio quickly and\nefficiently. \n\nFor more information on the SNPio library, please refer to the [API documentation](https://snpio.readthedocs.io) and examples provided in the repository. If you have any questions or feedback, please feel free to reach out to the developers. \n\nWe hope you find SNPio useful for your bioinformatic analyses!\n\n**Note:**\n\nThe SNPio library is under active development, and we welcome\ncontributions from the community. If you would like to contribute to the\nproject, please check the GitHub repository for open issues and submit a\npull request. We appreciate your support and feedback!\n\nIf you encounter any issues or have any questions about the SNPio\nlibrary, please feel free to reach out to the developers or open an\nissue on the GitHub repository. We are here to help and improve the\nlibrary based on your feedback.\n\nThe SNPio library is licensed under the GPL3 License, and we encourage\nyou to use it for your research and analysis tasks. If you find the\nlibrary useful, please cite it in your publications. We appreciate your\nsupport and feedback!\n",
"bugtrack_url": null,
"license": "GPL-3.0-or-later",
"summary": "Reads and writes VCF, PHYLIP, and STRUCTURE files and performs data filtering on the alignment.",
"version": "1.1.4",
"project_urls": {
"Bug Tracker": "https://github.com/btmartin721/SNPio/issues",
"Source Code": "https://github.com/btmartin721/SNPio"
},
"split_keywords": [
"genomics",
" bioinformatics",
" population genetics",
" snp",
" vcf",
" phylip",
" structure",
" missing data",
" filtering",
" filter",
" maf",
" minor allele frequency",
" mac",
" minor allele count",
" biallelic",
" monomorphic",
" singleton"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "664ea3ca6c9b1b7da92ca5c22f9b81ba945d95ec4e7fb1a4394f774e03ccf076",
"md5": "c7df0a9aaf2b1ccc8fe9fd6faab118b5",
"sha256": "2eedba81c990e05c798ae93d6eb54e81915d5124635e1a0853d5edaed8015a0e"
},
"downloads": -1,
"filename": "snpio-1.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c7df0a9aaf2b1ccc8fe9fd6faab118b5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 74545694,
"upload_time": "2024-10-27T19:28:42",
"upload_time_iso_8601": "2024-10-27T19:28:42.048541Z",
"url": "https://files.pythonhosted.org/packages/66/4e/a3ca6c9b1b7da92ca5c22f9b81ba945d95ec4e7fb1a4394f774e03ccf076/snpio-1.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "26d82bb2b6a776ed3eabd765b033039950d573a6289b55096f0a4cd18d186878",
"md5": "da55e62ce6ca6c430cd4252c6370a5ca",
"sha256": "58688ba32047646fb608ba68d5ea6f8b771a37c0cf85381b35d947aa64f991b1"
},
"downloads": -1,
"filename": "snpio-1.1.4.tar.gz",
"has_sig": false,
"md5_digest": "da55e62ce6ca6c430cd4252c6370a5ca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 71178992,
"upload_time": "2024-10-27T19:29:10",
"upload_time_iso_8601": "2024-10-27T19:29:10.906347Z",
"url": "https://files.pythonhosted.org/packages/26/d8/2bb2b6a776ed3eabd765b033039950d573a6289b55096f0a4cd18d186878/snpio-1.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-27 19:29:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "btmartin721",
"github_project": "SNPio",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "snpio"
}