snpio


Namesnpio JSON
Version 1.0.5.2 PyPI version JSON
download
home_pagehttps://github.com/btmartin721/SNPio
SummaryReads and writes VCF, PHYLIP, and STRUCTURE files and performs data filtering on the alignment.
upload_time2023-09-17 09:53:05
maintainer
docs_urlNone
authorBradley T. Martin and Tyler K. Chafin
requires_python>=3.8
licenseGPL3
keywords genomics bioinformatics population genetics snp vcf phylip structure missing data filtering maf biallelic
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <img src="https://github.com/btmartin721/SNPio/blob/master/img/snpio_logo.png" width="50%" alt="SNPio logo">

# SNPio
API to read, write, and filter PHYLIP, STRUCTURE, and VCF files using a GenotypeData object.

In addition to the below tutorial, see our [API Documentation](https://snpio.readthedocs.io/en/latest/#) for more information.

# Getting Started

This guide provides an overview of how to get started with the SNPio library. It covers the basic steps to read, manipulate, and analyze genotype data using the `GenotypeData` class.

## Installation

Before using SNPio, make sure it is installed in your Python environment. You can install it using pip. In the project root directory (the directory containing setup.py), type the following command into your terminal:

```
pip install snpio
```

## Importing SNPio

To start using SNPio, import the necessary modules:

```
from snpio import GenotypeData
from snpio import Plotting
```

> **Important Notes:** GenotypeData and NRemover2 treat gap ('-', '?', '.') and 'N' characters as missing data. Also, if your input file is PHYLIP or STRUCTURE, they will be forced to be biallelic. If you need more than two alleles per site, then you must use the VCF file format, and even then some of the transformations force all sites to be biallelic.

## The Population Map File

To use `GenotypeData` you'll need a population map (popmap) file. It is basically just a two-column tab-delimited file with SampleIDs in the first column and the corresponding PopulationIDs in the second column. 

For example:

```
Sample1\tPopulation1
Sample2\tPopulation1
Sample3\tPopulation2
Sample4\tPopulation2
...
```

## Optional Input files

There are some other optional input files you can provide as well. These include a phylogenetic tree in NEWICK format and the site rates and Q-matrix obtained from running IQ-TREE. THe latter two can be found in the output from IQ-TREE.

Currently, we don't have functionality to do any analyses on the tree, site_rates, and q-matrix objects, but we plan to implement more features that incorporate them in the future.

## Reading Alignment with Genotype Data

The first step is to read genotype data from an alignment. The `GenotypeData` class can read and write PHYLIP, STRUCTURE, and VCF files. VCF files can be either compressed with bgzip or uncompressed. GenotypeData can also convert between these three file formats and makes some informative plots. An example script, `run_snpio.py` is provided to showcase some of SNPio's functionality.

The files referenced in the code blocks below can be found in the provided `example_data/` directory.

The `GenotypeData` class provides methods to read and write data in various formats, such as VCF, PHYLIP, STRUCTURE, and custom formats. Here's an example of reading genotype data from a VCF file:

```
from snpio import GenotypeData
from snpio import Plotting

# Read the alignment, popmap, and tree files
gd = GenotypeData(
    filename="example_data/phylip_files/phylogen_nomx.u.snps.phy",
    popmapfile="example_data/popmaps/phylogen_nomx.popmap",
    force_popmap=True,
    filetype="auto",
    qmatrix_iqtree="example_data/trees/test.qmat",
    siterates_iqtree="example_data/trees/test.rate",
    guidetree="example_data/trees/test.tre",
    include_pops=["EA", "TT", "GU"], # Only include these populations. There's also an exclude_pops option that will exclude the provided


    populations.
)

# Print out the phylogenetic tree that you provided as a newick input file.
print(gd.tree)

# Access basic properties
print(gd.num_snps)  # Number of SNPs (loci) in the dataset
print(gd.num_inds)  # Number of samples in the dataset
print(gd.populations)  # List of population IDs
print(gd.popmap)  # Dictionary of SampleIDs as keys and popIDs as values

# Dictionary of PopulationIDs as keys and a lists of SampleIDs 
# for the given population as values.
print(genotype_data.popmap_inverse) 

print(gd.samples)  # Sample IDs in input order
print(gd.loci_indices) # If loci were removed, will be subset.
print(gd.sample_indices) # If samples were removed, will be subset.

# You can print the alignment as a Biopython MultipleSeqAlignment
# This is useful for visualzation.
print(gd.alignment)

# Or you can use the alignment as a 2D list.
print(gd.snp_data)

# Get a numpy array of snp_data
print(np.array(gd.snp_data))
```

Here's the alignment object:

```
Alignment with 161 rows and 6724 columns
GNNNNCNNNNRNCNTNCNANNCNCGGGGCNNNCNTNNNTNNNNN...NCN EAAL_BX1380
NNGNNCNCNRGNNGTNCCNNNCCSNNNNNNGNNNYCCATTNGKN...NNT EAAL_BX211
GAGTACNCGGRGCNTTCCACGCNCGGGGCGGTCNTCCAYTCGTN...ANT EAAL_BXEA27
GAGTACCCGRRGCGTTYCACGNCCGGGGCGGTCGTCCATTCGTR...ACT EAGA_BX301
GAGTACNCGGGGCGTTYCACGCNCNGGGCGGTNGNCCATTCGTG...ACT EAGA_BX346
GAGTACCCGGRGCGTTYCACGCCCGGGGCGGNCNTCCATTCGTG...ACT EAGA_BX472
GAGTACNNGGGGCGTTCCACNCCCGGGGCGGTCGTCCATTCNTG...ACT EAGA_BX660
GAGTACNCGGRGCGTTCCACNNNSGGRGCGGTCGNCCATTCGTG...ACT EAGA_BXEA15_654
GWGTACCCGGRGCNTTCCACRNCCGGGGCGNTCGNCCNTTCGNG...ACT EAGA_BXEA17
GAGTACCCGGGGCGTTCCACGCCCGGGGCGGNCGNCCATTYGTG...ACT EAGA_BXEA21
NAGTACCCGGGGCGTTCCANNCNNGGGGCGGTCNYCCATTCGTG...ACT EAGA_BXEA25
GNNNASNNGNRNCNTTNNNCNNNCNNNGNGGNNNNNNNTNNNTG...ANN EAGA_BXEA29_655
GAGTACCCGGRGCGTTCCACGCCNGGGGCGGTCGNCCATTCGGN...ACT EAGA_BXEA31_659
GAGTACCCGGAGCGTTCCACGNCSGNGGCGNNCGTCNATTCGTG...ACT EAGA_BXEA32_662
GWGTACNCGNGGCGTTCCACGNNNNGGGNGGTCGTCNNTNCGTG...ACT EAGA_BXEA33_663
GAGTACCCGGRGCGTTCCACGNCSGGGGCGGTCGNCNATTCGTG...ACT EAGA_BXEA34_665
GAGTACNCGGRGCGTTCCACNNNSGGGGCGGTNGNCCANNCNTG...ACT EAGA_BXEA35_666
GWGTNCCYGGRGCNTNCCACRNCCGGGGCGNTCGNCCNTTCGNG...ACT EAGA_BXEA49_564
...
NANNNCNNGGGGCNTTNCNNNCCCGGGNCNGNCNTCCATTNNNN...ANT TTTX_BX23
```

## Data Transformation and Analysis

Once you have the genotype data, you can perform various data transformations and analyses. Here's an example of running principal component analysis (PCA) on the genotype data:

```
# Generate plots to assess the amount of missing data in alignment.
gd.missingness_reports(prefix="unfiltered")

# Does a Principal Component Analysis and makes a scatterplot.
components, pca = Plotting.run_pca(
        gd # GenotypeData instance from above.
        plot_dir="plots",
        prefix="unfiltered",
        n_components=None, # If None, then uses all components.
        center=True,
        scale=False,
        n_axes=2, # Can be 2 or 3. If 3, makes a 3D plot.
        point_size=15,
        font_size=15,
        plot_format="pdf",
        bottom_margin=0,
        top_margin=0,
        left_margin=0,
        right_margin=0,
        width=1088,
        height=700,
)
explvar = pca.explained_variance_ratio_ # Can use this to make a plot.

# Access other transformed genotype data and attributes

# 012-encoded genotypes, with ref=0, heterozygous=1, alt=2
genotypes_012 = genotype_data.genotypes_012(fmt="list") # Get 012-eencoded genotypes.

# onehot-encoded genotypes.
genotypes_onehot = genotype_data.genotypes_onehot 

# Dictionary object with all the VCF file fields.
# All values will be None if VCF file wasn't the input file type.
vcf_attributes = genotype_data.vcf_attributes 

# Access optional properties
q_matrix = genotype_data.q
site_rates = genotype_data.site_rates
tree = genotype_data.tree
```

## GenotypeData Plots

There are a number of informative plots that GenotypeData makes.

Here is a plot describing the counts of each found population:

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/population_counts.png" width="50%" alt="Barplot with counts per population">

Here is a plot showing the distribution of genotypes in the alignment:

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/genotype_distributions.png" width="50%" alt="Plot showing IUPAC genotype distributions">

## Alignment Filtering

The `NRemover2` class provides methods for filtering genetic alignments based on the proportion of missing data, the minor allele frequency (MAF), and monomorphic, non-biallelic, and singleton sites. It allows you to filter out sequences (samples) and loci (columns) that exceed the provided thresholds. Missing data filtering options include removing loci whose columns exceed global missing and per-population thresholds and removing samples that exceed a per-sample threshold. The class also provides informative plots related to the filtering process.

### Attributes:

- `alignment` (list of Bio.SeqRecord.SeqRecord): The input alignment to filter.
- `populations` (list of str): The population for each sequence in the alignment.
- `loci_indices` (list of int): Indices that were retained post-filtering.
- `sample_indices` (list of int): Indices that were retained post-filtering.
- `msa`: (MultipleSeqAlignment): BioPython MultipleSeqAlignment object.

### Methods:

- `nremover()`: Runs the whole NRemover2 pipeline. Includes arguments for all thresholds and settings that you'll need. You can also toggle a threshold search that plots the proportion of missing data across all the filtering options across multiple thresholds.

### Usage Example:

To illustrate how to use the `NRemover2` class, here's an example:

```
from snpio import NRemover2

# Create an instance of NRemover2
# Provide it the GenotypeData instance from above.
nrm = nremover2.NRemover2(gd)

# Run nremover to filter out missing data.
# Set the thresholds as desired.
# Returns a GenotypeData object.
gd_filtered = nrm.nremover(
    max_missing_global=0.5, # Maximum global missing data threshold.
    max_missing_pop=0.5, # Maximum per-population threshold.
    max_missing_sample=0.8, # Maximum per-sample threshold.
    singletons=True, # Filter out singletons.
    biallelic=True, # Filter out non-biallelic sites.
    monomorphic=True, # Filter out monomorphic loci.
    min_maf=0.01, # Only retain loci with a MAF above this threshold.
    search_thresholds=True, # Plots against multiple thresholds.
    plot_dir="plots", # Where to save the plots to.
)

# Makes an informative plot showing missing data proportions.
gd_filtered.missingness_reports(prefix="filtered")

# Run a PCA on the filtered data and make a scatterplot.
Plotting.run_pca(gd_filtered, prefix="filtered")
```

Running the above code makes a number of informative plots. See below.

Here is a Sankey diagram showing the number of loci removed at each filtering step.

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/sankey_filtering_report.png" width="75%" alt="Sankey filtering report for loci removed at each filtering step">

Here is the proportions of missing data for the filter missingness report:

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/filtered_missingness.png" width="75%" alt="Missingness filtering report plot">

Here is the PCA we ran on the filtered data, with colors being a gradient corresponding to the proportion of missing data in each sample:

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/filtered_pca.png" width="50%" alt="Principal Component Analysis scatterplot for filtered data">

The below two plots show the missingness proportion variance among all the thresholds if you used set `search_thresholds=True` when you ran the `nremover()` function. The first makes plots for the missing data filters, and the second for the MAF, biallelic, monomorphic, and singleton filters. they are shown for both globally and per-population.

First, the missing data filter report:

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/missingness_report.png" width="75%" alt="Plots showing missingness proportion variance for each filtering step">

And now the MAF, biallelic, singleton, and monomorphic filter report:

<img src="https://github.com/btmartin721/SNPio/blob/master/plots/maf_missingness_report.png" width="50%" alt="Plots showing missingness proportion variance among the MAF thresholds and singleton, biallelic, and monomorphic filters (toggled off and on)">

If you do not want to use some of the filtering options, just leave them at default for the ones you don't want to run.

## Writing to File and File Conversions

If you want to write your output to a file, just do use one of the write functions. Any of the input file types can be written with any of the write functions.

```
gd_filtered.write_phylip("example_data/phylip_files/nremover_test.phy")

gd_filtered.write_structure("example_data/structure_files/nremover_test.str")

gd_filtered.write_vcf("example_data/vcf_files/nmremover_test.vcf")
```

For detailed information about the available methods and attributes, refer to the API Reference.

That's it! You have successfully completed the basic steps to get started with SNPio. Explore the library further to discover more functionality and advanced features.

For detailed information about the available methods and attributes, refer to the API Reference.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/btmartin721/SNPio",
    "name": "snpio",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "genomics,bioinformatics,population genetics,SNP,VCF,PHYLIP,STRUCTURE,missing data,filtering,MAF,biallelic",
    "author": "Bradley T. Martin and Tyler K. Chafin",
    "author_email": "evobio721@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3c/07/ab3120056a33d87ff1e0807177e98d461465760cf4333dd1d044194fe285/snpio-1.0.5.2.tar.gz",
    "platform": "Any",
    "description": "<img src=\"https://github.com/btmartin721/SNPio/blob/master/img/snpio_logo.png\" width=\"50%\" alt=\"SNPio logo\">\n\n# SNPio\nAPI to read, write, and filter PHYLIP, STRUCTURE, and VCF files using a GenotypeData object.\n\nIn addition to the below tutorial, see our [API Documentation](https://snpio.readthedocs.io/en/latest/#) for more information.\n\n# Getting Started\n\nThis guide provides an overview of how to get started with the SNPio library. It covers the basic steps to read, manipulate, and analyze genotype data using the `GenotypeData` class.\n\n## Installation\n\nBefore using SNPio, make sure it is installed in your Python environment. You can install it using pip. In the project root directory (the directory containing setup.py), type the following command into your terminal:\n\n```\npip install snpio\n```\n\n## Importing SNPio\n\nTo start using SNPio, import the necessary modules:\n\n```\nfrom snpio import GenotypeData\nfrom snpio import Plotting\n```\n\n> **Important Notes:** GenotypeData and NRemover2 treat gap ('-', '?', '.') and 'N' characters as missing data. Also, if your input file is PHYLIP or STRUCTURE, they will be forced to be biallelic. If you need more than two alleles per site, then you must use the VCF file format, and even then some of the transformations force all sites to be biallelic.\n\n## The Population Map File\n\nTo use `GenotypeData` you'll need a population map (popmap) file. It is basically just a two-column tab-delimited file with SampleIDs in the first column and the corresponding PopulationIDs in the second column. \n\nFor example:\n\n```\nSample1\\tPopulation1\nSample2\\tPopulation1\nSample3\\tPopulation2\nSample4\\tPopulation2\n...\n```\n\n## Optional Input files\n\nThere are some other optional input files you can provide as well. These include a phylogenetic tree in NEWICK format and the site rates and Q-matrix obtained from running IQ-TREE. THe latter two can be found in the output from IQ-TREE.\n\nCurrently, we don't have functionality to do any analyses on the tree, site_rates, and q-matrix objects, but we plan to implement more features that incorporate them in the future.\n\n## Reading Alignment with Genotype Data\n\nThe first step is to read genotype data from an alignment. The `GenotypeData` class can read and write PHYLIP, STRUCTURE, and VCF files. VCF files can be either compressed with bgzip or uncompressed. GenotypeData can also convert between these three file formats and makes some informative plots. An example script, `run_snpio.py` is provided to showcase some of SNPio's functionality.\n\nThe files referenced in the code blocks below can be found in the provided `example_data/` directory.\n\nThe `GenotypeData` class provides methods to read and write data in various formats, such as VCF, PHYLIP, STRUCTURE, and custom formats. Here's an example of reading genotype data from a VCF file:\n\n```\nfrom snpio import GenotypeData\nfrom snpio import Plotting\n\n# Read the alignment, popmap, and tree files\ngd = GenotypeData(\n    filename=\"example_data/phylip_files/phylogen_nomx.u.snps.phy\",\n    popmapfile=\"example_data/popmaps/phylogen_nomx.popmap\",\n    force_popmap=True,\n    filetype=\"auto\",\n    qmatrix_iqtree=\"example_data/trees/test.qmat\",\n    siterates_iqtree=\"example_data/trees/test.rate\",\n    guidetree=\"example_data/trees/test.tre\",\n    include_pops=[\"EA\", \"TT\", \"GU\"], # Only include these populations. There's also an exclude_pops option that will exclude the provided\n\n\n    populations.\n)\n\n# Print out the phylogenetic tree that you provided as a newick input file.\nprint(gd.tree)\n\n# Access basic properties\nprint(gd.num_snps)  # Number of SNPs (loci) in the dataset\nprint(gd.num_inds)  # Number of samples in the dataset\nprint(gd.populations)  # List of population IDs\nprint(gd.popmap)  # Dictionary of SampleIDs as keys and popIDs as values\n\n# Dictionary of PopulationIDs as keys and a lists of SampleIDs \n# for the given population as values.\nprint(genotype_data.popmap_inverse) \n\nprint(gd.samples)  # Sample IDs in input order\nprint(gd.loci_indices) # If loci were removed, will be subset.\nprint(gd.sample_indices) # If samples were removed, will be subset.\n\n# You can print the alignment as a Biopython MultipleSeqAlignment\n# This is useful for visualzation.\nprint(gd.alignment)\n\n# Or you can use the alignment as a 2D list.\nprint(gd.snp_data)\n\n# Get a numpy array of snp_data\nprint(np.array(gd.snp_data))\n```\n\nHere's the alignment object:\n\n```\nAlignment with 161 rows and 6724 columns\nGNNNNCNNNNRNCNTNCNANNCNCGGGGCNNNCNTNNNTNNNNN...NCN EAAL_BX1380\nNNGNNCNCNRGNNGTNCCNNNCCSNNNNNNGNNNYCCATTNGKN...NNT EAAL_BX211\nGAGTACNCGGRGCNTTCCACGCNCGGGGCGGTCNTCCAYTCGTN...ANT EAAL_BXEA27\nGAGTACCCGRRGCGTTYCACGNCCGGGGCGGTCGTCCATTCGTR...ACT EAGA_BX301\nGAGTACNCGGGGCGTTYCACGCNCNGGGCGGTNGNCCATTCGTG...ACT EAGA_BX346\nGAGTACCCGGRGCGTTYCACGCCCGGGGCGGNCNTCCATTCGTG...ACT EAGA_BX472\nGAGTACNNGGGGCGTTCCACNCCCGGGGCGGTCGTCCATTCNTG...ACT EAGA_BX660\nGAGTACNCGGRGCGTTCCACNNNSGGRGCGGTCGNCCATTCGTG...ACT EAGA_BXEA15_654\nGWGTACCCGGRGCNTTCCACRNCCGGGGCGNTCGNCCNTTCGNG...ACT EAGA_BXEA17\nGAGTACCCGGGGCGTTCCACGCCCGGGGCGGNCGNCCATTYGTG...ACT EAGA_BXEA21\nNAGTACCCGGGGCGTTCCANNCNNGGGGCGGTCNYCCATTCGTG...ACT EAGA_BXEA25\nGNNNASNNGNRNCNTTNNNCNNNCNNNGNGGNNNNNNNTNNNTG...ANN EAGA_BXEA29_655\nGAGTACCCGGRGCGTTCCACGCCNGGGGCGGTCGNCCATTCGGN...ACT EAGA_BXEA31_659\nGAGTACCCGGAGCGTTCCACGNCSGNGGCGNNCGTCNATTCGTG...ACT EAGA_BXEA32_662\nGWGTACNCGNGGCGTTCCACGNNNNGGGNGGTCGTCNNTNCGTG...ACT EAGA_BXEA33_663\nGAGTACCCGGRGCGTTCCACGNCSGGGGCGGTCGNCNATTCGTG...ACT EAGA_BXEA34_665\nGAGTACNCGGRGCGTTCCACNNNSGGGGCGGTNGNCCANNCNTG...ACT EAGA_BXEA35_666\nGWGTNCCYGGRGCNTNCCACRNCCGGGGCGNTCGNCCNTTCGNG...ACT EAGA_BXEA49_564\n...\nNANNNCNNGGGGCNTTNCNNNCCCGGGNCNGNCNTCCATTNNNN...ANT TTTX_BX23\n```\n\n## Data Transformation and Analysis\n\nOnce you have the genotype data, you can perform various data transformations and analyses. Here's an example of running principal component analysis (PCA) on the genotype data:\n\n```\n# Generate plots to assess the amount of missing data in alignment.\ngd.missingness_reports(prefix=\"unfiltered\")\n\n# Does a Principal Component Analysis and makes a scatterplot.\ncomponents, pca = Plotting.run_pca(\n        gd # GenotypeData instance from above.\n        plot_dir=\"plots\",\n        prefix=\"unfiltered\",\n        n_components=None, # If None, then uses all components.\n        center=True,\n        scale=False,\n        n_axes=2, # Can be 2 or 3. If 3, makes a 3D plot.\n        point_size=15,\n        font_size=15,\n        plot_format=\"pdf\",\n        bottom_margin=0,\n        top_margin=0,\n        left_margin=0,\n        right_margin=0,\n        width=1088,\n        height=700,\n)\nexplvar = pca.explained_variance_ratio_ # Can use this to make a plot.\n\n# Access other transformed genotype data and attributes\n\n# 012-encoded genotypes, with ref=0, heterozygous=1, alt=2\ngenotypes_012 = genotype_data.genotypes_012(fmt=\"list\") # Get 012-eencoded genotypes.\n\n# onehot-encoded genotypes.\ngenotypes_onehot = genotype_data.genotypes_onehot \n\n# Dictionary object with all the VCF file fields.\n# All values will be None if VCF file wasn't the input file type.\nvcf_attributes = genotype_data.vcf_attributes \n\n# Access optional properties\nq_matrix = genotype_data.q\nsite_rates = genotype_data.site_rates\ntree = genotype_data.tree\n```\n\n## GenotypeData Plots\n\nThere are a number of informative plots that GenotypeData makes.\n\nHere is a plot describing the counts of each found population:\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/population_counts.png\" width=\"50%\" alt=\"Barplot with counts per population\">\n\nHere is a plot showing the distribution of genotypes in the alignment:\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/genotype_distributions.png\" width=\"50%\" alt=\"Plot showing IUPAC genotype distributions\">\n\n## Alignment Filtering\n\nThe `NRemover2` class provides methods for filtering genetic alignments based on the proportion of missing data, the minor allele frequency (MAF), and monomorphic, non-biallelic, and singleton sites. It allows you to filter out sequences (samples) and loci (columns) that exceed the provided thresholds. Missing data filtering options include removing loci whose columns exceed global missing and per-population thresholds and removing samples that exceed a per-sample threshold. The class also provides informative plots related to the filtering process.\n\n### Attributes:\n\n- `alignment` (list of Bio.SeqRecord.SeqRecord): The input alignment to filter.\n- `populations` (list of str): The population for each sequence in the alignment.\n- `loci_indices` (list of int): Indices that were retained post-filtering.\n- `sample_indices` (list of int): Indices that were retained post-filtering.\n- `msa`: (MultipleSeqAlignment): BioPython MultipleSeqAlignment object.\n\n### Methods:\n\n- `nremover()`: Runs the whole NRemover2 pipeline. Includes arguments for all thresholds and settings that you'll need. You can also toggle a threshold search that plots the proportion of missing data across all the filtering options across multiple thresholds.\n\n### Usage Example:\n\nTo illustrate how to use the `NRemover2` class, here's an example:\n\n```\nfrom snpio import NRemover2\n\n# Create an instance of NRemover2\n# Provide it the GenotypeData instance from above.\nnrm = nremover2.NRemover2(gd)\n\n# Run nremover to filter out missing data.\n# Set the thresholds as desired.\n# Returns a GenotypeData object.\ngd_filtered = nrm.nremover(\n    max_missing_global=0.5, # Maximum global missing data threshold.\n    max_missing_pop=0.5, # Maximum per-population threshold.\n    max_missing_sample=0.8, # Maximum per-sample threshold.\n    singletons=True, # Filter out singletons.\n    biallelic=True, # Filter out non-biallelic sites.\n    monomorphic=True, # Filter out monomorphic loci.\n    min_maf=0.01, # Only retain loci with a MAF above this threshold.\n    search_thresholds=True, # Plots against multiple thresholds.\n    plot_dir=\"plots\", # Where to save the plots to.\n)\n\n# Makes an informative plot showing missing data proportions.\ngd_filtered.missingness_reports(prefix=\"filtered\")\n\n# Run a PCA on the filtered data and make a scatterplot.\nPlotting.run_pca(gd_filtered, prefix=\"filtered\")\n```\n\nRunning the above code makes a number of informative plots. See below.\n\nHere is a Sankey diagram showing the number of loci removed at each filtering step.\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/sankey_filtering_report.png\" width=\"75%\" alt=\"Sankey filtering report for loci removed at each filtering step\">\n\nHere is the proportions of missing data for the filter missingness report:\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/filtered_missingness.png\" width=\"75%\" alt=\"Missingness filtering report plot\">\n\nHere is the PCA we ran on the filtered data, with colors being a gradient corresponding to the proportion of missing data in each sample:\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/filtered_pca.png\" width=\"50%\" alt=\"Principal Component Analysis scatterplot for filtered data\">\n\nThe below two plots show the missingness proportion variance among all the thresholds if you used set `search_thresholds=True` when you ran the `nremover()` function. The first makes plots for the missing data filters, and the second for the MAF, biallelic, monomorphic, and singleton filters. they are shown for both globally and per-population.\n\nFirst, the missing data filter report:\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/missingness_report.png\" width=\"75%\" alt=\"Plots showing missingness proportion variance for each filtering step\">\n\nAnd now the MAF, biallelic, singleton, and monomorphic filter report:\n\n<img src=\"https://github.com/btmartin721/SNPio/blob/master/plots/maf_missingness_report.png\" width=\"50%\" alt=\"Plots showing missingness proportion variance among the MAF thresholds and singleton, biallelic, and monomorphic filters (toggled off and on)\">\n\nIf you do not want to use some of the filtering options, just leave them at default for the ones you don't want to run.\n\n## Writing to File and File Conversions\n\nIf you want to write your output to a file, just do use one of the write functions. Any of the input file types can be written with any of the write functions.\n\n```\ngd_filtered.write_phylip(\"example_data/phylip_files/nremover_test.phy\")\n\ngd_filtered.write_structure(\"example_data/structure_files/nremover_test.str\")\n\ngd_filtered.write_vcf(\"example_data/vcf_files/nmremover_test.vcf\")\n```\n\nFor detailed information about the available methods and attributes, refer to the API Reference.\n\nThat's it! You have successfully completed the basic steps to get started with SNPio. Explore the library further to discover more functionality and advanced features.\n\nFor detailed information about the available methods and attributes, refer to the API Reference.\n",
    "bugtrack_url": null,
    "license": "GPL3",
    "summary": "Reads and writes VCF, PHYLIP, and STRUCTURE files and performs data filtering on the alignment.",
    "version": "1.0.5.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/btmartin721/SNPio/issues",
        "Homepage": "https://github.com/btmartin721/SNPio",
        "Source Code": "https://github.com/btmartin721/SNPio"
    },
    "split_keywords": [
        "genomics",
        "bioinformatics",
        "population genetics",
        "snp",
        "vcf",
        "phylip",
        "structure",
        "missing data",
        "filtering",
        "maf",
        "biallelic"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "02fa9a424aed5508f0bc8427ac25f0beb381a5e4c07cfcf16ba340d44ec8fc0c",
                "md5": "9d11063ec7fbdc9a51c5166cae0025b1",
                "sha256": "05450af96d42e2309bec087e9f7ddcbd11f49c446be63707e147ec6ad1012ebf"
            },
            "downloads": -1,
            "filename": "snpio-1.0.5.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9d11063ec7fbdc9a51c5166cae0025b1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 102694,
            "upload_time": "2023-09-17T09:53:04",
            "upload_time_iso_8601": "2023-09-17T09:53:04.025745Z",
            "url": "https://files.pythonhosted.org/packages/02/fa/9a424aed5508f0bc8427ac25f0beb381a5e4c07cfcf16ba340d44ec8fc0c/snpio-1.0.5.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3c07ab3120056a33d87ff1e0807177e98d461465760cf4333dd1d044194fe285",
                "md5": "802b797d1258d154d4e36959a9e1be3d",
                "sha256": "edd8e43abbfa3deb9a1bf82002aa311e5a4f915976c4c4a5db3217c5ef02cd75"
            },
            "downloads": -1,
            "filename": "snpio-1.0.5.2.tar.gz",
            "has_sig": false,
            "md5_digest": "802b797d1258d154d4e36959a9e1be3d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 102501,
            "upload_time": "2023-09-17T09:53:05",
            "upload_time_iso_8601": "2023-09-17T09:53:05.995994Z",
            "url": "https://files.pythonhosted.org/packages/3c/07/ab3120056a33d87ff1e0807177e98d461465760cf4333dd1d044194fe285/snpio-1.0.5.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-17 09:53:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "btmartin721",
    "github_project": "SNPio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "snpio"
}
        
Elapsed time: 0.13760s