# phylostan: phylogenetic inference using Stan
## Introduction
*phylostan* is a tool written in python for inferring phylogenetic trees from nucleotide datasets.
It generates a variety of phylogenetic models using the Stan language.
Through the pystan library, *phylostan* has access to Stan's variational inference and sampling (NUTS and HMC) engines.
The program has been described and its performance evaluated in an [article](https://doi.org/10.7717/peerj.8272). The data and scripts used to generate the results can be found [here](examples/README.md).
## Features
Phylogenetic model components:
- Nucleotide substitution models: JC69, HKY, GTR
- Rate heterogeneity: discretized Weibull distribution and general discrete distribution
- Tree without clock constraint with uniform prior on topology
- Time tree:
- Homochronous sequences: same sampling date
- Heterochronous sequences: sequences sampled at different time points
- Molecular clocks:
- Strict
- [Autocorrelated](https://doi.org/10.1093/oxfordjournals.molbev.a025892)
- [Uncorrelated](https://dx.doi.org/10.1371%2Fjournal.pbio.0040088): log-normal hierarchical prior
- Coalescent models:
- Constant population size
- [Skyride](https://doi.org/10.1093/molbev/msn090)
- [Skygrid](https://doi.org/10.1093/molbev/mss265)
Algorithms provided by Stan:
- Variational inference:
- Mean-field distribution
- Full-rank distribution
- No U-Turn Sampler ([NUTS](https://arxiv.org/abs/1111.4246))
- Hamiltonian Monte Carlo (HMC)
## Prerequisites
| Program/Library | Version | Description |
|----------- | --------| -- |
| python | Tested on python 3.6, 3.7, 3.9 | |
| [pystan](https://pystan.readthedocs.io/) | >=2.19 <3 | API for [Stan](https://mc-stan.org) |
| [dendropy](https://www.dendropy.org) | | Library for manipulating trees and alignments|
| numpy | >=1.7 | |
## You can install phylostan using pip
```bash
pip install phylostan
```
## You can also run it locally
```bash
python -m phylostan.phylostan <COMMAND>
```
where `<COMMAND>` is either the *build* or *run* command.
## Command-line usage
*phylostan* is decomposed into two sub-commands:
- *build*: creates a Stan file: a text file containing the model.
- *run*: runs a Stan file with the data.
These two steps are separated so the user can edit the Stan model. The main reason would be to modify the priors.
To get some help about the *build* or *run* commands:
```bash
phylostan build --help
phylostan run --help
```
## Quickstart
We are going to use the `fluA.fa` alignment and `fluA.tree` tree files. This dataset contains 69 influenza A virus haemagglutinin nucleotide sequences isolated between 1981 and 1998.
First, a Stan script needs to be generated using the *build* command:
```bash
cd examples/fluA
phylostan build -s fluA-GTR-W4.stan -m HKY -C 4 \
--heterochronous --estimate_rate --clock strict --coalescent constant
```
This command is going to create a Stan file `fluA-GTR-W4.stan` with the following model:
- Hasegawa, Kishino and Yano (HKY) nucleotide substitution model
- Rate heterogeneity with 4 rate categories using the Weibull distribution
- Assumes that sequences were sampled are different time points (heterochronous)
- Constant effective population size
- The substitution rate will be estimated
In the second step we compile and run the script with our data
```bash
phylostan run -s fluA-GTR-W4.stan -m HKY -C 4 \
--heterochronous --estimate_rate --clock strict --coalescent constant \
-i fluA.fa -t fluA.tree -o fluA -q meanfield
```
The *run* command requires the data (tree and alignment) and an output parameter.
It also needs the parameters that were provided to the *build* command.
The output will consists of 4 files:
- `fluA`: this file is the output file of Stan. It contains the samples drawn from the variational distribution (or MCMC samples).
- `fluA.diag`: this file is also generated by Stan and it contains some information such as the ELBO at each iteration.
- `fluA.trees`: this file is a nexus file containing trees. It can be opened with a program such as [FigTree](https://github.com/rambaut/figtree) or summarized using *treeannotator* from [BEAST](https://beast.community/treeannotator) or [BEAST2](https://www.beast2.org/treeannotator/).
- `fluA-GTR-W4.pkl`: the Stan script is compiled into this binary file. This file can be reused automatically by *phylostan* unless it must be recompiled, then the option `--compile` can be used.
At the end of the run, *phylostan* will print on the screen the mean and 95% credibility interval of the parameters of interest:
```
Weibull (shape) mean: 0.488 95% CI: (0.383,0.616)
Strict clock (rate) mean: 0.00499 95% CI: (0.00432,0.00577)
Constant population size (theta) mean: 4.03 95% CI: (3.14,5.05)
HKY (kappa) mean: 5.58 95% CI: (4.37 7.039)
Root height mean: 18.96 95% CI: (18.36 19.74)
```
In this example we have used a mean-field distribution (`-q meanfield`) to approximate the posterior using variational inference.
The Stan model is already compiled so we can run the NUTS algorithm without re-generating the script file, simply issue the command:
```bash
phylostan run -s fluA-GTR-W4.stan -m HKY -C 4 \
--heterochronous --estimate_rate --clock strict --coalescent constant \
-i fluA.fa -t fluA.tree -o fluA -a nuts
```
The NUTS algorithm is much slower (and more accurate) than variational inference so it should be used on a small dataset.
## Citing phylostan
Mathieu Fourment and Aaron E. Darling. Evaluating Probabilistic Programming and Fast Variational Bayesian Inference in Phylogenetics. 2019 _PeerJ_. doi: [10.7717/peerj.8272](https://doi.org/10.7717/peerj.8272).
```
@article{fourment2019phylostan,
title = "Evaluating probabilistic programming and fast variational
{B}ayesian inference in phylogenetics",
author = "Fourment, Mathieu and Darling, Aaron E",
journal = "PeerJ",
volume = 7,
pages = "e8272",
month = dec,
year = 2019
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/4ment/phylostan",
"name": "phylostan",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "phylogenetics,variational,HMC,Bayes,Stan",
"author": "Mathieu Fourment",
"author_email": "mathieu.fourment@uts.edu.au",
"download_url": "",
"platform": null,
"description": "# phylostan: phylogenetic inference using Stan\n\n## Introduction\n\n*phylostan* is a tool written in python for inferring phylogenetic trees from nucleotide datasets. \nIt generates a variety of phylogenetic models using the Stan language.\n Through the pystan library, *phylostan* has access to Stan's variational inference and sampling (NUTS and HMC) engines.\nThe program has been described and its performance evaluated in an [article](https://doi.org/10.7717/peerj.8272). The data and scripts used to generate the results can be found [here](examples/README.md).\n\n## Features\nPhylogenetic model components:\n- Nucleotide substitution models: JC69, HKY, GTR\n- Rate heterogeneity: discretized Weibull distribution and general discrete distribution\n- Tree without clock constraint with uniform prior on topology\n- Time tree:\n - Homochronous sequences: same sampling date\n - Heterochronous sequences: sequences sampled at different time points\n - Molecular clocks:\n - Strict\n - [Autocorrelated](https://doi.org/10.1093/oxfordjournals.molbev.a025892)\n - [Uncorrelated](https://dx.doi.org/10.1371%2Fjournal.pbio.0040088): log-normal hierarchical prior\n - Coalescent models:\n - Constant population size\n - [Skyride](https://doi.org/10.1093/molbev/msn090)\n - [Skygrid](https://doi.org/10.1093/molbev/mss265)\n\nAlgorithms provided by Stan:\n- Variational inference:\n - Mean-field distribution\n - Full-rank distribution\n- No U-Turn Sampler ([NUTS](https://arxiv.org/abs/1111.4246))\n- Hamiltonian Monte Carlo (HMC)\n\n## Prerequisites\n\n| Program/Library | Version | Description |\n|----------- | --------| -- |\n| python | Tested on python 3.6, 3.7, 3.9 | |\n| [pystan](https://pystan.readthedocs.io/) | >=2.19 <3 | API for [Stan](https://mc-stan.org) |\n| [dendropy](https://www.dendropy.org) | | Library for manipulating trees and alignments|\n| numpy | >=1.7 | |\n\n\n## You can install phylostan using pip\n```bash\npip install phylostan\n```\n\n## You can also run it locally\n```bash\npython -m phylostan.phylostan <COMMAND>\n```\nwhere `<COMMAND>` is either the *build* or *run* command.\n\n## Command-line usage\n\n*phylostan* is decomposed into two sub-commands:\n- *build*: creates a Stan file: a text file containing the model.\n- *run*: runs a Stan file with the data.\n\nThese two steps are separated so the user can edit the Stan model. The main reason would be to modify the priors.\n\nTo get some help about the *build* or *run* commands:\n```bash\nphylostan build --help\nphylostan run --help\n```\n\n## Quickstart\n\nWe are going to use the `fluA.fa` alignment and `fluA.tree` tree files. This dataset contains 69 influenza A virus haemagglutinin nucleotide sequences isolated between 1981 and 1998. \n\nFirst, a Stan script needs to be generated using the *build* command:\n```bash\ncd examples/fluA\nphylostan build -s fluA-GTR-W4.stan -m HKY -C 4 \\\n --heterochronous --estimate_rate --clock strict --coalescent constant\n```\n\nThis command is going to create a Stan file `fluA-GTR-W4.stan` with the following model:\n- Hasegawa, Kishino and Yano (HKY) nucleotide substitution model\n- Rate heterogeneity with 4 rate categories using the Weibull distribution\n- Assumes that sequences were sampled are different time points (heterochronous)\n- Constant effective population size\n- The substitution rate will be estimated\n\nIn the second step we compile and run the script with our data\n```bash\nphylostan run -s fluA-GTR-W4.stan -m HKY -C 4 \\\n --heterochronous --estimate_rate --clock strict --coalescent constant \\\n -i fluA.fa -t fluA.tree -o fluA -q meanfield\n```\n\nThe *run* command requires the data (tree and alignment) and an output parameter.\nIt also needs the parameters that were provided to the *build* command.\nThe output will consists of 4 files:\n- `fluA`: this file is the output file of Stan. It contains the samples drawn from the variational distribution (or MCMC samples).\n- `fluA.diag`: this file is also generated by Stan and it contains some information such as the ELBO at each iteration.\n- `fluA.trees`: this file is a nexus file containing trees. It can be opened with a program such as [FigTree](https://github.com/rambaut/figtree) or summarized using *treeannotator* from [BEAST](https://beast.community/treeannotator) or [BEAST2](https://www.beast2.org/treeannotator/).\n- `fluA-GTR-W4.pkl`: the Stan script is compiled into this binary file. This file can be reused automatically by *phylostan* unless it must be recompiled, then the option `--compile` can be used.\n\nAt the end of the run, *phylostan* will print on the screen the mean and 95% credibility interval of the parameters of interest:\n```\nWeibull (shape) mean: 0.488 95% CI: (0.383,0.616)\nStrict clock (rate) mean: 0.00499 95% CI: (0.00432,0.00577)\nConstant population size (theta) mean: 4.03 95% CI: (3.14,5.05)\nHKY (kappa) mean: 5.58 95% CI: (4.37 7.039)\nRoot height mean: 18.96 95% CI: (18.36 19.74)\n```\nIn this example we have used a mean-field distribution (`-q meanfield`) to approximate the posterior using variational inference.\nThe Stan model is already compiled so we can run the NUTS algorithm without re-generating the script file, simply issue the command:\n```bash\nphylostan run -s fluA-GTR-W4.stan -m HKY -C 4 \\\n --heterochronous --estimate_rate --clock strict --coalescent constant \\\n -i fluA.fa -t fluA.tree -o fluA -a nuts\n```\n\nThe NUTS algorithm is much slower (and more accurate) than variational inference so it should be used on a small dataset.\n\n## Citing phylostan\nMathieu Fourment and Aaron E. Darling. Evaluating Probabilistic Programming and Fast Variational Bayesian Inference in Phylogenetics. 2019 _PeerJ_. doi: [10.7717/peerj.8272](https://doi.org/10.7717/peerj.8272).\n\n```\n@article{fourment2019phylostan,\n title = \"Evaluating probabilistic programming and fast variational\n {B}ayesian inference in phylogenetics\",\n author = \"Fourment, Mathieu and Darling, Aaron E\",\n journal = \"PeerJ\",\n volume = 7,\n pages = \"e8272\",\n month = dec,\n year = 2019\n}\n```\n\n",
"bugtrack_url": null,
"license": "GPL3",
"summary": "Phylogenetic inference with Stan",
"version": "1.0.5.post1",
"split_keywords": [
"phylogenetics",
"variational",
"hmc",
"bayes",
"stan"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "9dda9c5d363d01459640b7f656c2a0e0",
"sha256": "7b2282c6f34bc4093f0117646091e85024a60dc2559e97408f8b92f5a6d7ef51"
},
"downloads": -1,
"filename": "phylostan-1.0.5.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9dda9c5d363d01459640b7f656c2a0e0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 37445,
"upload_time": "2022-12-07T03:23:54",
"upload_time_iso_8601": "2022-12-07T03:23:54.388938Z",
"url": "https://files.pythonhosted.org/packages/1c/d2/3863912134e5f74cf0248c2e54f6cac2f116d254b100689c9c6877181926/phylostan-1.0.5.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-07 03:23:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "4ment",
"github_project": "phylostan",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "phylostan"
}