# PLNmodels: Poisson lognormal models
> The Poisson lognormal model and variants can be used for analysis of mutivariate count data.
> This package implements
> efficient algorithms extracting meaningful data from difficult to interpret
> and complex multivariate count data. It has been built to scale on large datasets even
> though it has memory limitations. Possible fields of applications include
> - Genomics (number of times a gene is expressed in a cell)
> - Ecology (species abundances)
>
> One main functionality is to normalize the count data to obtain more valuable
> data. It also analyse the significance of each variable and their correlation as well as the weight of
> covariates (if available).
<!-- accompanied with a set of -->
<!-- > functions for visualization and diagnostic. See [this deck of -->
<!-- > slides](https://pln-team.github.io/slideshow/) for a -->
<!-- > comprehensive introduction. -->
## Getting started
[A notebook to get started can be found
here](https://github.com/PLN-team/pyPLNmodels/blob/main/Getting_started.ipynb).
If you need just a quick view of the package, see the quickstart next.
## 🛠 Installation
**pyPLNmodels** is available on
[pypi](https://pypi.org/project/pyPLNmodels/). The development
version is available on [GitHub](https://github.com/PLN-team/pyPLNmodels) and [GitLab](https://gitlab.com/Bastien-mva/pyplnmodels).
### Package installation
```
pip install pyPLNmodels
```
## Statistical description
For those unfamiliar with the concepts of Poisson or Gaussian random variables,
it is not necessary to delve into these statistical descriptions. The key
takeaway is as follows:
This package is designed to analyze multi-dimensional count data. It
effectively extracts significant information, such as
the mean, the relationships with covariates, and the correlation between count
variables, in a manner appropriate for count data.
Consider $\mathbf Y$ a count matrix (denoted as ```endog``` in the package) consisting of $n$ rows and $p$ columns.
It is assumed that each individual $\mathbf Y_i$, that is the $i^{\text{th}}$
row of $\mathbf Y$, is independent from the others and follows a Poisson
lognormal distribution:
$$\mathbf Y_{i}\sim \mathcal P(\exp(\mathbf Z_{i})), \quad \mathbf Z_i \sim
\mathcal N(\mathbf o_i + \mathbf B ^{\top} \mathbf x_i, \mathbf \Sigma),$$
where $\mathbf x_i \in \mathbb R^d$ (`exog`) and $\mathbf o_i \in \mathbb R^p$ (`offsets`) are
user-specified covariates and offsets. The matrix $\mathbf B$ is a $d\times p$
matrix of regression coefficients and $\mathbf \Sigma$ is a $p\times p$
covariance matrix. The goal is to estimate the parameters $\mathbf B$ and
$\mathbf \Sigma$, denoted as ```coef``` and ```covariance``` in the package,
respectively. A normalization procedure adequate to count data can be applied
by extracting the ```latent_variables``` $\mathbf Z_i$ once the parameters are learned.
## ⚡️ Quickstart
The package comes with an ecological data set to present the functionality:
```
import pyPLNmodels
from pyPLNmodels.models import PlnPCAcollection, Pln, ZIPln
from pyPLNmodels.oaks import load_oaks
oaks = load_oaks()
```
### How to specify a model
Each model can be specified in two distinct manners:
* by formula (similar to R), where a data frame is passed and the formula is specified using the ```from_formula``` initialization:
```model = Model.from_formula("endog ~ 1 + covariate_name ", data = oaks)# not run```
We rely to the [patsy](https://github.com/pydata/patsy) package for the formula parsing.
* by specifying the endog, exog, and offsets matrices directly:
```model = Model(endog = oaks["endog"], exog = oaks[["covariate_name"]], offsets = oaks[["offset_name"]])# not run```
The parameters `exog` and `offsets` are optional. By default,
`exog` is set to represent an intercept, which is a vector of ones. Similarly,
`offsets` defaults to a matrix of zeros.
### Unpenalized Poisson lognormal model (aka `Pln`)
This is the building-block of the models implemented in this package. It fits a Poisson lognormal model to the data:
```
pln = Pln.from_formula("endog ~ 1 + tree ", data = oaks)
pln.fit()
print(pln)
transformed_data = pln.transform()
pln.show()
```
### Rank Constrained Poisson lognormal for Poisson Principal Component Analysis (aka `PlnPCA` and `PlnPCAcollection`)
This model excels in dimension reduction and is capable of scaling to
high-dimensional count data ($p >> 1$). It represents a variant of the PLN
model, incorporating a rank constraint on the covariance matrix. This can be
interpreted as an extension of the [probabilistic
PCA](https://academic.oup.com/jrsssb/article/61/3/611/7083217) for
count data, where the rank determines the number of components in the
probabilistic PCA. Users have the flexibility to define the rank of the
covariance matrix via the `rank` keyword of the `PlnPCA` object. Furthermore, they can specify multiple ranks simultaneously
within a single object (`PlnPCAcollection`), and then select the optimal model based on either the
AIC (default) or BIC criterion:
```
pca_col = PlnPCAcollection.from_formula("endog ~ 1 + tree ", data = oaks, ranks = [3,4,5])
pca_col.fit()
print(pca_col)
pca_col.show()
best_pca = pca_col.best_model()
best_pca.show()
transformed_data = best_pca.transform(project = True)
print('Original data shape: ', oaks["endog"].shape)
print('Transformed data shape: ', transformed_data.shape)
```
A correlation circle may be employed to graphically represent the relationship
between the variables and the components:
```
best_pca.plot_pca_correlation_circle(["var_1","var_2"], indices_of_variables = [0,1])
```
### Zero inflated Poisson Log normal Model (aka `ZIPln`)
The `ZiPln` model, a variant of the PLN model, is designed to handle zero
inflation in the data. It is defined as follows:
$$Y_{ij}\sim \mathcal W_{ij} \times P(\exp(Z_{ij})), \quad \mathbf Z_i \sim \mathcal N(\mathbf o_i + \mathbf B ^{\top} \mathbf x_i, \mathbf \Sigma), \quad W_{ij} \sim \mathcal B(\sigma( \mathbf x_i^{0^{\top}}\mathbf B^0_j))$$
This model is particularly beneficial when the data contains a significant
number of zeros. It incorporates additional covariates for the zero inflation
coefficient, which are specified following the pipe `|` symbol in the formula or via the `exog_inflation` keyword. If not specified, it is set to the covariates for the Poisson part.
```
zi = ZIPln.from_formula("endog ~ 1 + tree | 1 + tree", data = oaks)
zi.fit()
print(zi)
print("Transformed data shape: ", zi.transform().shape)
z_latent_variables, w_latent_variables = zi.transform(return_latent_prob = True)
print(r'$Z$ latent variables shape', z_latent_variables.shape)
print(r'$W$ latent variables shape', w_latent_variables.shape)
```
By default, the transformation of the data returns only the $\mathbf Z$ latent
variable. However, if the `return_latent_prob`
parameter is set to `True`, the transformed data will include both the latent
variables $\mathbf W$ and $\mathbf Z$. Here, $\mathbf W$ accounts for the zero
inflation, while $\mathbf Z$ accounts for the Poisson parameter.
### Visualization
The package is equipped with a set of visualization functions designed to
help the user interpret the data. The `viz` function conducts Principal
Component Analysis (PCA) on the latent variables, while the `viz_positions` function
carries out PCA on the latent variables, adjusted for covariates. Additionally,
the `viz_prob` function provides a visual representation of the zero-inflation
probability.
```
best_pca.viz(colors = oaks["tree"])
best_pca.viz_positions(colors = oaks["dist2ground"])
pln.viz(colors = oaks["tree"])
pln.viz_positions(colors = oaks["dist2ground"])
zi.viz(colors = oaks["tree"])
zi.viz_positions(colors = oaks["dist2ground"])
zi.viz_prob(colors = oaks["tree"])
```
## 👐 Contributing
Feel free to contribute, but read the [CONTRIBUTING.md](https://forgemia.inra.fr/bbatardiere/pyplnmodels/-/blob/main/CONTRIBUTING.md) first. A public roadmap will be available soon.
## ⚡️ Citations
Please cite our work using the following references:
- J. Chiquet, M. Mariadassou and S. Robin: Variational inference for
probabilistic Poisson PCA, the Annals of Applied Statistics, 12:
2674–2698, 2018. [pdf](http://dx.doi.org/10.1214/18%2DAOAS1177)
- B. Batardiere, J.Chiquet, M.Mariadassou: Zero-inflation in the Multivariate
Poisson Lognormal Family. [pdf](https://arxiv.org/abs/2405.14711)
Raw data
{
"_id": null,
"home_page": null,
"name": "pyPLNmodels",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "Bastien Batardi\u00e8re <bastien.batardiere@gmail.com>, Julien Chiquet <julien.chiquet@inrae.fr>",
"keywords": "python, count, data, count data, high dimension, scRNAseq, PLN",
"author": null,
"author_email": "Bastien Batardiere <bastien.batardiere@gmail.com>, Julien Chiquet <julien.chiquet@inrae.fr>, Joon Kwon <joon.kwon@inrae.fr>",
"download_url": "https://files.pythonhosted.org/packages/7a/e3/2c68ab0a71cf72d9b2a986963469773d1a8888c6e09c40da6943e55d6ec4/pyplnmodels-0.0.82.tar.gz",
"platform": null,
"description": "\n# PLNmodels: Poisson lognormal models\n\n> The Poisson lognormal model and variants can be used for analysis of mutivariate count data.\n> This package implements\n> efficient algorithms extracting meaningful data from difficult to interpret\n> and complex multivariate count data. It has been built to scale on large datasets even\n> though it has memory limitations. Possible fields of applications include\n> - Genomics (number of times a gene is expressed in a cell)\n> - Ecology (species abundances)\n>\n> One main functionality is to normalize the count data to obtain more valuable\n> data. It also analyse the significance of each variable and their correlation as well as the weight of\n> covariates (if available).\n<!-- accompanied with a set of -->\n<!-- > functions for visualization and diagnostic. See [this deck of -->\n<!-- > slides](https://pln-team.github.io/slideshow/) for a -->\n<!-- > comprehensive introduction. -->\n\n## Getting started\n[A notebook to get started can be found\nhere](https://github.com/PLN-team/pyPLNmodels/blob/main/Getting_started.ipynb).\nIf you need just a quick view of the package, see the quickstart next.\n\n## \ud83d\udee0 Installation\n\n**pyPLNmodels** is available on\n[pypi](https://pypi.org/project/pyPLNmodels/). The development\nversion is available on [GitHub](https://github.com/PLN-team/pyPLNmodels) and [GitLab](https://gitlab.com/Bastien-mva/pyplnmodels).\n\n### Package installation\n```\npip install pyPLNmodels\n```\n\n\n## Statistical description\n\nFor those unfamiliar with the concepts of Poisson or Gaussian random variables,\nit is not necessary to delve into these statistical descriptions. The key\ntakeaway is as follows:\nThis package is designed to analyze multi-dimensional count data. It\neffectively extracts significant information, such as\nthe mean, the relationships with covariates, and the correlation between count\nvariables, in a manner appropriate for count data.\n\nConsider $\\mathbf Y$ a count matrix (denoted as ```endog``` in the package) consisting of $n$ rows and $p$ columns.\nIt is assumed that each individual $\\mathbf Y_i$, that is the $i^{\\text{th}}$\nrow of $\\mathbf Y$, is independent from the others and follows a Poisson\nlognormal distribution:\n\n$$\\mathbf Y_{i}\\sim \\mathcal P(\\exp(\\mathbf Z_{i})), \\quad \\mathbf Z_i \\sim\n\\mathcal N(\\mathbf o_i + \\mathbf B ^{\\top} \\mathbf x_i, \\mathbf \\Sigma),$$\n\nwhere $\\mathbf x_i \\in \\mathbb R^d$ (`exog`) and $\\mathbf o_i \\in \\mathbb R^p$ (`offsets`) are\nuser-specified covariates and offsets. The matrix $\\mathbf B$ is a $d\\times p$\nmatrix of regression coefficients and $\\mathbf \\Sigma$ is a $p\\times p$\ncovariance matrix. The goal is to estimate the parameters $\\mathbf B$ and\n$\\mathbf \\Sigma$, denoted as ```coef``` and ```covariance``` in the package,\nrespectively. A normalization procedure adequate to count data can be applied\nby extracting the ```latent_variables``` $\\mathbf Z_i$ once the parameters are learned.\n\n\n\n\n## \u26a1\ufe0f Quickstart\n\nThe package comes with an ecological data set to present the functionality:\n```\nimport pyPLNmodels\nfrom pyPLNmodels.models import PlnPCAcollection, Pln, ZIPln\nfrom pyPLNmodels.oaks import load_oaks\noaks = load_oaks()\n```\n\n### How to specify a model\nEach model can be specified in two distinct manners:\n\n* by formula (similar to R), where a data frame is passed and the formula is specified using the ```from_formula``` initialization:\n\n```model = Model.from_formula(\"endog ~ 1 + covariate_name \", data = oaks)# not run```\n\nWe rely to the [patsy](https://github.com/pydata/patsy) package for the formula parsing.\n\n* by specifying the endog, exog, and offsets matrices directly:\n\n```model = Model(endog = oaks[\"endog\"], exog = oaks[[\"covariate_name\"]], offsets = oaks[[\"offset_name\"]])# not run```\n\nThe parameters `exog` and `offsets` are optional. By default,\n`exog` is set to represent an intercept, which is a vector of ones. Similarly,\n`offsets` defaults to a matrix of zeros.\n\n### Unpenalized Poisson lognormal model (aka `Pln`)\n\nThis is the building-block of the models implemented in this package. It fits a Poisson lognormal model to the data:\n```\npln = Pln.from_formula(\"endog ~ 1 + tree \", data = oaks)\npln.fit()\nprint(pln)\ntransformed_data = pln.transform()\npln.show()\n```\n\n### Rank Constrained Poisson lognormal for Poisson Principal Component Analysis (aka `PlnPCA` and `PlnPCAcollection`)\n\nThis model excels in dimension reduction and is capable of scaling to\nhigh-dimensional count data ($p >> 1$). It represents a variant of the PLN\nmodel, incorporating a rank constraint on the covariance matrix. This can be\ninterpreted as an extension of the [probabilistic\nPCA](https://academic.oup.com/jrsssb/article/61/3/611/7083217) for\ncount data, where the rank determines the number of components in the\nprobabilistic PCA. Users have the flexibility to define the rank of the\ncovariance matrix via the `rank` keyword of the `PlnPCA` object. Furthermore, they can specify multiple ranks simultaneously\nwithin a single object (`PlnPCAcollection`), and then select the optimal model based on either the\nAIC (default) or BIC criterion:\n```\npca_col = PlnPCAcollection.from_formula(\"endog ~ 1 + tree \", data = oaks, ranks = [3,4,5])\npca_col.fit()\nprint(pca_col)\npca_col.show()\nbest_pca = pca_col.best_model()\nbest_pca.show()\ntransformed_data = best_pca.transform(project = True)\nprint('Original data shape: ', oaks[\"endog\"].shape)\nprint('Transformed data shape: ', transformed_data.shape)\n```\n\nA correlation circle may be employed to graphically represent the relationship\nbetween the variables and the components:\n```\nbest_pca.plot_pca_correlation_circle([\"var_1\",\"var_2\"], indices_of_variables = [0,1])\n```\n\n\n### Zero inflated Poisson Log normal Model (aka `ZIPln`)\n\nThe `ZiPln` model, a variant of the PLN model, is designed to handle zero\ninflation in the data. It is defined as follows:\n\n$$Y_{ij}\\sim \\mathcal W_{ij} \\times P(\\exp(Z_{ij})), \\quad \\mathbf Z_i \\sim \\mathcal N(\\mathbf o_i + \\mathbf B ^{\\top} \\mathbf x_i, \\mathbf \\Sigma), \\quad W_{ij} \\sim \\mathcal B(\\sigma( \\mathbf x_i^{0^{\\top}}\\mathbf B^0_j))$$\n\nThis model is particularly beneficial when the data contains a significant\nnumber of zeros. It incorporates additional covariates for the zero inflation\ncoefficient, which are specified following the pipe `|` symbol in the formula or via the `exog_inflation` keyword. If not specified, it is set to the covariates for the Poisson part.\n\n```\nzi = ZIPln.from_formula(\"endog ~ 1 + tree | 1 + tree\", data = oaks)\nzi.fit()\nprint(zi)\nprint(\"Transformed data shape: \", zi.transform().shape)\nz_latent_variables, w_latent_variables = zi.transform(return_latent_prob = True)\nprint(r'$Z$ latent variables shape', z_latent_variables.shape)\nprint(r'$W$ latent variables shape', w_latent_variables.shape)\n```\n\nBy default, the transformation of the data returns only the $\\mathbf Z$ latent\nvariable. However, if the `return_latent_prob`\nparameter is set to `True`, the transformed data will include both the latent\nvariables $\\mathbf W$ and $\\mathbf Z$. Here, $\\mathbf W$ accounts for the zero\ninflation, while $\\mathbf Z$ accounts for the Poisson parameter.\n\n### Visualization\n\nThe package is equipped with a set of visualization functions designed to\nhelp the user interpret the data. The `viz` function conducts Principal\nComponent Analysis (PCA) on the latent variables, while the `viz_positions` function\ncarries out PCA on the latent variables, adjusted for covariates. Additionally,\nthe `viz_prob` function provides a visual representation of the zero-inflation\nprobability.\n\n```\nbest_pca.viz(colors = oaks[\"tree\"])\nbest_pca.viz_positions(colors = oaks[\"dist2ground\"])\npln.viz(colors = oaks[\"tree\"])\npln.viz_positions(colors = oaks[\"dist2ground\"])\nzi.viz(colors = oaks[\"tree\"])\nzi.viz_positions(colors = oaks[\"dist2ground\"])\nzi.viz_prob(colors = oaks[\"tree\"])\n```\n\n## \ud83d\udc50 Contributing\n\nFeel free to contribute, but read the [CONTRIBUTING.md](https://forgemia.inra.fr/bbatardiere/pyplnmodels/-/blob/main/CONTRIBUTING.md) first. A public roadmap will be available soon.\n\n## \u26a1\ufe0f Citations\n\nPlease cite our work using the following references:\n\n- J. Chiquet, M. Mariadassou and S. Robin: Variational inference for\n probabilistic Poisson PCA, the Annals of Applied Statistics, 12:\n 2674\u20132698, 2018. [pdf](http://dx.doi.org/10.1214/18%2DAOAS1177)\n\n- B. Batardiere, J.Chiquet, M.Mariadassou: Zero-inflation in the Multivariate\n Poisson Lognormal Family. [pdf](https://arxiv.org/abs/2405.14711)\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Package implementing PLN models",
"version": "0.0.82",
"project_urls": {
"Documentation": "https://bbatardiere.pages.mia.inra.fr/pyplnmodels",
"Repository": "https://github.com/PLN-team/pyPLNmodels"
},
"split_keywords": [
"python",
" count",
" data",
" count data",
" high dimension",
" scrnaseq",
" pln"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e4912e5e28959d292a815ffb3c9dfa9a5fe37a9b63049326d50a739e4fdf7525",
"md5": "fd747f6f917dfdd8944a7a97f2763907",
"sha256": "88836257c82425b81c7d1c927a7ca47a185e128b2246e539edfc078734529c8c"
},
"downloads": -1,
"filename": "pyPLNmodels-0.0.82-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fd747f6f917dfdd8944a7a97f2763907",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 322984,
"upload_time": "2024-08-10T08:37:34",
"upload_time_iso_8601": "2024-08-10T08:37:34.240155Z",
"url": "https://files.pythonhosted.org/packages/e4/91/2e5e28959d292a815ffb3c9dfa9a5fe37a9b63049326d50a739e4fdf7525/pyPLNmodels-0.0.82-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7ae32c68ab0a71cf72d9b2a986963469773d1a8888c6e09c40da6943e55d6ec4",
"md5": "3dcb9bad51434f4be5cacc2c29c6c9e2",
"sha256": "b3238a1d85939405d6f608ecdfdd71fba45bdd01fd3dbd0f669eefac8a51f7cb"
},
"downloads": -1,
"filename": "pyplnmodels-0.0.82.tar.gz",
"has_sig": false,
"md5_digest": "3dcb9bad51434f4be5cacc2c29c6c9e2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 381202,
"upload_time": "2024-08-10T08:37:36",
"upload_time_iso_8601": "2024-08-10T08:37:36.537319Z",
"url": "https://files.pythonhosted.org/packages/7a/e3/2c68ab0a71cf72d9b2a986963469773d1a8888c6e09c40da6943e55d6ec4/pyplnmodels-0.0.82.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-10 08:37:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PLN-team",
"github_project": "pyPLNmodels",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pyplnmodels"
}