# InfoVar
[](https://badge.fury.io/py/infovar)
[](https://infovar.readthedocs.io/en/latest/?badge=latest)

The `infovar` Python package provides tools to efficiently study the informativity of variables on data of interest.
## Context
The informativity of a variable or set of variables is defined here as the ability of these variables, if known, to reduce the uncertainty we have about a quantity of interest. This uncertainty can be defined in several ways, for example in the sense of Shannon's information theory.
This is a ubiquitous problem in science in general, with very concrete applications in climatology, economics, psychology, sociology, and astrophysics, to name a few. Consequently, `InfoVar` has been designed to be very general.
This package provides tools for quantifying the statistical dependence (e.g., mutual information, but other metrics are available) between continuous numerical data and estimating the associated error as well as the influence of the latter on the order of variables in terms of importance.
## Installation
*(optional)* Create a virtual environment and activate it:
```shell
python -m venv .venv
source .venv/bin/activate
```
**Note 1:** to deactivate the virtual env :
```shell
deactivate
```
**Note 2:** To delete the virtual environment:
```shell
rm -r .venv
```
### From PyPI (recommanded)
To install `infovar`:
```shell
pip install infovar
```
### From local package
To get the source code:
```shell
git clone git@github.com:einigl/infovar.git
```
To install `infovar`:
```shell
pip install -e .
```
## Get started
To get started, check out the Jupyter notebooks provided in the `examples` folder.
## Tests
To test, run:
```shell
pytest --cov && coverage-badge -o coverage.svg -f
```
## Documentation
```bash
cd docs
sphinx-apidoc -o . ../infovar
make html
```
Outputs are in `docs/_build/html`.
## Features
### Statistics
In this project, we propose to measure the statistical dependence of variables based on the mutual information. Other metrics can also be used, such as the conditional differential entropy, which is closely related to mutual information, or canonical correlation coefficient.
Mutual information and conditional differential entropy are estimated nonparametrically using [Greg Ver Steeg's implementation](http://www.isi.edu/~gregv/npeet.html). More details are given in the `assessment` directory, which evaluates the properties of each available statistics and provides further mathematical context and references.
If you're interested in other metrics, it's possible to add and use them.
### Uncertainty on estimations
Uncertainty in the estimation of the above statistics can arise from various sources:
- the variance of the estimator,
- statistical fluctuations of samples from the same distribution.
To account for these uncertainties and to be able to compare different values properly, we propose implementations of several approaches, based on bootstrapping or subsampling.
### Estimation for different range of values
The heart of `InfoVar` lies in the fact that the informativity of a variable on a quantity of interest can vary according to the selected range of value of this quantity.
For example, if we're interested in house prices in California (see `examples/california-housing`), among a set of variables, geographical location (latitude, longitude) appears to be the most important pair of variables. However, if we restrict ourselves to the 10% most expensive homes, it appears that the number of rooms in the house becomes most useful. This type of observation is important, for example, from a data analysis point of view, but also in a variable selection context.
More generally, taking into account these variations as a function of ranges of values of the variable of interest enables more refined analysis of phenomena. To help you understand, here are a few examples of possible applications.
Determining factors on ...
**... student's grades as a function of the grade obtained.**.
- *Data of interest:* student marks on an exam.
- *Variables:* time spent working at home, missed lessons, parents' income, etc.
**... number of species in forests.**
- *Data of interest:* number of species.
- *Variables:* forest age, humidity, distance to nearest town, number of visitors per day, etc.
**... on the number of medals a country has won at the Olympic Games.**
- *Data of interest:* number of medals won by each country in each of the last 10 editions of the games.
- *Variables:* amount invested by the national Olympic committee, population, per capita income, unemployment rate, etc.
It is also possible to perform the same analysis, but according to the value range of *another* variable.
**...the average annual temperature in a city as a function of altitude.**
- *Data of interest:* average temperature.
- *Variables:* duration of sunshine, percentage of vegetated land, altitude.
**... the number of medals won by a country at the Olympic Games as a function of its population.**
- *Data of interest:* number of medals won by each country in each of the last 10 editions of the games.
- *Variables:* amount invested by the national Olympic committee, population, per capita income, unemployment rate.
The `InfoVar` allows you to perform sensitivity analysis in two ways:
1. Define rigid intervals for the data that varies (example: houses priced below $150k, between $150 and $350k and above $350k).
2. Define a sliding window and calculate the evolution of the statistics almost continuously.
In case 1 (discrete case), the `DiscreteHandler` class provides all the important functions for calculating, storing and accessing results. In case 2 (continuous case), the `ContinuousHandler` class is used. The notebooks in `examples` give an example of the use of each of these two classes.
## Associated packages
[**A&A papers repository**](https://github.com/einigl/informative-obs-paper): Reproduce the results in Einig et al. (2024, 2025)
[**IRAM 30m EMIR informative observables**](https://github.com/einigl/iram-30m-emir-obs-info): Informativity of molecular lines to estimate astrophysical parameters.
## References
[1] Einig, L & Palud, P. & Roueff, A. & Pety, J. & Bron, E. & Le Petit, F. & Gerin, M. & Chanussot, J. & Chainais, P. & Thouvenin, P.-A. & Languignon, D. & Bešlić, I. & Coudé, S. & Mazurek, H. & Orkisz, J. H. & G. Santa-Maria, M. & Ségal, L. & Zakardjian, A. & Bardeau, S. & Demyk, K. & de Souza Magalhẽs, V. & Javier R. Goicoechea & Gratier, P. & V. Guzmán, V. & Hughes, A. & Levrier, F. & Le Bourlot, J. & Darek C. Lis & Liszt, H. S. & Peretto, N. & Roueff, E & Sievers, A. (2024).
**Quantifying the informativity of emission lines to infer physical conditions in giant molecular clouds. I. Application to model predictions.** *Astronomy & Astrophysics.*
10.xxxx/xxxx-xxxx/xxxxxxxxx.
[2] Einig, L et al (2024, in prep.).
**Quantifying the informativity of emission lines to infer physical conditions in giant molecular clouds. II. Training robust models from selected observations.** *Astronomy & Astrophysics.*
10.xxxx/xxxx-xxxx/xxxxxxxxx.
Raw data
{
"_id": null,
"home_page": "https://github.com/einigl/infovar",
"name": "infovar",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "statistics, information theory",
"author": "Lucas Einig",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/3a/15/746aa3955ed0312714249949581a02aa7d962d1c42eb6a4bce1207300fdb/infovar-0.2.0.tar.gz",
"platform": null,
"description": "# InfoVar\n\n[](https://badge.fury.io/py/infovar)\n[](https://infovar.readthedocs.io/en/latest/?badge=latest)\n\n\nThe `infovar` Python package provides tools to efficiently study the informativity of variables on data of interest.\n\n\n## Context\n\nThe informativity of a variable or set of variables is defined here as the ability of these variables, if known, to reduce the uncertainty we have about a quantity of interest. This uncertainty can be defined in several ways, for example in the sense of Shannon's information theory.\n\nThis is a ubiquitous problem in science in general, with very concrete applications in climatology, economics, psychology, sociology, and astrophysics, to name a few. Consequently, `InfoVar` has been designed to be very general.\n\nThis package provides tools for quantifying the statistical dependence (e.g., mutual information, but other metrics are available) between continuous numerical data and estimating the associated error as well as the influence of the latter on the order of variables in terms of importance.\n\n## Installation\n\n*(optional)* Create a virtual environment and activate it:\n\n```shell\npython -m venv .venv\nsource .venv/bin/activate\n```\n\n**Note 1:** to deactivate the virtual env :\n\n```shell\ndeactivate\n```\n\n**Note 2:** To delete the virtual environment:\n\n```shell\nrm -r .venv\n```\n\n### From PyPI (recommanded)\n\nTo install `infovar`:\n\n```shell\npip install infovar\n```\n\n### From local package\n\nTo get the source code:\n\n```shell\ngit clone git@github.com:einigl/infovar.git\n```\n\nTo install `infovar`:\n\n```shell\npip install -e .\n```\n\n\n## Get started\n\nTo get started, check out the Jupyter notebooks provided in the `examples` folder.\n\n\n## Tests\n\nTo test, run:\n\n```shell\npytest --cov && coverage-badge -o coverage.svg -f\n```\n\n## Documentation\n\n```bash\ncd docs\nsphinx-apidoc -o . ../infovar\nmake html\n```\n\nOutputs are in `docs/_build/html`.\n\n\n## Features\n\n### Statistics\n\nIn this project, we propose to measure the statistical dependence of variables based on the mutual information. Other metrics can also be used, such as the conditional differential entropy, which is closely related to mutual information, or canonical correlation coefficient.\n\nMutual information and conditional differential entropy are estimated nonparametrically using [Greg Ver Steeg's implementation](http://www.isi.edu/~gregv/npeet.html). More details are given in the `assessment` directory, which evaluates the properties of each available statistics and provides further mathematical context and references.\n\nIf you're interested in other metrics, it's possible to add and use them.\n\n### Uncertainty on estimations\n\nUncertainty in the estimation of the above statistics can arise from various sources:\n- the variance of the estimator,\n- statistical fluctuations of samples from the same distribution.\n\nTo account for these uncertainties and to be able to compare different values properly, we propose implementations of several approaches, based on bootstrapping or subsampling.\n\n### Estimation for different range of values\n\nThe heart of `InfoVar` lies in the fact that the informativity of a variable on a quantity of interest can vary according to the selected range of value of this quantity.\n\nFor example, if we're interested in house prices in California (see `examples/california-housing`), among a set of variables, geographical location (latitude, longitude) appears to be the most important pair of variables. However, if we restrict ourselves to the 10% most expensive homes, it appears that the number of rooms in the house becomes most useful. This type of observation is important, for example, from a data analysis point of view, but also in a variable selection context.\n\nMore generally, taking into account these variations as a function of ranges of values of the variable of interest enables more refined analysis of phenomena. To help you understand, here are a few examples of possible applications.\n\nDetermining factors on ...\n\n**... student's grades as a function of the grade obtained.**.\n- *Data of interest:* student marks on an exam.\n- *Variables:* time spent working at home, missed lessons, parents' income, etc.\n\n**... number of species in forests.**\n- *Data of interest:* number of species.\n- *Variables:* forest age, humidity, distance to nearest town, number of visitors per day, etc.\n\n**... on the number of medals a country has won at the Olympic Games.**\n- *Data of interest:* number of medals won by each country in each of the last 10 editions of the games.\n- *Variables:* amount invested by the national Olympic committee, population, per capita income, unemployment rate, etc.\n\nIt is also possible to perform the same analysis, but according to the value range of *another* variable.\n\n**...the average annual temperature in a city as a function of altitude.**\n- *Data of interest:* average temperature.\n- *Variables:* duration of sunshine, percentage of vegetated land, altitude.\n\n**... the number of medals won by a country at the Olympic Games as a function of its population.**\n- *Data of interest:* number of medals won by each country in each of the last 10 editions of the games.\n- *Variables:* amount invested by the national Olympic committee, population, per capita income, unemployment rate.\n\nThe `InfoVar` allows you to perform sensitivity analysis in two ways:\n1. Define rigid intervals for the data that varies (example: houses priced below $150k, between $150 and $350k and above $350k).\n2. Define a sliding window and calculate the evolution of the statistics almost continuously.\n\nIn case 1 (discrete case), the `DiscreteHandler` class provides all the important functions for calculating, storing and accessing results. In case 2 (continuous case), the `ContinuousHandler` class is used. The notebooks in `examples` give an example of the use of each of these two classes.\n\n\n## Associated packages\n\n[**A&A papers repository**](https://github.com/einigl/informative-obs-paper): Reproduce the results in Einig et al. (2024, 2025)\n\n[**IRAM 30m EMIR informative observables**](https://github.com/einigl/iram-30m-emir-obs-info): Informativity of molecular lines to estimate astrophysical parameters.\n\n\n## References\n\n[1] Einig, L & Palud, P. & Roueff, A. & Pety, J. & Bron, E. & Le Petit, F. & Gerin, M. & Chanussot, J. & Chainais, P. & Thouvenin, P.-A. & Languignon, D. & Be\u0161li\u0107, I. & Coud\u00e9, S. & Mazurek, H. & Orkisz, J. H. & G. Santa-Maria, M. & S\u00e9gal, L. & Zakardjian, A. & Bardeau, S. & Demyk, K. & de Souza Magalh\u1ebds, V. & Javier R. Goicoechea & Gratier, P. & V. Guzm\u00e1n, V. & Hughes, A. & Levrier, F. & Le Bourlot, J. & Darek C. Lis & Liszt, H. S. & Peretto, N. & Roueff, E & Sievers, A. (2024).\n**Quantifying the informativity of emission lines to infer physical conditions in giant molecular clouds. I. Application to model predictions.** *Astronomy & Astrophysics.*\n10.xxxx/xxxx-xxxx/xxxxxxxxx.\n\n[2] Einig, L et al (2024, in prep.).\n**Quantifying the informativity of emission lines to infer physical conditions in giant molecular clouds. II. Training robust models from selected observations.** *Astronomy & Astrophysics.*\n10.xxxx/xxxx-xxxx/xxxxxxxxx.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Informative variables (InfoVar)",
"version": "0.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/einigl/infovar/issues",
"Homepage": "https://github.com/einigl/infovar",
"Repository": "https://github.com/einigl/infovar"
},
"split_keywords": [
"statistics",
" information theory"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6925f173598eae6d74e3d112058a2c0e3caa3328ec5da07192bcc264616f5b52",
"md5": "b5f55810771aa128c7a8b83920b4a244",
"sha256": "b3ddef4a75b71c46f28782892f09593d7894362f4718327d2c815aec14c8a0f6"
},
"downloads": -1,
"filename": "infovar-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b5f55810771aa128c7a8b83920b4a244",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 30974,
"upload_time": "2024-10-02T13:01:55",
"upload_time_iso_8601": "2024-10-02T13:01:55.520990Z",
"url": "https://files.pythonhosted.org/packages/69/25/f173598eae6d74e3d112058a2c0e3caa3328ec5da07192bcc264616f5b52/infovar-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3a15746aa3955ed0312714249949581a02aa7d962d1c42eb6a4bce1207300fdb",
"md5": "ed6e6c549edd805524be9e6a405a8d19",
"sha256": "19ede53583fb4e7b98afd4a8434fcba43197d96c85b9fafd5c31f20f1ebcf97b"
},
"downloads": -1,
"filename": "infovar-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "ed6e6c549edd805524be9e6a405a8d19",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 27389,
"upload_time": "2024-10-02T13:01:56",
"upload_time_iso_8601": "2024-10-02T13:01:56.530347Z",
"url": "https://files.pythonhosted.org/packages/3a/15/746aa3955ed0312714249949581a02aa7d962d1c42eb6a4bce1207300fdb/infovar-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-02 13:01:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "einigl",
"github_project": "infovar",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "infovar"
}