.. -*- mode: rst -*-
|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_ |Codecov|_
.. |GitHubActions| image:: https://github.com/Quantmetry/qolmat/actions/workflows/test.yml/badge.svg
.. _GitHubActions: https://github.com/Quantmetry/qolmat/actions
.. |ReadTheDocs| image:: https://readthedocs.org/projects/qolmat/badge
.. _ReadTheDocs: https://qolmat.readthedocs.io/en/latest
.. |License| image:: https://img.shields.io/github/license/Quantmetry/qolmat
.. _License: https://github.com/Quantmetry/qolmat/blob/main/LICENSE
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/qolmat
.. _PythonVersion: https://pypi.org/project/qolmat/
.. |PyPi| image:: https://img.shields.io/pypi/v/qolmat
.. _PyPi: https://pypi.org/project/qolmat/
.. |Release| image:: https://img.shields.io/github/v/release/Quantmetry/qolmat
.. _Release: https://github.com/Quantmetry/qolmat
.. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main
.. _Commits: https://github.com/Quantmetry/qolmat/commits/main
.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/main/graph/badge.svg
.. _Codecov: https://codecov.io/gh/quantmetry/qolmat
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png
:align: center
Qolmat - The Tool for Data Imputation
======================================
**Qolmat** provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.
🔗 Requirements
===============
Python 3.8+
🛠 Installation
===============
Qolmat can be installed in different ways:
.. code:: sh
$ pip install qolmat # installation via `pip`
$ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch
$ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository
⚡️ Quickstart
==============
Let us start with a basic imputation problem.
We generate one-dimensional noisy time series with missing values.
With just these few lines of code, you can see how easy it is to
- impute missing values with one particular imputer;
- benchmark multiple imputation methods with different metrics.
.. code-block:: python
import numpy as np
import pandas as pd
from qolmat.benchmark import comparator, missing_patterns
from qolmat.imputations import imputers
from qolmat.utils import data
# load and prepare csv data
df_data = data.get_data("Beijing")
columns = ["TEMP", "PRES", "WSPM"]
df_data = df_data[columns]
df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)
# impute and compare
imputer_median = imputers.ImputerSimple(groups=("station",))
imputer_interpol = imputers.ImputerInterpolation(method="linear", groups=("station",))
imputer_var1 = imputers.ImputerEM(model="VAR", groups=("station",), method="mle", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)
dict_imputers = {
"median": imputer_median,
"interpolation": imputer_interpol,
"VAR(1) process": imputer_var1
}
generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)
comparison = comparator.Comparator(
dict_imputers,
generator_holes = generator_holes,
metrics = ["mae", "wmape", "kl_columnwise", "frechet"],
)
results = comparison.compare(df_with_nan)
results.style.highlight_min(color="lightsteelblue", axis=1)
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png
:align: center
📘 Documentation
================
The full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.
**How does Qolmat work ?**
Qolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:
1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_.
2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.
3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.
This is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.
.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png
:align: center
**Imputation methods**
The following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the `Imputation article <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ on Wikipedia.
.. list-table::
:widths: 25 70 15 15
:header-rows: 1
* - Method
- Description
- Tabular or Time series
- Single or Multiple
* - mean
- Imputes the missing values using the mean along each column
- tabular
- single
* - median
- Imputes the missing values using the median along each column
- tabular
- single
* - LOCF
- Imputes missing entries by carrying the last observation forward for each columns
- time series
- single
* - shuffle
- Imputes missing entries with the random value of each column
- tabular
- multiple
* - interpolation
- Imputes missing using some interpolation strategies supported by pd.Series.interpolate
- time series
- single
* - impute on residuals
- The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised
- time series
- single
* - MICE
- Multiple Imputation by Chained Equation
- tabular
- both
* - RPCA
- Robust Principal Component Analysis
- both
- single
* - SoftImpute
- Iterative method for matrix completion that uses nuclear-norm regularization
- tabular
- single
* - KNN
- K-nearest kneighbors
- tabular
- single
* - EM sampler
- Imputes missing values via EM algorithm
- both
- both
* - MLP
- Imputer based Multi-Layers Perceptron Model
- both
- both
* - Autoencoder
- Imputer based Autoencoder Model with Variationel method
- both
- both
* - TabDDPM
- Imputer based on Denoising Diffusion Probabilistic Models
- both
- both
📝 Contributing
===============
You are welcome to propose and contribute new ideas.
We encourage you to `open an issue <https://github.com/quantmetry/qolmat/issues>`_ so that we can align on the work to be done.
It is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope.
For more information on the contribution process, please go `here <https://github.com/Quantmetry/qolmat/blob/main/CONTRIBUTING.rst>`_.
🤝 Affiliation
================
Qolmat has been developed by Quantmetry.
|Quantmetry|_
.. |Quantmetry| image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/quantmetry.png
:width: 150
.. _Quantmetry: https://www.quantmetry.com/
🔍 References
==============
[1] Candès, Emmanuel J., et al. “Robust principal component analysis?.”
Journal of the ACM (JACM) 58.3 (2011): 1-37,
(`pdf <https://arxiv.org/abs/0912.3599>`__)
[2] Wang, Xuehui, et al. “An improved robust principal component
analysis model for anomalies detection of subway passenger flow.”
Journal of advanced transportation 2018 (2018).
(`pdf <https://www.hindawi.com/journals/jat/2018/7191549/>`__)
[3] Chen, Yuxin, et al. “Bridging convex and nonconvex optimization in
robust PCA: Noise, outliers, and missing data.” Annals of statistics, 49(5), 2948 (2021), (`pdf <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9491514/pdf/nihms-1782570.pdf>`__)
[4] Shahid, Nauman, et al. “Fast robust PCA on graphs.” IEEE Journal of
Selected Topics in Signal Processing 10.4 (2016): 740-756.
(`pdf <https://arxiv.org/abs/1507.08173>`__)
[5] Jiashi Feng, et al. “Online robust pca via stochastic optimization.“ Advances in neural information processing systems, 26, 2013.
(`pdf <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.7506&rep=rep1&type=pdf>`__)
[6] García, S., Luengo, J., & Herrera, F. "Data preprocessing in data mining". 2015.
(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)
[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. "Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)
📝 License
==========
Qolmat is free and open-source software licensed under the `BSD 3-Clause license <https://github.com/quantmetry/qolmat/blob/main/LICENSE>`_.
Raw data
{
"_id": null,
"home_page": "https://github.com/Quantmetry/qolmat",
"name": "qolmat",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.9",
"maintainer_email": null,
"keywords": "imputation",
"author": "Julien ROUSSEL",
"author_email": "julien.roussel@capgemini.com",
"download_url": "https://files.pythonhosted.org/packages/15/98/8ee56c252dd10d290f7d8f165172339032ac748d56823d6a38f7adc25147/qolmat-0.1.10.tar.gz",
"platform": null,
"description": ".. -*- mode: rst -*-\n\n|GitHubActions|_ |ReadTheDocs|_ |License|_ |PythonVersion|_ |PyPi|_ |Release|_ |Commits|_ |Codecov|_\n\n.. |GitHubActions| image:: https://github.com/Quantmetry/qolmat/actions/workflows/test.yml/badge.svg\n.. _GitHubActions: https://github.com/Quantmetry/qolmat/actions\n\n.. |ReadTheDocs| image:: https://readthedocs.org/projects/qolmat/badge\n.. _ReadTheDocs: https://qolmat.readthedocs.io/en/latest\n\n.. |License| image:: https://img.shields.io/github/license/Quantmetry/qolmat\n.. _License: https://github.com/Quantmetry/qolmat/blob/main/LICENSE\n\n.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/qolmat\n.. _PythonVersion: https://pypi.org/project/qolmat/\n\n.. |PyPi| image:: https://img.shields.io/pypi/v/qolmat\n.. _PyPi: https://pypi.org/project/qolmat/\n\n.. |Release| image:: https://img.shields.io/github/v/release/Quantmetry/qolmat\n.. _Release: https://github.com/Quantmetry/qolmat\n\n.. |Commits| image:: https://img.shields.io/github/commits-since/Quantmetry/qolmat/latest/main\n.. _Commits: https://github.com/Quantmetry/qolmat/commits/main\n\n.. |Codecov| image:: https://codecov.io/gh/quantmetry/qolmat/branch/main/graph/badge.svg\n.. _Codecov: https://codecov.io/gh/quantmetry/qolmat\n\n.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/logo.png\n :align: center\n\nQolmat - The Tool for Data Imputation\n======================================\n\n**Qolmat** provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.\n\n\ud83d\udd17 Requirements\n===============\n\nPython 3.8+\n\n\ud83d\udee0 Installation\n===============\n\nQolmat can be installed in different ways:\n\n.. code:: sh\n\n $ pip install qolmat # installation via `pip`\n $ pip install qolmat[pytorch] # if you need ImputerDiffusion relying on pytorch\n $ pip install git+https://github.com/Quantmetry/qolmat # or directly from the github repository\n\n\u26a1\ufe0f Quickstart\n==============\n\nLet us start with a basic imputation problem.\nWe generate one-dimensional noisy time series with missing values.\nWith just these few lines of code, you can see how easy it is to\n\n- impute missing values with one particular imputer;\n- benchmark multiple imputation methods with different metrics.\n\n.. code-block:: python\n\n import numpy as np\n import pandas as pd\n\n from qolmat.benchmark import comparator, missing_patterns\n from qolmat.imputations import imputers\n from qolmat.utils import data\n\n # load and prepare csv data\n\n df_data = data.get_data(\"Beijing\")\n columns = [\"TEMP\", \"PRES\", \"WSPM\"]\n df_data = df_data[columns]\n df_with_nan = data.add_holes(df_data, ratio_masked=0.2, mean_size=120)\n\n # impute and compare\n imputer_median = imputers.ImputerSimple(groups=(\"station\",))\n imputer_interpol = imputers.ImputerInterpolation(method=\"linear\", groups=(\"station\",))\n imputer_var1 = imputers.ImputerEM(model=\"VAR\", groups=(\"station\",), method=\"mle\", max_iter_em=50, n_iter_ou=15, dt=1e-3, p=1)\n dict_imputers = {\n \"median\": imputer_median,\n \"interpolation\": imputer_interpol,\n \"VAR(1) process\": imputer_var1\n }\n generator_holes = missing_patterns.EmpiricalHoleGenerator(n_splits=4, ratio_masked=0.1)\n comparison = comparator.Comparator(\n dict_imputers,\n generator_holes = generator_holes,\n metrics = [\"mae\", \"wmape\", \"kl_columnwise\", \"frechet\"],\n )\n results = comparison.compare(df_with_nan)\n results.style.highlight_min(color=\"lightsteelblue\", axis=1)\n\n.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/readme_tabular_comparison.png\n :align: center\n\n\ud83d\udcd8 Documentation\n================\n\nThe full documentation can be found `on this link <https://qolmat.readthedocs.io/en/latest/>`_.\n\n**How does Qolmat work ?**\n\nQolmat allows model selection for scikit-learn compatible imputation algorithms, by performing three steps pictured below:\n\n1) For each of the K folds, Qolmat artificially masks a set of observed values using a default or user specified `hole generator <explanation.html#hole-generator>`_.\n2) For each fold and each compared `imputation method <imputers.html>`_, Qolmat fills both the missing and the masked values, then computes each of the default or user specified `performance metrics <explanation.html#metrics>`_.\n3) For each compared imputer, Qolmat pools the computed metrics from the K folds into a single value.\n\nThis is very similar in spirit to the `cross_val_score <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html>`_ function for scikit-learn.\n\n.. image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/schema_qolmat.png\n :align: center\n\n**Imputation methods**\n\nThe following table contains the available imputation methods. We distinguish single imputation methods (aiming for pointwise accuracy, mostly deterministic) from multiple imputation methods (aiming for distribution similarity, mostly stochastic). For further details regarding the distinction between single and multiple imputation, you can refer to the `Imputation article <https://en.wikipedia.org/wiki/Imputation_(statistics)>`_ on Wikipedia.\n\n.. list-table::\n :widths: 25 70 15 15\n :header-rows: 1\n\n * - Method\n - Description\n - Tabular or Time series\n - Single or Multiple\n * - mean\n - Imputes the missing values using the mean along each column\n - tabular\n - single\n * - median\n - Imputes the missing values using the median along each column\n - tabular\n - single\n * - LOCF\n - Imputes missing entries by carrying the last observation forward for each columns\n - time series\n - single\n * - shuffle\n - Imputes missing entries with the random value of each column\n - tabular\n - multiple\n * - interpolation\n - Imputes missing using some interpolation strategies supported by pd.Series.interpolate\n - time series\n - single\n * - impute on residuals\n - The series are de-seasonalised, residuals are imputed via linear interpolation, then residuals are re-seasonalised\n - time series\n - single\n * - MICE\n - Multiple Imputation by Chained Equation\n - tabular\n - both\n * - RPCA\n - Robust Principal Component Analysis\n - both\n - single\n * - SoftImpute\n - Iterative method for matrix completion that uses nuclear-norm regularization\n - tabular\n - single\n * - KNN\n - K-nearest kneighbors\n - tabular\n - single\n * - EM sampler\n - Imputes missing values via EM algorithm\n - both\n - both\n * - MLP\n - Imputer based Multi-Layers Perceptron Model\n - both\n - both\n * - Autoencoder\n - Imputer based Autoencoder Model with Variationel method\n - both\n - both\n * - TabDDPM\n - Imputer based on Denoising Diffusion Probabilistic Models\n - both\n - both\n\n\n\n\ud83d\udcdd Contributing\n===============\n\nYou are welcome to propose and contribute new ideas.\nWe encourage you to `open an issue <https://github.com/quantmetry/qolmat/issues>`_ so that we can align on the work to be done.\nIt is generally a good idea to have a quick discussion before opening a pull request that is potentially out-of-scope.\nFor more information on the contribution process, please go `here <https://github.com/Quantmetry/qolmat/blob/main/CONTRIBUTING.rst>`_.\n\n\n\ud83e\udd1d Affiliation\n================\n\nQolmat has been developed by Quantmetry.\n\n|Quantmetry|_\n\n.. |Quantmetry| image:: https://raw.githubusercontent.com/Quantmetry/qolmat/main/docs/images/quantmetry.png\n :width: 150\n.. _Quantmetry: https://www.quantmetry.com/\n\n\ud83d\udd0d References\n==============\n\n[1] Cand\u00e8s, Emmanuel J., et al. \u201cRobust principal component analysis?.\u201d\nJournal of the ACM (JACM) 58.3 (2011): 1-37,\n(`pdf <https://arxiv.org/abs/0912.3599>`__)\n\n[2] Wang, Xuehui, et al. \u201cAn improved robust principal component\nanalysis model for anomalies detection of subway passenger flow.\u201d\nJournal of advanced transportation 2018 (2018).\n(`pdf <https://www.hindawi.com/journals/jat/2018/7191549/>`__)\n\n[3] Chen, Yuxin, et al. \u201cBridging convex and nonconvex optimization in\nrobust PCA: Noise, outliers, and missing data.\u201d Annals of statistics, 49(5), 2948 (2021), (`pdf <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9491514/pdf/nihms-1782570.pdf>`__)\n\n[4] Shahid, Nauman, et al. \u201cFast robust PCA on graphs.\u201d IEEE Journal of\nSelected Topics in Signal Processing 10.4 (2016): 740-756.\n(`pdf <https://arxiv.org/abs/1507.08173>`__)\n\n[5] Jiashi Feng, et al. \u201cOnline robust pca via stochastic optimization.\u201c Advances in neural information processing systems, 26, 2013.\n(`pdf <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.7506&rep=rep1&type=pdf>`__)\n\n[6] Garc\u00eda, S., Luengo, J., & Herrera, F. \"Data preprocessing in data mining\". 2015.\n(`pdf <https://www.academia.edu/download/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73.pdf>`__)\n\n[7] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. \"Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series\" (2022) in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland, (`pdf <https://link.springer.com/chapter/10.1007/978-3-031-25891-6_21>`__)\n\n\ud83d\udcdd License\n==========\n\nQolmat is free and open-source software licensed under the `BSD 3-Clause license <https://github.com/quantmetry/qolmat/blob/main/LICENSE>`_.\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "A Python library for optimal data imputation.",
"version": "0.1.10",
"project_urls": {
"Bug Tracker": "https://github.com/Quantmetry/qolmat",
"Documentation": "https://qolmat.readthedocs.io/en/latest/",
"Homepage": "https://github.com/Quantmetry/qolmat",
"Repository": "https://github.com/Quantmetry/qolmat",
"Source Code": "https://github.com/Quantmetry/qolmat"
},
"split_keywords": [
"imputation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "943543787132f66a32613bad0cdac5ce76f6bc843e834b2787526f1108309736",
"md5": "e3583335d4ceb6a5e3be10df0ac686b4",
"sha256": "451f3d8a4f14024adcadcb5c7bee892e77a0f9f77f50eb962288efb383b3db7c"
},
"downloads": -1,
"filename": "qolmat-0.1.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e3583335d4ceb6a5e3be10df0ac686b4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.9",
"size": 16428750,
"upload_time": "2025-08-30T19:53:42",
"upload_time_iso_8601": "2025-08-30T19:53:42.451417Z",
"url": "https://files.pythonhosted.org/packages/94/35/43787132f66a32613bad0cdac5ce76f6bc843e834b2787526f1108309736/qolmat-0.1.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "15988ee56c252dd10d290f7d8f165172339032ac748d56823d6a38f7adc25147",
"md5": "bb8d47b205c112ada75926d0aabb1787",
"sha256": "a433da173cd3c6bc46eb61b8cb1ec666dda8ff7eb15dd3e5bf9df94a89b5e1a4"
},
"downloads": -1,
"filename": "qolmat-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "bb8d47b205c112ada75926d0aabb1787",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.9",
"size": 16039570,
"upload_time": "2025-08-30T19:53:45",
"upload_time_iso_8601": "2025-08-30T19:53:45.193502Z",
"url": "https://files.pythonhosted.org/packages/15/98/8ee56c252dd10d290f7d8f165172339032ac748d56823d6a38f7adc25147/qolmat-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-30 19:53:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Quantmetry",
"github_project": "qolmat",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "qolmat"
}