# mdatagen: A Python Library for the Generation of Artificial Missing Data
![Python3](https://img.shields.io/badge/Language-Python3-steelblue)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Documentation](https://img.shields.io/badge/Documentation-Link-green.svg)](docs/)
[![Latest PyPI Version](https://img.shields.io/pypi/v/mdatagen.svg)](https://pypi.org/project/mdatagen/)
## Overview
This package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library **missing-data-generator** (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.
This Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).
## User Guide
Please refer to the [univariate docs](docs/univariate.md) or [multivariate docs](docs/multivariate.md) for more implementatios details.
### Installation
To install the package, please use the `pip` installation as follows:
```bash
pip install mdatagen
```
## API Usage
API usage is described in each of the following sections
- [MCAR univariate example](docs/mcar_univariate_example.ipynb)
- [MNAR univariate example](docs/mnar_univariate_example.ipynb)
- [MAR univariate example](docs/mar_univariate_example.ipynb)
- [MNAR Multivariate Examples](docs/mnar_multivariate_examples.ipynb)
- [Novel MNAR Multivariate mechanism](docs/novel_mnar_multivariate_example.ipynb)
- [Evaluation of Imputation Quality](docs/evaluation_imputation_quality.ipynb)
- [Visualization Plots](docs/examples_plots.ipynb)
- [Complete Pipeline Example](docs/complete_pipeline_example.ipynb)
### Code examples
Here, we provide a basic usage for MAR mechanism in both univariate and multivariate
scenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation
quality.
### MAR univariate
```python
import pandas as pd
from sklearn.datasets import load_iris
from mdatagen.univariate.uMAR import uMAR
# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data,
columns=iris.feature_names)
X = iris_df.copy() # Features
y = iris.target # Label values
generator = uMAR(X=X,
y=y,
missing_rate=50,
x_miss='sepal length (cm)',
x_obs = 'petal lenght (cm)')
# Generate the missing data under MAR mechanism univariate
generate_data = generator.rank()
print(generate_data.isna().sum())
```
### MAR multivariate
```python
import pandas as pd
from sklearn.datasets import load_iris
from mdatagen.multivariate.mMAR import mMAR
# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data,
columns=iris.feature_names)
X = iris_df.copy() # Features
y = iris.target # Label values
generator = mMAR(X=X,
y=y)
# Generate the missing data under MAR mechanism multivariate
generate_data = generator.correlated(missing_rate=25)
print(generate_data.isna().sum())
```
### Histogram plot
```python
import pandas as pd
from sklearn.datasets import load_iris
from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.plots.plot import PlotMissingData
# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data,
columns=iris.feature_names)
X = iris_df.copy() # Features
y = iris.target # Label values
# Create a instance with missing rate
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X,
y=y,
missing_rate=25,
x_miss='petal length (cm)')
# Generate the missing data under MNAR mechanism
generate_data = generator.random()
miss_plot = PlotMissingData(data_missing=generate_data,
data_original=X
)
miss_plot.visualize_miss("histogram",
col_missing="petal length (cm)",
save=True,
path_save_fig = "MCAR_iris.png")
```
### Imputation Quality Evaluation: Mean Squared Error (MSE)
```python
import pandas as pd
from sklearn.datasets import load_iris
from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.metrics.metrics import EvaluateImputation
# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data,
columns=iris.feature_names)
X = iris_df.copy() # Features
y = iris.target # Label values
# Create a instance with missing rate
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X,
y=y,
missing_rate=25,
x_miss='petal length (cm)')
# Generate the missing data under MNAR mechanism
generate_data = generator.random()
# Calculate the metric: MSE
fill_zero = generate_data.drop("target",axis=1).fillna(0)
eval_metric = EvaluateImputation(
data_imputed=fill_zero,
data_original=X,
metric="mean_squared_error")
print(eval_metric.show())
```
## Contribuitions
Contributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.
## Citation
If you use **mdatagen** in your research, please cite the [original paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8605316)
Bibtex entry:
```bash
@ARTICLE{Santos2019,
author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, João and Abreu, Pedro Henriques},
journal={IEEE Access},
title={Generating Synthetic Missing Data: A Review by Missing Mechanism},
year={2019},
volume={7},
number={},
pages={11651-11667},
doi={10.1109/ACCESS.2019.2891360}}
```
## Acknowledgements
The authors gratefully acknowledge the Brazilian funding agencies FAPESP (Fundação Amparo à Pesquisa do Estado de São Paulo) under grants 2022/10553-6, 2023/13688-2, and 2021/06870-3. Moreover, this research was supported in part by the Coordenação de Aperfeiçoamento de Pessoalde Nível Superior - Brasil (CAPES) - Finance Code 001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.
Raw data
{
"_id": null,
"home_page": "https://github.com/ArthurMangussi/pymdatagen",
"name": "mdatagen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10.12",
"maintainer_email": null,
"keywords": "machine learning, preprocessing data",
"author": "Arthur Dantas Mangussi",
"author_email": "mangussiarthur@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/2b/52/92b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638/mdatagen-0.1.63.tar.gz",
"platform": null,
"description": "# mdatagen: A Python Library for the Generation of Artificial Missing Data\r\n\r\n![Python3](https://img.shields.io/badge/Language-Python3-steelblue)\r\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\r\n[![Documentation](https://img.shields.io/badge/Documentation-Link-green.svg)](docs/)\r\n[![Latest PyPI Version](https://img.shields.io/pypi/v/mdatagen.svg)](https://pypi.org/project/mdatagen/)\r\n\r\n\r\n## Overview\r\nThis package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library **missing-data-generator** (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.\r\n\r\nThis Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).\r\n\r\n## User Guide\r\n\r\nPlease refer to the [univariate docs](docs/univariate.md) or [multivariate docs](docs/multivariate.md) for more implementatios details.\r\n\r\n\r\n### Installation\r\nTo install the package, please use the `pip` installation as follows:\r\n\r\n```bash\r\npip install mdatagen\r\n```\r\n\r\n## API Usage\r\n\r\nAPI usage is described in each of the following sections\r\n\r\n- [MCAR univariate example](docs/mcar_univariate_example.ipynb)\r\n- [MNAR univariate example](docs/mnar_univariate_example.ipynb)\r\n- [MAR univariate example](docs/mar_univariate_example.ipynb)\r\n- [MNAR Multivariate Examples](docs/mnar_multivariate_examples.ipynb)\r\n- [Novel MNAR Multivariate mechanism](docs/novel_mnar_multivariate_example.ipynb)\r\n- [Evaluation of Imputation Quality](docs/evaluation_imputation_quality.ipynb)\r\n- [Visualization Plots](docs/examples_plots.ipynb)\r\n- [Complete Pipeline Example](docs/complete_pipeline_example.ipynb)\r\n\r\n\r\n### Code examples\r\nHere, we provide a basic usage for MAR mechanism in both univariate and multivariate\r\nscenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation\r\nquality. \r\n\r\n### MAR univariate \r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMAR import uMAR\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n columns=iris.feature_names)\r\n\r\nX = iris_df.copy() # Features\r\ny = iris.target # Label values\r\n\r\ngenerator = uMAR(X=X, \r\n y=y, \r\n missing_rate=50, \r\n x_miss='sepal length (cm)',\r\n x_obs = 'petal lenght (cm)')\r\n\r\n# Generate the missing data under MAR mechanism univariate\r\ngenerate_data = generator.rank()\r\nprint(generate_data.isna().sum())\r\n\r\n```\r\n### MAR multivariate\r\n\r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.multivariate.mMAR import mMAR\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n columns=iris.feature_names)\r\n\r\nX = iris_df.copy() # Features\r\ny = iris.target # Label values\r\n\r\ngenerator = mMAR(X=X, \r\n y=y)\r\n\r\n# Generate the missing data under MAR mechanism multivariate\r\ngenerate_data = generator.correlated(missing_rate=25)\r\nprint(generate_data.isna().sum())\r\n```\r\n### Histogram plot\r\n \r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMCAR import uMCAR\r\nfrom mdatagen.plots.plot import PlotMissingData\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n columns=iris.feature_names)\r\n\r\nX = iris_df.copy() # Features\r\ny = iris.target # Label values\r\n\r\n# Create a instance with missing rate \r\n# equal to 25% in dataset under MCAR mechanism\r\ngenerator = uMCAR(X=X, \r\n y=y, \r\n missing_rate=25, \r\n x_miss='petal length (cm)')\r\n\r\n# Generate the missing data under MNAR mechanism\r\ngenerate_data = generator.random()\r\n\r\n\r\nmiss_plot = PlotMissingData(data_missing=generate_data,\r\n data_original=X\r\n )\r\nmiss_plot.visualize_miss(\"histogram\",\r\n col_missing=\"petal length (cm)\",\r\n save=True,\r\n path_save_fig = \"MCAR_iris.png\")\r\n```\r\n### Imputation Quality Evaluation: Mean Squared Error (MSE) \r\n\r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMCAR import uMCAR\r\nfrom mdatagen.metrics.metrics import EvaluateImputation\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n columns=iris.feature_names)\r\n\r\nX = iris_df.copy() # Features\r\ny = iris.target # Label values\r\n\r\n# Create a instance with missing rate \r\n# equal to 25% in dataset under MCAR mechanism\r\ngenerator = uMCAR(X=X, \r\n y=y, \r\n missing_rate=25, \r\n x_miss='petal length (cm)')\r\n\r\n# Generate the missing data under MNAR mechanism\r\ngenerate_data = generator.random()\r\n\r\n# Calculate the metric: MSE\r\nfill_zero = generate_data.drop(\"target\",axis=1).fillna(0)\r\neval_metric = EvaluateImputation(\r\n data_imputed=fill_zero,\r\n data_original=X,\r\n metric=\"mean_squared_error\")\r\nprint(eval_metric.show())\r\n```\r\n\r\n\r\n## Contribuitions\r\nContributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.\r\n\r\n## Citation\r\nIf you use **mdatagen** in your research, please cite the [original paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8605316)\r\n\r\nBibtex entry:\r\n```bash\r\n@ARTICLE{Santos2019,\r\n author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, Jo\u00e3o and Abreu, Pedro Henriques},\r\n journal={IEEE Access}, \r\n title={Generating Synthetic Missing Data: A Review by Missing Mechanism}, \r\n year={2019},\r\n volume={7},\r\n number={},\r\n pages={11651-11667},\r\n doi={10.1109/ACCESS.2019.2891360}}\r\n```\r\n## Acknowledgements\r\nThe authors gratefully acknowledge the Brazilian funding agencies FAPESP (Funda\u00e7\u00e3o Amparo \u00e0 Pesquisa do Estado de S\u00e3o Paulo) under grants 2022/10553-6, 2023/13688-2, and 2021/06870-3. Moreover, this research was supported in part by the Coordena\u00e7\u00e3o de Aperfei\u00e7oamento de Pessoalde N\u00edvel Superior - Brasil (CAPES) - Finance\u00a0Code\u00a0001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "mdatagen: A Python Library for the Generation of Artificial Missing Data",
"version": "0.1.63",
"project_urls": {
"Homepage": "https://github.com/ArthurMangussi/pymdatagen"
},
"split_keywords": [
"machine learning",
" preprocessing data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0957c4fc6b598a0740857981d4bef294e1299e226dc47d5e9728ac62b156e05e",
"md5": "16b7adf96221df3b0a5dc36852de58d6",
"sha256": "cb8844d013f608a78e131fa2704d031b9076384dd610a5e9d12e29f37c0d60a7"
},
"downloads": -1,
"filename": "mdatagen-0.1.63-py3-none-any.whl",
"has_sig": false,
"md5_digest": "16b7adf96221df3b0a5dc36852de58d6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10.12",
"size": 28749,
"upload_time": "2024-08-13T19:27:11",
"upload_time_iso_8601": "2024-08-13T19:27:11.944006Z",
"url": "https://files.pythonhosted.org/packages/09/57/c4fc6b598a0740857981d4bef294e1299e226dc47d5e9728ac62b156e05e/mdatagen-0.1.63-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2b5292b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638",
"md5": "6a68ba8f523a3c8e1b05c8af45837e7e",
"sha256": "72c08473fe21ee9bd06fee7fd24fc08cadc791985510ee807eb41fadb02b2541"
},
"downloads": -1,
"filename": "mdatagen-0.1.63.tar.gz",
"has_sig": false,
"md5_digest": "6a68ba8f523a3c8e1b05c8af45837e7e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10.12",
"size": 22661,
"upload_time": "2024-08-13T19:27:13",
"upload_time_iso_8601": "2024-08-13T19:27:13.431642Z",
"url": "https://files.pythonhosted.org/packages/2b/52/92b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638/mdatagen-0.1.63.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-13 19:27:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ArthurMangussi",
"github_project": "pymdatagen",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mdatagen"
}