mdatagen


Namemdatagen JSON
Version 0.1.63 PyPI version JSON
download
home_pagehttps://github.com/ArthurMangussi/pymdatagen
Summarymdatagen: A Python Library for the Generation of Artificial Missing Data
upload_time2024-08-13 19:27:13
maintainerNone
docs_urlNone
authorArthur Dantas Mangussi
requires_python>=3.10.12
licenseMIT
keywords machine learning preprocessing data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mdatagen: A Python Library for the Generation of Artificial Missing Data

![Python3](https://img.shields.io/badge/Language-Python3-steelblue)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Documentation](https://img.shields.io/badge/Documentation-Link-green.svg)](docs/)
[![Latest PyPI Version](https://img.shields.io/pypi/v/mdatagen.svg)](https://pypi.org/project/mdatagen/)


## Overview
This package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library **missing-data-generator** (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.

This Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).

## User Guide

Please refer to the [univariate docs](docs/univariate.md) or [multivariate docs](docs/multivariate.md) for more implementatios details.


### Installation
To install the package, please use the `pip` installation as follows:

```bash
pip install mdatagen
```

## API Usage

API usage is described in each of the following sections

- [MCAR univariate example](docs/mcar_univariate_example.ipynb)
- [MNAR univariate example](docs/mnar_univariate_example.ipynb)
- [MAR univariate example](docs/mar_univariate_example.ipynb)
- [MNAR Multivariate Examples](docs/mnar_multivariate_examples.ipynb)
- [Novel MNAR Multivariate mechanism](docs/novel_mnar_multivariate_example.ipynb)
- [Evaluation of Imputation Quality](docs/evaluation_imputation_quality.ipynb)
- [Visualization Plots](docs/examples_plots.ipynb)
- [Complete Pipeline Example](docs/complete_pipeline_example.ipynb)


### Code examples
Here, we provide a basic usage for MAR mechanism in both univariate and multivariate
scenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation
quality. 

### MAR univariate 
```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMAR import uMAR

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                      columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

generator = uMAR(X=X, 
                  y=y, 
                  missing_rate=50, 
                  x_miss='sepal length (cm)',
                  x_obs = 'petal lenght (cm)')

# Generate the missing data under MAR mechanism univariate
generate_data = generator.rank()
print(generate_data.isna().sum())

```
### MAR multivariate

```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.multivariate.mMAR import mMAR

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                      columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

generator = mMAR(X=X, 
                  y=y)

# Generate the missing data under MAR mechanism multivariate
generate_data = generator.correlated(missing_rate=25)
print(generate_data.isna().sum())
```
### Histogram plot
 
```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.plots.plot import PlotMissingData

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                        columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

# Create a instance with missing rate 
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X, 
                  y=y, 
                  missing_rate=25, 
                  x_miss='petal length (cm)')

# Generate the missing data under MNAR mechanism
generate_data = generator.random()


miss_plot = PlotMissingData(data_missing=generate_data,
                            data_original=X
                            )
miss_plot.visualize_miss("histogram",
                         col_missing="petal length (cm)",
                         save=True,
                         path_save_fig = "MCAR_iris.png")
```
### Imputation Quality Evaluation: Mean Squared Error (MSE) 

```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.metrics.metrics import EvaluateImputation

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                        columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

# Create a instance with missing rate 
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X, 
                  y=y, 
                  missing_rate=25, 
                  x_miss='petal length (cm)')

# Generate the missing data under MNAR mechanism
generate_data = generator.random()

# Calculate the metric: MSE
fill_zero = generate_data.drop("target",axis=1).fillna(0)
eval_metric = EvaluateImputation(
            data_imputed=fill_zero,
            data_original=X,
            metric="mean_squared_error")
print(eval_metric.show())
```


## Contribuitions
Contributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.

## Citation
If you use **mdatagen** in your research, please cite the [original paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8605316)

Bibtex entry:
```bash
@ARTICLE{Santos2019,
  author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, João and Abreu, Pedro Henriques},
  journal={IEEE Access}, 
  title={Generating Synthetic Missing Data: A Review by Missing Mechanism}, 
  year={2019},
  volume={7},
  number={},
  pages={11651-11667},
  doi={10.1109/ACCESS.2019.2891360}}
```
## Acknowledgements
The authors gratefully acknowledge the Brazilian funding agencies FAPESP (Fundação Amparo à Pesquisa do Estado de São Paulo) under grants 2022/10553-6, 2023/13688-2, and 2021/06870-3. Moreover, this research was supported in part by the Coordenação de Aperfeiçoamento de Pessoalde Nível Superior - Brasil (CAPES) - Finance Code 001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ArthurMangussi/pymdatagen",
    "name": "mdatagen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10.12",
    "maintainer_email": null,
    "keywords": "machine learning, preprocessing data",
    "author": "Arthur Dantas Mangussi",
    "author_email": "mangussiarthur@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/2b/52/92b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638/mdatagen-0.1.63.tar.gz",
    "platform": null,
    "description": "# mdatagen: A Python Library for the Generation of Artificial Missing Data\r\n\r\n![Python3](https://img.shields.io/badge/Language-Python3-steelblue)\r\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\r\n[![Documentation](https://img.shields.io/badge/Documentation-Link-green.svg)](docs/)\r\n[![Latest PyPI Version](https://img.shields.io/pypi/v/mdatagen.svg)](https://pypi.org/project/mdatagen/)\r\n\r\n\r\n## Overview\r\nThis package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library **missing-data-generator** (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.\r\n\r\nThis Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).\r\n\r\n## User Guide\r\n\r\nPlease refer to the [univariate docs](docs/univariate.md) or [multivariate docs](docs/multivariate.md) for more implementatios details.\r\n\r\n\r\n### Installation\r\nTo install the package, please use the `pip` installation as follows:\r\n\r\n```bash\r\npip install mdatagen\r\n```\r\n\r\n## API Usage\r\n\r\nAPI usage is described in each of the following sections\r\n\r\n- [MCAR univariate example](docs/mcar_univariate_example.ipynb)\r\n- [MNAR univariate example](docs/mnar_univariate_example.ipynb)\r\n- [MAR univariate example](docs/mar_univariate_example.ipynb)\r\n- [MNAR Multivariate Examples](docs/mnar_multivariate_examples.ipynb)\r\n- [Novel MNAR Multivariate mechanism](docs/novel_mnar_multivariate_example.ipynb)\r\n- [Evaluation of Imputation Quality](docs/evaluation_imputation_quality.ipynb)\r\n- [Visualization Plots](docs/examples_plots.ipynb)\r\n- [Complete Pipeline Example](docs/complete_pipeline_example.ipynb)\r\n\r\n\r\n### Code examples\r\nHere, we provide a basic usage for MAR mechanism in both univariate and multivariate\r\nscenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation\r\nquality. \r\n\r\n### MAR univariate \r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMAR import uMAR\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                      columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\ngenerator = uMAR(X=X, \r\n                  y=y, \r\n                  missing_rate=50, \r\n                  x_miss='sepal length (cm)',\r\n                  x_obs = 'petal lenght (cm)')\r\n\r\n# Generate the missing data under MAR mechanism univariate\r\ngenerate_data = generator.rank()\r\nprint(generate_data.isna().sum())\r\n\r\n```\r\n### MAR multivariate\r\n\r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.multivariate.mMAR import mMAR\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                      columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\ngenerator = mMAR(X=X, \r\n                  y=y)\r\n\r\n# Generate the missing data under MAR mechanism multivariate\r\ngenerate_data = generator.correlated(missing_rate=25)\r\nprint(generate_data.isna().sum())\r\n```\r\n### Histogram plot\r\n \r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMCAR import uMCAR\r\nfrom mdatagen.plots.plot import PlotMissingData\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                        columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\n# Create a instance with missing rate \r\n# equal to 25% in dataset under MCAR mechanism\r\ngenerator = uMCAR(X=X, \r\n                  y=y, \r\n                  missing_rate=25, \r\n                  x_miss='petal length (cm)')\r\n\r\n# Generate the missing data under MNAR mechanism\r\ngenerate_data = generator.random()\r\n\r\n\r\nmiss_plot = PlotMissingData(data_missing=generate_data,\r\n                            data_original=X\r\n                            )\r\nmiss_plot.visualize_miss(\"histogram\",\r\n                         col_missing=\"petal length (cm)\",\r\n                         save=True,\r\n                         path_save_fig = \"MCAR_iris.png\")\r\n```\r\n### Imputation Quality Evaluation: Mean Squared Error (MSE) \r\n\r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMCAR import uMCAR\r\nfrom mdatagen.metrics.metrics import EvaluateImputation\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                        columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\n# Create a instance with missing rate \r\n# equal to 25% in dataset under MCAR mechanism\r\ngenerator = uMCAR(X=X, \r\n                  y=y, \r\n                  missing_rate=25, \r\n                  x_miss='petal length (cm)')\r\n\r\n# Generate the missing data under MNAR mechanism\r\ngenerate_data = generator.random()\r\n\r\n# Calculate the metric: MSE\r\nfill_zero = generate_data.drop(\"target\",axis=1).fillna(0)\r\neval_metric = EvaluateImputation(\r\n            data_imputed=fill_zero,\r\n            data_original=X,\r\n            metric=\"mean_squared_error\")\r\nprint(eval_metric.show())\r\n```\r\n\r\n\r\n## Contribuitions\r\nContributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.\r\n\r\n## Citation\r\nIf you use **mdatagen** in your research, please cite the [original paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8605316)\r\n\r\nBibtex entry:\r\n```bash\r\n@ARTICLE{Santos2019,\r\n  author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, Jo\u00e3o and Abreu, Pedro Henriques},\r\n  journal={IEEE Access}, \r\n  title={Generating Synthetic Missing Data: A Review by Missing Mechanism}, \r\n  year={2019},\r\n  volume={7},\r\n  number={},\r\n  pages={11651-11667},\r\n  doi={10.1109/ACCESS.2019.2891360}}\r\n```\r\n## Acknowledgements\r\nThe authors gratefully acknowledge the Brazilian funding agencies FAPESP (Funda\u00e7\u00e3o Amparo \u00e0 Pesquisa do Estado de S\u00e3o Paulo) under grants 2022/10553-6, 2023/13688-2, and 2021/06870-3. Moreover, this research was supported in part by the Coordena\u00e7\u00e3o de Aperfei\u00e7oamento de Pessoalde N\u00edvel Superior - Brasil (CAPES) - Finance\u00a0Code\u00a0001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "mdatagen: A Python Library for the Generation of Artificial Missing Data",
    "version": "0.1.63",
    "project_urls": {
        "Homepage": "https://github.com/ArthurMangussi/pymdatagen"
    },
    "split_keywords": [
        "machine learning",
        " preprocessing data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0957c4fc6b598a0740857981d4bef294e1299e226dc47d5e9728ac62b156e05e",
                "md5": "16b7adf96221df3b0a5dc36852de58d6",
                "sha256": "cb8844d013f608a78e131fa2704d031b9076384dd610a5e9d12e29f37c0d60a7"
            },
            "downloads": -1,
            "filename": "mdatagen-0.1.63-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "16b7adf96221df3b0a5dc36852de58d6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10.12",
            "size": 28749,
            "upload_time": "2024-08-13T19:27:11",
            "upload_time_iso_8601": "2024-08-13T19:27:11.944006Z",
            "url": "https://files.pythonhosted.org/packages/09/57/c4fc6b598a0740857981d4bef294e1299e226dc47d5e9728ac62b156e05e/mdatagen-0.1.63-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2b5292b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638",
                "md5": "6a68ba8f523a3c8e1b05c8af45837e7e",
                "sha256": "72c08473fe21ee9bd06fee7fd24fc08cadc791985510ee807eb41fadb02b2541"
            },
            "downloads": -1,
            "filename": "mdatagen-0.1.63.tar.gz",
            "has_sig": false,
            "md5_digest": "6a68ba8f523a3c8e1b05c8af45837e7e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10.12",
            "size": 22661,
            "upload_time": "2024-08-13T19:27:13",
            "upload_time_iso_8601": "2024-08-13T19:27:13.431642Z",
            "url": "https://files.pythonhosted.org/packages/2b/52/92b8c3faaf8eccaa9f729c791ff0c942bb851773cb8d80edf1b130ee9638/mdatagen-0.1.63.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-13 19:27:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ArthurMangussi",
    "github_project": "pymdatagen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mdatagen"
}
        
Elapsed time: 0.36343s