mdatagen


Namemdatagen JSON
Version 0.1.71 PyPI version JSON
download
home_pagehttps://github.com/ArthurMangussi/pymdatagen
Summarymdatagen: A Python Library for the Generation of Artificial Missing Data
upload_time2024-11-25 18:50:48
maintainerNone
docs_urlNone
authorArthur Dantas Mangussi
requires_python>=3.10.12
licenseMIT
keywords machine learning preprocessing data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # mdatagen: A Python Library for the Generation of Artificial Missing Data

![Python3](https://img.shields.io/badge/Language-Python3-steelblue)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Documentation](https://img.shields.io/badge/Documentation-Link-green.svg)](docs/)
[![Latest PyPI Version](https://img.shields.io/pypi/v/mdatagen.svg)](https://pypi.org/project/mdatagen/)


## Overview
This package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library **missing-data-generator** (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.

This Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).

## User Guide

Please refer to the [univariate docs](docs/univariate.md) or [multivariate docs](docs/multivariate.md) for more implementatios details.


### Installation
To install the package, please use the `pip` installation as follows:

```bash
pip install mdatagen
```

## API Usage

API usage is described in each of the following sections

- [MCAR univariate example](docs/mcar_univariate_example.ipynb)
- [MNAR univariate example](docs/mnar_univariate_example.ipynb)
- [MAR univariate example](docs/mar_univariate_example.ipynb)
- [MNAR Multivariate Examples](docs/mnar_multivariate_examples.ipynb)
- [Novel MNAR Multivariate mechanism](docs/novel_mnar_multivariate_example.ipynb)
- [Evaluation of Imputation Quality](docs/evaluation_imputation_quality.ipynb)
- [Visualization Plots](docs/examples_plots.ipynb)
- [Complete Pipeline Example](docs/complete_pipeline_example.ipynb)


### Code examples
Here, we provide a basic usage for MAR mechanism in both univariate and multivariate
scenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation
quality. 

### MAR univariate 
```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMAR import uMAR

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                      columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

generator = uMAR(X=X, 
                  y=y, 
                  missing_rate=50, 
                  x_miss='sepal length (cm)',
                  x_obs = 'petal lenght (cm)')

# Generate the missing data under MAR mechanism univariate
generate_data = generator.rank()
print(generate_data.isna().sum())

```
### MAR multivariate

```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.multivariate.mMAR import mMAR

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                      columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

generator = mMAR(X=X, 
                  y=y)

# Generate the missing data under MAR mechanism multivariate
generate_data = generator.correlated(missing_rate=25)
print(generate_data.isna().sum())
```
### Histogram plot
 
```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.plots.plot import PlotMissingData

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                        columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

# Create a instance with missing rate 
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X, 
                  y=y, 
                  missing_rate=25, 
                  x_miss='petal length (cm)')

# Generate the missing data under MNAR mechanism
generate_data = generator.random()


miss_plot = PlotMissingData(data_missing=generate_data,
                            data_original=X
                            )
miss_plot.visualize_miss("histogram",
                         col_missing="petal length (cm)",
                         save=True,
                         path_save_fig = "MCAR_iris.png")
```
### Imputation Quality Evaluation: Mean Squared Error (MSE) 

```python
import pandas as pd
from sklearn.datasets import load_iris

from mdatagen.univariate.uMCAR import uMCAR
from mdatagen.metrics import EvaluateImputation

# Load the data
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, 
                        columns=iris.feature_names)

X = iris_df.copy()   # Features
y = iris.target      # Label values

# Create a instance with missing rate 
# equal to 25% in dataset under MCAR mechanism
generator = uMCAR(X=X, 
                  y=y, 
                  missing_rate=25, 
                  x_miss='petal length (cm)')

# Generate the missing data under MNAR mechanism
generate_data = generator.random()

# Calculate the metric: MSE
fill_zero = generate_data.drop("target",axis=1).fillna(0)
eval_metric = EvaluateImputation(
            data_imputed=fill_zero,
            data_original=X,
            metric="mean_squared_error")
print(eval_metric.show())
```


## Contribuitions
Contributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.

## Citation
If you use **mdatagen** in your research, please cite the package as an independent software tool. Additionally, if your research benefits from the concepts or methodologies discussed in the survey by Santos et al. (2019), we encourage citing their work as well.

### Citation for the mdatagen package:
The **mdatagen** package is an independent Python library designed to generate artificial missing data. It does not reproduce or extend the implementation of any previous work, but it is related to ideas explored in Santos et al. (2019). Please cite the package as follows:

Bibtex entry:
```bash
@misc{mdatagen2024,
  author = {Mangussi, Arthur Dantas and Santos, Miriam Seoane and Lopes, Filipe Loyola and Pereira, Ricardo Cardoso and Lorena, Ana Carolina and Abreu, Pedro Henriques},
  title = {mdatagen: A Python library for generating missing data},
  year = {2024},
  howpublished = {\url{https://arthurmangussi.github.io/pymdatagen/}},
}
```
### Citation for the survey by Santos et al. (2019):
If your work references concepts or methodologies discussed in the survey by [Santos et al. (2019)](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8605316), please also cite their paper:


Bibtex entry:
```bash
@ARTICLE{Santos2019,
  author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, João and Abreu, Pedro Henriques},
  journal={IEEE Access}, 
  title={Generating Synthetic Missing Data: A Review by Missing Mechanism}, 
  year={2019},
  volume={7},
  number={},
  pages={11651-11667},
  doi={10.1109/ACCESS.2019.2891360}}
```
## Acknowledgements
The authors gratefully acknowledge the Brazilian funding agencies FAPESP (Fundação Amparo à Pesquisa do Estado de São Paulo) under grants 2021/06870-3, 2022/10553-6, and 2023/13688-2. Moreover, this research was supported in part by the Coordenação de Aperfeiçoamento de Pessoalde Nível Superior - Brasil (CAPES) - Finance Code 001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ArthurMangussi/pymdatagen",
    "name": "mdatagen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10.12",
    "maintainer_email": null,
    "keywords": "machine learning, preprocessing data",
    "author": "Arthur Dantas Mangussi",
    "author_email": "mangussiarthur@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/1e/70/7d60deaa22d47ba36c7d8c339c3ad153dc8137fa598ea8fd6e043f538811/mdatagen-0.1.71.tar.gz",
    "platform": null,
    "description": "# mdatagen: A Python Library for the Generation of Artificial Missing Data\r\n\r\n![Python3](https://img.shields.io/badge/Language-Python3-steelblue)\r\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\r\n[![Documentation](https://img.shields.io/badge/Documentation-Link-green.svg)](docs/)\r\n[![Latest PyPI Version](https://img.shields.io/pypi/v/mdatagen.svg)](https://pypi.org/project/mdatagen/)\r\n\r\n\r\n## Overview\r\nThis package has been developed to address a gap in machine learning research, specifically the artificial generation of missing data. Santos et al. (2019) provided a survey that presents various strategies for both univariate and multivariate scenarios, but the Python community still needs implementations of these strategies. Our Python library **missing-data-generator** (mdatagen) puts forward a comprehensive set of implementations of missing data mechanisms, covering Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), allowing users to simulate several real-world scenarios comprising absent observations. The library is designed for easy integration with existing Python-based data analysis workflows, including well-established modules such as scikit-learn, and popular libraries for missing data visualization, such as missingno, enhancing its accessibility and usability for researchers.\r\n\r\nThis Python package is a collaboration between researchers at the Aeronautics Institute of Technologies (Brazil) and the University of Coimbra (Portugal).\r\n\r\n## User Guide\r\n\r\nPlease refer to the [univariate docs](docs/univariate.md) or [multivariate docs](docs/multivariate.md) for more implementatios details.\r\n\r\n\r\n### Installation\r\nTo install the package, please use the `pip` installation as follows:\r\n\r\n```bash\r\npip install mdatagen\r\n```\r\n\r\n## API Usage\r\n\r\nAPI usage is described in each of the following sections\r\n\r\n- [MCAR univariate example](docs/mcar_univariate_example.ipynb)\r\n- [MNAR univariate example](docs/mnar_univariate_example.ipynb)\r\n- [MAR univariate example](docs/mar_univariate_example.ipynb)\r\n- [MNAR Multivariate Examples](docs/mnar_multivariate_examples.ipynb)\r\n- [Novel MNAR Multivariate mechanism](docs/novel_mnar_multivariate_example.ipynb)\r\n- [Evaluation of Imputation Quality](docs/evaluation_imputation_quality.ipynb)\r\n- [Visualization Plots](docs/examples_plots.ipynb)\r\n- [Complete Pipeline Example](docs/complete_pipeline_example.ipynb)\r\n\r\n\r\n### Code examples\r\nHere, we provide a basic usage for MAR mechanism in both univariate and multivariate\r\nscenarios to getting started. Also, we illustrate how to use the Histogram plot and evaluate the imputation\r\nquality. \r\n\r\n### MAR univariate \r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMAR import uMAR\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                      columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\ngenerator = uMAR(X=X, \r\n                  y=y, \r\n                  missing_rate=50, \r\n                  x_miss='sepal length (cm)',\r\n                  x_obs = 'petal lenght (cm)')\r\n\r\n# Generate the missing data under MAR mechanism univariate\r\ngenerate_data = generator.rank()\r\nprint(generate_data.isna().sum())\r\n\r\n```\r\n### MAR multivariate\r\n\r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.multivariate.mMAR import mMAR\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                      columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\ngenerator = mMAR(X=X, \r\n                  y=y)\r\n\r\n# Generate the missing data under MAR mechanism multivariate\r\ngenerate_data = generator.correlated(missing_rate=25)\r\nprint(generate_data.isna().sum())\r\n```\r\n### Histogram plot\r\n \r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMCAR import uMCAR\r\nfrom mdatagen.plots.plot import PlotMissingData\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                        columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\n# Create a instance with missing rate \r\n# equal to 25% in dataset under MCAR mechanism\r\ngenerator = uMCAR(X=X, \r\n                  y=y, \r\n                  missing_rate=25, \r\n                  x_miss='petal length (cm)')\r\n\r\n# Generate the missing data under MNAR mechanism\r\ngenerate_data = generator.random()\r\n\r\n\r\nmiss_plot = PlotMissingData(data_missing=generate_data,\r\n                            data_original=X\r\n                            )\r\nmiss_plot.visualize_miss(\"histogram\",\r\n                         col_missing=\"petal length (cm)\",\r\n                         save=True,\r\n                         path_save_fig = \"MCAR_iris.png\")\r\n```\r\n### Imputation Quality Evaluation: Mean Squared Error (MSE) \r\n\r\n```python\r\nimport pandas as pd\r\nfrom sklearn.datasets import load_iris\r\n\r\nfrom mdatagen.univariate.uMCAR import uMCAR\r\nfrom mdatagen.metrics import EvaluateImputation\r\n\r\n# Load the data\r\niris = load_iris()\r\niris_df = pd.DataFrame(data=iris.data, \r\n                        columns=iris.feature_names)\r\n\r\nX = iris_df.copy()   # Features\r\ny = iris.target      # Label values\r\n\r\n# Create a instance with missing rate \r\n# equal to 25% in dataset under MCAR mechanism\r\ngenerator = uMCAR(X=X, \r\n                  y=y, \r\n                  missing_rate=25, \r\n                  x_miss='petal length (cm)')\r\n\r\n# Generate the missing data under MNAR mechanism\r\ngenerate_data = generator.random()\r\n\r\n# Calculate the metric: MSE\r\nfill_zero = generate_data.drop(\"target\",axis=1).fillna(0)\r\neval_metric = EvaluateImputation(\r\n            data_imputed=fill_zero,\r\n            data_original=X,\r\n            metric=\"mean_squared_error\")\r\nprint(eval_metric.show())\r\n```\r\n\r\n\r\n## Contribuitions\r\nContributions are welcome! Feel free to open issues, submit pull requests, or provide feedback.\r\n\r\n## Citation\r\nIf you use **mdatagen** in your research, please cite the package as an independent software tool. Additionally, if your research benefits from the concepts or methodologies discussed in the survey by Santos et al. (2019), we encourage citing their work as well.\r\n\r\n### Citation for the mdatagen package:\r\nThe **mdatagen** package is an independent Python library designed to generate artificial missing data. It does not reproduce or extend the implementation of any previous work, but it is related to ideas explored in Santos et al. (2019). Please cite the package as follows:\r\n\r\nBibtex entry:\r\n```bash\r\n@misc{mdatagen2024,\r\n  author = {Mangussi, Arthur Dantas and Santos, Miriam Seoane and Lopes, Filipe Loyola and Pereira, Ricardo Cardoso and Lorena, Ana Carolina and Abreu, Pedro Henriques},\r\n  title = {mdatagen: A Python library for generating missing data},\r\n  year = {2024},\r\n  howpublished = {\\url{https://arthurmangussi.github.io/pymdatagen/}},\r\n}\r\n```\r\n### Citation for the survey by Santos et al. (2019):\r\nIf your work references concepts or methodologies discussed in the survey by [Santos et al. (2019)](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8605316), please also cite their paper:\r\n\r\n\r\nBibtex entry:\r\n```bash\r\n@ARTICLE{Santos2019,\r\n  author={Santos, Miriam Seoane and Pereira, Ricardo Cardoso and Costa, Adriana Fonseca and Soares, Jastin Pompeu and Santos, Jo\u00e3o and Abreu, Pedro Henriques},\r\n  journal={IEEE Access}, \r\n  title={Generating Synthetic Missing Data: A Review by Missing Mechanism}, \r\n  year={2019},\r\n  volume={7},\r\n  number={},\r\n  pages={11651-11667},\r\n  doi={10.1109/ACCESS.2019.2891360}}\r\n```\r\n## Acknowledgements\r\nThe authors gratefully acknowledge the Brazilian funding agencies FAPESP (Funda\u00e7\u00e3o Amparo \u00e0 Pesquisa do Estado de S\u00e3o Paulo) under grants 2021/06870-3, 2022/10553-6, and 2023/13688-2. Moreover, this research was supported in part by the Coordena\u00e7\u00e3o de Aperfei\u00e7oamento de Pessoalde N\u00edvel Superior - Brasil (CAPES) - Finance\u00a0Code\u00a0001, and Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055 Center for Responsable AI.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "mdatagen: A Python Library for the Generation of Artificial Missing Data",
    "version": "0.1.71",
    "project_urls": {
        "Homepage": "https://github.com/ArthurMangussi/pymdatagen"
    },
    "split_keywords": [
        "machine learning",
        " preprocessing data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "75751b60ea9642ed291120c50d5f53d5a7d2061c9349fb1b33a44c54598ac152",
                "md5": "fb98855a10f8b90470c88150e2f18c2f",
                "sha256": "f789662bb09b99fd8157cc0ff3ee92fb83e0e2d12deb2572f85834c1dd6d77d4"
            },
            "downloads": -1,
            "filename": "mdatagen-0.1.71-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fb98855a10f8b90470c88150e2f18c2f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10.12",
            "size": 31263,
            "upload_time": "2024-11-25T18:50:47",
            "upload_time_iso_8601": "2024-11-25T18:50:47.641292Z",
            "url": "https://files.pythonhosted.org/packages/75/75/1b60ea9642ed291120c50d5f53d5a7d2061c9349fb1b33a44c54598ac152/mdatagen-0.1.71-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1e707d60deaa22d47ba36c7d8c339c3ad153dc8137fa598ea8fd6e043f538811",
                "md5": "1a0d1d545b92ea7fd07b8a8d50b56a54",
                "sha256": "22b7526c8d9a8c63b1f9219fa8f17df99f82088c3b58e34f5201dabceedce5f5"
            },
            "downloads": -1,
            "filename": "mdatagen-0.1.71.tar.gz",
            "has_sig": false,
            "md5_digest": "1a0d1d545b92ea7fd07b8a8d50b56a54",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10.12",
            "size": 25409,
            "upload_time": "2024-11-25T18:50:48",
            "upload_time_iso_8601": "2024-11-25T18:50:48.712364Z",
            "url": "https://files.pythonhosted.org/packages/1e/70/7d60deaa22d47ba36c7d8c339c3ad153dc8137fa598ea8fd6e043f538811/mdatagen-0.1.71.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-25 18:50:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ArthurMangussi",
    "github_project": "pymdatagen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mdatagen"
}
        
Elapsed time: 0.38739s