<h1 align="center">SDQCPy</h1>
<p align="center"><strong>SDQCPy: A Comprehensive Python Package for Synthetic Data Management</strong></p>
<p align="center"><a href="README.zh-CN.md">中文版本</a></p>
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Results Display](#results-display)
- [Usage](#usage)
- [Demo](#demo)
- [Data Synthesis](#data-synthesis)
- [Workflow](#workflow)
- [Support](#support)
- [License](#license)
## Features
`SDQCPy` offers a comprehensive toolkit for synthetic data generation, quality assessment, and analysis:
1. **Data Synthesis**: Generate synthetic data using various models.
2. **Quality Evaluation**: Assess synthetic data quality through statistical tests, classification metrics, explainability analysis, and causal inference.
3. **End-to-End Analysis**: Perform holistic analysis by integrating multiple evaluation methods to provide a comprehensive view of synthetic data quality.
4. **Results Display**: Store the results in *a HTML file*.
## Installation
***You can install `SDQCPy` using pip:***
```bash
pip install sdqcpy
```
***Alternatively, you can install it from the source:***
```bash
git clone https://github.com/T0217/sdqcpy.git
cd sdqcpy
pip install -e .
```
## Results Display
`SDQCPy` provides a `SequentialAnalysis` class to perform the sequential analysis and store the results in *a HTML file*.
![Sample Result](./Results%20Display/sample%20result.jpg)
## Usage
### Demo
You can use the following code to achieve the sequential analysis and store the results in a HTML file:
```python
from sdqc_integration import SequentialAnalysis
from sdqc_data import read_data
import logging
import warnings
# Ignore warnings and set logging level to ERROR
warnings.filterwarnings('ignore')
logger = logging.getLogger()
logger.setLevel(logging.ERROR)
# Set random seed
random_seed = 17
# Replace with your own data path and use pandas to read the data
raw_data = read_data('3_raw')
synthetic_data = read_data('3_synth')
output_path = 'raw_synth.html'
# Perform sequential analysis
sequential = SequentialAnalysis(
raw_data=raw_data,
synthetic_data=synthetic_data,
random_seed=random_seed,
use_cols=None,
)
results = sequential.run()
sequential.visualize_html(output_path)
```
### Data Synthesis
`SDQCPy` supports various methods, the implementation of these methods are using [`ydata-synthetic`](https://github.com/ydataai/ydata-synthetic) and [`SDV`](https://github.com/sdv-dev/SDV).
> [!TIP]
>
> ***We only display simple code here, and the parameters of each model can be further modified as needed.***
- **YData Synthesizer**
```python
import pandas as pd
from sdqc_synthesize import YDataSynthesizer
raw_data = pd.read_csv("raw_data.csv") # Please replace with your own data path
ydata_synth = YDataSynthesizer(data=raw_data)
synthetic_data = ydata_synth.generate()
```
> [!IMPORTANT]
>
> ***In the latest version, [`ydata-synthetic`](https://github.com/ydataai/ydata-synthetic) has switched to using [ydata-sdk](https://github.com/ydataai/ydata-sdk). However, since synthetic data is only a supplementary feature of this library, it has not been updated yet.***
- **SDV Synthesizer**
```python
import pandas as pd
from sdqc_synthesize import SDVSynthesizer
raw_data = pd.read_csv("raw_data.csv") # Please replace with your own data path
sdv_synth = SDVSynthesizer(data=raw_data)
synthetic_data = sdv_synth.generate()
```
## Workflow
`SDQCPy` use the process shown below to perform the quality check and analysis:
```mermaid
---
title Main Idea
---
flowchart TB
%% Define the style
classDef default stroke:#000,fill:none
%% Define the nodes
initial([Input Real Data and Synthetic Data])
step1[Statistical Test]
step2[Classification]
step3[Explainability]
step4[Causal Analysis]
endprocess[Export HTML file]
%% Define the relationships between nodes
initial --> step1
step1 --> step2
step2 --> step3
step3 --> step4
step4 --> endprocess
```
- **Statistical Test**
`SDQCPy` employs various methods for *descriptive analysis*, *distribution comparison*, and *correlation testing* tailored to ***different data types***.
- **Classification**
`SDQCPy` employs machine learning models(`SVC`, `RandomForestClassifier`, `XGBClassifier`, `LGBMClassifier`) to evaluate the similarity between the real and synthetic data.
- **Explainability**
`SDQCPy` employs several of the current mainstream explainability methods(`Model-Based`,`SHAP`, `PFI`) to evaluate the explainability of the synthetic data.
- **Causal Analysis**
`SDQCPy` employs several causal structure learning methods and evaluation metrics to compare the adjacency matrix of the raw and synthetic data. The implementation of these methods are using [`gCastle`](https://github.com/huawei-noah/trustworthyAI/tree/master/gcastle).
- **End-to-End Analysis**(named `SequentialAnalysis`)
To streamline the process of calling individual modules one by one, we have integrated all the functions. If you have specific needs, you can also use these functions along your lines.
## Support
Need help? Report a bug? Ideas for collaborations? Reach out via [GitHub Issues](https://github.com/T0217/sdqcpy/issues)
> [!IMPORTANT]
>
> ***Before reporting an issue on `GitHub`, please check the existing [Issues](https://github.com/T0217/sdqcpy/issues) to avoid duplicates.***
>
> ***If you wish to contribute to this library, <span style="color: red;">please first open an Issue to discuss your proposed changes.</span> Once discussed, you are welcome to submit a Pull Request.***
## License
[Apache-2.0](LICENSE) @[T0217](https://github.com/T0217)
Raw data
{
"_id": null,
"home_page": null,
"name": "sdqcpy",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.9",
"maintainer_email": null,
"keywords": "synthetic data, data quality, data validation, data management",
"author": null,
"author_email": "T0217 <tianyanggong17@outlook.com>",
"download_url": "https://files.pythonhosted.org/packages/1a/43/36373aa80af71ac1ac7f4a690d98a04caeef3c61b6c85ca5f32f4e406753/sdqcpy-1.0.1.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">SDQCPy</h1>\r\n<p align=\"center\"><strong>SDQCPy: A Comprehensive Python Package for Synthetic Data Management</strong></p>\r\n\r\n<p align=\"center\"><a href=\"README.zh-CN.md\">\u4e2d\u6587\u7248\u672c</a></p>\r\n\r\n## Table of Contents\r\n\r\n- [Features](#features)\r\n- [Installation](#installation) \r\n- [Results Display](#results-display)\r\n- [Usage](#usage)\r\n - [Demo](#demo)\r\n - [Data Synthesis](#data-synthesis)\r\n- [Workflow](#workflow)\r\n- [Support](#support)\r\n- [License](#license)\r\n\r\n## Features\r\n\r\n`SDQCPy` offers a comprehensive toolkit for synthetic data generation, quality assessment, and analysis:\r\n\r\n1. **Data Synthesis**: Generate synthetic data using various models.\r\n2. **Quality Evaluation**: Assess synthetic data quality through statistical tests, classification metrics, explainability analysis, and causal inference.\r\n3. **End-to-End Analysis**: Perform holistic analysis by integrating multiple evaluation methods to provide a comprehensive view of synthetic data quality.\r\n4. **Results Display**: Store the results in *a HTML file*.\r\n\r\n## Installation\r\n\r\n***You can install `SDQCPy` using pip:***\r\n\r\n```bash\r\npip install sdqcpy\r\n```\r\n***Alternatively, you can install it from the source:***\r\n\r\n```bash\r\ngit clone https://github.com/T0217/sdqcpy.git\r\ncd sdqcpy\r\npip install -e .\r\n```\r\n\r\n## Results Display\r\n\r\n`SDQCPy` provides a `SequentialAnalysis` class to perform the sequential analysis and store the results in *a HTML file*.\r\n\r\n![Sample Result](./Results%20Display/sample%20result.jpg)\r\n\r\n\r\n## Usage\r\n\r\n### Demo\r\n\r\nYou can use the following code to achieve the sequential analysis and store the results in a HTML file:\r\n\r\n```python\r\nfrom sdqc_integration import SequentialAnalysis\r\nfrom sdqc_data import read_data\r\nimport logging\r\nimport warnings\r\n\r\n# Ignore warnings and set logging level to ERROR\r\nwarnings.filterwarnings('ignore')\r\nlogger = logging.getLogger()\r\nlogger.setLevel(logging.ERROR)\r\n\r\n# Set random seed\r\nrandom_seed = 17\r\n\r\n# Replace with your own data path and use pandas to read the data\r\nraw_data = read_data('3_raw')\r\nsynthetic_data = read_data('3_synth')\r\n\r\noutput_path = 'raw_synth.html'\r\n\r\n# Perform sequential analysis\r\nsequential = SequentialAnalysis(\r\n raw_data=raw_data,\r\n synthetic_data=synthetic_data,\r\n random_seed=random_seed,\r\n use_cols=None,\r\n)\r\nresults = sequential.run()\r\nsequential.visualize_html(output_path)\r\n```\r\n\r\n### Data Synthesis\r\n\r\n`SDQCPy` supports various methods, the implementation of these methods are using [`ydata-synthetic`](https://github.com/ydataai/ydata-synthetic) and [`SDV`](https://github.com/sdv-dev/SDV).\r\n\r\n> [!TIP]\r\n>\r\n> ***We only display simple code here, and the parameters of each model can be further modified as needed.***\r\n\r\n- **YData Synthesizer**\r\n\r\n ```python\r\n import pandas as pd\r\n from sdqc_synthesize import YDataSynthesizer\r\n \r\n raw_data = pd.read_csv(\"raw_data.csv\") # Please replace with your own data path\r\n ydata_synth = YDataSynthesizer(data=raw_data)\r\n synthetic_data = ydata_synth.generate()\r\n ```\r\n\r\n> [!IMPORTANT]\r\n>\r\n> ***In the latest version, [`ydata-synthetic`](https://github.com/ydataai/ydata-synthetic) has switched to using [ydata-sdk](https://github.com/ydataai/ydata-sdk). However, since synthetic data is only a supplementary feature of this library, it has not been updated yet.*** \r\n\r\n- **SDV Synthesizer**\r\n\r\n ```python\r\n import pandas as pd\r\n from sdqc_synthesize import SDVSynthesizer\r\n \r\n raw_data = pd.read_csv(\"raw_data.csv\") # Please replace with your own data path\r\n sdv_synth = SDVSynthesizer(data=raw_data)\r\n synthetic_data = sdv_synth.generate()\r\n ```\r\n\r\n## Workflow\r\n`SDQCPy` use the process shown below to perform the quality check and analysis:\r\n\r\n```mermaid\r\n---\r\ntitle Main Idea\r\n---\r\nflowchart TB\r\n\t%% Define the style\r\n\tclassDef default stroke:#000,fill:none\r\n\r\n\t%% Define the nodes\r\n\tinitial([Input Real Data and Synthetic Data])\r\n\tstep1[Statistical Test]\r\n\tstep2[Classification]\r\n\tstep3[Explainability]\r\n\tstep4[Causal Analysis]\r\n\tendprocess[Export HTML file]\r\n\r\n %% Define the relationships between nodes\r\n initial --> step1\r\n step1 --> step2\r\n step2 --> step3\r\n step3 --> step4\r\n step4 --> endprocess\r\n```\r\n\r\n- **Statistical Test**\r\n`SDQCPy` employs various methods for *descriptive analysis*, *distribution comparison*, and *correlation testing* tailored to ***different data types***.\r\n- **Classification**\r\n`SDQCPy` employs machine learning models(`SVC`, `RandomForestClassifier`, `XGBClassifier`, `LGBMClassifier`) to evaluate the similarity between the real and synthetic data.\r\n- **Explainability**\r\n`SDQCPy` employs several of the current mainstream explainability methods(`Model-Based`,`SHAP`, `PFI`) to evaluate the explainability of the synthetic data.\r\n- **Causal Analysis**\r\n`SDQCPy` employs several causal structure learning methods and evaluation metrics to compare the adjacency matrix of the raw and synthetic data. The implementation of these methods are using [`gCastle`](https://github.com/huawei-noah/trustworthyAI/tree/master/gcastle).\r\n- **End-to-End Analysis**(named `SequentialAnalysis`)\r\nTo streamline the process of calling individual modules one by one, we have integrated all the functions. If you have specific needs, you can also use these functions along your lines.\r\n\r\n## Support\r\n\r\nNeed help? Report a bug? Ideas for collaborations? Reach out via [GitHub Issues](https://github.com/T0217/sdqcpy/issues)\r\n\r\n> [!IMPORTANT]\r\n>\r\n> ***Before reporting an issue on `GitHub`, please check the existing [Issues](https://github.com/T0217/sdqcpy/issues) to avoid duplicates.***\r\n>\r\n> ***If you wish to contribute to this library, <span style=\"color: red;\">please first open an Issue to discuss your proposed changes.</span> Once discussed, you are welcome to submit a Pull Request.***\r\n\r\n## License\r\n[Apache-2.0](LICENSE) @[T0217](https://github.com/T0217)\r\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "SDQCPy is a comprehensive Python package designed for synthetic data management, quality control, and validation.",
"version": "1.0.1",
"project_urls": {
"Bug Tracker": "https://github.com/T0217/sdqcpy/issues",
"Homepage": "https://github.com/T0217/sdqcpy"
},
"split_keywords": [
"synthetic data",
" data quality",
" data validation",
" data management"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a892029e50f22fe52bb23f17ea2ff88b344b25ffa5805ac252c89ebc5db8b28e",
"md5": "835cd551ebefc5e2453467b0152efa33",
"sha256": "62f334c4bcfabb105eb21def736ef7eb79b9e404b85dad3ec9ca950e271b2a0b"
},
"downloads": -1,
"filename": "sdqcpy-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "835cd551ebefc5e2453467b0152efa33",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.9",
"size": 35113,
"upload_time": "2024-10-03T00:41:00",
"upload_time_iso_8601": "2024-10-03T00:41:00.092712Z",
"url": "https://files.pythonhosted.org/packages/a8/92/029e50f22fe52bb23f17ea2ff88b344b25ffa5805ac252c89ebc5db8b28e/sdqcpy-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1a4336373aa80af71ac1ac7f4a690d98a04caeef3c61b6c85ca5f32f4e406753",
"md5": "05d09e87ddb1c77bd2bbec15f114b56b",
"sha256": "4aef2958940e67b6e353b476cdb2405e0571521993daf3be400af0921fbb26b7"
},
"downloads": -1,
"filename": "sdqcpy-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "05d09e87ddb1c77bd2bbec15f114b56b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.9",
"size": 29211,
"upload_time": "2024-10-03T00:41:01",
"upload_time_iso_8601": "2024-10-03T00:41:01.610124Z",
"url": "https://files.pythonhosted.org/packages/1a/43/36373aa80af71ac1ac7f4a690d98a04caeef3c61b6c85ca5f32f4e406753/sdqcpy-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-03 00:41:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "T0217",
"github_project": "sdqcpy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "sdqcpy"
}