# Synthetic Data Generation for Tabular, Classification, and Time-Series Labels
This repository contains a Python-based framework for generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks. It is designed to help researchers, data scientists, and machine learning engineers create high-quality, realistic datasets for training and evaluating their models while ensuring privacy and compliance with data protection regulations.
## Features
1. **Tabular Data Generation**: Easily generate synthetic tabular datasets with customizable column types, distribution patterns, and correlations between variables.
2. **Classification Data Generation**: Create datasets for binary or multi-class classification tasks, controlling class imbalance and feature importance.
3. **Time-Series Data Generation**: Generate synthetic time-series datasets with user-defined seasonality, trend, and noise components.
4. **Data Privacy**: Ensure data privacy by using differential privacy techniques and limiting the degree of similarity between the original and synthetic datasets.
5. **Flexible and Extensible**: The framework is designed to be easily extended and adapted to a wide range of data generation tasks, with support for custom data generation modules and integration with other data generation tools.
## Installation
Clone the repository and install the required dependencies:
```bash
git clone https://github.com/syntheticdataset/synthetic-dataset.git
cd synthetic-dataset
pip install -r requirements.txt
```
## Usage
Refer to the provided examples and documentation for guidance on how to generate synthetic datasets for your specific use case.
from synthetic_data import TabularDataGenerator, ClassificationDataGenerator, TimeSeriesDataGenerator
```python
# Tabular data generation
tabular_gen = TabularDataGenerator(num_rows=1000)
tabular_data = tabular_gen.generate()
# Classification data generation
classification_gen = ClassificationDataGenerator(num_samples=1000, num_classes=3)
classification_data, labels = classification_gen.generate()
# Time-series data generation
time_series_gen = TimeSeriesDataGenerator(num_samples=1000, seasonal_period=12)
time_series_data = time_series_gen.generate()
```
## Contributing
Please read the CONTRIBUTING.md file for details on how to contribute to the project. We welcome pull requests, bug reports, and feature requests.
## License
This project is licensed under the MIT License - [Licence](https://github.com/syntheticdataset/synthetic-dataset/blob/main/LICENSE) see the file for details.
Raw data
{
"_id": null,
"home_page": "",
"name": "synthetic-dataset",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "python,pandas,numpy,scikit-learn,scipy,matplotlib,seaborn",
"author": "Synthetic Dataset AI Team",
"author_email": "<admin@syntheticdataset.ai>",
"download_url": "https://files.pythonhosted.org/packages/26/ea/2f021b6a2a16c960aece62899bfc33fc19fe2516d07d3ad88f6cfa4bbc27/synthetic-dataset-0.0.0.2.tar.gz",
"platform": null,
"description": "\n# Synthetic Data Generation for Tabular, Classification, and Time-Series Labels\n\n\n\nThis repository contains a Python-based framework for generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks. It is designed to help researchers, data scientists, and machine learning engineers create high-quality, realistic datasets for training and evaluating their models while ensuring privacy and compliance with data protection regulations.\n\n\n\n## Features\n\n\n\n1. **Tabular Data Generation**: Easily generate synthetic tabular datasets with customizable column types, distribution patterns, and correlations between variables.\n\n2. **Classification Data Generation**: Create datasets for binary or multi-class classification tasks, controlling class imbalance and feature importance.\n\n3. **Time-Series Data Generation**: Generate synthetic time-series datasets with user-defined seasonality, trend, and noise components.\n\n4. **Data Privacy**: Ensure data privacy by using differential privacy techniques and limiting the degree of similarity between the original and synthetic datasets.\n\n5. **Flexible and Extensible**: The framework is designed to be easily extended and adapted to a wide range of data generation tasks, with support for custom data generation modules and integration with other data generation tools.\n\n\n\n## Installation\n\n\n\nClone the repository and install the required dependencies:\n\n\n\n```bash\n\ngit clone https://github.com/syntheticdataset/synthetic-dataset.git\n\ncd synthetic-dataset\n\npip install -r requirements.txt\n\n```\n\n\n\n## Usage\n\nRefer to the provided examples and documentation for guidance on how to generate synthetic datasets for your specific use case.\n\n\n\nfrom synthetic_data import TabularDataGenerator, ClassificationDataGenerator, TimeSeriesDataGenerator\n\n\n\n```python\n\n# Tabular data generation\n\ntabular_gen = TabularDataGenerator(num_rows=1000)\n\ntabular_data = tabular_gen.generate()\n\n\n\n# Classification data generation\n\nclassification_gen = ClassificationDataGenerator(num_samples=1000, num_classes=3)\n\nclassification_data, labels = classification_gen.generate()\n\n\n\n# Time-series data generation\n\ntime_series_gen = TimeSeriesDataGenerator(num_samples=1000, seasonal_period=12)\n\ntime_series_data = time_series_gen.generate()\n\n```\n\n\n\n\n\n## Contributing\n\nPlease read the CONTRIBUTING.md file for details on how to contribute to the project. We welcome pull requests, bug reports, and feature requests.\n\n\n\n## License\n\nThis project is licensed under the MIT License - [Licence](https://github.com/syntheticdataset/synthetic-dataset/blob/main/LICENSE) see the file for details.\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Generating accurate and safe synthetic datasets for tabular, classification, and time-series labeling tasks",
"version": "0.0.0.2",
"split_keywords": [
"python",
"pandas",
"numpy",
"scikit-learn",
"scipy",
"matplotlib",
"seaborn"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2e324614b7ca4899ff2a5ab1bf39f04bb0344654d0dbba14ba72d16a124ff8fc",
"md5": "57de5825ef26283b781b9e474fdb5428",
"sha256": "ea0bfa4cd8b0039e0b78c70e24e7e9fa053eabb7125ece98cb1851df587bfbc0"
},
"downloads": -1,
"filename": "synthetic_dataset-0.0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "57de5825ef26283b781b9e474fdb5428",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 3322,
"upload_time": "2023-04-10T04:36:59",
"upload_time_iso_8601": "2023-04-10T04:36:59.519759Z",
"url": "https://files.pythonhosted.org/packages/2e/32/4614b7ca4899ff2a5ab1bf39f04bb0344654d0dbba14ba72d16a124ff8fc/synthetic_dataset-0.0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "26ea2f021b6a2a16c960aece62899bfc33fc19fe2516d07d3ad88f6cfa4bbc27",
"md5": "6716b0b014950fce8ad53fdad1dd9898",
"sha256": "41b8ab040623c3b440fc518275a1260c82e1282c172d0603e044a4c910b3125d"
},
"downloads": -1,
"filename": "synthetic-dataset-0.0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "6716b0b014950fce8ad53fdad1dd9898",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 3457,
"upload_time": "2023-04-10T04:37:01",
"upload_time_iso_8601": "2023-04-10T04:37:01.684270Z",
"url": "https://files.pythonhosted.org/packages/26/ea/2f021b6a2a16c960aece62899bfc33fc19fe2516d07d3ad88f6cfa4bbc27/synthetic-dataset-0.0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-10 04:37:01",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "synthetic-dataset"
}