# IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework
[![License](https://img.shields.io/github/license/osllmai/IndoxGen-Torch)](https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch/LICENSE)
[![PyPI](https://badge.fury.io/py/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)
[![Python](https://img.shields.io/pypi/pyversions/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)
[![Downloads](https://static.pepy.tech/badge/indoxGen-torch)](https://pepy.tech/project/indoxGen-torch)
[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)
[![GitHub stars](https://img.shields.io/github/stars/osllmai/IndoxGen-torch?style=social)](https://github.com/osllmai/indoxGen)
<p align="center">
<a href="https://osllm.ai">Official Website</a> • <a href="https://docs.osllm.ai/index.html">Documentation</a> • <a href="https://discord.gg/qrCc56ZR">Discord</a>
</p>
<p align="center">
<b>NEW:</b> <a href="https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill">Subscribe to our mailing list</a> for updates and news!
</p>
## Overview
IndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.
## Key Features
- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- **PyTorch Integration**: Built on PyTorch for efficient, GPU-accelerated training and generation.
- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.
- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.
- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.
## Installation
```bash
pip install IndoxGen-Torch
```
## Quick Start Guide
### Basic Usage
```python
from indoxGen_pytorch import TabularGANConfig, TabularGANTrainer
import pandas as pd
# Load your data
data = pd.read_csv("data/Adult.csv")
# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
"relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]
# Set up the configuration
config = TabularGANConfig(
input_dim=200,
generator_layers=[128, 256, 512],
discriminator_layers=[512, 256, 128],
learning_rate=2e-4,
beta_1=0.5,
beta_2=0.9,
batch_size=128,
epochs=50,
n_critic=5
)
# Initialize and train the model
trainer = TabularGANTrainer(
config=config,
categorical_columns=categorical_columns,
mixed_columns=mixed_columns,
integer_columns=integer_columns
)
history = trainer.train(data, patience=15)
# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
```
## Advanced Techniques
### Customizing the GAN Architecture
```python
custom_config = TabularGANConfig(
input_dim=300,
generator_layers=[256, 512, 1024, 512],
discriminator_layers=[512, 1024, 512, 256],
learning_rate=1e-4,
batch_size=256,
epochs=100,
n_critic=3
)
custom_trainer = TabularGANTrainer(config=custom_config, ...)
```
### Handling Imbalanced Datasets
```python
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)
```
## Configuration and Customization
The `TabularGANConfig` class allows for extensive customization:
- `input_dim`: Dimension of the input noise vector
- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator
- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters
- `batch_size`, `epochs`: Training configuration
- `n_critic`: Number of discriminator updates per generator update
Refer to the API documentation for a comprehensive list of configuration options.
## Best Practices
1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.
2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.
3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.
4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.
5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.
## Roadmap
* [x] Implement basic GAN architecture for tabular data
* [x] Add support for mixed data types (categorical, continuous, integer)
* [x] Integrate early stopping and training history
* [ ] Implement more advanced GAN variants (WGAN, CGAN)
* [ ] Add built-in privacy preserving mechanisms
* [ ] Develop automated hyperparameter tuning
* [ ] Create visualization tools for synthetic data quality assessment
* [ ] Implement distributed training support for large-scale datasets
## Contributing
We welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.
## License
IndoxGen-Torch is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.
---
Raw data
{
"_id": null,
"home_page": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch",
"name": "indoxGen-torch",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "AI, deep learning, language models, synthetic data generation, machine learning, NLP",
"author": "nerdstudio",
"author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
"download_url": "https://files.pythonhosted.org/packages/cf/7c/f17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57/indoxgen_torch-0.0.9.tar.gz",
"platform": null,
"description": "# IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework\r\n\r\n[![License](https://img.shields.io/github/license/osllmai/IndoxGen-Torch)](https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch/LICENSE)\r\n[![PyPI](https://badge.fury.io/py/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)\r\n[![Python](https://img.shields.io/pypi/pyversions/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)\r\n[![Downloads](https://static.pepy.tech/badge/indoxGen-torch)](https://pepy.tech/project/indoxGen-torch)\r\n\r\n[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)\r\n[![GitHub stars](https://img.shields.io/github/stars/osllmai/IndoxGen-torch?style=social)](https://github.com/osllmai/indoxGen)\r\n\r\n<p align=\"center\">\r\n <a href=\"https://osllm.ai\">Official Website</a> • <a href=\"https://docs.osllm.ai/index.html\">Documentation</a> • <a href=\"https://discord.gg/qrCc56ZR\">Discord</a>\r\n</p>\r\n\r\n<p align=\"center\">\r\n <b>NEW:</b> <a href=\"https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill\">Subscribe to our mailing list</a> for updates and news!\r\n</p>\r\n\r\n## Overview\r\n\r\nIndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.\r\n\r\n## Key Features\r\n\r\n- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.\r\n- **PyTorch Integration**: Built on PyTorch for efficient, GPU-accelerated training and generation.\r\n- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.\r\n- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.\r\n- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.\r\n- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install IndoxGen-Torch\r\n```\r\n\r\n## Quick Start Guide\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom indoxGen_pytorch import TabularGANConfig, TabularGANTrainer\r\nimport pandas as pd\r\n\r\n# Load your data\r\ndata = pd.read_csv(\"data/Adult.csv\")\r\n\r\n# Define column types\r\ncategorical_columns = [\"workclass\", \"education\", \"marital-status\", \"occupation\",\r\n \"relationship\", \"race\", \"gender\", \"native-country\", \"income\"]\r\nmixed_columns = {\"capital-gain\": \"positive\", \"capital-loss\": \"positive\"}\r\ninteger_columns = [\"age\", \"fnlwgt\", \"hours-per-week\", \"capital-gain\", \"capital-loss\"]\r\n\r\n# Set up the configuration\r\nconfig = TabularGANConfig(\r\n input_dim=200,\r\n generator_layers=[128, 256, 512],\r\n discriminator_layers=[512, 256, 128],\r\n learning_rate=2e-4,\r\n beta_1=0.5,\r\n beta_2=0.9,\r\n batch_size=128,\r\n epochs=50,\r\n n_critic=5\r\n)\r\n\r\n# Initialize and train the model\r\ntrainer = TabularGANTrainer(\r\n config=config,\r\n categorical_columns=categorical_columns,\r\n mixed_columns=mixed_columns,\r\n integer_columns=integer_columns\r\n)\r\nhistory = trainer.train(data, patience=15)\r\n\r\n# Generate synthetic data\r\nsynthetic_data = trainer.generate_samples(50000)\r\n```\r\n\r\n## Advanced Techniques\r\n\r\n### Customizing the GAN Architecture\r\n\r\n```python\r\ncustom_config = TabularGANConfig(\r\n input_dim=300,\r\n generator_layers=[256, 512, 1024, 512],\r\n discriminator_layers=[512, 1024, 512, 256],\r\n learning_rate=1e-4,\r\n batch_size=256,\r\n epochs=100,\r\n n_critic=3\r\n)\r\n\r\ncustom_trainer = TabularGANTrainer(config=custom_config, ...)\r\n```\r\n\r\n### Handling Imbalanced Datasets\r\n\r\n```python\r\noriginal_class_distribution = data['target_column'].value_counts(normalize=True)\r\nsynthetic_data = trainer.generate_samples(100000)\r\nsynthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)\r\n```\r\n\r\n## Configuration and Customization\r\n\r\nThe `TabularGANConfig` class allows for extensive customization:\r\n\r\n- `input_dim`: Dimension of the input noise vector\r\n- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator\r\n- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters\r\n- `batch_size`, `epochs`: Training configuration\r\n- `n_critic`: Number of discriminator updates per generator update\r\n\r\nRefer to the API documentation for a comprehensive list of configuration options.\r\n\r\n## Best Practices\r\n\r\n1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.\r\n2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.\r\n3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.\r\n4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.\r\n5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.\r\n\r\n## Roadmap\r\n* [x] Implement basic GAN architecture for tabular data\r\n* [x] Add support for mixed data types (categorical, continuous, integer)\r\n* [x] Integrate early stopping and training history\r\n* [ ] Implement more advanced GAN variants (WGAN, CGAN)\r\n* [ ] Add built-in privacy preserving mechanisms\r\n* [ ] Develop automated hyperparameter tuning\r\n* [ ] Create visualization tools for synthetic data quality assessment\r\n* [ ] Implement distributed training support for large-scale datasets\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.\r\n\r\n## License\r\n\r\nIndoxGen-Torch is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.\r\n\r\n---\r\n",
"bugtrack_url": null,
"license": "AGPL-3.0-or-later",
"summary": "Indox Synthetic Data Generation (GAN-pytorch)",
"version": "0.0.9",
"project_urls": {
"Homepage": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch"
},
"split_keywords": [
"ai",
" deep learning",
" language models",
" synthetic data generation",
" machine learning",
" nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "988868f47b7f0ecf6c1e5f33dff5a52682c7920a74c39b886fc215ca8d1bc14e",
"md5": "024e214efe4d458b8f59b729249916b6",
"sha256": "c314d81cead763586a4b2cf54aa7a01123363b14a3cfbaff0acb81e1440aec69"
},
"downloads": -1,
"filename": "indoxGen_torch-0.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "024e214efe4d458b8f59b729249916b6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 32384,
"upload_time": "2024-10-14T06:14:37",
"upload_time_iso_8601": "2024-10-14T06:14:37.534044Z",
"url": "https://files.pythonhosted.org/packages/98/88/68f47b7f0ecf6c1e5f33dff5a52682c7920a74c39b886fc215ca8d1bc14e/indoxGen_torch-0.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cf7cf17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57",
"md5": "9df208f482005db6fd41592209917417",
"sha256": "44f5619b22d29be8deab50d040cd6aa2f9938d5ce678fbf06073c1143c1cb1bf"
},
"downloads": -1,
"filename": "indoxgen_torch-0.0.9.tar.gz",
"has_sig": false,
"md5_digest": "9df208f482005db6fd41592209917417",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 31177,
"upload_time": "2024-10-14T06:14:39",
"upload_time_iso_8601": "2024-10-14T06:14:39.780222Z",
"url": "https://files.pythonhosted.org/packages/cf/7c/f17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57/indoxgen_torch-0.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-14 06:14:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "osllmai",
"github_project": "IndoxGen",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "indoxgen-torch"
}