# IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework
[![License](https://img.shields.io/github/license/osllmai/indoxGen_tensor)](https://github.com/osllmai/indoxGen_tensor/blob/main/LICENSE)
[![PyPI](https://badge.fury.io/py/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)
[![Python](https://img.shields.io/pypi/pyversions/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)
[![Downloads](https://static.pepy.tech/badge/indoxGen-tensor)](https://pepy.tech/project/indoxGen-tensor)
[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)
[![GitHub stars](https://img.shields.io/github/stars/osllmai/indoxGen-tensor?style=social)](https://github.com/osllmai/indoxGen_tensor)
<p align="center">
<a href="https://osllm.ai">Official Website</a> • <a href="https://docs.osllm.ai/index.html">Documentation</a> • <a href="https://discord.gg/qrCc56ZR">Discord</a>
</p>
<p align="center">
<b>NEW:</b> <a href="https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill">Subscribe to our mailing list</a> for updates and news!
</p>
## Overview
IndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.
## Key Features
- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- **TensorFlow Integration**: Built on TensorFlow for efficient, GPU-accelerated training and generation.
- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.
- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.
- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.
## Installation
```bash
pip install indoxgen-tensor
```
## Quick Start Guide
### Basic Usage
```python
from indoxGen_tensor import TabularGANConfig, TabularGANTrainer
import pandas as pd
# Load your data
data = pd.read_csv("data/Adult.csv")
# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
"relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]
# Set up the configuration
config = TabularGANConfig(
input_dim=200,
generator_layers=[128, 256, 512],
discriminator_layers=[512, 256, 128],
learning_rate=2e-4,
beta_1=0.5,
beta_2=0.9,
batch_size=128,
epochs=50,
n_critic=5
)
# Initialize and train the model
trainer = TabularGANTrainer(
config=config,
categorical_columns=categorical_columns,
mixed_columns=mixed_columns,
integer_columns=integer_columns
)
history = trainer.train(data, patience=15)
# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
```
## Advanced Techniques
### Customizing the GAN Architecture
```python
custom_config = TabularGANConfig(
input_dim=300,
generator_layers=[256, 512, 1024, 512],
discriminator_layers=[512, 1024, 512, 256],
learning_rate=1e-4,
batch_size=256,
epochs=100,
n_critic=3
)
custom_trainer = TabularGANTrainer(config=custom_config, ...)
```
### Handling Imbalanced Datasets
```python
# Assuming 'rare_class' is underrepresented in your original data
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)
# Adjust generation or sampling to match desired distribution
```
## Configuration and Customization
The `TabularGANConfig` class allows for extensive customization:
- `input_dim`: Dimension of the input noise vector
- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator
- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters
- `batch_size`, `epochs`: Training configuration
- `n_critic`: Number of discriminator updates per generator update
Refer to the API documentation for a comprehensive list of configuration options.
## Best Practices
1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.
2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.
3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.
4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.
5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.
## Roadmap
* [x] Implement basic GAN architecture for tabular data
* [x] Add support for mixed data types (categorical, continuous, integer)
* [x] Integrate early stopping and training history
* [ ] Implement more advanced GAN variants (WGAN, CGAN)
* [ ] Add built-in privacy preserving mechanisms
* [ ] Develop automated hyperparameter tuning
* [ ] Create visualization tools for synthetic data quality assessment
* [ ] Implement distributed training support for large-scale datasets
## Contributing
We welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.
## License
IndoxGen-Tensor is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.
---
IndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology
Raw data
{
"_id": null,
"home_page": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_tensor",
"name": "indoxGen-tensor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "AI, deep learning, language models, synthetic data generation, machine learning, NLP, GAN, tensorflow",
"author": "nerdstudio",
"author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
"download_url": "https://files.pythonhosted.org/packages/4e/25/f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c/indoxgen_tensor-0.1.0.tar.gz",
"platform": null,
"description": "# IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework\r\n\r\n[![License](https://img.shields.io/github/license/osllmai/indoxGen_tensor)](https://github.com/osllmai/indoxGen_tensor/blob/main/LICENSE)\r\n[![PyPI](https://badge.fury.io/py/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)\r\n[![Python](https://img.shields.io/pypi/pyversions/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)\r\n[![Downloads](https://static.pepy.tech/badge/indoxGen-tensor)](https://pepy.tech/project/indoxGen-tensor)\r\n\r\n[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)\r\n[![GitHub stars](https://img.shields.io/github/stars/osllmai/indoxGen-tensor?style=social)](https://github.com/osllmai/indoxGen_tensor)\r\n\r\n<p align=\"center\">\r\n <a href=\"https://osllm.ai\">Official Website</a> • <a href=\"https://docs.osllm.ai/index.html\">Documentation</a> • <a href=\"https://discord.gg/qrCc56ZR\">Discord</a>\r\n</p>\r\n\r\n<p align=\"center\">\r\n <b>NEW:</b> <a href=\"https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill\">Subscribe to our mailing list</a> for updates and news!\r\n</p>\r\n\r\n## Overview\r\n\r\nIndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.\r\n\r\n## Key Features\r\n\r\n- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.\r\n- **TensorFlow Integration**: Built on TensorFlow for efficient, GPU-accelerated training and generation.\r\n- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.\r\n- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.\r\n- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.\r\n- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install indoxgen-tensor\r\n```\r\n\r\n## Quick Start Guide\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom indoxGen_tensor import TabularGANConfig, TabularGANTrainer\r\nimport pandas as pd\r\n\r\n# Load your data\r\ndata = pd.read_csv(\"data/Adult.csv\")\r\n\r\n# Define column types\r\ncategorical_columns = [\"workclass\", \"education\", \"marital-status\", \"occupation\",\r\n \"relationship\", \"race\", \"gender\", \"native-country\", \"income\"]\r\nmixed_columns = {\"capital-gain\": \"positive\", \"capital-loss\": \"positive\"}\r\ninteger_columns = [\"age\", \"fnlwgt\", \"hours-per-week\", \"capital-gain\", \"capital-loss\"]\r\n\r\n# Set up the configuration\r\nconfig = TabularGANConfig(\r\n input_dim=200,\r\n generator_layers=[128, 256, 512],\r\n discriminator_layers=[512, 256, 128],\r\n learning_rate=2e-4,\r\n beta_1=0.5,\r\n beta_2=0.9,\r\n batch_size=128,\r\n epochs=50,\r\n n_critic=5\r\n)\r\n\r\n# Initialize and train the model\r\ntrainer = TabularGANTrainer(\r\n config=config,\r\n categorical_columns=categorical_columns,\r\n mixed_columns=mixed_columns,\r\n integer_columns=integer_columns\r\n)\r\nhistory = trainer.train(data, patience=15)\r\n\r\n# Generate synthetic data\r\nsynthetic_data = trainer.generate_samples(50000)\r\n```\r\n\r\n## Advanced Techniques\r\n\r\n### Customizing the GAN Architecture\r\n\r\n```python\r\ncustom_config = TabularGANConfig(\r\n input_dim=300,\r\n generator_layers=[256, 512, 1024, 512],\r\n discriminator_layers=[512, 1024, 512, 256],\r\n learning_rate=1e-4,\r\n batch_size=256,\r\n epochs=100,\r\n n_critic=3\r\n)\r\n\r\ncustom_trainer = TabularGANTrainer(config=custom_config, ...)\r\n```\r\n\r\n### Handling Imbalanced Datasets\r\n\r\n```python\r\n# Assuming 'rare_class' is underrepresented in your original data\r\noriginal_class_distribution = data['target_column'].value_counts(normalize=True)\r\nsynthetic_data = trainer.generate_samples(100000)\r\nsynthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)\r\n\r\n# Adjust generation or sampling to match desired distribution\r\n```\r\n\r\n## Configuration and Customization\r\n\r\nThe `TabularGANConfig` class allows for extensive customization:\r\n\r\n- `input_dim`: Dimension of the input noise vector\r\n- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator\r\n- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters\r\n- `batch_size`, `epochs`: Training configuration\r\n- `n_critic`: Number of discriminator updates per generator update\r\n\r\nRefer to the API documentation for a comprehensive list of configuration options.\r\n\r\n## Best Practices\r\n\r\n1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.\r\n2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.\r\n3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.\r\n4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.\r\n5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.\r\n\r\n## Roadmap\r\n* [x] Implement basic GAN architecture for tabular data\r\n* [x] Add support for mixed data types (categorical, continuous, integer)\r\n* [x] Integrate early stopping and training history\r\n* [ ] Implement more advanced GAN variants (WGAN, CGAN)\r\n* [ ] Add built-in privacy preserving mechanisms\r\n* [ ] Develop automated hyperparameter tuning\r\n* [ ] Create visualization tools for synthetic data quality assessment\r\n* [ ] Implement distributed training support for large-scale datasets\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.\r\n\r\n## License\r\n\r\nIndoxGen-Tensor is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.\r\n\r\n---\r\n\r\nIndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology\r\n",
"bugtrack_url": null,
"license": "AGPL-3.0-or-later",
"summary": "Indox Synthetic Data Generation (GAN-tensorflow)",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_tensor"
},
"split_keywords": [
"ai",
" deep learning",
" language models",
" synthetic data generation",
" machine learning",
" nlp",
" gan",
" tensorflow"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ac28db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8",
"md5": "4041743a6fd69fcb468d0cf1d4da7fe8",
"sha256": "aa84cba244d6a5aa0505e939d506b2e14c9a029e0fa70c871f9dc6f6a17d3cdb"
},
"downloads": -1,
"filename": "indoxGen_tensor-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4041743a6fd69fcb468d0cf1d4da7fe8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 31504,
"upload_time": "2024-10-13T14:10:45",
"upload_time_iso_8601": "2024-10-13T14:10:45.263081Z",
"url": "https://files.pythonhosted.org/packages/ac/28/db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8/indoxGen_tensor-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4e25f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c",
"md5": "7df6276f712d55f7d5be6360801286e5",
"sha256": "778d86079f7080ced26708e2462b0f1399b4e5b9cffbe384da5f6f1fcfda7837"
},
"downloads": -1,
"filename": "indoxgen_tensor-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "7df6276f712d55f7d5be6360801286e5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 30043,
"upload_time": "2024-10-13T14:10:47",
"upload_time_iso_8601": "2024-10-13T14:10:47.047131Z",
"url": "https://files.pythonhosted.org/packages/4e/25/f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c/indoxgen_tensor-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-13 14:10:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "osllmai",
"github_project": "IndoxGen",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "indoxgen-tensor"
}