indoxGen-tensor


NameindoxGen-tensor JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_tensor
SummaryIndox Synthetic Data Generation (GAN-tensorflow)
upload_time2024-10-13 14:10:47
maintainerNone
docs_urlNone
authornerdstudio
requires_python>=3.9
licenseAGPL-3.0-or-later
keywords ai deep learning language models synthetic data generation machine learning nlp gan tensorflow
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework

[![License](https://img.shields.io/github/license/osllmai/indoxGen_tensor)](https://github.com/osllmai/indoxGen_tensor/blob/main/LICENSE)
[![PyPI](https://badge.fury.io/py/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)
[![Python](https://img.shields.io/pypi/pyversions/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)
[![Downloads](https://static.pepy.tech/badge/indoxGen-tensor)](https://pepy.tech/project/indoxGen-tensor)

[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)
[![GitHub stars](https://img.shields.io/github/stars/osllmai/indoxGen-tensor?style=social)](https://github.com/osllmai/indoxGen_tensor)

<p align="center">
  <a href="https://osllm.ai">Official Website</a> &bull; <a href="https://docs.osllm.ai/index.html">Documentation</a> &bull; <a href="https://discord.gg/qrCc56ZR">Discord</a>
</p>

<p align="center">
  <b>NEW:</b> <a href="https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill">Subscribe to our mailing list</a> for updates and news!
</p>

## Overview

IndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.

## Key Features

- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- **TensorFlow Integration**: Built on TensorFlow for efficient, GPU-accelerated training and generation.
- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.
- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.
- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.

## Installation

```bash
pip install indoxgen-tensor
```

## Quick Start Guide

### Basic Usage

```python
from indoxGen_tensor import TabularGANConfig, TabularGANTrainer
import pandas as pd

# Load your data
data = pd.read_csv("data/Adult.csv")

# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
                       "relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]

# Set up the configuration
config = TabularGANConfig(
    input_dim=200,
    generator_layers=[128, 256, 512],
    discriminator_layers=[512, 256, 128],
    learning_rate=2e-4,
    beta_1=0.5,
    beta_2=0.9,
    batch_size=128,
    epochs=50,
    n_critic=5
)

# Initialize and train the model
trainer = TabularGANTrainer(
    config=config,
    categorical_columns=categorical_columns,
    mixed_columns=mixed_columns,
    integer_columns=integer_columns
)
history = trainer.train(data, patience=15)

# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
```

## Advanced Techniques

### Customizing the GAN Architecture

```python
custom_config = TabularGANConfig(
    input_dim=300,
    generator_layers=[256, 512, 1024, 512],
    discriminator_layers=[512, 1024, 512, 256],
    learning_rate=1e-4,
    batch_size=256,
    epochs=100,
    n_critic=3
)

custom_trainer = TabularGANTrainer(config=custom_config, ...)
```

### Handling Imbalanced Datasets

```python
# Assuming 'rare_class' is underrepresented in your original data
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)

# Adjust generation or sampling to match desired distribution
```

## Configuration and Customization

The `TabularGANConfig` class allows for extensive customization:

- `input_dim`: Dimension of the input noise vector
- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator
- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters
- `batch_size`, `epochs`: Training configuration
- `n_critic`: Number of discriminator updates per generator update

Refer to the API documentation for a comprehensive list of configuration options.

## Best Practices

1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.
2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.
3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.
4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.
5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.

## Roadmap
* [x] Implement basic GAN architecture for tabular data
* [x] Add support for mixed data types (categorical, continuous, integer)
* [x] Integrate early stopping and training history
* [ ] Implement more advanced GAN variants (WGAN, CGAN)
* [ ] Add built-in privacy preserving mechanisms
* [ ] Develop automated hyperparameter tuning
* [ ] Create visualization tools for synthetic data quality assessment
* [ ] Implement distributed training support for large-scale datasets

## Contributing

We welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.

## License

IndoxGen-Tensor is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.

---

IndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_tensor",
    "name": "indoxGen-tensor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "AI, deep learning, language models, synthetic data generation, machine learning, NLP, GAN, tensorflow",
    "author": "nerdstudio",
    "author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
    "download_url": "https://files.pythonhosted.org/packages/4e/25/f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c/indoxgen_tensor-0.1.0.tar.gz",
    "platform": null,
    "description": "# IndoxGen-Tensor: Advanced GAN-based Synthetic Data Generation Framework\r\n\r\n[![License](https://img.shields.io/github/license/osllmai/indoxGen_tensor)](https://github.com/osllmai/indoxGen_tensor/blob/main/LICENSE)\r\n[![PyPI](https://badge.fury.io/py/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)\r\n[![Python](https://img.shields.io/pypi/pyversions/indoxGen-tensor.svg)](https://pypi.org/project/indoxGen-tensor/)\r\n[![Downloads](https://static.pepy.tech/badge/indoxGen-tensor)](https://pepy.tech/project/indoxGen-tensor)\r\n\r\n[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)\r\n[![GitHub stars](https://img.shields.io/github/stars/osllmai/indoxGen-tensor?style=social)](https://github.com/osllmai/indoxGen_tensor)\r\n\r\n<p align=\"center\">\r\n  <a href=\"https://osllm.ai\">Official Website</a> &bull; <a href=\"https://docs.osllm.ai/index.html\">Documentation</a> &bull; <a href=\"https://discord.gg/qrCc56ZR\">Discord</a>\r\n</p>\r\n\r\n<p align=\"center\">\r\n  <b>NEW:</b> <a href=\"https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill\">Subscribe to our mailing list</a> for updates and news!\r\n</p>\r\n\r\n## Overview\r\n\r\nIndoxGen-Tensor is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by TensorFlow. This module extends the capabilities of IndoxGen by providing a robust, TensorFlow-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.\r\n\r\n## Key Features\r\n\r\n- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.\r\n- **TensorFlow Integration**: Built on TensorFlow for efficient, GPU-accelerated training and generation.\r\n- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.\r\n- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.\r\n- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.\r\n- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install indoxgen-tensor\r\n```\r\n\r\n## Quick Start Guide\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom indoxGen_tensor import TabularGANConfig, TabularGANTrainer\r\nimport pandas as pd\r\n\r\n# Load your data\r\ndata = pd.read_csv(\"data/Adult.csv\")\r\n\r\n# Define column types\r\ncategorical_columns = [\"workclass\", \"education\", \"marital-status\", \"occupation\",\r\n                       \"relationship\", \"race\", \"gender\", \"native-country\", \"income\"]\r\nmixed_columns = {\"capital-gain\": \"positive\", \"capital-loss\": \"positive\"}\r\ninteger_columns = [\"age\", \"fnlwgt\", \"hours-per-week\", \"capital-gain\", \"capital-loss\"]\r\n\r\n# Set up the configuration\r\nconfig = TabularGANConfig(\r\n    input_dim=200,\r\n    generator_layers=[128, 256, 512],\r\n    discriminator_layers=[512, 256, 128],\r\n    learning_rate=2e-4,\r\n    beta_1=0.5,\r\n    beta_2=0.9,\r\n    batch_size=128,\r\n    epochs=50,\r\n    n_critic=5\r\n)\r\n\r\n# Initialize and train the model\r\ntrainer = TabularGANTrainer(\r\n    config=config,\r\n    categorical_columns=categorical_columns,\r\n    mixed_columns=mixed_columns,\r\n    integer_columns=integer_columns\r\n)\r\nhistory = trainer.train(data, patience=15)\r\n\r\n# Generate synthetic data\r\nsynthetic_data = trainer.generate_samples(50000)\r\n```\r\n\r\n## Advanced Techniques\r\n\r\n### Customizing the GAN Architecture\r\n\r\n```python\r\ncustom_config = TabularGANConfig(\r\n    input_dim=300,\r\n    generator_layers=[256, 512, 1024, 512],\r\n    discriminator_layers=[512, 1024, 512, 256],\r\n    learning_rate=1e-4,\r\n    batch_size=256,\r\n    epochs=100,\r\n    n_critic=3\r\n)\r\n\r\ncustom_trainer = TabularGANTrainer(config=custom_config, ...)\r\n```\r\n\r\n### Handling Imbalanced Datasets\r\n\r\n```python\r\n# Assuming 'rare_class' is underrepresented in your original data\r\noriginal_class_distribution = data['target_column'].value_counts(normalize=True)\r\nsynthetic_data = trainer.generate_samples(100000)\r\nsynthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)\r\n\r\n# Adjust generation or sampling to match desired distribution\r\n```\r\n\r\n## Configuration and Customization\r\n\r\nThe `TabularGANConfig` class allows for extensive customization:\r\n\r\n- `input_dim`: Dimension of the input noise vector\r\n- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator\r\n- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters\r\n- `batch_size`, `epochs`: Training configuration\r\n- `n_critic`: Number of discriminator updates per generator update\r\n\r\nRefer to the API documentation for a comprehensive list of configuration options.\r\n\r\n## Best Practices\r\n\r\n1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.\r\n2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.\r\n3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.\r\n4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.\r\n5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.\r\n\r\n## Roadmap\r\n* [x] Implement basic GAN architecture for tabular data\r\n* [x] Add support for mixed data types (categorical, continuous, integer)\r\n* [x] Integrate early stopping and training history\r\n* [ ] Implement more advanced GAN variants (WGAN, CGAN)\r\n* [ ] Add built-in privacy preserving mechanisms\r\n* [ ] Develop automated hyperparameter tuning\r\n* [ ] Create visualization tools for synthetic data quality assessment\r\n* [ ] Implement distributed training support for large-scale datasets\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.\r\n\r\n## License\r\n\r\nIndoxGen-Tensor is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.\r\n\r\n---\r\n\r\nIndoxGen-Tensor - Advancing Synthetic Data Generation with GAN Technology\r\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0-or-later",
    "summary": "Indox Synthetic Data Generation (GAN-tensorflow)",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_tensor"
    },
    "split_keywords": [
        "ai",
        " deep learning",
        " language models",
        " synthetic data generation",
        " machine learning",
        " nlp",
        " gan",
        " tensorflow"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ac28db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8",
                "md5": "4041743a6fd69fcb468d0cf1d4da7fe8",
                "sha256": "aa84cba244d6a5aa0505e939d506b2e14c9a029e0fa70c871f9dc6f6a17d3cdb"
            },
            "downloads": -1,
            "filename": "indoxGen_tensor-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4041743a6fd69fcb468d0cf1d4da7fe8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 31504,
            "upload_time": "2024-10-13T14:10:45",
            "upload_time_iso_8601": "2024-10-13T14:10:45.263081Z",
            "url": "https://files.pythonhosted.org/packages/ac/28/db18674cefcbb01503be575d407f4bbb5d7dd096d8739364f62f1c54caa8/indoxGen_tensor-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4e25f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c",
                "md5": "7df6276f712d55f7d5be6360801286e5",
                "sha256": "778d86079f7080ced26708e2462b0f1399b4e5b9cffbe384da5f6f1fcfda7837"
            },
            "downloads": -1,
            "filename": "indoxgen_tensor-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7df6276f712d55f7d5be6360801286e5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 30043,
            "upload_time": "2024-10-13T14:10:47",
            "upload_time_iso_8601": "2024-10-13T14:10:47.047131Z",
            "url": "https://files.pythonhosted.org/packages/4e/25/f2a448ff755b8a104dd8ffc430262aa44e437d4a8787cf62947ed376160c/indoxgen_tensor-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-13 14:10:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "osllmai",
    "github_project": "IndoxGen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "indoxgen-tensor"
}
        
Elapsed time: 0.47299s