indoxGen-torch


NameindoxGen-torch JSON
Version 0.0.9 PyPI version JSON
download
home_pagehttps://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch
SummaryIndox Synthetic Data Generation (GAN-pytorch)
upload_time2024-10-14 06:14:39
maintainerNone
docs_urlNone
authornerdstudio
requires_python>=3.9
licenseAGPL-3.0-or-later
keywords ai deep learning language models synthetic data generation machine learning nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework

[![License](https://img.shields.io/github/license/osllmai/IndoxGen-Torch)](https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch/LICENSE)
[![PyPI](https://badge.fury.io/py/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)
[![Python](https://img.shields.io/pypi/pyversions/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)
[![Downloads](https://static.pepy.tech/badge/indoxGen-torch)](https://pepy.tech/project/indoxGen-torch)

[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)
[![GitHub stars](https://img.shields.io/github/stars/osllmai/IndoxGen-torch?style=social)](https://github.com/osllmai/indoxGen)

<p align="center">
  <a href="https://osllm.ai">Official Website</a> &bull; <a href="https://docs.osllm.ai/index.html">Documentation</a> &bull; <a href="https://discord.gg/qrCc56ZR">Discord</a>
</p>

<p align="center">
  <b>NEW:</b> <a href="https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill">Subscribe to our mailing list</a> for updates and news!
</p>

## Overview

IndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.

## Key Features

- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.
- **PyTorch Integration**: Built on PyTorch for efficient, GPU-accelerated training and generation.
- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.
- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.
- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.
- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.

## Installation

```bash
pip install IndoxGen-Torch
```

## Quick Start Guide

### Basic Usage

```python
from indoxGen_pytorch import TabularGANConfig, TabularGANTrainer
import pandas as pd

# Load your data
data = pd.read_csv("data/Adult.csv")

# Define column types
categorical_columns = ["workclass", "education", "marital-status", "occupation",
                       "relationship", "race", "gender", "native-country", "income"]
mixed_columns = {"capital-gain": "positive", "capital-loss": "positive"}
integer_columns = ["age", "fnlwgt", "hours-per-week", "capital-gain", "capital-loss"]

# Set up the configuration
config = TabularGANConfig(
    input_dim=200,
    generator_layers=[128, 256, 512],
    discriminator_layers=[512, 256, 128],
    learning_rate=2e-4,
    beta_1=0.5,
    beta_2=0.9,
    batch_size=128,
    epochs=50,
    n_critic=5
)

# Initialize and train the model
trainer = TabularGANTrainer(
    config=config,
    categorical_columns=categorical_columns,
    mixed_columns=mixed_columns,
    integer_columns=integer_columns
)
history = trainer.train(data, patience=15)

# Generate synthetic data
synthetic_data = trainer.generate_samples(50000)
```

## Advanced Techniques

### Customizing the GAN Architecture

```python
custom_config = TabularGANConfig(
    input_dim=300,
    generator_layers=[256, 512, 1024, 512],
    discriminator_layers=[512, 1024, 512, 256],
    learning_rate=1e-4,
    batch_size=256,
    epochs=100,
    n_critic=3
)

custom_trainer = TabularGANTrainer(config=custom_config, ...)
```

### Handling Imbalanced Datasets

```python
original_class_distribution = data['target_column'].value_counts(normalize=True)
synthetic_data = trainer.generate_samples(100000)
synthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)
```

## Configuration and Customization

The `TabularGANConfig` class allows for extensive customization:

- `input_dim`: Dimension of the input noise vector
- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator
- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters
- `batch_size`, `epochs`: Training configuration
- `n_critic`: Number of discriminator updates per generator update

Refer to the API documentation for a comprehensive list of configuration options.

## Best Practices

1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.
2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.
3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.
4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.
5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.

## Roadmap
* [x] Implement basic GAN architecture for tabular data
* [x] Add support for mixed data types (categorical, continuous, integer)
* [x] Integrate early stopping and training history
* [ ] Implement more advanced GAN variants (WGAN, CGAN)
* [ ] Add built-in privacy preserving mechanisms
* [ ] Develop automated hyperparameter tuning
* [ ] Create visualization tools for synthetic data quality assessment
* [ ] Implement distributed training support for large-scale datasets

## Contributing

We welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.

## License

IndoxGen-Torch is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.

---

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch",
    "name": "indoxGen-torch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "AI, deep learning, language models, synthetic data generation, machine learning, NLP",
    "author": "nerdstudio",
    "author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
    "download_url": "https://files.pythonhosted.org/packages/cf/7c/f17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57/indoxgen_torch-0.0.9.tar.gz",
    "platform": null,
    "description": "# IndoxGen-Torch: Advanced GAN-based Synthetic Data Generation Framework\r\n\r\n[![License](https://img.shields.io/github/license/osllmai/IndoxGen-Torch)](https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch/LICENSE)\r\n[![PyPI](https://badge.fury.io/py/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)\r\n[![Python](https://img.shields.io/pypi/pyversions/IndoxGen-Torch.svg)](https://pypi.org/project/indoxGen-torch/)\r\n[![Downloads](https://static.pepy.tech/badge/indoxGen-torch)](https://pepy.tech/project/indoxGen-torch)\r\n\r\n[![Discord](https://img.shields.io/discord/1223867382460579961?label=Discord&logo=Discord&style=social)](https://discord.com/invite/ossllmai)\r\n[![GitHub stars](https://img.shields.io/github/stars/osllmai/IndoxGen-torch?style=social)](https://github.com/osllmai/indoxGen)\r\n\r\n<p align=\"center\">\r\n  <a href=\"https://osllm.ai\">Official Website</a> &bull; <a href=\"https://docs.osllm.ai/index.html\">Documentation</a> &bull; <a href=\"https://discord.gg/qrCc56ZR\">Discord</a>\r\n</p>\r\n\r\n<p align=\"center\">\r\n  <b>NEW:</b> <a href=\"https://docs.google.com/forms/d/1CQXJvxLUqLBSXnjqQmRpOyZqD6nrKubLz2WTcIJ37fU/prefill\">Subscribe to our mailing list</a> for updates and news!\r\n</p>\r\n\r\n## Overview\r\n\r\nIndoxGen-Torch is a cutting-edge framework for generating high-quality synthetic data using Generative Adversarial Networks (GANs) powered by PyTorch. This module extends the capabilities of IndoxGen by providing a robust, PyTorch-based solution for creating realistic tabular data, particularly suited for complex datasets with mixed data types.\r\n\r\n## Key Features\r\n\r\n- **GAN-based Generation**: Utilizes advanced GAN architecture for high-fidelity synthetic data creation.\r\n- **PyTorch Integration**: Built on PyTorch for efficient, GPU-accelerated training and generation.\r\n- **Flexible Data Handling**: Supports categorical, mixed, and integer columns for versatile data modeling.\r\n- **Customizable Architecture**: Easily configure generator and discriminator layers, learning rates, and other hyperparameters.\r\n- **Training Monitoring**: Built-in patience-based early stopping for optimal model training.\r\n- **Scalable Generation**: Efficiently generate large volumes of synthetic data post-training.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install IndoxGen-Torch\r\n```\r\n\r\n## Quick Start Guide\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom indoxGen_pytorch import TabularGANConfig, TabularGANTrainer\r\nimport pandas as pd\r\n\r\n# Load your data\r\ndata = pd.read_csv(\"data/Adult.csv\")\r\n\r\n# Define column types\r\ncategorical_columns = [\"workclass\", \"education\", \"marital-status\", \"occupation\",\r\n                       \"relationship\", \"race\", \"gender\", \"native-country\", \"income\"]\r\nmixed_columns = {\"capital-gain\": \"positive\", \"capital-loss\": \"positive\"}\r\ninteger_columns = [\"age\", \"fnlwgt\", \"hours-per-week\", \"capital-gain\", \"capital-loss\"]\r\n\r\n# Set up the configuration\r\nconfig = TabularGANConfig(\r\n    input_dim=200,\r\n    generator_layers=[128, 256, 512],\r\n    discriminator_layers=[512, 256, 128],\r\n    learning_rate=2e-4,\r\n    beta_1=0.5,\r\n    beta_2=0.9,\r\n    batch_size=128,\r\n    epochs=50,\r\n    n_critic=5\r\n)\r\n\r\n# Initialize and train the model\r\ntrainer = TabularGANTrainer(\r\n    config=config,\r\n    categorical_columns=categorical_columns,\r\n    mixed_columns=mixed_columns,\r\n    integer_columns=integer_columns\r\n)\r\nhistory = trainer.train(data, patience=15)\r\n\r\n# Generate synthetic data\r\nsynthetic_data = trainer.generate_samples(50000)\r\n```\r\n\r\n## Advanced Techniques\r\n\r\n### Customizing the GAN Architecture\r\n\r\n```python\r\ncustom_config = TabularGANConfig(\r\n    input_dim=300,\r\n    generator_layers=[256, 512, 1024, 512],\r\n    discriminator_layers=[512, 1024, 512, 256],\r\n    learning_rate=1e-4,\r\n    batch_size=256,\r\n    epochs=100,\r\n    n_critic=3\r\n)\r\n\r\ncustom_trainer = TabularGANTrainer(config=custom_config, ...)\r\n```\r\n\r\n### Handling Imbalanced Datasets\r\n\r\n```python\r\noriginal_class_distribution = data['target_column'].value_counts(normalize=True)\r\nsynthetic_data = trainer.generate_samples(100000)\r\nsynthetic_class_distribution = synthetic_data['target_column'].value_counts(normalize=True)\r\n```\r\n\r\n## Configuration and Customization\r\n\r\nThe `TabularGANConfig` class allows for extensive customization:\r\n\r\n- `input_dim`: Dimension of the input noise vector\r\n- `generator_layers` and `discriminator_layers`: List of layer sizes for the generator and discriminator\r\n- `learning_rate`, `beta_1`, `beta_2`: Adam optimizer parameters\r\n- `batch_size`, `epochs`: Training configuration\r\n- `n_critic`: Number of discriminator updates per generator update\r\n\r\nRefer to the API documentation for a comprehensive list of configuration options.\r\n\r\n## Best Practices\r\n\r\n1. **Data Preprocessing**: Ensure your data is properly cleaned and normalized before training.\r\n2. **Hyperparameter Tuning**: Experiment with different configurations to find the optimal setup for your dataset.\r\n3. **Validation**: Regularly compare the distribution of synthetic data with the original dataset.\r\n4. **Privacy Considerations**: Implement differential privacy techniques when dealing with sensitive data.\r\n5. **Scalability**: For large datasets, consider using distributed training capabilities of TensorFlow.\r\n\r\n## Roadmap\r\n* [x] Implement basic GAN architecture for tabular data\r\n* [x] Add support for mixed data types (categorical, continuous, integer)\r\n* [x] Integrate early stopping and training history\r\n* [ ] Implement more advanced GAN variants (WGAN, CGAN)\r\n* [ ] Add built-in privacy preserving mechanisms\r\n* [ ] Develop automated hyperparameter tuning\r\n* [ ] Create visualization tools for synthetic data quality assessment\r\n* [ ] Implement distributed training support for large-scale datasets\r\n\r\n## Contributing\r\n\r\nWe welcome contributions! Please see our [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to get started.\r\n\r\n## License\r\n\r\nIndoxGen-Torch is released under the MIT License. See [LICENSE.md](LICENSE.md) for more details.\r\n\r\n---\r\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0-or-later",
    "summary": "Indox Synthetic Data Generation (GAN-pytorch)",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://github.com/osllmai/IndoxGen/tree/master/libs/indoxGen_torch"
    },
    "split_keywords": [
        "ai",
        " deep learning",
        " language models",
        " synthetic data generation",
        " machine learning",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "988868f47b7f0ecf6c1e5f33dff5a52682c7920a74c39b886fc215ca8d1bc14e",
                "md5": "024e214efe4d458b8f59b729249916b6",
                "sha256": "c314d81cead763586a4b2cf54aa7a01123363b14a3cfbaff0acb81e1440aec69"
            },
            "downloads": -1,
            "filename": "indoxGen_torch-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "024e214efe4d458b8f59b729249916b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 32384,
            "upload_time": "2024-10-14T06:14:37",
            "upload_time_iso_8601": "2024-10-14T06:14:37.534044Z",
            "url": "https://files.pythonhosted.org/packages/98/88/68f47b7f0ecf6c1e5f33dff5a52682c7920a74c39b886fc215ca8d1bc14e/indoxGen_torch-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cf7cf17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57",
                "md5": "9df208f482005db6fd41592209917417",
                "sha256": "44f5619b22d29be8deab50d040cd6aa2f9938d5ce678fbf06073c1143c1cb1bf"
            },
            "downloads": -1,
            "filename": "indoxgen_torch-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "9df208f482005db6fd41592209917417",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 31177,
            "upload_time": "2024-10-14T06:14:39",
            "upload_time_iso_8601": "2024-10-14T06:14:39.780222Z",
            "url": "https://files.pythonhosted.org/packages/cf/7c/f17f7256e2120e1110e52bb36d163c59c3510fab4d84fe12ab2e53a48d57/indoxgen_torch-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-14 06:14:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "osllmai",
    "github_project": "IndoxGen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "indoxgen-torch"
}
        
Elapsed time: 4.03796s