# **Synthetic Data Generator**
The Synthetic Data Generator package allows users to generate synthetic datasets that resemble real-world data. The package supports generating continuous, categorical, and time-series features, and even allows users to add noise to simulate real-world imperfections.
This package is ideal for data scientists, researchers, and machine learning engineers who need synthetic data for testing, modeling, or experimentation when real data is unavailable or when data privacy is a concern.
## **Key Features**
- Generate continuous data following a normal distribution.
- Generate categorical data with customizable categories.
- Generate time-series data with specified start dates, frequencies, and durations.
- Create combined datasets with continuous, categorical, and time-series features.
- Optionally add noise to both continuous and categorical data to simulate real-world variability.
## **Installation**
To install the package from PyPI, use the following command:
```bash
pip install synthetic-data-generator
```
If you are testing from TestPyPI, install it using:
```bash
pip install --index-url https://test.pypi.org/simple/ synthetic-data-generator
```
In any other case, install it using:
```bash
pip install --extra-index-url https://test.pypi.org/simple/ synthetic-data-generator==0.0.2
```
## **Usage**
The package is user-friendly and can be used to generate synthetic datasets with just a few lines of code. Below is an example that demonstrates how to generate a dataset with continuous, categorical, and time-series data.
### **Example 1: Basic Usage**
```python
from synthetic_data_generator import SyntheticDataGenerator
# Initialize the generator
generator = SyntheticDataGenerator()
# Generate synthetic data with 2 continuous features, 1 categorical feature, and 1000 samples
df = generator.create_synthetic_dataset(continuous_features=2, categorical_features=1, num_samples=1000)
print(df.head())
```
### **Example 2: Including Time-Series Data**
```python
# Create a dataset with 2 continuous features, 1 categorical feature, and a time-series column
df = generator.create_synthetic_dataset(continuous_features=2, categorical_features=1, num_samples=1000, include_time_series=True, start_date='2023-01-01', freq='D')
print(df.head())
```
### **Example 3: Adding Noise to the Data**
To simulate real-world data, you can add noise to both continuous and categorical features.
```python
# Generate synthetic data with noise
df = generator.create_synthetic_dataset(continuous_features=2, categorical_features=1, num_samples=1000)
# Add noise to the dataset
noisy_df = generator.add_noise(df, continuous_noise_level=0.05, categorical_noise_level=0.1)
print(noisy_df.head())
```
## **API Reference**
### **Class: `SyntheticDataGenerator`**
This is the main class of the package responsible for generating synthetic datasets.
#### **Methods:**
1. **`generate_continuous(num_samples=1000, mean=0, std=1)`**
- Generates continuous data that follows a normal distribution.
- **Parameters:**
- `num_samples`: Number of samples to generate (default is 1000).
- `mean`: The mean of the normal distribution (default is 0).
- `std`: The standard deviation of the normal distribution (default is 1).
- **Returns**: A NumPy array of continuous data.
2. **`generate_categorical(num_samples=1000, categories=None)`**
- Generates categorical data with user-defined categories.
- **Parameters:**
- `num_samples`: Number of samples to generate (default is 1000).
- `categories`: List of categories (default is `['A', 'B', 'C']`).
- **Returns**: A NumPy array of categorical data.
3. **`generate_time_series(start_date='2022-01-01', periods=1000, freq='D')`**
- Generates time-series data based on the specified start date and frequency.
- **Parameters:**
- `start_date`: The start date for the time-series data (default is `'2022-01-01'`).
- `periods`: Number of periods to generate (default is 1000).
- `freq`: Frequency of the time series (default is `'D'` for daily).
- **Returns**: A Pandas `DatetimeIndex` object.
4. **`create_synthetic_dataset(continuous_features=1, categorical_features=1, num_samples=1000, include_time_series=False, start_date='2022-01-01', periods=None, freq='D')`**
- Generates a dataset containing continuous, categorical, and optional time-series features.
- **Parameters:**
- `continuous_features`: Number of continuous features to generate (default is 1).
- `categorical_features`: Number of categorical features to generate (default is 1).
- `num_samples`: Number of samples to generate (default is 1000).
- `include_time_series`: Whether to include a time-series column (default is `False`).
- `start_date`: The start date for the time series (only applicable if `include_time_series=True`).
- `periods`: Number of periods to generate for the time series (defaults to `num_samples`).
- `freq`: Frequency of the time series (default is `'D'` for daily).
- **Returns**: A Pandas `DataFrame` with the generated data.
5. **`add_noise(df, continuous_noise_level=0.01, categorical_noise_level=0.01)`**
- Adds noise to the continuous and categorical features of the dataset.
- **Parameters:**
- `df`: The dataset to which noise will be added.
- `continuous_noise_level`: The intensity of noise added to continuous features (default is `0.01`).
- `categorical_noise_level`: The proportion of categorical values to replace with random values (default is `0.01`).
- **Returns**: A noisy version of the input dataset.
## **Development and Contribution**
If you want to contribute to the development of this package, you can set up the development environment and run tests using `pytest`.
### **Installing Development Dependencies**
To install the development dependencies, use the following command:
```bash
pip install synthetic-data-generator[dev]
```
### **Running Tests**
Tests are written using the `unittest` framework. You can run the tests by executing:
```bash
python -m unittest discover tests
```
## **License**
This project is licensed under the MIT License - see the License file for details.
## **Contributing**
Feel free to submit pull requests or report issues on [GitHub](https://github.com/Gouranga-GH/custom_pypi_sdg.git). Contributions are welcome!
## **Author**
- **Gouranga Jha**
- [post.gourang@gmail.com](mailto:youremail@example.com)
- GitHub: [https://github.com/Gouranga-GH](https://github.com/Gouranga-GH)
Raw data
{
"_id": null,
"home_page": "https://github.com/Gouranga-GH/custom_pypi_sdg.git",
"name": "auto-synth-data-gen",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Gouranga Jha",
"author_email": "post.gourang@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ae/f2/7d97923491982b81f556413e6ba0cf94c52602fc07adb9fbbb653ed2fb4f/auto_synth_data_gen-0.0.3.tar.gz",
"platform": null,
"description": "\r\n# **Synthetic Data Generator**\r\n\r\nThe Synthetic Data Generator package allows users to generate synthetic datasets that resemble real-world data. The package supports generating continuous, categorical, and time-series features, and even allows users to add noise to simulate real-world imperfections.\r\n\r\nThis package is ideal for data scientists, researchers, and machine learning engineers who need synthetic data for testing, modeling, or experimentation when real data is unavailable or when data privacy is a concern.\r\n\r\n## **Key Features**\r\n- Generate continuous data following a normal distribution.\r\n- Generate categorical data with customizable categories.\r\n- Generate time-series data with specified start dates, frequencies, and durations.\r\n- Create combined datasets with continuous, categorical, and time-series features.\r\n- Optionally add noise to both continuous and categorical data to simulate real-world variability.\r\n\r\n## **Installation**\r\n\r\nTo install the package from PyPI, use the following command:\r\n\r\n```bash\r\npip install synthetic-data-generator\r\n```\r\n\r\nIf you are testing from TestPyPI, install it using:\r\n\r\n```bash\r\npip install --index-url https://test.pypi.org/simple/ synthetic-data-generator\r\n```\r\n\r\nIn any other case, install it using:\r\n\r\n```bash\r\npip install --extra-index-url https://test.pypi.org/simple/ synthetic-data-generator==0.0.2\r\n```\r\n\r\n\r\n## **Usage**\r\n\r\nThe package is user-friendly and can be used to generate synthetic datasets with just a few lines of code. Below is an example that demonstrates how to generate a dataset with continuous, categorical, and time-series data.\r\n\r\n### **Example 1: Basic Usage**\r\n\r\n```python\r\nfrom synthetic_data_generator import SyntheticDataGenerator\r\n\r\n# Initialize the generator\r\ngenerator = SyntheticDataGenerator()\r\n\r\n# Generate synthetic data with 2 continuous features, 1 categorical feature, and 1000 samples\r\ndf = generator.create_synthetic_dataset(continuous_features=2, categorical_features=1, num_samples=1000)\r\nprint(df.head())\r\n```\r\n\r\n### **Example 2: Including Time-Series Data**\r\n\r\n```python\r\n# Create a dataset with 2 continuous features, 1 categorical feature, and a time-series column\r\ndf = generator.create_synthetic_dataset(continuous_features=2, categorical_features=1, num_samples=1000, include_time_series=True, start_date='2023-01-01', freq='D')\r\nprint(df.head())\r\n```\r\n\r\n### **Example 3: Adding Noise to the Data**\r\n\r\nTo simulate real-world data, you can add noise to both continuous and categorical features.\r\n\r\n```python\r\n# Generate synthetic data with noise\r\ndf = generator.create_synthetic_dataset(continuous_features=2, categorical_features=1, num_samples=1000)\r\n\r\n# Add noise to the dataset\r\nnoisy_df = generator.add_noise(df, continuous_noise_level=0.05, categorical_noise_level=0.1)\r\nprint(noisy_df.head())\r\n```\r\n\r\n## **API Reference**\r\n\r\n### **Class: `SyntheticDataGenerator`**\r\n\r\nThis is the main class of the package responsible for generating synthetic datasets.\r\n\r\n#### **Methods:**\r\n\r\n1. **`generate_continuous(num_samples=1000, mean=0, std=1)`**\r\n - Generates continuous data that follows a normal distribution.\r\n - **Parameters:**\r\n - `num_samples`: Number of samples to generate (default is 1000).\r\n - `mean`: The mean of the normal distribution (default is 0).\r\n - `std`: The standard deviation of the normal distribution (default is 1).\r\n - **Returns**: A NumPy array of continuous data.\r\n\r\n2. **`generate_categorical(num_samples=1000, categories=None)`**\r\n - Generates categorical data with user-defined categories.\r\n - **Parameters:**\r\n - `num_samples`: Number of samples to generate (default is 1000).\r\n - `categories`: List of categories (default is `['A', 'B', 'C']`).\r\n - **Returns**: A NumPy array of categorical data.\r\n\r\n3. **`generate_time_series(start_date='2022-01-01', periods=1000, freq='D')`**\r\n - Generates time-series data based on the specified start date and frequency.\r\n - **Parameters:**\r\n - `start_date`: The start date for the time-series data (default is `'2022-01-01'`).\r\n - `periods`: Number of periods to generate (default is 1000).\r\n - `freq`: Frequency of the time series (default is `'D'` for daily).\r\n - **Returns**: A Pandas `DatetimeIndex` object.\r\n\r\n4. **`create_synthetic_dataset(continuous_features=1, categorical_features=1, num_samples=1000, include_time_series=False, start_date='2022-01-01', periods=None, freq='D')`**\r\n - Generates a dataset containing continuous, categorical, and optional time-series features.\r\n - **Parameters:**\r\n - `continuous_features`: Number of continuous features to generate (default is 1).\r\n - `categorical_features`: Number of categorical features to generate (default is 1).\r\n - `num_samples`: Number of samples to generate (default is 1000).\r\n - `include_time_series`: Whether to include a time-series column (default is `False`).\r\n - `start_date`: The start date for the time series (only applicable if `include_time_series=True`).\r\n - `periods`: Number of periods to generate for the time series (defaults to `num_samples`).\r\n - `freq`: Frequency of the time series (default is `'D'` for daily).\r\n - **Returns**: A Pandas `DataFrame` with the generated data.\r\n\r\n5. **`add_noise(df, continuous_noise_level=0.01, categorical_noise_level=0.01)`**\r\n - Adds noise to the continuous and categorical features of the dataset.\r\n - **Parameters:**\r\n - `df`: The dataset to which noise will be added.\r\n - `continuous_noise_level`: The intensity of noise added to continuous features (default is `0.01`).\r\n - `categorical_noise_level`: The proportion of categorical values to replace with random values (default is `0.01`).\r\n - **Returns**: A noisy version of the input dataset.\r\n\r\n## **Development and Contribution**\r\n\r\nIf you want to contribute to the development of this package, you can set up the development environment and run tests using `pytest`.\r\n\r\n### **Installing Development Dependencies**\r\n\r\nTo install the development dependencies, use the following command:\r\n\r\n```bash\r\npip install synthetic-data-generator[dev]\r\n```\r\n\r\n### **Running Tests**\r\n\r\nTests are written using the `unittest` framework. You can run the tests by executing:\r\n\r\n```bash\r\npython -m unittest discover tests\r\n```\r\n\r\n## **License**\r\n\r\nThis project is licensed under the MIT License - see the License file for details.\r\n\r\n## **Contributing**\r\n\r\nFeel free to submit pull requests or report issues on [GitHub](https://github.com/Gouranga-GH/custom_pypi_sdg.git). Contributions are welcome!\r\n\r\n## **Author**\r\n\r\n- **Gouranga Jha**\r\n- [post.gourang@gmail.com](mailto:youremail@example.com)\r\n- GitHub: [https://github.com/Gouranga-GH](https://github.com/Gouranga-GH)\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package to generate synthetic datasets resembling real-world data",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/Gouranga-GH/custom_pypi_sdg.git"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5f1f1720e6d0f45e913d46581ca5d278cea384be61875603e85eb9b49c44da24",
"md5": "2522d59e2bfb4e6d72f64d70748fd82b",
"sha256": "69fa45ac5eee76dd476a43eb35c492eb4daeae4ca9a870e92d5aaf702230cf7c"
},
"downloads": -1,
"filename": "auto_synth_data_gen-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2522d59e2bfb4e6d72f64d70748fd82b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 8452,
"upload_time": "2024-09-04T18:32:13",
"upload_time_iso_8601": "2024-09-04T18:32:13.017487Z",
"url": "https://files.pythonhosted.org/packages/5f/1f/1720e6d0f45e913d46581ca5d278cea384be61875603e85eb9b49c44da24/auto_synth_data_gen-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "aef27d97923491982b81f556413e6ba0cf94c52602fc07adb9fbbb653ed2fb4f",
"md5": "d222a5439b4578ee9b699d0df01b0a1d",
"sha256": "8d303726b9fd57a7e16d71dbf8a45149623eb0ef02e524de339897a00ef0d8e6"
},
"downloads": -1,
"filename": "auto_synth_data_gen-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "d222a5439b4578ee9b699d0df01b0a1d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 7943,
"upload_time": "2024-09-04T18:32:14",
"upload_time_iso_8601": "2024-09-04T18:32:14.754067Z",
"url": "https://files.pythonhosted.org/packages/ae/f2/7d97923491982b81f556413e6ba0cf94c52602fc07adb9fbbb653ed2fb4f/auto_synth_data_gen-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-04 18:32:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Gouranga-GH",
"github_project": "custom_pypi_sdg",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "auto-synth-data-gen"
}