sdv


Namesdv JSON
Version 1.12.1 PyPI version JSON
download
home_pageNone
SummaryGenerate synthetic data for single table, multi table and sequential data
upload_time2024-04-19 20:14:44
maintainerNone
docs_urlNone
authorNone
requires_python<3.13,>=3.8
licenseBSL-1.1
keywords sdv synthetic-data synhtetic-data-generation timeseries single-table multi-table
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
<br/>
<p align="center">
    <i>This repository is part of <a href="https://sdv.dev">The Synthetic Data Vault Project</a>, a project from <a href="https://datacebo.com">DataCebo</a>.</i>
</p>

[![Dev Status](https://img.shields.io/badge/Dev%20Status-5%20--%20Production%2fStable-green)](https://pypi.org/search/?c=Development+Status+%3A%3A+5+-+Production%2FStable)
[![PyPi Shield](https://img.shields.io/pypi/v/SDV.svg)](https://pypi.python.org/pypi/SDV)
[![Unit Tests](https://github.com/sdv-dev/SDV/actions/workflows/unit.yml/badge.svg?branch=main)](https://github.com/sdv-dev/SDV/actions/workflows/unit.yml?query=branch%3Amain)
[![Integration Tests](https://github.com/sdv-dev/SDV/actions/workflows/integration.yml/badge.svg?branch=main)](https://github.com/sdv-dev/SDV/actions/workflows/integration.yml?query=branch%3Amain)
[![Coverage Status](https://codecov.io/gh/sdv-dev/SDV/branch/main/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDV)
[![Downloads](https://static.pepy.tech/personalized-badge/sdv?period=total&units=international_system&left_color=grey&right_color=blue&left_text=Downloads)](https://pepy.tech/project/sdv)
[![Colab](https://img.shields.io/badge/Tutorials-Try%20now!-orange?logo=googlecolab)](https://docs.sdv.dev/sdv/demos)
[![Slack](https://img.shields.io/badge/Slack-Join%20now!-36C5F0?logo=slack)](https://bit.ly/sdv-slack-invite)

<div align="left">
<br/>
<p align="center">
<a href="https://github.com/sdv-dev/SDV">
<img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/stable/docs/images/SDV-logo.png"></img>
</a>
</p>
</div>

</div>

# Overview

The **Synthetic Data Vault** (SDV) is a Python library designed to be your one-stop shop for
creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn
patterns from your real data and emulate them in synthetic data.

## Features
:brain: **Create synthetic data using machine learning.** The SDV offers multiple models, ranging
from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate
data for single tables, multiple connected tables or sequential tables.

:bar_chart: **Evaluate and visualize data.** Compare the synthetic data to the real data against a
variety of measures. Diagnose problems and generate a quality report to get more insights.

:arrows_counterclockwise: **Preprocess, anonymize and define constraints.** Control data
processing to improve the quality of synthetic data, choose from different types of anonymization
and define business rules in the form of logical constraints.

| Important Links                               |                                                                                                     |
| --------------------------------------------- | ----------------------------------------------------------------------------------------------------|
| [![][Colab Logo] **Tutorials**][Tutorials]    | Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself. |
| :book: **[Docs]**                             | Learn how to use the SDV library with user guides and API references.                               |
| :orange_book: **[Blog]**                      | Get more insights about using the SDV, deploying models and our synthetic data community.          |
| [![][Slack Logo] **Community**][Community]    | Join our Slack workspace for announcements and discussions.                                         |
| :computer: **[Website]**                      | Check out the SDV website for more information about the project.                                   |

[Website]: https://sdv.dev
[Blog]: https://datacebo.com/blog
[Docs]: https://bit.ly/sdv-docs
[Repository]: https://github.com/sdv-dev/SDV
[License]: https://github.com/sdv-dev/SDV/blob/main/LICENSE
[Development Status]: https://pypi.org/search/?c=Development+Status+%3A%3A+5+-+Production%2FStable
[Slack Logo]: https://github.com/sdv-dev/SDV/blob/stable/docs/images/slack.png
[Community]: https://bit.ly/sdv-slack-invite
[Colab Logo]: https://github.com/sdv-dev/SDV/blob/stable/docs/images/google_colab.png
[Tutorials]: https://docs.sdv.dev/sdv/demos

# Install
The SDV is publicly available under the [Business Source License](https://github.com/sdv-dev/SDV/blob/main/LICENSE).
Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with
other software on your device.

```bash
pip install sdv
```

```bash
conda install -c pytorch -c conda-forge sdv
```

# Getting Started
Load a demo dataset to get started. This dataset is a single table describing guests staying at a
fictional hotel.

```python
from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests')
```

![Single Table Metadata Example](https://github.com/sdv-dev/SDV/blob/stable/docs/images/Single-Table-Metadata-Example.png)

The demo also includes **metadata**, a description of the dataset, including the data types in each
column and the primary key (`guest_email`).

## Synthesizing Data
Next, we can create an **SDV synthesizer**,  an object that you can use to create synthetic data.
It learns patterns from the real data and replicates them to generate synthetic data. Let's use
the `FAST_ML` preset synthesizer, which is optimized for performance.

```python
from sdv.lite import SingleTablePreset

synthesizer = SingleTablePreset(metadata, name='FAST_ML')
synthesizer.fit(data=real_data)
```

And now the synthesizer is ready to create synthetic data!

```python
synthetic_data = synthesizer.sample(num_rows=500)
```

The synthetic data will have the following properties:
- **Sensitive columns are fully anonymized.** The email, billing address and credit card number
columns contain new data so you don't expose the real values.
- **Other columns follow statistical patterns.** For example, the proportion of room types, the
distribution of check in dates and the correlations between room rate and room type are preserved.
- **Keys and other relationships are intact.** The primary key (guest email) is unique for each row.
If you have multiple tables, the connection between a primary and foreign keys makes sense.

## Evaluating Synthetic Data
The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get
started by generating a quality report.

```python
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata)
```

```
Creating report: 100%|██████████| 4/4 [00:00<00:00, 19.30it/s]
Overall Quality Score: 89.12%
Properties:
Column Shapes: 90.27%
Column Pair Trends: 87.97%
```

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well
as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

```python
from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='amenities_fee',
    metadata=metadata
)
    
fig.show()
```

![Real vs. Synthetic Data](https://github.com/sdv-dev/SDV/blob/stable/docs/images/Real-vs-Synthetic-Evaluation.png)

# What's Next?
Using the SDV library, you can synthesize single table, multi table and sequential data. You can
also customize the full synthetic data workflow, including preprocessing, anonymization and adding
constraints.

To learn more, visit the [SDV Demo page](https://docs.sdv.dev/sdv/demos).

# Credits
Thank you to our team of contributors who have built and maintained the SDV ecosystem over the
years!

[View Contributors](https://github.com/sdv-dev/SDV/graphs/contributors)

## Citation
If you use SDV for your research, please cite the following paper:

*Neha Patki, Roy Wedge, Kalyan Veeramachaneni*. [The Synthetic Data Vault](https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf). [IEEE DSAA 2016](https://ieeexplore.ieee.org/document/7796926).

```
@inproceedings{
    SDV,
    title={The Synthetic data vault},
    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},
    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},
    year={2016},
    pages={399-410},
    doi={10.1109/DSAA.2016.49},
    month={Oct}
}
```

---


<div align="center">
  <a href="https://datacebo.com"><picture>
      <source media="(prefers-color-scheme: dark)" srcset="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo-dark-mode.png">
      <img align="center" width=40% src="https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo.png"></img>
  </picture></a>
</div>
<br/>
<br/>

[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](
https://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we
created [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.
Today, DataCebo is the proud developer of SDV, the largest ecosystem for
synthetic data generation & evaluation. It is home to multiple libraries that support synthetic
data, including:

* 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
* 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,
  multi table and time series data.
* 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data
  generation models.

[Get started using the SDV package](https://bit.ly/sdv-docs) -- a fully
integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries
for specific needs.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sdv",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.8",
    "maintainer_email": null,
    "keywords": "sdv, synthetic-data, synhtetic-data-generation, timeseries, single-table, multi-table",
    "author": null,
    "author_email": "\"DataCebo, Inc.\" <info@sdv.dev>",
    "download_url": "https://files.pythonhosted.org/packages/d8/ff/5daa3a701dd073babf797ce30f05395c631ce1bed51cfdaa3a3a415fb374/sdv-1.12.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n<br/>\n<p align=\"center\">\n    <i>This repository is part of <a href=\"https://sdv.dev\">The Synthetic Data Vault Project</a>, a project from <a href=\"https://datacebo.com\">DataCebo</a>.</i>\n</p>\n\n[![Dev Status](https://img.shields.io/badge/Dev%20Status-5%20--%20Production%2fStable-green)](https://pypi.org/search/?c=Development+Status+%3A%3A+5+-+Production%2FStable)\n[![PyPi Shield](https://img.shields.io/pypi/v/SDV.svg)](https://pypi.python.org/pypi/SDV)\n[![Unit Tests](https://github.com/sdv-dev/SDV/actions/workflows/unit.yml/badge.svg?branch=main)](https://github.com/sdv-dev/SDV/actions/workflows/unit.yml?query=branch%3Amain)\n[![Integration Tests](https://github.com/sdv-dev/SDV/actions/workflows/integration.yml/badge.svg?branch=main)](https://github.com/sdv-dev/SDV/actions/workflows/integration.yml?query=branch%3Amain)\n[![Coverage Status](https://codecov.io/gh/sdv-dev/SDV/branch/main/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDV)\n[![Downloads](https://static.pepy.tech/personalized-badge/sdv?period=total&units=international_system&left_color=grey&right_color=blue&left_text=Downloads)](https://pepy.tech/project/sdv)\n[![Colab](https://img.shields.io/badge/Tutorials-Try%20now!-orange?logo=googlecolab)](https://docs.sdv.dev/sdv/demos)\n[![Slack](https://img.shields.io/badge/Slack-Join%20now!-36C5F0?logo=slack)](https://bit.ly/sdv-slack-invite)\n\n<div align=\"left\">\n<br/>\n<p align=\"center\">\n<a href=\"https://github.com/sdv-dev/SDV\">\n<img align=\"center\" width=40% src=\"https://github.com/sdv-dev/SDV/blob/stable/docs/images/SDV-logo.png\"></img>\n</a>\n</p>\n</div>\n\n</div>\n\n# Overview\n\nThe **Synthetic Data Vault** (SDV) is a Python library designed to be your one-stop shop for\ncreating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn\npatterns from your real data and emulate them in synthetic data.\n\n## Features\n:brain: **Create synthetic data using machine learning.** The SDV offers multiple models, ranging\nfrom classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate\ndata for single tables, multiple connected tables or sequential tables.\n\n:bar_chart: **Evaluate and visualize data.** Compare the synthetic data to the real data against a\nvariety of measures. Diagnose problems and generate a quality report to get more insights.\n\n:arrows_counterclockwise: **Preprocess, anonymize and define constraints.** Control data\nprocessing to improve the quality of synthetic data, choose from different types of anonymization\nand define business rules in the form of logical constraints.\n\n| Important Links                               |                                                                                                     |\n| --------------------------------------------- | ----------------------------------------------------------------------------------------------------|\n| [![][Colab Logo] **Tutorials**][Tutorials]    | Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself. |\n| :book: **[Docs]**                             | Learn how to use the SDV library with user guides and API references.                               |\n| :orange_book: **[Blog]**                      | Get more insights about using the SDV, deploying models and our synthetic data community.          |\n| [![][Slack Logo] **Community**][Community]    | Join our Slack workspace for announcements and discussions.                                         |\n| :computer: **[Website]**                      | Check out the SDV website for more information about the project.                                   |\n\n[Website]: https://sdv.dev\n[Blog]: https://datacebo.com/blog\n[Docs]: https://bit.ly/sdv-docs\n[Repository]: https://github.com/sdv-dev/SDV\n[License]: https://github.com/sdv-dev/SDV/blob/main/LICENSE\n[Development Status]: https://pypi.org/search/?c=Development+Status+%3A%3A+5+-+Production%2FStable\n[Slack Logo]: https://github.com/sdv-dev/SDV/blob/stable/docs/images/slack.png\n[Community]: https://bit.ly/sdv-slack-invite\n[Colab Logo]: https://github.com/sdv-dev/SDV/blob/stable/docs/images/google_colab.png\n[Tutorials]: https://docs.sdv.dev/sdv/demos\n\n# Install\nThe SDV is publicly available under the [Business Source License](https://github.com/sdv-dev/SDV/blob/main/LICENSE).\nInstall SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with\nother software on your device.\n\n```bash\npip install sdv\n```\n\n```bash\nconda install -c pytorch -c conda-forge sdv\n```\n\n# Getting Started\nLoad a demo dataset to get started. This dataset is a single table describing guests staying at a\nfictional hotel.\n\n```python\nfrom sdv.datasets.demo import download_demo\n\nreal_data, metadata = download_demo(\n    modality='single_table',\n    dataset_name='fake_hotel_guests')\n```\n\n![Single Table Metadata Example](https://github.com/sdv-dev/SDV/blob/stable/docs/images/Single-Table-Metadata-Example.png)\n\nThe demo also includes **metadata**, a description of the dataset, including the data types in each\ncolumn and the primary key (`guest_email`).\n\n## Synthesizing Data\nNext, we can create an **SDV synthesizer**,  an object that you can use to create synthetic data.\nIt learns patterns from the real data and replicates them to generate synthetic data. Let's use\nthe `FAST_ML` preset synthesizer, which is optimized for performance.\n\n```python\nfrom sdv.lite import SingleTablePreset\n\nsynthesizer = SingleTablePreset(metadata, name='FAST_ML')\nsynthesizer.fit(data=real_data)\n```\n\nAnd now the synthesizer is ready to create synthetic data!\n\n```python\nsynthetic_data = synthesizer.sample(num_rows=500)\n```\n\nThe synthetic data will have the following properties:\n- **Sensitive columns are fully anonymized.** The email, billing address and credit card number\ncolumns contain new data so you don't expose the real values.\n- **Other columns follow statistical patterns.** For example, the proportion of room types, the\ndistribution of check in dates and the correlations between room rate and room type are preserved.\n- **Keys and other relationships are intact.** The primary key (guest email) is unique for each row.\nIf you have multiple tables, the connection between a primary and foreign keys makes sense.\n\n## Evaluating Synthetic Data\nThe SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get\nstarted by generating a quality report.\n\n```python\nfrom sdv.evaluation.single_table import evaluate_quality\n\nquality_report = evaluate_quality(\n    real_data,\n    synthetic_data,\n    metadata)\n```\n\n```\nCreating report: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 4/4 [00:00<00:00, 19.30it/s]\nOverall Quality Score: 89.12%\nProperties:\nColumn Shapes: 90.27%\nColumn Pair Trends: 87.97%\n```\n\nThis object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well\nas detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.\n\n```python\nfrom sdv.evaluation.single_table import get_column_plot\n\nfig = get_column_plot(\n    real_data=real_data,\n    synthetic_data=synthetic_data,\n    column_name='amenities_fee',\n    metadata=metadata\n)\n    \nfig.show()\n```\n\n![Real vs. Synthetic Data](https://github.com/sdv-dev/SDV/blob/stable/docs/images/Real-vs-Synthetic-Evaluation.png)\n\n# What's Next?\nUsing the SDV library, you can synthesize single table, multi table and sequential data. You can\nalso customize the full synthetic data workflow, including preprocessing, anonymization and adding\nconstraints.\n\nTo learn more, visit the [SDV Demo page](https://docs.sdv.dev/sdv/demos).\n\n# Credits\nThank you to our team of contributors who have built and maintained the SDV ecosystem over the\nyears!\n\n[View Contributors](https://github.com/sdv-dev/SDV/graphs/contributors)\n\n## Citation\nIf you use SDV for your research, please cite the following paper:\n\n*Neha Patki, Roy Wedge, Kalyan Veeramachaneni*. [The Synthetic Data Vault](https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf). [IEEE DSAA 2016](https://ieeexplore.ieee.org/document/7796926).\n\n```\n@inproceedings{\n    SDV,\n    title={The Synthetic data vault},\n    author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan},\n    booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)},\n    year={2016},\n    pages={399-410},\n    doi={10.1109/DSAA.2016.49},\n    month={Oct}\n}\n```\n\n---\n\n\n<div align=\"center\">\n  <a href=\"https://datacebo.com\"><picture>\n      <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo-dark-mode.png\">\n      <img align=\"center\" width=40% src=\"https://github.com/sdv-dev/SDV/blob/stable/docs/images/datacebo-logo.png\"></img>\n  </picture></a>\n</div>\n<br/>\n<br/>\n\n[The Synthetic Data Vault Project](https://sdv.dev) was first created at MIT's [Data to AI Lab](\nhttps://dai.lids.mit.edu/) in 2016. After 4 years of research and traction with enterprise, we\ncreated [DataCebo](https://datacebo.com) in 2020 with the goal of growing the project.\nToday, DataCebo is the proud developer of SDV, the largest ecosystem for\nsynthetic data generation & evaluation. It is home to multiple libraries that support synthetic\ndata, including:\n\n* \ud83d\udd04 Data discovery & transformation. Reverse the transforms to reproduce realistic data.\n* \ud83e\udde0 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular,\n  multi table and time series data.\n* \ud83d\udcca Measuring quality and privacy of synthetic data, and comparing different synthetic data\n  generation models.\n\n[Get started using the SDV package](https://bit.ly/sdv-docs) -- a fully\nintegrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries\nfor specific needs.\n",
    "bugtrack_url": null,
    "license": "BSL-1.1",
    "summary": "Generate synthetic data for single table, multi table and sequential data",
    "version": "1.12.1",
    "project_urls": {
        "Changes": "https://github.com/sdv-dev/SDV/blob/main/HISTORY.md",
        "Chat": "https://bit.ly/sdv-slack-invite",
        "Issue Tracker": "https://github.com/sdv-dev/SDV/issues",
        "Source Code": "https://github.com/sdv-dev/SDV/",
        "Twitter": "https://twitter.com/sdv_dev"
    },
    "split_keywords": [
        "sdv",
        " synthetic-data",
        " synhtetic-data-generation",
        " timeseries",
        " single-table",
        " multi-table"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59946945a14b5b85b3604c5d74648b6eb4575b3dc1d73087bc9acdf89d97f3e0",
                "md5": "a10e174f71d9ce6b1cfa03e01b7186a0",
                "sha256": "bda76bdd6f4d612877b7d5e28a2c88b0c27f51535fde2c6b06100dec5b212a7a"
            },
            "downloads": -1,
            "filename": "sdv-1.12.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a10e174f71d9ce6b1cfa03e01b7186a0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.8",
            "size": 133532,
            "upload_time": "2024-04-19T20:14:27",
            "upload_time_iso_8601": "2024-04-19T20:14:27.262404Z",
            "url": "https://files.pythonhosted.org/packages/59/94/6945a14b5b85b3604c5d74648b6eb4575b3dc1d73087bc9acdf89d97f3e0/sdv-1.12.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d8ff5daa3a701dd073babf797ce30f05395c631ce1bed51cfdaa3a3a415fb374",
                "md5": "6bbac1981c0b72b48d82f244d7c26446",
                "sha256": "2339635601b4ca2d7687ebbf7898df4c9fd8172acc3469049c6308762bcd10d5"
            },
            "downloads": -1,
            "filename": "sdv-1.12.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6bbac1981c0b72b48d82f244d7c26446",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.8",
            "size": 118315,
            "upload_time": "2024-04-19T20:14:44",
            "upload_time_iso_8601": "2024-04-19T20:14:44.594012Z",
            "url": "https://files.pythonhosted.org/packages/d8/ff/5daa3a701dd073babf797ce30f05395c631ce1bed51cfdaa3a3a415fb374/sdv-1.12.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-19 20:14:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sdv-dev",
    "github_project": "SDV",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "sdv"
}
        
Elapsed time: 0.24469s