# _dpmm_: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation
## Overview
_dpmm_ is a Python library that implements state-of-the-art Differentially Private Marginal Models for generating synthetic tabular data.
Marginal Models have consistently been shown to capture key statistical properties like marginal distributions from the original data and reproduce them in the synthetic data, while Differential Privacy (DP) ensures that individual privacy is rigorously protected.
Summary of main features:
* end-to-end DP pipelines including data preprocessing, generative models, and mechanisms:
* DP data preprocessing -- 1) data domain is either provided as input or extracted with DP<sup>[paper](https://www.research-collection.ethz.ch/handle/20.500.11850/508570)</sup>, and 2) continous data is discretized with DP (Uniform and PrivTree<sup>[paper](https://arxiv.org/abs/1601.03229)</sup>)
* state-of-the-art DP generative models relying on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup> -- PrivBayes<sup>[paper](https://dl.acm.org/doi/10.1145/3134428)</sup>, MST<sup>[paper](https://arxiv.org/abs/2108.04978)</sup>, and AIM<sup>[paper](https://arxiv.org/abs/2201.12677)</sup>
* floating-point precision of DP mechanisms<sup>[paper](https://arxiv.org/abs/2207.10635)</sup>
* superior utility and performance
* rich functionality across all models/pipelines
* DP auditing of underlying mechanisms and models/pipelines<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2405.10994),[paper<sub>2</sub>](https://dl.acm.org/doi/10.1145/3576915.3616607)</sup>
__NB: Intended Use -- _dpmm_ is designed for research and exploratory use in privacy-preserving synthetic data generation (particularly in simple scenarios such as preserving high-quality 1/2-way marginals in datasets with up to 32 features<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2112.09238),[paper<sub>2</sub>](https://arxiv.org/abs/2305.10994)</sup>) and is not intended for production use in complex, real-world applications.__
## Installation
### Prerequisites
- Python 3.10 or 3.11
### PyPi install
You can also install from PyPi by running:
```sh
pip install dpmm
```
### Local Install
To install from the local github repo run the following command:
```sh
git clone git@github.com:sassoftware/dpmm.git
cd dpmm
poetry install
```
### Tests
To run the unit tests, go to the root of the repository (if installed locally), and use the following command:
```sh
pytest tests/
```
## Functionality
We provide numerous examples demonstrating the features of __dpmm__ across data preprocssing as well as the training and generation of generative models.
The examples are available across all models and model settings, and are accessible from the repository (if installed locally).
### Preprocessing
The provided generative pipelines combine automatic DP descritization preprocessing with a generative model and allows for the following features:
| Feature | Description | Example |
| --- | --- | --- |
| __dtype support__ | the following pandas data types are supported natively: `datetime`, `timedelta`, `float`, `int`, `category`, `bool`. | [Dtypes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_dtypes.ipynb) |
|__null-value support__ | missing values are supported and will be reproduced accordingly if present in any column within the real data. | |
|__automatic discretisation__ | while the default discretisation strategy used by _dpmm_ is `priv-tree` a more typical `uniform` strategy is also availble, they can both be combined with an `'auto'` mode which will attempt to identify the optimal number of bins for each numerical column column. | |
### Model Features
| Feature | Description | Example |
| --- | --- | --- |
| __domain compression__ | a `compress` flag can be set to `True` to ensure the discretised domain is compressed to improve the privacy budget / data quality trade-off. | |
|__model size control__ | a `max_model_size` parameter that ensures the memory footprint of the selected marginals remains lower than the specified upper threshold. | [Max Memory example](https://github.com/sassoftware/dpmm/tree/main/examples/example_memory.ipynb) |
|__model serialisation__ | pipelines can be serialised to / deserialised from disk by provided a valid folder to store the model to. | [Serialisation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_serialisation.ipynb) |
### Generation Features
| Feature | Description | Example |
| --- | --- | --- |
| __conditional generation__ | at generation time, it is also possible to provide a partial dataframe containing only some of the columns, in that case the generative pipeline will conditionally generate the remaining columns. | [Conditional Generation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_conditional.ipynb) |
| __deterministic generation__ | when a `random_state` value is provided at generation time, the generative process becomes deterministic assuming the same input parameters are provided. | [Random State example](https://github.com/sassoftware/dpmm/tree/main/examples/example_seed.ipynb) |
### Models
The implemented models include:
| Method | Description | Reference | Example |
|--- | --- | --- | --- |
|**PrivBayes+PGM**| Differentialy Private Bayesian Network. | [PrivBayes: Private Data Release via Bayesian Networks](https://dl.acm.org/doi/10.1145/3134428)| [PrivBayes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_privbayes.ipynb) |
|**MST**| Maximum Spanning Tree. | [Winning the NIST Contest: A scalable and general approach to differentially private synthetic data](https://arxiv.org/abs/2108.04978)| [MST example](https://github.com/sassoftware/dpmm/tree/main/examples/example_mst.ipynb) |
|**AIM**| Adaptive and Iterative Mechanism. | [AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data](https://arxiv.org/abs/2201.12677)| [AIM example](https://github.com/sassoftware/dpmm/tree/main/examples/example_aim.ipynb) |
__NB: All models rely on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup>.__
## Getting Started
To get started with using the _dpmm_, follow the steps below:
1. Import the necessary modules and load your data:
```python
import pandas as pd
import json
from dpmm.pipelines import MSTPipeline
wine_dir = Path().parent / "wine"
df = pd.read_pickle(wine_dir / "wine.pkl.gz")
with (wine_dir / "wine_bounds.json").open("r") as f:
domain = json.load(f)
```
2. Initialize and fit a model:
```python
model = MSTPipeline(
# Generator Parameters
epsilon=1.0,
delta=1e-5,
# Discretiser Parametrs
proc_epsilon=0.1,
)
model.fit(df, domain)
```
3. Generate synthetic data:
```python
synth_df = model.generate(n_records=100)
print(synth_df)
"""
type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 white 5.288142 0.190330 0.212473 1.402665 0.032305 37.097305 60.585301 0.990234 2.998241 0.658841 12.467682 1
1 white 5.956364 0.225099 0.210124 15.968057 0.043620 70.073909 202.689578 0.995807 3.198247 0.318414 10.290390 0
2 white 5.315535 0.341091 0.247268 0.628240 0.024938 52.468176 104.892353 0.990975 3.161218 0.971699 11.181373 1
3 white 7.879125 0.234170 0.275704 3.711610 0.039565 68.977194 163.380550 1.005989 3.068622 0.798520 8.075999 0
4 white 6.981342 0.358461 0.337705 3.600390 0.050450 51.567452 134.896467 0.996149 3.272745 0.599021 10.200400 0
"""
```
### Troubleshooting
If you encounter any issues, please check the following:
- Ensure that all required packages are installed.
- Verify that your data does not contain missing values or non-integer columns if using certain models.
- Check the model parameters and ensure they are set correctly.
## Contributing
Maintainers are accepting patches and contributions to this project.
Please read [CONTRIBUTING.md](https://github.com/sassoftware/dpmm/tree/main/CONTRIBUTING.md) for details about submitting contributions to this project.
## License
This project is licensed under the [Apache 2.0 License](https://github.com/sassoftware/dpmm/tree/main/LICENSE).
This project also uses code snippets from the following projects:
- [private-pgm](https://github.com/ryan112358/private-pgm): Apache 2.0
- [opendp](https://github.com/opendp/smartnoise-sdk): MIT License
- [ektelo](https://github.com/ektelo/ektelo): Apache 2.0
## Additional Resources
* [SAS Global Forum Papers](https://www.sas.com/en_us/events/sas-global-forum.html)
* [SAS Communities](https://communities.sas.com/)
## Citing
If you use this code, please cite the associated paper:
```
@inproceedings{mahiou2025dpmm,
title={{dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation}},
author={Mahiou, Sofiane and Dizche, Amir and Nazari, Reza and Wu, Xinmin and Abbey, Ralph and Silva, Jorge and Ganev, Georgi},
booktitle={TPDP},
year={2025}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/sassoftware/dpmm",
"name": "dpmm",
"maintainer": "Sofiane Mahiou",
"docs_url": null,
"requires_python": "<3.12,>=3.10",
"maintainer_email": "sofiane.mahiou@sas.com",
"keywords": "machine-learning, tabular-data, differential-privacy, synthetic-data",
"author": "Sofiane Mahiou, Georgi Ganev",
"author_email": "sofiane.mahiou@sas.com",
"download_url": "https://files.pythonhosted.org/packages/77/34/c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938/dpmm-0.1.9.tar.gz",
"platform": null,
"description": "# _dpmm_: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation\n\n\n## Overview\n\n_dpmm_ is a Python library that implements state-of-the-art Differentially Private Marginal Models for generating synthetic tabular data.\nMarginal Models have consistently been shown to capture key statistical properties like marginal distributions from the original data and reproduce them in the synthetic data, while Differential Privacy (DP) ensures that individual privacy is rigorously protected.\n\nSummary of main features:\n* end-to-end DP pipelines including data preprocessing, generative models, and mechanisms:\n * DP data preprocessing -- 1) data domain is either provided as input or extracted with DP<sup>[paper](https://www.research-collection.ethz.ch/handle/20.500.11850/508570)</sup>, and 2) continous data is discretized with DP (Uniform and PrivTree<sup>[paper](https://arxiv.org/abs/1601.03229)</sup>)\n * state-of-the-art DP generative models relying on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup> -- PrivBayes<sup>[paper](https://dl.acm.org/doi/10.1145/3134428)</sup>, MST<sup>[paper](https://arxiv.org/abs/2108.04978)</sup>, and AIM<sup>[paper](https://arxiv.org/abs/2201.12677)</sup>\n * floating-point precision of DP mechanisms<sup>[paper](https://arxiv.org/abs/2207.10635)</sup>\n* superior utility and performance\n* rich functionality across all models/pipelines\n* DP auditing of underlying mechanisms and models/pipelines<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2405.10994),[paper<sub>2</sub>](https://dl.acm.org/doi/10.1145/3576915.3616607)</sup>\n\n__NB: Intended Use -- _dpmm_ is designed for research and exploratory use in privacy-preserving synthetic data generation (particularly in simple scenarios such as preserving high-quality 1/2-way marginals in datasets with up to 32 features<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2112.09238),[paper<sub>2</sub>](https://arxiv.org/abs/2305.10994)</sup>) and is not intended for production use in complex, real-world applications.__\n\n \n\n## Installation\n\n### Prerequisites\n\n- Python 3.10 or 3.11\n\n### PyPi install\n\nYou can also install from PyPi by running: \n\n```sh\npip install dpmm\n```\n\n### Local Install \n\nTo install from the local github repo run the following command: \n\n```sh\ngit clone git@github.com:sassoftware/dpmm.git\ncd dpmm\npoetry install\n```\n\n### Tests\n\nTo run the unit tests, go to the root of the repository (if installed locally), and use the following command:\n\n```sh\npytest tests/\n```\n\n\n\n## Functionality\n\nWe provide numerous examples demonstrating the features of __dpmm__ across data preprocssing as well as the training and generation of generative models.\nThe examples are available across all models and model settings, and are accessible from the repository (if installed locally).\n\n\n### Preprocessing\nThe provided generative pipelines combine automatic DP descritization preprocessing with a generative model and allows for the following features:\n\n| Feature | Description | Example |\n| --- | --- | --- |\n| __dtype support__ | the following pandas data types are supported natively: `datetime`, `timedelta`, `float`, `int`, `category`, `bool`. | [Dtypes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_dtypes.ipynb) |\n|__null-value support__ | missing values are supported and will be reproduced accordingly if present in any column within the real data. | |\n|__automatic discretisation__ | while the default discretisation strategy used by _dpmm_ is `priv-tree` a more typical `uniform` strategy is also availble, they can both be combined with an `'auto'` mode which will attempt to identify the optimal number of bins for each numerical column column. | |\n\n\n### Model Features\n\n| Feature | Description | Example |\n| --- | --- | --- |\n| __domain compression__ | a `compress` flag can be set to `True` to ensure the discretised domain is compressed to improve the privacy budget / data quality trade-off. | |\n|__model size control__ | a `max_model_size` parameter that ensures the memory footprint of the selected marginals remains lower than the specified upper threshold. | [Max Memory example](https://github.com/sassoftware/dpmm/tree/main/examples/example_memory.ipynb) |\n|__model serialisation__ | pipelines can be serialised to / deserialised from disk by provided a valid folder to store the model to. | [Serialisation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_serialisation.ipynb) |\n\n\n### Generation Features\n\n| Feature | Description | Example |\n| --- | --- | --- |\n| __conditional generation__ | at generation time, it is also possible to provide a partial dataframe containing only some of the columns, in that case the generative pipeline will conditionally generate the remaining columns. | [Conditional Generation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_conditional.ipynb) |\n| __deterministic generation__ | when a `random_state` value is provided at generation time, the generative process becomes deterministic assuming the same input parameters are provided. | [Random State example](https://github.com/sassoftware/dpmm/tree/main/examples/example_seed.ipynb) |\n\n### Models\nThe implemented models include:\n\n| Method | Description | Reference | Example | \n|--- | --- | --- | --- | \n|**PrivBayes+PGM**| Differentialy Private Bayesian Network. | [PrivBayes: Private Data Release via Bayesian Networks](https://dl.acm.org/doi/10.1145/3134428)| [PrivBayes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_privbayes.ipynb) |\n|**MST**| Maximum Spanning Tree. | [Winning the NIST Contest: A scalable and general approach to differentially private synthetic data](https://arxiv.org/abs/2108.04978)| [MST example](https://github.com/sassoftware/dpmm/tree/main/examples/example_mst.ipynb) | \n|**AIM**| Adaptive and Iterative Mechanism. | [AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data](https://arxiv.org/abs/2201.12677)| [AIM example](https://github.com/sassoftware/dpmm/tree/main/examples/example_aim.ipynb) |\n\n__NB: All models rely on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup>.__\n\n\n\n## Getting Started\n\nTo get started with using the _dpmm_, follow the steps below:\n\n1. Import the necessary modules and load your data:\n ```python\n import pandas as pd\n import json\n from dpmm.pipelines import MSTPipeline\n\n\n wine_dir = Path().parent / \"wine\"\n\n df = pd.read_pickle(wine_dir / \"wine.pkl.gz\")\n with (wine_dir / \"wine_bounds.json\").open(\"r\") as f:\n domain = json.load(f)\n ```\n\n2. Initialize and fit a model:\n\n ```python\n model = MSTPipeline(\n # Generator Parameters\n epsilon=1.0, \n delta=1e-5,\n # Discretiser Parametrs\n proc_epsilon=0.1,\n )\n model.fit(df, domain)\n ```\n\n3. Generate synthetic data:\n ```python\n synth_df = model.generate(n_records=100)\n print(synth_df)\n \"\"\"\n type fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality \n 0 white 5.288142 0.190330 0.212473 1.402665 0.032305 37.097305 60.585301 0.990234 2.998241 0.658841 12.467682 1 \n 1 white 5.956364 0.225099 0.210124 15.968057 0.043620 70.073909 202.689578 0.995807 3.198247 0.318414 10.290390 0 \n 2 white 5.315535 0.341091 0.247268 0.628240 0.024938 52.468176 104.892353 0.990975 3.161218 0.971699 11.181373 1 \n 3 white 7.879125 0.234170 0.275704 3.711610 0.039565 68.977194 163.380550 1.005989 3.068622 0.798520 8.075999 0 \n 4 white 6.981342 0.358461 0.337705 3.600390 0.050450 51.567452 134.896467 0.996149 3.272745 0.599021 10.200400 0 \n\n \"\"\"\n ```\n\n\n\n### Troubleshooting\n\nIf you encounter any issues, please check the following:\n\n- Ensure that all required packages are installed.\n- Verify that your data does not contain missing values or non-integer columns if using certain models.\n- Check the model parameters and ensure they are set correctly.\n\n\n\n## Contributing\n\nMaintainers are accepting patches and contributions to this project.\nPlease read [CONTRIBUTING.md](https://github.com/sassoftware/dpmm/tree/main/CONTRIBUTING.md) for details about submitting contributions to this project.\n\n\n\n## License\n\nThis project is licensed under the [Apache 2.0 License](https://github.com/sassoftware/dpmm/tree/main/LICENSE).\nThis project also uses code snippets from the following projects: \n- [private-pgm](https://github.com/ryan112358/private-pgm): Apache 2.0\n- [opendp](https://github.com/opendp/smartnoise-sdk): MIT License\n- [ektelo](https://github.com/ektelo/ektelo): Apache 2.0\n\n\n\n## Additional Resources\n\n* [SAS Global Forum Papers](https://www.sas.com/en_us/events/sas-global-forum.html)\n* [SAS Communities](https://communities.sas.com/)\n\n\n\n## Citing\n\nIf you use this code, please cite the associated paper:\n```\n@inproceedings{mahiou2025dpmm,\n title={{dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation}},\n author={Mahiou, Sofiane and Dizche, Amir and Nazari, Reza and Wu, Xinmin and Abbey, Ralph and Silva, Jorge and Ganev, Georgi},\n booktitle={TPDP},\n year={2025}\n}\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "dpmm: a library for synthetic tabular data generation with rich functionality and end-to-end Differential Privacy guarantees",
"version": "0.1.9",
"project_urls": {
"Homepage": "https://github.com/sassoftware/dpmm",
"arxiv": "https://arxiv.org/abs/2506.00322"
},
"split_keywords": [
"machine-learning",
" tabular-data",
" differential-privacy",
" synthetic-data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0caa9647c79eb9260b5e899fbd711a190d6a8cc1fea4fe6bb9daf66de9d00cc0",
"md5": "bfad9bd55821a6cd3c26831e67241242",
"sha256": "fbd71d26caa51733cf1d8382f140faa755d7157e7d471191fee2c4a862a2f51b"
},
"downloads": -1,
"filename": "dpmm-0.1.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bfad9bd55821a6cd3c26831e67241242",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.10",
"size": 69809,
"upload_time": "2025-08-28T13:36:36",
"upload_time_iso_8601": "2025-08-28T13:36:36.151992Z",
"url": "https://files.pythonhosted.org/packages/0c/aa/9647c79eb9260b5e899fbd711a190d6a8cc1fea4fe6bb9daf66de9d00cc0/dpmm-0.1.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7734c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938",
"md5": "fe39c5800421cae8ee0988fea7d7eaf9",
"sha256": "bc43064737275f8a58dd003094fd4f2af47f355e045ddf140da71c37343a563d"
},
"downloads": -1,
"filename": "dpmm-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "fe39c5800421cae8ee0988fea7d7eaf9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.10",
"size": 57848,
"upload_time": "2025-08-28T13:36:37",
"upload_time_iso_8601": "2025-08-28T13:36:37.562695Z",
"url": "https://files.pythonhosted.org/packages/77/34/c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938/dpmm-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-28 13:36:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sassoftware",
"github_project": "dpmm",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "dpmm"
}