dpmm


Namedpmm JSON
Version 0.1.9 PyPI version JSON
download
home_pagehttps://github.com/sassoftware/dpmm
Summarydpmm: a library for synthetic tabular data generation with rich functionality and end-to-end Differential Privacy guarantees
upload_time2025-08-28 13:36:37
maintainerSofiane Mahiou
docs_urlNone
authorSofiane Mahiou, Georgi Ganev
requires_python<3.12,>=3.10
licenseApache-2.0
keywords machine-learning tabular-data differential-privacy synthetic-data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # _dpmm_: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation


## Overview

_dpmm_ is a Python library that implements state-of-the-art Differentially Private Marginal Models for generating synthetic tabular data.
Marginal Models have consistently been shown to capture key statistical properties like marginal distributions from the original data and reproduce them in the synthetic data, while Differential Privacy (DP) ensures that individual privacy is rigorously protected.

Summary of main features:
* end-to-end DP pipelines including data preprocessing, generative models, and mechanisms:
   * DP data preprocessing -- 1) data domain is either provided as input or extracted with DP<sup>[paper](https://www.research-collection.ethz.ch/handle/20.500.11850/508570)</sup>, and 2) continous data is discretized with DP (Uniform and PrivTree<sup>[paper](https://arxiv.org/abs/1601.03229)</sup>)
   * state-of-the-art DP generative models relying on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup> -- PrivBayes<sup>[paper](https://dl.acm.org/doi/10.1145/3134428)</sup>, MST<sup>[paper](https://arxiv.org/abs/2108.04978)</sup>, and AIM<sup>[paper](https://arxiv.org/abs/2201.12677)</sup>
   * floating-point precision of DP mechanisms<sup>[paper](https://arxiv.org/abs/2207.10635)</sup>
* superior utility and performance
* rich functionality across all models/pipelines
* DP auditing of underlying mechanisms and models/pipelines<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2405.10994),[paper<sub>2</sub>](https://dl.acm.org/doi/10.1145/3576915.3616607)</sup>

__NB: Intended Use -- _dpmm_ is designed for research and exploratory use in privacy-preserving synthetic data generation (particularly in simple scenarios such as preserving high-quality 1/2-way marginals in datasets with up to 32 features<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2112.09238),[paper<sub>2</sub>](https://arxiv.org/abs/2305.10994)</sup>) and is not intended for production use in complex, real-world applications.__

 

## Installation

### Prerequisites

- Python 3.10 or 3.11

### PyPi install

You can also install from PyPi by running: 

```sh
pip install dpmm
```

### Local Install 

To install from the local github repo run the following command: 

```sh
git clone git@github.com:sassoftware/dpmm.git
cd dpmm
poetry install
```

### Tests

To run the unit tests, go to the root of the repository (if installed locally), and use the following command:

```sh
pytest tests/
```



## Functionality

We provide numerous examples demonstrating the features of __dpmm__ across data preprocssing as well as the training and generation of generative models.
The examples are available across all models and model settings, and are accessible from the repository (if installed locally).


### Preprocessing
The provided generative pipelines combine automatic DP descritization preprocessing with a generative model and allows for the following features:

| Feature | Description | Example |
| --- | --- | --- |
| __dtype support__ | the following pandas data types are supported natively: `datetime`, `timedelta`, `float`, `int`, `category`, `bool`. | [Dtypes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_dtypes.ipynb) |
|__null-value support__ | missing values are supported and will be reproduced accordingly if present in any column within the real data. | |
|__automatic discretisation__ | while the default discretisation strategy used by _dpmm_ is `priv-tree` a more typical `uniform` strategy is also availble, they can both be combined with an `'auto'` mode which will attempt to identify the optimal number of bins for each numerical column column. | |


### Model Features

| Feature | Description | Example |
| --- | --- | --- |
| __domain compression__ | a `compress` flag can be set to `True` to ensure the discretised domain is compressed to improve the privacy budget / data quality trade-off. |  |
|__model size control__ | a `max_model_size` parameter that ensures the memory footprint of the selected marginals remains lower than the specified upper threshold. | [Max Memory example](https://github.com/sassoftware/dpmm/tree/main/examples/example_memory.ipynb) |
|__model serialisation__ | pipelines can be serialised to / deserialised from disk by provided a valid folder to store the model to. | [Serialisation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_serialisation.ipynb) |


### Generation Features

| Feature | Description | Example |
| --- | --- | --- |
| __conditional generation__ | at generation time, it is also possible to provide a partial dataframe containing only some of the columns, in that case the generative pipeline will conditionally generate the remaining columns. | [Conditional Generation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_conditional.ipynb) |
| __deterministic generation__ | when a `random_state` value is provided at generation time, the generative process becomes deterministic assuming the same input parameters are provided. | [Random State example](https://github.com/sassoftware/dpmm/tree/main/examples/example_seed.ipynb) |

### Models
The implemented models include:

| Method | Description | Reference | Example | 
|--- | --- | --- | --- | 
|**PrivBayes+PGM**|  Differentialy Private Bayesian Network. | [PrivBayes: Private Data Release via Bayesian Networks](https://dl.acm.org/doi/10.1145/3134428)| [PrivBayes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_privbayes.ipynb) |
|**MST**|  Maximum Spanning Tree. | [Winning the NIST Contest: A scalable and general approach to differentially private synthetic data](https://arxiv.org/abs/2108.04978)| [MST example](https://github.com/sassoftware/dpmm/tree/main/examples/example_mst.ipynb) | 
|**AIM**|  Adaptive and Iterative Mechanism. | [AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data](https://arxiv.org/abs/2201.12677)| [AIM example](https://github.com/sassoftware/dpmm/tree/main/examples/example_aim.ipynb) |

__NB: All models rely on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup>.__



## Getting Started

To get started with using the _dpmm_, follow the steps below:

1. Import the necessary modules and load your data:
   ```python
   import pandas as pd
   import json
   from dpmm.pipelines import MSTPipeline


   wine_dir = Path().parent / "wine"

   df = pd.read_pickle(wine_dir / "wine.pkl.gz")
   with (wine_dir / "wine_bounds.json").open("r") as f:
      domain = json.load(f)
   ```

2. Initialize and fit a model:

   ```python
   model = MSTPipeline(
      # Generator Parameters
      epsilon=1.0, 
      delta=1e-5,
      # Discretiser Parametrs
      proc_epsilon=0.1,
   )
   model.fit(df, domain)
   ```

3. Generate synthetic data:
   ```python
   synth_df = model.generate(n_records=100)
   print(synth_df)
   """
         type  fixed acidity  volatile acidity  citric acid  residual sugar   chlorides free sulfur dioxide  total sulfur dioxide   density        pH   sulphates    alcohol quality  
      0  white       5.288142          0.190330     0.212473        1.402665    0.032305            37.097305             60.585301  0.990234  2.998241    0.658841  12.467682       1  
      1  white       5.956364          0.225099     0.210124       15.968057    0.043620            70.073909            202.689578  0.995807  3.198247    0.318414  10.290390       0  
      2  white       5.315535          0.341091     0.247268        0.628240    0.024938            52.468176            104.892353  0.990975  3.161218    0.971699  11.181373       1  
      3  white       7.879125          0.234170     0.275704        3.711610    0.039565            68.977194            163.380550  1.005989  3.068622    0.798520   8.075999       0  
      4  white       6.981342          0.358461     0.337705        3.600390    0.050450            51.567452            134.896467  0.996149  3.272745    0.599021  10.200400       0  

   """
   ```



### Troubleshooting

If you encounter any issues, please check the following:

- Ensure that all required packages are installed.
- Verify that your data does not contain missing values or non-integer columns if using certain models.
- Check the model parameters and ensure they are set correctly.



## Contributing

Maintainers are accepting patches and contributions to this project.
Please read [CONTRIBUTING.md](https://github.com/sassoftware/dpmm/tree/main/CONTRIBUTING.md) for details about submitting contributions to this project.



## License

This project is licensed under the [Apache 2.0 License](https://github.com/sassoftware/dpmm/tree/main/LICENSE).
This project also uses code snippets from the following projects: 
- [private-pgm](https://github.com/ryan112358/private-pgm): Apache 2.0
- [opendp](https://github.com/opendp/smartnoise-sdk): MIT License
- [ektelo](https://github.com/ektelo/ektelo): Apache 2.0



## Additional Resources

* [SAS Global Forum Papers](https://www.sas.com/en_us/events/sas-global-forum.html)
* [SAS Communities](https://communities.sas.com/)



## Citing

If you use this code, please cite the associated paper:
```
@inproceedings{mahiou2025dpmm,
  title={{dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation}},
  author={Mahiou, Sofiane and Dizche, Amir and Nazari, Reza and Wu, Xinmin and Abbey, Ralph and Silva, Jorge and Ganev, Georgi},
  booktitle={TPDP},
  year={2025}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sassoftware/dpmm",
    "name": "dpmm",
    "maintainer": "Sofiane Mahiou",
    "docs_url": null,
    "requires_python": "<3.12,>=3.10",
    "maintainer_email": "sofiane.mahiou@sas.com",
    "keywords": "machine-learning, tabular-data, differential-privacy, synthetic-data",
    "author": "Sofiane Mahiou, Georgi Ganev",
    "author_email": "sofiane.mahiou@sas.com",
    "download_url": "https://files.pythonhosted.org/packages/77/34/c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938/dpmm-0.1.9.tar.gz",
    "platform": null,
    "description": "# _dpmm_: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation\n\n\n## Overview\n\n_dpmm_ is a Python library that implements state-of-the-art Differentially Private Marginal Models for generating synthetic tabular data.\nMarginal Models have consistently been shown to capture key statistical properties like marginal distributions from the original data and reproduce them in the synthetic data, while Differential Privacy (DP) ensures that individual privacy is rigorously protected.\n\nSummary of main features:\n* end-to-end DP pipelines including data preprocessing, generative models, and mechanisms:\n   * DP data preprocessing -- 1) data domain is either provided as input or extracted with DP<sup>[paper](https://www.research-collection.ethz.ch/handle/20.500.11850/508570)</sup>, and 2) continous data is discretized with DP (Uniform and PrivTree<sup>[paper](https://arxiv.org/abs/1601.03229)</sup>)\n   * state-of-the-art DP generative models relying on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup> -- PrivBayes<sup>[paper](https://dl.acm.org/doi/10.1145/3134428)</sup>, MST<sup>[paper](https://arxiv.org/abs/2108.04978)</sup>, and AIM<sup>[paper](https://arxiv.org/abs/2201.12677)</sup>\n   * floating-point precision of DP mechanisms<sup>[paper](https://arxiv.org/abs/2207.10635)</sup>\n* superior utility and performance\n* rich functionality across all models/pipelines\n* DP auditing of underlying mechanisms and models/pipelines<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2405.10994),[paper<sub>2</sub>](https://dl.acm.org/doi/10.1145/3576915.3616607)</sup>\n\n__NB: Intended Use -- _dpmm_ is designed for research and exploratory use in privacy-preserving synthetic data generation (particularly in simple scenarios such as preserving high-quality 1/2-way marginals in datasets with up to 32 features<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2112.09238),[paper<sub>2</sub>](https://arxiv.org/abs/2305.10994)</sup>) and is not intended for production use in complex, real-world applications.__\n\n \n\n## Installation\n\n### Prerequisites\n\n- Python 3.10 or 3.11\n\n### PyPi install\n\nYou can also install from PyPi by running: \n\n```sh\npip install dpmm\n```\n\n### Local Install \n\nTo install from the local github repo run the following command: \n\n```sh\ngit clone git@github.com:sassoftware/dpmm.git\ncd dpmm\npoetry install\n```\n\n### Tests\n\nTo run the unit tests, go to the root of the repository (if installed locally), and use the following command:\n\n```sh\npytest tests/\n```\n\n\n\n## Functionality\n\nWe provide numerous examples demonstrating the features of __dpmm__ across data preprocssing as well as the training and generation of generative models.\nThe examples are available across all models and model settings, and are accessible from the repository (if installed locally).\n\n\n### Preprocessing\nThe provided generative pipelines combine automatic DP descritization preprocessing with a generative model and allows for the following features:\n\n| Feature | Description | Example |\n| --- | --- | --- |\n| __dtype support__ | the following pandas data types are supported natively: `datetime`, `timedelta`, `float`, `int`, `category`, `bool`. | [Dtypes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_dtypes.ipynb) |\n|__null-value support__ | missing values are supported and will be reproduced accordingly if present in any column within the real data. | |\n|__automatic discretisation__ | while the default discretisation strategy used by _dpmm_ is `priv-tree` a more typical `uniform` strategy is also availble, they can both be combined with an `'auto'` mode which will attempt to identify the optimal number of bins for each numerical column column. | |\n\n\n### Model Features\n\n| Feature | Description | Example |\n| --- | --- | --- |\n| __domain compression__ | a `compress` flag can be set to `True` to ensure the discretised domain is compressed to improve the privacy budget / data quality trade-off. |  |\n|__model size control__ | a `max_model_size` parameter that ensures the memory footprint of the selected marginals remains lower than the specified upper threshold. | [Max Memory example](https://github.com/sassoftware/dpmm/tree/main/examples/example_memory.ipynb) |\n|__model serialisation__ | pipelines can be serialised to / deserialised from disk by provided a valid folder to store the model to. | [Serialisation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_serialisation.ipynb) |\n\n\n### Generation Features\n\n| Feature | Description | Example |\n| --- | --- | --- |\n| __conditional generation__ | at generation time, it is also possible to provide a partial dataframe containing only some of the columns, in that case the generative pipeline will conditionally generate the remaining columns. | [Conditional Generation example](https://github.com/sassoftware/dpmm/tree/main/examples/example_conditional.ipynb) |\n| __deterministic generation__ | when a `random_state` value is provided at generation time, the generative process becomes deterministic assuming the same input parameters are provided. | [Random State example](https://github.com/sassoftware/dpmm/tree/main/examples/example_seed.ipynb) |\n\n### Models\nThe implemented models include:\n\n| Method | Description | Reference | Example | \n|--- | --- | --- | --- | \n|**PrivBayes+PGM**|  Differentialy Private Bayesian Network. | [PrivBayes: Private Data Release via Bayesian Networks](https://dl.acm.org/doi/10.1145/3134428)| [PrivBayes example](https://github.com/sassoftware/dpmm/tree/main/examples/example_privbayes.ipynb) |\n|**MST**|  Maximum Spanning Tree. | [Winning the NIST Contest: A scalable and general approach to differentially private synthetic data](https://arxiv.org/abs/2108.04978)| [MST example](https://github.com/sassoftware/dpmm/tree/main/examples/example_mst.ipynb) | \n|**AIM**|  Adaptive and Iterative Mechanism. | [AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data](https://arxiv.org/abs/2201.12677)| [AIM example](https://github.com/sassoftware/dpmm/tree/main/examples/example_aim.ipynb) |\n\n__NB: All models rely on the select-measure-generate paradigm<sup>[paper<sub>1</sub>](https://arxiv.org/abs/2108.04978),[paper<sub>2</sub>](https://differentialprivacy.org/synth-data-1/)</sup> and Private-PGM<sup>[paper](https://arxiv.org/abs/1901.09136)</sup>.__\n\n\n\n## Getting Started\n\nTo get started with using the _dpmm_, follow the steps below:\n\n1. Import the necessary modules and load your data:\n   ```python\n   import pandas as pd\n   import json\n   from dpmm.pipelines import MSTPipeline\n\n\n   wine_dir = Path().parent / \"wine\"\n\n   df = pd.read_pickle(wine_dir / \"wine.pkl.gz\")\n   with (wine_dir / \"wine_bounds.json\").open(\"r\") as f:\n      domain = json.load(f)\n   ```\n\n2. Initialize and fit a model:\n\n   ```python\n   model = MSTPipeline(\n      # Generator Parameters\n      epsilon=1.0, \n      delta=1e-5,\n      # Discretiser Parametrs\n      proc_epsilon=0.1,\n   )\n   model.fit(df, domain)\n   ```\n\n3. Generate synthetic data:\n   ```python\n   synth_df = model.generate(n_records=100)\n   print(synth_df)\n   \"\"\"\n         type  fixed acidity  volatile acidity  citric acid  residual sugar   chlorides free sulfur dioxide  total sulfur dioxide   density        pH   sulphates    alcohol quality  \n      0  white       5.288142          0.190330     0.212473        1.402665    0.032305            37.097305             60.585301  0.990234  2.998241    0.658841  12.467682       1  \n      1  white       5.956364          0.225099     0.210124       15.968057    0.043620            70.073909            202.689578  0.995807  3.198247    0.318414  10.290390       0  \n      2  white       5.315535          0.341091     0.247268        0.628240    0.024938            52.468176            104.892353  0.990975  3.161218    0.971699  11.181373       1  \n      3  white       7.879125          0.234170     0.275704        3.711610    0.039565            68.977194            163.380550  1.005989  3.068622    0.798520   8.075999       0  \n      4  white       6.981342          0.358461     0.337705        3.600390    0.050450            51.567452            134.896467  0.996149  3.272745    0.599021  10.200400       0  \n\n   \"\"\"\n   ```\n\n\n\n### Troubleshooting\n\nIf you encounter any issues, please check the following:\n\n- Ensure that all required packages are installed.\n- Verify that your data does not contain missing values or non-integer columns if using certain models.\n- Check the model parameters and ensure they are set correctly.\n\n\n\n## Contributing\n\nMaintainers are accepting patches and contributions to this project.\nPlease read [CONTRIBUTING.md](https://github.com/sassoftware/dpmm/tree/main/CONTRIBUTING.md) for details about submitting contributions to this project.\n\n\n\n## License\n\nThis project is licensed under the [Apache 2.0 License](https://github.com/sassoftware/dpmm/tree/main/LICENSE).\nThis project also uses code snippets from the following projects: \n- [private-pgm](https://github.com/ryan112358/private-pgm): Apache 2.0\n- [opendp](https://github.com/opendp/smartnoise-sdk): MIT License\n- [ektelo](https://github.com/ektelo/ektelo): Apache 2.0\n\n\n\n## Additional Resources\n\n* [SAS Global Forum Papers](https://www.sas.com/en_us/events/sas-global-forum.html)\n* [SAS Communities](https://communities.sas.com/)\n\n\n\n## Citing\n\nIf you use this code, please cite the associated paper:\n```\n@inproceedings{mahiou2025dpmm,\n  title={{dpmm: Differentially Private Marginal Models, a Library for Synthetic Tabular Data Generation}},\n  author={Mahiou, Sofiane and Dizche, Amir and Nazari, Reza and Wu, Xinmin and Abbey, Ralph and Silva, Jorge and Ganev, Georgi},\n  booktitle={TPDP},\n  year={2025}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "dpmm: a library for synthetic tabular data generation with rich functionality and end-to-end Differential Privacy guarantees",
    "version": "0.1.9",
    "project_urls": {
        "Homepage": "https://github.com/sassoftware/dpmm",
        "arxiv": "https://arxiv.org/abs/2506.00322"
    },
    "split_keywords": [
        "machine-learning",
        " tabular-data",
        " differential-privacy",
        " synthetic-data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0caa9647c79eb9260b5e899fbd711a190d6a8cc1fea4fe6bb9daf66de9d00cc0",
                "md5": "bfad9bd55821a6cd3c26831e67241242",
                "sha256": "fbd71d26caa51733cf1d8382f140faa755d7157e7d471191fee2c4a862a2f51b"
            },
            "downloads": -1,
            "filename": "dpmm-0.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bfad9bd55821a6cd3c26831e67241242",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.10",
            "size": 69809,
            "upload_time": "2025-08-28T13:36:36",
            "upload_time_iso_8601": "2025-08-28T13:36:36.151992Z",
            "url": "https://files.pythonhosted.org/packages/0c/aa/9647c79eb9260b5e899fbd711a190d6a8cc1fea4fe6bb9daf66de9d00cc0/dpmm-0.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7734c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938",
                "md5": "fe39c5800421cae8ee0988fea7d7eaf9",
                "sha256": "bc43064737275f8a58dd003094fd4f2af47f355e045ddf140da71c37343a563d"
            },
            "downloads": -1,
            "filename": "dpmm-0.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "fe39c5800421cae8ee0988fea7d7eaf9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.10",
            "size": 57848,
            "upload_time": "2025-08-28T13:36:37",
            "upload_time_iso_8601": "2025-08-28T13:36:37.562695Z",
            "url": "https://files.pythonhosted.org/packages/77/34/c1a69020d9279a13fe68ef9eb2410431a712dc89453c0d3e535a1b115938/dpmm-0.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 13:36:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sassoftware",
    "github_project": "dpmm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "dpmm"
}
        
Elapsed time: 1.41570s