metasyn


Namemetasyn JSON
Version 0.7.1 PyPI version JSON
download
home_page
SummaryPackage for creating synthetic datasets while preserving privacy.
upload_time2024-02-28 10:24:49
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT License Copyright (c) 2024 SoDa Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords metadata open-data privacy synthetic-data tabular datasets
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="docs/source/images/logos/blue.svg" width="600px" alt="Metasyn logo"></img>
  <h3 align="center">Transparent and privacy-friendly synthetic data generation</h3>
  <p align="center">
    <span>
        <a href="https://www.repostatus.org/#wip"><img src="https://www.repostatus.org/badges/latest/wip.svg" alt="Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public."/></a>
        <a href="https://pypi.org/project/metasyn"><img src="https://img.shields.io/pypi/pyversions/metasyn" alt="metasyn on pypi"></img></a>
        <a href="https://colab.research.google.com/github/sodascience/metasyn/blob/main/examples/getting_started.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="open getting started on colab"></img></a>
        <a href="https://metasyn.readthedocs.io/en/latest/index.html"><img src="https://readthedocs.org/projects/metasyn/badge/?version=latest" alt="Readthedocs"></img></a>
        <a href="https://hub.docker.com/r/sodateam/metasyn"><img src="https://img.shields.io/docker/v/sodateam/metasyn?logo=docker&label=docker&color=blue" alt="Docker image version"></img></a>
    </span>
  </p>
</p>
<br/>

Metasyn is a Python package that **generates synthetic data**, and allows **sharing of the data generation model**, to facilitate collaboration and testing on sensitive data without exposing the original data.

It has three main functionalities:

1. **[Estimation](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html)**: Metasyn can analyze a dataset and create a *MetaFrame* for it. This is essentially a blueprint (or data generation model) that captures the structure and distributions of the columns without storing any entries.
2. **[Generation](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html)**: From a *MetaFrame*, metasyn can generate new synthetic data that resembles the original, on a column-by-column basis. 
3. **[Serialization](https://metasynth.readthedocs.io/en/latest/usage/exporting_metaframes.html)**: Metasyn can export and import *MetaFrames* to an easy-to-read format. This allows for easy modification and sharing of the model.


![Metasyn Pipeline](docs/source/images/pipeline_basic.png)

## Why Metasyn?
- **Privacy**: With metasyn you can share not only synthetic data, but also the model used to create it. This increases transparency and facilitates collaboration and testing on sensitive data without exposing the original data.
- **Extensible**: Metasyn is designed to be easily extendable and customizable and supports plugins for custom distributions and privacy control.
- **Faker**: Metasyn integrates with the [Faker](https://faker.readthedocs.io/en/master/) plugin to generate real-sounding entries for names, emails, phone numbers, etc.
- **DataFrame-based**: Metasyn is built on top of [Polars](https://pola.rs/), and supports both Polars and [Pandas](https://pandas.pydata.org/) DataFrames as input.
- **Flexibility**: Metasyn supports a variety of distribution and data types and can automatically select and fit to them. It also supports and detects columns with unique values or structured strings.
- **Ease of use**: Metasyn is designed to be easy to use and understand.

## Example
The following diagram shows how metasyn can generate synthetic data from an input dataset:

![Example input and output](docs/source/images/example_input_output_concise.png)

This can be reproduced using the following code:


```python
# Create a Polars DataFrame. In this case we load it from a csv file.
# It is important to specify which categories are categorical, as Polars does not infer this automatically.
df = pl.read_csv("example.csv", dtypes={"fruits": pl.Categorical, "cars": pl.Categorical})

# Create a MetaFrame from the DataFrame.
mf = MetaFrame.fit_dataframe(df)

# Generate a new DataFrame, with 5 rows data from the MetaFrame.
output_df = mf.synthesize(5)

# This DataFrame can be exported to csv, parquet, excel and more. E.g., to csv:
output_df.write_csv("output.csv")
```

This example is the most basic use case, as a next step we recommend to check out the [User Guide](https://metasyn.readthedocs.io/en/latest/usage/usage.html) for more detailed examples or to follow along our [interactive tutorial](https://metasyn.readthedocs.io/en/latest/usage/interactive_tutorials.html). 

For more information on how to use Polars DataFrames, refer to the [Polars documentation](https://pola.rs/).


## Installing metasyn
Metasyn can be installed directly from PyPI using the following command in the terminal:

```sh
pip install metasyn
```

After that metasyn is available to use in your Python scripts and notebooks. It will also be accessible through its [command-line interface](https://metasyn.readthedocs.io/en/latest/usage/cli.html). It is also possible to run and access metasyn's CLI through a Docker container available on [Docker Hub](https://hub.docker.com/r/sodateam/metasyn).  

For more information on installing metasyn, refer to the [installation guide](https://metasyn.readthedocs.io/en/latest/usage/installation.html).


## Documentation and help
- **Documentation**: For a detailed overview of metasyn, refer to the [documentation](https://metasyn.readthedocs.io/en/latest/index.html). 
- **Quick-start:** Our [quick start guide](https://metasyn.readthedocs.io/en/latest/usage/quick_start.html) acts as a crash-course on the functionality and workflow of metasyn.
- **Interactive tutorial** Our [interactive tutorial](https://metasyn.readthedocs.io/en/latest/usage/interactive_tutorials.html) (Jupyter Notebook) follows and expands on the quick start guide, providing a step-by-step walkthrough and example to get you started. This tutorial can be followed without having to install metasyn locally by running it in Google Colab or Binder.

## Contributing
Metasyn is an open-source project, and we welcome contributions from the community.

To contribute to the codebase, follow these steps:
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

More information on contributing can be found in the [contributing](https://metasyn.readthedocs.io/en/latest/developer/contributing.html) section of the documentation.


## Contact
**Metasyn** is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team.
Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Erik-Jan van Kesteren](https://github.com/vankesteren) or [Raoul Schram](https://github.com/qubixes).

<img src="docs/source/images/logos/soda.png" alt="SoDa logo" width="250px"/> 

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "metasyn",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "metadata,open-data,privacy,synthetic-data,tabular datasets",
    "author": "",
    "author_email": "Raoul Schram <r.d.schram@uu.nl>, Erik-Jan van Kesteren <e.vankesteren1@uu.nl>",
    "download_url": "https://files.pythonhosted.org/packages/a9/8d/c74a1b3addf79ebcc4be3462ce16dff73e7c246bb62f54ed188674340e61/metasyn-0.7.1.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"docs/source/images/logos/blue.svg\" width=\"600px\" alt=\"Metasyn logo\"></img>\n  <h3 align=\"center\">Transparent and privacy-friendly synthetic data generation</h3>\n  <p align=\"center\">\n    <span>\n        <a href=\"https://www.repostatus.org/#wip\"><img src=\"https://www.repostatus.org/badges/latest/wip.svg\" alt=\"Project Status: WIP \u2013 Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.\"/></a>\n        <a href=\"https://pypi.org/project/metasyn\"><img src=\"https://img.shields.io/pypi/pyversions/metasyn\" alt=\"metasyn on pypi\"></img></a>\n        <a href=\"https://colab.research.google.com/github/sodascience/metasyn/blob/main/examples/getting_started.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"open getting started on colab\"></img></a>\n        <a href=\"https://metasyn.readthedocs.io/en/latest/index.html\"><img src=\"https://readthedocs.org/projects/metasyn/badge/?version=latest\" alt=\"Readthedocs\"></img></a>\n        <a href=\"https://hub.docker.com/r/sodateam/metasyn\"><img src=\"https://img.shields.io/docker/v/sodateam/metasyn?logo=docker&label=docker&color=blue\" alt=\"Docker image version\"></img></a>\n    </span>\n  </p>\n</p>\n<br/>\n\nMetasyn is a Python package that **generates synthetic data**, and allows **sharing of the data generation model**, to facilitate collaboration and testing on sensitive data without exposing the original data.\n\nIt has three main functionalities:\n\n1. **[Estimation](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html)**: Metasyn can analyze a dataset and create a *MetaFrame* for it. This is essentially a blueprint (or data generation model) that captures the structure and distributions of the columns without storing any entries.\n2. **[Generation](https://metasynth.readthedocs.io/en/latest/usage/generating_synthetic_data.html)**: From a *MetaFrame*, metasyn can generate new synthetic data that resembles the original, on a column-by-column basis. \n3. **[Serialization](https://metasynth.readthedocs.io/en/latest/usage/exporting_metaframes.html)**: Metasyn can export and import *MetaFrames* to an easy-to-read format. This allows for easy modification and sharing of the model.\n\n\n![Metasyn Pipeline](docs/source/images/pipeline_basic.png)\n\n## Why Metasyn?\n- **Privacy**: With metasyn you can share not only synthetic data, but also the model used to create it. This increases transparency and facilitates collaboration and testing on sensitive data without exposing the original data.\n- **Extensible**: Metasyn is designed to be easily extendable and customizable and supports plugins for custom distributions and privacy control.\n- **Faker**: Metasyn integrates with the [Faker](https://faker.readthedocs.io/en/master/) plugin to generate real-sounding entries for names, emails, phone numbers, etc.\n- **DataFrame-based**: Metasyn is built on top of [Polars](https://pola.rs/), and supports both Polars and [Pandas](https://pandas.pydata.org/) DataFrames as input.\n- **Flexibility**: Metasyn supports a variety of distribution and data types and can automatically select and fit to them. It also supports and detects columns with unique values or structured strings.\n- **Ease of use**: Metasyn is designed to be easy to use and understand.\n\n## Example\nThe following diagram shows how metasyn can generate synthetic data from an input dataset:\n\n![Example input and output](docs/source/images/example_input_output_concise.png)\n\nThis can be reproduced using the following code:\n\n\n```python\n# Create a Polars DataFrame. In this case we load it from a csv file.\n# It is important to specify which categories are categorical, as Polars does not infer this automatically.\ndf = pl.read_csv(\"example.csv\", dtypes={\"fruits\": pl.Categorical, \"cars\": pl.Categorical})\n\n# Create a MetaFrame from the DataFrame.\nmf = MetaFrame.fit_dataframe(df)\n\n# Generate a new DataFrame, with 5 rows data from the MetaFrame.\noutput_df = mf.synthesize(5)\n\n# This DataFrame can be exported to csv, parquet, excel and more. E.g., to csv:\noutput_df.write_csv(\"output.csv\")\n```\n\nThis example is the most basic use case, as a next step we recommend to check out the [User Guide](https://metasyn.readthedocs.io/en/latest/usage/usage.html) for more detailed examples or to follow along our [interactive tutorial](https://metasyn.readthedocs.io/en/latest/usage/interactive_tutorials.html). \n\nFor more information on how to use Polars DataFrames, refer to the [Polars documentation](https://pola.rs/).\n\n\n## Installing metasyn\nMetasyn can be installed directly from PyPI using the following command in the terminal:\n\n```sh\npip install metasyn\n```\n\nAfter that metasyn is available to use in your Python scripts and notebooks. It will also be accessible through its [command-line interface](https://metasyn.readthedocs.io/en/latest/usage/cli.html). It is also possible to run and access metasyn's CLI through a Docker container available on [Docker Hub](https://hub.docker.com/r/sodateam/metasyn).  \n\nFor more information on installing metasyn, refer to the [installation guide](https://metasyn.readthedocs.io/en/latest/usage/installation.html).\n\n\n## Documentation and help\n- **Documentation**: For a detailed overview of metasyn, refer to the [documentation](https://metasyn.readthedocs.io/en/latest/index.html). \n- **Quick-start:** Our [quick start guide](https://metasyn.readthedocs.io/en/latest/usage/quick_start.html) acts as a crash-course on the functionality and workflow of metasyn.\n- **Interactive tutorial** Our [interactive tutorial](https://metasyn.readthedocs.io/en/latest/usage/interactive_tutorials.html) (Jupyter Notebook) follows and expands on the quick start guide, providing a step-by-step walkthrough and example to get you started. This tutorial can be followed without having to install metasyn locally by running it in Google Colab or Binder.\n\n## Contributing\nMetasyn is an open-source project, and we welcome contributions from the community.\n\nTo contribute to the codebase, follow these steps:\n1. Fork the Project\n2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the Branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\nMore information on contributing can be found in the [contributing](https://metasyn.readthedocs.io/en/latest/developer/contributing.html) section of the documentation.\n\n\n## Contact\n**Metasyn** is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team.\nDo you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Erik-Jan van Kesteren](https://github.com/vankesteren) or [Raoul Schram](https://github.com/qubixes).\n\n<img src=\"docs/source/images/logos/soda.png\" alt=\"SoDa logo\" width=\"250px\"/> \n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 SoDa  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Package for creating synthetic datasets while preserving privacy.",
    "version": "0.7.1",
    "project_urls": {
        "GitHub": "https://github.com/sodascience/metasyn",
        "documentation": "https://metasyn.readthedocs.io/en/latest/index.html"
    },
    "split_keywords": [
        "metadata",
        "open-data",
        "privacy",
        "synthetic-data",
        "tabular datasets"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c5e8b5d0915590640c73df90854e8b9f2d113e7e2822b0e2c74b807d0e5c07fd",
                "md5": "135fc2131d36c3377c0f46a8a7bcd6ca",
                "sha256": "6fc104c933bd7abcb6085e1b57067bffcb45f544f610a25872fb1b8bf5484962"
            },
            "downloads": -1,
            "filename": "metasyn-0.7.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "135fc2131d36c3377c0f46a8a7bcd6ca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 282066,
            "upload_time": "2024-02-28T10:24:45",
            "upload_time_iso_8601": "2024-02-28T10:24:45.439243Z",
            "url": "https://files.pythonhosted.org/packages/c5/e8/b5d0915590640c73df90854e8b9f2d113e7e2822b0e2c74b807d0e5c07fd/metasyn-0.7.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a98dc74a1b3addf79ebcc4be3462ce16dff73e7c246bb62f54ed188674340e61",
                "md5": "e04a74ba7bce61ed93fcbdce82c77dfa",
                "sha256": "e117598b4801404482d206dd99ea3698513dc34bbda84090941555c6566fc8d0"
            },
            "downloads": -1,
            "filename": "metasyn-0.7.1.tar.gz",
            "has_sig": false,
            "md5_digest": "e04a74ba7bce61ed93fcbdce82c77dfa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 2567984,
            "upload_time": "2024-02-28T10:24:49",
            "upload_time_iso_8601": "2024-02-28T10:24:49.319459Z",
            "url": "https://files.pythonhosted.org/packages/a9/8d/c74a1b3addf79ebcc4be3462ce16dff73e7c246bb62f54ed188674340e61/metasyn-0.7.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-28 10:24:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sodascience",
    "github_project": "metasyn",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "metasyn"
}
        
Elapsed time: 0.25508s