pynonymizer


Namepynonymizer JSON
Version 2.2.0 PyPI version JSON
download
home_pagehttps://github.com/rwnx/pynonymizer
SummaryAn anonymization tool for production databases
upload_time2024-04-14 16:34:12
maintainerNone
docs_urlNone
authorRowan Twell
requires_python>3.9.0
licenseMIT
keywords anonymization gdpr database mysql
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # `pynonymizer` [![pynonymizer on PyPI](https://img.shields.io/pypi/v/pynonymizer)](https://pypi.org/project/pynonymizer/) [![Downloads](https://static.pepy.tech/badge/pynonymizer)](https://pepy.tech/project/pynonymizer) ![License](https://img.shields.io/pypi/l/pynonymizer)

pynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.

This can help you support GDPR/Data Protection in your organization without compromizing on quality testing data.

## Why are anonymized databases important?
The primary source of information on how your database is used is in _your production database_. In most situations, the production dataset is usually significantly larger than any development copy, and
would contain a wider range of data.

From time to time, it is prudent to run a new feature or stage a test against this dataset, rather
than one that is artificially created by developers or by testing frameworks. Anonymized databases allow us to use the structures present in production, while stripping them of any personally identifiable data that would
consitute a breach of privacy for end-users and subsequently a breach of GDPR.

With Anonymized databases, copies can be processed regularly, and distributed easily, leaving your developers and testers with a rich source of information on the volume and general makeup of the system in production. It can
be used to run better staging environments, integration tests, and even simulate database migrations.

below is an excerpt from an anonymized database:

| id |salutation | firstname | surname | email | dob |
| - | - | - | - | - | - |
| 1 | Dr. | Bernard | Gough | `tnelson@powell.com` | 2000-07-03 |
| 2 | Mr. | Molly | Bennett | `clarkeharriet@price-fry.com` | 2014-05-19 |
| 3 | Mrs. | Chelsea | Reid | `adamsamber@clayton.com` | 1974-09-08 |
| 4 | Dr. | Grace | Armstrong | `tracy36@wilson-matthews.com` | 1963-12-15 |
| 5 | Dr. | Stanley | James | `christine15@stewart.net` | 1976-09-16 |
| 6 | Dr. | Mark | Walsh | `dgardner@ward.biz` | 2004-08-28 |
| 7 | Mrs. | Josephine | Chambers | `hperry@allen.com` | 1916-04-04 |
| 8 | Dr. | Stephen | Thomas | `thompsonheather@smith-stevens.com` | 1995-04-17 |
| 9 | Ms. | Damian | Thompson | `yjones@cox.biz` | 2016-10-02 |
| 10 | Miss | Geraldine | Harris | `porteralice@francis-patel.com` | 1910-09-28 |
| 11 | Ms. | Gemma | Jones | `mandylewis@patel-thomas.net` | 1990-06-03 |
| 12 | Dr. | Glenn | Carr | `garnervalerie@farrell-parsons.biz` | 1998-04-19 |


## How does it work?
`pynonymizer` replaces personally identifiable data in your database with **realistic** pseudorandom data, from the `Faker` library or from other functions.
There are a wide variety of data types available which should suit the column in question, for example:

* `unique_email`
* `company`
* `file_path`
* `[...]`

Pynonymizer's main data replacement mechanism `fake_update` is a random selection from a small pool of data (`--seed-rows` controls the available Faker data). This process is chosen for compatibility and speed of operation, but does not guarantee uniqueness. 
This may or may not suit your exact use-case. For a full list of data generation strategies, see the docs on [strategyfiles](https://github.com/rwnx/pynonymizer/blob/main/doc/strategyfiles.md)

### Examples

You can see strategyfile examples for existing database, such as wordpress or adventureworks sample database, in the the [examples folder](https://github.com/rwnx/pynonymizer/blob/main/examples).

### Process outline

1. Restore from dumpfile to temporary database.
1. Anonymize temporary database with strategy.
1. Dump resulting data to file.
1. Drop temporary database.

If this workflow doesnt work for you, see [process control](https://github.com/rwnx/pynonymizer/blob/main/doc/process-control.md) to see if it can be adjusted to suit your needs.

### mysql
* `mysql`/`mysqldump` Must be in $PATH
* Local or remote mysql >= 5.5
* Supported Inputs:
  * Plain SQL over stdout
  * Plain SQL file `.sql`
  * GZip-compressed SQL file `.gz` 
* Supported Outputs:
  * Plain SQL over stdout
  * Plain SQL file `.sql`
  * GZip-compressed SQL file `.gz` 
  * LZMA-compressed SQL file `.xz`

### mssql
* Requires extra dependencies: install package `pynonymizer[mssql]`
* MSSQL >= 2008
* For `RESTORE_DB`/`DUMP_DB` operations, the database server *must* be running
  locally with pynonymizer. This is because MSSQL `RESTORE` and `BACKUP` instructions
  are received by the database, so piping a local backup to a remote server is not possible.
* The anonymize process can be performed on remote servers, but you are responsible for creating/managing the target database.
* Supported Inputs:
  * Local backup file
* Supported Outputs:
  * Local backup file

### postgres
* `psql`/`pg_dump` Must be in $PATH
* Local or remote postgres server
* Supported Inputs:
  * Plain SQL over stdout
  * Plain SQL file `.sql`
  * GZip-compressed SQL file `.gz` 
* Supported Outputs:
  * Plain SQL over stdout
  * Plain SQL file `.sql`
  * GZip-compressed SQL file `.gz` 
  * LZMA-compressed SQL file `.xz`

# Getting Started

## Usage
### CLI
1. Write a [strategyfile](https://github.com/rwnx/pynonymizer/blob/main/doc/strategyfiles.md) for your database
1. Check out the help for a description of options `pynonymizer --help`
1. Start Anonymizing!

### Package
Pynonymizer can also be invoked programmatically / from other python code. See the module entrypoint [pynonymizer](pynonymizer/__init__.py) or [pynonymizer/pynonymize.py](pynonymizer/pynonymize.py)

```python
import pynonymizer

pynonymizer.run(input_path="./backup.sql", strategyfile_path="./strategy.yml" [...] )
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rwnx/pynonymizer",
    "name": "pynonymizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">3.9.0",
    "maintainer_email": null,
    "keywords": "anonymization gdpr database mysql",
    "author": "Rowan Twell",
    "author_email": "rowantwell@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/92/0b/2241817bd19d32d94abe9e9df245e017dc551a4a50558f0e456498ed8f18/pynonymizer-2.2.0.tar.gz",
    "platform": null,
    "description": "# `pynonymizer` [![pynonymizer on PyPI](https://img.shields.io/pypi/v/pynonymizer)](https://pypi.org/project/pynonymizer/) [![Downloads](https://static.pepy.tech/badge/pynonymizer)](https://pepy.tech/project/pynonymizer) ![License](https://img.shields.io/pypi/l/pynonymizer)\n\npynonymizer is a universal tool for translating sensitive production database dumps into anonymized copies.\n\nThis can help you support GDPR/Data Protection in your organization without compromizing on quality testing data.\n\n## Why are anonymized databases important?\nThe primary source of information on how your database is used is in _your production database_. In most situations, the production dataset is usually significantly larger than any development copy, and\nwould contain a wider range of data.\n\nFrom time to time, it is prudent to run a new feature or stage a test against this dataset, rather\nthan one that is artificially created by developers or by testing frameworks. Anonymized databases allow us to use the structures present in production, while stripping them of any personally identifiable data that would\nconsitute a breach of privacy for end-users and subsequently a breach of GDPR.\n\nWith Anonymized databases, copies can be processed regularly, and distributed easily, leaving your developers and testers with a rich source of information on the volume and general makeup of the system in production. It can\nbe used to run better staging environments, integration tests, and even simulate database migrations.\n\nbelow is an excerpt from an anonymized database:\n\n| id |salutation | firstname | surname | email | dob |\n| - | - | - | - | - | - |\n| 1 | Dr. | Bernard | Gough | `tnelson@powell.com` | 2000-07-03 |\n| 2 | Mr. | Molly | Bennett | `clarkeharriet@price-fry.com` | 2014-05-19 |\n| 3 | Mrs. | Chelsea | Reid | `adamsamber@clayton.com` | 1974-09-08 |\n| 4 | Dr. | Grace | Armstrong | `tracy36@wilson-matthews.com` | 1963-12-15 |\n| 5 | Dr. | Stanley | James | `christine15@stewart.net` | 1976-09-16 |\n| 6 | Dr. | Mark | Walsh | `dgardner@ward.biz` | 2004-08-28 |\n| 7 | Mrs. | Josephine | Chambers | `hperry@allen.com` | 1916-04-04 |\n| 8 | Dr. | Stephen | Thomas | `thompsonheather@smith-stevens.com` | 1995-04-17 |\n| 9 | Ms. | Damian | Thompson | `yjones@cox.biz` | 2016-10-02 |\n| 10 | Miss | Geraldine | Harris | `porteralice@francis-patel.com` | 1910-09-28 |\n| 11 | Ms. | Gemma | Jones | `mandylewis@patel-thomas.net` | 1990-06-03 |\n| 12 | Dr. | Glenn | Carr | `garnervalerie@farrell-parsons.biz` | 1998-04-19 |\n\n\n## How does it work?\n`pynonymizer` replaces personally identifiable data in your database with **realistic** pseudorandom data, from the `Faker` library or from other functions.\nThere are a wide variety of data types available which should suit the column in question, for example:\n\n* `unique_email`\n* `company`\n* `file_path`\n* `[...]`\n\nPynonymizer's main data replacement mechanism `fake_update` is a random selection from a small pool of data (`--seed-rows` controls the available Faker data). This process is chosen for compatibility and speed of operation, but does not guarantee uniqueness. \nThis may or may not suit your exact use-case. For a full list of data generation strategies, see the docs on [strategyfiles](https://github.com/rwnx/pynonymizer/blob/main/doc/strategyfiles.md)\n\n### Examples\n\nYou can see strategyfile examples for existing database, such as wordpress or adventureworks sample database, in the the [examples folder](https://github.com/rwnx/pynonymizer/blob/main/examples).\n\n### Process outline\n\n1. Restore from dumpfile to temporary database.\n1. Anonymize temporary database with strategy.\n1. Dump resulting data to file.\n1. Drop temporary database.\n\nIf this workflow doesnt work for you, see [process control](https://github.com/rwnx/pynonymizer/blob/main/doc/process-control.md) to see if it can be adjusted to suit your needs.\n\n### mysql\n* `mysql`/`mysqldump` Must be in $PATH\n* Local or remote mysql >= 5.5\n* Supported Inputs:\n  * Plain SQL over stdout\n  * Plain SQL file `.sql`\n  * GZip-compressed SQL file `.gz` \n* Supported Outputs:\n  * Plain SQL over stdout\n  * Plain SQL file `.sql`\n  * GZip-compressed SQL file `.gz` \n  * LZMA-compressed SQL file `.xz`\n\n### mssql\n* Requires extra dependencies: install package `pynonymizer[mssql]`\n* MSSQL >= 2008\n* For `RESTORE_DB`/`DUMP_DB` operations, the database server *must* be running\n  locally with pynonymizer. This is because MSSQL `RESTORE` and `BACKUP` instructions\n  are received by the database, so piping a local backup to a remote server is not possible.\n* The anonymize process can be performed on remote servers, but you are responsible for creating/managing the target database.\n* Supported Inputs:\n  * Local backup file\n* Supported Outputs:\n  * Local backup file\n\n### postgres\n* `psql`/`pg_dump` Must be in $PATH\n* Local or remote postgres server\n* Supported Inputs:\n  * Plain SQL over stdout\n  * Plain SQL file `.sql`\n  * GZip-compressed SQL file `.gz` \n* Supported Outputs:\n  * Plain SQL over stdout\n  * Plain SQL file `.sql`\n  * GZip-compressed SQL file `.gz` \n  * LZMA-compressed SQL file `.xz`\n\n# Getting Started\n\n## Usage\n### CLI\n1. Write a [strategyfile](https://github.com/rwnx/pynonymizer/blob/main/doc/strategyfiles.md) for your database\n1. Check out the help for a description of options `pynonymizer --help`\n1. Start Anonymizing!\n\n### Package\nPynonymizer can also be invoked programmatically / from other python code. See the module entrypoint [pynonymizer](pynonymizer/__init__.py) or [pynonymizer/pynonymize.py](pynonymizer/pynonymize.py)\n\n```python\nimport pynonymizer\n\npynonymizer.run(input_path=\"./backup.sql\", strategyfile_path=\"./strategy.yml\" [...] )\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An anonymization tool for production databases",
    "version": "2.2.0",
    "project_urls": {
        "Homepage": "https://github.com/rwnx/pynonymizer"
    },
    "split_keywords": [
        "anonymization",
        "gdpr",
        "database",
        "mysql"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fea5d01e416d5715409fea68600a1259843127eb196e932e6a4215d37a3c0f70",
                "md5": "ce0ce5d9ed2eeafacaed6ad7e58a13a3",
                "sha256": "0224e437cd3350f6aac6d338b030705034dca503e65e54bb03e15f8147f8511c"
            },
            "downloads": -1,
            "filename": "pynonymizer-2.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ce0ce5d9ed2eeafacaed6ad7e58a13a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">3.9.0",
            "size": 38806,
            "upload_time": "2024-04-14T16:34:11",
            "upload_time_iso_8601": "2024-04-14T16:34:11.137616Z",
            "url": "https://files.pythonhosted.org/packages/fe/a5/d01e416d5715409fea68600a1259843127eb196e932e6a4215d37a3c0f70/pynonymizer-2.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "920b2241817bd19d32d94abe9e9df245e017dc551a4a50558f0e456498ed8f18",
                "md5": "34c8952ae9f9cacfab8aaab07f272f6a",
                "sha256": "c2df54f6bf232d9a27ab59306bba2c7dc76505123b855e0b84bbb293f447e823"
            },
            "downloads": -1,
            "filename": "pynonymizer-2.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "34c8952ae9f9cacfab8aaab07f272f6a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">3.9.0",
            "size": 30954,
            "upload_time": "2024-04-14T16:34:12",
            "upload_time_iso_8601": "2024-04-14T16:34:12.921609Z",
            "url": "https://files.pythonhosted.org/packages/92/0b/2241817bd19d32d94abe9e9df245e017dc551a4a50558f0e456498ed8f18/pynonymizer-2.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-14 16:34:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rwnx",
    "github_project": "pynonymizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pynonymizer"
}
        
Elapsed time: 0.27063s