django-devdata


Namedjango-devdata JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/danpalmer/django-devdata
SummaryDjango tooling for creating development databases seeded with anonymised production data.
upload_time2024-07-14 07:27:15
maintainerNone
docs_urlNone
authorDan Palmer
requires_python>=3.8
licenseMIT
keywords django development databases
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # django-devdata

`django-devdata` provides a convenient workflow for creating development
databases seeded with anonymised production data. Have a development database
that contains useful data, and is fast to create and keep up to date.

As of 1.x, `django-devdata` is ready for use in real-world projects. See
releases for more details.

## Elevator pitch

```python
# blog.models

class Post(models.Model):
    content = models.TextField()
    published = models.DateTimeField()


class Comment(models.Model):
    user = models.ForeignKey(User)
    post = models.ForeignKey(Post)
    text = models.TextField()

# settings.py
DEVDATA_STRATEGIES = {
    'auth.User': [
        # We want all internal users
        InternalUsersStrategy(name='internal_users'),
        # Get some random other users, we don't need everyone
        RandomSampleQuerySetStrategy(name='random_users', count=10),
    ],
    'blog.Post': [
        # Only the latest blog posts necessary for testing...
        LatestSampleQuerySetStrategy(name='latest_posts', count=3, order='-published'),
        # Except that one the weird edge case
        ExactQuerySetStrategy(name='edge_case', pks=(42,)),
    ],
    'blog.Comment': [
        # Get all the comments – devdata will automatically restrict to only
        # those that maintain referential integrity, i.e. comments from users
        # not selected, or on posts not selected, will be skipped.
        QuerySetStrategy(name='all'),
    ],
}
```

```shell
(prod)$ python manage.py devdata_export devdata
(prod)$ tar -czf devdata.tar devdata/
(local)$ scp prod:~/devdata.tar devdata.tar.gz
(local)$ tar -xzf devdata.tar.gz
(local)$ python manage.py devdata_import devdata/
```

#### Problem

In the same way that development environments being close in configuration to
production environments, it's important that the data in databases we use for
development is a realistic representation of that in production.

We could use a dump of a production database, but there are several problems
with this:

1. It's bad for user privacy and a security risk. It may not be allowed in some
   organisations.
2. Production databases can be too big, impractical or unusable.
3. Test data is limited to that available in production.
4. Preserving referential integrity for a sample of data is hard.

Another option is to use factories or fake data to generate the entire
development database. This is mostly desirable, but...

- It can be a burden to maintain factories once there are hundreds or thousands
  of them.
- It can be hard to retroactively add these to a Django site of a significant
  size.

#### Solution

`django-devdata` provides defines a three step workflow:

1. _Exporting_ data, with a customisable export strategies per model.
2. _Anonymising_ data, with customisable anonymisation per field/model.
3. _Importing_ data, with customisable importing per model.

`django-devdata` ships with built-in support for:

- Exporting full tables
- Exporting subsets (random, latest, specified primary keys)
- Anonymising data with [`faker`](https://github.com/joke2k/faker/)
- Importing exported data
- Importing data from [`factory-boy`](https://github.com/FactoryBoy/factory_boy)
  factories

In addition to this, the structure provided by `django-devdata` can be extended
to support extraction from other data sources, to import/export Django fixtures,
or to work with other factory libraries.

Exporting, anonymising, and importing, are all configurable, and
`django-devdata`'s base classes will help do this without much work.

## Workflow

#### Exporting

``` console
$ python manage.py devdata_export [dest] [app_label.ModelName ...]
```

This step allows a sync strategy to persist some data that will be used to
create a new development database. For example, the `QuerySetStrategy` can
export data from a table to a filesystem for later import.

This can be used for:

- Exporting a manually created database for other developers to use.
- Exporting realistic data from a production database.
- A cron job to maintain a development dataset hosted on cloud storage.

This step is optional (the built-in factory strategy doesn't do this).

#### Anonymisation

This step is critical when using `django-devdata` to export from production
sources. It's not a distinct step, but rather an opt-out part of the export
step.

#### Importing

``` console
$ python manage.py devdata_import [src]
```

This step is responsible for preparing the database and filling it. If any
exporting strategies have been used those must have run first, or their outputs
must have been downloaded if they are being shared/hosted somewhere.

Factory-based strategies generate data during this process.

##### Reset modes

``` console
$ python manage.py devdata_import --reset-mode=$MODE [src]
```

By default any existing database will be removed, ensuring that a fresh database
is created for the imported data. This is expected to be the most common case
for local development, but may not always be suitable.

The following modes are offered:

- `drop-database`: the default; drops the database & re-creates it.
- `drop-tables`: drops the tables the Django codebase is aware of, useful if the
  Django database user doesn't have access to drop the entire database.
- `none`: no attempt to reset the database, useful if the user has already
  manually configured the database or otherwise wants more control over setup.

See the docstrings in [`src/devdata/reset_modes.py`](src/devdata/reset_modes.py)
for more details.

## Customising

#### Strategies

The `django-devdata` strategies define how an import and optionally an export
happen. Each model is configured with a list of Strategies to use.

Classes are provided to inherit from for customising this behaviour:

- `Strategy` – the base class of all strategies.
- `Exportable` – a mixin that opts this strategy in to the export step.
- `QuerySetStrategy` – the base of all strategies that export production data
  to a filesystem. Handles referential integrity, serialisation, and
  anonymisation of the data pre-export.
- `FactoryStrategy` – the base of all strategies that create data based on
  `factory-boy` factories.

The API necessary for classes to implement is small, and there are customisation
points provided for common patterns.

In our experience most models can be exported with just the un-customised
`QuerySetStrategy`, some will need to use other pre-provided strategies, and
a small number will need custom exporters based on the classes provided.

##### Extra Strategies

Sometimes it can be useful to export and import data from the database which
lives outside the tables which Django manages via models.

The "extra" strategies provide hooks which support transferring these data.

Classes are provided to inherit from for customising this behaviour:

- `ExtraExport` – defines how to get data out of the database.
- `ExtraImport` – defines how to get data into a database.

The API necessary for classes to implement is small and reminiscent of those for
`Strategy` and `Exportable`.

The following "extra" strategies are provided out of the box:

- `PostgresSequences` – transfers data about Postgres sequences which are not
  attached to tables.

#### Anonymisers

Anonymisers are configured by field name, and by model and field name.

Each anonymiser is a function that takes a number of kwargs with useful context
and returns a new value, compatible with the Django JSON encoder/decoder.

The signature for an anonymiser is:

```python
def anonymise(*, obj: Model, field: str, pii_value: Any, fake: Faker) -> Any:
    ...
```

There are several anonymisers provided to use or to build off:

- `faker_anonymise` – Use `faker` to anonymise this field with the provided
  generator, e.g. `faker_anonymise('pyint', min_value=15, max_value=85)`.
- `const` – anonymise to a constant value, e.g. `const('ch_XXXXXXXX')`.
- `random_foreign_key` – anonymise to a random foreign key.

`django-devdata`'s anonymisation is not intended to be perfect, but rather to be
a reasonable default for creating useful data that does a good enough job by
default. _Structure_ in data can be used to de-anonymise users in some cases
with advanced techniques, and `django-devdata` does not attempt to solve for
this case as most attackers, users, and legislators, are more concerned about
obviously personally identifiable information such as names and email addresses.
This anonymisation is no replacement for encryption at-rest with tools like
FileVault or BitLocker on development machines.

An example of this pragmatism in anonymisation is the `preserve_nulls` argument
taken by some built-in anonymisers. This goes against _true_ anonymisation, but
the absence of data is typically not of much use to attackers (or concern for
users), if the actual data is anonymised, while this can be of huge benefit to
developers in maintaining data consistency.

#### Settings

`django-devdata` makes heavy use of Django settings for both defining how it
should act for your site, and also for configuring how you'll use your workflow.

```python
"""
django-devdata default settings, with documentation on usage.
"""

# Required
# A mapping of app model label to list of strategies to be used.
DEVDATA_STRATEGIES = ...
# {'auth.User': [QuerySetStrategy(name='all')], 'sessions.Session': []}

# Optional
# A list of strategies for transferring data about a database which are not
# captured in the tables themselves.
DEVDATA_EXTRA_STRATEGIES = ...
# [
#   ('devdata.extras.PostgresSequences', {}),
# ]

# Optional
# A mapping of field name to an anonymiser to be used for all fields with that
# name.
DEVDATA_FIELD_ANONYMISERS = {}
# {'first_name': faker_anonymise('first_name'), 'ip': const('127.0.0.1')}

# Optional
# A mapping of app model label to a mapping of fields and anonymisers to be
# scoped to just that model.
DEVDATA_MODEL_ANONYMISERS = {}
# {'auth.User': {'first_name': faker_anonymise('first_name')}}

# Optional
# List of locales to be used for Faker in generating anonymised data.
DEVDATA_FAKER_LOCALES = None
# ['en_GB', 'en_AU']

# Optional
# In many codebases, there will only be a few models that will do most of the
# work to restrict the total export size – only taking a few users, or a few
# comments – for many models a default behaviour of taking everything
# following the restrictions from other models would be sufficient. This setting
# allows for specifying a default strategy.
# Important:
# - When using this, no errors will be raised if a model is missed from the list
#   of strategies.
# - This strategy is not added to all models, and it does not override an empty
#   list of strategies. It is only used when a model is not defined in the
#   strategy config at all.
DEVDATA_DEFAULT_STRATEGY = None
```

Strategies can be defined either as a strategy instance, or a tuple of
dotted-path and kwargs, for example the following are equivalent:

```python
DEVDATA_STRATEGIES = {
    'auth.User': [
        QuerySetStrategy(name='all_users'),
    ],
}

DEVDATA_STRATEGIES = {
    'auth.User': [
        ('devdata.strategies.QuerySetStrategy', {'name': 'all_users'}),
    ],
}
```

This alternate configuration format is provided in cases of extensive use of
custom strategies, as strategies often import models, but due to the Django
startup process models can't be imported until the settings have been imported.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/danpalmer/django-devdata",
    "name": "django-devdata",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "django, development, databases",
    "author": "Dan Palmer",
    "author_email": "dan@danpalmer.me",
    "download_url": "https://files.pythonhosted.org/packages/63/ee/9bded21ebc9bf8785c34ed153d45d61931eeda8413bdc6a95529b719237a/django_devdata-1.1.0.tar.gz",
    "platform": null,
    "description": "# django-devdata\n\n`django-devdata` provides a convenient workflow for creating development\ndatabases seeded with anonymised production data. Have a development database\nthat contains useful data, and is fast to create and keep up to date.\n\nAs of 1.x, `django-devdata` is ready for use in real-world projects. See\nreleases for more details.\n\n## Elevator pitch\n\n```python\n# blog.models\n\nclass Post(models.Model):\n    content = models.TextField()\n    published = models.DateTimeField()\n\n\nclass Comment(models.Model):\n    user = models.ForeignKey(User)\n    post = models.ForeignKey(Post)\n    text = models.TextField()\n\n# settings.py\nDEVDATA_STRATEGIES = {\n    'auth.User': [\n        # We want all internal users\n        InternalUsersStrategy(name='internal_users'),\n        # Get some random other users, we don't need everyone\n        RandomSampleQuerySetStrategy(name='random_users', count=10),\n    ],\n    'blog.Post': [\n        # Only the latest blog posts necessary for testing...\n        LatestSampleQuerySetStrategy(name='latest_posts', count=3, order='-published'),\n        # Except that one the weird edge case\n        ExactQuerySetStrategy(name='edge_case', pks=(42,)),\n    ],\n    'blog.Comment': [\n        # Get all the comments \u2013\u00a0devdata will automatically restrict to only\n        # those that maintain referential integrity, i.e. comments from users\n        # not selected, or on posts not selected, will be skipped.\n        QuerySetStrategy(name='all'),\n    ],\n}\n```\n\n```shell\n(prod)$ python manage.py devdata_export devdata\n(prod)$ tar -czf devdata.tar devdata/\n(local)$ scp prod:~/devdata.tar devdata.tar.gz\n(local)$ tar -xzf devdata.tar.gz\n(local)$ python manage.py devdata_import devdata/\n```\n\n#### Problem\n\nIn the same way that development environments being close in configuration to\nproduction environments, it's important that the data in databases we use for\ndevelopment is a realistic representation of that in production.\n\nWe could use a dump of a production database, but there are several problems\nwith this:\n\n1. It's bad for user privacy and a security risk. It may not be allowed in some\n   organisations.\n2. Production databases can be too big, impractical or unusable.\n3. Test data is limited to that available in production.\n4. Preserving referential integrity for a sample of data is hard.\n\nAnother option is to use factories or fake data to generate the entire\ndevelopment database. This is mostly desirable, but...\n\n- It can be a burden to maintain factories once there are hundreds or thousands\n  of them.\n- It can be hard to retroactively add these to a Django site of a significant\n  size.\n\n#### Solution\n\n`django-devdata` provides defines a three step workflow:\n\n1. _Exporting_ data, with a customisable export strategies per model.\n2. _Anonymising_ data, with customisable anonymisation per field/model.\n3. _Importing_ data, with customisable importing per model.\n\n`django-devdata` ships with built-in support for:\n\n- Exporting full tables\n- Exporting subsets (random, latest, specified primary keys)\n- Anonymising data with [`faker`](https://github.com/joke2k/faker/)\n- Importing exported data\n- Importing data from [`factory-boy`](https://github.com/FactoryBoy/factory_boy)\n  factories\n\nIn addition to this, the structure provided by `django-devdata` can be extended\nto support extraction from other data sources, to import/export Django fixtures,\nor to work with other factory libraries.\n\nExporting, anonymising, and importing, are all configurable, and\n`django-devdata`'s base classes will help do this without much work.\n\n## Workflow\n\n#### Exporting\n\n``` console\n$ python manage.py devdata_export [dest] [app_label.ModelName ...]\n```\n\nThis step allows a sync strategy to persist some data that will be used to\ncreate a new development database. For example, the `QuerySetStrategy` can\nexport data from a table to a filesystem for later import.\n\nThis can be used for:\n\n- Exporting a manually created database for other developers to use.\n- Exporting realistic data from a production database.\n- A cron job to maintain a development dataset hosted on cloud storage.\n\nThis step is optional (the built-in factory strategy doesn't do this).\n\n#### Anonymisation\n\nThis step is critical when using `django-devdata` to export from production\nsources. It's not a distinct step, but rather an opt-out part of the export\nstep.\n\n#### Importing\n\n``` console\n$ python manage.py devdata_import [src]\n```\n\nThis step is responsible for preparing the database and filling it. If any\nexporting strategies have been used those must have run first, or their outputs\nmust have been downloaded if they are being shared/hosted somewhere.\n\nFactory-based strategies generate data during this process.\n\n##### Reset modes\n\n``` console\n$ python manage.py devdata_import --reset-mode=$MODE [src]\n```\n\nBy default any existing database will be removed, ensuring that a fresh database\nis created for the imported data. This is expected to be the most common case\nfor local development, but may not always be suitable.\n\nThe following modes are offered:\n\n- `drop-database`: the default; drops the database & re-creates it.\n- `drop-tables`: drops the tables the Django codebase is aware of, useful if the\n  Django database user doesn't have access to drop the entire database.\n- `none`: no attempt to reset the database, useful if the user has already\n  manually configured the database or otherwise wants more control over setup.\n\nSee the docstrings in [`src/devdata/reset_modes.py`](src/devdata/reset_modes.py)\nfor more details.\n\n## Customising\n\n#### Strategies\n\nThe `django-devdata` strategies define how an import and optionally an export\nhappen. Each model is configured with a list of Strategies to use.\n\nClasses are provided to inherit from for customising this behaviour:\n\n- `Strategy` \u2013 the base class of all strategies.\n- `Exportable` \u2013\u00a0a mixin that opts this strategy in to the export step.\n- `QuerySetStrategy` \u2013 the base of all strategies that export production data\n  to a filesystem. Handles referential integrity, serialisation, and\n  anonymisation of the data pre-export.\n- `FactoryStrategy` \u2013 the base of all strategies that create data based on\n  `factory-boy` factories.\n\nThe API necessary for classes to implement is small, and there are customisation\npoints provided for common patterns.\n\nIn our experience most models can be exported with just the un-customised\n`QuerySetStrategy`, some will need to use other pre-provided strategies, and\na small number will need custom exporters based on the classes provided.\n\n##### Extra Strategies\n\nSometimes it can be useful to export and import data from the database which\nlives outside the tables which Django manages via models.\n\nThe \"extra\" strategies provide hooks which support transferring these data.\n\nClasses are provided to inherit from for customising this behaviour:\n\n- `ExtraExport` \u2013 defines how to get data out of the database.\n- `ExtraImport` \u2013 defines how to get data into a database.\n\nThe API necessary for classes to implement is small and reminiscent of those for\n`Strategy` and `Exportable`.\n\nThe following \"extra\" strategies are provided out of the box:\n\n- `PostgresSequences` \u2013 transfers data about Postgres sequences which are not\n  attached to tables.\n\n#### Anonymisers\n\nAnonymisers are configured by field name, and by model and field name.\n\nEach anonymiser is a function that takes a number of kwargs with useful context\nand returns a new value, compatible with the Django JSON encoder/decoder.\n\nThe signature for an anonymiser is:\n\n```python\ndef anonymise(*, obj: Model, field: str, pii_value: Any, fake: Faker) -> Any:\n    ...\n```\n\nThere are several anonymisers provided to use or to build off:\n\n- `faker_anonymise` \u2013 Use `faker` to anonymise this field with the provided\n  generator, e.g. `faker_anonymise('pyint', min_value=15, max_value=85)`.\n- `const` \u2013 anonymise to a constant value, e.g. `const('ch_XXXXXXXX')`.\n- `random_foreign_key` \u2013\u00a0anonymise to a random foreign key.\n\n`django-devdata`'s anonymisation is not intended to be perfect, but rather to be\na reasonable default for creating useful data that does a good enough job by\ndefault. _Structure_ in data can be used to de-anonymise users in some cases\nwith advanced techniques, and `django-devdata` does not attempt to solve for\nthis case as most attackers, users, and legislators, are more concerned about\nobviously personally identifiable information such as names and email addresses.\nThis anonymisation is no replacement for encryption at-rest with tools like\nFileVault or BitLocker on development machines.\n\nAn example of this pragmatism in anonymisation is the `preserve_nulls` argument\ntaken by some built-in anonymisers. This goes against _true_ anonymisation, but\nthe absence of data is typically not of much use to attackers (or concern for\nusers), if the actual data is anonymised, while this can be of huge benefit to\ndevelopers in maintaining data consistency.\n\n#### Settings\n\n`django-devdata` makes heavy use of Django settings for both defining how it\nshould act for your site, and also for configuring how you'll use your workflow.\n\n```python\n\"\"\"\ndjango-devdata default settings, with documentation on usage.\n\"\"\"\n\n# Required\n# A mapping of app model label to list of strategies to be used.\nDEVDATA_STRATEGIES = ...\n# {'auth.User': [QuerySetStrategy(name='all')], 'sessions.Session': []}\n\n# Optional\n# A list of strategies for transferring data about a database which are not\n# captured in the tables themselves.\nDEVDATA_EXTRA_STRATEGIES = ...\n# [\n#   ('devdata.extras.PostgresSequences', {}),\n# ]\n\n# Optional\n# A mapping of field name to an anonymiser to be used for all fields with that\n# name.\nDEVDATA_FIELD_ANONYMISERS = {}\n# {'first_name': faker_anonymise('first_name'), 'ip': const('127.0.0.1')}\n\n# Optional\n# A mapping of app model label to a mapping of fields and anonymisers to be\n# scoped to just that model.\nDEVDATA_MODEL_ANONYMISERS = {}\n# {'auth.User': {'first_name': faker_anonymise('first_name')}}\n\n# Optional\n# List of locales to be used for Faker in generating anonymised data.\nDEVDATA_FAKER_LOCALES = None\n# ['en_GB', 'en_AU']\n\n# Optional\n# In many codebases, there will only be a few models that will do most of the\n# work to restrict the total export size \u2013 only taking a few users, or a few\n# comments \u2013 for many models a default behaviour of taking everything\n# following the restrictions from other models would be sufficient. This setting\n# allows for specifying a default strategy.\n# Important:\n# - When using this, no errors will be raised if a model is missed from the list\n#   of strategies.\n# - This strategy is not added to all models, and it does not override an empty\n#   list of strategies. It is only used when a model is not defined in the\n#   strategy config at all.\nDEVDATA_DEFAULT_STRATEGY = None\n```\n\nStrategies can be defined either as a strategy instance, or a tuple of\ndotted-path and kwargs, for example the following are equivalent:\n\n```python\nDEVDATA_STRATEGIES = {\n    'auth.User': [\n        QuerySetStrategy(name='all_users'),\n    ],\n}\n\nDEVDATA_STRATEGIES = {\n    'auth.User': [\n        ('devdata.strategies.QuerySetStrategy', {'name': 'all_users'}),\n    ],\n}\n```\n\nThis alternate configuration format is provided in cases of extensive use of\ncustom strategies, as strategies often import models, but due to the Django\nstartup process models can't be imported until the settings have been imported.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Django tooling for creating development databases seeded with anonymised production data.",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/danpalmer/django-devdata",
        "Repository": "https://github.com/danpalmer/django-devdata"
    },
    "split_keywords": [
        "django",
        " development",
        " databases"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cf6fcfc067b7610fe2e946593ffe847df059e075d0e53ae4a3f176a691d0063b",
                "md5": "c3672c3994eb7c3d32ed97bafa7f4416",
                "sha256": "0bc1df7347edc1d5c7b2bd868f37a469d9d3a457e8b8b50dd280a60b480626d0"
            },
            "downloads": -1,
            "filename": "django_devdata-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3672c3994eb7c3d32ed97bafa7f4416",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 22744,
            "upload_time": "2024-07-14T07:27:13",
            "upload_time_iso_8601": "2024-07-14T07:27:13.710921Z",
            "url": "https://files.pythonhosted.org/packages/cf/6f/cfc067b7610fe2e946593ffe847df059e075d0e53ae4a3f176a691d0063b/django_devdata-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "63ee9bded21ebc9bf8785c34ed153d45d61931eeda8413bdc6a95529b719237a",
                "md5": "efd8d3e09864cf927105686b035a8a67",
                "sha256": "781970032bb0a12c743a5379c5fd08c74398a8f57403f7a87242e4ff845a8106"
            },
            "downloads": -1,
            "filename": "django_devdata-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "efd8d3e09864cf927105686b035a8a67",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 23690,
            "upload_time": "2024-07-14T07:27:15",
            "upload_time_iso_8601": "2024-07-14T07:27:15.295940Z",
            "url": "https://files.pythonhosted.org/packages/63/ee/9bded21ebc9bf8785c34ed153d45d61931eeda8413bdc6a95529b719237a/django_devdata-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-14 07:27:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "danpalmer",
    "github_project": "django-devdata",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "django-devdata"
}
        
Elapsed time: 1.41840s