datapatch


Namedatapatch JSON
Version 1.2.4 PyPI version JSON
download
home_pageNone
SummaryA library for defining rule-based overrides on messy data.
upload_time2025-07-27 19:40:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been 
entered by humans. You might find country names like `Northkorea`, or `Greet Britain`
that you want to normalise. `datapatch` creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.

## Installation

You can install `datapatch` from the Python package index:

```bash
pip install datapatch
```

## Example

Given a YAML file like this:

```yaml
countries:
  normalize: true
  lowercase: true
  asciify: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain
```

The file can be used to apply the data patches against raw input:

```python
from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)
```

### Extended options

There's a host of options available to configure the application of the data
patches:

```yaml
countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  normalize: false
  # By default normalize perform transliteration across alphabets (Путин -> Putin)
  # set asciify to false if you want to keep non-ascii alphabets as is
  asciify: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg
```

### Result objects

You can also have more details associated with a result and access them:

```yaml
countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR
```

This can be accessed as a result object with attributes:

```python
from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
```

## License

`datapatch` is licensed under the terms of the MIT license, which is included as
`LICENSE`.
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datapatch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "OpenSanctions <info@opensanctions.org>",
    "download_url": null,
    "platform": null,
    "description": "# datapatch\n\nA Python library for defining rule-based overrides on messy data. Imagine, for example,\ntrying to import a dataset in each row is associated with a country - which have been \nentered by humans. You might find country names like `Northkorea`, or `Greet Britain`\nthat you want to normalise. `datapatch` creates a mechanism to build a flexible lookup\ntable (usually stored as a YAML file) to catch and repair these data issues.\n\n## Installation\n\nYou can install `datapatch` from the Python package index:\n\n```bash\npip install datapatch\n```\n\n## Example\n\nGiven a YAML file like this:\n\n```yaml\ncountries:\n  normalize: true\n  lowercase: true\n  asciify: true\n  options:\n    - match: Frankreich\n      value: France\n    - match:\n        - Northkorea\n        - Nordkorea\n        - Northern Korea\n        - NKorea\n        - DPRK\n      value: North Korea\n    - contains: Britain\n      value: Great Britain\n```\n\nThe file can be used to apply the data patches against raw input:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\n# This will apply the patch or default to the original string if none exists:\nfor row in iter_data():\n    raw = row.get(\"Country\")\n    row[\"Country\"] = countries.get_value(raw, default=raw)\n```\n\n### Extended options\n\nThere's a host of options available to configure the application of the data\npatches:\n\n```yaml\ncountries:\n  # If you mark a lookup as required, a value that matches no options will\n  # throw a `datapatch.exc:LookupException`.\n  required: true\n  # Normalisation will remove many special characters, remove multiple spaces\n  normalize: false\n  # By default normalize perform transliteration across alphabets (\u041f\u0443\u0442\u0438\u043d -> Putin)\n  # set asciify to false if you want to keep non-ascii alphabets as is\n  asciify: false\n  options:\n    - match: Francois\n      value: France\n  # This is a shorthand for defining options that have just one `match` and\n  # one `value` defined:\n  map:\n    Luxemborg: Luxembourg\n    Lux: Luxembourg\n```\n\n### Result objects\n\nYou can also have more details associated with a result and access them:\n\n```yaml\ncountries:\n  options:\n    - match: Frankreich\n      # These can be arbitrary attributes:\n      label: France\n      code: FR\n```\n\nThis can be accessed as a result object with attributes:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\nresult = countries.match(\"Frankreich\")\nprint(result.label, result.code)\nassert result.capital is None, result.capital\n```\n\n## License\n\n`datapatch` is licensed under the terms of the MIT license, which is included as\n`LICENSE`.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A library for defining rule-based overrides on messy data.",
    "version": "1.2.4",
    "project_urls": {
        "Homepage": "https://github.com/opensanctions/datapatch"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d8dd1df187bea2546fa7c3de04d34a366ad5a8095febb61468baddf077c8fd73",
                "md5": "597f471b838cb7ccf52b21862e740eca",
                "sha256": "b6c2dae33a6635d6526b122bcd2229f098ef9f833bd60a93f644a10d82dde699"
            },
            "downloads": -1,
            "filename": "datapatch-1.2.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "597f471b838cb7ccf52b21862e740eca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 8270,
            "upload_time": "2025-07-27T19:40:10",
            "upload_time_iso_8601": "2025-07-27T19:40:10.616619Z",
            "url": "https://files.pythonhosted.org/packages/d8/dd/1df187bea2546fa7c3de04d34a366ad5a8095febb61468baddf077c8fd73/datapatch-1.2.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 19:40:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "opensanctions",
    "github_project": "datapatch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datapatch"
}
        
Elapsed time: 3.76785s