datapatch


Namedatapatch JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://github.com/opensanctions/datapatch
Summary
upload_time2024-01-12 07:00:55
maintainer
docs_urlNone
authorFriedrich Lindenberg
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been 
entered by humans. You might find country names like `Northkorea`, or `Greet Britain`
that you want to normalise. `datapatch` creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.

## Installation

You can install `datapatch` from the Python package index:

```bash
pip install datapatch
```

## Example

Given a YAML file like this:

```yaml
countries:
  normalize: true
  lowercase: true
  asciify: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain
```

The file can be used to apply the data patches against raw input:

```python
from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)
```

### Extended options

There's a host of options available to configure the application of the data
patches:

```yaml
countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  normalize: false
  # By default normalize perform transliteration across alphabets (Путин -> Putin)
  # set asciify to false if you want to keep non-ascii alphabets as is
  asciify: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg
```

### Result objects

You can also have more details associated with a result and access them:

```yaml
countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR
```

This can be accessed as a result object with attributes:

```python
from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
```

## License

`datapatch` is licensed under the terms of the MIT license, which is included as
`LICENSE`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/opensanctions/datapatch",
    "name": "datapatch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Friedrich Lindenberg",
    "author_email": "tech@opensanctions.org",
    "download_url": "https://files.pythonhosted.org/packages/a2/e7/42394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129/datapatch-1.2.0.tar.gz",
    "platform": null,
    "description": "# datapatch\n\nA Python library for defining rule-based overrides on messy data. Imagine, for example,\ntrying to import a dataset in each row is associated with a country - which have been \nentered by humans. You might find country names like `Northkorea`, or `Greet Britain`\nthat you want to normalise. `datapatch` creates a mechanism to build a flexible lookup\ntable (usually stored as a YAML file) to catch and repair these data issues.\n\n## Installation\n\nYou can install `datapatch` from the Python package index:\n\n```bash\npip install datapatch\n```\n\n## Example\n\nGiven a YAML file like this:\n\n```yaml\ncountries:\n  normalize: true\n  lowercase: true\n  asciify: true\n  options:\n    - match: Frankreich\n      value: France\n    - match:\n        - Northkorea\n        - Nordkorea\n        - Northern Korea\n        - NKorea\n        - DPRK\n      value: North Korea\n    - contains: Britain\n      value: Great Britain\n```\n\nThe file can be used to apply the data patches against raw input:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\n# This will apply the patch or default to the original string if none exists:\nfor row in iter_data():\n    raw = row.get(\"Country\")\n    row[\"Country\"] = countries.get_value(raw, default=raw)\n```\n\n### Extended options\n\nThere's a host of options available to configure the application of the data\npatches:\n\n```yaml\ncountries:\n  # If you mark a lookup as required, a value that matches no options will\n  # throw a `datapatch.exc:LookupException`.\n  required: true\n  # Normalisation will remove many special characters, remove multiple spaces\n  normalize: false\n  # By default normalize perform transliteration across alphabets (\u041f\u0443\u0442\u0438\u043d -> Putin)\n  # set asciify to false if you want to keep non-ascii alphabets as is\n  asciify: false\n  options:\n    - match: Francois\n      value: France\n  # This is a shorthand for defining options that have just one `match` and\n  # one `value` defined:\n  map:\n    Luxemborg: Luxembourg\n    Lux: Luxembourg\n```\n\n### Result objects\n\nYou can also have more details associated with a result and access them:\n\n```yaml\ncountries:\n  options:\n    - match: Frankreich\n      # These can be arbitrary attributes:\n      label: France\n      code: FR\n```\n\nThis can be accessed as a result object with attributes:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\nresult = countries.match(\"Frankreich\")\nprint(result.label, result.code)\nassert result.capital is None, result.capital\n```\n\n## License\n\n`datapatch` is licensed under the terms of the MIT license, which is included as\n`LICENSE`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "",
    "version": "1.2.0",
    "project_urls": {
        "Homepage": "https://github.com/opensanctions/datapatch"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d0a74df812e274f2af44a1392ab6ebe99349182059ed982f2e0ef3d20f8ab98",
                "md5": "6adea57249b5a76bd50ebf82800df993",
                "sha256": "a6676a5b7e55fcae21d502a7cfea06101116e13b73cc1d30bb310f03ee6f9dce"
            },
            "downloads": -1,
            "filename": "datapatch-1.2.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6adea57249b5a76bd50ebf82800df993",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 8579,
            "upload_time": "2024-01-12T07:00:53",
            "upload_time_iso_8601": "2024-01-12T07:00:53.657037Z",
            "url": "https://files.pythonhosted.org/packages/8d/0a/74df812e274f2af44a1392ab6ebe99349182059ed982f2e0ef3d20f8ab98/datapatch-1.2.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a2e742394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129",
                "md5": "25e15760edab9969c17adc498b742641",
                "sha256": "a08c7a0f33e88653b61088835fb2cd8ee8a65c2d81f92ae1210089a4d89d3061"
            },
            "downloads": -1,
            "filename": "datapatch-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "25e15760edab9969c17adc498b742641",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 8132,
            "upload_time": "2024-01-12T07:00:55",
            "upload_time_iso_8601": "2024-01-12T07:00:55.217429Z",
            "url": "https://files.pythonhosted.org/packages/a2/e7/42394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129/datapatch-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-12 07:00:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "opensanctions",
    "github_project": "datapatch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datapatch"
}
        
Elapsed time: 0.17217s