datapatch


Namedatapatch JSON
Version 1.2.2 PyPI version JSON
download
home_pagehttps://github.com/opensanctions/datapatch
SummaryNone
upload_time2024-11-19 21:35:19
maintainerNone
docs_urlNone
authorFriedrich Lindenberg
requires_pythonNone
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # datapatch

A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been 
entered by humans. You might find country names like `Northkorea`, or `Greet Britain`
that you want to normalise. `datapatch` creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.

## Installation

You can install `datapatch` from the Python package index:

```bash
pip install datapatch
```

## Example

Given a YAML file like this:

```yaml
countries:
  normalize: true
  lowercase: true
  asciify: true
  options:
    - match: Frankreich
      value: France
    - match:
        - Northkorea
        - Nordkorea
        - Northern Korea
        - NKorea
        - DPRK
      value: North Korea
    - contains: Britain
      value: Great Britain
```

The file can be used to apply the data patches against raw input:

```python
from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

# This will apply the patch or default to the original string if none exists:
for row in iter_data():
    raw = row.get("Country")
    row["Country"] = countries.get_value(raw, default=raw)
```

### Extended options

There's a host of options available to configure the application of the data
patches:

```yaml
countries:
  # If you mark a lookup as required, a value that matches no options will
  # throw a `datapatch.exc:LookupException`.
  required: true
  # Normalisation will remove many special characters, remove multiple spaces
  normalize: false
  # By default normalize perform transliteration across alphabets (Путин -> Putin)
  # set asciify to false if you want to keep non-ascii alphabets as is
  asciify: false
  options:
    - match: Francois
      value: France
  # This is a shorthand for defining options that have just one `match` and
  # one `value` defined:
  map:
    Luxemborg: Luxembourg
    Lux: Luxembourg
```

### Result objects

You can also have more details associated with a result and access them:

```yaml
countries:
  options:
    - match: Frankreich
      # These can be arbitrary attributes:
      label: France
      code: FR
```

This can be accessed as a result object with attributes:

```python
from datapatch import read_lookups, LookupException

lookups = read_lookups("countries.yml")
countries = lookups.get("countries")

result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
```

## License

`datapatch` is licensed under the terms of the MIT license, which is included as
`LICENSE`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/opensanctions/datapatch",
    "name": "datapatch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Friedrich Lindenberg",
    "author_email": "tech@opensanctions.org",
    "download_url": "https://files.pythonhosted.org/packages/86/56/90a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0/datapatch-1.2.2.tar.gz",
    "platform": null,
    "description": "# datapatch\n\nA Python library for defining rule-based overrides on messy data. Imagine, for example,\ntrying to import a dataset in each row is associated with a country - which have been \nentered by humans. You might find country names like `Northkorea`, or `Greet Britain`\nthat you want to normalise. `datapatch` creates a mechanism to build a flexible lookup\ntable (usually stored as a YAML file) to catch and repair these data issues.\n\n## Installation\n\nYou can install `datapatch` from the Python package index:\n\n```bash\npip install datapatch\n```\n\n## Example\n\nGiven a YAML file like this:\n\n```yaml\ncountries:\n  normalize: true\n  lowercase: true\n  asciify: true\n  options:\n    - match: Frankreich\n      value: France\n    - match:\n        - Northkorea\n        - Nordkorea\n        - Northern Korea\n        - NKorea\n        - DPRK\n      value: North Korea\n    - contains: Britain\n      value: Great Britain\n```\n\nThe file can be used to apply the data patches against raw input:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\n# This will apply the patch or default to the original string if none exists:\nfor row in iter_data():\n    raw = row.get(\"Country\")\n    row[\"Country\"] = countries.get_value(raw, default=raw)\n```\n\n### Extended options\n\nThere's a host of options available to configure the application of the data\npatches:\n\n```yaml\ncountries:\n  # If you mark a lookup as required, a value that matches no options will\n  # throw a `datapatch.exc:LookupException`.\n  required: true\n  # Normalisation will remove many special characters, remove multiple spaces\n  normalize: false\n  # By default normalize perform transliteration across alphabets (\u041f\u0443\u0442\u0438\u043d -> Putin)\n  # set asciify to false if you want to keep non-ascii alphabets as is\n  asciify: false\n  options:\n    - match: Francois\n      value: France\n  # This is a shorthand for defining options that have just one `match` and\n  # one `value` defined:\n  map:\n    Luxemborg: Luxembourg\n    Lux: Luxembourg\n```\n\n### Result objects\n\nYou can also have more details associated with a result and access them:\n\n```yaml\ncountries:\n  options:\n    - match: Frankreich\n      # These can be arbitrary attributes:\n      label: France\n      code: FR\n```\n\nThis can be accessed as a result object with attributes:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\nresult = countries.match(\"Frankreich\")\nprint(result.label, result.code)\nassert result.capital is None, result.capital\n```\n\n## License\n\n`datapatch` is licensed under the terms of the MIT license, which is included as\n`LICENSE`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": null,
    "version": "1.2.2",
    "project_urls": {
        "Homepage": "https://github.com/opensanctions/datapatch"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1e87b4ffe1fbc43f7544415c1782e0232bd416b1d19954ff968e25c96331ec8d",
                "md5": "1626c5dd283f8e2ee128015178a1cde9",
                "sha256": "6110112bc017fe51b3d7c3cc00d7f5abfd02481076743368afa442d6c0e6326f"
            },
            "downloads": -1,
            "filename": "datapatch-1.2.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1626c5dd283f8e2ee128015178a1cde9",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 8608,
            "upload_time": "2024-11-19T21:35:17",
            "upload_time_iso_8601": "2024-11-19T21:35:17.970707Z",
            "url": "https://files.pythonhosted.org/packages/1e/87/b4ffe1fbc43f7544415c1782e0232bd416b1d19954ff968e25c96331ec8d/datapatch-1.2.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "865690a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0",
                "md5": "9ea14b4ac97cd166fbe57ae783d65bb2",
                "sha256": "c4656685a03a7bb2e9e482220a130c4ead53999b0c46075809827b9e1cd2baf1"
            },
            "downloads": -1,
            "filename": "datapatch-1.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "9ea14b4ac97cd166fbe57ae783d65bb2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 8118,
            "upload_time": "2024-11-19T21:35:19",
            "upload_time_iso_8601": "2024-11-19T21:35:19.880645Z",
            "url": "https://files.pythonhosted.org/packages/86/56/90a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0/datapatch-1.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-19 21:35:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "opensanctions",
    "github_project": "datapatch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datapatch"
}
        
Elapsed time: 1.77925s