# datapatch
A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been
entered by humans. You might find country names like `Northkorea`, or `Greet Britain`
that you want to normalise. `datapatch` creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.
## Installation
You can install `datapatch` from the Python package index:
```bash
pip install datapatch
```
## Example
Given a YAML file like this:
```yaml
countries:
normalize: true
lowercase: true
asciify: true
options:
- match: Frankreich
value: France
- match:
- Northkorea
- Nordkorea
- Northern Korea
- NKorea
- DPRK
value: North Korea
- contains: Britain
value: Great Britain
```
The file can be used to apply the data patches against raw input:
```python
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
# This will apply the patch or default to the original string if none exists:
for row in iter_data():
raw = row.get("Country")
row["Country"] = countries.get_value(raw, default=raw)
```
### Extended options
There's a host of options available to configure the application of the data
patches:
```yaml
countries:
# If you mark a lookup as required, a value that matches no options will
# throw a `datapatch.exc:LookupException`.
required: true
# Normalisation will remove many special characters, remove multiple spaces
normalize: false
# By default normalize perform transliteration across alphabets (Путин -> Putin)
# set asciify to false if you want to keep non-ascii alphabets as is
asciify: false
options:
- match: Francois
value: France
# This is a shorthand for defining options that have just one `match` and
# one `value` defined:
map:
Luxemborg: Luxembourg
Lux: Luxembourg
```
### Result objects
You can also have more details associated with a result and access them:
```yaml
countries:
options:
- match: Frankreich
# These can be arbitrary attributes:
label: France
code: FR
```
This can be accessed as a result object with attributes:
```python
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
```
## License
`datapatch` is licensed under the terms of the MIT license, which is included as
`LICENSE`.
Raw data
{
"_id": null,
"home_page": "https://github.com/opensanctions/datapatch",
"name": "datapatch",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Friedrich Lindenberg",
"author_email": "tech@opensanctions.org",
"download_url": "https://files.pythonhosted.org/packages/a2/e7/42394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129/datapatch-1.2.0.tar.gz",
"platform": null,
"description": "# datapatch\n\nA Python library for defining rule-based overrides on messy data. Imagine, for example,\ntrying to import a dataset in each row is associated with a country - which have been \nentered by humans. You might find country names like `Northkorea`, or `Greet Britain`\nthat you want to normalise. `datapatch` creates a mechanism to build a flexible lookup\ntable (usually stored as a YAML file) to catch and repair these data issues.\n\n## Installation\n\nYou can install `datapatch` from the Python package index:\n\n```bash\npip install datapatch\n```\n\n## Example\n\nGiven a YAML file like this:\n\n```yaml\ncountries:\n normalize: true\n lowercase: true\n asciify: true\n options:\n - match: Frankreich\n value: France\n - match:\n - Northkorea\n - Nordkorea\n - Northern Korea\n - NKorea\n - DPRK\n value: North Korea\n - contains: Britain\n value: Great Britain\n```\n\nThe file can be used to apply the data patches against raw input:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\n# This will apply the patch or default to the original string if none exists:\nfor row in iter_data():\n raw = row.get(\"Country\")\n row[\"Country\"] = countries.get_value(raw, default=raw)\n```\n\n### Extended options\n\nThere's a host of options available to configure the application of the data\npatches:\n\n```yaml\ncountries:\n # If you mark a lookup as required, a value that matches no options will\n # throw a `datapatch.exc:LookupException`.\n required: true\n # Normalisation will remove many special characters, remove multiple spaces\n normalize: false\n # By default normalize perform transliteration across alphabets (\u041f\u0443\u0442\u0438\u043d -> Putin)\n # set asciify to false if you want to keep non-ascii alphabets as is\n asciify: false\n options:\n - match: Francois\n value: France\n # This is a shorthand for defining options that have just one `match` and\n # one `value` defined:\n map:\n Luxemborg: Luxembourg\n Lux: Luxembourg\n```\n\n### Result objects\n\nYou can also have more details associated with a result and access them:\n\n```yaml\ncountries:\n options:\n - match: Frankreich\n # These can be arbitrary attributes:\n label: France\n code: FR\n```\n\nThis can be accessed as a result object with attributes:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\nresult = countries.match(\"Frankreich\")\nprint(result.label, result.code)\nassert result.capital is None, result.capital\n```\n\n## License\n\n`datapatch` is licensed under the terms of the MIT license, which is included as\n`LICENSE`.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "",
"version": "1.2.0",
"project_urls": {
"Homepage": "https://github.com/opensanctions/datapatch"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8d0a74df812e274f2af44a1392ab6ebe99349182059ed982f2e0ef3d20f8ab98",
"md5": "6adea57249b5a76bd50ebf82800df993",
"sha256": "a6676a5b7e55fcae21d502a7cfea06101116e13b73cc1d30bb310f03ee6f9dce"
},
"downloads": -1,
"filename": "datapatch-1.2.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "6adea57249b5a76bd50ebf82800df993",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 8579,
"upload_time": "2024-01-12T07:00:53",
"upload_time_iso_8601": "2024-01-12T07:00:53.657037Z",
"url": "https://files.pythonhosted.org/packages/8d/0a/74df812e274f2af44a1392ab6ebe99349182059ed982f2e0ef3d20f8ab98/datapatch-1.2.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a2e742394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129",
"md5": "25e15760edab9969c17adc498b742641",
"sha256": "a08c7a0f33e88653b61088835fb2cd8ee8a65c2d81f92ae1210089a4d89d3061"
},
"downloads": -1,
"filename": "datapatch-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "25e15760edab9969c17adc498b742641",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8132,
"upload_time": "2024-01-12T07:00:55",
"upload_time_iso_8601": "2024-01-12T07:00:55.217429Z",
"url": "https://files.pythonhosted.org/packages/a2/e7/42394b1477d0543e5f17a1e2036ca85f0eabbad747475c2a3666bd6e4129/datapatch-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-12 07:00:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "opensanctions",
"github_project": "datapatch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "datapatch"
}