# datapatch
A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been
entered by humans. You might find country names like `Northkorea`, or `Greet Britain`
that you want to normalise. `datapatch` creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.
## Installation
You can install `datapatch` from the Python package index:
```bash
pip install datapatch
```
## Example
Given a YAML file like this:
```yaml
countries:
normalize: true
lowercase: true
asciify: true
options:
- match: Frankreich
value: France
- match:
- Northkorea
- Nordkorea
- Northern Korea
- NKorea
- DPRK
value: North Korea
- contains: Britain
value: Great Britain
```
The file can be used to apply the data patches against raw input:
```python
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
# This will apply the patch or default to the original string if none exists:
for row in iter_data():
raw = row.get("Country")
row["Country"] = countries.get_value(raw, default=raw)
```
### Extended options
There's a host of options available to configure the application of the data
patches:
```yaml
countries:
# If you mark a lookup as required, a value that matches no options will
# throw a `datapatch.exc:LookupException`.
required: true
# Normalisation will remove many special characters, remove multiple spaces
normalize: false
# By default normalize perform transliteration across alphabets (Путин -> Putin)
# set asciify to false if you want to keep non-ascii alphabets as is
asciify: false
options:
- match: Francois
value: France
# This is a shorthand for defining options that have just one `match` and
# one `value` defined:
map:
Luxemborg: Luxembourg
Lux: Luxembourg
```
### Result objects
You can also have more details associated with a result and access them:
```yaml
countries:
options:
- match: Frankreich
# These can be arbitrary attributes:
label: France
code: FR
```
This can be accessed as a result object with attributes:
```python
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
```
## License
`datapatch` is licensed under the terms of the MIT license, which is included as
`LICENSE`.
Raw data
{
"_id": null,
"home_page": "https://github.com/opensanctions/datapatch",
"name": "datapatch",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Friedrich Lindenberg",
"author_email": "tech@opensanctions.org",
"download_url": "https://files.pythonhosted.org/packages/86/56/90a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0/datapatch-1.2.2.tar.gz",
"platform": null,
"description": "# datapatch\n\nA Python library for defining rule-based overrides on messy data. Imagine, for example,\ntrying to import a dataset in each row is associated with a country - which have been \nentered by humans. You might find country names like `Northkorea`, or `Greet Britain`\nthat you want to normalise. `datapatch` creates a mechanism to build a flexible lookup\ntable (usually stored as a YAML file) to catch and repair these data issues.\n\n## Installation\n\nYou can install `datapatch` from the Python package index:\n\n```bash\npip install datapatch\n```\n\n## Example\n\nGiven a YAML file like this:\n\n```yaml\ncountries:\n normalize: true\n lowercase: true\n asciify: true\n options:\n - match: Frankreich\n value: France\n - match:\n - Northkorea\n - Nordkorea\n - Northern Korea\n - NKorea\n - DPRK\n value: North Korea\n - contains: Britain\n value: Great Britain\n```\n\nThe file can be used to apply the data patches against raw input:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\n# This will apply the patch or default to the original string if none exists:\nfor row in iter_data():\n raw = row.get(\"Country\")\n row[\"Country\"] = countries.get_value(raw, default=raw)\n```\n\n### Extended options\n\nThere's a host of options available to configure the application of the data\npatches:\n\n```yaml\ncountries:\n # If you mark a lookup as required, a value that matches no options will\n # throw a `datapatch.exc:LookupException`.\n required: true\n # Normalisation will remove many special characters, remove multiple spaces\n normalize: false\n # By default normalize perform transliteration across alphabets (\u041f\u0443\u0442\u0438\u043d -> Putin)\n # set asciify to false if you want to keep non-ascii alphabets as is\n asciify: false\n options:\n - match: Francois\n value: France\n # This is a shorthand for defining options that have just one `match` and\n # one `value` defined:\n map:\n Luxemborg: Luxembourg\n Lux: Luxembourg\n```\n\n### Result objects\n\nYou can also have more details associated with a result and access them:\n\n```yaml\ncountries:\n options:\n - match: Frankreich\n # These can be arbitrary attributes:\n label: France\n code: FR\n```\n\nThis can be accessed as a result object with attributes:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\nresult = countries.match(\"Frankreich\")\nprint(result.label, result.code)\nassert result.capital is None, result.capital\n```\n\n## License\n\n`datapatch` is licensed under the terms of the MIT license, which is included as\n`LICENSE`.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": null,
"version": "1.2.2",
"project_urls": {
"Homepage": "https://github.com/opensanctions/datapatch"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1e87b4ffe1fbc43f7544415c1782e0232bd416b1d19954ff968e25c96331ec8d",
"md5": "1626c5dd283f8e2ee128015178a1cde9",
"sha256": "6110112bc017fe51b3d7c3cc00d7f5abfd02481076743368afa442d6c0e6326f"
},
"downloads": -1,
"filename": "datapatch-1.2.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "1626c5dd283f8e2ee128015178a1cde9",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 8608,
"upload_time": "2024-11-19T21:35:17",
"upload_time_iso_8601": "2024-11-19T21:35:17.970707Z",
"url": "https://files.pythonhosted.org/packages/1e/87/b4ffe1fbc43f7544415c1782e0232bd416b1d19954ff968e25c96331ec8d/datapatch-1.2.2-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "865690a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0",
"md5": "9ea14b4ac97cd166fbe57ae783d65bb2",
"sha256": "c4656685a03a7bb2e9e482220a130c4ead53999b0c46075809827b9e1cd2baf1"
},
"downloads": -1,
"filename": "datapatch-1.2.2.tar.gz",
"has_sig": false,
"md5_digest": "9ea14b4ac97cd166fbe57ae783d65bb2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8118,
"upload_time": "2024-11-19T21:35:19",
"upload_time_iso_8601": "2024-11-19T21:35:19.880645Z",
"url": "https://files.pythonhosted.org/packages/86/56/90a895e72fb2d73dcb9c6bc42bfb4a3e1658ecd524169f3536c77f566fb0/datapatch-1.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-19 21:35:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "opensanctions",
"github_project": "datapatch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "datapatch"
}