Name | datapatch JSON |
Version |
1.2.4
JSON |
| download |
home_page | None |
Summary | A library for defining rule-based overrides on messy data. |
upload_time | 2025-07-27 19:40:10 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# datapatch
A Python library for defining rule-based overrides on messy data. Imagine, for example,
trying to import a dataset in each row is associated with a country - which have been
entered by humans. You might find country names like `Northkorea`, or `Greet Britain`
that you want to normalise. `datapatch` creates a mechanism to build a flexible lookup
table (usually stored as a YAML file) to catch and repair these data issues.
## Installation
You can install `datapatch` from the Python package index:
```bash
pip install datapatch
```
## Example
Given a YAML file like this:
```yaml
countries:
normalize: true
lowercase: true
asciify: true
options:
- match: Frankreich
value: France
- match:
- Northkorea
- Nordkorea
- Northern Korea
- NKorea
- DPRK
value: North Korea
- contains: Britain
value: Great Britain
```
The file can be used to apply the data patches against raw input:
```python
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
# This will apply the patch or default to the original string if none exists:
for row in iter_data():
raw = row.get("Country")
row["Country"] = countries.get_value(raw, default=raw)
```
### Extended options
There's a host of options available to configure the application of the data
patches:
```yaml
countries:
# If you mark a lookup as required, a value that matches no options will
# throw a `datapatch.exc:LookupException`.
required: true
# Normalisation will remove many special characters, remove multiple spaces
normalize: false
# By default normalize perform transliteration across alphabets (Путин -> Putin)
# set asciify to false if you want to keep non-ascii alphabets as is
asciify: false
options:
- match: Francois
value: France
# This is a shorthand for defining options that have just one `match` and
# one `value` defined:
map:
Luxemborg: Luxembourg
Lux: Luxembourg
```
### Result objects
You can also have more details associated with a result and access them:
```yaml
countries:
options:
- match: Frankreich
# These can be arbitrary attributes:
label: France
code: FR
```
This can be accessed as a result object with attributes:
```python
from datapatch import read_lookups, LookupException
lookups = read_lookups("countries.yml")
countries = lookups.get("countries")
result = countries.match("Frankreich")
print(result.label, result.code)
assert result.capital is None, result.capital
```
## License
`datapatch` is licensed under the terms of the MIT license, which is included as
`LICENSE`.
Raw data
{
"_id": null,
"home_page": null,
"name": "datapatch",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "OpenSanctions <info@opensanctions.org>",
"download_url": null,
"platform": null,
"description": "# datapatch\n\nA Python library for defining rule-based overrides on messy data. Imagine, for example,\ntrying to import a dataset in each row is associated with a country - which have been \nentered by humans. You might find country names like `Northkorea`, or `Greet Britain`\nthat you want to normalise. `datapatch` creates a mechanism to build a flexible lookup\ntable (usually stored as a YAML file) to catch and repair these data issues.\n\n## Installation\n\nYou can install `datapatch` from the Python package index:\n\n```bash\npip install datapatch\n```\n\n## Example\n\nGiven a YAML file like this:\n\n```yaml\ncountries:\n normalize: true\n lowercase: true\n asciify: true\n options:\n - match: Frankreich\n value: France\n - match:\n - Northkorea\n - Nordkorea\n - Northern Korea\n - NKorea\n - DPRK\n value: North Korea\n - contains: Britain\n value: Great Britain\n```\n\nThe file can be used to apply the data patches against raw input:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\n# This will apply the patch or default to the original string if none exists:\nfor row in iter_data():\n raw = row.get(\"Country\")\n row[\"Country\"] = countries.get_value(raw, default=raw)\n```\n\n### Extended options\n\nThere's a host of options available to configure the application of the data\npatches:\n\n```yaml\ncountries:\n # If you mark a lookup as required, a value that matches no options will\n # throw a `datapatch.exc:LookupException`.\n required: true\n # Normalisation will remove many special characters, remove multiple spaces\n normalize: false\n # By default normalize perform transliteration across alphabets (\u041f\u0443\u0442\u0438\u043d -> Putin)\n # set asciify to false if you want to keep non-ascii alphabets as is\n asciify: false\n options:\n - match: Francois\n value: France\n # This is a shorthand for defining options that have just one `match` and\n # one `value` defined:\n map:\n Luxemborg: Luxembourg\n Lux: Luxembourg\n```\n\n### Result objects\n\nYou can also have more details associated with a result and access them:\n\n```yaml\ncountries:\n options:\n - match: Frankreich\n # These can be arbitrary attributes:\n label: France\n code: FR\n```\n\nThis can be accessed as a result object with attributes:\n\n```python\nfrom datapatch import read_lookups, LookupException\n\nlookups = read_lookups(\"countries.yml\")\ncountries = lookups.get(\"countries\")\n\nresult = countries.match(\"Frankreich\")\nprint(result.label, result.code)\nassert result.capital is None, result.capital\n```\n\n## License\n\n`datapatch` is licensed under the terms of the MIT license, which is included as\n`LICENSE`.",
"bugtrack_url": null,
"license": "MIT",
"summary": "A library for defining rule-based overrides on messy data.",
"version": "1.2.4",
"project_urls": {
"Homepage": "https://github.com/opensanctions/datapatch"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d8dd1df187bea2546fa7c3de04d34a366ad5a8095febb61468baddf077c8fd73",
"md5": "597f471b838cb7ccf52b21862e740eca",
"sha256": "b6c2dae33a6635d6526b122bcd2229f098ef9f833bd60a93f644a10d82dde699"
},
"downloads": -1,
"filename": "datapatch-1.2.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "597f471b838cb7ccf52b21862e740eca",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8270,
"upload_time": "2025-07-27T19:40:10",
"upload_time_iso_8601": "2025-07-27T19:40:10.616619Z",
"url": "https://files.pythonhosted.org/packages/d8/dd/1df187bea2546fa7c3de04d34a366ad5a8095febb61468baddf077c8fd73/datapatch-1.2.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 19:40:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "opensanctions",
"github_project": "datapatch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "datapatch"
}