# whyqd: simplicity, transparency, speed
[![Documentation Status](https://readthedocs.org/projects/whyqd/badge/?version=latest)](docs/en/latest/?badge=latest)
[![Build Status](https://travis-ci.com/whythawk/whyqd.svg?branch=master)](https://travis-ci.com/whythawk/whyqd.svg?branch=master)
[![DOI](https://zenodo.org/badge/239159569.svg)](https://zenodo.org/badge/latestdoi/239159569)
## What is it?
> More research, less wrangling
[**whyqd**](https://whyqd.com) (/wɪkɪd/) is a curatorial toolkit intended to produce well-structured and predictable
data for research analysis.
It provides an intuitive method for creating schema-to-schema crosswalks for restructuring messy data to conform to a
standardised metadata schema. It supports rapid and continuous transformation of messy data using a simple series of
steps. Once complete, you can import wrangled data into more complex analytical or database systems.
**whyqd** plays well with your existing Python-based data-analytical tools. It uses [Ray](https://www.ray.io/) and
[Modin](https://modin.readthedocs.io/) as a drop-in replacement for [Pandas](https://pandas.pydata.org/) to support
processing of large datasets, and [Pydantic](https://pydantic-docs.helpmanual.io/) for data models.
Each definition is saved as [JSON Schema-compliant](https://json-schema.org/) file. This permits others to read and
scrutinise your approach, validate your methodology, or even use your crosswalks to import and transform data in
production.
Once complete, a transform file can be shared, along with your input data, and anyone can import and validate your
crosswalk to verify that your output data is the product of these inputs.
## Why use it?
**whyqd** allows you to get to work without requiring you to achieve buy-in from anyone or change your existing code.
If you don't want to spend days or weeks slogging through data when all you want to do is test whether your source
data are even useful. If you already have a workflow and established software which includes Python and pandas, and
don't want to change your code every time your source data changes.
If you want to go from a [Cthulhu dataset](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial3) like this:
![UNDP Human Development Index 2007-2008: a beautiful example of messy data.](docs/images/undp-hdi-2007-8.jpg)
*UNDP Human Development Index 2007-2008: a beautiful example of messy data.*
To this:
| | country_name | indicator_name | reference | year | values |
|:---|:-----------------------|:-----------------|:------------|:-------|:---------|
| 0 | Hong Kong, China (SAR) | HDI rank | e | 2008 | 21 |
| 1 | Singapore | HDI rank | nan | 2008 | 25 |
| 2 | Korea (Republic of) | HDI rank | nan | 2008 | 26 |
| 3 | Cyprus | HDI rank | nan | 2008 | 28 |
| 4 | Brunei Darussalam | HDI rank | nan | 2008 | 30 |
| 5 | Barbados | HDI rank | e,g,f | 2008 | 31 |
With a readable set of scripts to ensure that your process can be audited and repeated:
```python
schema_scripts = [
f"UNITE > 'reference' < {REFERENCE_COLUMNS}",
"RENAME > 'country_name' < ['Country']",
"PIVOT_LONGER > ['indicator_name', 'values'] < ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
"SEPARATE > ['indicator_name', 'year'] < ';;'::['indicator_name']",
"DEBLANK",
"DEDUPE",
]
```
Then **whyqd** may be for you.
## How does it work?
> Crosswalks are mappings of the relationships between fields defined in different metadata
> [schemas](https://whyqd.readthedocs.io/en/latest/strategies/schema). Ideally, these are one-to-one, where a field in
> one has an exact match in the other. In practice, it's more complicated than that.
Your workflow is:
1. Define a single destination schema,
2. Derive a source schema from a data source,
3. Review your source data structure,
4. Develop a crosswalk to define the relationship between source and destination,
5. Transform and validate your outputs,
6. Share your output data, transform definitions, and a citation.
It starts like this:
```python
import whyqd as qd
```
[Install](https://whyqd.readthedocs.io/en/latest/installation) and then read the [quickstart](https://whyqd.readthedocs.io/en/latest/quickstart).
There are four worked tutorials to guide you through typical scenarios:
- [Aligning multiple sources of local government data from a many-headed Excel spreadsheet to a single schema](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial1)
- [Pivoting wide-format data into archival long-format](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial2)
- [Wrangling Cthulhu data without losing your mind](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial3)
- [Transforming data containing American dates, currencies as strings and misaligned columns](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial4)
## Installation
You'll need at least Python 3.9, then install with your favourite package manager:
```bash
pip install whyqd
```
To derive a source schema from tabular data, import from `DATASOURCE_PATH`, define its `MIMETYPE`, and derive a schema:
```python
import whyqd as qd
datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
schema_source = qd.SchemaDefinition()
schema_source.derive_model(data=datasource.get)
schema_source.fields.set_categories(name=CATEGORY_FIELD,
terms=datasource.get_data())
schema_source.save()
```
[Get started...](https://whyqd.readthedocs.io/en/latest/quickstart)
## Changelog
The version history can be found in the [changelog](https://whyqd.readthedocs.io/en/latest/changelog).
## Background and funding
**whyqd** was created to serve a continuous data wrangling process, including collaboration on more complex messy
sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our
database, back to source. You can see the product of that at [openLocal.uk](https://openlocal.uk).
**whyqd** [received initial funding](https://eoscfuture-grants.eu/meet-the-grantees/implementation-no-code-method-schema-schema-data-transformations-interoperability)
from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical
development support is from [EOSC Future](https://eoscfuture.eu/) through the
[RDA Open Call mechanism](https://eoscfuture-grants.eu/provider/research-data-alliance), based on evaluations of
external, independent experts.
The 'backronym' for **whyqd** /wɪkɪd/ is *Whythawk Quantitative Data*, [Whythawk](https://whythawk.com)
is an open data science and open research technical consultancy.
## Licence
The [**whyqd** Python distribution](https://github.com/whythawk/whyqd) is licensed under the terms of the
[BSD 3-Clause license](https://github.com/whythawk/whyqd/blob/master/LICENSE). All documentation is released under
[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). **whyqd** tradenames and
marks are copyright [Whythawk](https://whythawk.com).
Raw data
{
"_id": null,
"home_page": "https://whyqd.com",
"name": "whyqd",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "python,data-science,pandas,open-data,open-science,data-analysis,data-wrangling,data-management,munging,crosswalks",
"author": "Gavin Chait",
"author_email": "gchait@whythawk.com",
"download_url": "https://files.pythonhosted.org/packages/77/74/7f55abbea3220c73800c3b447794be5df2059f95a7eb749231d690ce1f59/whyqd-1.1.3.tar.gz",
"platform": null,
"description": "# whyqd: simplicity, transparency, speed\n\n[![Documentation Status](https://readthedocs.org/projects/whyqd/badge/?version=latest)](docs/en/latest/?badge=latest)\n[![Build Status](https://travis-ci.com/whythawk/whyqd.svg?branch=master)](https://travis-ci.com/whythawk/whyqd.svg?branch=master)\n[![DOI](https://zenodo.org/badge/239159569.svg)](https://zenodo.org/badge/latestdoi/239159569)\n\n## What is it?\n\n> More research, less wrangling\n\n[**whyqd**](https://whyqd.com) (/w\u026ak\u026ad/) is a curatorial toolkit intended to produce well-structured and predictable \ndata for research analysis.\n\nIt provides an intuitive method for creating schema-to-schema crosswalks for restructuring messy data to conform to a \nstandardised metadata schema. It supports rapid and continuous transformation of messy data using a simple series of \nsteps. Once complete, you can import wrangled data into more complex analytical or database systems.\n\n**whyqd** plays well with your existing Python-based data-analytical tools. It uses [Ray](https://www.ray.io/) and \n[Modin](https://modin.readthedocs.io/) as a drop-in replacement for [Pandas](https://pandas.pydata.org/) to support \nprocessing of large datasets, and [Pydantic](https://pydantic-docs.helpmanual.io/) for data models. \n\nEach definition is saved as [JSON Schema-compliant](https://json-schema.org/) file. This permits others to read and \nscrutinise your approach, validate your methodology, or even use your crosswalks to import and transform data in \nproduction.\n\nOnce complete, a transform file can be shared, along with your input data, and anyone can import and validate your \ncrosswalk to verify that your output data is the product of these inputs.\n\n## Why use it?\n\n**whyqd** allows you to get to work without requiring you to achieve buy-in from anyone or change your existing code.\n\nIf you don't want to spend days or weeks slogging through data when all you want to do is test whether your source \ndata are even useful. If you already have a workflow and established software which includes Python and pandas, and \ndon't want to change your code every time your source data changes.\n\nIf you want to go from a [Cthulhu dataset](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial3) like this:\n\n![UNDP Human Development Index 2007-2008: a beautiful example of messy data.](docs/images/undp-hdi-2007-8.jpg)\n*UNDP Human Development Index 2007-2008: a beautiful example of messy data.*\n\nTo this:\n\n| | country_name | indicator_name | reference | year | values |\n|:---|:-----------------------|:-----------------|:------------|:-------|:---------|\n| 0 | Hong Kong, China (SAR) | HDI rank | e | 2008 | 21 |\n| 1 | Singapore | HDI rank | nan | 2008 | 25 |\n| 2 | Korea (Republic of) | HDI rank | nan | 2008 | 26 |\n| 3 | Cyprus | HDI rank | nan | 2008 | 28 |\n| 4 | Brunei Darussalam | HDI rank | nan | 2008 | 30 |\n| 5 | Barbados | HDI rank | e,g,f | 2008 | 31 |\n\nWith a readable set of scripts to ensure that your process can be audited and repeated:\n\n```python\nschema_scripts = [\n f\"UNITE > 'reference' < {REFERENCE_COLUMNS}\",\n \"RENAME > 'country_name' < ['Country']\",\n \"PIVOT_LONGER > ['indicator_name', 'values'] < ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']\",\n \"SEPARATE > ['indicator_name', 'year'] < ';;'::['indicator_name']\",\n \"DEBLANK\",\n \"DEDUPE\",\n]\n```\n\nThen **whyqd** may be for you.\n\n## How does it work?\n\n> Crosswalks are mappings of the relationships between fields defined in different metadata \n> [schemas](https://whyqd.readthedocs.io/en/latest/strategies/schema). Ideally, these are one-to-one, where a field in \n> one has an exact match in the other. In practice, it's more complicated than that.\n\nYour workflow is:\n\n1. Define a single destination schema,\n2. Derive a source schema from a data source,\n3. Review your source data structure,\n4. Develop a crosswalk to define the relationship between source and destination,\n5. Transform and validate your outputs,\n6. Share your output data, transform definitions, and a citation.\n\nIt starts like this:\n\n```python\nimport whyqd as qd\n```\n\n[Install](https://whyqd.readthedocs.io/en/latest/installation) and then read the [quickstart](https://whyqd.readthedocs.io/en/latest/quickstart).\n\nThere are four worked tutorials to guide you through typical scenarios:\n\n- [Aligning multiple sources of local government data from a many-headed Excel spreadsheet to a single schema](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial1)\n- [Pivoting wide-format data into archival long-format](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial2)\n- [Wrangling Cthulhu data without losing your mind](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial3)\n- [Transforming data containing American dates, currencies as strings and misaligned columns](https://whyqd.readthedocs.io/en/latest/tutorials/tutorial4)\n\n## Installation\n\nYou'll need at least Python 3.9, then install with your favourite package manager:\n\n```bash\npip install whyqd\n```\n\nTo derive a source schema from tabular data, import from `DATASOURCE_PATH`, define its `MIMETYPE`, and derive a schema:\n\n```python\nimport whyqd as qd\n\ndatasource = qd.DataSourceDefinition()\ndatasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)\nschema_source = qd.SchemaDefinition()\nschema_source.derive_model(data=datasource.get)\nschema_source.fields.set_categories(name=CATEGORY_FIELD, \n terms=datasource.get_data())\nschema_source.save()\n```\n\n[Get started...](https://whyqd.readthedocs.io/en/latest/quickstart)\n\n## Changelog\n\nThe version history can be found in the [changelog](https://whyqd.readthedocs.io/en/latest/changelog).\n\n## Background and funding\n\n**whyqd** was created to serve a continuous data wrangling process, including collaboration on more complex messy \nsources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our \ndatabase, back to source. You can see the product of that at [openLocal.uk](https://openlocal.uk).\n\n**whyqd** [received initial funding](https://eoscfuture-grants.eu/meet-the-grantees/implementation-no-code-method-schema-schema-data-transformations-interoperability)\nfrom the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical \ndevelopment support is from [EOSC Future](https://eoscfuture.eu/) through the \n[RDA Open Call mechanism](https://eoscfuture-grants.eu/provider/research-data-alliance), based on evaluations of \nexternal, independent experts.\n\nThe 'backronym' for **whyqd** /w\u026ak\u026ad/ is *Whythawk Quantitative Data*, [Whythawk](https://whythawk.com)\nis an open data science and open research technical consultancy.\n\n## Licence\n\nThe [**whyqd** Python distribution](https://github.com/whythawk/whyqd) is licensed under the terms of the \n[BSD 3-Clause license](https://github.com/whythawk/whyqd/blob/master/LICENSE). All documentation is released under \n[Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). **whyqd** tradenames and \nmarks are copyright [Whythawk](https://whythawk.com).\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "data wrangling simplicity, complete audit transparency, and at speed",
"version": "1.1.3",
"project_urls": {
"Documentation": "https://whyqd.readthedocs.io/",
"Homepage": "https://whyqd.com",
"Repository": "https://github.com/whythawk/whyqd/"
},
"split_keywords": [
"python",
"data-science",
"pandas",
"open-data",
"open-science",
"data-analysis",
"data-wrangling",
"data-management",
"munging",
"crosswalks"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a041abc35cbc7b73f91fae377da313599ae8e503ed4d56bee04ae5939e8efa90",
"md5": "027ef983bedd6ad9d8265775c5db3f28",
"sha256": "3eca5c87fd0c4f63ac45cc143c41f1dd8f2863adce04d1181057a3e5ff1f8517"
},
"downloads": -1,
"filename": "whyqd-1.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "027ef983bedd6ad9d8265775c5db3f28",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 3092955,
"upload_time": "2024-03-08T14:50:28",
"upload_time_iso_8601": "2024-03-08T14:50:28.453818Z",
"url": "https://files.pythonhosted.org/packages/a0/41/abc35cbc7b73f91fae377da313599ae8e503ed4d56bee04ae5939e8efa90/whyqd-1.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "77747f55abbea3220c73800c3b447794be5df2059f95a7eb749231d690ce1f59",
"md5": "ca00e61894dd3ca6f36321898db189c7",
"sha256": "b65f264a7f4983f67c51a83d1866461aa218c29b90f7c73c83f2154463da6319"
},
"downloads": -1,
"filename": "whyqd-1.1.3.tar.gz",
"has_sig": false,
"md5_digest": "ca00e61894dd3ca6f36321898db189c7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 3058105,
"upload_time": "2024-03-08T14:50:35",
"upload_time_iso_8601": "2024-03-08T14:50:35.550938Z",
"url": "https://files.pythonhosted.org/packages/77/74/7f55abbea3220c73800c3b447794be5df2059f95a7eb749231d690ce1f59/whyqd-1.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-08 14:50:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "whythawk",
"github_project": "whyqd",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "aiosignal",
"specs": [
[
"==",
"1.3.1"
]
]
},
{
"name": "attrs",
"specs": [
[
"==",
"23.1.0"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2022.12.7"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.1.0"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.3"
]
]
},
{
"name": "colorama",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "distlib",
"specs": [
[
"==",
"0.3.6"
]
]
},
{
"name": "et-xmlfile",
"specs": [
[
"==",
"1.1.0"
]
]
},
{
"name": "filelock",
"specs": [
[
"==",
"3.12.0"
]
]
},
{
"name": "fire",
"specs": [
[
"==",
"0.5.0"
]
]
},
{
"name": "frozenlist",
"specs": [
[
"==",
"1.3.3"
]
]
},
{
"name": "fsspec",
"specs": [
[
"==",
"2023.4.0"
]
]
},
{
"name": "grpcio",
"specs": [
[
"==",
"1.49.1"
]
]
},
{
"name": "grpcio",
"specs": [
[
"==",
"1.51.3"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.4"
]
]
},
{
"name": "importlib-resources",
"specs": [
[
"==",
"5.12.0"
]
]
},
{
"name": "jsonschema",
"specs": [
[
"==",
"4.17.3"
]
]
},
{
"name": "modin",
"specs": [
[
"==",
"0.20.1"
]
]
},
{
"name": "msgpack",
"specs": [
[
"==",
"1.0.5"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.24.3"
]
]
},
{
"name": "openpyxl",
"specs": [
[
"==",
"3.1.2"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"23.1"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"1.5.3"
]
]
},
{
"name": "pkgutil-resolve-name",
"specs": [
[
"==",
"1.3.10"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"3.5.0"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"4.22.3"
]
]
},
{
"name": "psutil",
"specs": [
[
"==",
"5.9.5"
]
]
},
{
"name": "pyarrow",
"specs": [
[
"==",
"11.0.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"1.10.7"
]
]
},
{
"name": "pyrsistent",
"specs": [
[
"==",
"0.19.3"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.8.2"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2023.3"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0"
]
]
},
{
"name": "randomname",
"specs": [
[
"==",
"0.2.1"
]
]
},
{
"name": "ray",
"specs": [
[
"==",
"2.4.0"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.29.0"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"67.7.2"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "tabulate",
"specs": [
[
"==",
"0.8.10"
]
]
},
{
"name": "termcolor",
"specs": [
[
"==",
"2.3.0"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
"==",
"4.5.0"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"1.26.15"
]
]
},
{
"name": "virtualenv",
"specs": [
[
"==",
"20.21.0"
]
]
},
{
"name": "xlrd",
"specs": [
[
"==",
"2.0.1"
]
]
},
{
"name": "zipp",
"specs": [
[
"==",
"3.15.0"
]
]
}
],
"lcname": "whyqd"
}