harmonize-wq


Nameharmonize-wq JSON
Version 0.5.0 PyPI version JSON
download
home_pageNone
SummaryPackage to standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats
upload_time2024-08-21 19:34:58
maintainerNone
docs_urlNone
authorNone
requires_python<3.12,>=3.8
licenseMIT License Copyright (c) 2023 U.S. Federal Government (in countries where recognized) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords usepa water data water quality
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPi](https://img.shields.io/pypi/v/harmonize-wq.svg)](https://pypi.python.org/pypi/harmonize-wq)
[![Documentation Status](https://github.com/USEPA/harmonize-wq/actions/workflows/documentation_deploy.yaml/badge.svg)](https://github.com/USEPA/harmonize-wq/actions/workflows/documentation_deploy.yaml)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![test](https://github.com/USEPA/harmonize-wq/actions/workflows/test.yml/badge.svg)](https://github.com/USEPA/harmonize-wq/actions/workflows/test.yml)
[![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https://raw.githubusercontent.com/USEPA/harmonize-wq/main/pyproject.toml)](https://www.python.org/downloads/)
[![pyOpenSci Peer-Reviewed](https://pyopensci.org/badges/peer-reviewed.svg)](https://github.com/pyOpenSci/software-review/issues/157)

# harmonize-wq
Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

US EPA’s [Water Quality Portal (WQP)](https://www.waterqualitydata.us/) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using [python](https://github.com/USGS-python/dataretrieval) or [R](https://github.com/USGS-R/dataRetrieval).
Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format.
Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:

- Identify differences in data units (including speciation and basis)
- Identify differences in sampling or analytic methods
- Resolve data errors using transparent assumptions
- Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

For complete documentation see [docs](https://usepa.github.io/harmonize-wq/index.html). For more complete tutorial information see: [demos](https://github.com/USEPA/harmonize-wq/tree/main/demos)

## Quick Start

harmonize_wq can be installed using pip:
```bash
python3 -m pip install harmonize-wq
```

To install the latest development version of harmonize_wq using pip:

```bash
pip install git+https://github.com/USEPA/harmonize-wq.git
```

## Example Workflow
### dataretrieval Query for a geojson

```python
import dataretrieval.wqp as wqp
from harmonize_wq import wrangle

# File for area of interest
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'

# Build query
query = {'characteristicName': ['Temperature, water',
                                'Depth, Secchi disk depth',
                                ]}
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'

# Run query
res_narrow, md_narrow = wqp.get_results(**query)

# dataframe of downloaded results
res_narrow
```

### Harmonize results

```python
from harmonize_wq import harmonize

# Harmonize all results
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized
```

### Clean results

```python
from harmonize_wq import clean

# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized)  # datetime
df_cleaned = clean.harmonize_depth(df_cleaned)  # Sample depth
df_cleaned
```

### Transform results from long to wide format
There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic.
To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

```python
from harmonize_wq import wrangle

# Split QA column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)

# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
main_df, chars_df = wrangle.split_table(df_full)

# Combine rows with the same sample organization, activity, location, and datetime
df_wide = wrangle.collapse_results(main_df)

```

The number of columns in the resulting table is greatly reduced

Output Column | Type | Source | Changes
--- | --- | --- | ---
MonitoringLocationIdentifier | Defines row | MonitoringLocationIdentifier | NA 
Activity_datetime | Defines row | ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode | Combined and UTC
ActivityIdentifier | Defines row | ActivityIdentifier | NA
OrganizationIdentifier | Defines row | OrganizationIdentifier | NA 
OrganizationFormalName | Metadata| OrganizationFormalName | NA
ProviderName | Metadata | ProviderName | NA
StartDate | Metadata | ActivityStartDate | Preserves date where time NAT
Depth | Metadata | ResultDepthHeightMeasure/MeasureValue, ResultDepthHeightMeasure/MeasureUnitCode | standardized to meters
Secchi | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to meters
QA_Secchi | QA | NA | harmonization processing quality issues
Temperature | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to degrees Celcius
QA_Temperature | QA | NA | harmonization processing quality issues

## Issue Tracker
harmonize_wq is under development. Please report any bugs and enhancement ideas using [issues](https://github.com/USEPA/harmonize-wq/issues)


## Disclaimer
The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use.
EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA.
The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "harmonize-wq",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.8",
    "maintainer_email": null,
    "keywords": "USEPA, water data, water quality",
    "author": null,
    "author_email": "Justin Bousquin <Bousquin.Justin@epa.gov>",
    "download_url": "https://files.pythonhosted.org/packages/25/34/a0175b4aaa1d82fa02a61eb06fde2743c800f6939196b6264760b68062fa/harmonize_wq-0.5.0.tar.gz",
    "platform": null,
    "description": "[![PyPi](https://img.shields.io/pypi/v/harmonize-wq.svg)](https://pypi.python.org/pypi/harmonize-wq)\n[![Documentation Status](https://github.com/USEPA/harmonize-wq/actions/workflows/documentation_deploy.yaml/badge.svg)](https://github.com/USEPA/harmonize-wq/actions/workflows/documentation_deploy.yaml)\n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![test](https://github.com/USEPA/harmonize-wq/actions/workflows/test.yml/badge.svg)](https://github.com/USEPA/harmonize-wq/actions/workflows/test.yml)\n[![Python Version from PEP 621 TOML](https://img.shields.io/python/required-version-toml?tomlFilePath=https://raw.githubusercontent.com/USEPA/harmonize-wq/main/pyproject.toml)](https://www.python.org/downloads/)\n[![pyOpenSci Peer-Reviewed](https://pyopensci.org/badges/peer-reviewed.svg)](https://github.com/pyOpenSci/software-review/issues/157)\n\n# harmonize-wq\nStandardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats\n\nUS EPA\u2019s [Water Quality Portal (WQP)](https://www.waterqualitydata.us/) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using [python](https://github.com/USGS-python/dataretrieval) or [R](https://github.com/USGS-R/dataRetrieval).\nGiven the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format.\nRecognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:\n\n- Identify differences in data units (including speciation and basis)\n- Identify differences in sampling or analytic methods\n- Resolve data errors using transparent assumptions\n- Transform data from long to wide format\n\nDomain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.\n\nFor complete documentation see [docs](https://usepa.github.io/harmonize-wq/index.html). For more complete tutorial information see: [demos](https://github.com/USEPA/harmonize-wq/tree/main/demos)\n\n## Quick Start\n\nharmonize_wq can be installed using pip:\n```bash\npython3 -m pip install harmonize-wq\n```\n\nTo install the latest development version of harmonize_wq using pip:\n\n```bash\npip install git+https://github.com/USEPA/harmonize-wq.git\n```\n\n## Example Workflow\n### dataretrieval Query for a geojson\n\n```python\nimport dataretrieval.wqp as wqp\nfrom harmonize_wq import wrangle\n\n# File for area of interest\naoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'\n\n# Build query\nquery = {'characteristicName': ['Temperature, water',\n                                'Depth, Secchi disk depth',\n                                ]}\nquery['bBox'] = wrangle.get_bounding_box(aoi_url)\nquery['dataProfile'] = 'narrowResult'\n\n# Run query\nres_narrow, md_narrow = wqp.get_results(**query)\n\n# dataframe of downloaded results\nres_narrow\n```\n\n### Harmonize results\n\n```python\nfrom harmonize_wq import harmonize\n\n# Harmonize all results\ndf_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')\ndf_harmonized\n```\n\n### Clean results\n\n```python\nfrom harmonize_wq import clean\n\n# Clean up other columns of data\ndf_cleaned = clean.datetime(df_harmonized)  # datetime\ndf_cleaned = clean.harmonize_depth(df_cleaned)  # Sample depth\ndf_cleaned\n```\n\n### Transform results from long to wide format\nThere are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic.\nTo ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.\n\n```python\nfrom harmonize_wq import wrangle\n\n# Split QA column into multiple characteristic specific QA columns\ndf_full = wrangle.split_col(df_cleaned)\n\n# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)\nmain_df, chars_df = wrangle.split_table(df_full)\n\n# Combine rows with the same sample organization, activity, location, and datetime\ndf_wide = wrangle.collapse_results(main_df)\n\n```\n\nThe number of columns in the resulting table is greatly reduced\n\nOutput Column | Type | Source | Changes\n--- | --- | --- | ---\nMonitoringLocationIdentifier | Defines row | MonitoringLocationIdentifier | NA \nActivity_datetime | Defines row | ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode | Combined and UTC\nActivityIdentifier | Defines row | ActivityIdentifier | NA\nOrganizationIdentifier | Defines row | OrganizationIdentifier | NA \nOrganizationFormalName | Metadata| OrganizationFormalName | NA\nProviderName | Metadata | ProviderName | NA\nStartDate | Metadata | ActivityStartDate | Preserves date where time NAT\nDepth | Metadata | ResultDepthHeightMeasure/MeasureValue, ResultDepthHeightMeasure/MeasureUnitCode | standardized to meters\nSecchi | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to meters\nQA_Secchi | QA | NA | harmonization processing quality issues\nTemperature | Result | ResultMeasureValue, ResultMeasure/MeasureUnitCode | standardized to degrees Celcius\nQA_Temperature | QA | NA | harmonization processing quality issues\n\n## Issue Tracker\nharmonize_wq is under development. Please report any bugs and enhancement ideas using [issues](https://github.com/USEPA/harmonize-wq/issues)\n\n\n## Disclaimer\nThe United States Environmental Protection Agency (EPA) GitHub project code is provided on an \"as is\" basis and the user assumes responsibility for its use.\nEPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA.\nThe EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.\n",
    "bugtrack_url": null,
    "license": "MIT License Copyright (c) 2023 U.S. Federal Government (in countries where recognized)  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Package to standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats",
    "version": "0.5.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/USEPA/harmonize-wq/issues",
        "Documentation": "https://usepa.github.io/harmonize-wq/",
        "Homepage": "https://github.com/USEPA/harmonize-wq"
    },
    "split_keywords": [
        "usepa",
        " water data",
        " water quality"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9ecb1225d3144b5b7c4c124f3270ecb5b49d9be8e105d019c900537e24284514",
                "md5": "f498be5bec9295bd03628113463d7019",
                "sha256": "4096e9fee48d96f517ebd0fa9cdf8e7cc561f9130da452229183fd3ad47b3a1d"
            },
            "downloads": -1,
            "filename": "harmonize_wq-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f498be5bec9295bd03628113463d7019",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.8",
            "size": 57183,
            "upload_time": "2024-08-21T19:34:54",
            "upload_time_iso_8601": "2024-08-21T19:34:54.026183Z",
            "url": "https://files.pythonhosted.org/packages/9e/cb/1225d3144b5b7c4c124f3270ecb5b49d9be8e105d019c900537e24284514/harmonize_wq-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2534a0175b4aaa1d82fa02a61eb06fde2743c800f6939196b6264760b68062fa",
                "md5": "275dd0ded87643f7688ec50e1f615e07",
                "sha256": "190fbf6fc725ac570d5efd624cd207d7dcf6bfbec315d34eb2abe50831746e9b"
            },
            "downloads": -1,
            "filename": "harmonize_wq-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "275dd0ded87643f7688ec50e1f615e07",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.8",
            "size": 54807,
            "upload_time": "2024-08-21T19:34:58",
            "upload_time_iso_8601": "2024-08-21T19:34:58.434867Z",
            "url": "https://files.pythonhosted.org/packages/25/34/a0175b4aaa1d82fa02a61eb06fde2743c800f6939196b6264760b68062fa/harmonize_wq-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-21 19:34:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "USEPA",
    "github_project": "harmonize-wq",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "harmonize-wq"
}
        
Elapsed time: 2.97768s