oca-ds-validator


Nameoca-ds-validator JSON
Version 0.0.2 PyPI version JSON
download
home_page
SummaryValidate OCA dataset in python workflows
upload_time2024-02-08 14:47:54
maintainer
docs_urlNone
authorXingjian Xu and Steven Mugisha Mizero
requires_python>=3.8
license
keywords data entry oca json bundle
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # OCA Data Set Validator
This is a Python package for validating [Overlays Capture Architecture (OCA)](https://oca.colossi.network/) data sets. It includes three classes: `OCADataSet`, `OCADataSetErr`, and `OCABundle`. For more information about OCA, please check [OCA Specification v1.0.0](https://oca.colossi.network/specification/).

- `OCADataSet` represents an OCA data set to be validated, and can be loaded from a pandas DataFrame, an OCA Excel Data Entry File, or a CSV file.

- `OCADataSetErr` represents the result set of an OCA data set validation. This class is generated by the data set validation, contains all the error information, and also provides three methods for a quick overview: `overview()`, `first_error_col()`, and `get_error_col(attr_name)`.

- `OCABundle` represents schema overlays from a loaded `.json` OCA bundle used to validate the data set.

## Dependencies
- pandas
- pathlib

## Usage

### Installation
Install the package by typing `pip install oca_ds_validator` to the console. Then you could import the classes from any Python scripts.

- The package can be found here: [oca_ds_validator](https://pypi.org/project/oca-ds-validator/0.0.1/)

### Validation steps
1. Import the OCA Bundle using `OCABundle(path)`.
2. Import the OCA Data Set using `OCADataSet(pandas_dataframe)` or `OCADataSet.from_path(path)`.
3. Generate the validation result using `validate()` method for class `OCABundle`.

```python
from oca_ds_validator import OCADataSet, OCADataSetErr, OCABundle

test_bundle = OCABundle("/path/to/oca/bundle.json")

test_data = OCADataSet(data_set_dataframe)
# test_data = OCADataSet.from_path("/path/to/oca/data_entry_file.xlsx")
# test_data = OCADataSet.from_path("/path/to/oca/data_set_file.csv")

test_rslt = test_bundle.validate(test_data)
#########################################################################################
# Example of a possible test_rslt:
#   attr_err:
#     [('missing_attribute',
#       'Missing attribute (attribute not found in the data set).'),
#      ('unmatched_attribute',
#       'Unmatched attribute (attribute not found in the OCA Bundle).')]
#   format_err:
#     {'attribute_with_format_error_on_row_0': {0: 'Format mismatch.'},
#      'array_attribute_without_array_data_on_row_0': {0: 'Valid array required.'},
#      'attribute_with_format_error_on_row_42': {42: 'Format mismatch.'},
#      'attribute_with_errors_on_row_0_and_1': {0: 'Format mismatch.',
#                                               1: 'Valid array required.'},
#      'mandatory_attribute_with_missing_data': {0: 'Missing mandatory attribute.'},
#      'attribute_without_error': {}}
#   ecode_err:  # Not matching any of the entry codes
#     {'attribute_with_entry_codes': {0: 'One of the entry codes required.'}}
#########################################################################################
```

### Optional Messages
There are three optional boolean arguments to control the message printed.
| argument | default value | usage |
| -------- | ------------- | ----- |
| `show_data_preview` | `False` | If enabled, prints a pandas preview of the data set before validation. |
| `enable_flagged_alarm` | `True` | If enabled, prints a warning message for the existence of flagged attributes. |
| `enable_version_alarm` | `True` | If enabled, prints a warning message for each overlay that contains an OCA version number different from the development version of this script (1.0).


### Result Observation
The errors of the data set is stored in the generated `OCADataSetErr` class.

```Python
# Prints a brief summary of errors.
test_rslt.overview()
#########################################################################################
# Attribute error.
# {'missing_attribute'} found in the OCA Bundle but not in the data set;
# {'unmatched_attribute'} found in the data set but not in the OCA Bundle.
# Found 3 problematic row(s) in the following attribute(s):
# {'attribute_with_format_error_on_row_0',
#  'array_attribute_without_array_data_on_row_0',
#  'attribute_with_format_error_on_row_42',
#  'attribute_with_errors_on_row_0_and_1',
#  'mandatory_attribute_with_missing_data',
#  'attribute_with_entry_codes'}
#########################################################################################

# Prints the information of the first problematic column.
test_rslt.first_err_col()
#########################################################################################
# The first problematic column is: attribute_with_format_error_on_row_0
# Format error(s) would occur in the following rows:
# row 0 : Format mismatch.
# No entry code error found in the column.
#########################################################################################

# Prints the information of some certain column.
test_rslt.get_err_col("attribute_with_format_error_on_row_42")
#########################################################################################
# Format error(s) would occur in the following rows of column
# attribute_with_format_error_on_row_42:
# row 42 : Format mismatch.
# No entry code error found in the column.
#########################################################################################
```

### Further Processing
```Python
# Get objects of full error details.
# You may find it useful for data visualization or further analysis.
test_rslt.get_attr_err()
test_rslt.get_format_err()
test_rslt.get_ecode_err()
test_rslt.get_char_encode_err()
```

## Development Status

This script is created with support by [Agri-food Data Canada](https://agrifooddatacanada.ca/), funded by [CFREF](https://www.cfref-apogee.gc.ca/) through the [Food from Thought grant](https://foodfromthought.ca/) held at the [University of Guelph](https://www.uoguelph.ca/). Currently, we do not provide any warranty of any kind regarding the accuracy, security, completeness or reliability of this script or any of its parts.

At the moment, this script is developed for the validation of the following [OCA attribute types](https://oca.colossi.network/specification/#attribute-type):
- Text (with regular expressions)
- DateTime (with ISO 8601 formats)
- Array[Type]; for any Types that are not mentioned above, only the validness of the array will be checked.

Also, besides the format overlay, the data set will be validated with the following overlays:
- [Conformance Overlay](https://oca.colossi.network/specification/#conformance-overlay), for any missing mandatory data
- [Entry Code Overlay](https://oca.colossi.network/specification/#entry-code-overlay), for any data that mismatches entry codes
- [Character Encoding](https://oca.colossi.network/specification/#character-encoding-overlay), for any data that mismatches the specified character encoding

JSON data types are **NOT** validated due to the type coercion while importing Excel or CSV files. We also recommend that you import the data set as Pandas DataFrame to prevent unexpected DateTime formatting by software such as Microsoft Excel.

Any validation errors other than the above are **NOT** guaranteed to be filtered by this script. Please feel free to contact us with any suggestions for future development.

You could also find a well-developed [OCA Validator](https://github.com/THCLab/oca-conductor) by [The Human Colossus Lab](https://github.com/THCLab) (Rust required).


## License

EUPL (European Union Public License), version 1.2

We have distilled the most crucial license specifics to make your adoption seamless: [see here for details](https://github.com/THCLab/licensing).

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "oca-ds-validator",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "Data Entry,OCA JSON Bundle",
    "author": "Xingjian Xu and Steven Mugisha Mizero",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/c9/81/7a9268556ed66309f8b9aaf939ca7c881f367f27372c35af2b3df6a20753/oca_ds_validator-0.0.2.tar.gz",
    "platform": null,
    "description": "# OCA Data Set Validator\nThis is a Python package for validating [Overlays Capture Architecture (OCA)](https://oca.colossi.network/) data sets. It includes three classes: `OCADataSet`, `OCADataSetErr`, and `OCABundle`. For more information about OCA, please check [OCA Specification v1.0.0](https://oca.colossi.network/specification/).\n\n- `OCADataSet` represents an OCA data set to be validated, and can be loaded from a pandas DataFrame, an OCA Excel Data Entry File, or a CSV file.\n\n- `OCADataSetErr` represents the result set of an OCA data set validation. This class is generated by the data set validation, contains all the error information, and also provides three methods for a quick overview: `overview()`, `first_error_col()`, and `get_error_col(attr_name)`.\n\n- `OCABundle` represents schema overlays from a loaded `.json` OCA bundle used to validate the data set.\n\n## Dependencies\n- pandas\n- pathlib\n\n## Usage\n\n### Installation\nInstall the package by typing `pip install oca_ds_validator` to the console. Then you could import the classes from any Python scripts.\n\n- The package can be found here: [oca_ds_validator](https://pypi.org/project/oca-ds-validator/0.0.1/)\n\n### Validation steps\n1. Import the OCA Bundle using `OCABundle(path)`.\n2. Import the OCA Data Set using `OCADataSet(pandas_dataframe)` or `OCADataSet.from_path(path)`.\n3. Generate the validation result using `validate()` method for class `OCABundle`.\n\n```python\nfrom oca_ds_validator import OCADataSet, OCADataSetErr, OCABundle\n\ntest_bundle = OCABundle(\"/path/to/oca/bundle.json\")\n\ntest_data = OCADataSet(data_set_dataframe)\n# test_data = OCADataSet.from_path(\"/path/to/oca/data_entry_file.xlsx\")\n# test_data = OCADataSet.from_path(\"/path/to/oca/data_set_file.csv\")\n\ntest_rslt = test_bundle.validate(test_data)\n#########################################################################################\n# Example of a possible test_rslt:\n#   attr_err:\n#     [('missing_attribute',\n#       'Missing attribute (attribute not found in the data set).'),\n#      ('unmatched_attribute',\n#       'Unmatched attribute (attribute not found in the OCA Bundle).')]\n#   format_err:\n#     {'attribute_with_format_error_on_row_0': {0: 'Format mismatch.'},\n#      'array_attribute_without_array_data_on_row_0': {0: 'Valid array required.'},\n#      'attribute_with_format_error_on_row_42': {42: 'Format mismatch.'},\n#      'attribute_with_errors_on_row_0_and_1': {0: 'Format mismatch.',\n#                                               1: 'Valid array required.'},\n#      'mandatory_attribute_with_missing_data': {0: 'Missing mandatory attribute.'},\n#      'attribute_without_error': {}}\n#   ecode_err:  # Not matching any of the entry codes\n#     {'attribute_with_entry_codes': {0: 'One of the entry codes required.'}}\n#########################################################################################\n```\n\n### Optional Messages\nThere are three optional boolean arguments to control the message printed.\n| argument | default value | usage |\n| -------- | ------------- | ----- |\n| `show_data_preview` | `False` | If enabled, prints a pandas preview of the data set before validation. |\n| `enable_flagged_alarm` | `True` | If enabled, prints a warning message for the existence of flagged attributes. |\n| `enable_version_alarm` | `True` | If enabled, prints a warning message for each overlay that contains an OCA version number different from the development version of this script (1.0).\n\n\n### Result Observation\nThe errors of the data set is stored in the generated `OCADataSetErr` class.\n\n```Python\n# Prints a brief summary of errors.\ntest_rslt.overview()\n#########################################################################################\n# Attribute error.\n# {'missing_attribute'} found in the OCA Bundle but not in the data set;\n# {'unmatched_attribute'} found in the data set but not in the OCA Bundle.\n# Found 3 problematic row(s) in the following attribute(s):\n# {'attribute_with_format_error_on_row_0',\n#  'array_attribute_without_array_data_on_row_0',\n#  'attribute_with_format_error_on_row_42',\n#  'attribute_with_errors_on_row_0_and_1',\n#  'mandatory_attribute_with_missing_data',\n#  'attribute_with_entry_codes'}\n#########################################################################################\n\n# Prints the information of the first problematic column.\ntest_rslt.first_err_col()\n#########################################################################################\n# The first problematic column is: attribute_with_format_error_on_row_0\n# Format error(s) would occur in the following rows:\n# row 0 : Format mismatch.\n# No entry code error found in the column.\n#########################################################################################\n\n# Prints the information of some certain column.\ntest_rslt.get_err_col(\"attribute_with_format_error_on_row_42\")\n#########################################################################################\n# Format error(s) would occur in the following rows of column\n# attribute_with_format_error_on_row_42:\n# row 42 : Format mismatch.\n# No entry code error found in the column.\n#########################################################################################\n```\n\n### Further Processing\n```Python\n# Get objects of full error details.\n# You may find it useful for data visualization or further analysis.\ntest_rslt.get_attr_err()\ntest_rslt.get_format_err()\ntest_rslt.get_ecode_err()\ntest_rslt.get_char_encode_err()\n```\n\n## Development Status\n\nThis script is created with support by [Agri-food Data Canada](https://agrifooddatacanada.ca/), funded by [CFREF](https://www.cfref-apogee.gc.ca/) through the [Food from Thought grant](https://foodfromthought.ca/) held at the [University of Guelph](https://www.uoguelph.ca/). Currently, we do not provide any warranty of any kind regarding the accuracy, security, completeness or reliability of this script or any of its parts.\n\nAt the moment, this script is developed for the validation of the following [OCA attribute types](https://oca.colossi.network/specification/#attribute-type):\n- Text (with regular expressions)\n- DateTime (with ISO 8601 formats)\n- Array[Type]; for any Types that are not mentioned above, only the validness of the array will be checked.\n\nAlso, besides the format overlay, the data set will be validated with the following overlays:\n- [Conformance Overlay](https://oca.colossi.network/specification/#conformance-overlay), for any missing mandatory data\n- [Entry Code Overlay](https://oca.colossi.network/specification/#entry-code-overlay), for any data that mismatches entry codes\n- [Character Encoding](https://oca.colossi.network/specification/#character-encoding-overlay), for any data that mismatches the specified character encoding\n\nJSON data types are **NOT** validated due to the type coercion while importing Excel or CSV files. We also recommend that you import the data set as Pandas DataFrame to prevent unexpected DateTime formatting by software such as Microsoft Excel.\n\nAny validation errors other than the above are **NOT** guaranteed to be filtered by this script. Please feel free to contact us with any suggestions for future development.\n\nYou could also find a well-developed [OCA Validator](https://github.com/THCLab/oca-conductor) by [The Human Colossus Lab](https://github.com/THCLab) (Rust required).\n\n\n## License\n\nEUPL (European Union Public License), version 1.2\n\nWe have distilled the most crucial license specifics to make your adoption seamless: [see here for details](https://github.com/THCLab/licensing).\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Validate OCA dataset in python workflows",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/agrifooddatacanada/OCA_data_set_validator",
        "Issues": "https://github.com/agrifooddatacanada/OCA_data_set_validator/issues"
    },
    "split_keywords": [
        "data entry",
        "oca json bundle"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6031b2ed103200399938220a502c75059985ba8616b49d2fd637688c96f23f4b",
                "md5": "c05650268e141c88a4b3ea0e1f10b328",
                "sha256": "ea40149a488de906075d888a15b5afefbcc377c13fb02f49f870a99b6d35e042"
            },
            "downloads": -1,
            "filename": "oca_ds_validator-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c05650268e141c88a4b3ea0e1f10b328",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14236,
            "upload_time": "2024-02-08T14:47:53",
            "upload_time_iso_8601": "2024-02-08T14:47:53.278687Z",
            "url": "https://files.pythonhosted.org/packages/60/31/b2ed103200399938220a502c75059985ba8616b49d2fd637688c96f23f4b/oca_ds_validator-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c9817a9268556ed66309f8b9aaf939ca7c881f367f27372c35af2b3df6a20753",
                "md5": "427c92ee53f3a61bc176f0745b6d0fe7",
                "sha256": "d9d26b5d6b02a0108d8c14217be2ff4b58ded6d5e3ac68d75bc44d12b231814d"
            },
            "downloads": -1,
            "filename": "oca_ds_validator-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "427c92ee53f3a61bc176f0745b6d0fe7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 44373,
            "upload_time": "2024-02-08T14:47:54",
            "upload_time_iso_8601": "2024-02-08T14:47:54.715628Z",
            "url": "https://files.pythonhosted.org/packages/c9/81/7a9268556ed66309f8b9aaf939ca7c881f367f27372c35af2b3df6a20753/oca_ds_validator-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-08 14:47:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "agrifooddatacanada",
    "github_project": "OCA_data_set_validator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "oca-ds-validator"
}
        
Elapsed time: 0.41182s