### still in an early development stage and undergoing significant changes regularly
# dataframe-auditor
A dataframe auditor that computes a number characteristics of the data.
> [Summary](#summary)
>
> [Installation](#installation)
>
> [Testing](#testing)
>
> [Usage](#usage)
>
> [Contributions](#contributions)
## Summary
[Data profiling](https://en.wikipedia.org/wiki/Data_profiling) is important in data analysis and analytics, as well as in determining characteristics of data pipelines.
This repository aims to provide a means to extract a selection of attributes from data.
It is currently focused on processing _pandas_ dataframes, but this functionality is being
extended to _spark_ dataframes too.
Given a pandas dataframe, the extracted values are (where _object_ and _category_ types are mapped to
_string_, and all numerical types to _numeric_):
|Type | Measure |
|:---|:---|
|**String & Numeric** | Percentage null |
|**String** | Distinct counts |
| | Most frequent categories |
|**Numeric** | Mean |
| | Standard deviation |
| | Variance |
| | Min value|
| | Max value|
| | Range |
| | Kurtosis |
| | Skewness |
| | Kullback-Liebler divergence |
| | Mean absolute deviation |
| | Median |
| | Interquartile range |
| | Percentage zero values |
| | Percentage nan values |
Naturally, many of these characteristics are not independent of one another, but some may be excluded as suits the application.
The result of auditing a dataframe using this library is that a dictionary of these measures is returned for each column in the dataframe.
For example, if a dataframe consists of a single column, named _trivial_, where all values are `1`, then
```json
[{
"attr": "trivial",
"type": "NUMERIC",
"median": 1.0,
"variance": 0.0,
"std": 0.0,
"max": 1,
"min": 1,
"mad": 0.0,
"p_zeros": 0.0,
"kurtosis": 0,
"skewness": 0,
"iqr": 0.0,
"range": 0,
"p_nan": 0.0,
"mean": 1.0
}]
```
For a dataframe with columns `["trivial", "non-trivial"]`, a list of dictionaries is returned:
```json
[{
"attr": "trivial"
},
{
"attr": "non-trivial"
}]
```
## Installation
* Dependencies are contained in `requirements.txt`:
```bash
pip install -r requirements.txt
```
* Alternatively, if you wish to install directly from github, you may use:
```bash
pip install git+https://github.com/jackdotwa/dataframe-auditor.git
```
## Testing
* Unittests may be run via:
```
python -m unittest discover tests
```
* Code coverage may be determined via:
```bash
coverage run -m unittest discover tests && coverage report
```
## Usage
Many examples of using this package is:
```python
import pandas as pd
import dfauditor
numeric_data = {
'x': [50, 50, -10, 0, 0, 5, 15, -3, None, 0],
'y': [0.00001, 256.128, None, 16.32, 2048, -3.1415926535, 111, 2.4, 4.8, 0.0],
'trivial': [1]*10
}
numeric_df = pd.DataFrame(numeric_data)
result_dict = dfauditor.audit_dataframe(numeric_df, nr_processes=3)
```
## Contributions
Pull requests are always welcome.
Raw data
{
"_id": null,
"home_page": "https://gitlab.com/spatialedge/ml-engineering/dataframe-auditor",
"name": "spatialedge-analytics-dfauditor",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.9",
"maintainer_email": null,
"keywords": "analytics, utilities",
"author": "Jacques du Toit, Carl du Plessis, Jean Naude",
"author_email": null,
"download_url": null,
"platform": null,
"description": "### still in an early development stage and undergoing significant changes regularly\n\n# dataframe-auditor\n\nA dataframe auditor that computes a number characteristics of the data.\n\n\n> [Summary](#summary)\n> \n> [Installation](#installation)\n>\n> [Testing](#testing)\n>\n> [Usage](#usage)\n> \n> [Contributions](#contributions)\n\n## Summary\n\n [Data profiling](https://en.wikipedia.org/wiki/Data_profiling) is important in data analysis and analytics, as well as in determining characteristics of data pipelines.\n This repository aims to provide a means to extract a selection of attributes from data.\n \n It is currently focused on processing _pandas_ dataframes, but this functionality is being \n extended to _spark_ dataframes too.\n \n Given a pandas dataframe, the extracted values are (where _object_ and _category_ types are mapped to \n _string_, and all numerical types to _numeric_):\n \n |Type | Measure | \n |:---|:---|\n |**String & Numeric** | Percentage null |\n |**String** | Distinct counts |\n | | Most frequent categories |\n |**Numeric** | Mean | \n | | Standard deviation |\n | | Variance |\n | | Min value| \n | | Max value|\n | | Range |\n | | Kurtosis |\n | | Skewness |\n | | Kullback-Liebler divergence |\n | | Mean absolute deviation |\n | | Median |\n | | Interquartile range |\n | | Percentage zero values |\n | | Percentage nan values |\n \n\n Naturally, many of these characteristics are not independent of one another, but some may be excluded as suits the application.\n \n The result of auditing a dataframe using this library is that a dictionary of these measures is returned for each column in the dataframe. \n For example, if a dataframe consists of a single column, named _trivial_, where all values are `1`, then\n \n ```json\n [{\n \"attr\": \"trivial\",\n \"type\": \"NUMERIC\",\n \"median\": 1.0,\n \"variance\": 0.0,\n \"std\": 0.0,\n \"max\": 1,\n \"min\": 1,\n \"mad\": 0.0,\n \"p_zeros\": 0.0,\n \"kurtosis\": 0,\n \"skewness\": 0,\n \"iqr\": 0.0,\n \"range\": 0,\n \"p_nan\": 0.0,\n \"mean\": 1.0\n }]\n ```\n \n For a dataframe with columns `[\"trivial\", \"non-trivial\"]`, a list of dictionaries is returned:\n ```json\n [{\n \"attr\": \"trivial\"\n },\n {\n \"attr\": \"non-trivial\"\n }]\n```\n \n \n## Installation\n\n * Dependencies are contained in `requirements.txt`:\n \n ```bash\n pip install -r requirements.txt\n ```\n \n * Alternatively, if you wish to install directly from github, you may use:\n \n ```bash\n pip install git+https://github.com/jackdotwa/dataframe-auditor.git\n ```\n \n \n## Testing\n\n * Unittests may be run via:\n \n ```\n python -m unittest discover tests\n ```\n * Code coverage may be determined via:\n \n ```bash\n coverage run -m unittest discover tests && coverage report \n ```\n \n\n## Usage\n\n Many examples of using this package is:\n \n ```python\n import pandas as pd\n import dfauditor\n numeric_data = {\n 'x': [50, 50, -10, 0, 0, 5, 15, -3, None, 0],\n 'y': [0.00001, 256.128, None, 16.32, 2048, -3.1415926535, 111, 2.4, 4.8, 0.0],\n 'trivial': [1]*10\n }\n numeric_df = pd.DataFrame(numeric_data)\n result_dict = dfauditor.audit_dataframe(numeric_df, nr_processes=3)\n ``` \n \n## Contributions\nPull requests are always welcome.\n\n",
"bugtrack_url": null,
"license": "Spatialedge Community License",
"summary": "A dataframe auditor that extracts descriptive statistics from dataframe columns",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://gitlab.com/spatialedge/ml-engineering/dataframe-auditor",
"Repository": "https://gitlab.com/spatialedge/ml-engineering/dataframe-auditor"
},
"split_keywords": [
"analytics",
" utilities"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f1ebe55a613ae65bec1a52691bd08a4be5c0b68ea78f24f741ab7eebd608bd6b",
"md5": "5d6d4fd8289c0f68fbc2dacfe22c314f",
"sha256": "5827650c25138761ab7a4681c3a575740a860005794324c8f8eabf3f6dac850f"
},
"downloads": -1,
"filename": "spatialedge_analytics_dfauditor-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5d6d4fd8289c0f68fbc2dacfe22c314f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.9",
"size": 11000,
"upload_time": "2024-12-12T10:10:07",
"upload_time_iso_8601": "2024-12-12T10:10:07.897810Z",
"url": "https://files.pythonhosted.org/packages/f1/eb/e55a613ae65bec1a52691bd08a4be5c0b68ea78f24f741ab7eebd608bd6b/spatialedge_analytics_dfauditor-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-12 10:10:07",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "spatialedge",
"gitlab_project": "ml-engineering",
"lcname": "spatialedge-analytics-dfauditor"
}