spatialedge-analytics-dfauditor


Namespatialedge-analytics-dfauditor JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://gitlab.com/spatialedge/ml-engineering/dataframe-auditor
SummaryA dataframe auditor that extracts descriptive statistics from dataframe columns
upload_time2024-12-12 10:10:07
maintainerNone
docs_urlNone
authorJacques du Toit, Carl du Plessis, Jean Naude
requires_python<3.12,>=3.9
licenseSpatialedge Community License
keywords analytics utilities
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ### still in an early development stage and undergoing significant changes regularly

# dataframe-auditor

A dataframe auditor that computes a number characteristics of the data.


> [Summary](#summary)
> 
> [Installation](#installation)
>
> [Testing](#testing)
>
> [Usage](#usage)
> 
> [Contributions](#contributions)

## Summary

  [Data profiling](https://en.wikipedia.org/wiki/Data_profiling) is important in data analysis and analytics, as well as in determining characteristics of data pipelines.
  This repository aims to provide a means to extract a selection of attributes from data.
  
  It is currently focused on processing _pandas_ dataframes, but this functionality is being 
  extended to _spark_ dataframes too.
  
  Given a pandas dataframe, the extracted values are (where _object_ and _category_ types are mapped to 
  _string_, and all numerical types to _numeric_):
  
  |Type | Measure |   
  |:---|:---|
  |**String & Numeric** | Percentage null |
  |**String** | Distinct counts |
  | | Most frequent categories |
  |**Numeric** | Mean | 
  | | Standard deviation |
  | | Variance |
  | | Min value| 
  | | Max value|
  | | Range |
  | | Kurtosis |
  | | Skewness |
  | | Kullback-Liebler divergence |
  | | Mean absolute deviation |
  | | Median |
  | | Interquartile range |
  | | Percentage zero values |
  | | Percentage nan values |
     

  Naturally, many of these characteristics are not independent of one another, but some may be excluded as suits the application.
  
  The result of auditing a dataframe using this library is that a dictionary of these measures is returned for each column in the dataframe. 
  For example, if a dataframe consists of a single column, named _trivial_, where all values are `1`, then
  
  ```json
    [{
     "attr":  "trivial",
     "type": "NUMERIC",
     "median": 1.0,
     "variance": 0.0,
     "std": 0.0,
     "max": 1,
     "min": 1,
     "mad": 0.0,
     "p_zeros": 0.0,
     "kurtosis": 0,
     "skewness": 0,
     "iqr": 0.0,
     "range": 0,
     "p_nan": 0.0,
     "mean": 1.0
     }]
  ```
  
  For a dataframe with columns `["trivial", "non-trivial"]`, a list of dictionaries is returned:
  ```json
    [{
      "attr":  "trivial"
      },
     {
      "attr": "non-trivial"
     }]
```
    
  
## Installation

  * Dependencies are contained in `requirements.txt`:
      
    ```bash
    pip install -r requirements.txt
    ```
    
  * Alternatively, if you wish to install directly from github, you may use:
  
    ```bash
    pip install git+https://github.com/jackdotwa/dataframe-auditor.git
    ```
 
    
## Testing

  * Unittests may be run via:
   
  ```
    python -m unittest discover tests
  ```
  * Code coverage may be determined via:
  
  ```bash
    coverage run -m unittest discover tests && coverage report 
  ```
  

## Usage

  Many examples of using this package is:
  
  ```python
  import pandas as pd
  import dfauditor
  numeric_data = {
        'x': [50, 50, -10, 0, 0, 5, 15, -3, None, 0],
        'y': [0.00001, 256.128, None, 16.32, 2048, -3.1415926535, 111, 2.4, 4.8, 0.0],
        'trivial': [1]*10
  }
  numeric_df = pd.DataFrame(numeric_data)
  result_dict = dfauditor.audit_dataframe(numeric_df, nr_processes=3)
  ``` 
 
## Contributions
Pull requests are always welcome.


            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/spatialedge/ml-engineering/dataframe-auditor",
    "name": "spatialedge-analytics-dfauditor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.9",
    "maintainer_email": null,
    "keywords": "analytics, utilities",
    "author": "Jacques du Toit, Carl du Plessis, Jean Naude",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "### still in an early development stage and undergoing significant changes regularly\n\n# dataframe-auditor\n\nA dataframe auditor that computes a number characteristics of the data.\n\n\n> [Summary](#summary)\n> \n> [Installation](#installation)\n>\n> [Testing](#testing)\n>\n> [Usage](#usage)\n> \n> [Contributions](#contributions)\n\n## Summary\n\n  [Data profiling](https://en.wikipedia.org/wiki/Data_profiling) is important in data analysis and analytics, as well as in determining characteristics of data pipelines.\n  This repository aims to provide a means to extract a selection of attributes from data.\n  \n  It is currently focused on processing _pandas_ dataframes, but this functionality is being \n  extended to _spark_ dataframes too.\n  \n  Given a pandas dataframe, the extracted values are (where _object_ and _category_ types are mapped to \n  _string_, and all numerical types to _numeric_):\n  \n  |Type | Measure |   \n  |:---|:---|\n  |**String & Numeric** | Percentage null |\n  |**String** | Distinct counts |\n  | | Most frequent categories |\n  |**Numeric** | Mean | \n  | | Standard deviation |\n  | | Variance |\n  | | Min value| \n  | | Max value|\n  | | Range |\n  | | Kurtosis |\n  | | Skewness |\n  | | Kullback-Liebler divergence |\n  | | Mean absolute deviation |\n  | | Median |\n  | | Interquartile range |\n  | | Percentage zero values |\n  | | Percentage nan values |\n     \n\n  Naturally, many of these characteristics are not independent of one another, but some may be excluded as suits the application.\n  \n  The result of auditing a dataframe using this library is that a dictionary of these measures is returned for each column in the dataframe. \n  For example, if a dataframe consists of a single column, named _trivial_, where all values are `1`, then\n  \n  ```json\n    [{\n     \"attr\":  \"trivial\",\n     \"type\": \"NUMERIC\",\n     \"median\": 1.0,\n     \"variance\": 0.0,\n     \"std\": 0.0,\n     \"max\": 1,\n     \"min\": 1,\n     \"mad\": 0.0,\n     \"p_zeros\": 0.0,\n     \"kurtosis\": 0,\n     \"skewness\": 0,\n     \"iqr\": 0.0,\n     \"range\": 0,\n     \"p_nan\": 0.0,\n     \"mean\": 1.0\n     }]\n  ```\n  \n  For a dataframe with columns `[\"trivial\", \"non-trivial\"]`, a list of dictionaries is returned:\n  ```json\n    [{\n      \"attr\":  \"trivial\"\n      },\n     {\n      \"attr\": \"non-trivial\"\n     }]\n```\n    \n  \n## Installation\n\n  * Dependencies are contained in `requirements.txt`:\n      \n    ```bash\n    pip install -r requirements.txt\n    ```\n    \n  * Alternatively, if you wish to install directly from github, you may use:\n  \n    ```bash\n    pip install git+https://github.com/jackdotwa/dataframe-auditor.git\n    ```\n \n    \n## Testing\n\n  * Unittests may be run via:\n   \n  ```\n    python -m unittest discover tests\n  ```\n  * Code coverage may be determined via:\n  \n  ```bash\n    coverage run -m unittest discover tests && coverage report \n  ```\n  \n\n## Usage\n\n  Many examples of using this package is:\n  \n  ```python\n  import pandas as pd\n  import dfauditor\n  numeric_data = {\n        'x': [50, 50, -10, 0, 0, 5, 15, -3, None, 0],\n        'y': [0.00001, 256.128, None, 16.32, 2048, -3.1415926535, 111, 2.4, 4.8, 0.0],\n        'trivial': [1]*10\n  }\n  numeric_df = pd.DataFrame(numeric_data)\n  result_dict = dfauditor.audit_dataframe(numeric_df, nr_processes=3)\n  ``` \n \n## Contributions\nPull requests are always welcome.\n\n",
    "bugtrack_url": null,
    "license": "Spatialedge Community License",
    "summary": "A dataframe auditor that extracts descriptive statistics from dataframe columns",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://gitlab.com/spatialedge/ml-engineering/dataframe-auditor",
        "Repository": "https://gitlab.com/spatialedge/ml-engineering/dataframe-auditor"
    },
    "split_keywords": [
        "analytics",
        " utilities"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f1ebe55a613ae65bec1a52691bd08a4be5c0b68ea78f24f741ab7eebd608bd6b",
                "md5": "5d6d4fd8289c0f68fbc2dacfe22c314f",
                "sha256": "5827650c25138761ab7a4681c3a575740a860005794324c8f8eabf3f6dac850f"
            },
            "downloads": -1,
            "filename": "spatialedge_analytics_dfauditor-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5d6d4fd8289c0f68fbc2dacfe22c314f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.9",
            "size": 11000,
            "upload_time": "2024-12-12T10:10:07",
            "upload_time_iso_8601": "2024-12-12T10:10:07.897810Z",
            "url": "https://files.pythonhosted.org/packages/f1/eb/e55a613ae65bec1a52691bd08a4be5c0b68ea78f24f741ab7eebd608bd6b/spatialedge_analytics_dfauditor-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-12 10:10:07",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "spatialedge",
    "gitlab_project": "ml-engineering",
    "lcname": "spatialedge-analytics-dfauditor"
}
        
Elapsed time: 0.72997s