structured-profiling


Namestructured-profiling JSON
Version 0.3.11 PyPI version JSON
download
home_pagehttps://github.com/Clearbox-AI/StructuredDataProfiling
SummaryA Python library to check for data quality and automatically generate data tests.
upload_time2023-10-26 09:01:36
maintainer
docs_urlNone
authorClearbox AI
requires_python>=3.9,<3.13
licenseGPL
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
## StructuredDataProfiling

The StructuredDataProfiling is a Python library developed to automatically profile structured datasets and to facilitate the creation of **data tests**.

The library creates data tests in the form of **Expectations** using the [great_expectations](https://www.greatexpectations.io) framework. Expectations are 'declarative statements that a computer can evaluate and semantically meaningful to humans'.

An expectation could be, for example, 'the sum of columns a and b should be equal to one' or 'the values in column c should be non-negative'.

StructuredDataProfiling runs a series of tests aimed at identifying statistics, rules, and constraints characterising a given dataset. The information generated by the profiler is collected by performing the following operations:

- Characterise uni- and bi-variate distributions.
- Identify data quality issues.
- Evaluate relationships between attributes (ex. column C is the difference between columns A and B)
- Understand ontologies characterizing categorical data (column A contains names, while B contains geographical places).

For an overview of the library outputs please check the [examples](./examples) section.

# Installation
You can install StructuredDataProfiling by using pip:
`pip install structured-profiling
`
# Quickstart
You can import the profiler using

```python
from structured_data_profiling.profiler import DatasetProfiler
```
You can import the profiler using
```python
profiler = DatasetProfiler('./csv_path.csv')
```
The presence of a primary key (for example to define relations between tables or sequences) can be specified by using the argument **primary key** containing a single or multiple column names.

To start the profiling scripts, you can run the profile() method
```python
profiler.profile()
```
The method generate_expectations() outputs the results of the profiling process converted into data expectations. Please note, the method requires the existence of a local great_expectations project.
If you haven't done so please run ```great_expectations init``` in your working directory.
```python
profiler.generate_expectations()
```
The expectations are generated in a JSON format using the great_expectation schema. The method will also create data docs using the rendered provided by the great_expectations library.

These docs can be found in the local folder ```great_expectations/uncommitted/data_docs```.

# Profiling outputs
The profiler generates 3 json files describing the ingested dataset. These json files contain information about:
- column_profiles: it contains the statistical characterisation of the dataset columns.
- dataset_profile: it highlights issues and limitations affecting the dataset.
- tests: it contains the data tests found by the profiler.

The process of generating expectations makes use of the great_expectations library to produce an HTML file contaning data docs. An example of data doc for a given column can be seen in the image below.

<img alt="data docs example 1" src="https://raw.githubusercontent.com/Clearbox-AI/StructuredDataProfiling/main/examples/num_columns.png"/>


# Examples
You can find a couple of notebook examples in the [examples](./examples) folder.
# To-dos
Disclaimer: this library is still at a very early stage. Among other things, we still need to:

- [ ] Support more data formats (Feather, Parquet)
- [ ] Add more Expectations
- [ ] Integrate PII identification using Presidio
- [ ] Optimise and compile part of the profiling routines using Cython
- [ ] Write library tests

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Clearbox-AI/StructuredDataProfiling",
    "name": "structured-profiling",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<3.13",
    "maintainer_email": "",
    "keywords": "",
    "author": "Clearbox AI",
    "author_email": "info@clearbox.ai",
    "download_url": "https://files.pythonhosted.org/packages/de/06/c958dbee553ac4416b95be6e65b93c6f81734c5afa87ffb413644979d86e/structured_profiling-0.3.11.tar.gz",
    "platform": null,
    "description": "\n## StructuredDataProfiling\n\nThe StructuredDataProfiling is a Python library developed to automatically profile structured datasets and to facilitate the creation of **data tests**.\n\nThe library creates data tests in the form of **Expectations** using the [great_expectations](https://www.greatexpectations.io) framework. Expectations are 'declarative statements that a computer can evaluate and semantically meaningful to humans'.\n\nAn expectation could be, for example, 'the sum of columns a and b should be equal to one' or 'the values in column c should be non-negative'.\n\nStructuredDataProfiling runs a series of tests aimed at identifying statistics, rules, and constraints characterising a given dataset. The information generated by the profiler is collected by performing the following operations:\n\n- Characterise uni- and bi-variate distributions.\n- Identify data quality issues.\n- Evaluate relationships between attributes (ex. column C is the difference between columns A and B)\n- Understand ontologies characterizing categorical data (column A contains names, while B contains geographical places).\n\nFor an overview of the library outputs please check the [examples](./examples) section.\n\n# Installation\nYou can install StructuredDataProfiling by using pip:\n`pip install structured-profiling\n`\n# Quickstart\nYou can import the profiler using\n\n```python\nfrom structured_data_profiling.profiler import DatasetProfiler\n```\nYou can import the profiler using\n```python\nprofiler = DatasetProfiler('./csv_path.csv')\n```\nThe presence of a primary key (for example to define relations between tables or sequences) can be specified by using the argument **primary key** containing a single or multiple column names.\n\nTo start the profiling scripts, you can run the profile() method\n```python\nprofiler.profile()\n```\nThe method generate_expectations() outputs the results of the profiling process converted into data expectations. Please note, the method requires the existence of a local great_expectations project.\nIf you haven't done so please run ```great_expectations init``` in your working directory.\n```python\nprofiler.generate_expectations()\n```\nThe expectations are generated in a JSON format using the great_expectation schema. The method will also create data docs using the rendered provided by the great_expectations library.\n\nThese docs can be found in the local folder ```great_expectations/uncommitted/data_docs```.\n\n# Profiling outputs\nThe profiler generates 3 json files describing the ingested dataset. These json files contain information about:\n- column_profiles: it contains the statistical characterisation of the dataset columns.\n- dataset_profile: it highlights issues and limitations affecting the dataset.\n- tests: it contains the data tests found by the profiler.\n\nThe process of generating expectations makes use of the great_expectations library to produce an HTML file contaning data docs. An example of data doc for a given column can be seen in the image below.\n\n<img alt=\"data docs example 1\" src=\"https://raw.githubusercontent.com/Clearbox-AI/StructuredDataProfiling/main/examples/num_columns.png\"/>\n\n\n# Examples\nYou can find a couple of notebook examples in the [examples](./examples) folder.\n# To-dos\nDisclaimer: this library is still at a very early stage. Among other things, we still need to:\n\n- [ ] Support more data formats (Feather, Parquet)\n- [ ] Add more Expectations\n- [ ] Integrate PII identification using Presidio\n- [ ] Optimise and compile part of the profiling routines using Cython\n- [ ] Write library tests\n",
    "bugtrack_url": null,
    "license": "GPL",
    "summary": "A Python library to check for data quality and automatically generate data tests. ",
    "version": "0.3.11",
    "project_urls": {
        "Homepage": "https://github.com/Clearbox-AI/StructuredDataProfiling",
        "Repository": "https://github.com/Clearbox-AI/StructuredDataProfiling"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a98f95e73649e66edef34f6957fdfb6f0a3c4ba506a51418d859d970e38e531a",
                "md5": "43af92fd138259cb39e6b196affdc3cb",
                "sha256": "20fa40ea2ed2ac50dd00623df96d299acbb69ae7205c5c955fac61776448b5bb"
            },
            "downloads": -1,
            "filename": "structured_profiling-0.3.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "43af92fd138259cb39e6b196affdc3cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<3.13",
            "size": 29850,
            "upload_time": "2023-10-26T09:01:35",
            "upload_time_iso_8601": "2023-10-26T09:01:35.239269Z",
            "url": "https://files.pythonhosted.org/packages/a9/8f/95e73649e66edef34f6957fdfb6f0a3c4ba506a51418d859d970e38e531a/structured_profiling-0.3.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "de06c958dbee553ac4416b95be6e65b93c6f81734c5afa87ffb413644979d86e",
                "md5": "b26f89048f63347a04f2e40e38e35d35",
                "sha256": "7b0366de3bcd25afe0ab61bde34ecccf56460d2dbe8ebb96e0bcd5155924b8fa"
            },
            "downloads": -1,
            "filename": "structured_profiling-0.3.11.tar.gz",
            "has_sig": false,
            "md5_digest": "b26f89048f63347a04f2e40e38e35d35",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<3.13",
            "size": 25809,
            "upload_time": "2023-10-26T09:01:36",
            "upload_time_iso_8601": "2023-10-26T09:01:36.664812Z",
            "url": "https://files.pythonhosted.org/packages/de/06/c958dbee553ac4416b95be6e65b93c6f81734c5afa87ffb413644979d86e/structured_profiling-0.3.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-26 09:01:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Clearbox-AI",
    "github_project": "StructuredDataProfiling",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "structured-profiling"
}
        
Elapsed time: 0.24617s