zarque-profiling


Namezarque-profiling JSON
Version 0.5.10 PyPI version JSON
download
home_pagehttps://github.com/crescendo-medix/zarque-profiling
SummaryData profiling tools for Big Data
upload_time2023-07-19 04:12:58
maintainer
docs_urlNone
authorCrescendo Medix
requires_python>=3.7, <3.12
licenseMIT
keywords big-data polars pandas data-profiling eda data-science data-analysis python jupyter ipython
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--
[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)
![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)
![PyPI - Status](https://img.shields.io/pypi/status/zarque-profiling)
![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling/month)](https://pepy.tech/project/zarque-profiling)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling/week)](https://pepy.tech/project/zarque-profiling)
-->

<p align="center">
<img src="https://user-images.githubusercontent.com/132550577/236653863-ccf98580-4a6f-46ba-abde-d3e20a87b354.png" alt="Zarque-profiling">
</p>

[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)
![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)
![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)

Zarque-profiling is a data profiling tool that is 3x faster than Pandas-profiling. Zarque-profiling offers a new option for your big data profiling needs. 

### Features

Zarque-profiling has the same features, analysis items, and output reports as Pandas-profiling, with the ability to perform minimal-profiling (minimal=True), maximal-profiling (minimal=False), and the ability to compare two reports.  

>*Note:*    
*For big data, it is not recommended to use maximal-profiling (minimal=False) because of the time required for the analysis process. Minimal-profiling (minimal=True) is set as the default.*  


### Powered by Polars

Zarque-profiling is based on pandas-profiling (ydata-profiling) and uses Polars instead of Pandas to speed up the analysis process.  

###  Use cases

- Profiling large datasets as a standalone package  
  Profiling of large data sets that Pandas-profiling is too time consuming to handle.  
  Data profiling when Polars is used for data analytics and data science.  
- Seamless integration with existing packages  
- EDA (Exploratory Data Analysis)  
  Simple data analysis without writing code. (histograms, scatter plots, heat maps, text analysis)  
- Comparing multiple version of the same dataset (profiling reports)  
  Compare data before and after data wrangling.  
  Compare training data with evaluation data by machine learning.  
- Data preparation/Data migration solution business  
  Estimation of man-hours required.  
  Help to create data specification.  
  Determine if dataset should be migrated or not.  


***

### Benchmark

The figure below shows the benchmark results of data acquisition and analysis processing time for 1 million to 100 million rows in minimal profiling (minimal=True). This data is for reference only. Processing times vary depending on the performance of the PC used and the amount of memory.  

<p align="center">
<img src="https://user-images.githubusercontent.com/132550577/236175318-f7f34294-b7cd-48ab-b13b-acfc4cc3e442.png">
</p>

***

### Installation

You can install using the `pip` package manager.

```sh
pip install zarque-profiling
```

### How to use

... *See Pandas-profiling for details on usage.*  

Prepare Polars data-frame.

```py
import polars as pol
# CSV file
df = pol.read_csv("path/file_name.csv")
# Parquet file
df = pol.read_parquet("path/file_name.parquet")
```

Generate the standard profiling report.  

```py
from zarque_profiling import ProfileReport
# Minimal-profiling
ProfileReport(df, title="Zarque Profiling Report")
# Maximal-profiling
ProfileReport(df, minimal=False, title="Zarque Profiling Report")
```

Using inside Jupyter Lab.  

```py
report = ProfileReport(df)
# Displaying the report as a set of widgets
report.to_widgets()
# Directly embedded in a cell
report.to_notebook_iframe()
```

Exporting the report to a file.  

```py
# As a HTML file
report.to_file("path/file_name.html")
# As a JSON file
report.to_json("path/file_name.json")
```

Compare 2 profiling reports.  

```py
from zarque_profiling import compare
df1 = pol.read_csv("path/file_name.csv")
df2 = pol.read_csv("path/file_name_corrected.csv")
report1 = ProfileReport(df1, title="Original Data")
report2 = ProfileReport(df2, title="Corrected Data")
compare([report1, report2])
```

### Customize examples  

>*For big data, the following code example takes long time for the analysis process.*

Correlation Diagram (spearman, pearson, phi_k, cramers and kendall).  

```py
ProfileReport(
    df,
    minimal=False,
    correlations={
        "spearman": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
        "pearson" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
        "phi_k"   : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
        "cramers" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
        "kendall" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
        "auto"    : {"calculate": False, "warn_high_correlations": False, "threshold": 0.9}
    }
)
```

Text analysis (length distribution, word distribution and character information).  

```py
ProfileReport(
    df,
    vars={"cat": {"length": True, "words": True, "characters": True}}
)
```

Change matplotlib Font-family.  
If Japanese column names are used, change the default font to a font that can display Japanese.
>*The following code is an example of setting up an IPAex font (Japanese fonts).*

```py
ProfileReport(
    df,
    minimal=False,
    font_family="IPAexGothic"
)
```

### License

- MIT license  

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/crescendo-medix/zarque-profiling",
    "name": "zarque-profiling",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7, <3.12",
    "maintainer_email": "",
    "keywords": "big-data polars pandas data-profiling eda data-science data-analysis python jupyter ipython",
    "author": "Crescendo Medix",
    "author_email": "g-ikeba@nifty.com",
    "download_url": "",
    "platform": null,
    "description": "<!--\n[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)\n![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)\n![PyPI - Status](https://img.shields.io/pypi/status/zarque-profiling)\n![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling/month)](https://pepy.tech/project/zarque-profiling)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling/week)](https://pepy.tech/project/zarque-profiling)\n-->\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/132550577/236653863-ccf98580-4a6f-46ba-abde-d3e20a87b354.png\" alt=\"Zarque-profiling\">\n</p>\n\n[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)\n![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)\n![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)\n\nZarque-profiling is a data profiling tool that is 3x faster than Pandas-profiling. Zarque-profiling offers a new option for your big data profiling needs. \n\n### Features\n\nZarque-profiling has the same features, analysis items, and output reports as Pandas-profiling, with the ability to perform minimal-profiling (minimal=True), maximal-profiling (minimal=False), and the ability to compare two reports.  \n\n>*Note:*    \n*For big data, it is not recommended to use maximal-profiling (minimal=False) because of the time required for the analysis process. Minimal-profiling (minimal=True) is set as the default.*  \n\n\n### Powered by Polars\n\nZarque-profiling is based on pandas-profiling (ydata-profiling) and uses Polars instead of Pandas to speed up the analysis process.  \n\n###  Use cases\n\n- Profiling large datasets as a standalone package  \n  Profiling of large data sets that Pandas-profiling is too time consuming to handle.  \n  Data profiling when Polars is used for data analytics and data science.  \n- Seamless integration with existing packages  \n- EDA (Exploratory Data Analysis)  \n  Simple data analysis without writing code. (histograms, scatter plots, heat maps, text analysis)  \n- Comparing multiple version of the same dataset (profiling reports)  \n  Compare data before and after data wrangling.  \n  Compare training data with evaluation data by machine learning.  \n- Data preparation/Data migration solution business  \n  Estimation of man-hours required.  \n  Help to create data specification.  \n  Determine if dataset should be migrated or not.  \n\n\n***\n\n### Benchmark\n\nThe figure below shows the benchmark results of data acquisition and analysis processing time for 1 million to 100 million rows in minimal profiling (minimal=True). This data is for reference only. Processing times vary depending on the performance of the PC used and the amount of memory.  \n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/132550577/236175318-f7f34294-b7cd-48ab-b13b-acfc4cc3e442.png\">\n</p>\n\n***\n\n### Installation\n\nYou can install using the `pip` package manager.\n\n```sh\npip install zarque-profiling\n```\n\n### How to use\n\n... *See Pandas-profiling for details on usage.*  \n\nPrepare Polars data-frame.\n\n```py\nimport polars as pol\n# CSV file\ndf = pol.read_csv(\"path/file_name.csv\")\n# Parquet file\ndf = pol.read_parquet(\"path/file_name.parquet\")\n```\n\nGenerate the standard profiling report.  \n\n```py\nfrom zarque_profiling import ProfileReport\n# Minimal-profiling\nProfileReport(df, title=\"Zarque Profiling Report\")\n# Maximal-profiling\nProfileReport(df, minimal=False, title=\"Zarque Profiling Report\")\n```\n\nUsing inside Jupyter Lab.  \n\n```py\nreport = ProfileReport(df)\n# Displaying the report as a set of widgets\nreport.to_widgets()\n# Directly embedded in a cell\nreport.to_notebook_iframe()\n```\n\nExporting the report to a file.  \n\n```py\n# As a HTML file\nreport.to_file(\"path/file_name.html\")\n# As a JSON file\nreport.to_json(\"path/file_name.json\")\n```\n\nCompare 2 profiling reports.  \n\n```py\nfrom zarque_profiling import compare\ndf1 = pol.read_csv(\"path/file_name.csv\")\ndf2 = pol.read_csv(\"path/file_name_corrected.csv\")\nreport1 = ProfileReport(df1, title=\"Original Data\")\nreport2 = ProfileReport(df2, title=\"Corrected Data\")\ncompare([report1, report2])\n```\n\n### Customize examples  \n\n>*For big data, the following code example takes long time for the analysis process.*\n\nCorrelation Diagram (spearman, pearson, phi_k, cramers and kendall).  \n\n```py\nProfileReport(\n    df,\n    minimal=False,\n    correlations={\n        \"spearman\": {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n        \"pearson\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n        \"phi_k\"   : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n        \"cramers\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n        \"kendall\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n        \"auto\"    : {\"calculate\": False, \"warn_high_correlations\": False, \"threshold\": 0.9}\n    }\n)\n```\n\nText analysis (length distribution, word distribution and character information).  \n\n```py\nProfileReport(\n    df,\n    vars={\"cat\": {\"length\": True, \"words\": True, \"characters\": True}}\n)\n```\n\nChange matplotlib Font-family.  \nIf Japanese column names are used, change the default font to a font that can display Japanese.\n>*The following code is an example of setting up an IPAex font (Japanese fonts).*\n\n```py\nProfileReport(\n    df,\n    minimal=False,\n    font_family=\"IPAexGothic\"\n)\n```\n\n### License\n\n- MIT license  \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Data profiling tools for Big Data",
    "version": "0.5.10",
    "project_urls": {
        "Homepage": "https://github.com/crescendo-medix/zarque-profiling"
    },
    "split_keywords": [
        "big-data",
        "polars",
        "pandas",
        "data-profiling",
        "eda",
        "data-science",
        "data-analysis",
        "python",
        "jupyter",
        "ipython"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e71179e9c813a7c36bc113414bc9510aa464b000ab9354f42f8de7e4a41dfb9d",
                "md5": "0f498747f006476e7ec2cf8301d5cd25",
                "sha256": "7f97399d34cd8cc57878da28cf8ec3b17ec367d610532e52c93c454903110886"
            },
            "downloads": -1,
            "filename": "zarque_profiling-0.5.10-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0f498747f006476e7ec2cf8301d5cd25",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.7, <3.12",
            "size": 352609,
            "upload_time": "2023-07-19T04:12:58",
            "upload_time_iso_8601": "2023-07-19T04:12:58.516891Z",
            "url": "https://files.pythonhosted.org/packages/e7/11/79e9c813a7c36bc113414bc9510aa464b000ab9354f42f8de7e4a41dfb9d/zarque_profiling-0.5.10-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-19 04:12:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "crescendo-medix",
    "github_project": "zarque-profiling",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "zarque-profiling"
}
        
Elapsed time: 0.08959s