<!--
[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)
![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)
![PyPI - Status](https://img.shields.io/pypi/status/zarque-profiling)
![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling/month)](https://pepy.tech/project/zarque-profiling)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling/week)](https://pepy.tech/project/zarque-profiling)
-->
<p align="center">
<img src="https://user-images.githubusercontent.com/132550577/236653863-ccf98580-4a6f-46ba-abde-d3e20a87b354.png" alt="Zarque-profiling">
</p>
[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)
![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)
[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)
![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)
Zarque-profiling is a data profiling tool that is 3x faster than Pandas-profiling. Zarque-profiling offers a new option for your big data profiling needs.
### Features
Zarque-profiling has the same features, analysis items, and output reports as Pandas-profiling, with the ability to perform minimal-profiling (minimal=True), maximal-profiling (minimal=False), and the ability to compare two reports.
>*Note:*
*For big data, it is not recommended to use maximal-profiling (minimal=False) because of the time required for the analysis process. Minimal-profiling (minimal=True) is set as the default.*
### Powered by Polars
Zarque-profiling is based on pandas-profiling (ydata-profiling) and uses Polars instead of Pandas to speed up the analysis process.
### Use cases
- Profiling large datasets as a standalone package
Profiling of large data sets that Pandas-profiling is too time consuming to handle.
Data profiling when Polars is used for data analytics and data science.
- Seamless integration with existing packages
- EDA (Exploratory Data Analysis)
Simple data analysis without writing code. (histograms, scatter plots, heat maps, text analysis)
- Comparing multiple version of the same dataset (profiling reports)
Compare data before and after data wrangling.
Compare training data with evaluation data by machine learning.
- Data preparation/Data migration solution business
Estimation of man-hours required.
Help to create data specification.
Determine if dataset should be migrated or not.
***
### Benchmark
The figure below shows the benchmark results of data acquisition and analysis processing time for 1 million to 100 million rows in minimal profiling (minimal=True). This data is for reference only. Processing times vary depending on the performance of the PC used and the amount of memory.
<p align="center">
<img src="https://user-images.githubusercontent.com/132550577/236175318-f7f34294-b7cd-48ab-b13b-acfc4cc3e442.png">
</p>
***
### Installation
You can install using the `pip` package manager.
```sh
pip install zarque-profiling
```
### How to use
... *See Pandas-profiling for details on usage.*
Prepare Polars data-frame.
```py
import polars as pol
# CSV file
df = pol.read_csv("path/file_name.csv")
# Parquet file
df = pol.read_parquet("path/file_name.parquet")
```
Generate the standard profiling report.
```py
from zarque_profiling import ProfileReport
# Minimal-profiling
ProfileReport(df, title="Zarque Profiling Report")
# Maximal-profiling
ProfileReport(df, minimal=False, title="Zarque Profiling Report")
```
Using inside Jupyter Lab.
```py
report = ProfileReport(df)
# Displaying the report as a set of widgets
report.to_widgets()
# Directly embedded in a cell
report.to_notebook_iframe()
```
Exporting the report to a file.
```py
# As a HTML file
report.to_file("path/file_name.html")
# As a JSON file
report.to_json("path/file_name.json")
```
Compare 2 profiling reports.
```py
from zarque_profiling import compare
df1 = pol.read_csv("path/file_name.csv")
df2 = pol.read_csv("path/file_name_corrected.csv")
report1 = ProfileReport(df1, title="Original Data")
report2 = ProfileReport(df2, title="Corrected Data")
compare([report1, report2])
```
### Customize examples
>*For big data, the following code example takes long time for the analysis process.*
Correlation Diagram (spearman, pearson, phi_k, cramers and kendall).
```py
ProfileReport(
df,
minimal=False,
correlations={
"spearman": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
"pearson" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
"phi_k" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
"cramers" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
"kendall" : {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
"auto" : {"calculate": False, "warn_high_correlations": False, "threshold": 0.9}
}
)
```
Text analysis (length distribution, word distribution and character information).
```py
ProfileReport(
df,
vars={"cat": {"length": True, "words": True, "characters": True}}
)
```
Change matplotlib Font-family.
If Japanese column names are used, change the default font to a font that can display Japanese.
>*The following code is an example of setting up an IPAex font (Japanese fonts).*
```py
ProfileReport(
df,
minimal=False,
font_family="IPAexGothic"
)
```
### License
- MIT license
Raw data
{
"_id": null,
"home_page": "https://github.com/crescendo-medix/zarque-profiling",
"name": "zarque-profiling",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7, <3.12",
"maintainer_email": "",
"keywords": "big-data polars pandas data-profiling eda data-science data-analysis python jupyter ipython",
"author": "Crescendo Medix",
"author_email": "g-ikeba@nifty.com",
"download_url": "",
"platform": null,
"description": "<!--\n[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)\n![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)\n![PyPI - Status](https://img.shields.io/pypi/status/zarque-profiling)\n![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling/month)](https://pepy.tech/project/zarque-profiling)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling/week)](https://pepy.tech/project/zarque-profiling)\n-->\n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/132550577/236653863-ccf98580-4a6f-46ba-abde-d3e20a87b354.png\" alt=\"Zarque-profiling\">\n</p>\n\n[![PyPI version](https://badge.fury.io/py/zarque-profiling.svg)](https://badge.fury.io/py/zarque-profiling)\n![Python Versions](https://img.shields.io/pypi/pyversions/zarque-profiling.svg)\n[![Downloads](https://static.pepy.tech/badge/zarque-profiling)](https://pepy.tech/project/zarque-profiling)\n![GitHub](https://img.shields.io/github/license/crescendo-medix/zarque-profiling)\n\nZarque-profiling is a data profiling tool that is 3x faster than Pandas-profiling. Zarque-profiling offers a new option for your big data profiling needs. \n\n### Features\n\nZarque-profiling has the same features, analysis items, and output reports as Pandas-profiling, with the ability to perform minimal-profiling (minimal=True), maximal-profiling (minimal=False), and the ability to compare two reports. \n\n>*Note:* \n*For big data, it is not recommended to use maximal-profiling (minimal=False) because of the time required for the analysis process. Minimal-profiling (minimal=True) is set as the default.* \n\n\n### Powered by Polars\n\nZarque-profiling is based on pandas-profiling (ydata-profiling) and uses Polars instead of Pandas to speed up the analysis process. \n\n### Use cases\n\n- Profiling large datasets as a standalone package \n Profiling of large data sets that Pandas-profiling is too time consuming to handle. \n Data profiling when Polars is used for data analytics and data science. \n- Seamless integration with existing packages \n- EDA (Exploratory Data Analysis) \n Simple data analysis without writing code. (histograms, scatter plots, heat maps, text analysis) \n- Comparing multiple version of the same dataset (profiling reports) \n Compare data before and after data wrangling. \n Compare training data with evaluation data by machine learning. \n- Data preparation/Data migration solution business \n Estimation of man-hours required. \n Help to create data specification. \n Determine if dataset should be migrated or not. \n\n\n***\n\n### Benchmark\n\nThe figure below shows the benchmark results of data acquisition and analysis processing time for 1 million to 100 million rows in minimal profiling (minimal=True). This data is for reference only. Processing times vary depending on the performance of the PC used and the amount of memory. \n\n<p align=\"center\">\n<img src=\"https://user-images.githubusercontent.com/132550577/236175318-f7f34294-b7cd-48ab-b13b-acfc4cc3e442.png\">\n</p>\n\n***\n\n### Installation\n\nYou can install using the `pip` package manager.\n\n```sh\npip install zarque-profiling\n```\n\n### How to use\n\n... *See Pandas-profiling for details on usage.* \n\nPrepare Polars data-frame.\n\n```py\nimport polars as pol\n# CSV file\ndf = pol.read_csv(\"path/file_name.csv\")\n# Parquet file\ndf = pol.read_parquet(\"path/file_name.parquet\")\n```\n\nGenerate the standard profiling report. \n\n```py\nfrom zarque_profiling import ProfileReport\n# Minimal-profiling\nProfileReport(df, title=\"Zarque Profiling Report\")\n# Maximal-profiling\nProfileReport(df, minimal=False, title=\"Zarque Profiling Report\")\n```\n\nUsing inside Jupyter Lab. \n\n```py\nreport = ProfileReport(df)\n# Displaying the report as a set of widgets\nreport.to_widgets()\n# Directly embedded in a cell\nreport.to_notebook_iframe()\n```\n\nExporting the report to a file. \n\n```py\n# As a HTML file\nreport.to_file(\"path/file_name.html\")\n# As a JSON file\nreport.to_json(\"path/file_name.json\")\n```\n\nCompare 2 profiling reports. \n\n```py\nfrom zarque_profiling import compare\ndf1 = pol.read_csv(\"path/file_name.csv\")\ndf2 = pol.read_csv(\"path/file_name_corrected.csv\")\nreport1 = ProfileReport(df1, title=\"Original Data\")\nreport2 = ProfileReport(df2, title=\"Corrected Data\")\ncompare([report1, report2])\n```\n\n### Customize examples \n\n>*For big data, the following code example takes long time for the analysis process.*\n\nCorrelation Diagram (spearman, pearson, phi_k, cramers and kendall). \n\n```py\nProfileReport(\n df,\n minimal=False,\n correlations={\n \"spearman\": {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n \"pearson\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n \"phi_k\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n \"cramers\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n \"kendall\" : {\"calculate\": True, \"warn_high_correlations\": True, \"threshold\": 0.9},\n \"auto\" : {\"calculate\": False, \"warn_high_correlations\": False, \"threshold\": 0.9}\n }\n)\n```\n\nText analysis (length distribution, word distribution and character information). \n\n```py\nProfileReport(\n df,\n vars={\"cat\": {\"length\": True, \"words\": True, \"characters\": True}}\n)\n```\n\nChange matplotlib Font-family. \nIf Japanese column names are used, change the default font to a font that can display Japanese.\n>*The following code is an example of setting up an IPAex font (Japanese fonts).*\n\n```py\nProfileReport(\n df,\n minimal=False,\n font_family=\"IPAexGothic\"\n)\n```\n\n### License\n\n- MIT license \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Data profiling tools for Big Data",
"version": "0.5.10",
"project_urls": {
"Homepage": "https://github.com/crescendo-medix/zarque-profiling"
},
"split_keywords": [
"big-data",
"polars",
"pandas",
"data-profiling",
"eda",
"data-science",
"data-analysis",
"python",
"jupyter",
"ipython"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e71179e9c813a7c36bc113414bc9510aa464b000ab9354f42f8de7e4a41dfb9d",
"md5": "0f498747f006476e7ec2cf8301d5cd25",
"sha256": "7f97399d34cd8cc57878da28cf8ec3b17ec367d610532e52c93c454903110886"
},
"downloads": -1,
"filename": "zarque_profiling-0.5.10-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "0f498747f006476e7ec2cf8301d5cd25",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.7, <3.12",
"size": 352609,
"upload_time": "2023-07-19T04:12:58",
"upload_time_iso_8601": "2023-07-19T04:12:58.516891Z",
"url": "https://files.pythonhosted.org/packages/e7/11/79e9c813a7c36bc113414bc9510aa464b000ab9354f42f8de7e4a41dfb9d/zarque_profiling-0.5.10-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-19 04:12:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "crescendo-medix",
"github_project": "zarque-profiling",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "zarque-profiling"
}