<br />
<img src="https://www.datawaza.com/en/latest/_static/datawaza_logo_name_trans.svg" alt="datawaza_logo_name_trans.svg" width="300"/>
--------------------------------------
[![PyPI Version](https://img.shields.io/pypi/v/datawaza)](https://pypi.org/project/datawaza/)
[![License](https://img.shields.io/github/license/jbeno/datawaza)](https://github.com/jbeno/datawaza/blob/main/LICENSE)
[![Last Commit](https://img.shields.io/github/last-commit/jbeno/datawaza)](https://github.com/jbeno/datawaza)
[![Documentation Status](https://readthedocs.org/projects/datawaza/badge/?version=latest)](https://www.datawaza.com/en/latest/?badge=latest)
[![Coverage Status](https://coveralls.io/repos/github/jbeno/datawaza/badge.svg?branch=main)](https://coveralls.io/github/jbeno/datawaza?branch=main)
[![Python Version](https://img.shields.io/pypi/pyversions/datawaza)]()
Datawaza streamlines common Data Science tasks. It's a collection of tools for data exploration, visualization, data cleaning, pipeline creation, hyper-parameter searching, model iteration, and evaluation. It builds upon core libraries like [Pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/), and [Scikit-Learn](https://scikit-learn.org/stable/).
<p align="center">
<a href="https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_charts"><img src="https://www.datawaza.com/en/latest/_static/plot_charts.png" width="30%" /></a>
<a href="https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_map_ca"><img src="https://www.datawaza.com/en/latest/_static/plot_map_ca.png" width="30%" style="margin:0 1%;" /></a>
<a href="https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_3d"><img src="https://www.datawaza.com/en/latest/_static/plot_3d.png" width="30%" /></a>
</p>
<p align="center">
<a href="https://www.datawaza.com/en/latest/model.html#datawaza.model.iterate_model"><img src="https://www.datawaza.com/en/latest/_static/iterate_model_1.png" width="30%" /></a>
<a href="https://www.datawaza.com/en/latest/model.html#datawaza.model.iterate_model"><img src="https://www.datawaza.com/en/latest/_static/iterate_model_2.png" width="30%" style="margin:0 1%;" /></a>
<a href="https://www.datawaza.com/en/latest/model.html#datawaza.model.plot_results"><img src="https://www.datawaza.com/en/latest/_static/plot_results.png" width="30%" /></a>
</p>
<p align="center">
<a href="https://www.datawaza.com/en/latest/explore.html#datawaza.explore.get_corr"><img src="https://www.datawaza.com/en/latest/_static/get_corr.png" width="30%" /></a>
<a href="https://www.datawaza.com/en/latest/clean.html#datawaza.clean.reduce_multicollinearity"><img src="https://www.datawaza.com/en/latest/_static/reduce_multicollinearity.png" width="30%" style="margin:0 1%;" /></a>
<a href="https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_corr"><img src="https://www.datawaza.com/en/latest/_static/plot_corr.png" width="30%" /></a>
</p>
Installation
------------
The latest release can be found on [PyPI](https://pypi.org/project/datawaza/). See the [Change Log](CHANGELOG.md) for a history of changes. Install Datawaza with pip:
pip install datawaza
Documentation
-------------
Online documentation is available at [Datawaza.com](https://datawaza.com).
The [User Guide](https://www.datawaza.com/en/latest/userguide.html) is a Jupyter notebook that walks through how to use the Datawaza functions. It's probably the best place to start. There is also an API reference for the major modules: [Clean](https://www.datawaza.com/en/latest/clean.html), [Explore](https://www.datawaza.com/en/latest/explore.html), [Model](https://www.datawaza.com/en/latest/model.html), and [Tools](https://www.datawaza.com/en/latest/tools.html).
Development
-----------
The [Datawaza repo](https://github.com/jbeno/datawaza) is on GitHub.
Please submit bugs that you encounter to the [Issue Tracker](https://github.com/jbeno/datawaza/issues). Contributions and ideas for enhancements are welcome! So far this is a solo effort, but I would love to collaborate.
Dependencies
------------
Datawaza supports Python 3.10. It may support other versions, but these have not been tested yet.
Due to the breadth of use cases, installation requires NumPy, Pandas, Matplotlib, Seaborn, Plotly, Scikit-Learn, SciPy, Cartopy, GeoPandas, StatsModels, and a few other supporting packages. See the [Requirements.txt](https://github.com/jbeno/datawaza/blob/main/requirements.txt).
What is Waza?
-------------
Waza (技) means "technique" in Japanese. In martial arts like Aikido, it is paired with words like "suwari-waza" (sitting techniques) or "kaeshi-waza" (reversal techniques). So we've paired it with "data" to represent Data Science techniques: データ技 "data-waza".
Origin Story
-------------
Most of these functions were created while I was pursuing a [Professional Certificate in Machine Learning & Artificial Intelligence](https://em-executive.berkeley.edu/professional-certificate-machine-learning-artificial-intelligence) from U.C. Berkeley. With every assignment, I tried to simplify repetitive tasks and streamline my workflow. They served me well, and I hope you will find some value in them.
Quick Start
-----------
The [User Guide](https://www.datawaza.com/en/latest/userguide.html) will show you how to use Datawaza's functions in depth. Assuming you already have data loaded, here are some examples of what it can do:
>>> import datawaza as dw
Show the unique values of each variable below the threshold of n = 12:
>>> dw.get_unique(df, 12, count=True, percent=True)
CATEGORICAL: Variables with unique values equal to or below: 12
job has 12 unique values:
admin. 10422 25.3%
blue-collar 9254 22.47%
technician 6743 16.37%
services 3969 9.64%
management 2924 7.1%
retired 1720 4.18%
entrepreneur 1456 3.54%
self-employed 1421 3.45%
housemaid 1060 2.57%
unemployed 1014 2.46%
student 875 2.12%
unknown 330 0.8%
marital has 4 unique values:
married 24928 60.52%
single 11568 28.09%
divorced 4612 11.2%
unknown 80 0.19%
Plot bar charts of categorical variables, dimensioned by the target variable:
>>> dw.plot_charts(df, plot_type='cat', cat_cols=cat_columns, hue='y', rotation=90)
![plot_charts output](https://www.datawaza.com/en/latest/_static/plot_charts_output.png)
Get the top positive and negative correlations with the target variable, and save to lists:
>>> pos_features, neg_features = dw.get_corr(df_enc, n=10, var='subscribed_enc', return_arrays=True)
Top 10 positive correlations:
Variable 1 Variable 2 Correlation
0 duration subscribed_enc 0.41
1 poutcome_success subscribed_enc 0.32
2 previously_contacted subscribed_enc 0.32
3 pdays subscribed_enc 0.27
4 previous subscribed_enc 0.23
5 month_mar subscribed_enc 0.14
6 month_oct subscribed_enc 0.14
7 month_sep subscribed_enc 0.12
8 no_default_1 subscribed_enc 0.10
9 job_student subscribed_enc 0.09
Top 10 negative correlations:
Variable 1 Variable 2 Correlation
0 nr.employed subscribed_enc -0.35
1 euribor3m subscribed_enc -0.31
2 emp.var.rate subscribed_enc -0.30
3 poutcome_nonexistent subscribed_enc -0.19
4 contact_telephone subscribed_enc -0.14
5 cons.price.idx subscribed_enc -0.14
6 month_may subscribed_enc -0.11
7 campaign subscribed_enc -0.07
8 job_blue-collar subscribed_enc -0.07
9 education_basic.9y subscribed_enc -0.05
Plot a chart showing the top correlations with the target variable:
>>> dw.plot_corr(df_enc, 'subscribed_enc', n=16, size=(12,6), rotation=90)
![plot_corr output](https://www.datawaza.com/en/latest/_static/plot_corr_output.png)
Run a model iteration, which dynamically assembles a pipeline and evaluates the model, including
charts of residuals, predicted vs. actual, and coefficients:
>>> results_df, iteration_6 = dw.iterate_model(X2_train, X2_test, y2_train, y2_test,
... transformers=['ohe', 'log', 'poly3'], model='linreg',
... iteration='6', note='X2. Test size: 0.25, Pipeline: OHE > Log > Poly3 > LinReg',
... plot=True, lowess=True, coef=True, perm=True, vif=True, decimal=2,
... save=True, save_df=results_df, config=my_config)
![iterate_model output 1 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_1.png)
![iterate_model output 2 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_2.png)
![iterate_model output 3 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_3.png)
Compare train/test scores across model iterations, and select the best result:
>>> dw.plot_results(results_df, metrics=['Train MAE', 'Test MAE'], y_label='Mean Absolute Error',
... select_metric='Test MAE', select_criteria='min', decimal=0)
![plot_results output](https://www.datawaza.com/en/latest/_static/plot_results_output.png)
This was just a sample of some Datawaza tools. Download [userguide.ipynb](https://github.com/jbeno/datawaza/blob/main/docs/userguide.ipynb) and explore the full breadth of the library in your Jupyter environment.
Raw data
{
"_id": null,
"home_page": "https://datawaza.com",
"name": "datawaza",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "data science, visualization, machine learning, data analysis",
"author": "Jim Beno",
"author_email": "jim@jimbeno.net",
"download_url": "https://files.pythonhosted.org/packages/f3/7e/84b1fd2a585b796fc56d91d6ca25b54ef58493ce9ad525cd5ae3dcfc89f2/datawaza-0.1.2.tar.gz",
"platform": null,
"description": "<br />\n<img src=\"https://www.datawaza.com/en/latest/_static/datawaza_logo_name_trans.svg\" alt=\"datawaza_logo_name_trans.svg\" width=\"300\"/>\n\n--------------------------------------\n[![PyPI Version](https://img.shields.io/pypi/v/datawaza)](https://pypi.org/project/datawaza/)\n[![License](https://img.shields.io/github/license/jbeno/datawaza)](https://github.com/jbeno/datawaza/blob/main/LICENSE)\n[![Last Commit](https://img.shields.io/github/last-commit/jbeno/datawaza)](https://github.com/jbeno/datawaza)\n[![Documentation Status](https://readthedocs.org/projects/datawaza/badge/?version=latest)](https://www.datawaza.com/en/latest/?badge=latest)\n[![Coverage Status](https://coveralls.io/repos/github/jbeno/datawaza/badge.svg?branch=main)](https://coveralls.io/github/jbeno/datawaza?branch=main)\n[![Python Version](https://img.shields.io/pypi/pyversions/datawaza)]()\n\nDatawaza streamlines common Data Science tasks. It's a collection of tools for data exploration, visualization, data cleaning, pipeline creation, hyper-parameter searching, model iteration, and evaluation. It builds upon core libraries like [Pandas](https://pandas.pydata.org/), [Matplotlib](https://matplotlib.org/), [Seaborn](https://seaborn.pydata.org/), and [Scikit-Learn](https://scikit-learn.org/stable/).\n\n<p align=\"center\">\n <a href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_charts\"><img src=\"https://www.datawaza.com/en/latest/_static/plot_charts.png\" width=\"30%\" /></a>\n <a href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_map_ca\"><img src=\"https://www.datawaza.com/en/latest/_static/plot_map_ca.png\" width=\"30%\" style=\"margin:0 1%;\" /></a>\n <a href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_3d\"><img src=\"https://www.datawaza.com/en/latest/_static/plot_3d.png\" width=\"30%\" /></a>\n</p>\n<p align=\"center\">\n <a href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.iterate_model\"><img src=\"https://www.datawaza.com/en/latest/_static/iterate_model_1.png\" width=\"30%\" /></a>\n <a href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.iterate_model\"><img src=\"https://www.datawaza.com/en/latest/_static/iterate_model_2.png\" width=\"30%\" style=\"margin:0 1%;\" /></a>\n <a href=\"https://www.datawaza.com/en/latest/model.html#datawaza.model.plot_results\"><img src=\"https://www.datawaza.com/en/latest/_static/plot_results.png\" width=\"30%\" /></a>\n</p>\n<p align=\"center\">\n <a href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.get_corr\"><img src=\"https://www.datawaza.com/en/latest/_static/get_corr.png\" width=\"30%\" /></a>\n <a href=\"https://www.datawaza.com/en/latest/clean.html#datawaza.clean.reduce_multicollinearity\"><img src=\"https://www.datawaza.com/en/latest/_static/reduce_multicollinearity.png\" width=\"30%\" style=\"margin:0 1%;\" /></a>\n <a href=\"https://www.datawaza.com/en/latest/explore.html#datawaza.explore.plot_corr\"><img src=\"https://www.datawaza.com/en/latest/_static/plot_corr.png\" width=\"30%\" /></a>\n</p>\n\nInstallation\n------------\n\nThe latest release can be found on [PyPI](https://pypi.org/project/datawaza/). See the [Change Log](CHANGELOG.md) for a history of changes. Install Datawaza with pip:\n\n pip install datawaza\n\nDocumentation\n-------------\n\nOnline documentation is available at [Datawaza.com](https://datawaza.com).\n\nThe [User Guide](https://www.datawaza.com/en/latest/userguide.html) is a Jupyter notebook that walks through how to use the Datawaza functions. It's probably the best place to start. There is also an API reference for the major modules: [Clean](https://www.datawaza.com/en/latest/clean.html), [Explore](https://www.datawaza.com/en/latest/explore.html), [Model](https://www.datawaza.com/en/latest/model.html), and [Tools](https://www.datawaza.com/en/latest/tools.html).\n\nDevelopment\n-----------\n\nThe [Datawaza repo](https://github.com/jbeno/datawaza) is on GitHub.\n\nPlease submit bugs that you encounter to the [Issue Tracker](https://github.com/jbeno/datawaza/issues). Contributions and ideas for enhancements are welcome! So far this is a solo effort, but I would love to collaborate.\n\nDependencies\n------------\n\nDatawaza supports Python 3.10. It may support other versions, but these have not been tested yet.\n\nDue to the breadth of use cases, installation requires NumPy, Pandas, Matplotlib, Seaborn, Plotly, Scikit-Learn, SciPy, Cartopy, GeoPandas, StatsModels, and a few other supporting packages. See the [Requirements.txt](https://github.com/jbeno/datawaza/blob/main/requirements.txt).\n\nWhat is Waza?\n-------------\n\nWaza (\u6280) means \"technique\" in Japanese. In martial arts like Aikido, it is paired with words like \"suwari-waza\" (sitting techniques) or \"kaeshi-waza\" (reversal techniques). So we've paired it with \"data\" to represent Data Science techniques: \u30c7\u30fc\u30bf\u6280 \"data-waza\".\n\nOrigin Story\n-------------\n\nMost of these functions were created while I was pursuing a [Professional Certificate in Machine Learning & Artificial Intelligence](https://em-executive.berkeley.edu/professional-certificate-machine-learning-artificial-intelligence) from U.C. Berkeley. With every assignment, I tried to simplify repetitive tasks and streamline my workflow. They served me well, and I hope you will find some value in them.\n\nQuick Start\n-----------\n\nThe [User Guide](https://www.datawaza.com/en/latest/userguide.html) will show you how to use Datawaza's functions in depth. Assuming you already have data loaded, here are some examples of what it can do:\n\n >>> import datawaza as dw\n \nShow the unique values of each variable below the threshold of n = 12:\n\n >>> dw.get_unique(df, 12, count=True, percent=True)\n\n CATEGORICAL: Variables with unique values equal to or below: 12\n \n job has 12 unique values:\n \n admin. 10422 25.3%\n blue-collar 9254 22.47%\n technician 6743 16.37%\n services 3969 9.64%\n management 2924 7.1%\n retired 1720 4.18%\n entrepreneur 1456 3.54%\n self-employed 1421 3.45%\n housemaid 1060 2.57%\n unemployed 1014 2.46%\n student 875 2.12%\n unknown 330 0.8%\n \n marital has 4 unique values:\n \n married 24928 60.52%\n single 11568 28.09%\n divorced 4612 11.2%\n unknown 80 0.19%\n\nPlot bar charts of categorical variables, dimensioned by the target variable:\n\n >>> dw.plot_charts(df, plot_type='cat', cat_cols=cat_columns, hue='y', rotation=90)\n\n![plot_charts output](https://www.datawaza.com/en/latest/_static/plot_charts_output.png)\n\nGet the top positive and negative correlations with the target variable, and save to lists:\n\n >>> pos_features, neg_features = dw.get_corr(df_enc, n=10, var='subscribed_enc', return_arrays=True)\n\n Top 10 positive correlations:\n Variable 1 Variable 2 Correlation\n 0 duration subscribed_enc 0.41\n 1 poutcome_success subscribed_enc 0.32\n 2 previously_contacted subscribed_enc 0.32\n 3 pdays subscribed_enc 0.27\n 4 previous subscribed_enc 0.23\n 5 month_mar subscribed_enc 0.14\n 6 month_oct subscribed_enc 0.14\n 7 month_sep subscribed_enc 0.12\n 8 no_default_1 subscribed_enc 0.10\n 9 job_student subscribed_enc 0.09\n \n Top 10 negative correlations:\n Variable 1 Variable 2 Correlation\n 0 nr.employed subscribed_enc -0.35\n 1 euribor3m subscribed_enc -0.31\n 2 emp.var.rate subscribed_enc -0.30\n 3 poutcome_nonexistent subscribed_enc -0.19\n 4 contact_telephone subscribed_enc -0.14\n 5 cons.price.idx subscribed_enc -0.14\n 6 month_may subscribed_enc -0.11\n 7 campaign subscribed_enc -0.07\n 8 job_blue-collar subscribed_enc -0.07\n 9 education_basic.9y subscribed_enc -0.05\n\nPlot a chart showing the top correlations with the target variable:\n\n >>> dw.plot_corr(df_enc, 'subscribed_enc', n=16, size=(12,6), rotation=90)\n\n![plot_corr output](https://www.datawaza.com/en/latest/_static/plot_corr_output.png)\n\nRun a model iteration, which dynamically assembles a pipeline and evaluates the model, including\ncharts of residuals, predicted vs. actual, and coefficients:\n\n >>> results_df, iteration_6 = dw.iterate_model(X2_train, X2_test, y2_train, y2_test,\n ... transformers=['ohe', 'log', 'poly3'], model='linreg',\n ... iteration='6', note='X2. Test size: 0.25, Pipeline: OHE > Log > Poly3 > LinReg',\n ... plot=True, lowess=True, coef=True, perm=True, vif=True, decimal=2,\n ... save=True, save_df=results_df, config=my_config)\n\n![iterate_model output 1 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_1.png)\n![iterate_model output 2 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_2.png)\n![iterate_model output 3 of 3](https://www.datawaza.com/en/latest/_static/iterate_model_output_3.png)\n\nCompare train/test scores across model iterations, and select the best result:\n\n >>> dw.plot_results(results_df, metrics=['Train MAE', 'Test MAE'], y_label='Mean Absolute Error',\n ... select_metric='Test MAE', select_criteria='min', decimal=0)\n\n![plot_results output](https://www.datawaza.com/en/latest/_static/plot_results_output.png)\n\nThis was just a sample of some Datawaza tools. Download [userguide.ipynb](https://github.com/jbeno/datawaza/blob/main/docs/userguide.ipynb) and explore the full breadth of the library in your Jupyter environment.\n",
"bugtrack_url": null,
"license": null,
"summary": "Datawaza is a collection of tools for data exploration, visualization, data cleaning, pipeline creation, model iteration, and evaluation.",
"version": "0.1.2",
"project_urls": {
"Documentation": "https://datawaza.com",
"Homepage": "https://datawaza.com",
"Source": "https://github.com/jbeno/datawaza"
},
"split_keywords": [
"data science",
" visualization",
" machine learning",
" data analysis"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c4586ea930c1af89542a95719eb12f35411c2058ea640c32e28144fbe2300634",
"md5": "9365b31f98dd42429e2ca32f60610892",
"sha256": "d0b1483cf2b4b9b364173c0825a557ad2c7bd54a0b46cf970c5da88a949630b0"
},
"downloads": -1,
"filename": "datawaza-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9365b31f98dd42429e2ca32f60610892",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 2851369,
"upload_time": "2024-03-20T06:52:56",
"upload_time_iso_8601": "2024-03-20T06:52:56.350053Z",
"url": "https://files.pythonhosted.org/packages/c4/58/6ea930c1af89542a95719eb12f35411c2058ea640c32e28144fbe2300634/datawaza-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f37e84b1fd2a585b796fc56d91d6ca25b54ef58493ce9ad525cd5ae3dcfc89f2",
"md5": "3a5df821f5f2dc54cd6e14e79c1ee75e",
"sha256": "a5ab458febacebd7f764d613e53916e215f4d1e120bc51e64bd780d03c025724"
},
"downloads": -1,
"filename": "datawaza-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "3a5df821f5f2dc54cd6e14e79c1ee75e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 2853251,
"upload_time": "2024-03-20T06:52:58",
"upload_time_iso_8601": "2024-03-20T06:52:58.774183Z",
"url": "https://files.pythonhosted.org/packages/f3/7e/84b1fd2a585b796fc56d91d6ca25b54ef58493ce9ad525cd5ae3dcfc89f2/datawaza-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-20 06:52:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jbeno",
"github_project": "datawaza",
"travis_ci": false,
"coveralls": true,
"github_actions": false,
"requirements": [],
"lcname": "datawaza"
}