========
clean_df
========
.. image:: https://img.shields.io/pypi/v/clean_df.svg
:target: https://pypi.python.org/pypi/clean_df
.. image:: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml/badge.svg
:target: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml
.. image:: https://readthedocs.org/projects/clean-df/badge/?version=latest
:target: https://clean-df.readthedocs.io/en/latest/?version=latest
:alt: Documentation Status
.. image:: https://img.shields.io/pypi/l/clean_df.svg
:target: https://github.com/NaelAqel/clean_df/blob/main/LICENSE
Python module to report, clean, and optimize **Pandas Dataframes** effectively.
**Full Documentation** `Here`_.
.. _Here: https://naelaqel.com/clean_df/
Description and Features
------------------------
The first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:
* Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).
* Report your **Pandas DataFrame** to decide for actions, this report will show:
#. Duplicated rows report.
#. Columns’ Datatype to optimize memory report.
#. Columns to convert to categorical report.
#. Outliers report.
#. Missing values report.
* Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.
* Optimize the dataframe by converting columns to the desired data type and converting categorical columns to 'category' data type.
Installation
------------
To install ``clean_df``, run this command in your terminal::
$ pip install clean_df
For more information on installation details for this project, please see the ``docs/installation.rst`` file.
Usage
-----
This module is very easy to use, for a full detailed example please see the ``docs/usage.rst`` file.
Importing the module
^^^^^^^^^^^^^^^^^^^^
::
from clean_df import CleanDataFrame
Defining the class
^^^^^^^^^^^^^^^^^^
Pass your pandas dataframe to ``CleanDataFrame`` class::
cdf = CleanDataFrame(
df=df, # the dataframe to be cleaned
max_num_cat=5 # maximum number of unique values in a column to be
) # converted to categorical datatype, default is 5
Reporting
^^^^^^^^^
Call ``report`` method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values::
cdf.report(
show_matrix=True, # show matrix missing values (from missingno package), default is True
show_heat=True, # show heat missing values (from missingno package), default is True
matrix_kws={}, # if need to pass any arguments to matrix plot, default is {}
heat_kws={} # if need to pass any arguments to heat plot, default is {}
)
Cleaning
^^^^^^^^
Call ``clean`` method to drop high number of missing value columns, duplicated rows, and rows with missing values::
cdf.clean(
min_missing_ratio=0.05, # the minimum ratio of missing values to drop a column, default is 0.05
drop_nan=True # if True, drop the rows with missing values after dropping columns
# with missingsa above min_missing_ratio
drop_kws={}, # if need to pass any arguments to pd.DataFrame.drop(), default is {}
drop_duplicates_kws={} # same drop_kws, but for drop_duplicates function
)
Optimizing
^^^^^^^^^^
Call ``optimize`` method to optimize the dataframe by changing columns' data types based on its values for maximum memory savings::
cdf.optimize()
Accessing the Cleaned Data DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::
cdf.df
Contributing
------------
See the ``CONTRIBUTING.rst`` for contribution details. Feel free to contact me for any subject through my:
* `Email`_
* `LinkedIn`_
* `WhatsApp`_
Also, you are welcomed to visit my personal `blog`_ .
.. _Email: mailto:dev@naelaqel.com
.. _LinkedIn: https://www.linkedin.com/in/naelaqel1
.. _WhatsApp: https://wa.me/962796780232
.. _blog: https://naelaqel.com
License
-------
Free software: MIT license.
Documentation
-------------
* The full documentation is hosted on my `website`_, and on `ReadTheDocs`_.
* The source code is available in `GitHub`_.
.. _website: https://naelaqel.com/clean_df/
.. _ReadTheDocs: https://clean_df.readthedocs.io
.. _GitHub: https://github.com/naelaqel/clean_df
Credits
-------
* This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.
* Here are `additional`_ resources I got a lot from them.
.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage
.. _`additional`: https://naelaqel.com/resources/
=======
History
=======
0.3.0 (2023-08-23)
------------------
* Improve the performance when calling ``report`` method.
* The ``pytest`` now is including the full methods in the module.
0.2.3 (2023-03-04)
------------------
* Improve memory consumption and module performance.
0.2.2 (2023-03-03)
------------------
* Fix a bug that made "dict_keys" error in some speical cases.
0.2.1 (2023-03-03)
------------------
* Improve module performance.
0.2.0 (2023-03-02)
------------------
* Add a new report for categorical columns.
* Make the module more efficient.
0.1.1 (2023-02-27)
------------------
* Rectify and organize documentation.
* Provide test to GitHub Actions.
0.1.0 (2023-02-27)
------------------
* First release on PyPI.
Raw data
{
"_id": null,
"home_page": "https://github.com/naelaqel/clean_df",
"name": "clean-df",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "clean_df,cleaning,data analysis,data science,wrangling,reporting,optimization,outliers,missing",
"author": "Nael Aqel",
"author_email": "dev@naelaqel.com",
"download_url": "https://files.pythonhosted.org/packages/ea/09/522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41/clean_df-0.3.0.tar.gz",
"platform": null,
"description": "========\r\nclean_df\r\n========\r\n\r\n.. image:: https://img.shields.io/pypi/v/clean_df.svg\r\n :target: https://pypi.python.org/pypi/clean_df\r\n\r\n.. image:: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml/badge.svg\r\n :target: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml\r\n\r\n.. image:: https://readthedocs.org/projects/clean-df/badge/?version=latest\r\n :target: https://clean-df.readthedocs.io/en/latest/?version=latest\r\n :alt: Documentation Status\r\n\r\n.. image:: https://img.shields.io/pypi/l/clean_df.svg\r\n :target: https://github.com/NaelAqel/clean_df/blob/main/LICENSE \r\n \r\n \r\n \r\nPython module to report, clean, and optimize **Pandas Dataframes** effectively.\r\n\r\n**Full Documentation** `Here`_.\r\n\r\n.. _Here: https://naelaqel.com/clean_df/\r\n \r\nDescription and Features\r\n------------------------\r\nThe first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can: \r\n\r\n* Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).\r\n* Report your **Pandas DataFrame** to decide for actions, this report will show: \r\n\r\n #. Duplicated rows report.\r\n #. Columns\u00e2\u20ac\u2122 Datatype to optimize memory report.\r\n #. Columns to convert to categorical report.\r\n #. Outliers report.\r\n #. Missing values report.\r\n\r\n\r\n* Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.\r\n\r\n* Optimize the dataframe by converting columns to the desired data type and converting categorical columns to 'category' data type.\r\n\r\nInstallation\r\n------------\r\nTo install ``clean_df``, run this command in your terminal:: \r\n\r\n $ pip install clean_df\r\n\r\nFor more information on installation details for this project, please see the ``docs/installation.rst`` file.\r\n\r\n\r\n \r\nUsage\r\n-----\r\nThis module is very easy to use, for a full detailed example please see the ``docs/usage.rst`` file.\r\n\r\nImporting the module\r\n^^^^^^^^^^^^^^^^^^^^\r\n::\r\n\r\n from clean_df import CleanDataFrame \r\n\r\nDefining the class\r\n^^^^^^^^^^^^^^^^^^\r\nPass your pandas dataframe to ``CleanDataFrame`` class::\r\n\r\n cdf = CleanDataFrame(\r\n df=df, # the dataframe to be cleaned\r\n max_num_cat=5 # maximum number of unique values in a column to be \r\n ) # converted to categorical datatype, default is 5\r\n\r\nReporting\r\n^^^^^^^^^\r\nCall ``report`` method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values::\r\n\r\n cdf.report(\r\n show_matrix=True, # show matrix missing values (from missingno package), default is True\r\n show_heat=True, # show heat missing values (from missingno package), default is True\r\n matrix_kws={}, # if need to pass any arguments to matrix plot, default is {}\r\n heat_kws={} # if need to pass any arguments to heat plot, default is {}\r\n )\r\n\r\nCleaning\r\n^^^^^^^^\r\nCall ``clean`` method to drop high number of missing value columns, duplicated rows, and rows with missing values::\r\n\r\n cdf.clean(\r\n min_missing_ratio=0.05, # the minimum ratio of missing values to drop a column, default is 0.05\r\n drop_nan=True # if True, drop the rows with missing values after dropping columns \r\n # with missingsa above min_missing_ratio\r\n drop_kws={}, # if need to pass any arguments to pd.DataFrame.drop(), default is {}\r\n drop_duplicates_kws={} # same drop_kws, but for drop_duplicates function\r\n )\r\n\r\nOptimizing\r\n^^^^^^^^^^\r\nCall ``optimize`` method to optimize the dataframe by changing columns' data types based on its values for maximum memory savings::\r\n\r\n cdf.optimize()\r\n\r\n\r\nAccessing the Cleaned Data DataFrame\r\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n::\r\n\r\n cdf.df \r\n\r\n\r\n \r\nContributing\r\n------------\r\nSee the ``CONTRIBUTING.rst`` for contribution details. Feel free to contact me for any subject through my: \r\n\r\n* `Email`_\r\n* `LinkedIn`_\r\n* `WhatsApp`_\r\n\r\nAlso, you are welcomed to visit my personal `blog`_ .\r\n\r\n.. _Email: mailto:dev@naelaqel.com\r\n.. _LinkedIn: https://www.linkedin.com/in/naelaqel1\r\n.. _WhatsApp: https://wa.me/962796780232\r\n.. _blog: https://naelaqel.com\r\n\r\n \r\n\r\nLicense\r\n-------\r\nFree software: MIT license.\r\n\r\n \r\n\r\nDocumentation\r\n-------------\r\n* The full documentation is hosted on my `website`_, and on `ReadTheDocs`_.\r\n* The source code is available in `GitHub`_.\r\n\r\n.. _website: https://naelaqel.com/clean_df/\r\n.. _ReadTheDocs: https://clean_df.readthedocs.io\r\n.. _GitHub: https://github.com/naelaqel/clean_df\r\n\r\n \r\n \r\nCredits\r\n-------\r\n* This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template. \r\n* Here are `additional`_ resources I got a lot from them.\r\n\r\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\r\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\r\n.. _`additional`: https://naelaqel.com/resources/\r\n\r\n\r\n=======\r\nHistory\r\n=======\r\n0.3.0 (2023-08-23)\r\n------------------\r\n* Improve the performance when calling ``report`` method.\r\n* The ``pytest`` now is including the full methods in the module. \r\n\r\n0.2.3 (2023-03-04)\r\n------------------\r\n* Improve memory consumption and module performance.\r\n\r\n0.2.2 (2023-03-03)\r\n------------------\r\n* Fix a bug that made \"dict_keys\" error in some speical cases.\r\n\r\n0.2.1 (2023-03-03)\r\n------------------\r\n* Improve module performance.\r\n\r\n0.2.0 (2023-03-02)\r\n------------------\r\n* Add a new report for categorical columns.\r\n* Make the module more efficient.\r\n\r\n0.1.1 (2023-02-27)\r\n------------------\r\n* Rectify and organize documentation.\r\n* Provide test to GitHub Actions.\r\n\r\n0.1.0 (2023-02-27)\r\n------------------\r\n\r\n* First release on PyPI.\r\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "Python module to report, clean, and optimize Pandas Dataframes effectively",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/naelaqel/clean_df"
},
"split_keywords": [
"clean_df",
"cleaning",
"data analysis",
"data science",
"wrangling",
"reporting",
"optimization",
"outliers",
"missing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "61324712907b66148e9977ccfa76efb1b441a2664d759e99ed3e47f5994ad786",
"md5": "67a8f37f6be096e99af1d28ce3e8aeea",
"sha256": "26085edad095995e96f12c6e9e4ee523ebf5477103e22f474dc2a3f731bb682d"
},
"downloads": -1,
"filename": "clean_df-0.3.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "67a8f37f6be096e99af1d28ce3e8aeea",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.7",
"size": 11717,
"upload_time": "2023-08-22T21:24:54",
"upload_time_iso_8601": "2023-08-22T21:24:54.723870Z",
"url": "https://files.pythonhosted.org/packages/61/32/4712907b66148e9977ccfa76efb1b441a2664d759e99ed3e47f5994ad786/clean_df-0.3.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ea09522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41",
"md5": "cffeb5992f714964a78fe6f14e615c33",
"sha256": "defe0284ddf9352d6d6ced16e6e9408337561f018bd8bf3b365a63511028360b"
},
"downloads": -1,
"filename": "clean_df-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "cffeb5992f714964a78fe6f14e615c33",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 410984,
"upload_time": "2023-08-22T21:24:59",
"upload_time_iso_8601": "2023-08-22T21:24:59.176528Z",
"url": "https://files.pythonhosted.org/packages/ea/09/522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41/clean_df-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-22 21:24:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "naelaqel",
"github_project": "clean_df",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "clean-df"
}