clean-df


Nameclean-df JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/naelaqel/clean_df
SummaryPython module to report, clean, and optimize Pandas Dataframes effectively
upload_time2023-08-22 21:24:59
maintainer
docs_urlNone
authorNael Aqel
requires_python>=3.7
licenseMIT license
keywords clean_df cleaning data analysis data science wrangling reporting optimization outliers missing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            ========
clean_df
========

.. image:: https://img.shields.io/pypi/v/clean_df.svg
        :target: https://pypi.python.org/pypi/clean_df

.. image:: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml/badge.svg
   :target: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml

.. image:: https://readthedocs.org/projects/clean-df/badge/?version=latest
        :target: https://clean-df.readthedocs.io/en/latest/?version=latest
        :alt: Documentation Status

.. image:: https://img.shields.io/pypi/l/clean_df.svg
   :target: https://github.com/NaelAqel/clean_df/blob/main/LICENSE  
  
  
  
Python module to report, clean, and optimize **Pandas Dataframes** effectively.

**Full Documentation** `Here`_.

.. _Here: https://naelaqel.com/clean_df/
  
Description and Features
------------------------
The first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:  

* Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).
* Report your **Pandas DataFrame** to decide for actions, this report will show:  

  #. Duplicated rows report.
  #. Columns’ Datatype to optimize memory report.
  #. Columns to convert to categorical report.
  #. Outliers report.
  #. Missing values report.


* Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.

* Optimize the dataframe by converting columns to the desired data type and converting categorical columns to 'category' data type.

Installation
------------
To install ``clean_df``, run this command in your terminal:: 

    $ pip install clean_df

For more information on installation details for this project, please see the ``docs/installation.rst`` file.


    
Usage
-----
This module is very easy to use, for a full detailed example please see the ``docs/usage.rst`` file.

Importing the module
^^^^^^^^^^^^^^^^^^^^
::

        from clean_df import CleanDataFrame   

Defining the class
^^^^^^^^^^^^^^^^^^
Pass your pandas dataframe to ``CleanDataFrame`` class::

        cdf = CleanDataFrame(
                df=df,             # the dataframe to be cleaned
                max_num_cat=5      # maximum number of unique values in a column to be 
                )                  # converted to categorical datatype, default is 5

Reporting
^^^^^^^^^
Call ``report`` method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values::

        cdf.report(
                show_matrix=True,   # show matrix missing values (from missingno package), default is True
                show_heat=True,     # show heat missing values (from missingno package), default is True
                matrix_kws={},      # if need to pass any arguments to matrix plot, default is {}
                heat_kws={}         # if need to pass any arguments to heat plot, default is {}
                )

Cleaning
^^^^^^^^
Call ``clean`` method to drop high number of missing value columns, duplicated rows, and rows with missing values::

        cdf.clean(
                min_missing_ratio=0.05,    # the minimum ratio of missing values to drop a column, default is 0.05
                drop_nan=True              # if True, drop the rows with missing values after dropping columns 
                                           # with missingsa above min_missing_ratio
                drop_kws={},               # if need to pass any arguments to pd.DataFrame.drop(), default is {}
                drop_duplicates_kws={}     # same drop_kws, but for drop_duplicates function
                )

Optimizing
^^^^^^^^^^
Call ``optimize`` method to optimize the dataframe by changing columns' data types based on its values for maximum memory savings::

        cdf.optimize()


Accessing the Cleaned Data DataFrame
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::

        cdf.df 


  
Contributing
------------
See the ``CONTRIBUTING.rst`` for contribution details. Feel free to contact me for any subject through my:  

* `Email`_
* `LinkedIn`_
* `WhatsApp`_

Also, you are welcomed to visit my personal `blog`_ .

.. _Email: mailto:dev@naelaqel.com
.. _LinkedIn: https://www.linkedin.com/in/naelaqel1
.. _WhatsApp: https://wa.me/962796780232
.. _blog: https://naelaqel.com

   

License
-------
Free software: MIT license.

    

Documentation
-------------
* The full documentation is hosted on my `website`_, and on `ReadTheDocs`_.
* The source code is available in `GitHub`_.

.. _website: https://naelaqel.com/clean_df/
.. _ReadTheDocs: https://clean_df.readthedocs.io
.. _GitHub: https://github.com/naelaqel/clean_df

    
    
Credits
-------
* This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.  
* Here are `additional`_ resources I got a lot from them.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage
.. _`additional`: https://naelaqel.com/resources/


=======
History
=======
0.3.0 (2023-08-23)
------------------
* Improve the performance when calling ``report`` method.
* The ``pytest`` now is including the full methods in the module. 

0.2.3 (2023-03-04)
------------------
* Improve memory consumption and module performance.

0.2.2 (2023-03-03)
------------------
* Fix a bug that made "dict_keys" error in some speical cases.

0.2.1 (2023-03-03)
------------------
* Improve module performance.

0.2.0 (2023-03-02)
------------------
* Add a new report for categorical columns.
* Make the module more efficient.

0.1.1 (2023-02-27)
------------------
* Rectify and organize documentation.
* Provide test to GitHub Actions.

0.1.0 (2023-02-27)
------------------

* First release on PyPI.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/naelaqel/clean_df",
    "name": "clean-df",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "clean_df,cleaning,data analysis,data science,wrangling,reporting,optimization,outliers,missing",
    "author": "Nael Aqel",
    "author_email": "dev@naelaqel.com",
    "download_url": "https://files.pythonhosted.org/packages/ea/09/522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41/clean_df-0.3.0.tar.gz",
    "platform": null,
    "description": "========\r\nclean_df\r\n========\r\n\r\n.. image:: https://img.shields.io/pypi/v/clean_df.svg\r\n        :target: https://pypi.python.org/pypi/clean_df\r\n\r\n.. image:: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml/badge.svg\r\n   :target: https://github.com/NaelAqel/clean_df/actions/workflows/test.yml\r\n\r\n.. image:: https://readthedocs.org/projects/clean-df/badge/?version=latest\r\n        :target: https://clean-df.readthedocs.io/en/latest/?version=latest\r\n        :alt: Documentation Status\r\n\r\n.. image:: https://img.shields.io/pypi/l/clean_df.svg\r\n   :target: https://github.com/NaelAqel/clean_df/blob/main/LICENSE  \r\n  \r\n  \r\n  \r\nPython module to report, clean, and optimize **Pandas Dataframes** effectively.\r\n\r\n**Full Documentation** `Here`_.\r\n\r\n.. _Here: https://naelaqel.com/clean_df/\r\n  \r\nDescription and Features\r\n------------------------\r\nThe first step of any data analysis project is to check and clean the data, in this module I implemented a very effiecint code that can:  \r\n\r\n* Automatically drop columns that have a unique value (these columns are useless, so it will be dropped).\r\n* Report your **Pandas DataFrame** to decide for actions, this report will show:  \r\n\r\n  #. Duplicated rows report.\r\n  #. Columns\u00e2\u20ac\u2122 Datatype to optimize memory report.\r\n  #. Columns to convert to categorical report.\r\n  #. Outliers report.\r\n  #. Missing values report.\r\n\r\n\r\n* Clean the dataframe by dropping columns that have a high ratio of missing values, rows with missing values, and duplicated rows in the dataframe.\r\n\r\n* Optimize the dataframe by converting columns to the desired data type and converting categorical columns to 'category' data type.\r\n\r\nInstallation\r\n------------\r\nTo install ``clean_df``, run this command in your terminal:: \r\n\r\n    $ pip install clean_df\r\n\r\nFor more information on installation details for this project, please see the ``docs/installation.rst`` file.\r\n\r\n\r\n    \r\nUsage\r\n-----\r\nThis module is very easy to use, for a full detailed example please see the ``docs/usage.rst`` file.\r\n\r\nImporting the module\r\n^^^^^^^^^^^^^^^^^^^^\r\n::\r\n\r\n        from clean_df import CleanDataFrame   \r\n\r\nDefining the class\r\n^^^^^^^^^^^^^^^^^^\r\nPass your pandas dataframe to ``CleanDataFrame`` class::\r\n\r\n        cdf = CleanDataFrame(\r\n                df=df,             # the dataframe to be cleaned\r\n                max_num_cat=5      # maximum number of unique values in a column to be \r\n                )                  # converted to categorical datatype, default is 5\r\n\r\nReporting\r\n^^^^^^^^^\r\nCall ``report`` method to see a full report about the dataframe (duplications, columns to optimize its data types, categorical columns, outliers, and missing values::\r\n\r\n        cdf.report(\r\n                show_matrix=True,   # show matrix missing values (from missingno package), default is True\r\n                show_heat=True,     # show heat missing values (from missingno package), default is True\r\n                matrix_kws={},      # if need to pass any arguments to matrix plot, default is {}\r\n                heat_kws={}         # if need to pass any arguments to heat plot, default is {}\r\n                )\r\n\r\nCleaning\r\n^^^^^^^^\r\nCall ``clean`` method to drop high number of missing value columns, duplicated rows, and rows with missing values::\r\n\r\n        cdf.clean(\r\n                min_missing_ratio=0.05,    # the minimum ratio of missing values to drop a column, default is 0.05\r\n                drop_nan=True              # if True, drop the rows with missing values after dropping columns \r\n                                           # with missingsa above min_missing_ratio\r\n                drop_kws={},               # if need to pass any arguments to pd.DataFrame.drop(), default is {}\r\n                drop_duplicates_kws={}     # same drop_kws, but for drop_duplicates function\r\n                )\r\n\r\nOptimizing\r\n^^^^^^^^^^\r\nCall ``optimize`` method to optimize the dataframe by changing columns' data types based on its values for maximum memory savings::\r\n\r\n        cdf.optimize()\r\n\r\n\r\nAccessing the Cleaned Data DataFrame\r\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n::\r\n\r\n        cdf.df \r\n\r\n\r\n  \r\nContributing\r\n------------\r\nSee the ``CONTRIBUTING.rst`` for contribution details. Feel free to contact me for any subject through my:  \r\n\r\n* `Email`_\r\n* `LinkedIn`_\r\n* `WhatsApp`_\r\n\r\nAlso, you are welcomed to visit my personal `blog`_ .\r\n\r\n.. _Email: mailto:dev@naelaqel.com\r\n.. _LinkedIn: https://www.linkedin.com/in/naelaqel1\r\n.. _WhatsApp: https://wa.me/962796780232\r\n.. _blog: https://naelaqel.com\r\n\r\n   \r\n\r\nLicense\r\n-------\r\nFree software: MIT license.\r\n\r\n    \r\n\r\nDocumentation\r\n-------------\r\n* The full documentation is hosted on my `website`_, and on `ReadTheDocs`_.\r\n* The source code is available in `GitHub`_.\r\n\r\n.. _website: https://naelaqel.com/clean_df/\r\n.. _ReadTheDocs: https://clean_df.readthedocs.io\r\n.. _GitHub: https://github.com/naelaqel/clean_df\r\n\r\n    \r\n    \r\nCredits\r\n-------\r\n* This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.  \r\n* Here are `additional`_ resources I got a lot from them.\r\n\r\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\r\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\r\n.. _`additional`: https://naelaqel.com/resources/\r\n\r\n\r\n=======\r\nHistory\r\n=======\r\n0.3.0 (2023-08-23)\r\n------------------\r\n* Improve the performance when calling ``report`` method.\r\n* The ``pytest`` now is including the full methods in the module. \r\n\r\n0.2.3 (2023-03-04)\r\n------------------\r\n* Improve memory consumption and module performance.\r\n\r\n0.2.2 (2023-03-03)\r\n------------------\r\n* Fix a bug that made \"dict_keys\" error in some speical cases.\r\n\r\n0.2.1 (2023-03-03)\r\n------------------\r\n* Improve module performance.\r\n\r\n0.2.0 (2023-03-02)\r\n------------------\r\n* Add a new report for categorical columns.\r\n* Make the module more efficient.\r\n\r\n0.1.1 (2023-02-27)\r\n------------------\r\n* Rectify and organize documentation.\r\n* Provide test to GitHub Actions.\r\n\r\n0.1.0 (2023-02-27)\r\n------------------\r\n\r\n* First release on PyPI.\r\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "Python module to report, clean, and optimize Pandas Dataframes effectively",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/naelaqel/clean_df"
    },
    "split_keywords": [
        "clean_df",
        "cleaning",
        "data analysis",
        "data science",
        "wrangling",
        "reporting",
        "optimization",
        "outliers",
        "missing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "61324712907b66148e9977ccfa76efb1b441a2664d759e99ed3e47f5994ad786",
                "md5": "67a8f37f6be096e99af1d28ce3e8aeea",
                "sha256": "26085edad095995e96f12c6e9e4ee523ebf5477103e22f474dc2a3f731bb682d"
            },
            "downloads": -1,
            "filename": "clean_df-0.3.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "67a8f37f6be096e99af1d28ce3e8aeea",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.7",
            "size": 11717,
            "upload_time": "2023-08-22T21:24:54",
            "upload_time_iso_8601": "2023-08-22T21:24:54.723870Z",
            "url": "https://files.pythonhosted.org/packages/61/32/4712907b66148e9977ccfa76efb1b441a2664d759e99ed3e47f5994ad786/clean_df-0.3.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea09522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41",
                "md5": "cffeb5992f714964a78fe6f14e615c33",
                "sha256": "defe0284ddf9352d6d6ced16e6e9408337561f018bd8bf3b365a63511028360b"
            },
            "downloads": -1,
            "filename": "clean_df-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "cffeb5992f714964a78fe6f14e615c33",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 410984,
            "upload_time": "2023-08-22T21:24:59",
            "upload_time_iso_8601": "2023-08-22T21:24:59.176528Z",
            "url": "https://files.pythonhosted.org/packages/ea/09/522afe48a2f2bc41dedce50a7fcd00398114ade3bec40fd2a0285179cf41/clean_df-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-22 21:24:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "naelaqel",
    "github_project": "clean_df",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "clean-df"
}
        
Elapsed time: 2.61451s