dtype-diet


Namedtype-diet JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/noklam/dtype_diet/tree/master/
SummaryAttempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)
upload_time2023-09-27 23:22:26
maintainer
docs_urlNone
authornoklam
requires_python>=3.7
licenseMIT License
keywords pandas optimization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # dtype_diet
> Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)


This file will become your README and also the index of your documentation.

## Install

`pip install dtype_diet`

# Documentation
https://noklam.github.io/dtype_diet/

## How to use

> This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.

This tool checks each column to see if larger dtypes (e.g. 8 byte `float64` and `int64`) could be shrunk to smaller `dtypes` without causing any data loss. 
Dropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column.  Categoricals are proposed for `object` columns which can bring significant speed and RAM benefits.


Here's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the [repository](https://github.com/noklam/dtype_diet/01_example.ipynb):

```python
#slow
# sell_prices.csv.zip 
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes
df = pd.read_csv('data/sell_prices.csv')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
```

    Original df memory: 957.5197134017944 MB
    Propsed df memory: 85.09655094146729 MB
    

```python
#slow
proposed_df
```




<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Current dtype</th>
      <th>Proposed dtype</th>
      <th>Current Memory (MB)</th>
      <th>Proposed Memory (MB)</th>
      <th>Ram Usage Improvement (MB)</th>
      <th>Ram Usage Improvement (%)</th>
    </tr>
    <tr>
      <th>Column</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>store_id</th>
      <td>object</td>
      <td>category</td>
      <td>203763.920410</td>
      <td>3340.907715</td>
      <td>200423.012695</td>
      <td>98.360403</td>
    </tr>
    <tr>
      <th>item_id</th>
      <td>object</td>
      <td>category</td>
      <td>233039.977539</td>
      <td>6824.677734</td>
      <td>226215.299805</td>
      <td>97.071456</td>
    </tr>
    <tr>
      <th>wm_yr_wk</th>
      <td>int64</td>
      <td>int16</td>
      <td>26723.191406</td>
      <td>6680.844727</td>
      <td>20042.346680</td>
      <td>74.999825</td>
    </tr>
    <tr>
      <th>sell_price</th>
      <td>float64</td>
      <td>None</td>
      <td>26723.191406</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>



Recommendations:

* Run `report_on_dataframe(your_df)` to get recommendations
* Run `optimize_dtypes(df, proposed_df)` to convert to recommeded dtypes.
* Consider if Categoricals will save you RAM (see Caveats below)
* Consider if f32 or f16 will be useful (see Caveats - f32 is _probably_ a reasonable choice unless you have huge ranges of floats)
* Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)
* Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)
* Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum [for the tweet](https://twitter.com/crepererum/status/1267441357339201536))

Look at `report_on_dataframe(your_df)` to get a printed report - no changes are made to your dataframe.

## Caveats

* reduced numeric ranges might lead to overflow (TODO document)
* category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)
* f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64
* we could do with a link that explains binary representation of float & int for those wanting to learn more

## Development 


### Contributors

* Antony Milbourne https://github.com/amilbourne
* Mani https://github.com/neomatrix369

### Local Setup

```
$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest
$ conda activate dtype_diet
```

## Release
```
make release
```
# Contributing
The repository is developed with `nbdev`, a system for developing library with notebook.

Make sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)
```
nbdev_install_git_hooks
```

Some other useful commands
```
nbdev_build_docs
nbdev_build_lib
nbdev_test_nbs
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/noklam/dtype_diet/tree/master/",
    "name": "dtype-diet",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "pandas,optimization",
    "author": "noklam",
    "author_email": "mediumnok@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ed/c6/5ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156/dtype_diet-0.0.2.tar.gz",
    "platform": null,
    "description": "# dtype_diet\n> Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)\n\n\nThis file will become your README and also the index of your documentation.\n\n## Install\n\n`pip install dtype_diet`\n\n# Documentation\nhttps://noklam.github.io/dtype_diet/\n\n## How to use\n\n> This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.\n\nThis tool checks each column to see if larger dtypes (e.g. 8 byte `float64` and `int64`) could be shrunk to smaller `dtypes` without causing any data loss. \nDropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column.  Categoricals are proposed for `object` columns which can bring significant speed and RAM benefits.\n\n\nHere's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the [repository](https://github.com/noklam/dtype_diet/01_example.ipynb):\n\n```python\n#slow\n# sell_prices.csv.zip \n# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/\nimport pandas as pd\nfrom dtype_diet import report_on_dataframe, optimize_dtypes\ndf = pd.read_csv('data/sell_prices.csv')\nproposed_df = report_on_dataframe(df, unit=\"MB\")\nnew_df = optimize_dtypes(df, proposed_df)\nprint(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')\nprint(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')\n```\n\n    Original df memory: 957.5197134017944 MB\n    Propsed df memory: 85.09655094146729 MB\n    \n\n```python\n#slow\nproposed_df\n```\n\n\n\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Current dtype</th>\n      <th>Proposed dtype</th>\n      <th>Current Memory (MB)</th>\n      <th>Proposed Memory (MB)</th>\n      <th>Ram Usage Improvement (MB)</th>\n      <th>Ram Usage Improvement (%)</th>\n    </tr>\n    <tr>\n      <th>Column</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>store_id</th>\n      <td>object</td>\n      <td>category</td>\n      <td>203763.920410</td>\n      <td>3340.907715</td>\n      <td>200423.012695</td>\n      <td>98.360403</td>\n    </tr>\n    <tr>\n      <th>item_id</th>\n      <td>object</td>\n      <td>category</td>\n      <td>233039.977539</td>\n      <td>6824.677734</td>\n      <td>226215.299805</td>\n      <td>97.071456</td>\n    </tr>\n    <tr>\n      <th>wm_yr_wk</th>\n      <td>int64</td>\n      <td>int16</td>\n      <td>26723.191406</td>\n      <td>6680.844727</td>\n      <td>20042.346680</td>\n      <td>74.999825</td>\n    </tr>\n    <tr>\n      <th>sell_price</th>\n      <td>float64</td>\n      <td>None</td>\n      <td>26723.191406</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n</div>\n\n\n\nRecommendations:\n\n* Run `report_on_dataframe(your_df)` to get recommendations\n* Run `optimize_dtypes(df, proposed_df)` to convert to recommeded dtypes.\n* Consider if Categoricals will save you RAM (see Caveats below)\n* Consider if f32 or f16 will be useful (see Caveats - f32 is _probably_ a reasonable choice unless you have huge ranges of floats)\n* Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)\n* Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)\n* Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum [for the tweet](https://twitter.com/crepererum/status/1267441357339201536))\n\nLook at `report_on_dataframe(your_df)` to get a printed report - no changes are made to your dataframe.\n\n## Caveats\n\n* reduced numeric ranges might lead to overflow (TODO document)\n* category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)\n* f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64\n* we could do with a link that explains binary representation of float & int for those wanting to learn more\n\n## Development \n\n\n### Contributors\n\n* Antony Milbourne https://github.com/amilbourne\n* Mani https://github.com/neomatrix369\n\n### Local Setup\n\n```\n$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest\n$ conda activate dtype_diet\n```\n\n## Release\n```\nmake release\n```\n# Contributing\nThe repository is developed with `nbdev`, a system for developing library with notebook.\n\nMake sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)\n```\nnbdev_install_git_hooks\n```\n\nSome other useful commands\n```\nnbdev_build_docs\nnbdev_build_lib\nnbdev_test_nbs\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/noklam/dtype_diet/tree/master/"
    },
    "split_keywords": [
        "pandas",
        "optimization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "05aebf75447edcd6959a0413c6a08e18f78704808fd7a5714f49ca738e0ee9ba",
                "md5": "0a4082677fa8807cfa09aa1f27f87c09",
                "sha256": "73435e07ab12cc674a474bd396dfc05cc010e2f079c2809055d76330c0e611ef"
            },
            "downloads": -1,
            "filename": "dtype_diet-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0a4082677fa8807cfa09aa1f27f87c09",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7608,
            "upload_time": "2023-09-27T23:22:25",
            "upload_time_iso_8601": "2023-09-27T23:22:25.117690Z",
            "url": "https://files.pythonhosted.org/packages/05/ae/bf75447edcd6959a0413c6a08e18f78704808fd7a5714f49ca738e0ee9ba/dtype_diet-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "edc65ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156",
                "md5": "98781c303620229399538d36043fcb5b",
                "sha256": "3c9161bf67d8e591ea60d3d1bd99f19935292488d36804bb1ba8bf1ff1627dd6"
            },
            "downloads": -1,
            "filename": "dtype_diet-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "98781c303620229399538d36043fcb5b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 8737,
            "upload_time": "2023-09-27T23:22:26",
            "upload_time_iso_8601": "2023-09-27T23:22:26.754335Z",
            "url": "https://files.pythonhosted.org/packages/ed/c6/5ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156/dtype_diet-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-27 23:22:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "noklam",
    "github_project": "dtype_diet",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dtype-diet"
}
        
Elapsed time: 0.14560s