# dtype_diet
> Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)
This file will become your README and also the index of your documentation.
## Install
`pip install dtype_diet`
# Documentation
https://noklam.github.io/dtype_diet/
## How to use
> This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.
This tool checks each column to see if larger dtypes (e.g. 8 byte `float64` and `int64`) could be shrunk to smaller `dtypes` without causing any data loss.
Dropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column. Categoricals are proposed for `object` columns which can bring significant speed and RAM benefits.
Here's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the [repository](https://github.com/noklam/dtype_diet/01_example.ipynb):
```python
#slow
# sell_prices.csv.zip
# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
import pandas as pd
from dtype_diet import report_on_dataframe, optimize_dtypes
df = pd.read_csv('data/sell_prices.csv')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df)
print(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
```
Original df memory: 957.5197134017944 MB
Propsed df memory: 85.09655094146729 MB
```python
#slow
proposed_df
```
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Current dtype</th>
<th>Proposed dtype</th>
<th>Current Memory (MB)</th>
<th>Proposed Memory (MB)</th>
<th>Ram Usage Improvement (MB)</th>
<th>Ram Usage Improvement (%)</th>
</tr>
<tr>
<th>Column</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>store_id</th>
<td>object</td>
<td>category</td>
<td>203763.920410</td>
<td>3340.907715</td>
<td>200423.012695</td>
<td>98.360403</td>
</tr>
<tr>
<th>item_id</th>
<td>object</td>
<td>category</td>
<td>233039.977539</td>
<td>6824.677734</td>
<td>226215.299805</td>
<td>97.071456</td>
</tr>
<tr>
<th>wm_yr_wk</th>
<td>int64</td>
<td>int16</td>
<td>26723.191406</td>
<td>6680.844727</td>
<td>20042.346680</td>
<td>74.999825</td>
</tr>
<tr>
<th>sell_price</th>
<td>float64</td>
<td>None</td>
<td>26723.191406</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
Recommendations:
* Run `report_on_dataframe(your_df)` to get recommendations
* Run `optimize_dtypes(df, proposed_df)` to convert to recommeded dtypes.
* Consider if Categoricals will save you RAM (see Caveats below)
* Consider if f32 or f16 will be useful (see Caveats - f32 is _probably_ a reasonable choice unless you have huge ranges of floats)
* Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)
* Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)
* Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum [for the tweet](https://twitter.com/crepererum/status/1267441357339201536))
Look at `report_on_dataframe(your_df)` to get a printed report - no changes are made to your dataframe.
## Caveats
* reduced numeric ranges might lead to overflow (TODO document)
* category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)
* f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64
* we could do with a link that explains binary representation of float & int for those wanting to learn more
## Development
### Contributors
* Antony Milbourne https://github.com/amilbourne
* Mani https://github.com/neomatrix369
### Local Setup
```
$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest
$ conda activate dtype_diet
```
## Release
```
make release
```
# Contributing
The repository is developed with `nbdev`, a system for developing library with notebook.
Make sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)
```
nbdev_install_git_hooks
```
Some other useful commands
```
nbdev_build_docs
nbdev_build_lib
nbdev_test_nbs
```
Raw data
{
"_id": null,
"home_page": "https://github.com/noklam/dtype_diet/tree/master/",
"name": "dtype-diet",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "pandas,optimization",
"author": "noklam",
"author_email": "mediumnok@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ed/c6/5ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156/dtype_diet-0.0.2.tar.gz",
"platform": null,
"description": "# dtype_diet\n> Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)\n\n\nThis file will become your README and also the index of your documentation.\n\n## Install\n\n`pip install dtype_diet`\n\n# Documentation\nhttps://noklam.github.io/dtype_diet/\n\n## How to use\n\n> This is a fork of https://github.com/ianozsvald/dtype_diet to continue supoprt and develop the library with approval from the original author @ianozsvald.\n\nThis tool checks each column to see if larger dtypes (e.g. 8 byte `float64` and `int64`) could be shrunk to smaller `dtypes` without causing any data loss. \nDropping an 8 byte type to a 4 (or 2 or 1 byte) type will keep halving the RAM requirement for that column. Categoricals are proposed for `object` columns which can bring significant speed and RAM benefits.\n\n\nHere's an minimal example with 3 lines of code running on a Kaggle dataset showing a reduction of 957 -> 85MB, you can find the notebook in the [repository](https://github.com/noklam/dtype_diet/01_example.ipynb):\n\n```python\n#slow\n# sell_prices.csv.zip \n# Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/\nimport pandas as pd\nfrom dtype_diet import report_on_dataframe, optimize_dtypes\ndf = pd.read_csv('data/sell_prices.csv')\nproposed_df = report_on_dataframe(df, unit=\"MB\")\nnew_df = optimize_dtypes(df, proposed_df)\nprint(f'Original df memory: {df.memory_usage(deep=True).sum()/1024/1024} MB')\nprint(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')\n```\n\n Original df memory: 957.5197134017944 MB\n Propsed df memory: 85.09655094146729 MB\n \n\n```python\n#slow\nproposed_df\n```\n\n\n\n\n<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Current dtype</th>\n <th>Proposed dtype</th>\n <th>Current Memory (MB)</th>\n <th>Proposed Memory (MB)</th>\n <th>Ram Usage Improvement (MB)</th>\n <th>Ram Usage Improvement (%)</th>\n </tr>\n <tr>\n <th>Column</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>store_id</th>\n <td>object</td>\n <td>category</td>\n <td>203763.920410</td>\n <td>3340.907715</td>\n <td>200423.012695</td>\n <td>98.360403</td>\n </tr>\n <tr>\n <th>item_id</th>\n <td>object</td>\n <td>category</td>\n <td>233039.977539</td>\n <td>6824.677734</td>\n <td>226215.299805</td>\n <td>97.071456</td>\n </tr>\n <tr>\n <th>wm_yr_wk</th>\n <td>int64</td>\n <td>int16</td>\n <td>26723.191406</td>\n <td>6680.844727</td>\n <td>20042.346680</td>\n <td>74.999825</td>\n </tr>\n <tr>\n <th>sell_price</th>\n <td>float64</td>\n <td>None</td>\n <td>26723.191406</td>\n <td>NaN</td>\n <td>NaN</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>\n\n\n\nRecommendations:\n\n* Run `report_on_dataframe(your_df)` to get recommendations\n* Run `optimize_dtypes(df, proposed_df)` to convert to recommeded dtypes.\n* Consider if Categoricals will save you RAM (see Caveats below)\n* Consider if f32 or f16 will be useful (see Caveats - f32 is _probably_ a reasonable choice unless you have huge ranges of floats)\n* Consider if int32, int16, int8 will be useful (see Caveats - overflow may be an issue)\n* Look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html which recommends Pandas nullable dtype alternatives (e.g. to avoid promoting an int64 with NaN items to float64, instead you get Int64 with NaNs and no data loss)\n* Look at Extension arrays like https://github.com/JDASoftwareGroup/rle-array (thanks @repererum [for the tweet](https://twitter.com/crepererum/status/1267441357339201536))\n\nLook at `report_on_dataframe(your_df)` to get a printed report - no changes are made to your dataframe.\n\n## Caveats\n\n* reduced numeric ranges might lead to overflow (TODO document)\n* category dtype can have unexpected effects e.g. need for observed=True in groupby (TODO document)\n* f16 is likely to be simulated on modern hardware so calculations will be 2-3* slower than on f32 or f64\n* we could do with a link that explains binary representation of float & int for those wanting to learn more\n\n## Development \n\n\n### Contributors\n\n* Antony Milbourne https://github.com/amilbourne\n* Mani https://github.com/neomatrix369\n\n### Local Setup\n\n```\n$ conda create -n dtype_diet python=3.8 pandas jupyter pyarrow pytest\n$ conda activate dtype_diet\n```\n\n## Release\n```\nmake release\n```\n# Contributing\nThe repository is developed with `nbdev`, a system for developing library with notebook.\n\nMake sure you run this if you want to contribute to the library. For details, please refer to nbdev documentation (https://github.com/fastai/nbdev)\n```\nnbdev_install_git_hooks\n```\n\nSome other useful commands\n```\nnbdev_build_docs\nnbdev_build_lib\nnbdev_test_nbs\n```\n\n\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Attempt to shrink Pandas `dtypes` without losing data so you have more RAM (and maybe more speed)",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/noklam/dtype_diet/tree/master/"
},
"split_keywords": [
"pandas",
"optimization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "05aebf75447edcd6959a0413c6a08e18f78704808fd7a5714f49ca738e0ee9ba",
"md5": "0a4082677fa8807cfa09aa1f27f87c09",
"sha256": "73435e07ab12cc674a474bd396dfc05cc010e2f079c2809055d76330c0e611ef"
},
"downloads": -1,
"filename": "dtype_diet-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0a4082677fa8807cfa09aa1f27f87c09",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 7608,
"upload_time": "2023-09-27T23:22:25",
"upload_time_iso_8601": "2023-09-27T23:22:25.117690Z",
"url": "https://files.pythonhosted.org/packages/05/ae/bf75447edcd6959a0413c6a08e18f78704808fd7a5714f49ca738e0ee9ba/dtype_diet-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "edc65ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156",
"md5": "98781c303620229399538d36043fcb5b",
"sha256": "3c9161bf67d8e591ea60d3d1bd99f19935292488d36804bb1ba8bf1ff1627dd6"
},
"downloads": -1,
"filename": "dtype_diet-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "98781c303620229399538d36043fcb5b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 8737,
"upload_time": "2023-09-27T23:22:26",
"upload_time_iso_8601": "2023-09-27T23:22:26.754335Z",
"url": "https://files.pythonhosted.org/packages/ed/c6/5ce7377dca7d9b1994b6a8b8856b66244948afdddbec947e5d113b905156/dtype_diet-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-27 23:22:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "noklam",
"github_project": "dtype_diet",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dtype-diet"
}