Name | nanocube JSON |
Version |
0.2.1
JSON |
| download |
home_page | https://github.com/Zeutschler/nanocube |
Summary | Lightning fast OLAP-style point queries on Pandas DataFrames. |
upload_time | 2024-10-14 21:50:18 |
maintainer | None |
docs_url | None |
author | Thomas Zeutschler |
requires_python | >=3.10 |
license | MIT License Copyright (c) 2024 Thomas Zeutschler Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
python
pandas
numpy
spark
data analysis
olap
cube
dataframe
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# NanoCube
## Lightning fast OLAP-style point queries on DataFrames.
![GitHub license](https://img.shields.io/github/license/Zeutschler/nanocube?color=A1C547)
![PyPI version](https://img.shields.io/pypi/v/nanocube?logo=pypi&logoColor=979DA4&color=A1C547)
![PyPI Downloads](https://img.shields.io/pypi/dm/nanocube.svg?logo=pypi&logoColor=979DA4&label=PyPI%20downloads&color=A1C547)
![GitHub last commit](https://img.shields.io/github/last-commit/Zeutschler/nanocube?logo=github&logoColor=979DA4&color=A1C547)
![unit tests](https://img.shields.io/github/actions/workflow/status/zeutschler/nanocube/python-package.yml?logo=GitHub&logoColor=979DA4&label=unit%20tests&color=A1C547)
-----------------
**NanoCube** is a minimalistic in-memory, in-process OLAP engine for lightning fast point queries
on Pandas DataFrames. NanoCube shines when filtering and/or point queries need to be executed on a DataFrame,
e.g. for financial data analysis, business intelligence or fast web services.
If you think it would be valuable to **extend NanoCube with additional OLAP features**
please let me know. You can reach out by opening an issue or contacting me
on [LinkedIn](https://www.linkedin.com/in/thomas-zeutschler/).
``` bash
pip install nanocube
```
```python
import pandas as pd
from nanocube import NanoCube
# create a DataFrame
df = pd.read_csv('sale_data.csv')
value = df.loc[(df['make'].isin(['Audi', 'BMW']) & (df['engine'] == 'hybrid')]['revenue'].sum()
# create a NanoCube and run sum aggregated point queries
nc = NanoCube(df)
for i in range(1000):
value = nc.get('revenue', make=['Audi', 'BMW'], engine='hybrid')
```
### Lightning fast - really?
Aggregated point queries with NanoCube are often 100x to 1,000x times faster than using Pandas.
The more selective the query, the more you benefit from NanoCube. For highly selective queries,
NanoCube can even be 10,000x times faster than Pandas. For non-selective queries, the performance
is 10x faster and finally similar to Pandas, as both rely on Numpy for aggregation. NanoCube
only accelerates the filtering of data, not the aggregation.
For the special purpose of aggregative point queries, NanoCube is even faster than other
DataFrame related technologies, like Spark, Polars, Modin, Dask or Vaex. If such libraries are
a drop-in replacements for Pandas, then you should be able to speed up their filtering quite noticeably.
Try it and let me know how it performs.
NanoCube is beneficial only if some point queries (> 5) need to be executed, as the
initialization time for the NanoCube needs to be taken into consideration.
The more point query you run, the more you benefit from NanoCube.
### Benchmark - NanoCube vs. Others
The following table shows the duration for a single point query on the
`car_prices_us` dataset (available on [kaggle.com](https://www.kaggle.com)) containing 16x columns and 558,837x rows.
The query is highly selective, filtering on 4 dimensions `(model='Optima', trim='LX', make='Kia', body='Sedan')` and
aggregating column `mmr`. The factor is the speedup of NanoCube vs. the respective technology.
To reproduce the benchmark, you can execute file [nano_vs_others.py](benchmarks/nano_vs_others.py).
| | technology | duration_sec | factor |
|---:|:-----------------|---------------:|---------:|
| 0 | NanoCube | 0.021 | 1 |
| 1 | SQLite (indexed) | 0.196 | 9.333 |
| 2 | Polars | 0.844 | 40.19 |
| 3 | DuckDB | 5.315 | 253.095 |
| 4 | SQLite | 17.54 | 835.238 |
| 5 | Pandas | 51.931 | 2472.91 |
### How is this possible?
NanoCube creates an in-memory multi-dimensional index over all relevant entities/columns in a dataframe.
Internally, Roaring Bitmaps (https://roaringbitmap.org) are used by default for representing the index.
Initialization may take some time, but yields very fast filtering and point queries. As an alternative
to Roaring Bitmaps, Numpy-based indexing can be used. These are faster if only one filter is applied,
but can be orders of magnitude slower if multiple filters are applied.
Approach: For each unique value in all relevant dimension columns, a bitmap is created that represents the
rows in the DataFrame where this value occurs. The bitmaps can then be combined or intersected to determine
the rows relevant for a specific filter or point query. Once the relevant rows are determined, Numpy is used
then for to aggregate the requested measures.
NanoCube is a by-product of the CubedPandas project (https://github.com/Zeutschler/cubedpandas) and will be integrated
into CubedPandas in the future. But for now, NanoCube is a standalone library that can be used with
any Pandas DataFrame for the special purpose of point queries.
### Tips for using NanoCube
> **Tip**: Only include those columns in the NanoCube setup, that you actually want to query!
> The more columns you include, the more memory and time is needed for initialization.
> ```
> df = pd.read_csv('dataframe_with_100_columns.csv')
> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
> ```
> **Tip**: If you have a DataFrame with more than 1 million rows, you may want to sort the DataFrame
> before creating the NanoCube. This can improve the performance of NanoCube significantly, upto 10x times.
> **Tip**: NanoCubes can be saved and loaded to/from disk. This can be useful if you want to reuse a NanoCube
> for multiple queries or if you want to share a NanoCube with others. NanoCubes are saved in Arrow format but
> load up to 4x times faster than the respective parquet DataFrame file.
> ```
> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])
> nc.save('nanocube.nc')
> nc_reloaded = NanoCube.load('nanocube.nc')
> > ```
### What price do I have to pay?
NanoCube is free and MIT licensed. The prices to pay are additional memory consumption, depending on the
use case typically 25% on top of the original DataFrame and the time needed for initializing the
multi-dimensional index, typically 250k rows/sec depending on the number of columns to be indexed and
your hardware. The initialization time is proportional to the number of rows in the DataFrame (see below).
You may want to try and adapt the included samples [`sample.py`](samples/sample.py) and benchmarks
[`benchmark.py`](benchmarks/benchmark.py) and [`benchmark.ipynb`](benchmarks/benchmark.ipynb) to test the behavior of NanoCube
on your data.
## NanoCube Benchmarks
Using the Python script [benchmark.py](benchmarks/benchmark.py), the following comparison charts can be created.
The data set contains 7 dimension columns and 2 measure columns.
#### Point query for a single row
A highly selective query, fully qualified and filtering on all 7 dimensions. The query will return and aggregates 1 single row.
NanoCube is 250x up to 60,000x more times faster than Pandas, depending on the number of size in the DataFrame,
the more rows, the faster NanoCube is in comparison to Pandas.
![Point query for single row](benchmarks/charts/s.png)
#### Point query on high cardinality column
A highly selective, filtering on a single high cardinality dimension, where each member
represents ±0.01% of rows. NanoCube is 100x or more times faster than Pandas.
![Query on single high cardinality column](benchmarks/charts/hk.png)
#### Point query aggregating 0.1% of rows
A highly selective, filtering on 1 dimension that affects and aggregates 0.1% of rows.
NanoCube is 100x or more times faster than Pandas.
![Point query aggregating 0.1% of rows](benchmarks/charts/m.png)
#### Point query aggregating 5% of rows
A barely selective, filtering on 2 dimensions that affects and aggregates 5% of rows.
NanoCube is consistently 10x faster than Pandas. But you can already see, that the
aggregation in Numpy become more dominant -> compare the lines of the number of returned
records and the NanoCube response time, they are almost parallel.
![Point query aggregating 5% of rows](benchmarks/charts/l.png)
If sorting is applied to the DataFrame - low cardinality dimension columns first, higher dimension cardinality
columns last - then the performance of NanoCube can potentially improve dramatically, ranging from not faster
up to ±10x or more times faster. Here, the same query as above, but the DataFrame was sorted beforehand.
![Point query for single row](benchmarks/charts/l_sorted.png)
#### Point query aggregating 50% of rows
A non-selective query, filtering on 1 dimension that affects and aggregates 50% of rows.
Here, most of the time is spent in Numpy, aggregating the rows. The more
rows, the closer Pandas and NanoCube get as both rely finally on Numpy for
aggregation, which is very fast.
![Point query aggregating 50% of rows](benchmarks/charts/xl.png)
#### NanoCube initialization time
The time required to initialize a NanoCube instance is almost linear.
The initialization throughput heavily depends on the number of dimension columns.
A custom file format will be added soon allowing ±4x times faster loading
of a NanoCube in comparison to loading the respective parquet dataframe file
using Arrow.
![NanoCube initialization time](benchmarks/charts/init.png)
Raw data
{
"_id": null,
"home_page": "https://github.com/Zeutschler/nanocube",
"name": "nanocube",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Thomas Zeutschler <cubedpandas@gmail.com>",
"keywords": "python, pandas, numpy, spark, data analysis, OLAP, cube, dataframe",
"author": "Thomas Zeutschler",
"author_email": "cubedpandas@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/72/2b/ee0b25f441f0826418781e8bc3a53de4e40f13f368ffb2de10aff1accf36/nanocube-0.2.1.tar.gz",
"platform": "any",
"description": "# NanoCube\n\n## Lightning fast OLAP-style point queries on DataFrames.\n\n![GitHub license](https://img.shields.io/github/license/Zeutschler/nanocube?color=A1C547)\n![PyPI version](https://img.shields.io/pypi/v/nanocube?logo=pypi&logoColor=979DA4&color=A1C547)\n![PyPI Downloads](https://img.shields.io/pypi/dm/nanocube.svg?logo=pypi&logoColor=979DA4&label=PyPI%20downloads&color=A1C547)\n![GitHub last commit](https://img.shields.io/github/last-commit/Zeutschler/nanocube?logo=github&logoColor=979DA4&color=A1C547)\n![unit tests](https://img.shields.io/github/actions/workflow/status/zeutschler/nanocube/python-package.yml?logo=GitHub&logoColor=979DA4&label=unit%20tests&color=A1C547)\n\n-----------------\n\n**NanoCube** is a minimalistic in-memory, in-process OLAP engine for lightning fast point queries\non Pandas DataFrames. NanoCube shines when filtering and/or point queries need to be executed on a DataFrame,\ne.g. for financial data analysis, business intelligence or fast web services.\n\nIf you think it would be valuable to **extend NanoCube with additional OLAP features** \nplease let me know. You can reach out by opening an issue or contacting me \non [LinkedIn](https://www.linkedin.com/in/thomas-zeutschler/).\n\n``` bash\npip install nanocube\n```\n\n```python\nimport pandas as pd\nfrom nanocube import NanoCube\n\n# create a DataFrame\ndf = pd.read_csv('sale_data.csv')\nvalue = df.loc[(df['make'].isin(['Audi', 'BMW']) & (df['engine'] == 'hybrid')]['revenue'].sum()\n\n# create a NanoCube and run sum aggregated point queries\nnc = NanoCube(df)\nfor i in range(1000):\n value = nc.get('revenue', make=['Audi', 'BMW'], engine='hybrid')\n```\n\n### Lightning fast - really?\nAggregated point queries with NanoCube are often 100x to 1,000x times faster than using Pandas.\nThe more selective the query, the more you benefit from NanoCube. For highly selective queries,\nNanoCube can even be 10,000x times faster than Pandas. For non-selective queries, the performance\nis 10x faster and finally similar to Pandas, as both rely on Numpy for aggregation. NanoCube\nonly accelerates the filtering of data, not the aggregation.\n\n\nFor the special purpose of aggregative point queries, NanoCube is even faster than other \nDataFrame related technologies, like Spark, Polars, Modin, Dask or Vaex. If such libraries are \na drop-in replacements for Pandas, then you should be able to speed up their filtering quite noticeably. \nTry it and let me know how it performs.\n\nNanoCube is beneficial only if some point queries (> 5) need to be executed, as the \ninitialization time for the NanoCube needs to be taken into consideration.\nThe more point query you run, the more you benefit from NanoCube.\n\n### Benchmark - NanoCube vs. Others\nThe following table shows the duration for a single point query on the\n`car_prices_us` dataset (available on [kaggle.com](https://www.kaggle.com)) containing 16x columns and 558,837x rows. \nThe query is highly selective, filtering on 4 dimensions `(model='Optima', trim='LX', make='Kia', body='Sedan')` and \naggregating column `mmr`. The factor is the speedup of NanoCube vs. the respective technology.\n\nTo reproduce the benchmark, you can execute file [nano_vs_others.py](benchmarks/nano_vs_others.py).\n\n| | technology | duration_sec | factor |\n|---:|:-----------------|---------------:|---------:|\n| 0 | NanoCube | 0.021 | 1 |\n| 1 | SQLite (indexed) | 0.196 | 9.333 |\n| 2 | Polars | 0.844 | 40.19 |\n| 3 | DuckDB | 5.315 | 253.095 |\n| 4 | SQLite | 17.54 | 835.238 |\n| 5 | Pandas | 51.931 | 2472.91 |\n\n\n### How is this possible?\nNanoCube creates an in-memory multi-dimensional index over all relevant entities/columns in a dataframe.\nInternally, Roaring Bitmaps (https://roaringbitmap.org) are used by default for representing the index. \nInitialization may take some time, but yields very fast filtering and point queries. As an alternative\nto Roaring Bitmaps, Numpy-based indexing can be used. These are faster if only one filter is applied,\nbut can be orders of magnitude slower if multiple filters are applied.\n\nApproach: For each unique value in all relevant dimension columns, a bitmap is created that represents the \nrows in the DataFrame where this value occurs. The bitmaps can then be combined or intersected to determine \nthe rows relevant for a specific filter or point query. Once the relevant rows are determined, Numpy is used\nthen for to aggregate the requested measures. \n\nNanoCube is a by-product of the CubedPandas project (https://github.com/Zeutschler/cubedpandas) and will be integrated\ninto CubedPandas in the future. But for now, NanoCube is a standalone library that can be used with \nany Pandas DataFrame for the special purpose of point queries.\n\n### Tips for using NanoCube\n> **Tip**: Only include those columns in the NanoCube setup, that you actually want to query!\n> The more columns you include, the more memory and time is needed for initialization.\n> ```\n> df = pd.read_csv('dataframe_with_100_columns.csv')\n> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])\n> ```\n\n> **Tip**: If you have a DataFrame with more than 1 million rows, you may want to sort the DataFrame\n> before creating the NanoCube. This can improve the performance of NanoCube significantly, upto 10x times.\n\n> **Tip**: NanoCubes can be saved and loaded to/from disk. This can be useful if you want to reuse a NanoCube\n> for multiple queries or if you want to share a NanoCube with others. NanoCubes are saved in Arrow format but\n> load up to 4x times faster than the respective parquet DataFrame file.\n> ```\n> nc = NanoCube(df, dimensions=['col1', 'col2'], measures=['col100'])\n> nc.save('nanocube.nc')\n> nc_reloaded = NanoCube.load('nanocube.nc')\n> > ```\n\n\n### What price do I have to pay?\nNanoCube is free and MIT licensed. The prices to pay are additional memory consumption, depending on the\nuse case typically 25% on top of the original DataFrame and the time needed for initializing the \nmulti-dimensional index, typically 250k rows/sec depending on the number of columns to be indexed and \nyour hardware. The initialization time is proportional to the number of rows in the DataFrame (see below).\n\nYou may want to try and adapt the included samples [`sample.py`](samples/sample.py) and benchmarks \n[`benchmark.py`](benchmarks/benchmark.py) and [`benchmark.ipynb`](benchmarks/benchmark.ipynb) to test the behavior of NanoCube \non your data.\n\n## NanoCube Benchmarks\n\nUsing the Python script [benchmark.py](benchmarks/benchmark.py), the following comparison charts can be created.\nThe data set contains 7 dimension columns and 2 measure columns.\n\n#### Point query for a single row\nA highly selective query, fully qualified and filtering on all 7 dimensions. The query will return and aggregates 1 single row.\nNanoCube is 250x up to 60,000x more times faster than Pandas, depending on the number of size in the DataFrame,\nthe more rows, the faster NanoCube is in comparison to Pandas.\n\n![Point query for single row](benchmarks/charts/s.png)\n\n\n#### Point query on high cardinality column\nA highly selective, filtering on a single high cardinality dimension, where each member\nrepresents \u00b10.01% of rows. NanoCube is 100x or more times faster than Pandas. \n\n![Query on single high cardinality column](benchmarks/charts/hk.png)\n\n\n#### Point query aggregating 0.1% of rows\nA highly selective, filtering on 1 dimension that affects and aggregates 0.1% of rows.\nNanoCube is 100x or more times faster than Pandas. \n\n![Point query aggregating 0.1% of rows](benchmarks/charts/m.png)\n\n#### Point query aggregating 5% of rows\nA barely selective, filtering on 2 dimensions that affects and aggregates 5% of rows.\nNanoCube is consistently 10x faster than Pandas. But you can already see, that the \naggregation in Numpy become more dominant -> compare the lines of the number of returned \nrecords and the NanoCube response time, they are almost parallel. \n\n![Point query aggregating 5% of rows](benchmarks/charts/l.png)\n\nIf sorting is applied to the DataFrame - low cardinality dimension columns first, higher dimension cardinality \ncolumns last - then the performance of NanoCube can potentially improve dramatically, ranging from not faster\nup to \u00b110x or more times faster. Here, the same query as above, but the DataFrame was sorted beforehand.\n\n![Point query for single row](benchmarks/charts/l_sorted.png)\n\n\n#### Point query aggregating 50% of rows\nA non-selective query, filtering on 1 dimension that affects and aggregates 50% of rows.\nHere, most of the time is spent in Numpy, aggregating the rows. The more\nrows, the closer Pandas and NanoCube get as both rely finally on Numpy for\naggregation, which is very fast.\n\n![Point query aggregating 50% of rows](benchmarks/charts/xl.png)\n\n#### NanoCube initialization time\nThe time required to initialize a NanoCube instance is almost linear.\nThe initialization throughput heavily depends on the number of dimension columns. \nA custom file format will be added soon allowing \u00b14x times faster loading\nof a NanoCube in comparison to loading the respective parquet dataframe file\nusing Arrow.\n\n![NanoCube initialization time](benchmarks/charts/init.png)\n\n\n\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 Thomas Zeutschler Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "Lightning fast OLAP-style point queries on Pandas DataFrames.",
"version": "0.2.1",
"project_urls": {
"Documentation": "https://github.com/Zeutschler/nanocube",
"Homepage": "https://github.com/Zeutschler/nanocube",
"Issues": "https://github.com/Zeutschler/nanocube/issues",
"Repository": "https://github.com/Zeutschler/nanocube.git",
"pypi": "https://pypi.org/project/nanocube/"
},
"split_keywords": [
"python",
" pandas",
" numpy",
" spark",
" data analysis",
" olap",
" cube",
" dataframe"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9f51678ac394ef50d24e8a336603cb475c92fb769646e6f89acb346e35094d71",
"md5": "08f2b387288072015b12a6283fa92fb5",
"sha256": "1c1c6abd14150934f353bb1f9ffad5a27f1d7e063e84639595d77b5b143538bc"
},
"downloads": -1,
"filename": "nanocube-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "08f2b387288072015b12a6283fa92fb5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 12776,
"upload_time": "2024-10-14T21:50:16",
"upload_time_iso_8601": "2024-10-14T21:50:16.700133Z",
"url": "https://files.pythonhosted.org/packages/9f/51/678ac394ef50d24e8a336603cb475c92fb769646e6f89acb346e35094d71/nanocube-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "722bee0b25f441f0826418781e8bc3a53de4e40f13f368ffb2de10aff1accf36",
"md5": "02ba3cd335eff71b7b3bd841a4a62377",
"sha256": "5737404f67c8cc1cc05c308a9fc08d04bf0e99568009dd07b1b3c904e8368863"
},
"downloads": -1,
"filename": "nanocube-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "02ba3cd335eff71b7b3bd841a4a62377",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 16409,
"upload_time": "2024-10-14T21:50:18",
"upload_time_iso_8601": "2024-10-14T21:50:18.422696Z",
"url": "https://files.pythonhosted.org/packages/72/2b/ee0b25f441f0826418781e8bc3a53de4e40f13f368ffb2de10aff1accf36/nanocube-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-14 21:50:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Zeutschler",
"github_project": "nanocube",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "nanocube"
}