# CaTabRa-pandas
<p align="center">
<a href="#About"><b>About</b></a> •
<a href="#Quickstart"><b>Quickstart</b></a> •
<a href="#References"><b>References</b></a> •
<a href="#Contact"><b>Contact</b></a> •
<a href="#Acknowledgments"><b>Acknowledgments</b></a>
</p>
[]()
[]()
[]()
## About
**CaTabRa-pandas** is a Python library with a couple of useful functions for efficiently working with [pandas](https://pandas.pydata.org/) DataFrames. In particular, many of these functions are concerned with DataFrames containing *intervals*, i.e., DataFrames with (at least) two columns `"start"` and `"stop"` defining the left and right endpoints of intervals.
**Highlights**:
* Resample observations with respect to arbitrary (possibly irregular, possibly overlapping) windows: `catabra_pandas.resample_eav` and `catabra_pandas.resample_interval`.
* Compute the intersection, union, difference, etc. of intervals: `catabra_pandas.combine_intervals`.
* Group intervals by their distance to each other: `catabra_pandas.group_intervals`.
* For each point in a given DataFrame, find the interval that contains it: `catabra_pandas.find_containing_interval`.
* Find the previous/next observation for each entry in a DataFrame of timestamped observations: `catabra_pandas.prev_next_values`.
Each of these functions lacks a native pandas implementation, and is implemented *extremely efficiently* in **CaTabRa-pandas**. DataFrames with 10M+ rows are no problem!
**[Dask](https://docs.dask.org/en/stable/index.html) DataFrames are partly supported, too.**
If you are interested in **CaTabRa-pandas**, you might be interested in **[CaTabRa](https://github.com/risc-mi/catabra)**, too: **CaTabRa** is a full-fledged tabular data analysis framework that enables you to calculate statistics, generate appealing visualizations and train machine learning models with a single command.
## Quickstart
**CaTabRa-pandas** has minimal requirements and can be installed in every environment with Python >= 3.6 and pandas >= 1.0.
Once installed, **CaTabRa-pandas** can be readily used:
```python
import pandas as pd
import catabra_pandas
# use-case: resample observations wrt. given windows
observations = pd.DataFrame(
data={
"subject_id": [0, 0, 0, 0, 1, 1],
"attribute": ["HR", "Temp", "HR", "HR", "Temp", "HR"],
"timestamp": [1, 1, 5, 7, 2, 3],
"value": [82.7, 36.9, 79.5, 78.7, 37.2, 89.4]
}
)
windows = pd.DataFrame(
data={
("subject_id", ""): [0, 0, 1],
("timestamp", "start"): [0, 4, 1],
("timestamp", "stop"): [6, 8, 4]
}
)
catabra_pandas.resample_eav(
observations,
windows,
agg={
"HR": ["mean", "p75", "r-1"], # mean value, 75-th percentile, last observed value
"Temp": ["count", "mode"] # standard deviation, mode
},
entity_col="subject_id",
time_col="timestamp",
attribute_col="attribute",
value_col="value"
)
```
```python
import pandas as pd
import catabra_pandas
# use-case: find containing intervals
# note: intervals must be pairwise disjoint (in each group)
intervals = pd.DataFrame(
data={
"subject_id": [0, 0, 1],
"start": [0.5, 3.0, -10.7],
"stop": [2.3, 10., 10.7]
}
)
points = pd.DataFrame(
data={
"subject_id": [0, 0, 0, 1, 1],
"point": [1.0, 2.5, 9.9, 0.0, -8.8]
}
)
catabra_pandas.find_containing_interval(
points,
intervals,
["point"],
start_col="start",
stop_col="stop",
group_by="subject_id"
)
```
## References
**If you use CaTabRa-pandas in your research, we would appreciate citing the following conference paper:**
* A. Maletzky, S. Kaltenleithner, P. Moser and M. Giretzlehner.
*CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data*. In: I. Maglogiannis, L. Iliadis, J. MacIntyre
and M. Dominguez (eds), Artificial Intelligence Applications and Innovations (AIAI 2023). IFIP Advances in
Information and Communication Technology, vol 676, pp 57-68, 2023.
[DOI:10.1007/978-3-031-34107-6_5](https://doi.org/10.1007/978-3-031-34107-6_5)
```
@inproceedings{CaTabRa2023,
author = {Maletzky, Alexander and Kaltenleithner, Sophie and Moser, Philipp and Giretzlehner, Michael},
editor = {Maglogiannis, Ilias and Iliadis, Lazaros and MacIntyre, John and Dominguez, Manuel},
title = {{CaTabRa}: Efficient Analysis and Predictive Modeling of Tabular Data},
booktitle = {Artificial Intelligence Applications and Innovations},
year = {2023},
publisher = {Springer Nature Switzerland},
address = {Cham},
pages = {57--68},
isbn = {978-3-031-34107-6},
doi = {10.1007/978-3-031-34107-6_5}
}
```
## Contact
If you have any inquiries, please open a GitHub issue.
## Acknowledgments
This project is financed by research subsidies granted by the government of Upper Austria. RISC Software GmbH is Member
of UAR (Upper Austrian Research) Innovation Network.
Raw data
{
"_id": null,
"home_page": "https://github.com/risc-mi/catabra-pandas",
"name": "catabra-pandas",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0.0,>=3.6.1",
"maintainer_email": null,
"keywords": null,
"author": "RISC Software GmbH",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/20/d3/0bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2/catabra_pandas-0.0.1.tar.gz",
"platform": null,
"description": "# CaTabRa-pandas\n\n<p align=\"center\">\n <a href=\"#About\"><b>About</b></a> •\n <a href=\"#Quickstart\"><b>Quickstart</b></a> •\n <a href=\"#References\"><b>References</b></a> •\n <a href=\"#Contact\"><b>Contact</b></a> •\n <a href=\"#Acknowledgments\"><b>Acknowledgments</b></a>\n</p>\n\n[]()\n[]()\n[]()\n\n## About\n\n**CaTabRa-pandas** is a Python library with a couple of useful functions for efficiently working with [pandas](https://pandas.pydata.org/) DataFrames. In particular, many of these functions are concerned with DataFrames containing *intervals*, i.e., DataFrames with (at least) two columns `\"start\"` and `\"stop\"` defining the left and right endpoints of intervals.\n\n**Highlights**:\n* Resample observations with respect to arbitrary (possibly irregular, possibly overlapping) windows: `catabra_pandas.resample_eav` and `catabra_pandas.resample_interval`.\n* Compute the intersection, union, difference, etc. of intervals: `catabra_pandas.combine_intervals`.\n* Group intervals by their distance to each other: `catabra_pandas.group_intervals`.\n* For each point in a given DataFrame, find the interval that contains it: `catabra_pandas.find_containing_interval`.\n* Find the previous/next observation for each entry in a DataFrame of timestamped observations: `catabra_pandas.prev_next_values`.\n\nEach of these functions lacks a native pandas implementation, and is implemented *extremely efficiently* in **CaTabRa-pandas**. DataFrames with 10M+ rows are no problem!\n\n**[Dask](https://docs.dask.org/en/stable/index.html) DataFrames are partly supported, too.**\n\nIf you are interested in **CaTabRa-pandas**, you might be interested in **[CaTabRa](https://github.com/risc-mi/catabra)**, too: **CaTabRa** is a full-fledged tabular data analysis framework that enables you to calculate statistics, generate appealing visualizations and train machine learning models with a single command.\n\n## Quickstart\n\n**CaTabRa-pandas** has minimal requirements and can be installed in every environment with Python >= 3.6 and pandas >= 1.0.\n\nOnce installed, **CaTabRa-pandas** can be readily used:\n\n```python\nimport pandas as pd\nimport catabra_pandas\n\n# use-case: resample observations wrt. given windows\nobservations = pd.DataFrame(\n data={\n \"subject_id\": [0, 0, 0, 0, 1, 1],\n \"attribute\": [\"HR\", \"Temp\", \"HR\", \"HR\", \"Temp\", \"HR\"],\n \"timestamp\": [1, 1, 5, 7, 2, 3],\n \"value\": [82.7, 36.9, 79.5, 78.7, 37.2, 89.4]\n }\n)\nwindows = pd.DataFrame(\n data={\n (\"subject_id\", \"\"): [0, 0, 1],\n (\"timestamp\", \"start\"): [0, 4, 1],\n (\"timestamp\", \"stop\"): [6, 8, 4]\n }\n)\ncatabra_pandas.resample_eav(\n observations,\n windows,\n agg={\n \"HR\": [\"mean\", \"p75\", \"r-1\"], # mean value, 75-th percentile, last observed value\n \"Temp\": [\"count\", \"mode\"] # standard deviation, mode\n },\n entity_col=\"subject_id\",\n time_col=\"timestamp\",\n attribute_col=\"attribute\",\n value_col=\"value\"\n)\n```\n\n```python\nimport pandas as pd\nimport catabra_pandas\n\n# use-case: find containing intervals\n# note: intervals must be pairwise disjoint (in each group)\nintervals = pd.DataFrame(\n data={\n \"subject_id\": [0, 0, 1],\n \"start\": [0.5, 3.0, -10.7],\n \"stop\": [2.3, 10., 10.7]\n }\n)\npoints = pd.DataFrame(\n data={\n \"subject_id\": [0, 0, 0, 1, 1],\n \"point\": [1.0, 2.5, 9.9, 0.0, -8.8]\n }\n)\ncatabra_pandas.find_containing_interval(\n points,\n intervals,\n [\"point\"],\n start_col=\"start\",\n stop_col=\"stop\",\n group_by=\"subject_id\"\n)\n```\n\n## References\n\n**If you use CaTabRa-pandas in your research, we would appreciate citing the following conference paper:**\n\n* A. Maletzky, S. Kaltenleithner, P. Moser and M. Giretzlehner.\n *CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data*. In: I. Maglogiannis, L. Iliadis, J. MacIntyre\n and M. Dominguez (eds), Artificial Intelligence Applications and Innovations (AIAI 2023). IFIP Advances in\n Information and Communication Technology, vol 676, pp 57-68, 2023.\n [DOI:10.1007/978-3-031-34107-6_5](https://doi.org/10.1007/978-3-031-34107-6_5)\n\n ```\n @inproceedings{CaTabRa2023,\n author = {Maletzky, Alexander and Kaltenleithner, Sophie and Moser, Philipp and Giretzlehner, Michael},\n editor = {Maglogiannis, Ilias and Iliadis, Lazaros and MacIntyre, John and Dominguez, Manuel},\n title = {{CaTabRa}: Efficient Analysis and Predictive Modeling of Tabular Data},\n booktitle = {Artificial Intelligence Applications and Innovations},\n year = {2023},\n publisher = {Springer Nature Switzerland},\n address = {Cham},\n pages = {57--68},\n isbn = {978-3-031-34107-6},\n doi = {10.1007/978-3-031-34107-6_5}\n }\n ```\n\n## Contact\n\nIf you have any inquiries, please open a GitHub issue.\n\n## Acknowledgments\n\nThis project is financed by research subsidies granted by the government of Upper Austria. RISC Software GmbH is Member\nof UAR (Upper Austrian Research) Innovation Network.",
"bugtrack_url": null,
"license": "Apache 2.0 with Commons Clause",
"summary": "CaTabRa-pandas is a library with additional functionality for pandas",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/risc-mi/catabra-pandas",
"Repository": "https://github.com/risc-mi/catabra-pandas"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c39c6ad6079658b4d840ccd5494f02b395f884b5593522c21ce1e518e7721f8b",
"md5": "5c0130e157977e4b23596a8089f4d567",
"sha256": "1f65ebb7a35e4d73ec108549d7dff08faded704c68fa4a2dde742d11b25c867c"
},
"downloads": -1,
"filename": "catabra_pandas-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5c0130e157977e4b23596a8089f4d567",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0.0,>=3.6.1",
"size": 39842,
"upload_time": "2024-09-16T11:25:17",
"upload_time_iso_8601": "2024-09-16T11:25:17.563450Z",
"url": "https://files.pythonhosted.org/packages/c3/9c/6ad6079658b4d840ccd5494f02b395f884b5593522c21ce1e518e7721f8b/catabra_pandas-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "20d30bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2",
"md5": "a9efd7f63c5a458c3f6085c785b486f0",
"sha256": "964256c1807d7b25fd2fa36910f9e415e7d1cccbedcaa0a6705c6476d392c6a6"
},
"downloads": -1,
"filename": "catabra_pandas-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "a9efd7f63c5a458c3f6085c785b486f0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0.0,>=3.6.1",
"size": 39315,
"upload_time": "2024-09-16T11:25:18",
"upload_time_iso_8601": "2024-09-16T11:25:18.817934Z",
"url": "https://files.pythonhosted.org/packages/20/d3/0bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2/catabra_pandas-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-16 11:25:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "risc-mi",
"github_project": "catabra-pandas",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "catabra-pandas"
}