catabra-pandas


Namecatabra-pandas JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/risc-mi/catabra-pandas
SummaryCaTabRa-pandas is a library with additional functionality for pandas
upload_time2024-09-16 11:25:18
maintainerNone
docs_urlNone
authorRISC Software GmbH
requires_python<4.0.0,>=3.6.1
licenseApache 2.0 with Commons Clause
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CaTabRa-pandas

<p align="center">
  <a href="#About"><b>About</b></a> &bull;
  <a href="#Quickstart"><b>Quickstart</b></a> &bull;
  <a href="#References"><b>References</b></a> &bull;
  <a href="#Contact"><b>Contact</b></a> &bull;
  <a href="#Acknowledgments"><b>Acknowledgments</b></a>
</p>

[![Platform Support](https://img.shields.io/badge/python->=3.6-blue)]()
[![Platform Support](https://img.shields.io/badge/pandas->=1.0-blue)]()
[![Platform Support](https://img.shields.io/badge/platform-Linux%20|%20Windows%20|%20MacOS-blue)]()

## About

**CaTabRa-pandas** is a Python library with a couple of useful functions for efficiently working with [pandas](https://pandas.pydata.org/) DataFrames. In particular, many of these functions are concerned with DataFrames containing *intervals*, i.e., DataFrames with (at least) two columns `"start"` and `"stop"` defining the left and right endpoints of intervals.

**Highlights**:
* Resample observations with respect to arbitrary (possibly irregular, possibly overlapping) windows: `catabra_pandas.resample_eav` and `catabra_pandas.resample_interval`.
* Compute the intersection, union, difference, etc. of intervals: `catabra_pandas.combine_intervals`.
* Group intervals by their distance to each other: `catabra_pandas.group_intervals`.
* For each point in a given DataFrame, find the interval that contains it: `catabra_pandas.find_containing_interval`.
* Find the previous/next observation for each entry in a DataFrame of timestamped observations: `catabra_pandas.prev_next_values`.

Each of these functions lacks a native pandas implementation, and is implemented *extremely efficiently* in **CaTabRa-pandas**. DataFrames with 10M+ rows are no problem!

**[Dask](https://docs.dask.org/en/stable/index.html) DataFrames are partly supported, too.**

If you are interested in **CaTabRa-pandas**, you might be interested in **[CaTabRa](https://github.com/risc-mi/catabra)**, too: **CaTabRa** is a full-fledged tabular data analysis framework that enables you to calculate statistics, generate appealing visualizations and train machine learning models with a single command.

## Quickstart

**CaTabRa-pandas** has minimal requirements and can be installed in every environment with Python >= 3.6 and pandas >= 1.0.

Once installed, **CaTabRa-pandas** can be readily used:

```python
import pandas as pd
import catabra_pandas

# use-case: resample observations wrt. given windows
observations = pd.DataFrame(
    data={
        "subject_id": [0, 0, 0, 0, 1, 1],
        "attribute": ["HR", "Temp", "HR", "HR", "Temp", "HR"],
        "timestamp": [1, 1, 5, 7, 2, 3],
        "value": [82.7, 36.9, 79.5, 78.7, 37.2, 89.4]
    }
)
windows = pd.DataFrame(
    data={
        ("subject_id", ""): [0, 0, 1],
        ("timestamp", "start"): [0, 4, 1],
        ("timestamp", "stop"): [6, 8, 4]
    }
)
catabra_pandas.resample_eav(
    observations,
    windows,
    agg={
        "HR": ["mean", "p75", "r-1"],   # mean value, 75-th percentile, last observed value
        "Temp": ["count", "mode"]     # standard deviation, mode
    },
    entity_col="subject_id",
    time_col="timestamp",
    attribute_col="attribute",
    value_col="value"
)
```

```python
import pandas as pd
import catabra_pandas

# use-case: find containing intervals
# note: intervals must be pairwise disjoint (in each group)
intervals = pd.DataFrame(
    data={
        "subject_id": [0, 0, 1],
        "start": [0.5, 3.0, -10.7],
        "stop": [2.3, 10., 10.7]
    }
)
points = pd.DataFrame(
    data={
        "subject_id": [0, 0, 0, 1, 1],
        "point": [1.0, 2.5, 9.9, 0.0, -8.8]
    }
)
catabra_pandas.find_containing_interval(
    points,
    intervals,
    ["point"],
    start_col="start",
    stop_col="stop",
    group_by="subject_id"
)
```

## References

**If you use CaTabRa-pandas in your research, we would appreciate citing the following conference paper:**

* A. Maletzky, S. Kaltenleithner, P. Moser and M. Giretzlehner.
  *CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data*. In: I. Maglogiannis, L. Iliadis, J. MacIntyre
  and M. Dominguez (eds), Artificial Intelligence Applications and Innovations (AIAI 2023). IFIP Advances in
  Information and Communication Technology, vol 676, pp 57-68, 2023.
  [DOI:10.1007/978-3-031-34107-6_5](https://doi.org/10.1007/978-3-031-34107-6_5)

  ```
  @inproceedings{CaTabRa2023,
    author = {Maletzky, Alexander and Kaltenleithner, Sophie and Moser, Philipp and Giretzlehner, Michael},
    editor = {Maglogiannis, Ilias and Iliadis, Lazaros and MacIntyre, John and Dominguez, Manuel},
    title = {{CaTabRa}: Efficient Analysis and Predictive Modeling of Tabular Data},
    booktitle = {Artificial Intelligence Applications and Innovations},
    year = {2023},
    publisher = {Springer Nature Switzerland},
    address = {Cham},
    pages = {57--68},
    isbn = {978-3-031-34107-6},
    doi = {10.1007/978-3-031-34107-6_5}
  }
  ```

## Contact

If you have any inquiries, please open a GitHub issue.

## Acknowledgments

This project is financed by research subsidies granted by the government of Upper Austria. RISC Software GmbH is Member
of UAR (Upper Austrian Research) Innovation Network.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/risc-mi/catabra-pandas",
    "name": "catabra-pandas",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0.0,>=3.6.1",
    "maintainer_email": null,
    "keywords": null,
    "author": "RISC Software GmbH",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/20/d3/0bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2/catabra_pandas-0.0.1.tar.gz",
    "platform": null,
    "description": "# CaTabRa-pandas\n\n<p align=\"center\">\n  <a href=\"#About\"><b>About</b></a> &bull;\n  <a href=\"#Quickstart\"><b>Quickstart</b></a> &bull;\n  <a href=\"#References\"><b>References</b></a> &bull;\n  <a href=\"#Contact\"><b>Contact</b></a> &bull;\n  <a href=\"#Acknowledgments\"><b>Acknowledgments</b></a>\n</p>\n\n[![Platform Support](https://img.shields.io/badge/python->=3.6-blue)]()\n[![Platform Support](https://img.shields.io/badge/pandas->=1.0-blue)]()\n[![Platform Support](https://img.shields.io/badge/platform-Linux%20|%20Windows%20|%20MacOS-blue)]()\n\n## About\n\n**CaTabRa-pandas** is a Python library with a couple of useful functions for efficiently working with [pandas](https://pandas.pydata.org/) DataFrames. In particular, many of these functions are concerned with DataFrames containing *intervals*, i.e., DataFrames with (at least) two columns `\"start\"` and `\"stop\"` defining the left and right endpoints of intervals.\n\n**Highlights**:\n* Resample observations with respect to arbitrary (possibly irregular, possibly overlapping) windows: `catabra_pandas.resample_eav` and `catabra_pandas.resample_interval`.\n* Compute the intersection, union, difference, etc. of intervals: `catabra_pandas.combine_intervals`.\n* Group intervals by their distance to each other: `catabra_pandas.group_intervals`.\n* For each point in a given DataFrame, find the interval that contains it: `catabra_pandas.find_containing_interval`.\n* Find the previous/next observation for each entry in a DataFrame of timestamped observations: `catabra_pandas.prev_next_values`.\n\nEach of these functions lacks a native pandas implementation, and is implemented *extremely efficiently* in **CaTabRa-pandas**. DataFrames with 10M+ rows are no problem!\n\n**[Dask](https://docs.dask.org/en/stable/index.html) DataFrames are partly supported, too.**\n\nIf you are interested in **CaTabRa-pandas**, you might be interested in **[CaTabRa](https://github.com/risc-mi/catabra)**, too: **CaTabRa** is a full-fledged tabular data analysis framework that enables you to calculate statistics, generate appealing visualizations and train machine learning models with a single command.\n\n## Quickstart\n\n**CaTabRa-pandas** has minimal requirements and can be installed in every environment with Python >= 3.6 and pandas >= 1.0.\n\nOnce installed, **CaTabRa-pandas** can be readily used:\n\n```python\nimport pandas as pd\nimport catabra_pandas\n\n# use-case: resample observations wrt. given windows\nobservations = pd.DataFrame(\n    data={\n        \"subject_id\": [0, 0, 0, 0, 1, 1],\n        \"attribute\": [\"HR\", \"Temp\", \"HR\", \"HR\", \"Temp\", \"HR\"],\n        \"timestamp\": [1, 1, 5, 7, 2, 3],\n        \"value\": [82.7, 36.9, 79.5, 78.7, 37.2, 89.4]\n    }\n)\nwindows = pd.DataFrame(\n    data={\n        (\"subject_id\", \"\"): [0, 0, 1],\n        (\"timestamp\", \"start\"): [0, 4, 1],\n        (\"timestamp\", \"stop\"): [6, 8, 4]\n    }\n)\ncatabra_pandas.resample_eav(\n    observations,\n    windows,\n    agg={\n        \"HR\": [\"mean\", \"p75\", \"r-1\"],   # mean value, 75-th percentile, last observed value\n        \"Temp\": [\"count\", \"mode\"]     # standard deviation, mode\n    },\n    entity_col=\"subject_id\",\n    time_col=\"timestamp\",\n    attribute_col=\"attribute\",\n    value_col=\"value\"\n)\n```\n\n```python\nimport pandas as pd\nimport catabra_pandas\n\n# use-case: find containing intervals\n# note: intervals must be pairwise disjoint (in each group)\nintervals = pd.DataFrame(\n    data={\n        \"subject_id\": [0, 0, 1],\n        \"start\": [0.5, 3.0, -10.7],\n        \"stop\": [2.3, 10., 10.7]\n    }\n)\npoints = pd.DataFrame(\n    data={\n        \"subject_id\": [0, 0, 0, 1, 1],\n        \"point\": [1.0, 2.5, 9.9, 0.0, -8.8]\n    }\n)\ncatabra_pandas.find_containing_interval(\n    points,\n    intervals,\n    [\"point\"],\n    start_col=\"start\",\n    stop_col=\"stop\",\n    group_by=\"subject_id\"\n)\n```\n\n## References\n\n**If you use CaTabRa-pandas in your research, we would appreciate citing the following conference paper:**\n\n* A. Maletzky, S. Kaltenleithner, P. Moser and M. Giretzlehner.\n  *CaTabRa: Efficient Analysis and Predictive Modeling of Tabular Data*. In: I. Maglogiannis, L. Iliadis, J. MacIntyre\n  and M. Dominguez (eds), Artificial Intelligence Applications and Innovations (AIAI 2023). IFIP Advances in\n  Information and Communication Technology, vol 676, pp 57-68, 2023.\n  [DOI:10.1007/978-3-031-34107-6_5](https://doi.org/10.1007/978-3-031-34107-6_5)\n\n  ```\n  @inproceedings{CaTabRa2023,\n    author = {Maletzky, Alexander and Kaltenleithner, Sophie and Moser, Philipp and Giretzlehner, Michael},\n    editor = {Maglogiannis, Ilias and Iliadis, Lazaros and MacIntyre, John and Dominguez, Manuel},\n    title = {{CaTabRa}: Efficient Analysis and Predictive Modeling of Tabular Data},\n    booktitle = {Artificial Intelligence Applications and Innovations},\n    year = {2023},\n    publisher = {Springer Nature Switzerland},\n    address = {Cham},\n    pages = {57--68},\n    isbn = {978-3-031-34107-6},\n    doi = {10.1007/978-3-031-34107-6_5}\n  }\n  ```\n\n## Contact\n\nIf you have any inquiries, please open a GitHub issue.\n\n## Acknowledgments\n\nThis project is financed by research subsidies granted by the government of Upper Austria. RISC Software GmbH is Member\nof UAR (Upper Austrian Research) Innovation Network.",
    "bugtrack_url": null,
    "license": "Apache 2.0 with Commons Clause",
    "summary": "CaTabRa-pandas is a library with additional functionality for pandas",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/risc-mi/catabra-pandas",
        "Repository": "https://github.com/risc-mi/catabra-pandas"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c39c6ad6079658b4d840ccd5494f02b395f884b5593522c21ce1e518e7721f8b",
                "md5": "5c0130e157977e4b23596a8089f4d567",
                "sha256": "1f65ebb7a35e4d73ec108549d7dff08faded704c68fa4a2dde742d11b25c867c"
            },
            "downloads": -1,
            "filename": "catabra_pandas-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5c0130e157977e4b23596a8089f4d567",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0.0,>=3.6.1",
            "size": 39842,
            "upload_time": "2024-09-16T11:25:17",
            "upload_time_iso_8601": "2024-09-16T11:25:17.563450Z",
            "url": "https://files.pythonhosted.org/packages/c3/9c/6ad6079658b4d840ccd5494f02b395f884b5593522c21ce1e518e7721f8b/catabra_pandas-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "20d30bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2",
                "md5": "a9efd7f63c5a458c3f6085c785b486f0",
                "sha256": "964256c1807d7b25fd2fa36910f9e415e7d1cccbedcaa0a6705c6476d392c6a6"
            },
            "downloads": -1,
            "filename": "catabra_pandas-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a9efd7f63c5a458c3f6085c785b486f0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0.0,>=3.6.1",
            "size": 39315,
            "upload_time": "2024-09-16T11:25:18",
            "upload_time_iso_8601": "2024-09-16T11:25:18.817934Z",
            "url": "https://files.pythonhosted.org/packages/20/d3/0bf54c5e80c65f90ae6b29d01ad700f999846e238b3bf6e439f52a15f8f2/catabra_pandas-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-16 11:25:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "risc-mi",
    "github_project": "catabra-pandas",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "catabra-pandas"
}
        
Elapsed time: 0.31808s