timebasedcv


Nametimebasedcv JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryTime based cross validation
upload_time2024-05-19 10:38:04
maintainerNone
docs_urlNone
authorFrancesco Bruzzesi
requires_python>=3.8
licenseMIT License Copyright (c) 2023 Francesco Bruzzesi Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <img src="https://raw.githubusercontent.com/FBruzzesi/timebasedcv/main/docs/img/timebasedcv-logo.svg" width=185 height=185 align="right">

![license-shield](https://img.shields.io/github/license/FBruzzesi/timebasedcv)
![interrogate-badge](https://raw.githubusercontent.com/FBruzzesi/timebasedcv/main/docs/img/interrogate-shield.svg)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
![coverage-badge](https://raw.githubusercontent.com/FBruzzesi/timebasedcv/main/docs/img/coverage.svg)
![versions-shield](https://img.shields.io/pypi/pyversions/timebasedcv)

# Time based cross validation

**timebasedcv** is a Python codebase that provides a cross validation strategy based on time.

---

[Documentation](https://fbruzzesi.github.io/timebasedcv) | [Repository](https://github.com/fbruzzesi/timebasedcv) | [Issue Tracker](https://github.com/fbruzzesi/timebasedcv/issues)

---

## Disclaimer ⚠️

This codebase is experimental and is working for my use cases. It is very probable that there are cases not entirely covered and for which it could break (badly). If you find them, please feel free to open an issue in the [issue page](https://github.com/FBruzzesi/timebasedcv/issues/new){:target="_blank"} of the repo.

## Description ✨

The current implementation of [scikit-learn TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html){:target="_blank"} lacks the flexibility of having multiple samples within the same time period (or time unit).

**timebasedcv** addresses such problem by providing a cross validation strategy based on a **time period** rather than the number of samples. This is useful when the data is time dependent, and the split should keep together samples within the same time window.

Temporal data leakage is an issue and we want to prevent it from happening by providing splits that make sure the past and the future are well separated, so that data leakage does not spoil in a model cross validation.

Again, these splits points solely depend on the time period and not the number of observations.

### Features 📜

We introduce two main classes:

- [`TimeBasedSplit`](https://fbruzzesi.github.io/timebasedcv/api/timebasedcv/#timebasedcv.core.TimeBasedSplit) allows to define a split based on time unit (frequency), train size, test size, gap, stride, window type and mode. Remark that `TimeBasedSplit` is **not** compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3). In fact, we have made the (opinioned) choice to:

  - Return the sliced arrays from `.split(...)`, while scikit-learn CV Splitters return train and test indices of the split.
  - Require to pass the time series as input to `.split(...)` method, while scikit-learn CV Splitters require to provide only `X, y, groups` to `.split(...)`.
  - Such time series is used to generate the boolean masks with which we slice the original arrays into train and test for each split.

- Considering the above choices, we also provide a scikit-learn compatible splitter: [`TimeBasedCVSplitter`](https://fbruzzesi.github.io/timebasedcv/api/sklearn/#timebasedcv.sklearn.TimeBasedCVSplitter). Considering the signature that `.split(...)` requires and the fact that CV Splitters need to know a priori the number of splits, `TimeBasedCVSplitter` is initialized with the time series containing the time information used to generate the train and test indices of each split.

## Installation 💻

TL;DR:

```bash
python -m pip install timebasedcv
```

For further information, please refer to the dedicated [installation](https://fbruzzesi.github.io/timebasedcv/installation) section.

## Quickstart 🏃

The following code snippet is all you need to get started, yet consider checking out the [getting started](https://fbruzzesi.github.io/timebasedcv/user-guide/getting-started/) section of the documentation for a detailed guide on how to use the library.

The main takeaway should be that `TimeBasedSplit` allows for a lot of flexibility at the cost of having to specify a long list of parameters. This is what makes the library so powerful and flexible to cover the large majority of use cases.

First let's generate some data with different number of points per day:

```python
import numpy as np
import pandas as pd

RNG = np.random.default_rng(seed=42)

dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)

df = (pd.concat([
        pd.DataFrame({
            "time": pd.date_range(start, end, periods=_size, inclusive="left"),
            "a": RNG.normal(size=_size-1),
            "b": RNG.normal(size=_size-1),
        })
        for start, end, _size in zip(dates[:-1], dates[1:], RNG.integers(2, 24, size-1))
    ])
    .reset_index(drop=True)
    .assign(y=lambda t: t[["a", "b"]].sum(axis=1) + RNG.normal(size=t.shape[0])/25)
)

df.set_index("time").resample("D").agg(count=("y", np.size)).head(5)
```

```terminal
            count
time
2023-01-01      2
2023-01-02     18
2023-01-03     15
2023-01-04     10
2023-01-05     10
```

Then lets instantiate the `TimeBasedSplit` class:

```python
from timebasedcv import TimeBasedSplit

tbs = TimeBasedSplit(
    frequency="days",
    train_size=10,
    forecast_horizon=5,
    gap=1,
    stride=3,
    window="rolling",
    mode="forward",
)
```

Now let's run split the data with the provided `TimeBasedSplit` instance:

```py title="Generate the splits"
X, y, time_series = df.loc[:, ["a", "b"]], df["y"], df["time"]

for X_train, X_forecast, y_train, y_forecast in tbs.split(X, y, time_series=time_series):
    print(f"Train: {X_train.shape}, Forecast: {X_forecast.shape}")
```

```terminal
Train: (100, 2), Forecast: (51, 2)
Train: (114, 2), Forecast: (50, 2)
...
Train: (124, 2), Forecast: (40, 2)
Train: (137, 2), Forecast: (22, 2)
```

As we can see, each split does not necessarely have the same number of points, this is because the time series has a different number of points per day.

A picture is worth a thousand words, let's visualize the splits (blue dots represent the train points, while the red dots represent the forecastng points):

![cross-validation](docs/img/basic-cv-split.png)

## Contributing ✌️

Please read the [Contributing guidelines](https://fbruzzesi.github.io/timebasedcv/contribute/) in the documentation site.

## License 👀

The project has a [MIT Licence](https://github.com/FBruzzesi/timebasedcv/blob/main/LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "timebasedcv",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Francesco Bruzzesi",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/57/1c/ec2e0e771422f99a6b25d4c414d8a80162e3cab78d96558174e796fe445d/timebasedcv-0.2.1.tar.gz",
    "platform": null,
    "description": "<img src=\"https://raw.githubusercontent.com/FBruzzesi/timebasedcv/main/docs/img/timebasedcv-logo.svg\" width=185 height=185 align=\"right\">\n\n![license-shield](https://img.shields.io/github/license/FBruzzesi/timebasedcv)\n![interrogate-badge](https://raw.githubusercontent.com/FBruzzesi/timebasedcv/main/docs/img/interrogate-shield.svg)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n![coverage-badge](https://raw.githubusercontent.com/FBruzzesi/timebasedcv/main/docs/img/coverage.svg)\n![versions-shield](https://img.shields.io/pypi/pyversions/timebasedcv)\n\n# Time based cross validation\n\n**timebasedcv** is a Python codebase that provides a cross validation strategy based on time.\n\n---\n\n[Documentation](https://fbruzzesi.github.io/timebasedcv) | [Repository](https://github.com/fbruzzesi/timebasedcv) | [Issue Tracker](https://github.com/fbruzzesi/timebasedcv/issues)\n\n---\n\n## Disclaimer \u26a0\ufe0f\n\nThis codebase is experimental and is working for my use cases. It is very probable that there are cases not entirely covered and for which it could break (badly). If you find them, please feel free to open an issue in the [issue page](https://github.com/FBruzzesi/timebasedcv/issues/new){:target=\"_blank\"} of the repo.\n\n## Description \u2728\n\nThe current implementation of [scikit-learn TimeSeriesSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html){:target=\"_blank\"} lacks the flexibility of having multiple samples within the same time period (or time unit).\n\n**timebasedcv** addresses such problem by providing a cross validation strategy based on a **time period** rather than the number of samples. This is useful when the data is time dependent, and the split should keep together samples within the same time window.\n\nTemporal data leakage is an issue and we want to prevent it from happening by providing splits that make sure the past and the future are well separated, so that data leakage does not spoil in a model cross validation.\n\nAgain, these splits points solely depend on the time period and not the number of observations.\n\n### Features \ud83d\udcdc\n\nWe introduce two main classes:\n\n- [`TimeBasedSplit`](https://fbruzzesi.github.io/timebasedcv/api/timebasedcv/#timebasedcv.core.TimeBasedSplit) allows to define a split based on time unit (frequency), train size, test size, gap, stride, window type and mode. Remark that `TimeBasedSplit` is **not** compatible with [scikit-learn CV Splitters](https://scikit-learn.org/stable/common_pitfalls.html#id3). In fact, we have made the (opinioned) choice to:\n\n  - Return the sliced arrays from `.split(...)`, while scikit-learn CV Splitters return train and test indices of the split.\n  - Require to pass the time series as input to `.split(...)` method, while scikit-learn CV Splitters require to provide only `X, y, groups` to `.split(...)`.\n  - Such time series is used to generate the boolean masks with which we slice the original arrays into train and test for each split.\n\n- Considering the above choices, we also provide a scikit-learn compatible splitter: [`TimeBasedCVSplitter`](https://fbruzzesi.github.io/timebasedcv/api/sklearn/#timebasedcv.sklearn.TimeBasedCVSplitter). Considering the signature that `.split(...)` requires and the fact that CV Splitters need to know a priori the number of splits, `TimeBasedCVSplitter` is initialized with the time series containing the time information used to generate the train and test indices of each split.\n\n## Installation \ud83d\udcbb\n\nTL;DR:\n\n```bash\npython -m pip install timebasedcv\n```\n\nFor further information, please refer to the dedicated [installation](https://fbruzzesi.github.io/timebasedcv/installation) section.\n\n## Quickstart \ud83c\udfc3\n\nThe following code snippet is all you need to get started, yet consider checking out the [getting started](https://fbruzzesi.github.io/timebasedcv/user-guide/getting-started/) section of the documentation for a detailed guide on how to use the library.\n\nThe main takeaway should be that `TimeBasedSplit` allows for a lot of flexibility at the cost of having to specify a long list of parameters. This is what makes the library so powerful and flexible to cover the large majority of use cases.\n\nFirst let's generate some data with different number of points per day:\n\n```python\nimport numpy as np\nimport pandas as pd\n\nRNG = np.random.default_rng(seed=42)\n\ndates = pd.Series(pd.date_range(\"2023-01-01\", \"2023-01-31\", freq=\"D\"))\nsize = len(dates)\n\ndf = (pd.concat([\n        pd.DataFrame({\n            \"time\": pd.date_range(start, end, periods=_size, inclusive=\"left\"),\n            \"a\": RNG.normal(size=_size-1),\n            \"b\": RNG.normal(size=_size-1),\n        })\n        for start, end, _size in zip(dates[:-1], dates[1:], RNG.integers(2, 24, size-1))\n    ])\n    .reset_index(drop=True)\n    .assign(y=lambda t: t[[\"a\", \"b\"]].sum(axis=1) + RNG.normal(size=t.shape[0])/25)\n)\n\ndf.set_index(\"time\").resample(\"D\").agg(count=(\"y\", np.size)).head(5)\n```\n\n```terminal\n            count\ntime\n2023-01-01      2\n2023-01-02     18\n2023-01-03     15\n2023-01-04     10\n2023-01-05     10\n```\n\nThen lets instantiate the `TimeBasedSplit` class:\n\n```python\nfrom timebasedcv import TimeBasedSplit\n\ntbs = TimeBasedSplit(\n    frequency=\"days\",\n    train_size=10,\n    forecast_horizon=5,\n    gap=1,\n    stride=3,\n    window=\"rolling\",\n    mode=\"forward\",\n)\n```\n\nNow let's run split the data with the provided `TimeBasedSplit` instance:\n\n```py title=\"Generate the splits\"\nX, y, time_series = df.loc[:, [\"a\", \"b\"]], df[\"y\"], df[\"time\"]\n\nfor X_train, X_forecast, y_train, y_forecast in tbs.split(X, y, time_series=time_series):\n    print(f\"Train: {X_train.shape}, Forecast: {X_forecast.shape}\")\n```\n\n```terminal\nTrain: (100, 2), Forecast: (51, 2)\nTrain: (114, 2), Forecast: (50, 2)\n...\nTrain: (124, 2), Forecast: (40, 2)\nTrain: (137, 2), Forecast: (22, 2)\n```\n\nAs we can see, each split does not necessarely have the same number of points, this is because the time series has a different number of points per day.\n\nA picture is worth a thousand words, let's visualize the splits (blue dots represent the train points, while the red dots represent the forecastng points):\n\n![cross-validation](docs/img/basic-cv-split.png)\n\n## Contributing \u270c\ufe0f\n\nPlease read the [Contributing guidelines](https://fbruzzesi.github.io/timebasedcv/contribute/) in the documentation site.\n\n## License \ud83d\udc40\n\nThe project has a [MIT Licence](https://github.com/FBruzzesi/timebasedcv/blob/main/LICENSE)\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2023 Francesco Bruzzesi\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "Time based cross validation",
    "version": "0.2.1",
    "project_urls": {
        "documentation": "https://fbruzzesi.github.io/timebasedcv/",
        "issue-tracker": "https://github.com/fbruzzesi/timebasedcv/issues",
        "repository": "https://github.com/fbruzzesi/timebasedcv"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d5b10c818749b4d223c1a63881aaf1538741d1c5b7bb57870505ce6ce8e8c63e",
                "md5": "ef078720151a25c9e05796cf9eb1beba",
                "sha256": "9e445d4ac24b1e7866ae50b506ae8c6a387e57c38bc088140e7bbf93c634aac5"
            },
            "downloads": -1,
            "filename": "timebasedcv-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ef078720151a25c9e05796cf9eb1beba",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 19134,
            "upload_time": "2024-05-19T10:38:02",
            "upload_time_iso_8601": "2024-05-19T10:38:02.459672Z",
            "url": "https://files.pythonhosted.org/packages/d5/b1/0c818749b4d223c1a63881aaf1538741d1c5b7bb57870505ce6ce8e8c63e/timebasedcv-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "571cec2e0e771422f99a6b25d4c414d8a80162e3cab78d96558174e796fe445d",
                "md5": "2a55b5b6e3f364d66cf3a575369d73b0",
                "sha256": "e8cb8aada09ff3603d3907872d9d80c19c403e723561bbd6bcb7904eceaa1fdb"
            },
            "downloads": -1,
            "filename": "timebasedcv-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2a55b5b6e3f364d66cf3a575369d73b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 16185,
            "upload_time": "2024-05-19T10:38:04",
            "upload_time_iso_8601": "2024-05-19T10:38:04.460308Z",
            "url": "https://files.pythonhosted.org/packages/57/1c/ec2e0e771422f99a6b25d4c414d8a80162e3cab78d96558174e796fe445d/timebasedcv-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-19 10:38:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fbruzzesi",
    "github_project": "timebasedcv",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "timebasedcv"
}
        
Elapsed time: 2.37168s