timefiller


Nametimefiller JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/CyrilJl/TimeFiller
SummaryA package for imputing missing data in time series
upload_time2024-10-26 18:22:47
maintainerNone
docs_urlNone
authorCyril Joly
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI - Version](https://img.shields.io/pypi/v/timefiller)](https://pypi.org/project/timefiller/)
[![Conda Version](https://img.shields.io/conda/vn/conda-forge/timefiller.svg)](https://anaconda.org/conda-forge/timefiller)
[![Documentation Status](https://readthedocs.org/projects/timefiller/badge/?version=latest)](https://timefiller.readthedocs.io/en/latest/?badge=latest)
[![Unit tests](https://github.com/CyrilJl/timefiller/actions/workflows/pytest.yml/badge.svg)](https://github.com/CyrilJl/timefiller/actions/workflows/pytest.yml)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/51d0dd39565a410985a6836e7d6bcd0b)](https://app.codacy.com/gh/CyrilJl/TimeFiller/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)

# <img src="https://raw.githubusercontent.com/CyrilJl/timefiller/main/_static/logo_timefiller.svg" alt="Logo BatchStats" width="200" height="200" align="right"> timefiller

`timefiller` is a Python package designed for time series imputation and forecasting. When applied to a set of correlated time series, each series is processed individually, utilizing both its own auto-regressive patterns and correlations with the other series. The package is user-friendly, making it accessible even to non-experts.

Originally developed for imputation, it also proves useful for forecasting, particularly when covariates contain missing values.

## Installation

You can get ``timefiller`` from PyPi:
```bash
pip install timefiller
```
But also from conda-forge:
```bash
conda install -c conda-forge timefiller
```

```bash
mamba install timefiller
```

## Why this package?

While there are other Python packages for similar tasks, this one is lightweight with a straightforward and simple API. Currently, its speed may be a limitation for large datasets, but it can still be quite useful in many cases.

## Basic Usage

The simplest usage example:

```python
from timefiller import TimeSeriesImputer

df = load_your_dataset()
tsi = TimeSeriesImputer()
df_imputed = tsi(X=df)
```

## Advanced Usage

```python
from sklearn.linear_model import LassoCV
from timefiller import PositiveOutput, TimeSeriesImputer

df = load_your_dataset()
tsi = TimeSeriesImputer(estimator=LassoCV(),
                        ar_lags=(1, 2, 3, 6, 24),
                        multivariate_lags=6,
                        preprocessing=PositiveOutput())
df_imputed = tsi(X=df,
                 subset_cols=['col_1', 'col_17'],
                 after='2024-06-14',
                 n_nearest_features=35)
```

Check out the [documentation](https://timefiller.readthedocs.io/en/latest/index.html) for details on available options to customize your imputation.

## Real data example

Let's evaluate how ``timefiller`` performs on a real-world dataset, the [PeMS-Bay traffic data](https://zenodo.org/records/5724362). A sensor ID is selected for the experiment, and a contiguous block of missing values is introduced. To increase the complexity, additional Missing At Random (MAR) data is simulated, representing 1% of the entire dataset:

```python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from timefiller import TimeSeriesImputer
from timefiller.utils import add_mar_nan, fetch_pems_bay

# Fetch the time series dataset (e.g., PeMS-Bay traffic data)
df = fetch_pems_bay()
dfm = df.copy()  # Create a copy to introduce missing values later

# Randomly select one column (sensor ID) to introduce missing values
k = np.random.randint(df.shape[1])
col = df.columns[k]
i, j = 20_000, 22_500  # Define a range in the dataset to set as NaN (missing values)
dfm.iloc[i:j, k] = np.nan  # Introduce missing values in this range for the selected column

# Add more missing values randomly across the dataset (1% of the data)
dfm = add_mar_nan(dfm, ratio=0.01)

# Initialize the TimeSeriesImputer with AR lags and multivariate lags
tsi = TimeSeriesImputer(ar_lags=48, multivariate_lags=6)

# Apply the imputation method on the modified dataframe
df_imputed = tsi(dfm, subset_cols=col, n_nearest_features=75)

# Plot the imputed data alongside the data with missing values
df_imputed[col].rename('imputation').plot(figsize=(10, 3), lw=0.8, c='C0')
dfm[col].rename('data to impute').plot(ax=plt.gca(), lw=0.8, c='C1')
plt.title(f'sensor_id {col}')
plt.legend()
plt.show()

# Plot the imputed data vs the original complete data for comparison
df_imputed[col].rename('imputation').plot(figsize=(10, 3), lw=0.8, c='C0')
df[col].rename('complete data').plot(ax=plt.gca(), lw=0.8, c='C2')
plt.xlim(dfm.index[i], dfm.index[j])  # Focus on the region where data was missing
plt.legend()
plt.show()
```

<img src="https://raw.githubusercontent.com/CyrilJl/timefiller/main/_static/result_imputation.png" width="750">

## Algorithmic Approach

`timefiller` relies heavily on [scikit-learn](https://scikit-learn.org/stable/) for the learning process and uses [optimask](https://optimask.readthedocs.io/en/latest/index.html) to create NaN-free train and
predict matrices for the estimator.

For each column requiring imputation, the algorithm differentiates between rows with valid data and those with missing values. For rows with missing data, it identifies the available sets of other columns (features).
For each set, OptiMask is called to train the chosen sklearn estimator on the largest possible submatrix without any NaNs. This process can become computationally expensive if the available sets of features vary
greatly or occur infrequently. In such cases, multiple calls to OptiMask and repeated fitting and predicting using the estimator may be necessary.

One important point to keep in mind is that within a single column, two different rows (timestamps) may be imputed using different estimators (regressors), each trained on distinct sets of columns (covariate features)
and samples (rows/timestamps).

``timefiller`` provides its own default regressor which is a scikit-learn version of Extreme Learning Machine, with ReLU activation. It provides non-linear handling of input features while being faster to fit than Neural Networks.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/CyrilJl/TimeFiller",
    "name": "timefiller",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Cyril Joly",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/cf/81/17363478ca6a2a220c2501c121b94760db11af515b8d427539b799d1ea02/timefiller-0.1.7.tar.gz",
    "platform": null,
    "description": "[![PyPI - Version](https://img.shields.io/pypi/v/timefiller)](https://pypi.org/project/timefiller/)\n[![Conda Version](https://img.shields.io/conda/vn/conda-forge/timefiller.svg)](https://anaconda.org/conda-forge/timefiller)\n[![Documentation Status](https://readthedocs.org/projects/timefiller/badge/?version=latest)](https://timefiller.readthedocs.io/en/latest/?badge=latest)\n[![Unit tests](https://github.com/CyrilJl/timefiller/actions/workflows/pytest.yml/badge.svg)](https://github.com/CyrilJl/timefiller/actions/workflows/pytest.yml)\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/51d0dd39565a410985a6836e7d6bcd0b)](https://app.codacy.com/gh/CyrilJl/TimeFiller/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)\n\n# <img src=\"https://raw.githubusercontent.com/CyrilJl/timefiller/main/_static/logo_timefiller.svg\" alt=\"Logo BatchStats\" width=\"200\" height=\"200\" align=\"right\"> timefiller\n\n`timefiller` is a Python package designed for time series imputation and forecasting. When applied to a set of correlated time series, each series is processed individually, utilizing both its own auto-regressive patterns and correlations with the other series. The package is user-friendly, making it accessible even to non-experts.\n\nOriginally developed for imputation, it also proves useful for forecasting, particularly when covariates contain missing values.\n\n## Installation\n\nYou can get ``timefiller`` from PyPi:\n```bash\npip install timefiller\n```\nBut also from conda-forge:\n```bash\nconda install -c conda-forge timefiller\n```\n\n```bash\nmamba install timefiller\n```\n\n## Why this package?\n\nWhile there are other Python packages for similar tasks, this one is lightweight with a straightforward and simple API. Currently, its speed may be a limitation for large datasets, but it can still be quite useful in many cases.\n\n## Basic Usage\n\nThe simplest usage example:\n\n```python\nfrom timefiller import TimeSeriesImputer\n\ndf = load_your_dataset()\ntsi = TimeSeriesImputer()\ndf_imputed = tsi(X=df)\n```\n\n## Advanced Usage\n\n```python\nfrom sklearn.linear_model import LassoCV\nfrom timefiller import PositiveOutput, TimeSeriesImputer\n\ndf = load_your_dataset()\ntsi = TimeSeriesImputer(estimator=LassoCV(),\n                        ar_lags=(1, 2, 3, 6, 24),\n                        multivariate_lags=6,\n                        preprocessing=PositiveOutput())\ndf_imputed = tsi(X=df,\n                 subset_cols=['col_1', 'col_17'],\n                 after='2024-06-14',\n                 n_nearest_features=35)\n```\n\nCheck out the [documentation](https://timefiller.readthedocs.io/en/latest/index.html) for details on available options to customize your imputation.\n\n## Real data example\n\nLet's evaluate how ``timefiller`` performs on a real-world dataset, the [PeMS-Bay traffic data](https://zenodo.org/records/5724362). A sensor ID is selected for the experiment, and a contiguous block of missing values is introduced. To increase the complexity, additional Missing At Random (MAR) data is simulated, representing 1% of the entire dataset:\n\n```python\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nfrom timefiller import TimeSeriesImputer\nfrom timefiller.utils import add_mar_nan, fetch_pems_bay\n\n# Fetch the time series dataset (e.g., PeMS-Bay traffic data)\ndf = fetch_pems_bay()\ndfm = df.copy()  # Create a copy to introduce missing values later\n\n# Randomly select one column (sensor ID) to introduce missing values\nk = np.random.randint(df.shape[1])\ncol = df.columns[k]\ni, j = 20_000, 22_500  # Define a range in the dataset to set as NaN (missing values)\ndfm.iloc[i:j, k] = np.nan  # Introduce missing values in this range for the selected column\n\n# Add more missing values randomly across the dataset (1% of the data)\ndfm = add_mar_nan(dfm, ratio=0.01)\n\n# Initialize the TimeSeriesImputer with AR lags and multivariate lags\ntsi = TimeSeriesImputer(ar_lags=48, multivariate_lags=6)\n\n# Apply the imputation method on the modified dataframe\ndf_imputed = tsi(dfm, subset_cols=col, n_nearest_features=75)\n\n# Plot the imputed data alongside the data with missing values\ndf_imputed[col].rename('imputation').plot(figsize=(10, 3), lw=0.8, c='C0')\ndfm[col].rename('data to impute').plot(ax=plt.gca(), lw=0.8, c='C1')\nplt.title(f'sensor_id {col}')\nplt.legend()\nplt.show()\n\n# Plot the imputed data vs the original complete data for comparison\ndf_imputed[col].rename('imputation').plot(figsize=(10, 3), lw=0.8, c='C0')\ndf[col].rename('complete data').plot(ax=plt.gca(), lw=0.8, c='C2')\nplt.xlim(dfm.index[i], dfm.index[j])  # Focus on the region where data was missing\nplt.legend()\nplt.show()\n```\n\n<img src=\"https://raw.githubusercontent.com/CyrilJl/timefiller/main/_static/result_imputation.png\" width=\"750\">\n\n## Algorithmic Approach\n\n`timefiller` relies heavily on [scikit-learn](https://scikit-learn.org/stable/) for the learning process and uses [optimask](https://optimask.readthedocs.io/en/latest/index.html) to create NaN-free train and\npredict matrices for the estimator.\n\nFor each column requiring imputation, the algorithm differentiates between rows with valid data and those with missing values. For rows with missing data, it identifies the available sets of other columns (features).\nFor each set, OptiMask is called to train the chosen sklearn estimator on the largest possible submatrix without any NaNs. This process can become computationally expensive if the available sets of features vary\ngreatly or occur infrequently. In such cases, multiple calls to OptiMask and repeated fitting and predicting using the estimator may be necessary.\n\nOne important point to keep in mind is that within a single column, two different rows (timestamps) may be imputed using different estimators (regressors), each trained on distinct sets of columns (covariate features)\nand samples (rows/timestamps).\n\n``timefiller`` provides its own default regressor which is a scikit-learn version of Extreme Learning Machine, with ReLU activation. It provides non-linear handling of input features while being faster to fit than Neural Networks.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package for imputing missing data in time series",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/CyrilJl/TimeFiller"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cf8117363478ca6a2a220c2501c121b94760db11af515b8d427539b799d1ea02",
                "md5": "c4ee11e90bdbc721f2c2d8afc6bc9b6c",
                "sha256": "bdc9f516de88e72324c5b4bf15fea7dccfab42f1d4a1e0322f17a943db7b38ca"
            },
            "downloads": -1,
            "filename": "timefiller-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "c4ee11e90bdbc721f2c2d8afc6bc9b6c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 15315,
            "upload_time": "2024-10-26T18:22:47",
            "upload_time_iso_8601": "2024-10-26T18:22:47.668743Z",
            "url": "https://files.pythonhosted.org/packages/cf/81/17363478ca6a2a220c2501c121b94760db11af515b8d427539b799d1ea02/timefiller-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-26 18:22:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CyrilJl",
    "github_project": "TimeFiller",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "timefiller"
}
        
Elapsed time: 0.78986s