tscausalinference


Nametscausalinference JSON
Version 0.2.0.9 PyPI version JSON
download
home_page
SummaryBootstrap random walk simulations methodoly applied on top of Facebook Prophet to analyse causal effects
upload_time2023-04-17 13:23:42
maintainer
docs_urlNone
authorCarlos Trujillo
requires_python
license
keywords python causalimpact causal inference marketing prophet bootstrap monte carlo random walks
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tscausalinference

`tscausalinference` is a Python library for performing causal inference analysis over time series data. It uses the counterfactual methodology on top of the Prophet time-series forecasting library, with the help of Bootstrap simulations method for statistical significance testing and to manage uncertainty.

## How it works

Causal inference is a family of statistical methods used to determine the cause of changes in one variable if the changes occur in a different variable. The tscausalinference library creates synthetic control groups (forecast response) to determine the impact of a real treatment group (actual response). By defining these two groups, the library calculates the counterfactual result (difference between the groups) and determines its statistical significance.

The Prophet model is used to generate control data by making predictions about what would have happened in the absence of the intervention. This control data represents a counterfactual scenario, where the intervention did not occur, and allows us to compare the actual outcomes to what would have happened if the intervention had not been implemented. 

To manage the uncertainty of the prediction and to test the statistical significance of the results, the `tscausalinference` library performs Monte Carlo simulations through the Bootstrap method. This method involves resampling a single dataset to create many simulated datasets. The library creates a set of alternative random walks from the synthetic control data, and builds a distribution of each mean from the complete period of the simulation. 

The real mean of the period is then compared against this distribution, and we calculate how extreme this mean value is based on the distribution of **what should happen** using the cumulative distribution function (CDF). This helps us to determine the statistical significance of the effect.

The library works as follows:

1. Build a Prophet model to create a synthetic control group.
2. Perform Monte Carlo simulations using the Bootstrap method to create a set of alternative random walks from the synthetic control data.
3. Build a distribution of each mean from the complete period of the simulation.
4. Compare the real mean of the period against this distribution using the CDF to determine the statistical significance of the effect.

## Why Prophet?

Prophet is a time-series forecasting library in Python that uses statistical models to make predictions about future values based on past trends and regressors. It takes into account seasonal trends, holiday effects, and other factors that can affect the outcome variable. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Additionally, Prophet is a simple and scalable framework that is well-documented and supported by its own community.

## Why Bootstrap?

Bootstrap is a statistical procedure that involves resampling a single dataset to create many simulated datasets. This process allows us to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap is a useful method for estimating the effect of an intervention because it can help us detect significant changes in the mean or variance of a time series. One of the main challenges in time series analysis is that we often have a limited amount of data, especially when studying the effects of a specific intervention. Bootstrap is a non-parametric method that does not require any assumptions about the underlying distribution of the data, making it a flexible method that can be applied in a wide range of situations.

## Installation

tscausalinference can be installed using pip:

```python
!pip install tscausalinference
```

## Example Usage

The `tscausalinference` function takes the following arguments:

- `data`: the time series data as a Pandas DataFrame
- `intervention`: the time period of the intervention as a tuple of start and end dates
- `regressors`: optional list of regressors to be included in the Prophet model
- `seasonality`: boolean indicating whether to include seasonality in the Prophet model
- `cross_validation_steps`: number of steps to use in cross-validation for Prophet model tuning
- `seasonality_mode`: optional string to be included in the Prophet model, default 'additive'

```python
from tscausalinference import tscausalinference as tsci
import pandas as pd

# Load data
df = pd.read_csv('mydata.csv')
intervention = ['2022-07-04', '2022-07-19']

model = tsci(data = df, intervention = intervention)
model.run()

model.plot()
```
![plot_intervention Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/output.png)

```python
model..summarization(method = 'incremental', interrupted_variable = df.set_index('ds').regressor1, window=180)
```
```md
summary
-------
Each extra unit on ´y´ represents 5.23 units on your variable.

+-----------------------+------------+
 / / / /   DETAILED OVERVIEW   / / / /
+-----------------------+------------+

| METRIC                 |     VALUE |
|:-----------------------|----------:|
| Last 180 days Mean     | 1502.74   |
| Intervention Mean      | 1809.59   |
| Increase (%)           |  30.4029  |
| Variable Change        |  307      |
| Incremental Unit Value |  5.231456 |
```

### Create your own data
```python
from tscausalinference import synth_dataframe

synth = synth_dataframe() #Create fake data using our custom function.
df = synth.DataFrame()
```
```md
Min date: 2022-01-01 00:00:00
Max date: 2022-12-31 00:00:00
Day where effect was injected: 2022-12-17 00:00:00
Power of the effect: 30.0%
```
**Expected Schema**
```md
<class 'pandas.core.frame.DataFrame'>
Int64Index: 365 entries, 0 to 364
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   ds          365 non-null    datetime64[ns]
 1   y           365 non-null    float64       
 2   regressor1  365 non-null    float64       
 3   regressor2  365 non-null    float64       
dtypes: datetime64[ns](1), float64(3)
memory usage: 14.3 KB
```
### Checking seasonal decomposition
```python
model = tsci(data = df, intervention = intervention)
model.run()

model.plot(method = 'decompose')
```
![seasonal_decompose Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/seasonal_decompose.png)

### Check the sensibility of your time series
You can validate before run your experiment how big should be the effect in order to be catch it.

```python
from tscausalinference import sensitivity

model_check = sensitivity(
    df=df,
    test_period=['2023-12-25', '2024-01-05'],
    cross_validation_steps=10,
    alpha=0.05,
    model_params={'changepoint_range': [0.85,0.50],
                  'changepoint_prior_scale': [0.05, 0.50]},
    regressors=['regressor2', 'regressor1'],
    verbose=True,
    autocorrelation = False,
    n_samples=1000)
model_check.run(prior = False)
model_check.plot()
```
![sensitivity Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/sensitivity.png)

### Checking p-value, simulations & distributions
```python
model = tsci(data = df, intervention = intervention)

model.plot(method = 'simulations')
```
![plot_simulations Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/pvalue.png)

### Customizing model
```python
model = tsci(data = df, intervention = intervention, regressors = ['a','b'],
                        alpha = 0.03, n_samples = 10000, cross_validation_steps = 15
                        )

model.summarization()
```
```md
Summary
-------
During the intervention period, the response variable had an average value of approximately 295.04. 
By contrast, in the absence of an intervention, we would have expected an average response of 272.87. 
The 95% confidence interval of this counterfactual prediction is 223.68 to 321.77.

The usual error of your model is between -9.45% to 9.45% , during the intervention period was 12.45%. 
suggesting that the model can explain well what should happen, and that the differences are not significant.

The probability of obtaining this effect by chance is not small 
(after 10000 simulations, bootstrap probability p = 0.38485). 
This means that the causal effect cannot be considered statistically significant.
```

## Articles:
1. Carlos Trujillo by [Medium](https://medium.com/@carlangastr/combining-random-walks-and-bootstrap-for-causal-inference-analysis-on-time-series-2ca89f95bbe6)

## Extra documentation
1. Check out [Pypi](https://pypi.org/project/tscausalinference) for more information.
2. Check out [Introduction Notebooks](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/basic.ipynb) to see an example of usage.
3. Check on [Google colab](https://colab.research.google.com/drive/1OeF0OLlu0d9oM0xWiJSIEnxN-eWyFskq#scrollTo=YTX8Z4ZOwpyw) 🔥.

## Inspirational articles:
1. [Bootstrap random walks](https://reader.elsevier.com/reader/sd/pii/S0304414915300247?token=0E54369709F75136F10874CA9318FB348A6B9ED117081D7607994EDB862C09E8F95AE336C38CD97AD7A2C50FF14A8708&originRegion=eu-west-1&originCreation=20230224195555)
2. [Public Libs](https://github.com/lytics/impact)
3. [A Nonparametric approach for multiple change point analysis](https://arxiv.org/pdf/1306.4933.pdf)
4. [Causal Impact on python](https://www.youtube.com/watch?v=GTgZfCltMm8&t=272s)
5. [Causal Inference Using Bayesian Structural Time-Series Models](https://towardsdatascience.com/causal-inference-using-bayesian-structural-time-series-models-ab1a3da45cd0)
6. [Wikipedia](https://en.wikipedia.org/wiki/Random_walk)

## License
This project is licensed under the MIT License - see the LICENSE file for details.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tscausalinference",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,causalimpact,causal inference,marketing,prophet,bootstrap,monte carlo random walks",
    "author": "Carlos Trujillo",
    "author_email": "carlangastr@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3d/f0/b1faca85312163bea7682f0d88a7c3f08142ebff78234bac0600df8bf580/tscausalinference-0.2.0.9.tar.gz",
    "platform": null,
    "description": "# tscausalinference\n\n`tscausalinference` is a Python library for performing causal inference analysis over time series data. It uses the counterfactual methodology on top of the Prophet time-series forecasting library, with the help of Bootstrap simulations method for statistical significance testing and to manage uncertainty.\n\n## How it works\n\nCausal inference is a family of statistical methods used to determine the cause of changes in one variable if the changes occur in a different variable. The tscausalinference library creates synthetic control groups (forecast response) to determine the impact of a real treatment group (actual response). By defining these two groups, the library calculates the counterfactual result (difference between the groups) and determines its statistical significance.\n\nThe Prophet model is used to generate control data by making predictions about what would have happened in the absence of the intervention. This control data represents a counterfactual scenario, where the intervention did not occur, and allows us to compare the actual outcomes to what would have happened if the intervention had not been implemented. \n\nTo manage the uncertainty of the prediction and to test the statistical significance of the results, the `tscausalinference` library performs Monte Carlo simulations through the Bootstrap method. This method involves resampling a single dataset to create many simulated datasets. The library creates a set of alternative random walks from the synthetic control data, and builds a distribution of each mean from the complete period of the simulation. \n\nThe real mean of the period is then compared against this distribution, and we calculate how extreme this mean value is based on the distribution of **what should happen** using the cumulative distribution function (CDF). This helps us to determine the statistical significance of the effect.\n\nThe library works as follows:\n\n1. Build a Prophet model to create a synthetic control group.\n2. Perform Monte Carlo simulations using the Bootstrap method to create a set of alternative random walks from the synthetic control data.\n3. Build a distribution of each mean from the complete period of the simulation.\n4. Compare the real mean of the period against this distribution using the CDF to determine the statistical significance of the effect.\n\n## Why Prophet?\n\nProphet is a time-series forecasting library in Python that uses statistical models to make predictions about future values based on past trends and regressors. It takes into account seasonal trends, holiday effects, and other factors that can affect the outcome variable. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Additionally, Prophet is a simple and scalable framework that is well-documented and supported by its own community.\n\n## Why Bootstrap?\n\nBootstrap is a statistical procedure that involves resampling a single dataset to create many simulated datasets. This process allows us to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap is a useful method for estimating the effect of an intervention because it can help us detect significant changes in the mean or variance of a time series. One of the main challenges in time series analysis is that we often have a limited amount of data, especially when studying the effects of a specific intervention. Bootstrap is a non-parametric method that does not require any assumptions about the underlying distribution of the data, making it a flexible method that can be applied in a wide range of situations.\n\n## Installation\n\ntscausalinference can be installed using pip:\n\n```python\n!pip install tscausalinference\n```\n\n## Example Usage\n\nThe `tscausalinference` function takes the following arguments:\n\n- `data`: the time series data as a Pandas DataFrame\n- `intervention`: the time period of the intervention as a tuple of start and end dates\n- `regressors`: optional list of regressors to be included in the Prophet model\n- `seasonality`: boolean indicating whether to include seasonality in the Prophet model\n- `cross_validation_steps`: number of steps to use in cross-validation for Prophet model tuning\n- `seasonality_mode`: optional string to be included in the Prophet model, default 'additive'\n\n```python\nfrom tscausalinference import tscausalinference as tsci\nimport pandas as pd\n\n# Load data\ndf = pd.read_csv('mydata.csv')\nintervention = ['2022-07-04', '2022-07-19']\n\nmodel = tsci(data = df, intervention = intervention)\nmodel.run()\n\nmodel.plot()\n```\n![plot_intervention Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/output.png)\n\n```python\nmodel..summarization(method = 'incremental', interrupted_variable = df.set_index('ds').regressor1, window=180)\n```\n```md\nsummary\n-------\nEach extra unit on \u00b4y\u00b4 represents 5.23 units on your variable.\n\n+-----------------------+------------+\n / / / /   DETAILED OVERVIEW   / / / /\n+-----------------------+------------+\n\n| METRIC                 |     VALUE |\n|:-----------------------|----------:|\n| Last 180 days Mean     | 1502.74   |\n| Intervention Mean      | 1809.59   |\n| Increase (%)           |  30.4029  |\n| Variable Change        |  307      |\n| Incremental Unit Value |  5.231456 |\n```\n\n### Create your own data\n```python\nfrom tscausalinference import synth_dataframe\n\nsynth = synth_dataframe() #Create fake data using our custom function.\ndf = synth.DataFrame()\n```\n```md\nMin date: 2022-01-01 00:00:00\nMax date: 2022-12-31 00:00:00\nDay where effect was injected: 2022-12-17 00:00:00\nPower of the effect: 30.0%\n```\n**Expected Schema**\n```md\n<class 'pandas.core.frame.DataFrame'>\nInt64Index: 365 entries, 0 to 364\nData columns (total 4 columns):\n #   Column      Non-Null Count  Dtype         \n---  ------      --------------  -----         \n 0   ds          365 non-null    datetime64[ns]\n 1   y           365 non-null    float64       \n 2   regressor1  365 non-null    float64       \n 3   regressor2  365 non-null    float64       \ndtypes: datetime64[ns](1), float64(3)\nmemory usage: 14.3 KB\n```\n### Checking seasonal decomposition\n```python\nmodel = tsci(data = df, intervention = intervention)\nmodel.run()\n\nmodel.plot(method = 'decompose')\n```\n![seasonal_decompose Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/seasonal_decompose.png)\n\n### Check the sensibility of your time series\nYou can validate before run your experiment how big should be the effect in order to be catch it.\n\n```python\nfrom tscausalinference import sensitivity\n\nmodel_check = sensitivity(\n    df=df,\n    test_period=['2023-12-25', '2024-01-05'],\n    cross_validation_steps=10,\n    alpha=0.05,\n    model_params={'changepoint_range': [0.85,0.50],\n                  'changepoint_prior_scale': [0.05, 0.50]},\n    regressors=['regressor2', 'regressor1'],\n    verbose=True,\n    autocorrelation = False,\n    n_samples=1000)\nmodel_check.run(prior = False)\nmodel_check.plot()\n```\n![sensitivity Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/sensitivity.png)\n\n### Checking p-value, simulations & distributions\n```python\nmodel = tsci(data = df, intervention = intervention)\n\nmodel.plot(method = 'simulations')\n```\n![plot_simulations Method](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/plots/pvalue.png)\n\n### Customizing model\n```python\nmodel = tsci(data = df, intervention = intervention, regressors = ['a','b'],\n                        alpha = 0.03, n_samples = 10000, cross_validation_steps = 15\n                        )\n\nmodel.summarization()\n```\n```md\nSummary\n-------\nDuring the intervention period, the response variable had an average value of approximately 295.04. \nBy contrast, in the absence of an intervention, we would have expected an average response of 272.87. \nThe 95% confidence interval of this counterfactual prediction is 223.68 to 321.77.\n\nThe usual error of your model is between -9.45% to 9.45% , during the intervention period was 12.45%. \nsuggesting that the model can explain well what should happen, and that the differences are not significant.\n\nThe probability of obtaining this effect by chance is not small \n(after 10000 simulations, bootstrap probability p = 0.38485). \nThis means that the causal effect cannot be considered statistically significant.\n```\n\n## Articles:\n1. Carlos Trujillo by [Medium](https://medium.com/@carlangastr/combining-random-walks-and-bootstrap-for-causal-inference-analysis-on-time-series-2ca89f95bbe6)\n\n## Extra documentation\n1. Check out [Pypi](https://pypi.org/project/tscausalinference) for more information.\n2. Check out [Introduction Notebooks](https://github.com/carlangastr/marketing-science-projects/blob/main/tscausalinference/introduction_notebooks/basic.ipynb) to see an example of usage.\n3. Check on [Google colab](https://colab.research.google.com/drive/1OeF0OLlu0d9oM0xWiJSIEnxN-eWyFskq#scrollTo=YTX8Z4ZOwpyw) \ud83d\udd25.\n\n## Inspirational articles:\n1. [Bootstrap random walks](https://reader.elsevier.com/reader/sd/pii/S0304414915300247?token=0E54369709F75136F10874CA9318FB348A6B9ED117081D7607994EDB862C09E8F95AE336C38CD97AD7A2C50FF14A8708&originRegion=eu-west-1&originCreation=20230224195555)\n2. [Public Libs](https://github.com/lytics/impact)\n3. [A Nonparametric approach for multiple change point analysis](https://arxiv.org/pdf/1306.4933.pdf)\n4. [Causal Impact on python](https://www.youtube.com/watch?v=GTgZfCltMm8&t=272s)\n5. [Causal Inference Using Bayesian Structural Time-Series Models](https://towardsdatascience.com/causal-inference-using-bayesian-structural-time-series-models-ab1a3da45cd0)\n6. [Wikipedia](https://en.wikipedia.org/wiki/Random_walk)\n\n## License\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Bootstrap random walk simulations methodoly applied on top of Facebook Prophet to analyse causal effects",
    "version": "0.2.0.9",
    "split_keywords": [
        "python",
        "causalimpact",
        "causal inference",
        "marketing",
        "prophet",
        "bootstrap",
        "monte carlo random walks"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a811d75c21f1e5e986e7c2c7d690b7abb1dd823c1c9d75bd56d0e67ce771eb23",
                "md5": "235eedcc3342b4960538b961e0892fcb",
                "sha256": "c93499f2a7514843f2e58fd000c036e09e42dbaa20ef517779aa23ae57d8868d"
            },
            "downloads": -1,
            "filename": "tscausalinference-0.2.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "235eedcc3342b4960538b961e0892fcb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 28791,
            "upload_time": "2023-04-17T13:23:37",
            "upload_time_iso_8601": "2023-04-17T13:23:37.739583Z",
            "url": "https://files.pythonhosted.org/packages/a8/11/d75c21f1e5e986e7c2c7d690b7abb1dd823c1c9d75bd56d0e67ce771eb23/tscausalinference-0.2.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3df0b1faca85312163bea7682f0d88a7c3f08142ebff78234bac0600df8bf580",
                "md5": "c9b67a22e83a456e8dab8c0c09989b07",
                "sha256": "16cd261826a2eb784300a8eaa92370b6d3a6aa32a7755c006222869fd6324408"
            },
            "downloads": -1,
            "filename": "tscausalinference-0.2.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "c9b67a22e83a456e8dab8c0c09989b07",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 17798622,
            "upload_time": "2023-04-17T13:23:42",
            "upload_time_iso_8601": "2023-04-17T13:23:42.034767Z",
            "url": "https://files.pythonhosted.org/packages/3d/f0/b1faca85312163bea7682f0d88a7c3f08142ebff78234bac0600df8bf580/tscausalinference-0.2.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-17 13:23:42",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "tscausalinference"
}
        
Elapsed time: 1.03903s