bootstrap-ci

Name	bootstrap-ci JSON
Version	0.0.1 JSON
	download
home_page
Summary	Bootstrap sampling and confidence interval estimation package.
upload_time	2023-11-26 10:56:58
maintainer
docs_url	None
author
requires_python	>=3.9
license	MIT License Copyright (c) 2023 Ursa Zrimsek Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	uncertainty bootstrap confidence intervals
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # bootstrap-ci

***Toolbox for bootstrap sampling and estimation of confidence intervals.***

You can choose between hierarchical and non-parametric sampling and combine them 
with multiple bootstrap methods for estimation of confidence intervals.

## Table of Contents
- [Getting Started](#getting-started)
- [Bootstrap sampling](#bootstrap-sampling)
- [Boostrap methods](#bootstrap-methods)
- [Parameters](#parameters)
- [Suggestions](#suggestions-on-which-method-and-parameters-to-use)

# Getting started
Installation and a simple use case example.

## Installation
To use the `bootstrap-ci` package you will need to download it from [pip](https://pypi.org/project/pip/): 
```
pip install bootstrap-ci
```
## Simple example
Once you installed the package, you can use `ci` method to obtain confidence intervals for your chosen
statistic on a given sample:
```
import bootstrap-ci as boot
import numpy as np

np.random.seed(0)
sample = np.random.normal(0, 1, size=1000)

bootstrap = boot.Bootstrap(sample, statistic=np.mean)

onesided_95 = bootstrap.ci(coverages=[0.95], nr_bootstrap_samples=1000)
print(f'One-sided 95% confidence interval for mean is equal to (-inf, {round(onesided_95[0], 3)}).')

>>> One-sided 95% confidence interval for mean is equal to (-inf, 0.004).

twosided_95 = bootstrap.ci(coverages=0.95, side='two', nr_bootstrap_samples=1000)
print(f'Two-sided 95% confidence interval for mean is equal to ({round(twosided_95[0], 3)}, {round(twosided_95[1], 3)}).')

>>> Two-sided 95% confidence interval for mean is equal to (-0.108, 0.014).
```

To see more examples for different sampling possibilities go to [Parameters](#parameters).

# Bootstrap sampling
Bootstrap can be divided into two separate steps. The first one is **bootstrap sampling**, that produces the 
bootstrap distribution, which approximates the distribution of the observed parameter. 
There are different approaches to bootstrap sampling, differing primarily in their underlying data assumptions and 
parameter estimation. In this package you can choose between non-parametric and hierarchical sampling. 

### Non-parametric sampling
Non-parametric sampling is assumption-free and estimates the underlying data distribution $F$ directly with 
the original sample $X$. 
This means that for each bootstrap sample, it samples with replacement directly from the original sample. 
There are $n^n$ different possible samples that can arise with such procedure, but because of computational 
intensiveness, you can choose the number of independent samples, $B$, that you want to obtain. 
To obtain the bootstrap distribution, the value of the observed statistic is calculated on each of them.

### Hierarchical sampling
Hierarchical bootstrap sampling takes into account the group dependencies of the underlying data generating process.
We implemented the completely non-parametric *cases sampling*, where you can choose between all possible strategies,
and the parametric *random-effect sampling*.

#### Cases sampling
Bootstrap samples are obtained by resampling the groups on each level. 
They can be resampled with or without replacement, the latter meaning that we just take all the groups (or data points) 
on that level. Sampling strategy is selected with a vector of zeros and ones, $s = (s_1, \dots, s_{n_{lvl}})$, 
of the same length as the number of levels in the sample. The value 1 in the vector denotes sampling with replacement 
from that particular level. The value of 0 denotes sampling without replacement for that level.

#### Random-effect sampling
Random-effect sampling is a parametric sampling method that assumes that the data come from a random-effect model.
It first estimates the random effects of each group on each level of the sample, then draws those random effects with
replacement, to produce new bootstrap samples.

# Bootstrap methods
After bootstrap sampling, you can use one of the **bootstrap methods** to construct a confidence interval from the 
acquired bootstrap distribution. 

### Percentile
The percentile method is the original bootstrap method. 
Even though multiple improvements were made, it is probably still the most used one.
The percentile estimation of confidence level $\alpha$ is obtained by taking the $\alpha$ quantile of the bootstrap 
distribution,

$$\hat{\theta}\_{perc}\[\alpha\] = \hat{\theta}^*_\alpha.$$

In all the implementations of methods that use quantiles, the "median-unbiased" version of quantile calculation is used.

### Standard
The standard method, sometimes also called the normal method, assumes that the bootstrap distribution is normal and 
estimates standard deviation based on that. The estimations of confidence levels are obtained with

$$\hat{\theta}\_{std}\[\alpha\] = \hat{\theta} + \hat{\sigma} z_\alpha,$$
where $\hat{\theta}$ is the parameter value on the original sample, $\hat{\sigma}$ is the standard deviation estimate 
from the bootstrap distribution and $z_\alpha$ is the z-score of standard normal distribution.

### Basic
In the basic method, also sometimes called the reverse percentile method, the observed bootstrap distribution, 
$\theta^\*$, is replaced with $W^\* = \theta^* - \hat{\theta}$. This results in 
$$\hat{\theta}\_{bsc}\[\alpha\] = 2\hat{\theta} - \hat{\theta}^*\_{1 - \alpha}.$$

### BC
$BC$ does an important correction to the percentile interval. It removes the bias that arises from $\hat{\theta}$ 
not being the median of the bootstrap distribution, and is thus better in non-symetric problems, 
where the percentile method can fail.
The confidence level is estimated by:

$$\hat{\theta}\_{BC}\[\alpha\] = \hat{\theta}^*\_{\alpha_{BC}}, $$

$$\alpha_{BC} = \Phi\big(2\Phi^{-1}(\hat{b}) + z_\alpha \big),$$

where $\Phi$ is the CDF of standard normal distribution and $\hat{b}$ is the bias, calculated as the percentage of 
values from bootstrap distribution that are lower than the parameter's value on the original sample, $\hat{\theta}$.

### BC<sub>a</sub>
$BC_a$ does another correction to the $BC$ interval, by computing the acceleration constant $a$, which can account 
for the skewness of the bootstrap distribution.

This further adjusts the $\alpha_{BCa}$, which is then calculated by:

$$ \hat{\theta}\_{BCa}\[\alpha\] = \hat{\theta}^*\_{\alpha_{BCa}}$$

$$\alpha_{BCa} = \Phi\Big(\Phi^{-1}(b) + \frac{\Phi^{-1}(\hat{b}) + z_\alpha}{1 + \hat{a} (\Phi^{-1}(\hat{b}) + 
z_\alpha)} \Big),$$
where $\hat{a}$ is the approximation of the acceleration constant, that can be calculated using leave-one-out jackknife:

$$\hat{a} = \frac{1}{6}\frac{\sum U_i^3}{(\sum U_i^2)^\frac{3}{2}} $$
$$U_i = (n-1)(\hat{\theta}\_. - \hat{\theta}\_{(i)}),$$
where $\hat{\theta}\_{(i)}$ is the estimation of $\theta$ without the $i$-th datapoint and $\hat{\theta}\_.$ is the mean 
of all $\hat{\theta}_{(i)}$.

### Smoothed
The smoothed method replaces bootstrap distribution with a smoothed version of it ($\Theta^*$), by adding random noise, 
with a normal kernel centered on 0. 
The kernel's size is determined by a rule of thumb width selection: 
$h = 0.9 \min \big( \sigma^\*, \frac{iqr}{1.34} \big),$
where $iqr$ is the inter-quartile range of bootstrap distribution, the difference between its first and third quartile.

The estimation of the confidence level is then obtained by taking the $\alpha$ quantile of the smoothed distribution:

$$\hat{\theta}\_{smooth}\[\alpha\] = \hat{\Theta}^\*\_\alpha.$$

### Studentized
The studentized or bootstrap-t method, generalizes the Student's t method, using the distribution of 
$T = \dfrac{\hat{\theta} - \theta}{\hat{\sigma}}$ to estimate the confidence level $\alpha$.
It is computed by
$$\hat{\theta}\_{t}\[\alpha\] = \hat{\theta} - \hat{\sigma} T\_{1-\alpha},$$
where $\hat{\theta}$ is the parameter value on the original sample, $\hat{\sigma}$ is the standard deviation estimate 
from the bootstrap distribution.
Since the distribution of T is not known, its percentiles are approximated from the bootstrap distribution.
That is done by defining $T^\* = \dfrac{\hat{\theta}^\* - \hat\theta}{\hat{\sigma}^\*}$, where $\hat{\theta}^\*$ is the 
parameter's value on each bootstrap sample, and $\hat{\sigma}^\*$ is obtained by doing another inner bootstrap sampling 
on each of the outer samples. There are other possible ways to acquire $\hat{\sigma}^\*$, but we chose this way as it is 
very general and fully automatic.

### Double
The double bootstrap is made to adjust bias from a single bootstrap iteration with another layer of bootstraps.
The bootstrap procedure is repeated on each of the bootstrap samples to calculate the bias - the percentage of times 
that the parameter on its inner bootstrap sample is smaller from the original parameter's value. 
We want to take such a limit that $P \{\hat{\theta} \in (-\infty, \hat{\theta}\_{double}\[\alpha\])\} = \alpha$, 
which is why we need to select the $\alpha$-th quantile of biases $\hat{b}^*$ for the adjusted level $\alpha_{double}$. 
This leads to:

$$\hat{\theta}\_{double}\[\alpha\] = \hat{\theta}^\*\_{\alpha\_{double}}$$ 
$$\alpha_{double} = \hat{b}^\*_\alpha.$$

# Parameters
Here we describe the possible parameter values on different steps and present some additional examples.

## Initialization
First a `Bootstrap` instance needs to be initialized. Following parameters can be set:
- `data`: a `numpy` array containing values of the sample of interest.
- `statistic`: a callable function that accepts arrays of the same structure as parameter `data` and return a single 
  value.
- `use_jit`: bool that selects whether to use the `numba` library to speed up the sampling. Default value is set to 
  `False`. Change to `True` if you use a big number of bootstrap samples and want to speed up the calculations 
- `group_indices`: a parameter given only for hierarchical data. A list of lists that tells us how the data points in
`data` parameter group together. For example indices \[\[\[0, 1], \[2]], \[\[3]]] together with array \[0, 1, 2, 3] tell 
  us we have one group containing a group with points 0 and 1, and a group with point 2, and another group containing 
  a group with point 3.

You initialize an instance that will estimate the distribution of mean statistic on a given sample from normal 
distribution with the following code:
```
import bootstrap-ci as boot
import numpy as np

np.random.seed(0)
sample = np.random.normal(0, 1, size=1000)

bootstrap = boot.Bootstrap(sample, statistic=np.mean)
```

## Sampling

The method `sample` draws bootstrap samples from the original dataset. Following parameters can be used:
- `nr_bootstrap_samples`: how many bootstrap samples to draw, the size of the bootstrap distribution. Default value is 
  set to 1000, but we propose to take the largest feasible number to get the best results.
- `seed`: random seed. Default value `None` skips setting the seed value.
- `sampling`: select the type of sampling - possible to choose between *nonparametric* or *hierarchical* sampling, default
  value is *nonparametric*.
- `sampling_args`: sampling arguments, used only when doing hierarchical sampling. They should be saved in a dictionary,
  that should include key *method*. Implemented methods available to choose from are *cases* and *random-effect*. 
  For *cases* sampling a *strategy* also needs to be defined with an array of equal length as is the number of levels 
  in the dataset, containing zeroes and ones, telling us on which level we sample with replacement and where without.

For example, you can use the non-parametric sampling to get bootstrap distribution of size 1000 on the `bootstrap`
instance from above.
```
bootstrap.sample(nr_bootstrap_samples=1000, seed=0)
```

The values of the bootstrap distribution are now saved in the `bootstrap.bootstrap_values` parameter.

### Hierarchical sampling
If you are working with hierarchical data, you need to specify the group structure together with the given sample.
There are two different hierarchical sampling methods available to choose from, *random-effect* and *cases* sampling.
Here is an example of *cases* sampling where we sample with replacement on all but the last level:
```
# sample that is grouped like this: [[[0.1, -0.2], [1, -0.5]], [[10, 11]]]
sample = np.array([0.1, -0.2, 1, -0.5, 10, 11])
indices = [[[0, 1], [2, 3]], [[4, 5]]]

hierarchical_bootstrap = boot.Bootstrap(sample, statistic=np.mean, group_indices=indices)

samp_args = {'method': 'cases', 'strategy': [1, 1, 0]}
hierarchical_bootstrap.sample(nr_bootstrap_samples=1000, sampling='hierarchical', sampling_args=samp_args)
```

## Confidence intervals
After the bootstrap distribution is obtained, you can produce the confidence intervals by calling the method `ci`.
Following parameters can be set:
- `coverages`: array of coverage levels for which the values need to be computed. In the case of two-sided
                            intervals (`side`=*two*) it is a float number.
- `side`: it is possible to choose between *one* and *two* sided confidence intervals. One-sided returns
                        the left-sided confidence interval threshold x, representing CI in the shape of (-inf, x).
- `method`: which method to use for construction of confidence intervals. It is possible to select from
                       *percentile*, *basic*, *bca*, *bc*, *standard*, *smoothed*, *double* and *studentized*.
- `nr_bootstrap_samples`: number of bootstrap samples. Default value `None` should be used if the sampling was done 
  before as a separate step and you don't want to repeat it. If the sampling was not done you should specify the 
  number of samples.
- `seed`: random seed. Default value `None` skips setting the seed value.
- `sampling`: type of sampling, possible to choose between *nonparametric* and *hierarchical*. 
  Passed to the method `sample`.
- `sampling_args`: additional arguments used with hierarchical sampling, passed to the method `sample`.
- `quantile_type`: type of quantiles, possible to select from methods used in `numpy`'s quantile function.

It returns an array of threshold values for confidence intervals of corresponding coverage levels.
An example to get the one-sided 95% and 97.5% confidence intervals from the `bootstrap` instance from above, 
where sampling was already done:
```
bootstrap.ci(coverages=[0.95, 0.975], method='bca')

>>> array([0.00272853, 0.0119834 ])
```

## Jackknife after bootstrap
After bootstrap sampling you can diagnose the sampling process with the use of jackknife after bootstrap method, 
that draws a plot showing the influence each data point has on the statistic value.

# Suggestions on which method and parameters to use
For the general use case we propose to use the **double** bootstrap method. In the case of confidence interval of
extreme percentiles, we propose to use the **standard** bootstrap method.

We suggest to always use the largest number of bootstrap samples that is feasible for your sample size and statistic.
If you need to speed up the calculations, lower the number of bootstrap samples from the default value of 1000.

Go to repository [Bootstrap-CI-analysis](https://github.com/zrimseku/Bootstrap-CI-analysis) for more detailed 
information. It includes a detailed study of where bootstrap methods can be used and which one is suggested in 
certain use case.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "bootstrap-ci",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Ur\u0161a Zrim\u0161ek <ursazrimsek@gmail.com>",
    "keywords": "uncertainty,bootstrap,confidence intervals",
    "author": "",
    "author_email": "Ur\u0161a Zrim\u0161ek <ursazrimsek@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/06/cc/2d7b1f9fe47ea30e251294770b9d8e4666c72d5c09db530b90eacbecc4a2/bootstrap-ci-0.0.1.tar.gz",
    "platform": null,
    "description": "# bootstrap-ci\r\n\r\n***Toolbox for bootstrap sampling and estimation of confidence intervals.***\r\n\r\nYou can choose between hierarchical and non-parametric sampling and combine them \r\nwith multiple bootstrap methods for estimation of confidence intervals.\r\n\r\n## Table of Contents\r\n- [Getting Started](#getting-started)\r\n- [Bootstrap sampling](#bootstrap-sampling)\r\n- [Boostrap methods](#bootstrap-methods)\r\n- [Parameters](#parameters)\r\n- [Suggestions](#suggestions-on-which-method-and-parameters-to-use)\r\n\r\n# Getting started\r\nInstallation and a simple use case example.\r\n\r\n## Installation\r\nTo use the `bootstrap-ci` package you will need to download it from [pip](https://pypi.org/project/pip/): \r\n```\r\npip install bootstrap-ci\r\n```\r\n## Simple example\r\nOnce you installed the package, you can use `ci` method to obtain confidence intervals for your chosen\r\nstatistic on a given sample:\r\n```\r\nimport bootstrap-ci as boot\r\nimport numpy as np\r\n\r\nnp.random.seed(0)\r\nsample = np.random.normal(0, 1, size=1000)\r\n\r\nbootstrap = boot.Bootstrap(sample, statistic=np.mean)\r\n\r\nonesided_95 = bootstrap.ci(coverages=[0.95], nr_bootstrap_samples=1000)\r\nprint(f'One-sided 95% confidence interval for mean is equal to (-inf, {round(onesided_95[0], 3)}).')\r\n\r\n>>> One-sided 95% confidence interval for mean is equal to (-inf, 0.004).\r\n\r\ntwosided_95 = bootstrap.ci(coverages=0.95, side='two', nr_bootstrap_samples=1000)\r\nprint(f'Two-sided 95% confidence interval for mean is equal to ({round(twosided_95[0], 3)}, {round(twosided_95[1], 3)}).')\r\n\r\n>>> Two-sided 95% confidence interval for mean is equal to (-0.108, 0.014).\r\n```\r\n\r\nTo see more examples for different sampling possibilities go to [Parameters](#parameters).\r\n\r\n# Bootstrap sampling\r\nBootstrap can be divided into two separate steps. The first one is **bootstrap sampling**, that produces the \r\nbootstrap distribution, which approximates the distribution of the observed parameter. \r\nThere are different approaches to bootstrap sampling, differing primarily in their underlying data assumptions and \r\nparameter estimation. In this package you can choose between non-parametric and hierarchical sampling. \r\n\r\n### Non-parametric sampling\r\nNon-parametric sampling is assumption-free and estimates the underlying data distribution $F$ directly with \r\nthe original sample $X$. \r\nThis means that for each bootstrap sample, it samples with replacement directly from the original sample. \r\nThere are $n^n$ different possible samples that can arise with such procedure, but because of computational \r\nintensiveness, you can choose the number of independent samples, $B$, that you want to obtain. \r\nTo obtain the bootstrap distribution, the value of the observed statistic is calculated on each of them.\r\n\r\n### Hierarchical sampling\r\nHierarchical bootstrap sampling takes into account the group dependencies of the underlying data generating process.\r\nWe implemented the completely non-parametric *cases sampling*, where you can choose between all possible strategies,\r\nand the parametric *random-effect sampling*.\r\n\r\n#### Cases sampling\r\nBootstrap samples are obtained by resampling the groups on each level. \r\nThey can be resampled with or without replacement, the latter meaning that we just take all the groups (or data points) \r\non that level. Sampling strategy is selected with a vector of zeros and ones, $s = (s_1, \\dots, s_{n_{lvl}})$, \r\nof the same length as the number of levels in the sample. The value 1 in the vector denotes sampling with replacement \r\nfrom that particular level. The value of 0 denotes sampling without replacement for that level.\r\n\r\n#### Random-effect sampling\r\nRandom-effect sampling is a parametric sampling method that assumes that the data come from a random-effect model.\r\nIt first estimates the random effects of each group on each level of the sample, then draws those random effects with\r\nreplacement, to produce new bootstrap samples.\r\n\r\n# Bootstrap methods\r\nAfter bootstrap sampling, you can use one of the **bootstrap methods** to construct a confidence interval from the \r\nacquired bootstrap distribution. \r\n\r\n### Percentile\r\nThe percentile method is the original bootstrap method. \r\nEven though multiple improvements were made, it is probably still the most used one.\r\nThe percentile estimation of confidence level $\\alpha$ is obtained by taking the $\\alpha$ quantile of the bootstrap \r\ndistribution,\r\n\r\n$$\\hat{\\theta}\\_{perc}\\[\\alpha\\] = \\hat{\\theta}^*_\\alpha.$$\r\n\r\nIn all the implementations of methods that use quantiles, the \"median-unbiased\" version of quantile calculation is used.\r\n\r\n### Standard\r\nThe standard method, sometimes also called the normal method, assumes that the bootstrap distribution is normal and \r\nestimates standard deviation based on that. The estimations of confidence levels are obtained with\r\n\r\n$$\\hat{\\theta}\\_{std}\\[\\alpha\\] = \\hat{\\theta} + \\hat{\\sigma} z_\\alpha,$$\r\nwhere $\\hat{\\theta}$ is the parameter value on the original sample, $\\hat{\\sigma}$ is the standard deviation estimate \r\nfrom the bootstrap distribution and $z_\\alpha$ is the z-score of standard normal distribution.\r\n\r\n### Basic\r\nIn the basic method, also sometimes called the reverse percentile method, the observed bootstrap distribution, \r\n$\\theta^\\*$, is replaced with $W^\\* = \\theta^* - \\hat{\\theta}$. This results in \r\n$$\\hat{\\theta}\\_{bsc}\\[\\alpha\\] = 2\\hat{\\theta} - \\hat{\\theta}^*\\_{1 - \\alpha}.$$\r\n\r\n### BC\r\n$BC$ does an important correction to the percentile interval. It removes the bias that arises from $\\hat{\\theta}$ \r\nnot being the median of the bootstrap distribution, and is thus better in non-symetric problems, \r\nwhere the percentile method can fail.\r\nThe confidence level is estimated by:\r\n\r\n$$\\hat{\\theta}\\_{BC}\\[\\alpha\\] = \\hat{\\theta}^*\\_{\\alpha_{BC}}, $$\r\n\r\n$$\\alpha_{BC} = \\Phi\\big(2\\Phi^{-1}(\\hat{b}) + z_\\alpha \\big),$$\r\n\r\nwhere $\\Phi$ is the CDF of standard normal distribution and $\\hat{b}$ is the bias, calculated as the percentage of \r\nvalues from bootstrap distribution that are lower than the parameter's value on the original sample, $\\hat{\\theta}$.\r\n\r\n### BC<sub>a</sub>\r\n$BC_a$ does another correction to the $BC$ interval, by computing the acceleration constant $a$, which can account \r\nfor the skewness of the bootstrap distribution.\r\n\r\nThis further adjusts the $\\alpha_{BCa}$, which is then calculated by:\r\n\r\n$$ \\hat{\\theta}\\_{BCa}\\[\\alpha\\] = \\hat{\\theta}^*\\_{\\alpha_{BCa}}$$\r\n\r\n$$\\alpha_{BCa} = \\Phi\\Big(\\Phi^{-1}(b) + \\frac{\\Phi^{-1}(\\hat{b}) + z_\\alpha}{1 + \\hat{a} (\\Phi^{-1}(\\hat{b}) + \r\nz_\\alpha)} \\Big),$$\r\nwhere $\\hat{a}$ is the approximation of the acceleration constant, that can be calculated using leave-one-out jackknife:\r\n\r\n$$\\hat{a} = \\frac{1}{6}\\frac{\\sum U_i^3}{(\\sum U_i^2)^\\frac{3}{2}} $$\r\n$$U_i = (n-1)(\\hat{\\theta}\\_. - \\hat{\\theta}\\_{(i)}),$$\r\nwhere $\\hat{\\theta}\\_{(i)}$ is the estimation of $\\theta$ without the $i$-th datapoint and $\\hat{\\theta}\\_.$ is the mean \r\nof all $\\hat{\\theta}_{(i)}$.\r\n\r\n### Smoothed\r\nThe smoothed method replaces bootstrap distribution with a smoothed version of it ($\\Theta^*$), by adding random noise, \r\nwith a normal kernel centered on 0. \r\nThe kernel's size is determined by a rule of thumb width selection: \r\n$h = 0.9 \\min \\big( \\sigma^\\*, \\frac{iqr}{1.34} \\big),$\r\nwhere $iqr$ is the inter-quartile range of bootstrap distribution, the difference between its first and third quartile.\r\n\r\nThe estimation of the confidence level is then obtained by taking the $\\alpha$ quantile of the smoothed distribution:\r\n\r\n$$\\hat{\\theta}\\_{smooth}\\[\\alpha\\] = \\hat{\\Theta}^\\*\\_\\alpha.$$\r\n\r\n### Studentized\r\nThe studentized or bootstrap-t method, generalizes the Student's t method, using the distribution of \r\n$T = \\dfrac{\\hat{\\theta} - \\theta}{\\hat{\\sigma}}$ to estimate the confidence level $\\alpha$.\r\nIt is computed by\r\n$$\\hat{\\theta}\\_{t}\\[\\alpha\\] = \\hat{\\theta} - \\hat{\\sigma} T\\_{1-\\alpha},$$\r\nwhere $\\hat{\\theta}$ is the parameter value on the original sample, $\\hat{\\sigma}$ is the standard deviation estimate \r\nfrom the bootstrap distribution.\r\nSince the distribution of T is not known, its percentiles are approximated from the bootstrap distribution.\r\nThat is done by defining $T^\\* = \\dfrac{\\hat{\\theta}^\\* - \\hat\\theta}{\\hat{\\sigma}^\\*}$, where $\\hat{\\theta}^\\*$ is the \r\nparameter's value on each bootstrap sample, and $\\hat{\\sigma}^\\*$ is obtained by doing another inner bootstrap sampling \r\non each of the outer samples. There are other possible ways to acquire $\\hat{\\sigma}^\\*$, but we chose this way as it is \r\nvery general and fully automatic.\r\n\r\n### Double\r\nThe double bootstrap is made to adjust bias from a single bootstrap iteration with another layer of bootstraps.\r\nThe bootstrap procedure is repeated on each of the bootstrap samples to calculate the bias - the percentage of times \r\nthat the parameter on its inner bootstrap sample is smaller from the original parameter's value. \r\nWe want to take such a limit that $P \\{\\hat{\\theta} \\in (-\\infty, \\hat{\\theta}\\_{double}\\[\\alpha\\])\\} = \\alpha$, \r\nwhich is why we need to select the $\\alpha$-th quantile of biases $\\hat{b}^*$ for the adjusted level $\\alpha_{double}$. \r\nThis leads to:\r\n\r\n$$\\hat{\\theta}\\_{double}\\[\\alpha\\] = \\hat{\\theta}^\\*\\_{\\alpha\\_{double}}$$ \r\n$$\\alpha_{double} = \\hat{b}^\\*_\\alpha.$$\r\n\r\n# Parameters\r\nHere we describe the possible parameter values on different steps and present some additional examples.\r\n\r\n## Initialization\r\nFirst a `Bootstrap` instance needs to be initialized. Following parameters can be set:\r\n- `data`: a `numpy` array containing values of the sample of interest.\r\n- `statistic`: a callable function that accepts arrays of the same structure as parameter `data` and return a single \r\n  value.\r\n- `use_jit`: bool that selects whether to use the `numba` library to speed up the sampling. Default value is set to \r\n  `False`. Change to `True` if you use a big number of bootstrap samples and want to speed up the calculations \r\n- `group_indices`: a parameter given only for hierarchical data. A list of lists that tells us how the data points in\r\n`data` parameter group together. For example indices \\[\\[\\[0, 1], \\[2]], \\[\\[3]]] together with array \\[0, 1, 2, 3] tell \r\n  us we have one group containing a group with points 0 and 1, and a group with point 2, and another group containing \r\n  a group with point 3.\r\n\r\nYou initialize an instance that will estimate the distribution of mean statistic on a given sample from normal \r\ndistribution with the following code:\r\n```\r\nimport bootstrap-ci as boot\r\nimport numpy as np\r\n\r\nnp.random.seed(0)\r\nsample = np.random.normal(0, 1, size=1000)\r\n\r\nbootstrap = boot.Bootstrap(sample, statistic=np.mean)\r\n```\r\n\r\n## Sampling\r\n\r\nThe method `sample` draws bootstrap samples from the original dataset. Following parameters can be used:\r\n- `nr_bootstrap_samples`: how many bootstrap samples to draw, the size of the bootstrap distribution. Default value is \r\n  set to 1000, but we propose to take the largest feasible number to get the best results.\r\n- `seed`: random seed. Default value `None` skips setting the seed value.\r\n- `sampling`: select the type of sampling - possible to choose between *nonparametric* or *hierarchical* sampling, default\r\n  value is *nonparametric*.\r\n- `sampling_args`: sampling arguments, used only when doing hierarchical sampling. They should be saved in a dictionary,\r\n  that should include key *method*. Implemented methods available to choose from are *cases* and *random-effect*. \r\n  For *cases* sampling a *strategy* also needs to be defined with an array of equal length as is the number of levels \r\n  in the dataset, containing zeroes and ones, telling us on which level we sample with replacement and where without.\r\n\r\nFor example, you can use the non-parametric sampling to get bootstrap distribution of size 1000 on the `bootstrap`\r\ninstance from above.\r\n```\r\nbootstrap.sample(nr_bootstrap_samples=1000, seed=0)\r\n```\r\n\r\nThe values of the bootstrap distribution are now saved in the `bootstrap.bootstrap_values` parameter.\r\n\r\n### Hierarchical sampling\r\nIf you are working with hierarchical data, you need to specify the group structure together with the given sample.\r\nThere are two different hierarchical sampling methods available to choose from, *random-effect* and *cases* sampling.\r\nHere is an example of *cases* sampling where we sample with replacement on all but the last level:\r\n```\r\n# sample that is grouped like this: [[[0.1, -0.2], [1, -0.5]], [[10, 11]]]\r\nsample = np.array([0.1, -0.2, 1, -0.5, 10, 11])\r\nindices = [[[0, 1], [2, 3]], [[4, 5]]]\r\n\r\nhierarchical_bootstrap = boot.Bootstrap(sample, statistic=np.mean, group_indices=indices)\r\n\r\nsamp_args = {'method': 'cases', 'strategy': [1, 1, 0]}\r\nhierarchical_bootstrap.sample(nr_bootstrap_samples=1000, sampling='hierarchical', sampling_args=samp_args)\r\n```\r\n\r\n## Confidence intervals\r\nAfter the bootstrap distribution is obtained, you can produce the confidence intervals by calling the method `ci`.\r\nFollowing parameters can be set:\r\n- `coverages`: array of coverage levels for which the values need to be computed. In the case of two-sided\r\n                            intervals (`side`=*two*) it is a float number.\r\n- `side`: it is possible to choose between *one* and *two* sided confidence intervals. One-sided returns\r\n                        the left-sided confidence interval threshold x, representing CI in the shape of (-inf, x).\r\n- `method`: which method to use for construction of confidence intervals. It is possible to select from\r\n                       *percentile*, *basic*, *bca*, *bc*, *standard*, *smoothed*, *double* and *studentized*.\r\n- `nr_bootstrap_samples`: number of bootstrap samples. Default value `None` should be used if the sampling was done \r\n  before as a separate step and you don't want to repeat it. If the sampling was not done you should specify the \r\n  number of samples.\r\n- `seed`: random seed. Default value `None` skips setting the seed value.\r\n- `sampling`: type of sampling, possible to choose between *nonparametric* and *hierarchical*. \r\n  Passed to the method `sample`.\r\n- `sampling_args`: additional arguments used with hierarchical sampling, passed to the method `sample`.\r\n- `quantile_type`: type of quantiles, possible to select from methods used in `numpy`'s quantile function.\r\n\r\nIt returns an array of threshold values for confidence intervals of corresponding coverage levels.\r\nAn example to get the one-sided 95% and 97.5% confidence intervals from the `bootstrap` instance from above, \r\nwhere sampling was already done:\r\n```\r\nbootstrap.ci(coverages=[0.95, 0.975], method='bca')\r\n\r\n>>> array([0.00272853, 0.0119834 ])\r\n```\r\n\r\n## Jackknife after bootstrap\r\nAfter bootstrap sampling you can diagnose the sampling process with the use of jackknife after bootstrap method, \r\nthat draws a plot showing the influence each data point has on the statistic value.\r\n\r\n# Suggestions on which method and parameters to use\r\nFor the general use case we propose to use the **double** bootstrap method. In the case of confidence interval of\r\nextreme percentiles, we propose to use the **standard** bootstrap method.\r\n\r\nWe suggest to always use the largest number of bootstrap samples that is feasible for your sample size and statistic.\r\nIf you need to speed up the calculations, lower the number of bootstrap samples from the default value of 1000.\r\n\r\nGo to repository [Bootstrap-CI-analysis](https://github.com/zrimseku/Bootstrap-CI-analysis) for more detailed \r\ninformation. It includes a detailed study of where bootstrap methods can be used and which one is suggested in \r\ncertain use case.\r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Ursa Zrimsek  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Bootstrap sampling and confidence interval estimation package.",
    "version": "0.0.1",
    "project_urls": {
        "Bug Reports": "https://github.com/zrimseku/bootstrap-ci/issues",
        "Homepage": "https://github.com/zrimseku/bootstrap-ci"
    },
    "split_keywords": [
        "uncertainty",
        "bootstrap",
        "confidence intervals"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7b02efac0ad9074703eb90beccb8443f3906e74aace2ec563cf91ee3d2218fa3",
                "md5": "2c5ae8423c54f957d92dbc467fb1fe8b",
                "sha256": "88a89d1f3d89d37f1b778f2eac94b8a28456371ecb59d9b24d6de899475a6350"
            },
            "downloads": -1,
            "filename": "bootstrap_ci-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2c5ae8423c54f957d92dbc467fb1fe8b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 12758,
            "upload_time": "2023-11-26T10:56:56",
            "upload_time_iso_8601": "2023-11-26T10:56:56.303604Z",
            "url": "https://files.pythonhosted.org/packages/7b/02/efac0ad9074703eb90beccb8443f3906e74aace2ec563cf91ee3d2218fa3/bootstrap_ci-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "06cc2d7b1f9fe47ea30e251294770b9d8e4666c72d5c09db530b90eacbecc4a2",
                "md5": "a423a54055d12f9555d160de90be6f9d",
                "sha256": "580d237b14b560d4f8456225c4b559341add4bee02740a54ab3c6fedfe2b23c5"
            },
            "downloads": -1,
            "filename": "bootstrap-ci-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a423a54055d12f9555d160de90be6f9d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 12482,
            "upload_time": "2023-11-26T10:56:58",
            "upload_time_iso_8601": "2023-11-26T10:56:58.181047Z",
            "url": "https://files.pythonhosted.org/packages/06/cc/2d7b1f9fe47ea30e251294770b9d8e4666c72d5c09db530b90eacbecc4a2/bootstrap-ci-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-26 10:56:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zrimseku",
    "github_project": "bootstrap-ci",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "bootstrap-ci"
}