dddex

Name	dddex JSON
Version	0.0.40 JSON
	download
home_page	https://github.com/kaiguender/dddex
Summary	The package 'data-driven density estimation x' (dddex) turns any standard point forecasting model into an estimator of the underlying conditional density
upload_time	2024-10-09 23:18:41
maintainer	None
docs_url	None
author	kaiguender
requires_python	>=3.7
license	Apache Software License 2.0
keywords	nbdev jupyter notebook python
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            dddex: Data-Driven Density Estimation x
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` sh
pip install dddex
```

## What is dddex?

The package name `dddex` stands for *Data-Driven Density Estimaton x*.
New approaches are being implemented for estimating conditional
densities without any parametric assumption about the underlying
distribution. All those approaches take an arbitrary point forecaster as
input and turn them into a new object that outputs an estimation of the
conditional density based on the point predictions of the original point
forecaster. The *x* in the name emphasizes that the approaches can be
applied to any point forecaster. In this package several approaches are
being implementing via the following classes:

- [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
- [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn)
- [`LevelSetKDEx_NN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_nn)
- [`LevelSetKDEx_multivariate`](https://kaiguender.github.io/dddex/levelsetkdex_multivariate.html#levelsetkdex_multivariate)

In the following we are going to work exclusively with the class
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
because the most important methods are all pretty much the same. All
models can be run easily with only a few lines of code and are designed
to be compatible with the well known *Scikit-Learn* framework.

## How to use: LevelSetKDEx

To ensure compatibility with Scikit-Learn, as usual the class
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
implements a `fit` and `predict` method. As the purposes of both classes
is to compute estimations of conditional densities, the `predict` method
outputs p-quantiles rather than point forecasts.

Our choice of the class-names is supposed to be indicative of the
underlying models: The name *LevelSet* stems from the fact that both
methods operate with the underlying assumption that the values of point
forecasts generated by the same point forecaster can be interpreted as a
similarity measure between samples. *KDE* is short for *Kernel Density
Estimator* and the *x* yet again signals that the classes can be
initialized with any point forecasting model.

In the following, we demonstrate how to use the class
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
to compute estimations of the conditional densities and quantiles for
the [Yaz Data
Set](https://opimwue.github.io/ddop/modules/auto_generated/ddop.datasets.load_yaz.html#ddop.datasets.load_yaz).
As explained above,
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
is always based on a point forecaster that is being specified by the
user. In our example we use the well known `LightGBMRegressor` as the
underlying point predictor.

``` python
from dddex.levelSetKDEx_univariate import LevelSetKDEx, LevelSetKDEx_kNN, LevelSetKDEx_NN
from dddex.levelSetKDEx_multivariate import LevelSetKDEx_multivariate

from dddex.loadData import loadDataYaz
from lightgbm import LGBMRegressor
```

``` python
dataYaz, XTrain, yTrain, XTest, yTest = loadDataYaz(returnXY = True)
LGBM = LGBMRegressor(n_jobs = 1)
```

There are three parameters for
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex):

- **estimator**: A point forecasting model that must have a `predict`
  method.
- **binSize**: The amount of training samples considered to compute the
  conditional densities (for more details, see *To be written*).
- **weightsByDistance**: If *False*, all considered training samples are
  weighted equally. If *True*, training samples are weighted by the
  inverse of the distance of their respective point forecast to the
  point forecast of the test sample at hand.

``` python
LSKDEx = LevelSetKDEx(estimator = LGBM, 
                      binSize = 100,
                      weightsByDistance = False)
```

There is no need to run `fit` on the point forecasting model before
initializing *LevelSetKDEx*, because the `fit` method of
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
automatically checks whether the provided model has been fitted already
or not and runs the respective `fit` method of the point forecaster if
needed.

It should be noted, that running `fit` for the *LevelSetKDEx* approaches
takes exceptionally little time even for datasets with $>10^6$ samples
(provided, of course, that the underlying point forecasting model has
been fitted before hand).

``` python
LSKDEx.fit(X = XTrain, y = yTrain)
```

In order to compute conditional densities for test samples now, we
simply run the `getWeights` method.

``` python
conditionalDensities = LSKDEx.getWeights(X = XTest,
                                         outputType = 'summarized')

print(f"probabilities: {conditionalDensities[0][0]}")
print(f"demand values: {conditionalDensities[0][1]}")
```

    probabilities: [0.49 0.01 0.21 0.01 0.16 0.07 0.04 0.01]
    demand values: [0.         0.01075269 0.04       0.04878049 0.08       0.12
     0.16       0.2       ]

Here, `conditionalDensities` is a list whose elements correspond to the
samples specified via `X`. Every element contains a tuple, whose first
entry constitutes probabilities and the second entry corresponding
demand values (side note: The demand values have been scaled to lie in
$[0, 1]$). In the above example, we can for example see that our model
estimates that for the first test sample the demand will be 0 with a
probability of 49%.

Like the input argument *outputType* of `getWeights` suggests, we can
output the conditional density estimations in various different forms.
All in all, there are currently 5 output types specifying how the output
for each sample looks like:

- **all**: An array with the same length as the number of training
  samples. Each entry represents the probability of each training
  sample.
- **onlyPositiveWeights**: A tuple. The first element of the tuple
  represents the probabilities and the second one the indices of the
  corresponding training sample. Only probalities greater than zero are
  returned. Note: This is the most memory and computationally efficient
  output type.
- **summarized**: A tuple. The first element of the tuple represents the
  probabilities and the second one the corresponding value of `yTrain`.
  The probabilities corresponding to identical values of `yTrain` are
  aggregated.
- **cumulativeDistribution**: A tuple. The first element of the tuple
  represents the probabilities and the second one the corresponding
  value of `yTrain`.
- **cumulativeDistributionSummarized**: A tuple. The first element of
  the tuple represents the probabilities and the second one the
  corresponding value of `yTrain`. The probabilities corresponding to
  identical values of `yTrain` are aggregated.

For example, by setting
`outputType = 'cumulativeDistributionSummarized'` we can compute an
estimation of the conditional cumulative distribution function for each
sample. Below, we can see that our model predicts the demand of the
first sample to be lower or equal than 0.16 with a probability of 99%.

``` python
cumulativeDistributions = LSKDEx.getWeights(X = XTest,
                                            outputType = 'cumulativeDistributionSummarized')

print(f"cumulated probabilities: {cumulativeDistributions[0][0]}")
print(f"demand values: {cumulativeDistributions[0][1]}")
```

    cumulated probabilities: [0.49 0.5  0.71 0.72 0.88 0.95 0.99 1.  ]
    demand values: [0.         0.01075269 0.04       0.04878049 0.08       0.12
     0.16       0.2       ]

We can also compute estimations of quantiles using the `predict` method.
The parameter *probs* specifies the quantiles we want to predict.

``` python
predRes = LSKDEx.predict(X = XTest,
                         outputAsDf = True, 
                         probs = [0.1, 0.5, 0.75, 0.99])
print(predRes.iloc[0:6, :].to_markdown())
```

    |    |       0.1 |       0.5 |   0.75 |   0.99 |
    |---:|----------:|----------:|-------:|-------:|
    |  0 | 0         | 0.0107527 |   0.08 |   0.16 |
    |  1 | 0         | 0.08      |   0.12 |   0.2  |
    |  2 | 0.04      | 0.0967742 |   0.12 |   0.24 |
    |  3 | 0.056338  | 0.12      |   0.16 |   0.28 |
    |  4 | 0.04      | 0.0967742 |   0.12 |   0.24 |
    |  5 | 0.0666667 | 0.16      |   0.2  |   0.32 |

## How to tune binSize parameter of LevelSetKDEx

`dddex` also comes with the class `QuantileCrossValidations` that allows
to tune quantile predictors in an efficient manner. The class is
designed in a very similar fashion to the cross-validation classes of
Scikit-Learn. As such, at first
[`QuantileCrossValidation`](https://kaiguender.github.io/dddex/crossvalidation.html#quantilecrossvalidation)is
initialized with all the settings for the cross-validation.

- **quantileEstimator**: A model that must have a `set_params`, `fit`
  and `predict` method. Additionally, the `predict` method must (!) have
  a function argument called `prob` that allows to specify which
  quantiles to predict.
- **cvFolds**: An iterable yielding (train, test) splits as arrays of
  indices
- **parameterGrid**: The candidate values of to evaluate. Must be a
  dict.
- **probs**: The probabilities for which quantiles are computed and
  evaluated.
- **refitPerProb**: If True, for ever probability a fitted copy of
  *quantileEstimator* with the best parameter Setting for the respective
  p-quantile is stored in the attribute *bestEstimator_perProb*.
- **n_jobs**: How many cross-validation split results to compute in
  parallel.

After specifying the settings, `fit` has to be called to compute the
results of the cross validation. The performance of every parameter
setting is being evaluated by computing the relative reduction of the
pinball loss in comparison to the quantile estimations generated by
*SAA* (Sample Average Approximation) for every quantile.

``` python
from dddex.crossValidation import groupedTimeSeriesSplit, QuantileCrossValidation

dataTrain = dataYaz[dataYaz['label'] == 'train']
cvFolds = groupedTimeSeriesSplit(data = dataTrain, 
                                 kFolds = 3,
                                 testLength = 28,
                                 groupFeature = 'id',
                                 timeFeature = 'dayIndex')

LSKDEx = LevelSetKDEx(estimator = LGBM)
paramGrid = {'binSize': [20, 100, 400, 1000],
             'weightsByDistance': [True, False]}

CV = QuantileCrossValidation(quantileEstimator = LSKDEx,
                             parameterGrid = paramGrid,
                             cvFolds = cvFolds,
                             probs = [0.01, 0.25, 0.5, 0.75, 0.99],
                             refitPerProb = True,
                             n_jobs = 3)

CV.fit(X = XTrain, y = yTrain)
```

The best value for *binSize* can either be computed for every quantile
separately or for all quantiles at once by computing the average cost
reduction over all quantiles.

``` python
print(f"Best binSize over all quantiles: {CV.bestParams}")
CV.bestParams_perProb
```

    Best binSize over all quantiles: {'binSize': 1000, 'weightsByDistance': False}

    {0.01: {'binSize': 1000, 'weightsByDistance': False},
     0.25: {'binSize': 20, 'weightsByDistance': False},
     0.5: {'binSize': 100, 'weightsByDistance': False},
     0.75: {'binSize': 100, 'weightsByDistance': False},
     0.99: {'binSize': 1000, 'weightsByDistance': False}}

The exact results are also stored as attributes. The easiest way to view
the results is given via `cv_results`, which depicts the average results
over all cross-validation folds:

``` python
print(CV.cvResults.to_markdown())
```

    |               |    0.01 |     0.25 |      0.5 |     0.75 |    0.99 |
    |:--------------|--------:|---------:|---------:|---------:|--------:|
    | (20, True)    | 3.79553 | 0.946626 | 0.89631  | 0.974659 | 2.98365 |
    | (20, False)   | 3.23956 | 0.849528 | 0.808262 | 0.854069 | 2.46195 |
    | (100, True)   | 3.11384 | 0.92145  | 0.871266 | 0.922703 | 2.22249 |
    | (100, False)  | 1.65191 | 0.857026 | 0.803632 | 0.835323 | 1.81003 |
    | (400, True)   | 2.57563 | 0.908214 | 0.851471 | 0.900311 | 2.03445 |
    | (400, False)  | 1.64183 | 0.860281 | 0.812806 | 0.837641 | 1.57534 |
    | (1000, True)  | 2.34575 | 0.893628 | 0.843721 | 0.888143 | 1.82368 |
    | (1000, False) | 1.54641 | 0.869606 | 0.854369 | 0.88065  | 1.52644 |

The attentive reader will certainly notice that values greater than 1
imply that the respective model performed worse than SAA. This is, of
course, simply due to the fact, that we didn’t tune the hyperparameters
of the underlying `LGBMRegressor` point predictor and instead used the
default parameter values. The
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)classes
are able to produce highly accurate density estimations, but are
obviously not able to turn a terrible point predictor into a highly
performant conditional density estimator. The performance of the
underlying point predictor and the constructed
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
model go hand in hand.

We can also access the results for every fold separately via
`cv_results_raw`, which is a list with one entry per fold:

``` python
CV.cvResults_raw
```

    [                               0.01      0.25      0.50      0.75      0.99
     binSize weightsByDistance                                                  
     20      True               3.730363  0.977152  0.949944  1.093261  4.590650
             False              3.068598  0.854633  0.855041  0.953362  3.663885
     100     True               3.359961  0.945510  0.922778  1.027477  3.475501
             False              1.626054  0.871327  0.833379  0.907911  2.591117
     400     True               2.663854  0.928036  0.907505  0.995238  3.149022
             False              1.732673  0.860440  0.828015  0.890643  2.190292
     1000    True               2.463221  0.914308  0.897978  0.979345  2.753553
             False              1.464534  0.873277  0.858563  0.891858  1.830334,
                                    0.01      0.25      0.50      0.75      0.99
     binSize weightsByDistance                                                  
     20      True               4.725018  0.958236  0.891472  0.914408  2.253200
             False              4.157297  0.841141  0.795929  0.830544  1.883320
     100     True               3.687090  0.933531  0.876655  0.875718  1.551640
             False              1.752709  0.862970  0.812126  0.819613  1.416013
     400     True               3.061210  0.920190  0.851794  0.873496  1.464974
             False              2.085622  0.887758  0.839370  0.859290  1.296445
     1000    True               2.784076  0.903801  0.840009  0.856845  1.381658
             False              1.767468  0.869484  0.860893  0.876293  1.464460,
                                    0.01      0.25      0.50      0.75      0.99
     binSize weightsByDistance                                                  
     20      True               2.931208  0.904490  0.847513  0.916307  2.107091
             False              2.492787  0.852811  0.773815  0.778301  1.838642
     100     True               2.294471  0.885308  0.814365  0.864913  1.640339
             False              1.576956  0.836781  0.765390  0.778446  1.422947
     400     True               2.001828  0.876417  0.795114  0.832198  1.489340
             False              1.107203  0.832645  0.771034  0.762992  1.239275
     1000    True               1.789944  0.862776  0.793177  0.828237  1.335825
             False              1.407221  0.866058  0.843651  0.873799  1.284521]

The models with the best *binSize* parameter are automatically computed
while running `fit` and can be accessed via `bestEstimatorLSx`. If
`refitPerProb = True`, then `bestEstimatorLSx` is a dictionary whose
keys are the probabilities specified via the paramater *probs*.

``` python
LSKDEx_best99 = CV.bestEstimator_perProb[0.99]
predRes = LSKDEx_best99.predict(X = XTest,
                                probs = 0.99)
print(predRes.iloc[0:6, ].to_markdown())
```

    |    |   0.99 |
    |---:|-------:|
    |  0 |   0.32 |
    |  1 |   0.32 |
    |  2 |   0.32 |
    |  3 |   0.32 |
    |  4 |   0.32 |
    |  5 |   0.32 |

## Benchmarks: Random Forest wSAA

The `dddex` package also contains useful non-parametric benchmark models
to compare the performance of the
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
models to other state of the art non-parametric models capable of
generating conditional density estimations. In a [meta analysis
conducted by S. Butler et
al.](https://ml-eval.github.io/assets/pdf/ICLR22_Workshop_ML_Eval_DDNV.pdf)
the most performant model has been found to be [weighted sample average
approximation
(wSAA)](https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3253) based
on *Random Forest*. This model has been implemented in a Scikit-Learn
fashion as well.

``` python
from dddex.wSAA import RandomForestWSAA
RF = RandomForestWSAA()
```

[`RandomForestWSAA`](https://kaiguender.github.io/dddex/wsaa.html#randomforestwsaa)
is a class derived from the original `RandomForestRegressor` class from
Scikit-Learn, that has been extended to be able to generate conditional
density estimations in the manner described by Bertsimas et al. in their
paper [*From Predictive to prescriptive
analytics*](https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3253).
The *Random Forest* modell is being fitted in exactly the same way as
the original *RandomForestRegressor*:

``` python
RF.fit(X = XTrain, y = yTrain)
```

Identical to the
[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)
and
[`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn)
classes, an identical method called `getWeights` and `predict`are
implemented to compute conditional density estimations and quantiles.
The output is the same as before.

``` python
conditionalDensities = RF.getWeights(X = XTest,
                                     outputType = 'summarized')

print(f"probabilities: {conditionalDensities[0][0]}")
print(f"demand values: {conditionalDensities[0][1]}")
```

    probabilities: [0.08334138 0.17368071 0.2987331  0.10053752 0.1893534  0.09121861
     0.04362338 0.0145119  0.005     ]
    demand values: [0.   0.04 0.08 0.12 0.16 0.2  0.24 0.28 0.32]

``` python
predRes = RF.predict(X = XTest,
                     probs = [0.01, 0.5, 0.99],
                     outputAsDf = True)
print(predRes.iloc[0:6, :].to_markdown())
```

    |    |   0.01 |   0.5 |   0.99 |
    |---:|-------:|------:|-------:|
    |  0 |      0 |  0.08 |   0.28 |
    |  1 |      0 |  0.12 |   0.32 |
    |  2 |      0 |  0.12 |   0.32 |
    |  3 |      0 |  0.12 |   0.32 |
    |  4 |      0 |  0.12 |   0.32 |
    |  5 |      0 |  0.2  |   0.4  |

The original `predict` method of the `RandomForestRegressor` has been
renamed to `pointPredict`:

``` python
RF.pointPredict(X = XTest)[0:6]
```

    array([0.1064    , 0.1184    , 0.1324    , 0.1324    , 0.1364    ,
           0.18892683])

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kaiguender/dddex",
    "name": "dddex",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "kaiguender",
    "author_email": "kai.guender@yahoo.de",
    "download_url": "https://files.pythonhosted.org/packages/51/14/eb68d68ab910bb581344ae68bf764a578cad72c67d616fa8bf2bfccc411c/dddex-0.0.40.tar.gz",
    "platform": null,
    "description": "dddex: Data-Driven Density Estimation x\n================\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## Install\n\n``` sh\npip install dddex\n```\n\n## What is dddex?\n\nThe package name `dddex` stands for *Data-Driven Density Estimaton x*.\nNew approaches are being implemented for estimating conditional\ndensities without any parametric assumption about the underlying\ndistribution. All those approaches take an arbitrary point forecaster as\ninput and turn them into a new object that outputs an estimation of the\nconditional density based on the point predictions of the original point\nforecaster. The *x* in the name emphasizes that the approaches can be\napplied to any point forecaster. In this package several approaches are\nbeing implementing via the following classes:\n\n- [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\n- [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn)\n- [`LevelSetKDEx_NN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_nn)\n- [`LevelSetKDEx_multivariate`](https://kaiguender.github.io/dddex/levelsetkdex_multivariate.html#levelsetkdex_multivariate)\n\nIn the following we are going to work exclusively with the class\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nbecause the most important methods are all pretty much the same. All\nmodels can be run easily with only a few lines of code and are designed\nto be compatible with the well known *Scikit-Learn* framework.\n\n## How to use: LevelSetKDEx\n\nTo ensure compatibility with Scikit-Learn, as usual the class\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nimplements a `fit` and `predict` method. As the purposes of both classes\nis to compute estimations of conditional densities, the `predict` method\noutputs p-quantiles rather than point forecasts.\n\nOur choice of the class-names is supposed to be indicative of the\nunderlying models: The name *LevelSet* stems from the fact that both\nmethods operate with the underlying assumption that the values of point\nforecasts generated by the same point forecaster can be interpreted as a\nsimilarity measure between samples. *KDE* is short for *Kernel Density\nEstimator* and the *x* yet again signals that the classes can be\ninitialized with any point forecasting model.\n\nIn the following, we demonstrate how to use the class\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nto compute estimations of the conditional densities and quantiles for\nthe [Yaz Data\nSet](https://opimwue.github.io/ddop/modules/auto_generated/ddop.datasets.load_yaz.html#ddop.datasets.load_yaz).\nAs explained above,\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nis always based on a point forecaster that is being specified by the\nuser. In our example we use the well known `LightGBMRegressor` as the\nunderlying point predictor.\n\n``` python\nfrom dddex.levelSetKDEx_univariate import LevelSetKDEx, LevelSetKDEx_kNN, LevelSetKDEx_NN\nfrom dddex.levelSetKDEx_multivariate import LevelSetKDEx_multivariate\n\nfrom dddex.loadData import loadDataYaz\nfrom lightgbm import LGBMRegressor\n```\n\n``` python\ndataYaz, XTrain, yTrain, XTest, yTest = loadDataYaz(returnXY = True)\nLGBM = LGBMRegressor(n_jobs = 1)\n```\n\nThere are three parameters for\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex):\n\n- **estimator**: A point forecasting model that must have a `predict`\n  method.\n- **binSize**: The amount of training samples considered to compute the\n  conditional densities (for more details, see *To be written*).\n- **weightsByDistance**: If *False*, all considered training samples are\n  weighted equally. If *True*, training samples are weighted by the\n  inverse of the distance of their respective point forecast to the\n  point forecast of the test sample at hand.\n\n``` python\nLSKDEx = LevelSetKDEx(estimator = LGBM, \n                      binSize = 100,\n                      weightsByDistance = False)\n```\n\nThere is no need to run `fit` on the point forecasting model before\ninitializing *LevelSetKDEx*, because the `fit` method of\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nautomatically checks whether the provided model has been fitted already\nor not and runs the respective `fit` method of the point forecaster if\nneeded.\n\nIt should be noted, that running `fit` for the *LevelSetKDEx* approaches\ntakes exceptionally little time even for datasets with $>10^6$ samples\n(provided, of course, that the underlying point forecasting model has\nbeen fitted before hand).\n\n``` python\nLSKDEx.fit(X = XTrain, y = yTrain)\n```\n\nIn order to compute conditional densities for test samples now, we\nsimply run the `getWeights` method.\n\n``` python\nconditionalDensities = LSKDEx.getWeights(X = XTest,\n                                         outputType = 'summarized')\n\nprint(f\"probabilities: {conditionalDensities[0][0]}\")\nprint(f\"demand values: {conditionalDensities[0][1]}\")\n```\n\n    probabilities: [0.49 0.01 0.21 0.01 0.16 0.07 0.04 0.01]\n    demand values: [0.         0.01075269 0.04       0.04878049 0.08       0.12\n     0.16       0.2       ]\n\nHere, `conditionalDensities` is a list whose elements correspond to the\nsamples specified via `X`. Every element contains a tuple, whose first\nentry constitutes probabilities and the second entry corresponding\ndemand values (side note: The demand values have been scaled to lie in\n$[0, 1]$). In the above example, we can for example see that our model\nestimates that for the first test sample the demand will be 0 with a\nprobability of 49%.\n\nLike the input argument *outputType* of `getWeights` suggests, we can\noutput the conditional density estimations in various different forms.\nAll in all, there are currently 5 output types specifying how the output\nfor each sample looks like:\n\n- **all**: An array with the same length as the number of training\n  samples. Each entry represents the probability of each training\n  sample.\n- **onlyPositiveWeights**: A tuple. The first element of the tuple\n  represents the probabilities and the second one the indices of the\n  corresponding training sample. Only probalities greater than zero are\n  returned. Note: This is the most memory and computationally efficient\n  output type.\n- **summarized**: A tuple. The first element of the tuple represents the\n  probabilities and the second one the corresponding value of `yTrain`.\n  The probabilities corresponding to identical values of `yTrain` are\n  aggregated.\n- **cumulativeDistribution**: A tuple. The first element of the tuple\n  represents the probabilities and the second one the corresponding\n  value of `yTrain`.\n- **cumulativeDistributionSummarized**: A tuple. The first element of\n  the tuple represents the probabilities and the second one the\n  corresponding value of `yTrain`. The probabilities corresponding to\n  identical values of `yTrain` are aggregated.\n\nFor example, by setting\n`outputType = 'cumulativeDistributionSummarized'` we can compute an\nestimation of the conditional cumulative distribution function for each\nsample. Below, we can see that our model predicts the demand of the\nfirst sample to be lower or equal than 0.16 with a probability of 99%.\n\n``` python\ncumulativeDistributions = LSKDEx.getWeights(X = XTest,\n                                            outputType = 'cumulativeDistributionSummarized')\n\nprint(f\"cumulated probabilities: {cumulativeDistributions[0][0]}\")\nprint(f\"demand values: {cumulativeDistributions[0][1]}\")\n```\n\n    cumulated probabilities: [0.49 0.5  0.71 0.72 0.88 0.95 0.99 1.  ]\n    demand values: [0.         0.01075269 0.04       0.04878049 0.08       0.12\n     0.16       0.2       ]\n\nWe can also compute estimations of quantiles using the `predict` method.\nThe parameter *probs* specifies the quantiles we want to predict.\n\n``` python\npredRes = LSKDEx.predict(X = XTest,\n                         outputAsDf = True, \n                         probs = [0.1, 0.5, 0.75, 0.99])\nprint(predRes.iloc[0:6, :].to_markdown())\n```\n\n    |    |       0.1 |       0.5 |   0.75 |   0.99 |\n    |---:|----------:|----------:|-------:|-------:|\n    |  0 | 0         | 0.0107527 |   0.08 |   0.16 |\n    |  1 | 0         | 0.08      |   0.12 |   0.2  |\n    |  2 | 0.04      | 0.0967742 |   0.12 |   0.24 |\n    |  3 | 0.056338  | 0.12      |   0.16 |   0.28 |\n    |  4 | 0.04      | 0.0967742 |   0.12 |   0.24 |\n    |  5 | 0.0666667 | 0.16      |   0.2  |   0.32 |\n\n## How to tune binSize parameter of LevelSetKDEx\n\n`dddex` also comes with the class `QuantileCrossValidations` that allows\nto tune quantile predictors in an efficient manner. The class is\ndesigned in a very similar fashion to the cross-validation classes of\nScikit-Learn. As such, at first\n[`QuantileCrossValidation`](https://kaiguender.github.io/dddex/crossvalidation.html#quantilecrossvalidation)is\ninitialized with all the settings for the cross-validation.\n\n- **quantileEstimator**: A model that must have a `set_params`, `fit`\n  and `predict` method. Additionally, the `predict` method must (!) have\n  a function argument called `prob` that allows to specify which\n  quantiles to predict.\n- **cvFolds**: An iterable yielding (train, test) splits as arrays of\n  indices\n- **parameterGrid**: The candidate values of to evaluate. Must be a\n  dict.\n- **probs**: The probabilities for which quantiles are computed and\n  evaluated.\n- **refitPerProb**: If True, for ever probability a fitted copy of\n  *quantileEstimator* with the best parameter Setting for the respective\n  p-quantile is stored in the attribute *bestEstimator_perProb*.\n- **n_jobs**: How many cross-validation split results to compute in\n  parallel.\n\nAfter specifying the settings, `fit` has to be called to compute the\nresults of the cross validation. The performance of every parameter\nsetting is being evaluated by computing the relative reduction of the\npinball loss in comparison to the quantile estimations generated by\n*SAA* (Sample Average Approximation) for every quantile.\n\n``` python\nfrom dddex.crossValidation import groupedTimeSeriesSplit, QuantileCrossValidation\n\ndataTrain = dataYaz[dataYaz['label'] == 'train']\ncvFolds = groupedTimeSeriesSplit(data = dataTrain, \n                                 kFolds = 3,\n                                 testLength = 28,\n                                 groupFeature = 'id',\n                                 timeFeature = 'dayIndex')\n\nLSKDEx = LevelSetKDEx(estimator = LGBM)\nparamGrid = {'binSize': [20, 100, 400, 1000],\n             'weightsByDistance': [True, False]}\n\nCV = QuantileCrossValidation(quantileEstimator = LSKDEx,\n                             parameterGrid = paramGrid,\n                             cvFolds = cvFolds,\n                             probs = [0.01, 0.25, 0.5, 0.75, 0.99],\n                             refitPerProb = True,\n                             n_jobs = 3)\n\nCV.fit(X = XTrain, y = yTrain)\n```\n\nThe best value for *binSize* can either be computed for every quantile\nseparately or for all quantiles at once by computing the average cost\nreduction over all quantiles.\n\n``` python\nprint(f\"Best binSize over all quantiles: {CV.bestParams}\")\nCV.bestParams_perProb\n```\n\n    Best binSize over all quantiles: {'binSize': 1000, 'weightsByDistance': False}\n\n    {0.01: {'binSize': 1000, 'weightsByDistance': False},\n     0.25: {'binSize': 20, 'weightsByDistance': False},\n     0.5: {'binSize': 100, 'weightsByDistance': False},\n     0.75: {'binSize': 100, 'weightsByDistance': False},\n     0.99: {'binSize': 1000, 'weightsByDistance': False}}\n\nThe exact results are also stored as attributes. The easiest way to view\nthe results is given via `cv_results`, which depicts the average results\nover all cross-validation folds:\n\n``` python\nprint(CV.cvResults.to_markdown())\n```\n\n    |               |    0.01 |     0.25 |      0.5 |     0.75 |    0.99 |\n    |:--------------|--------:|---------:|---------:|---------:|--------:|\n    | (20, True)    | 3.79553 | 0.946626 | 0.89631  | 0.974659 | 2.98365 |\n    | (20, False)   | 3.23956 | 0.849528 | 0.808262 | 0.854069 | 2.46195 |\n    | (100, True)   | 3.11384 | 0.92145  | 0.871266 | 0.922703 | 2.22249 |\n    | (100, False)  | 1.65191 | 0.857026 | 0.803632 | 0.835323 | 1.81003 |\n    | (400, True)   | 2.57563 | 0.908214 | 0.851471 | 0.900311 | 2.03445 |\n    | (400, False)  | 1.64183 | 0.860281 | 0.812806 | 0.837641 | 1.57534 |\n    | (1000, True)  | 2.34575 | 0.893628 | 0.843721 | 0.888143 | 1.82368 |\n    | (1000, False) | 1.54641 | 0.869606 | 0.854369 | 0.88065  | 1.52644 |\n\nThe attentive reader will certainly notice that values greater than 1\nimply that the respective model performed worse than SAA. This is, of\ncourse, simply due to the fact, that we didn\u2019t tune the hyperparameters\nof the underlying `LGBMRegressor` point predictor and instead used the\ndefault parameter values. The\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)classes\nare able to produce highly accurate density estimations, but are\nobviously not able to turn a terrible point predictor into a highly\nperformant conditional density estimator. The performance of the\nunderlying point predictor and the constructed\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nmodel go hand in hand.\n\nWe can also access the results for every fold separately via\n`cv_results_raw`, which is a list with one entry per fold:\n\n``` python\nCV.cvResults_raw\n```\n\n    [                               0.01      0.25      0.50      0.75      0.99\n     binSize weightsByDistance                                                  \n     20      True               3.730363  0.977152  0.949944  1.093261  4.590650\n             False              3.068598  0.854633  0.855041  0.953362  3.663885\n     100     True               3.359961  0.945510  0.922778  1.027477  3.475501\n             False              1.626054  0.871327  0.833379  0.907911  2.591117\n     400     True               2.663854  0.928036  0.907505  0.995238  3.149022\n             False              1.732673  0.860440  0.828015  0.890643  2.190292\n     1000    True               2.463221  0.914308  0.897978  0.979345  2.753553\n             False              1.464534  0.873277  0.858563  0.891858  1.830334,\n                                    0.01      0.25      0.50      0.75      0.99\n     binSize weightsByDistance                                                  \n     20      True               4.725018  0.958236  0.891472  0.914408  2.253200\n             False              4.157297  0.841141  0.795929  0.830544  1.883320\n     100     True               3.687090  0.933531  0.876655  0.875718  1.551640\n             False              1.752709  0.862970  0.812126  0.819613  1.416013\n     400     True               3.061210  0.920190  0.851794  0.873496  1.464974\n             False              2.085622  0.887758  0.839370  0.859290  1.296445\n     1000    True               2.784076  0.903801  0.840009  0.856845  1.381658\n             False              1.767468  0.869484  0.860893  0.876293  1.464460,\n                                    0.01      0.25      0.50      0.75      0.99\n     binSize weightsByDistance                                                  \n     20      True               2.931208  0.904490  0.847513  0.916307  2.107091\n             False              2.492787  0.852811  0.773815  0.778301  1.838642\n     100     True               2.294471  0.885308  0.814365  0.864913  1.640339\n             False              1.576956  0.836781  0.765390  0.778446  1.422947\n     400     True               2.001828  0.876417  0.795114  0.832198  1.489340\n             False              1.107203  0.832645  0.771034  0.762992  1.239275\n     1000    True               1.789944  0.862776  0.793177  0.828237  1.335825\n             False              1.407221  0.866058  0.843651  0.873799  1.284521]\n\nThe models with the best *binSize* parameter are automatically computed\nwhile running `fit` and can be accessed via `bestEstimatorLSx`. If\n`refitPerProb = True`, then `bestEstimatorLSx` is a dictionary whose\nkeys are the probabilities specified via the paramater *probs*.\n\n``` python\nLSKDEx_best99 = CV.bestEstimator_perProb[0.99]\npredRes = LSKDEx_best99.predict(X = XTest,\n                                probs = 0.99)\nprint(predRes.iloc[0:6, ].to_markdown())\n```\n\n    |    |   0.99 |\n    |---:|-------:|\n    |  0 |   0.32 |\n    |  1 |   0.32 |\n    |  2 |   0.32 |\n    |  3 |   0.32 |\n    |  4 |   0.32 |\n    |  5 |   0.32 |\n\n## Benchmarks: Random Forest wSAA\n\nThe `dddex` package also contains useful non-parametric benchmark models\nto compare the performance of the\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nmodels to other state of the art non-parametric models capable of\ngenerating conditional density estimations. In a [meta analysis\nconducted by S. Butler et\nal.](https://ml-eval.github.io/assets/pdf/ICLR22_Workshop_ML_Eval_DDNV.pdf)\nthe most performant model has been found to be [weighted sample average\napproximation\n(wSAA)](https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3253) based\non *Random Forest*. This model has been implemented in a Scikit-Learn\nfashion as well.\n\n``` python\nfrom dddex.wSAA import RandomForestWSAA\nRF = RandomForestWSAA()\n```\n\n[`RandomForestWSAA`](https://kaiguender.github.io/dddex/wsaa.html#randomforestwsaa)\nis a class derived from the original `RandomForestRegressor` class from\nScikit-Learn, that has been extended to be able to generate conditional\ndensity estimations in the manner described by Bertsimas et al.\u00a0in their\npaper [*From Predictive to prescriptive\nanalytics*](https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3253).\nThe *Random Forest* modell is being fitted in exactly the same way as\nthe original *RandomForestRegressor*:\n\n``` python\nRF.fit(X = XTrain, y = yTrain)\n```\n\nIdentical to the\n[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex)\nand\n[`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn)\nclasses, an identical method called `getWeights` and `predict`are\nimplemented to compute conditional density estimations and quantiles.\nThe output is the same as before.\n\n``` python\nconditionalDensities = RF.getWeights(X = XTest,\n                                     outputType = 'summarized')\n\nprint(f\"probabilities: {conditionalDensities[0][0]}\")\nprint(f\"demand values: {conditionalDensities[0][1]}\")\n```\n\n    probabilities: [0.08334138 0.17368071 0.2987331  0.10053752 0.1893534  0.09121861\n     0.04362338 0.0145119  0.005     ]\n    demand values: [0.   0.04 0.08 0.12 0.16 0.2  0.24 0.28 0.32]\n\n``` python\npredRes = RF.predict(X = XTest,\n                     probs = [0.01, 0.5, 0.99],\n                     outputAsDf = True)\nprint(predRes.iloc[0:6, :].to_markdown())\n```\n\n    |    |   0.01 |   0.5 |   0.99 |\n    |---:|-------:|------:|-------:|\n    |  0 |      0 |  0.08 |   0.28 |\n    |  1 |      0 |  0.12 |   0.32 |\n    |  2 |      0 |  0.12 |   0.32 |\n    |  3 |      0 |  0.12 |   0.32 |\n    |  4 |      0 |  0.12 |   0.32 |\n    |  5 |      0 |  0.2  |   0.4  |\n\nThe original `predict` method of the `RandomForestRegressor` has been\nrenamed to `pointPredict`:\n\n``` python\nRF.pointPredict(X = XTest)[0:6]\n```\n\n    array([0.1064    , 0.1184    , 0.1324    , 0.1324    , 0.1364    ,\n           0.18892683])\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "The package 'data-driven density estimation x' (dddex) turns any standard point forecasting model into an estimator of the underlying conditional density",
    "version": "0.0.40",
    "project_urls": {
        "Homepage": "https://github.com/kaiguender/dddex"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8e928d5cf0fb8678527e1740f697e7b0ed342e70e51c5120ead31b449cc6758d",
                "md5": "05edecd3042a7c826c55907c4d07ff58",
                "sha256": "8728dc094dcf56381644a91c4fbc8b84b521b4e92fd97ee2bb3e7888e4de9bc4"
            },
            "downloads": -1,
            "filename": "dddex-0.0.40-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "05edecd3042a7c826c55907c4d07ff58",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 16243942,
            "upload_time": "2024-10-09T23:18:37",
            "upload_time_iso_8601": "2024-10-09T23:18:37.114398Z",
            "url": "https://files.pythonhosted.org/packages/8e/92/8d5cf0fb8678527e1740f697e7b0ed342e70e51c5120ead31b449cc6758d/dddex-0.0.40-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5114eb68d68ab910bb581344ae68bf764a578cad72c67d616fa8bf2bfccc411c",
                "md5": "8c763c0aafb5d7421f51d4a00283f5ca",
                "sha256": "abd34299ae7e38335de17427ae88c9b465777ff3f0cfa7e0c3a1a19e0d943fd2"
            },
            "downloads": -1,
            "filename": "dddex-0.0.40.tar.gz",
            "has_sig": false,
            "md5_digest": "8c763c0aafb5d7421f51d4a00283f5ca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 15860225,
            "upload_time": "2024-10-09T23:18:41",
            "upload_time_iso_8601": "2024-10-09T23:18:41.217324Z",
            "url": "https://files.pythonhosted.org/packages/51/14/eb68d68ab910bb581344ae68bf764a578cad72c67d616fa8bf2bfccc411c/dddex-0.0.40.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-09 23:18:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kaiguender",
    "github_project": "dddex",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dddex"
}

kaiguender