dsblocks

Name	dsblocks JSON
Version	0.0.15 JSON
	download
home_page	https://github.com/Jaume-JCI/ds-blocks
Summary	DS Blocks
upload_time	2022-12-02 01:25:24
maintainer
docs_url	None
author	Jaume Amores
requires_python	>=3.7,<=3.12
license	Apache Software License 2.0
keywords	nbdev jupyter notebook python
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            DS Blocks
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

`DS Blocks` makes it easy to write highly modular and compact data
science pipelines. It is based on a generalization of the well-known
scikit-learn pipeline design, enriching and extending it in multiple
ways. By doing so, `DS Blocks` makes it possible to express the ML
solution in terms of independent building blocks that can be easily
moved around and reused to create different solutions. At the same time,
`DS Blocks` makes it possible to write concise code by automatically
taking care of common steps that are needed when building a data science
pipeline, such as checkpointing, logging, profiling, data conversion,
and more, resulting in a significant reduction of boiler-plate code.

`DS Blocks` also provides a number of features that facilitate working
with notebooks, such as:

- Integration with [nbdev](https://nbdev.fast.ai/) and extension of its
  functionalities. `nbdev` is a powerful framework that streamlines
  development on notebooks using best software practices. `DS Blocks`
  extends `nbdev` by making it possible to convert notebooks into a test
  suite for external engines such as `pytest`. It also allows convenient
  freezing and unfreezing notebook test cells to avoid recomputing the
  tests every time we need to restart and re-run the notebook.

- `DS Blocks` provides several magic functions that facilitate
  reproducibility. It also provides convenient decorators for converting
  functions into pipeline components and reducing boiler-plate.

- In addition to a powerful pipeline design (see below), `DS Blocks`
  provides out-of-the-box components frequently used in Data Science,
  such as for cross-validation and model-selection, building ensembles,
  working with time-series, and more.

## Features

The following is a selection of some of the benefits provided by using
`DS Blocks` pipelines:

- Automatize common steps that are usually present in ML code, including
  caching / loading of intermediate results across the entire pipeline,
  logging, profiling, conversion of data to appropriate format, and
  more.

- Easy debugging of the entire pipeline, both during the current run as
  well as post-mortem. Facilitates investigation of issues occurred
  during past runs.

- Make it possible to easily show statistics and other types of
  information about the output of each component in the pipeline, print
  a summary of the pipeline, plot a diagram of the components, and show
  the dimensionality of the output provided by each component.

- Make it possible to use any data type in the communication between
  components. This is done through data conversion layers that
  facilitate reusing the components across different pipelines,
  regardless of the data format used by rest of the components. This
  functionality allows, for instance, to have a consistent use of
  DataFrames across the whole pipeline: when the input is a DataFrame,
  the output will be a DataFrame as well, and when the input is a numpy
  array the output is a numpy array. This is just an example, the
  proposed design allows to easily support many other use cases.

- Enable the use of sampling components that not only change the
  variables (or columns) but also change the number of observations (or
  rows), by either under-sampling or over-sampling, which is not
  supported by common pipelines such as the ones provided in
  scikit-learn.

- Integrated experiment tracking and hyper-parameter optimization.

- And many more!

## Comparison against other frameworks

`DS Blocks` provides functionalities that are also present in frameworks
such as [Metaflow](https://metaflow.org/),
[Kedro](https://kedro.readthedocs.io/),
[Ploomber](https://ploomber.io/), and others. In this section we briefly
comment on the differences against these three frameworks, which are
among the most popular ones. An important difference with respect to
these frameworks is that, while our design allows to build any kind of
Directed Acyclic Graph (DAG), we do not need to express the edges of
such graph explicitly, reducing the corresponding boiler-plate. Another
difference is the use of a compact design loosely similar to
scikit-learn’s pipelines and estimators, which allows to concisely
express any ML solution in a familiar syntax.

Apart from those differences, we comment here on more specific
differences wrt each framework:

- The main difference with respect to frameworks such as `Kedro`, is
  that we use a pure-code approach, avoding the need of writing separate
  config files that govern the behaviour of the pipeline.
- The main difference wrt to `Metaflow`, is that `DS Blocks` allows to
  keep the original code without changes, and extend its functionality
  by simply declaring sequences of the original functions and classes.
  While `Metaflow` allows to create flows of original functions, it uses
  a more verbose approach for achieving this.
- The main difference with `Ploomber`, `Luigi`, and other frameworks is
  that our pipelines are constructed programmatically with pure python,
  not by gluing together the inputs and outputs of applications that are
  run separately.

## Installation

DS Blocks is pip installable:

``` bash
pip install dsblocks
```

## Example usage

### Baseline problem

In the first problem, we will only use the `Sequential` class. Let us
import it, together with the numpy library.

``` python
import numpy as np
from dsblocks import Sequential
```

This first example is taken from
[Optuna](https://optuna.org/#code_examples)’s quadratic problem: find
the value of $X$ that minimizes:

$$(X-2)^2$$

We start by using a simple data vector as input: $$X=(0,1,2,3,4)^T$$

``` python
X = np.arange (5)
```

For the sake of this example, we decompose the aforementioned quadratic
equation into two simple functions: `subtract2` and `square`, and add a
third function `np.argmin` to find the value of `X` that minimizes this
equation. The three functions are then assembled in a `Sequential`
pipeline as follows:

``` python
def subtract2 (X): 
    return X-2
def square (X): 
    return X*X

pipeline = Sequential (subtract2, square, np.argmin)
```

The `Sequential` pipeline feeds the results from one function into the
next, the final one being `np.argmin`. In this toy example each function
performs a simple calculation, but in general they perform
time-consuming processes. After this, we obtain the result of this
pipeline by just calling it on the input data `X`:

``` python
idx_min = pipeline (X)
print (f'Value of X that minimizes the equation : {X[idx_min]}')
```

    Value of X that minimizes the equation : 2

Many times, the first step of the pipeline is to get the data from an
external source or storage. We now augment the pipeline by including a
new function `get_data` which runs as first step. We also include
persistence and logging in the pipeline by passing `verbose=2` and
`path_results='square_problem'`:

``` python
def get_data ():
    return np.arange (5)

pipeline = Sequential (get_data, subtract2, square, np.argmin,
                       verbose=2, path_results='square_problem')
pipeline()
print (f'Value of X that minimizes the equation : {X[idx_min]}')
```

    applying pipeline (on whole data)
    applying get_data (on whole data)
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/get_data_result.pk
    applying subtract2 (on whole data)
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/subtract2_result.pk
    applying square (on whole data)
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/square_result.pk
    applying argmin (on whole data)
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/argmin_result.pk
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/pipeline_result.pk

    Value of X that minimizes the equation : 2

We can see the logs of each step being executed and its results saved to
disk.

Now we can easily load the results of intermediate steps:

``` python
result = pipeline.subtract2.load_result ()
print ('result of X-2: ', result)
```

    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/subtract2_result.pk

    result of X-2:  [-2 -1  0  1  2]

``` python
result = pipeline.square.load_result()
print ('result of (X-2)^2: ', result)
```

    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/square_result.pk

    result of (X-2)^2:  [4 1 0 1 4]

Let us see the case where there was an interrumption in the execution
and we need to resume it. We simulate this case by removing the
intermediate results that happened after the interrumption:
`subtract2_result`, `square_result`, and the final `pipeline_result`:

``` python
!rm square_problem/whole/subtract2_result.pk
!rm square_problem/whole/square_result.pk
!rm square_problem/whole/pipeline_result.pk
```

Let us now re-run the pipeline, and see which steps are loaded and which
ones are re-computed:

``` python
pipeline ()
```

    applying pipeline (on whole data)
    applying get_data (on whole data)
    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/get_data_result.pk
    loaded pre-computed result
    applying subtract2 (on whole data)
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/subtract2_result.pk
    applying square (on whole data)
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/square_result.pk
    applying argmin (on whole data)
    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/argmin_result.pk
    loaded pre-computed result
    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/pipeline_result.pk

    2

We can see that the first intermediate results, from `get_data` and
`subtract2` are loaded, while the remaining steps, `subtract2`, `square`
and the final result of the `pipeline`, are re-computed (since their
result was removed from disk) and their result is saved to disk.

By default, results are always loaded and saved if we provide a
`path_results` when constructing our pipeline. This default behaviour
can be changed by specifying the values of
[`load`](https://Jaume-JCI.github.io/ds-blocks/utils/session.html#load)
and `save` at construction time. For instance:

``` python
pipeline = Sequential (component_1, component2,
                       path_results='my_results', load=False)
```

will save the result of the computation but not load it. This might be
useful when we want to overwrite the previous result with a newly
calculated one. The following:

``` python
pipeline = Sequential (component_1, component2,
                       path_results='my_results', save=False)
```

will load the result, if it exists. If it doesn’t, it will compute the
result but it won’t save it.

### Modified problem

Let us now modify the previous problem as follows: we want to find the
hyper-parameter `c` that minimizes the following regression problem:

$$
(X+c)^T (X+c) = Y,
$$

given a simple 1D dataset:

$$
X = (0, 1, 2)^T 
$$

$$
Y = (4, 9, 16)^T
$$

In this data, we have $y_i = (x_i+2) ^ 2$ $\forall i$, and therefore the
optimal solution is $c=2$.

For this problem we will measure the regression error using
`mean_squared_error` from sklearn. Let us import it:

``` python
from sklearn.metrics import mean_squared_error
```

We decompose the problem into four functions: `get_data`, `add_c`,
`square`, and `mean_squared_error`:

``` python
def get_data ():
    X = np.array ([0, 1, 2])
    Y = np.array ([4, 9, 16])
    return X, Y

def add_c (X, c):
    return X+c

def square (X):
    return X*X

pipeline = Sequential (get_data, add_c, square, mean_squared_error)
```

There are two issues with the above pipeline:

1.  The first function `get_data ()` returns `X` and `Y`. However, the
    subsequent component `add_c` does not consume `Y`. Therefore, it is
    not correct to simply pass the output of the first step directly
    into the next step. It is only the last function of the pipeline,
    `mean_squared_error` which consumes `Y`. We address this by using
    *data converters*, which drop the Y variable in all the cases except
    in the last step where it is needed.

2.  The function `add_c` has an argument `c` whose value is not provided
    by the previous step.

Before illustrating how those items are typically implemented with
`DS Blocks`, let us first see a more standard solution: for solving
issue 1, we use wrappers that perform data conversion from one step to
the next. This is suitable if we reuse external functions in our
pipeline and we cannot modify those functions to our needs. The second
issue is addressed by using a `partial` function where we fix the value
of `c`. Let us see the resulting code:

``` python
from functools import partial
```

``` python
def ignore_labels (func):
    def wrapper (X, Y):
        # 1. "data conversion" before calling function: Y is dropped, and only X is passed
        result = func (X)
        # 2. "data conversion" after calling the function: Y is attached to the result
        return result, Y
    return wrapper

c = 0 # pipeline parametrized with c=0
pipeline = Sequential (get_data, 
                       ignore_labels (partial (add_c, c=c)), 
                       ignore_labels (square), 
                       mean_squared_error)
error = pipeline () 
print (f'the error obtained with c={c} is {error}')
```

    the error obtained with c=0 is 74.66666666666667

The previous approach works fine in the current example. However, in
general, our pipelines are designed to not only work with functions, as
in this example, but to also work with estimators that have methods
similar to `fit`, `predict` and `transform`. For such case, it is more
convenient to use
[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)
objects as illustrated in the code below. The
[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)
allows to provide different conversion rules for each one of the
methods, `fit` and `predict`, called by the pipeline. A similar thing
happens regarding the use of `partial`: it works well when the steps of
the pipeline are single functions, but it is more problematic when each
step runs more than one method (e.g., `fit` and `predict`). The next
code illustrates how this is addressed in `DS Blocks`.

We start by importing the
[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)
class:

``` python
from dsblocks.core.data_conversion import DataConverter
```

.. and defining a DataConverter for our pipeline, as follows:

``` python
class IgnoreLabels (DataConverter):
    def __init__ (self, **kwargs):
        super ().__init__ (**kwargs)
    def convert_before_applying (self, X, Y, **kwargs):
        self.Y = Y
        return X
    def convert_after_applying (self, result, **kwargs):
        return result, self.Y
```

As we can see, our data converter implements two methods:

- `convert_before_applying`: run *before* the given step of the pipeline
  is run. It stores the variable `Y` returned by the previous step, and
  only returns the variable `X`, so that the current step only receives
  `X`.
- `convert_after_applying`: run *after* the given step of the pipeline
  is run. It attaches the variable `Y`, stored before, to whatever is
  returned by the current step, so that the next step of the pipeline
  will receive both the result of the current step and `Y`.

The above two methods manage the data conversion for *applying* the
current step. In the `DS Blocks` terminology, `apply` is equivalent to
`predict` or `transform` on a scikit-learn estimator, and can be done
either by calling the `apply` method, calling `predict` or `transform`
(which are aliases), or just calling the component on the input data, as
if it were a function (i.e., using `__call__`), which is what we do in
this tutorial.

Later we will see how we can add methods to our
[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)
in order to manage data conversion before and after calling the `fit` in
our pipeline components.

Now, in order to use the implemented
[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter),
we need to wrap the functions that need this converter in a
[`Component`](https://Jaume-JCI.github.io/ds-blocks/core/components.html#component)
class. These functions are `add_c` and `square`, and we indicate the
DataConverter they need to use as follows:

``` python
Component (add_c, data_converter=IgnoreLabels)
Component (square, data_converter=IgnoreLabels)
```

Furtyermore, in the case of `add_c`, we also want to indicate the value
of the parameter `c`. This will prove useful later when estimating the
error for multiple values in parallel, see `Using Parallel` below:

`Component (add_c, c=c, data_converter=IgnoreLabels)`,

where `c` is some variable defined previously. Any parameter to be used
by a given step can be specified in such a way. When the step uses an
object with several methods (e.g., `fit` and `transform`), we can
indicate parameters to be used for both methods in the same way, like
`Component (my_object, fit_param1=value1, fit_param2=value2, transform_param1=value3, ...)`.

Putting all this together, we construct `Sequential` as follows:

``` python
from dsblocks import Component
```

``` python
pipeline = Sequential (get_data,
                       Component(add_c, c=c, data_converter=IgnoreLabels),
                       Component(square, data_converter=IgnoreLabels), 
                       mean_squared_error)
```

… and call it as usual

``` python
error = pipeline () 
print (f'the error obtained with c={c} is {error}')
```

    the error obtained with c=0 is 74.66666666666667

In the previous construction of `Sequential`, we have steps that are
indicated by simply passing the name of the function that implements
this step, like `get_data` and `mean_squared_error`. Those steps are
automatically wrapped into
[`Component`](https://Jaume-JCI.github.io/ds-blocks/core/components.html#component)
classes, so that at the end all the steps of the pipeline are defined by
Components:

``` python
pipeline.components
```

    [Component GetData (name=get_data),
     Component AddC (name=add_c),
     Component Square (name=square),
     Component MeanSquaredError (name=mean_squared_error)]

Only when we need to specify parameters that are specific of a given
step we do need to explicitly use a
[`Component`](https://Jaume-JCI.github.io/ds-blocks/core/components.html#component)
for doing so, as we have done for `add_c` and `square`. If the parameter
is common for all the steps, we can just pass it in the construction of
Sequential and it will be propagated to all the components, like so:

``` python
Sequential (my_step1, my_step2, my_step3,
               data_converter=MyDataConverter)
```

in which case all the components use `MyDataConverter` for data
conversion.

### Using [`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)

We can estimate the error obtained by multiple values of the parameter
`c`, using a
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)
object. This object is a pipeline similar to `Sequential` but where the
outputs are not piped linearly from one step to the next. By default,
the same initial input is fed to all the components that compose the
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)
object, and the output from all of them is gathered in a tuple. Both
things can be, however, configured through callbacks. Let us see how it
works in our case.

Let’s start by importing
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel):

``` python
from dsblocks import Parallel
```

Now we define the components to be run in this pipeline. We can do so
when constructing it, `Parallel (component1, component2, component3)` or
beforehand,

``` python
components=(component1, component2, component3)
Parallel (*components)
```

In our case, the components to be run are each of them a Sequential
pipeline, where the only difference is the value of parameter `c`:

``` python
pipelines = (Sequential (get_data, 
                         Component(add_c, c=c, data_converter=IgnoreLabels),
                         Component(square, data_converter=IgnoreLabels), 
                         mean_squared_error)
              for c in range(0,5))
```

We pass those components to construct our
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)
component

``` python
parallel = Parallel (*pipelines)
```

… and call that component as usual:

``` python
result = parallel ()
print (f'result: {result}')
```

    result: (74.66666666666667, 27.666666666666668, 0.0, 51.666666666666664, 266.6666666666667)

As we can see, our
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)
object is composed of 5 pipeline components, each pipeline receiving a
different value of parameter `c`. The
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)
object then runs those pipelines and gathers their result in a tuple.

In general, the
[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)
object can be constructed by passing any collection of components, and
this collection can be heterogeneous. While in the current case, we have
constructed multiple copies of the same `Sequential` object, we can as
well have a single copy that receives different values of `c` each time,
by using the
[`ParallelInstances`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallelinstances)
class. However, the use of such class is a bit more elaborate and we
leave this topic for an advanced tutorial. Let us see now the error as a
function of `c`:

``` python
import matplotlib.pyplot as plt
```

``` python
plt.plot (result, 'b.-')
```

![](index_files/figure-commonmark/cell-26-output-1.png)

### Fitting models

Until now the steps of the pipeline have been functions. In our
pipelines, steps can also be specified in terms of objects. This is
suitable for having models or estimators whose state change as a result
of applying the step, using methods such as `fit` or `fit_transform`, as
done in scikit-learn. Specifically, each object passed as one of the
steps needs to have at least one of the the following methods: : an
`apply` method, which can be also called `transform` or `predict`, a
`fit` method, and a `fit_apply` method, which can also be called
`fit_transform` or `fit_predict`, following the same terminology as in
scikit-learn.

We see now an example of this where we use one such object, whose class
we call `BruteForceModel`. This uses a simple brute-force search to find
the value of `c` that minimizes the error in our current objective,
given a set of candidate values `c_values`:

``` python
class BruteForceModel ():
    def __init__ (self, c_values=range(5), **kwargs):
        self.c_values = c_values
    
    def transform (self, X):
        return (X+self.c)**2

    def fit (self, X, Y):
        error = np.empty ((len(self.c_values),))
        for i, c in enumerate(self.c_values):
            self.c = c
            Y_hat = self.transform (X)
            error[i] = mean_squared_error (Y, Y_hat)
        self.c = self.c_values[np.argmin (error)]
        return self
```

In order to use such object, we need to indicate the data conversion
step also for both the `apply` (or `transform`) and the `fit` methods.
Since we already did that for the `apply` method above, we just need to
add the data conversion for `fit`. This is done by adding a new method
`convert_before_fitting` to the `IgnoreLabels` class:

``` python
def convert_before_fitting (self, X, Y, **kwargs):
    self.Y = Y
    return X, Y

IgnoreLabels.convert_before_fitting = convert_before_fitting
```

We can construct our pipeline now, as follows:

``` python
pipeline = Sequential (get_data,
                       Component (BruteForceModel(), data_converter=IgnoreLabels),
                       Component (mean_squared_error, data_converter='NoConverter'))
```

As we can see above, the second component performs the data conversion
indicated in `IgnoreLabels`, and the third component doesn’t perform any
data conversion. In order to achieve that, we need to indicate that, for
the third component, the data converter is
[`NoConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#noconverter),
since the default converter used by `DS Blocks` does perform a specific
type of data conversion when we call the `fit_apply` method, as
explained below.

In order to fit our newly created model, we will be calling the
`fit_apply` method on the entire pipeline. This method is semantically
equivalent to the `fit_transform` and `fit_predict` methods of
scikit-learn. It makes the components of the pipeline be fitted to the
data, using the labels `Y`, then transform the data based on the fitted
parameters, and pass on to the next component of the pipeline the
transformed data. This is the behaviour we have in scikit-learn, and
replicated by default in `DS Blocks`.

The default data converter used by `DS Blocks`, called
[`GenericConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#genericconverter),
makes the sequential pipelines behave like the scikit-learn pipelines
when calling the `fit_apply` method. Just like in scikit-learn, when we
use the default converter, the `fit` method of our components receives
both the data `X` and the labels `Y`, but the `apply` method only
receives the (transformed) data `X`.

In our current pipeline, however, we want the last component, which
applies the `mean_squared_error` function, to receive both `X` and `Y`,
in order to be able to calcuate such error. Therefore, in the last
component we indicate that no conversion should be applied, in order to
avoid skipping the `Y` when calling `apply` on that component.

Once defined the new pipeline, we simply call the `fit_apply` method as
follows:

``` python
pipeline.fit_apply ()
```

    0.0

### MultiSplit objects

Let us now make the problem a little bit more interesting: we split the
data into two subsets, *training* and *test*, fit our model on the
training set, and have a separate estimate of the error for each of the
two subsets. For that purpose, it will be handy to use the
[`MultiSplitComponent`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitcomponent)
from `DS Blocks`. In particular, we will use the
[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict)
subclass for the current problem. Let us import it, together with
`train_test_split`.

``` python
from sklearn.model_selection import train_test_split
from dsblocks.core.compose import MultiSplitDict
```

Let us now define a slightly different `get_data` function where we can
indicate the number of data points we want to have, and a noise level
that is added to the data.

``` python
def get_data (n=225, noise=1.0):
    X = np.arange (n) 
    Y = (X+2)**2 + np.random.randn (n) * noise
    return X, Y
```

Let us also define a new function `generate_split`, which splits the
data into training and test, and returns a dictonary with both subsets.

``` python
def generate_split (X, Y, proportion_training=0.8):
    n_samples_train=int(len(X)*proportion_training)
    n_samples_test=len(X)-n_samples_train
    X_train, X_test, Y_train, Y_test = train_test_split(
        X, Y, train_size=n_samples_train, test_size=n_samples_test, shuffle=False
    )
    data = dict (training=(X_train, Y_train),
                 test=(X_test, Y_test))
    return data
```

Finally, we set our `BruteForceModel` component to use finer granularity
for the values of the c parameter:

``` python
brute_force_model = BruteForceModel(c_values=[1.8, 1.9, 2.0, 2.1, 2.2])
```

With all this, the new sequential pipeline is defined as follows:

``` python
pipeline = Sequential (get_data, 
                       Component(generate_split, data_converter='NoConverter'), 
                       MultiSplitDict (Component(brute_force_model, data_converter=IgnoreLabels)), 
                       MultiSplitDict(Component(mean_squared_error, data_converter='NoConverter')))
```

As we can see, the only two changes are: 1. We have an additional
component in the pipeline, which applies the function `generate_split`
2. The two final components are wrapped in a
[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict)
class.

By default,
[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict)
fits the wrapped component using the `training` data, i.e., the data
found in the `training` field of the input dictionary. After fitting the
component, it applies it separately to the training, the test, and, if
present, the validation subsets from the input dictionary. We can see
this by observing the output of the previous pipeline when calling
`fit_apply` on it:

``` python
pipeline.fit_apply()
```

    {'training': 0.9495413378781785, 'test': 0.8841783243494468}

As we can see, the model’s error is estimated separately on the training
and test set, and the output is a dictionary with the same fields as the
input to
[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict).
The specific subsets to which the component is fitted and/or applied,
can be indicated by parameter as follows:

``` python
MultiSplitDict (my_component, 
                fit_to=subset_name, 
                apply_to=[subset_name_1, subset_name_2, ...])
```

where `subset_name` is a string indicating the name of the field where
the subset of data is found in the input dictionary. Let’s see this:

``` python
pipeline = Sequential (get_data, 
                       Component(generate_split, data_converter='NoConverter'), 
                       MultiSplitDict (Component(brute_force_model, data_converter=IgnoreLabels), 
                                       apply_to=['test']), 
                       MultiSplitDict(Component(mean_squared_error, data_converter='NoConverter'), 
                                      apply_to=['test']))
pipeline.fit_apply()
```

    {'test': 0.6858969144872996}

As we can see, the estimated error is only done now for the test set.

### Experiment tracking

Many times we want to be able to track the results obtained with
different values of our parameters, or multiple runs if our pipeline has
some stochasticity. DS Blocks provides experiment tracking through
different mechanisms. The easiest one is probably to just wrap any
pipeline created before with a
[`TrackingComponent`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#trackingcomponent)
wrapper. Another possibility is to use the class
[`SequentialWithTracking`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#sequentialwithtracking)
instead of using `Sequential`. Let us see each of those in turn. First
let us import those two classes:

``` python
from dsblocks.core.compose import TrackingComponent, SequentialWithTracking
import joblib
```

Using the first option, we can wrap the previously created `pipeline`
object with the
[`TrackingComponent`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#trackingcomponent)
class:

``` python
tracking_pipeline = TrackingComponent (pipeline)
```

This is appropriate if we first created the pipeline without the
objective of tracking the results, and later we want to add tracking to
it. However, it is more common to directly define our pipeline with the
objective of tracking the results obtained with it. We do that by
constructing our top-level pipeline using
[`SequentialWithTracking`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#sequentialwithtracking),
instead of using `Sequential` as done previously:

``` python
tracking_pipeline = SequentialWithTracking (
    get_data, 
    Component (generate_split, data_converter='NoConverter'), 
    MultiSplitDict (Component(brute_force_model, data_converter=IgnoreLabels)), 
    MultiSplitDict (Component(mean_squared_error, data_converter='NoConverter'))
)
```

    could not pickle object: Can't pickle <class 'dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager'>: it's not found as dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager
    could not pickle object: Can't pickle <class 'dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager'>: it's not found as dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager

As we can see, the construction is done exactly the same way as the last
construction we did with `Sequential`. Now, each time we run this
pipeline with new parameters, the resulting metrics are added to a
database which can be queried. Let’s see that with three example runs:

``` python
error = tracking_pipeline.fit_apply (n=5, noise=1000)
print (error)

error = tracking_pipeline.fit_apply (n=1000, noise=1000)
print (error)

error = tracking_pipeline.fit_apply (n=10000, noise=1000)
print (error)
```

    Could not run pickle object: Can't pickle <class 'dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager'>: it's not found as dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager

    {'training': 332782.06925177074, 'test': 4497361.700722283}
    {'training': 908724.5037988542, 'test': 959655.3088879937}
    {'training': 989310.3821379375, 'test': 991343.8955351294}

We can query now the last results as follows:

``` python
em = tracking_pipeline.get_experiment_manager ()
df = em.get_experiment_data ()
df [['parameters','scores']]
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead tr th {
        text-align: left;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th colspan="3" halign="left">parameters</th>
      <th colspan="2" halign="left">scores</th>
    </tr>
    <tr>
      <th></th>
      <th>function</th>
      <th>n</th>
      <th>noise</th>
      <th>test</th>
      <th>training</th>
    </tr>
    <tr>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th>0</th>
      <th>0</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>fit_apply</td>
      <td>5</td>
      <td>1000</td>
      <td>4497361.700722</td>
      <td>332782.069252</td>
    </tr>
    <tr>
      <th>1</th>
      <td>fit_apply</td>
      <td>1000</td>
      <td>1000</td>
      <td>959655.308888</td>
      <td>908724.503799</td>
    </tr>
    <tr>
      <th>2</th>
      <td>fit_apply</td>
      <td>10000</td>
      <td>1000</td>
      <td>991343.895535</td>
      <td>989310.382138</td>
    </tr>
  </tbody>
</table>
</div>

The table above shows from left to right: - The experiment ID
corresponding to the last three executions. - The parameters used for
each execution. The first parameter included by default is the method
used in the execution, which is always `fit_apply` in our case. The
second and third parameters are `n` and `noise`, which indicate the
number of observations in our data and the noise level, respectively. -
The metric scores obtained by each experiment. There are as many score
names as fields in the dictionary returned by the last component of the
pipeline, where each score name is the corresponding dictionary field.
In our case, we have `test` and `training`, corresponding to the test
error and the training error. For each score name, we have as many
scores as runs we have done with the same parameters. In our case, we
have only run the pipeline one time for each set of parameters, and
therefore we only have one run number, `0`.

The intermediate steps of the execution are stored in a path associated
with each experiment ID. Let’s say we want to revisit resuts for
experiment 1 above, in particular we want to see the output of the step
`generate_split`. We can do that as follows:

``` python
path_results = em.get_path_results (experiment_id=1, run_number=0)
data=tracking_pipeline.main.generate_split.load_result(path_results=path_results)
print ('training X: ', data['training'][0][:3], '\ntraining Y: ', data['training'][1][:3])
```

    training X:  [0 1 2] 
    training Y:  [  289.81113214 -1228.50705265  -570.5033283 ]

Above we have explored the result of using different combinations of
parameters by running multiple times the `fit_apply` method of our
pipeline. Instead of doing that, we can explore many combinations of
parameters using serach strategies like grid search or bayesian
optimization. Let’s see now how this is done with grid search, and later
we will see it with bayesian optimization as part of the integration
with `Optuna`. In order to use grid search, we can call:

``` python
em.grid_search (
    parameters_multiple_values=dict(n=[int(1e5), int(1e6), int(1e7)],
                                    noise=[10.0,100.0,1000.0]),
    parameters_single_value=dict(function='fit_apply')
)
```

where we have explored the combinations of `n` and `noise` that appear
in the lists provided in `parameters_multiple_values`. At the same time,
we kept the parameters in `parameters_single_value` fixed to the value
indicated, in our case the function is `fit_apply` in all the
experiments.

In order to inspect the results, we can obtain a dataframe with the same
structure as the one explained above:

``` python
df = em.get_experiment_data()
```

Based on this, we can plot for instance the evolution of the error as a
function of the total number of observations, for training and test
separately, fixing the noise level to 100:

``` python
df_to_analyze = df[df[('parameters','noise')]==100.0]
#df_to_analyze = df[df[('parameters','n')]==int(1e5)]
parameter_to_analyze = df_to_analyze[('parameters','n')].values
#parameter_to_analyze = df_to_analyze[('parameters','noise')].values
scores_training = df_to_analyze[('scores','training')].values
scores_test = df_to_analyze[('scores','test')].values

plt.plot (parameter_to_analyze, scores_training, 'b.-')
plt.plot (parameter_to_analyze, scores_test, 'm.-')
plt.xlabel ('size of data')
#plt.xlabel ('noise')
plt.ylabel ('error')
plt.legend (['training', 'test']);
```

![](index_files/figure-commonmark/cell-46-output-1.png)

### Integration with Optuna

In order to use optuna-based search strategies like Bayesian
Optimization, we first need to define a parameter sampler that makes use
of the constructs provided in [optuna](). Let’s do that for our two
parameters, `noise` and `n`. We define a function which samples `noise`
from a uniform distribution, and `n` from a discrete set of values, and
returns a dictionary with the sampled values:

``` python
def parameter_sampler (trial):
    noise = trial.suggest_uniform('noise', 1000, 10000)
    n = trial.suggest_categorical('n', [100, 250, 500])

    parameters = dict(noise=noise,
                      n=n)

    return parameters
```

Next, we need to indicate which metric score needs to be optimized by
Bayesian Optimization. This is done by indicating the value of the
`key_score` property of our experiment manager. This property can be
indicated at construction time, when building our
[`SequentialWithTracking`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#sequentialwithtracking)
object, or just indicate it by assigning a value to it:

``` python
em.key_score = 'training'
```

…by which we indicate that we will be optimizing the value of the
training error.

Finally, we run the Bayesian Optimization on the indicated parameters by
passing the defined `parameter_sampler` function, and indicating
additional parameters that remain constant across experiments, in our
case `function='fit_apply'`, as follows:

``` python
em.hp_optimization (parameter_sampler=parameter_sampler, 
                    parameters=dict(function='fit_apply'))
```

    [I 2022-11-18 16:01:41,782] A new study created in RDB with name: hp_study
    [I 2022-11-18 16:01:42,408] Trial 0 finished with value: 34975291.89393385 and parameters: {'noise': 5939.321535345923, 'n': 100}. Best is trial 0 with value: 34975291.89393385.
    [I 2022-11-18 16:01:42,823] Trial 1 finished with value: 24467912.653205547 and parameters: {'noise': 4812.8931940501425, 'n': 500}. Best is trial 0 with value: 34975291.89393385.
    [I 2022-11-18 16:01:43,229] Trial 2 finished with value: 105726109.1003538 and parameters: {'noise': 9672.964844509264, 'n': 250}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:43,650] Trial 3 finished with value: 36215819.30834488 and parameters: {'noise': 6112.401049845391, 'n': 100}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:44,088] Trial 4 finished with value: 1404915.7659739438 and parameters: {'noise': 1181.9655769629314, 'n': 500}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:44,526] Trial 5 finished with value: 93425542.45875728 and parameters: {'noise': 9807.565080094875, 'n': 100}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:44,979] Trial 6 finished with value: 4904244.708255151 and parameters: {'noise': 2064.469832820399, 'n': 500}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:45,416] Trial 7 finished with value: 30972553.479057707 and parameters: {'noise': 5696.634895750645, 'n': 500}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:45,869] Trial 8 finished with value: 27207384.64360299 and parameters: {'noise': 5105.352989948937, 'n': 500}. Best is trial 2 with value: 105726109.1003538.
    [I 2022-11-18 16:01:46,338] Trial 9 finished with value: 49351456.854481995 and parameters: {'noise': 6508.861504501792, 'n': 250}. Best is trial 2 with value: 105726109.1003538.

    105726109.1003538

Again, after running this we can inspect the results by obtaining the
history of experiments in the form of a dataframe, as done previously:

``` python
df = em.get_experiment_data ()
```

And look at the combinations of parameters and their resulting scores:

``` python
df[['parameters', 'scores']]
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead tr th {
        text-align: left;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr>
      <th></th>
      <th colspan="3" halign="left">parameters</th>
      <th colspan="2" halign="left">scores</th>
    </tr>
    <tr>
      <th></th>
      <th>function</th>
      <th>n</th>
      <th>noise</th>
      <th>test</th>
      <th>training</th>
    </tr>
    <tr>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th>0</th>
      <th>0</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>fit_apply</td>
      <td>5</td>
      <td>1000.0</td>
      <td>4497361.700722</td>
      <td>332782.069252</td>
    </tr>
    <tr>
      <th>1</th>
      <td>fit_apply</td>
      <td>1000</td>
      <td>1000.0</td>
      <td>959655.308888</td>
      <td>908724.503799</td>
    </tr>
    <tr>
      <th>2</th>
      <td>fit_apply</td>
      <td>10000</td>
      <td>1000.0</td>
      <td>991343.895535</td>
      <td>989310.382138</td>
    </tr>
    <tr>
      <th>3</th>
      <td>fit_apply</td>
      <td>100000</td>
      <td>10.0</td>
      <td>101.674732</td>
      <td>100.014967</td>
    </tr>
    <tr>
      <th>4</th>
      <td>fit_apply</td>
      <td>100000</td>
      <td>100.0</td>
      <td>10060.595463</td>
      <td>9960.776863</td>
    </tr>
    <tr>
      <th>5</th>
      <td>fit_apply</td>
      <td>100000</td>
      <td>1000.0</td>
      <td>986655.689255</td>
      <td>1000000.368654</td>
    </tr>
    <tr>
      <th>6</th>
      <td>fit_apply</td>
      <td>1000000</td>
      <td>10.0</td>
      <td>100.209543</td>
      <td>99.914734</td>
    </tr>
    <tr>
      <th>7</th>
      <td>fit_apply</td>
      <td>1000000</td>
      <td>100.0</td>
      <td>10017.505157</td>
      <td>10009.726013</td>
    </tr>
    <tr>
      <th>8</th>
      <td>fit_apply</td>
      <td>1000000</td>
      <td>1000.0</td>
      <td>998290.266422</td>
      <td>1000026.188421</td>
    </tr>
    <tr>
      <th>9</th>
      <td>fit_apply</td>
      <td>10000000</td>
      <td>10.0</td>
      <td>99.995626</td>
      <td>99.945308</td>
    </tr>
    <tr>
      <th>10</th>
      <td>fit_apply</td>
      <td>10000000</td>
      <td>100.0</td>
      <td>9998.97338</td>
      <td>9988.258447</td>
    </tr>
    <tr>
      <th>11</th>
      <td>fit_apply</td>
      <td>10000000</td>
      <td>1000.0</td>
      <td>999088.076542</td>
      <td>1000050.766496</td>
    </tr>
    <tr>
      <th>12</th>
      <td>fit_apply</td>
      <td>100</td>
      <td>5939.321535</td>
      <td>27276806.205229</td>
      <td>34975291.893934</td>
    </tr>
    <tr>
      <th>13</th>
      <td>fit_apply</td>
      <td>500</td>
      <td>4812.893194</td>
      <td>31520677.399808</td>
      <td>24467912.653206</td>
    </tr>
    <tr>
      <th>14</th>
      <td>fit_apply</td>
      <td>250</td>
      <td>9672.964845</td>
      <td>101906074.636583</td>
      <td>105726109.100354</td>
    </tr>
    <tr>
      <th>15</th>
      <td>fit_apply</td>
      <td>100</td>
      <td>6112.40105</td>
      <td>14342902.683279</td>
      <td>36215819.308345</td>
    </tr>
    <tr>
      <th>16</th>
      <td>fit_apply</td>
      <td>500</td>
      <td>1181.965577</td>
      <td>1187953.622504</td>
      <td>1404915.765974</td>
    </tr>
    <tr>
      <th>17</th>
      <td>fit_apply</td>
      <td>100</td>
      <td>9807.56508</td>
      <td>85306914.7339</td>
      <td>93425542.458757</td>
    </tr>
    <tr>
      <th>18</th>
      <td>fit_apply</td>
      <td>500</td>
      <td>2064.469833</td>
      <td>4002535.937879</td>
      <td>4904244.708255</td>
    </tr>
    <tr>
      <th>19</th>
      <td>fit_apply</td>
      <td>500</td>
      <td>5696.634896</td>
      <td>27071477.780816</td>
      <td>30972553.479058</td>
    </tr>
    <tr>
      <th>20</th>
      <td>fit_apply</td>
      <td>500</td>
      <td>5105.35299</td>
      <td>21697305.132237</td>
      <td>27207384.643603</td>
    </tr>
    <tr>
      <th>21</th>
      <td>fit_apply</td>
      <td>250</td>
      <td>6508.861505</td>
      <td>33483663.716484</td>
      <td>49351456.854482</td>
    </tr>
  </tbody>
</table>
</div>

## Documentation

For further details, please see the
[documentation](https://jaume-jci.github.io/ds-blocks/)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Jaume-JCI/ds-blocks",
    "name": "dsblocks",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7,<=3.12",
    "maintainer_email": "",
    "keywords": "nbdev jupyter notebook python",
    "author": "Jaume Amores",
    "author_email": "jamorej@jci.com",
    "download_url": "https://files.pythonhosted.org/packages/53/8c/fa09d62eed830500936c28b86f6b89d9262140f6410b8fa39d207db2591c/dsblocks-0.0.15.tar.gz",
    "platform": null,
    "description": "DS Blocks\n================\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n`DS Blocks` makes it easy to write highly modular and compact data\nscience pipelines. It is based on a generalization of the well-known\nscikit-learn pipeline design, enriching and extending it in multiple\nways. By doing so, `DS Blocks` makes it possible to express the ML\nsolution in terms of independent building blocks that can be easily\nmoved around and reused to create different solutions. At the same time,\n`DS Blocks` makes it possible to write concise code by automatically\ntaking care of common steps that are needed when building a data science\npipeline, such as checkpointing, logging, profiling, data conversion,\nand more, resulting in a significant reduction of boiler-plate code.\n\n`DS Blocks` also provides a number of features that facilitate working\nwith notebooks, such as:\n\n- Integration with [nbdev](https://nbdev.fast.ai/) and extension of its\n  functionalities. `nbdev` is a powerful framework that streamlines\n  development on notebooks using best software practices. `DS Blocks`\n  extends `nbdev` by making it possible to convert notebooks into a test\n  suite for external engines such as `pytest`. It also allows convenient\n  freezing and unfreezing notebook test cells to avoid recomputing the\n  tests every time we need to restart and re-run the notebook.\n\n- `DS Blocks` provides several magic functions that facilitate\n  reproducibility. It also provides convenient decorators for converting\n  functions into pipeline components and reducing boiler-plate.\n\n- In addition to a powerful pipeline design (see below), `DS Blocks`\n  provides out-of-the-box components frequently used in Data Science,\n  such as for cross-validation and model-selection, building ensembles,\n  working with time-series, and more.\n\n## Features\n\nThe following is a selection of some of the benefits provided by using\n`DS Blocks` pipelines:\n\n- Automatize common steps that are usually present in ML code, including\n  caching / loading of intermediate results across the entire pipeline,\n  logging, profiling, conversion of data to appropriate format, and\n  more.\n\n- Easy debugging of the entire pipeline, both during the current run as\n  well as post-mortem. Facilitates investigation of issues occurred\n  during past runs.\n\n- Make it possible to easily show statistics and other types of\n  information about the output of each component in the pipeline, print\n  a summary of the pipeline, plot a diagram of the components, and show\n  the dimensionality of the output provided by each component.\n\n- Make it possible to use any data type in the communication between\n  components. This is done through data conversion layers that\n  facilitate reusing the components across different pipelines,\n  regardless of the data format used by rest of the components. This\n  functionality allows, for instance, to have a consistent use of\n  DataFrames across the whole pipeline: when the input is a DataFrame,\n  the output will be a DataFrame as well, and when the input is a numpy\n  array the output is a numpy array. This is just an example, the\n  proposed design allows to easily support many other use cases.\n\n- Enable the use of sampling components that not only change the\n  variables (or columns) but also change the number of observations (or\n  rows), by either under-sampling or over-sampling, which is not\n  supported by common pipelines such as the ones provided in\n  scikit-learn.\n\n- Integrated experiment tracking and hyper-parameter optimization.\n\n- And many more!\n\n## Comparison against other frameworks\n\n`DS Blocks` provides functionalities that are also present in frameworks\nsuch as [Metaflow](https://metaflow.org/),\n[Kedro](https://kedro.readthedocs.io/),\n[Ploomber](https://ploomber.io/), and others. In this section we briefly\ncomment on the differences against these three frameworks, which are\namong the most popular ones. An important difference with respect to\nthese frameworks is that, while our design allows to build any kind of\nDirected Acyclic Graph (DAG), we do not need to express the edges of\nsuch graph explicitly, reducing the corresponding boiler-plate. Another\ndifference is the use of a compact design loosely similar to\nscikit-learn\u2019s pipelines and estimators, which allows to concisely\nexpress any ML solution in a familiar syntax.\n\nApart from those differences, we comment here on more specific\ndifferences wrt each framework:\n\n- The main difference with respect to frameworks such as `Kedro`, is\n  that we use a pure-code approach, avoding the need of writing separate\n  config files that govern the behaviour of the pipeline.\n- The main difference wrt to `Metaflow`, is that `DS Blocks` allows to\n  keep the original code without changes, and extend its functionality\n  by simply declaring sequences of the original functions and classes.\n  While `Metaflow` allows to create flows of original functions, it uses\n  a more verbose approach for achieving this.\n- The main difference with `Ploomber`, `Luigi`, and other frameworks is\n  that our pipelines are constructed programmatically with pure python,\n  not by gluing together the inputs and outputs of applications that are\n  run separately.\n\n## Installation\n\nDS Blocks is pip installable:\n\n``` bash\npip install dsblocks\n```\n\n## Example usage\n\n### Baseline problem\n\nIn the first problem, we will only use the `Sequential` class. Let us\nimport it, together with the numpy library.\n\n``` python\nimport numpy as np\nfrom dsblocks import Sequential\n```\n\nThis first example is taken from\n[Optuna](https://optuna.org/#code_examples)\u2019s quadratic problem: find\nthe value of $X$ that minimizes:\n\n$$(X-2)^2$$\n\nWe start by using a simple data vector as input: $$X=(0,1,2,3,4)^T$$\n\n``` python\nX = np.arange (5)\n```\n\nFor the sake of this example, we decompose the aforementioned quadratic\nequation into two simple functions: `subtract2` and `square`, and add a\nthird function `np.argmin` to find the value of `X` that minimizes this\nequation. The three functions are then assembled in a `Sequential`\npipeline as follows:\n\n``` python\ndef subtract2 (X): \n    return X-2\ndef square (X): \n    return X*X\n\npipeline = Sequential (subtract2, square, np.argmin)\n```\n\nThe `Sequential` pipeline feeds the results from one function into the\nnext, the final one being `np.argmin`. In this toy example each function\nperforms a simple calculation, but in general they perform\ntime-consuming processes. After this, we obtain the result of this\npipeline by just calling it on the input data `X`:\n\n``` python\nidx_min = pipeline (X)\nprint (f'Value of X that minimizes the equation : {X[idx_min]}')\n```\n\n    Value of X that minimizes the equation : 2\n\nMany times, the first step of the pipeline is to get the data from an\nexternal source or storage. We now augment the pipeline by including a\nnew function `get_data` which runs as first step. We also include\npersistence and logging in the pipeline by passing `verbose=2` and\n`path_results='square_problem'`:\n\n``` python\ndef get_data ():\n    return np.arange (5)\n\npipeline = Sequential (get_data, subtract2, square, np.argmin,\n                       verbose=2, path_results='square_problem')\npipeline()\nprint (f'Value of X that minimizes the equation : {X[idx_min]}')\n```\n\n    applying pipeline (on whole data)\n    applying get_data (on whole data)\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/get_data_result.pk\n    applying subtract2 (on whole data)\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/subtract2_result.pk\n    applying square (on whole data)\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/square_result.pk\n    applying argmin (on whole data)\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/argmin_result.pk\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/pipeline_result.pk\n\n    Value of X that minimizes the equation : 2\n\nWe can see the logs of each step being executed and its results saved to\ndisk.\n\nNow we can easily load the results of intermediate steps:\n\n``` python\nresult = pipeline.subtract2.load_result ()\nprint ('result of X-2: ', result)\n```\n\n    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/subtract2_result.pk\n\n    result of X-2:  [-2 -1  0  1  2]\n\n``` python\nresult = pipeline.square.load_result()\nprint ('result of (X-2)^2: ', result)\n```\n\n    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/square_result.pk\n\n    result of (X-2)^2:  [4 1 0 1 4]\n\nLet us see the case where there was an interrumption in the execution\nand we need to resume it. We simulate this case by removing the\nintermediate results that happened after the interrumption:\n`subtract2_result`, `square_result`, and the final `pipeline_result`:\n\n``` python\n!rm square_problem/whole/subtract2_result.pk\n!rm square_problem/whole/square_result.pk\n!rm square_problem/whole/pipeline_result.pk\n```\n\nLet us now re-run the pipeline, and see which steps are loaded and which\nones are re-computed:\n\n``` python\npipeline ()\n```\n\n    applying pipeline (on whole data)\n    applying get_data (on whole data)\n    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/get_data_result.pk\n    loaded pre-computed result\n    applying subtract2 (on whole data)\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/subtract2_result.pk\n    applying square (on whole data)\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/square_result.pk\n    applying argmin (on whole data)\n    loading from /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/argmin_result.pk\n    loaded pre-computed result\n    saving to /home/jcidatascience/jaume/workspace/remote/ds-blocks/square_problem/whole/pipeline_result.pk\n\n    2\n\nWe can see that the first intermediate results, from `get_data` and\n`subtract2` are loaded, while the remaining steps, `subtract2`, `square`\nand the final result of the `pipeline`, are re-computed (since their\nresult was removed from disk) and their result is saved to disk.\n\nBy default, results are always loaded and saved if we provide a\n`path_results` when constructing our pipeline. This default behaviour\ncan be changed by specifying the values of\n[`load`](https://Jaume-JCI.github.io/ds-blocks/utils/session.html#load)\nand `save` at construction time. For instance:\n\n``` python\npipeline = Sequential (component_1, component2,\n                       path_results='my_results', load=False)\n```\n\nwill save the result of the computation but not load it. This might be\nuseful when we want to overwrite the previous result with a newly\ncalculated one. The following:\n\n``` python\npipeline = Sequential (component_1, component2,\n                       path_results='my_results', save=False)\n```\n\nwill load the result, if it exists. If it doesn\u2019t, it will compute the\nresult but it won\u2019t save it.\n\n### Modified problem\n\nLet us now modify the previous problem as follows: we want to find the\nhyper-parameter `c` that minimizes the following regression problem:\n\n$$\n(X+c)^T (X+c) = Y,\n$$\n\ngiven a simple 1D dataset:\n\n$$\nX = (0, 1, 2)^T \n$$\n\n$$\nY = (4, 9, 16)^T\n$$\n\nIn this data, we have $y_i = (x_i+2) ^ 2$ $\\forall i$, and therefore the\noptimal solution is $c=2$.\n\nFor this problem we will measure the regression error using\n`mean_squared_error` from sklearn. Let us import it:\n\n``` python\nfrom sklearn.metrics import mean_squared_error\n```\n\nWe decompose the problem into four functions: `get_data`, `add_c`,\n`square`, and `mean_squared_error`:\n\n``` python\ndef get_data ():\n    X = np.array ([0, 1, 2])\n    Y = np.array ([4, 9, 16])\n    return X, Y\n\ndef add_c (X, c):\n    return X+c\n\ndef square (X):\n    return X*X\n\npipeline = Sequential (get_data, add_c, square, mean_squared_error)\n```\n\nThere are two issues with the above pipeline:\n\n1.  The first function `get_data ()` returns `X` and `Y`. However, the\n    subsequent component `add_c` does not consume `Y`. Therefore, it is\n    not correct to simply pass the output of the first step directly\n    into the next step. It is only the last function of the pipeline,\n    `mean_squared_error` which consumes `Y`. We address this by using\n    *data converters*, which drop the Y variable in all the cases except\n    in the last step where it is needed.\n\n2.  The function `add_c` has an argument `c` whose value is not provided\n    by the previous step.\n\nBefore illustrating how those items are typically implemented with\n`DS Blocks`, let us first see a more standard solution: for solving\nissue 1, we use wrappers that perform data conversion from one step to\nthe next. This is suitable if we reuse external functions in our\npipeline and we cannot modify those functions to our needs. The second\nissue is addressed by using a `partial` function where we fix the value\nof `c`. Let us see the resulting code:\n\n``` python\nfrom functools import partial\n```\n\n``` python\ndef ignore_labels (func):\n    def wrapper (X, Y):\n        # 1. \"data conversion\" before calling function: Y is dropped, and only X is passed\n        result = func (X)\n        # 2. \"data conversion\" after calling the function: Y is attached to the result\n        return result, Y\n    return wrapper\n\nc = 0 # pipeline parametrized with c=0\npipeline = Sequential (get_data, \n                       ignore_labels (partial (add_c, c=c)), \n                       ignore_labels (square), \n                       mean_squared_error)\nerror = pipeline () \nprint (f'the error obtained with c={c} is {error}')\n```\n\n    the error obtained with c=0 is 74.66666666666667\n\nThe previous approach works fine in the current example. However, in\ngeneral, our pipelines are designed to not only work with functions, as\nin this example, but to also work with estimators that have methods\nsimilar to `fit`, `predict` and `transform`. For such case, it is more\nconvenient to use\n[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)\nobjects as illustrated in the code below. The\n[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)\nallows to provide different conversion rules for each one of the\nmethods, `fit` and `predict`, called by the pipeline. A similar thing\nhappens regarding the use of `partial`: it works well when the steps of\nthe pipeline are single functions, but it is more problematic when each\nstep runs more than one method (e.g., `fit` and `predict`). The next\ncode illustrates how this is addressed in `DS Blocks`.\n\nWe start by importing the\n[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)\nclass:\n\n``` python\nfrom dsblocks.core.data_conversion import DataConverter\n```\n\n.. and defining a DataConverter for our pipeline, as follows:\n\n``` python\nclass IgnoreLabels (DataConverter):\n    def __init__ (self, **kwargs):\n        super ().__init__ (**kwargs)\n    def convert_before_applying (self, X, Y, **kwargs):\n        self.Y = Y\n        return X\n    def convert_after_applying (self, result, **kwargs):\n        return result, self.Y\n```\n\nAs we can see, our data converter implements two methods:\n\n- `convert_before_applying`: run *before* the given step of the pipeline\n  is run. It stores the variable `Y` returned by the previous step, and\n  only returns the variable `X`, so that the current step only receives\n  `X`.\n- `convert_after_applying`: run *after* the given step of the pipeline\n  is run. It attaches the variable `Y`, stored before, to whatever is\n  returned by the current step, so that the next step of the pipeline\n  will receive both the result of the current step and `Y`.\n\nThe above two methods manage the data conversion for *applying* the\ncurrent step. In the `DS Blocks` terminology, `apply` is equivalent to\n`predict` or `transform` on a scikit-learn estimator, and can be done\neither by calling the `apply` method, calling `predict` or `transform`\n(which are aliases), or just calling the component on the input data, as\nif it were a function (i.e., using `__call__`), which is what we do in\nthis tutorial.\n\nLater we will see how we can add methods to our\n[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter)\nin order to manage data conversion before and after calling the `fit` in\nour pipeline components.\n\nNow, in order to use the implemented\n[`DataConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#dataconverter),\nwe need to wrap the functions that need this converter in a\n[`Component`](https://Jaume-JCI.github.io/ds-blocks/core/components.html#component)\nclass. These functions are `add_c` and `square`, and we indicate the\nDataConverter they need to use as follows:\n\n``` python\nComponent (add_c, data_converter=IgnoreLabels)\nComponent (square, data_converter=IgnoreLabels)\n```\n\nFurtyermore, in the case of `add_c`, we also want to indicate the value\nof the parameter `c`. This will prove useful later when estimating the\nerror for multiple values in parallel, see `Using Parallel` below:\n\n`Component (add_c, c=c, data_converter=IgnoreLabels)`,\n\nwhere `c` is some variable defined previously. Any parameter to be used\nby a given step can be specified in such a way. When the step uses an\nobject with several methods (e.g., `fit` and `transform`), we can\nindicate parameters to be used for both methods in the same way, like\n`Component (my_object, fit_param1=value1, fit_param2=value2, transform_param1=value3, ...)`.\n\nPutting all this together, we construct `Sequential` as follows:\n\n``` python\nfrom dsblocks import Component\n```\n\n``` python\npipeline = Sequential (get_data,\n                       Component(add_c, c=c, data_converter=IgnoreLabels),\n                       Component(square, data_converter=IgnoreLabels), \n                       mean_squared_error)\n```\n\n\u2026 and call it as usual\n\n``` python\nerror = pipeline () \nprint (f'the error obtained with c={c} is {error}')\n```\n\n    the error obtained with c=0 is 74.66666666666667\n\nIn the previous construction of `Sequential`, we have steps that are\nindicated by simply passing the name of the function that implements\nthis step, like `get_data` and `mean_squared_error`. Those steps are\nautomatically wrapped into\n[`Component`](https://Jaume-JCI.github.io/ds-blocks/core/components.html#component)\nclasses, so that at the end all the steps of the pipeline are defined by\nComponents:\n\n``` python\npipeline.components\n```\n\n    [Component GetData (name=get_data),\n     Component AddC (name=add_c),\n     Component Square (name=square),\n     Component MeanSquaredError (name=mean_squared_error)]\n\nOnly when we need to specify parameters that are specific of a given\nstep we do need to explicitly use a\n[`Component`](https://Jaume-JCI.github.io/ds-blocks/core/components.html#component)\nfor doing so, as we have done for `add_c` and `square`. If the parameter\nis common for all the steps, we can just pass it in the construction of\nSequential and it will be propagated to all the components, like so:\n\n``` python\nSequential (my_step1, my_step2, my_step3,\n               data_converter=MyDataConverter)\n```\n\nin which case all the components use `MyDataConverter` for data\nconversion.\n\n### Using [`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\n\nWe can estimate the error obtained by multiple values of the parameter\n`c`, using a\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\nobject. This object is a pipeline similar to `Sequential` but where the\noutputs are not piped linearly from one step to the next. By default,\nthe same initial input is fed to all the components that compose the\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\nobject, and the output from all of them is gathered in a tuple. Both\nthings can be, however, configured through callbacks. Let us see how it\nworks in our case.\n\nLet\u2019s start by importing\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel):\n\n``` python\nfrom dsblocks import Parallel\n```\n\nNow we define the components to be run in this pipeline. We can do so\nwhen constructing it, `Parallel (component1, component2, component3)` or\nbeforehand,\n\n``` python\ncomponents=(component1, component2, component3)\nParallel (*components)\n```\n\nIn our case, the components to be run are each of them a Sequential\npipeline, where the only difference is the value of parameter `c`:\n\n``` python\npipelines = (Sequential (get_data, \n                         Component(add_c, c=c, data_converter=IgnoreLabels),\n                         Component(square, data_converter=IgnoreLabels), \n                         mean_squared_error)\n              for c in range(0,5))\n```\n\nWe pass those components to construct our\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\ncomponent\n\n``` python\nparallel = Parallel (*pipelines)\n```\n\n\u2026 and call that component as usual:\n\n``` python\nresult = parallel ()\nprint (f'result: {result}')\n```\n\n    result: (74.66666666666667, 27.666666666666668, 0.0, 51.666666666666664, 266.6666666666667)\n\nAs we can see, our\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\nobject is composed of 5 pipeline components, each pipeline receiving a\ndifferent value of parameter `c`. The\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\nobject then runs those pipelines and gathers their result in a tuple.\n\nIn general, the\n[`Parallel`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallel)\nobject can be constructed by passing any collection of components, and\nthis collection can be heterogeneous. While in the current case, we have\nconstructed multiple copies of the same `Sequential` object, we can as\nwell have a single copy that receives different values of `c` each time,\nby using the\n[`ParallelInstances`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#parallelinstances)\nclass. However, the use of such class is a bit more elaborate and we\nleave this topic for an advanced tutorial. Let us see now the error as a\nfunction of `c`:\n\n``` python\nimport matplotlib.pyplot as plt\n```\n\n``` python\nplt.plot (result, 'b.-')\n```\n\n![](index_files/figure-commonmark/cell-26-output-1.png)\n\n### Fitting models\n\nUntil now the steps of the pipeline have been functions. In our\npipelines, steps can also be specified in terms of objects. This is\nsuitable for having models or estimators whose state change as a result\nof applying the step, using methods such as `fit` or `fit_transform`, as\ndone in scikit-learn. Specifically, each object passed as one of the\nsteps needs to have at least one of the the following methods: : an\n`apply` method, which can be also called `transform` or `predict`, a\n`fit` method, and a `fit_apply` method, which can also be called\n`fit_transform` or `fit_predict`, following the same terminology as in\nscikit-learn.\n\nWe see now an example of this where we use one such object, whose class\nwe call `BruteForceModel`. This uses a simple brute-force search to find\nthe value of `c` that minimizes the error in our current objective,\ngiven a set of candidate values `c_values`:\n\n``` python\nclass BruteForceModel ():\n    def __init__ (self, c_values=range(5), **kwargs):\n        self.c_values = c_values\n    \n    def transform (self, X):\n        return (X+self.c)**2\n\n    def fit (self, X, Y):\n        error = np.empty ((len(self.c_values),))\n        for i, c in enumerate(self.c_values):\n            self.c = c\n            Y_hat = self.transform (X)\n            error[i] = mean_squared_error (Y, Y_hat)\n        self.c = self.c_values[np.argmin (error)]\n        return self\n```\n\nIn order to use such object, we need to indicate the data conversion\nstep also for both the `apply` (or `transform`) and the `fit` methods.\nSince we already did that for the `apply` method above, we just need to\nadd the data conversion for `fit`. This is done by adding a new method\n`convert_before_fitting` to the `IgnoreLabels` class:\n\n``` python\ndef convert_before_fitting (self, X, Y, **kwargs):\n    self.Y = Y\n    return X, Y\n\nIgnoreLabels.convert_before_fitting = convert_before_fitting\n```\n\nWe can construct our pipeline now, as follows:\n\n``` python\npipeline = Sequential (get_data,\n                       Component (BruteForceModel(), data_converter=IgnoreLabels),\n                       Component (mean_squared_error, data_converter='NoConverter'))\n```\n\nAs we can see above, the second component performs the data conversion\nindicated in `IgnoreLabels`, and the third component doesn\u2019t perform any\ndata conversion. In order to achieve that, we need to indicate that, for\nthe third component, the data converter is\n[`NoConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#noconverter),\nsince the default converter used by `DS Blocks` does perform a specific\ntype of data conversion when we call the `fit_apply` method, as\nexplained below.\n\nIn order to fit our newly created model, we will be calling the\n`fit_apply` method on the entire pipeline. This method is semantically\nequivalent to the `fit_transform` and `fit_predict` methods of\nscikit-learn. It makes the components of the pipeline be fitted to the\ndata, using the labels `Y`, then transform the data based on the fitted\nparameters, and pass on to the next component of the pipeline the\ntransformed data. This is the behaviour we have in scikit-learn, and\nreplicated by default in `DS Blocks`.\n\nThe default data converter used by `DS Blocks`, called\n[`GenericConverter`](https://Jaume-JCI.github.io/ds-blocks/core/data_conversion.html#genericconverter),\nmakes the sequential pipelines behave like the scikit-learn pipelines\nwhen calling the `fit_apply` method. Just like in scikit-learn, when we\nuse the default converter, the `fit` method of our components receives\nboth the data `X` and the labels `Y`, but the `apply` method only\nreceives the (transformed) data `X`.\n\nIn our current pipeline, however, we want the last component, which\napplies the `mean_squared_error` function, to receive both `X` and `Y`,\nin order to be able to calcuate such error. Therefore, in the last\ncomponent we indicate that no conversion should be applied, in order to\navoid skipping the `Y` when calling `apply` on that component.\n\nOnce defined the new pipeline, we simply call the `fit_apply` method as\nfollows:\n\n``` python\npipeline.fit_apply ()\n```\n\n    0.0\n\n### MultiSplit objects\n\nLet us now make the problem a little bit more interesting: we split the\ndata into two subsets, *training* and *test*, fit our model on the\ntraining set, and have a separate estimate of the error for each of the\ntwo subsets. For that purpose, it will be handy to use the\n[`MultiSplitComponent`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitcomponent)\nfrom `DS Blocks`. In particular, we will use the\n[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict)\nsubclass for the current problem. Let us import it, together with\n`train_test_split`.\n\n``` python\nfrom sklearn.model_selection import train_test_split\nfrom dsblocks.core.compose import MultiSplitDict\n```\n\nLet us now define a slightly different `get_data` function where we can\nindicate the number of data points we want to have, and a noise level\nthat is added to the data.\n\n``` python\ndef get_data (n=225, noise=1.0):\n    X = np.arange (n) \n    Y = (X+2)**2 + np.random.randn (n) * noise\n    return X, Y\n```\n\nLet us also define a new function `generate_split`, which splits the\ndata into training and test, and returns a dictonary with both subsets.\n\n``` python\ndef generate_split (X, Y, proportion_training=0.8):\n    n_samples_train=int(len(X)*proportion_training)\n    n_samples_test=len(X)-n_samples_train\n    X_train, X_test, Y_train, Y_test = train_test_split(\n        X, Y, train_size=n_samples_train, test_size=n_samples_test, shuffle=False\n    )\n    data = dict (training=(X_train, Y_train),\n                 test=(X_test, Y_test))\n    return data\n```\n\nFinally, we set our `BruteForceModel` component to use finer granularity\nfor the values of the c parameter:\n\n``` python\nbrute_force_model = BruteForceModel(c_values=[1.8, 1.9, 2.0, 2.1, 2.2])\n```\n\nWith all this, the new sequential pipeline is defined as follows:\n\n``` python\npipeline = Sequential (get_data, \n                       Component(generate_split, data_converter='NoConverter'), \n                       MultiSplitDict (Component(brute_force_model, data_converter=IgnoreLabels)), \n                       MultiSplitDict(Component(mean_squared_error, data_converter='NoConverter')))\n```\n\nAs we can see, the only two changes are: 1. We have an additional\ncomponent in the pipeline, which applies the function `generate_split`\n2. The two final components are wrapped in a\n[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict)\nclass.\n\nBy default,\n[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict)\nfits the wrapped component using the `training` data, i.e., the data\nfound in the `training` field of the input dictionary. After fitting the\ncomponent, it applies it separately to the training, the test, and, if\npresent, the validation subsets from the input dictionary. We can see\nthis by observing the output of the previous pipeline when calling\n`fit_apply` on it:\n\n``` python\npipeline.fit_apply()\n```\n\n    {'training': 0.9495413378781785, 'test': 0.8841783243494468}\n\nAs we can see, the model\u2019s error is estimated separately on the training\nand test set, and the output is a dictionary with the same fields as the\ninput to\n[`MultiSplitDict`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#multisplitdict).\nThe specific subsets to which the component is fitted and/or applied,\ncan be indicated by parameter as follows:\n\n``` python\nMultiSplitDict (my_component, \n                fit_to=subset_name, \n                apply_to=[subset_name_1, subset_name_2, ...])\n```\n\nwhere `subset_name` is a string indicating the name of the field where\nthe subset of data is found in the input dictionary. Let\u2019s see this:\n\n``` python\npipeline = Sequential (get_data, \n                       Component(generate_split, data_converter='NoConverter'), \n                       MultiSplitDict (Component(brute_force_model, data_converter=IgnoreLabels), \n                                       apply_to=['test']), \n                       MultiSplitDict(Component(mean_squared_error, data_converter='NoConverter'), \n                                      apply_to=['test']))\npipeline.fit_apply()\n```\n\n    {'test': 0.6858969144872996}\n\nAs we can see, the estimated error is only done now for the test set.\n\n### Experiment tracking\n\nMany times we want to be able to track the results obtained with\ndifferent values of our parameters, or multiple runs if our pipeline has\nsome stochasticity. DS Blocks provides experiment tracking through\ndifferent mechanisms. The easiest one is probably to just wrap any\npipeline created before with a\n[`TrackingComponent`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#trackingcomponent)\nwrapper. Another possibility is to use the class\n[`SequentialWithTracking`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#sequentialwithtracking)\ninstead of using `Sequential`. Let us see each of those in turn. First\nlet us import those two classes:\n\n``` python\nfrom dsblocks.core.compose import TrackingComponent, SequentialWithTracking\nimport joblib\n```\n\nUsing the first option, we can wrap the previously created `pipeline`\nobject with the\n[`TrackingComponent`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#trackingcomponent)\nclass:\n\n``` python\ntracking_pipeline = TrackingComponent (pipeline)\n```\n\nThis is appropriate if we first created the pipeline without the\nobjective of tracking the results, and later we want to add tracking to\nit. However, it is more common to directly define our pipeline with the\nobjective of tracking the results obtained with it. We do that by\nconstructing our top-level pipeline using\n[`SequentialWithTracking`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#sequentialwithtracking),\ninstead of using `Sequential` as done previously:\n\n``` python\ntracking_pipeline = SequentialWithTracking (\n    get_data, \n    Component (generate_split, data_converter='NoConverter'), \n    MultiSplitDict (Component(brute_force_model, data_converter=IgnoreLabels)), \n    MultiSplitDict (Component(mean_squared_error, data_converter='NoConverter'))\n)\n```\n\n    could not pickle object: Can't pickle <class 'dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager'>: it's not found as dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager\n    could not pickle object: Can't pickle <class 'dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager'>: it's not found as dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager\n\nAs we can see, the construction is done exactly the same way as the last\nconstruction we did with `Sequential`. Now, each time we run this\npipeline with new parameters, the resulting metrics are added to a\ndatabase which can be queried. Let\u2019s see that with three example runs:\n\n``` python\nerror = tracking_pipeline.fit_apply (n=5, noise=1000)\nprint (error)\n\nerror = tracking_pipeline.fit_apply (n=1000, noise=1000)\nprint (error)\n\nerror = tracking_pipeline.fit_apply (n=10000, noise=1000)\nprint (error)\n```\n\n    Could not run pickle object: Can't pickle <class 'dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager'>: it's not found as dsblocks.core.utils.get_ds_experiment_manager.<locals>.DSExperimentManager\n\n    {'training': 332782.06925177074, 'test': 4497361.700722283}\n    {'training': 908724.5037988542, 'test': 959655.3088879937}\n    {'training': 989310.3821379375, 'test': 991343.8955351294}\n\nWe can query now the last results as follows:\n\n``` python\nem = tracking_pipeline.get_experiment_manager ()\ndf = em.get_experiment_data ()\ndf [['parameters','scores']]\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead tr th {\n        text-align: left;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr>\n      <th></th>\n      <th colspan=\"3\" halign=\"left\">parameters</th>\n      <th colspan=\"2\" halign=\"left\">scores</th>\n    </tr>\n    <tr>\n      <th></th>\n      <th>function</th>\n      <th>n</th>\n      <th>noise</th>\n      <th>test</th>\n      <th>training</th>\n    </tr>\n    <tr>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th>0</th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>fit_apply</td>\n      <td>5</td>\n      <td>1000</td>\n      <td>4497361.700722</td>\n      <td>332782.069252</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>fit_apply</td>\n      <td>1000</td>\n      <td>1000</td>\n      <td>959655.308888</td>\n      <td>908724.503799</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>fit_apply</td>\n      <td>10000</td>\n      <td>1000</td>\n      <td>991343.895535</td>\n      <td>989310.382138</td>\n    </tr>\n  </tbody>\n</table>\n</div>\n\nThe table above shows from left to right: - The experiment ID\ncorresponding to the last three executions. - The parameters used for\neach execution. The first parameter included by default is the method\nused in the execution, which is always `fit_apply` in our case. The\nsecond and third parameters are `n` and `noise`, which indicate the\nnumber of observations in our data and the noise level, respectively. -\nThe metric scores obtained by each experiment. There are as many score\nnames as fields in the dictionary returned by the last component of the\npipeline, where each score name is the corresponding dictionary field.\nIn our case, we have `test` and `training`, corresponding to the test\nerror and the training error. For each score name, we have as many\nscores as runs we have done with the same parameters. In our case, we\nhave only run the pipeline one time for each set of parameters, and\ntherefore we only have one run number, `0`.\n\nThe intermediate steps of the execution are stored in a path associated\nwith each experiment ID. Let\u2019s say we want to revisit resuts for\nexperiment 1 above, in particular we want to see the output of the step\n`generate_split`. We can do that as follows:\n\n``` python\npath_results = em.get_path_results (experiment_id=1, run_number=0)\ndata=tracking_pipeline.main.generate_split.load_result(path_results=path_results)\nprint ('training X: ', data['training'][0][:3], '\\ntraining Y: ', data['training'][1][:3])\n```\n\n    training X:  [0 1 2] \n    training Y:  [  289.81113214 -1228.50705265  -570.5033283 ]\n\nAbove we have explored the result of using different combinations of\nparameters by running multiple times the `fit_apply` method of our\npipeline. Instead of doing that, we can explore many combinations of\nparameters using serach strategies like grid search or bayesian\noptimization. Let\u2019s see now how this is done with grid search, and later\nwe will see it with bayesian optimization as part of the integration\nwith `Optuna`. In order to use grid search, we can call:\n\n``` python\nem.grid_search (\n    parameters_multiple_values=dict(n=[int(1e5), int(1e6), int(1e7)],\n                                    noise=[10.0,100.0,1000.0]),\n    parameters_single_value=dict(function='fit_apply')\n)\n```\n\nwhere we have explored the combinations of `n` and `noise` that appear\nin the lists provided in `parameters_multiple_values`. At the same time,\nwe kept the parameters in `parameters_single_value` fixed to the value\nindicated, in our case the function is `fit_apply` in all the\nexperiments.\n\nIn order to inspect the results, we can obtain a dataframe with the same\nstructure as the one explained above:\n\n``` python\ndf = em.get_experiment_data()\n```\n\nBased on this, we can plot for instance the evolution of the error as a\nfunction of the total number of observations, for training and test\nseparately, fixing the noise level to 100:\n\n``` python\ndf_to_analyze = df[df[('parameters','noise')]==100.0]\n#df_to_analyze = df[df[('parameters','n')]==int(1e5)]\nparameter_to_analyze = df_to_analyze[('parameters','n')].values\n#parameter_to_analyze = df_to_analyze[('parameters','noise')].values\nscores_training = df_to_analyze[('scores','training')].values\nscores_test = df_to_analyze[('scores','test')].values\n\nplt.plot (parameter_to_analyze, scores_training, 'b.-')\nplt.plot (parameter_to_analyze, scores_test, 'm.-')\nplt.xlabel ('size of data')\n#plt.xlabel ('noise')\nplt.ylabel ('error')\nplt.legend (['training', 'test']);\n```\n\n![](index_files/figure-commonmark/cell-46-output-1.png)\n\n### Integration with Optuna\n\nIn order to use optuna-based search strategies like Bayesian\nOptimization, we first need to define a parameter sampler that makes use\nof the constructs provided in [optuna](). Let\u2019s do that for our two\nparameters, `noise` and `n`. We define a function which samples `noise`\nfrom a uniform distribution, and `n` from a discrete set of values, and\nreturns a dictionary with the sampled values:\n\n``` python\ndef parameter_sampler (trial):\n    noise = trial.suggest_uniform('noise', 1000, 10000)\n    n = trial.suggest_categorical('n', [100, 250, 500])\n\n    parameters = dict(noise=noise,\n                      n=n)\n\n    return parameters\n```\n\nNext, we need to indicate which metric score needs to be optimized by\nBayesian Optimization. This is done by indicating the value of the\n`key_score` property of our experiment manager. This property can be\nindicated at construction time, when building our\n[`SequentialWithTracking`](https://Jaume-JCI.github.io/ds-blocks/core/compose.html#sequentialwithtracking)\nobject, or just indicate it by assigning a value to it:\n\n``` python\nem.key_score = 'training'\n```\n\n\u2026by which we indicate that we will be optimizing the value of the\ntraining error.\n\nFinally, we run the Bayesian Optimization on the indicated parameters by\npassing the defined `parameter_sampler` function, and indicating\nadditional parameters that remain constant across experiments, in our\ncase `function='fit_apply'`, as follows:\n\n``` python\nem.hp_optimization (parameter_sampler=parameter_sampler, \n                    parameters=dict(function='fit_apply'))\n```\n\n    [I 2022-11-18 16:01:41,782] A new study created in RDB with name: hp_study\n    [I 2022-11-18 16:01:42,408] Trial 0 finished with value: 34975291.89393385 and parameters: {'noise': 5939.321535345923, 'n': 100}. Best is trial 0 with value: 34975291.89393385.\n    [I 2022-11-18 16:01:42,823] Trial 1 finished with value: 24467912.653205547 and parameters: {'noise': 4812.8931940501425, 'n': 500}. Best is trial 0 with value: 34975291.89393385.\n    [I 2022-11-18 16:01:43,229] Trial 2 finished with value: 105726109.1003538 and parameters: {'noise': 9672.964844509264, 'n': 250}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:43,650] Trial 3 finished with value: 36215819.30834488 and parameters: {'noise': 6112.401049845391, 'n': 100}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:44,088] Trial 4 finished with value: 1404915.7659739438 and parameters: {'noise': 1181.9655769629314, 'n': 500}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:44,526] Trial 5 finished with value: 93425542.45875728 and parameters: {'noise': 9807.565080094875, 'n': 100}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:44,979] Trial 6 finished with value: 4904244.708255151 and parameters: {'noise': 2064.469832820399, 'n': 500}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:45,416] Trial 7 finished with value: 30972553.479057707 and parameters: {'noise': 5696.634895750645, 'n': 500}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:45,869] Trial 8 finished with value: 27207384.64360299 and parameters: {'noise': 5105.352989948937, 'n': 500}. Best is trial 2 with value: 105726109.1003538.\n    [I 2022-11-18 16:01:46,338] Trial 9 finished with value: 49351456.854481995 and parameters: {'noise': 6508.861504501792, 'n': 250}. Best is trial 2 with value: 105726109.1003538.\n\n    105726109.1003538\n\nAgain, after running this we can inspect the results by obtaining the\nhistory of experiments in the form of a dataframe, as done previously:\n\n``` python\ndf = em.get_experiment_data ()\n```\n\nAnd look at the combinations of parameters and their resulting scores:\n\n``` python\ndf[['parameters', 'scores']]\n```\n\n<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead tr th {\n        text-align: left;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr>\n      <th></th>\n      <th colspan=\"3\" halign=\"left\">parameters</th>\n      <th colspan=\"2\" halign=\"left\">scores</th>\n    </tr>\n    <tr>\n      <th></th>\n      <th>function</th>\n      <th>n</th>\n      <th>noise</th>\n      <th>test</th>\n      <th>training</th>\n    </tr>\n    <tr>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th>0</th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>fit_apply</td>\n      <td>5</td>\n      <td>1000.0</td>\n      <td>4497361.700722</td>\n      <td>332782.069252</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>fit_apply</td>\n      <td>1000</td>\n      <td>1000.0</td>\n      <td>959655.308888</td>\n      <td>908724.503799</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>fit_apply</td>\n      <td>10000</td>\n      <td>1000.0</td>\n      <td>991343.895535</td>\n      <td>989310.382138</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>fit_apply</td>\n      <td>100000</td>\n      <td>10.0</td>\n      <td>101.674732</td>\n      <td>100.014967</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>fit_apply</td>\n      <td>100000</td>\n      <td>100.0</td>\n      <td>10060.595463</td>\n      <td>9960.776863</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>fit_apply</td>\n      <td>100000</td>\n      <td>1000.0</td>\n      <td>986655.689255</td>\n      <td>1000000.368654</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>fit_apply</td>\n      <td>1000000</td>\n      <td>10.0</td>\n      <td>100.209543</td>\n      <td>99.914734</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>fit_apply</td>\n      <td>1000000</td>\n      <td>100.0</td>\n      <td>10017.505157</td>\n      <td>10009.726013</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>fit_apply</td>\n      <td>1000000</td>\n      <td>1000.0</td>\n      <td>998290.266422</td>\n      <td>1000026.188421</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>fit_apply</td>\n      <td>10000000</td>\n      <td>10.0</td>\n      <td>99.995626</td>\n      <td>99.945308</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>fit_apply</td>\n      <td>10000000</td>\n      <td>100.0</td>\n      <td>9998.97338</td>\n      <td>9988.258447</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>fit_apply</td>\n      <td>10000000</td>\n      <td>1000.0</td>\n      <td>999088.076542</td>\n      <td>1000050.766496</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>fit_apply</td>\n      <td>100</td>\n      <td>5939.321535</td>\n      <td>27276806.205229</td>\n      <td>34975291.893934</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>fit_apply</td>\n      <td>500</td>\n      <td>4812.893194</td>\n      <td>31520677.399808</td>\n      <td>24467912.653206</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>fit_apply</td>\n      <td>250</td>\n      <td>9672.964845</td>\n      <td>101906074.636583</td>\n      <td>105726109.100354</td>\n    </tr>\n    <tr>\n      <th>15</th>\n      <td>fit_apply</td>\n      <td>100</td>\n      <td>6112.40105</td>\n      <td>14342902.683279</td>\n      <td>36215819.308345</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td>fit_apply</td>\n      <td>500</td>\n      <td>1181.965577</td>\n      <td>1187953.622504</td>\n      <td>1404915.765974</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td>fit_apply</td>\n      <td>100</td>\n      <td>9807.56508</td>\n      <td>85306914.7339</td>\n      <td>93425542.458757</td>\n    </tr>\n    <tr>\n      <th>18</th>\n      <td>fit_apply</td>\n      <td>500</td>\n      <td>2064.469833</td>\n      <td>4002535.937879</td>\n      <td>4904244.708255</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td>fit_apply</td>\n      <td>500</td>\n      <td>5696.634896</td>\n      <td>27071477.780816</td>\n      <td>30972553.479058</td>\n    </tr>\n    <tr>\n      <th>20</th>\n      <td>fit_apply</td>\n      <td>500</td>\n      <td>5105.35299</td>\n      <td>21697305.132237</td>\n      <td>27207384.643603</td>\n    </tr>\n    <tr>\n      <th>21</th>\n      <td>fit_apply</td>\n      <td>250</td>\n      <td>6508.861505</td>\n      <td>33483663.716484</td>\n      <td>49351456.854482</td>\n    </tr>\n  </tbody>\n</table>\n</div>\n\n## Documentation\n\nFor further details, please see the\n[documentation](https://jaume-jci.github.io/ds-blocks/)\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "DS Blocks",
    "version": "0.0.15",
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "bb54be5a3ee4e2b964f15c32658039ff",
                "sha256": "3b9344e504b25cecde2363e371d0c5b6a5a28e95c518d11a713a5360c9f3d787"
            },
            "downloads": -1,
            "filename": "dsblocks-0.0.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bb54be5a3ee4e2b964f15c32658039ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7,<=3.12",
            "size": 154337,
            "upload_time": "2022-12-02T01:25:21",
            "upload_time_iso_8601": "2022-12-02T01:25:21.963707Z",
            "url": "https://files.pythonhosted.org/packages/19/87/31c383d30c6d049b5d216e7b4cb72aa4490aaa9c26f932cca2bfa5c5eede/dsblocks-0.0.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "8c7264896d55b989424e98669373c9ab",
                "sha256": "6f42a171e296bfd4da99c19db9c7d64abbb517bdba9f9275ad438c4cdfd0b7c2"
            },
            "downloads": -1,
            "filename": "dsblocks-0.0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "8c7264896d55b989424e98669373c9ab",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7,<=3.12",
            "size": 166425,
            "upload_time": "2022-12-02T01:25:24",
            "upload_time_iso_8601": "2022-12-02T01:25:24.410314Z",
            "url": "https://files.pythonhosted.org/packages/53/8c/fa09d62eed830500936c28b86f6b89d9262140f6410b8fa39d207db2591c/dsblocks-0.0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-02 01:25:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "Jaume-JCI",
    "github_project": "ds-blocks",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "dsblocks"
}

Jaume Amores