pyRobustRegressionLib

Name	pyRobustRegressionLib JSON
Version	1.3.2 JSON
	download
home_page	https://github.com/bschulz81/robustregression
Summary	A library that implements algorithms for linear and non-linear robust regression.
upload_time	2024-01-02 05:50:37
maintainer
docs_url	None
author	Benjamin Schulz
requires_python	>=3.7
license	MIT License
keywords	robust regression forward-search huber's loss function repeated median regression simple linear regression non-linear regression levenberg-marquardt algorithm statistics estimators s-estimator q-estimator student t-distribution interquartile range machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # RobustregressionLib
This is a c++ library with statistical machine learning algorithms for linear and non-linear robust regression.

It implements the statistical algorithms that were originally developed by the author for an autofocus application for telescopes

and published in 	arXiv:2201.12466 [astro-ph.IM], https://doi.org/10.1093/mnras/stac189

In addition to these, two other robust algorithms were added and the curve fitting library has been brought into a form of a
clear and simply API that can be easily used for very broad and general applications.

The library offers Python bindings for most functions. So the programmer has the choice between c++ and Python. In order to 
compile the library with Python bindings Pybind11 and Python3 should be installed and be found by CMake. 
Otherwise, only the C++ standard template library is used, together with OpenMP. 

The documentation of the functions that the library uses are written in the C++header files and in the __doc__ methods of the Python bindings.

In addition, a c++ application and a Python script is provided that show the functions of the library with very simple data.

The Library is released under MIT License.

Apart from his own publication, the author has not found the main robust curve fitting algorithms from this library in the statistical literature.

One of the algorithms presented here is a modification of the forward search algorithms by  Hadi and Simonoff, Atkinson and Riani and the least trimmed squares
method of Rousseeuw. The modification of the author is to use various estimators to include data after the algorithm tried a random initial combination.

The algorithm was originally developed for physical problems, where one has outliers but also data, which is often subject to random fluctuations, like astronomical seeing.
As we observed during trials with the astronomy application, including the S estimator in the forward search removes large outliers but allows for small random fluctuations 
in the data, which resulted in more natural curve fits than if we would simply select the "best" model that we would get from the forward search. If some degree of randomness is present,
the "best" model chosen by such a method would have the smallest error almost certainly by accident and would not include enough points for a precise curve fit.
The usage of the statistical estimators in the forward search appeared to prevent this problem.

The modified least trimmed squares method has also been used by the author in arXiv:2201.12466 with the various estimators to judge the quality of measurement data, which was 
defined as "Better" when the algorithm, if used sucessively with several different estimators, comes to a more similar result. 

Another algorithm presented in this library is an iterative method which also employs various estimators. It has the advantage that it should work with larger datasets but its statistical 
properties have not been extensively tested yet.

Because of the use of various statistical estimators and methods, the library builds on previous articles from the statistical literature. 
Some references are:

1. Smiley W. Cheng, James C. Fu, Statistics & Probability Letters 1 (1983), 223-227, for the t distribution
2. B. Peirce,  Astronomical Journal II 45 (1852) for the peirce criterion
3. Peter J. Rousseeuw, Christophe Croux, J. of the Amer. Statistical Assoc. (Theory and Methods), 88 (1993), p. 1273, for the S, Q, and T estimator
5. T. C. Beers,K. Flynn and K. Gebhardt,  Astron. J. 100 (1),32 (1990), for the Biweight Midvariance
6. Transtrum, Mark K, Sethna, James P (2012). "Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization". arXiv:1201.5885, for the Levenberg Marquardt Algorithm,
7. Rousseeuw, P. J. (1984).Journal of the American Statistical Association. 79 (388): 871â€“880. doi:10.1080/01621459.1984.10477105. JSTOR 2288718.
   Rousseeuw, P. J.; Leroy, A. M. (2005) [1987]. Robust Regression and Outlier Detection. Wiley. doi:10.1002/0471725382. ISBN 978-0-471-85233-9, for the least trimmed squares algorithm
8. Hadi and Simonoff, J. Amer. Statist. Assoc. 88 (1993) 1264-1272, Atkinson and Riani,Robust Diagnostic Regression Analysis (2000), Springer, for the forward search
9. Croux, C., Rousseeuw, P.J. (1992). Time-Efficient Algorithms for Two Highly Robust Estimators of Scale. In: Dodge, Y., Whittaker, J. (eds) Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-662-26811-7_58 
 (For the faster version of the S and Q estimators.) The versions of the S and Q estimators in this library are now adapted from Croux and Rousseeuw to the C language. Note that it is not the same Code because of some optimizations. Since many variables act on array indices in this algorithm, it was actually non-trivial to convert from Fortran to C. Especially for the Q estimator, the naive algorithm is faster for less than 100 datapoints. For the S estimator this is the case for less than 10 datapoints. Therefore, in these cases the naive versions are still used.
10. Andrew F. Siegel. Robust regression using repeated medians. Bionaetrika, 69(1):242â€“244, 1982,Andrew Stein and Michael Werman. 1992. Finding the repeated median regression line. In Proceedings of the third annual ACM-SIAM symposium on Discrete algorithms (SODA '92). Society for Industrial and Applied Mathematics, USA, 409â€“413. https://dl.acm.org/doi/10.5555/139404.139485

# Compiling and Installing the library:

The Library needs CMake and a C compiler that is at least able to generate code according to the C14 standard
(per default, if one does not use Clang or MacOs, it uses C17, but with a CXX_STANDARD 14 switch set in the 
CMakeLists.txt for the library, it can use C14, which is the the default for Clang and MacOS.) 

The library also makes use of OpenMP and needs Python in version 3 and pybind11 to compile. 

By default, the library also containts two test applications. 

If the variable $WITH_TESTAPP is set to ON, a c++ test application is compiled and put in an /output directory. 

The library also shipps with a Python module. By default, the CMake variable $With_Python is set to ON and a Python module will
be generated in addition to a c++ library.

If $WITH_TESTAPP and $WITH_PYTHON are set, which is the default, then a Python test application will be generated in addition to
the C++ test application.

## Installing with CMake
One can compile the library also traditionally via CMake. Typing 

> cmake . 

in the package directory will generate the files necessary to compile the library, depending on the CMake generator set by the user.

Under Linux, the command

> make 

will then compile the library.

After compilation, an /output directory will appear with the library in binary form. 
By default, the library also containts two test applications. 

If the variable $WITH_TESTAPP is set, a c++ test application is compiled and put in an /output directory. 

The library also ships with a Python module. By default, the CMake variable $WITH_PYTHON is ON and a Python module will
be generated in addition to a c++ library and copied into the /output directory.

If $WITH_TESTAPP and $WITH_PYTHON are set to ON, which is the default, then a Python test application will be generated and compiled into the /output directory.

By compiling with CMake, the Python module is just compiled into the /output directory. It is not installed in a path for system libraries
or python packages. So if one wants to use the Python module, one has either a) to write the script in the same folder where the module is, or b) load
it by pointing Python to the explicit path of the module, or c) copy the module to a place where Python can find it.


If one does not want the Python module to be compiled, one may set the cmake variable $WITH_PYTHON to OFF.

## Installing with PIP (This option is mostly for Windows since Linux distributions have their own package managers)

If one wants that the module is installed into a library path, where Python scripts may be able to find it easily, one can compile and install the module also 
with pip instead of CMake by typing

> pip install .

in the package directory. 

After that, the module is compiled and the binaries are copied into a path for Python site-packages, where Python scripts should be able to find it.

This is successfull on Windows.

Unfortunately, problems to find the module remain on Linux.
If pip is called by the  root user, the module is copied into the /usr/lib directory. Despite this,  python scripts have difficulties to load the module if they do not use a hardcoded import path. 
If one does not install the module as root user, pip will install it in a local site-package directory, where python also has problems to find the module without a hard coded import path.

If the module was compiled by pip, one can uninstall it by typing

> pip uninstall pyRobustRegressionLib

Under Linux, compiling with cmake should be preferred. Not at least because linux package managers (e.g. portage from gentoo) sometimes have conflicts with pip.

Additionally, the python environment will select ninja as a default generator, which will require to clean the build files
if an earlier generation based on cmake was done that may have used a different generator.


# Documentation of the library functions

## For the Python language

### Calling the documentation
Documentation of the API is provided in C++ header files in the /library/include directory and the docstrings for the Python module in the src/pyRobustRegressionLib
Module. The latter It can be loaded in Python scripts with 

> import pyRobustRegressionLib as rrl

The command 

> print(rrl.\_\_doc__)

Will list the sub-modules of the library, which are 

- StatisticFunctions, 
- LinearRegression, 
- MatrixCode, 
- NonLinearRegression and 
- RobustRegression
- LossFunctions

And their docstrings can be called e.g. by
> print(rrl.*SubModuleName*.\_\_doc__)

e.g.

> print(rrl.StatisticFunctions.\_\_doc__).

Will list the functions and classes of the sub-module StatisticFunctions. The free functions and classes all have more detailed doc
strings that can be called as below for example

> print (rrl.MatrixCode.Identity.\_\_doc__)

More convenient documentation is provided in the header files of the C++ source code of the package,
which can be found in the /library/include directory.

The header files can be found in the include subdirectory of the package.

In the testapp folder, two example programs, one in Python and one in C++ is provided.
These test applications have extensive comments and call many functions of the librarym which show the basic usage. 

The curve fits that are done in the provided example programs are, however, very simple of course.
This was done in order to keep the demonstration short and simple.
The library is of course intended to be used with larger and much more complicated data.

### Simple linear regression

Let us now define some vector for data X and > which we want to fit to a line.

> print("\nDefine some arrays X and Y")
> 
> X=[-3.0,5.0,7.0,10.0,13.0,16.0,20.0,22.0]
> 
> Y=[-210.0,430.0,590.0,830.0,1070.0,1310.0,1630.0,1790.0]

A simple linear fit can be called as follows:
At first we create an instance of the result structure, where the result is stored.

> res=rrl.LinearRegression.result()

Then, we call the linear regression function
> rrl.LinearRegression.linear_regression(X, Y, res)



And finally, we print out the slope and intercept
> print("Slope")
> 
> print(res.main_slope)
> 
> print("Intercept")
> 
> print(res.main_intercept)

### Robust regression
The robust regression is just slightly more complicated. Let us first add two outliers into the data:

> X2=[-3.0, 5.0,7.0, 10.0,13.0,15.0,16.0,20.0,22.0,25.0]
> 
> Y2=[ -210.0, 430.0, 590.0,830.0,1070.0,20.0,1310.0,1630.0,1790.0,-3.0]


#### Siegel's repeated Median regression
For linear regression, the library also has the median linear regression from Siegel, which can be called in the same way

> rrl.LinearRegression.median_linear_regression(X2, Y2, res)

but is slightly more robust.


#### Modified forward search/modified Lts regression
Median linear regression is a bit slower as simple linear regression and can get wrong if many outliers are present.

Therefore, the library has two methods for robust regression that can remove outliers.

The  structure that stores the result for robust linear regression now includes the indices of the used and rejected point.

We instantiate it with

> res= rrl.RobustRegression.linear_algorithm_result()

Additionally, there is a struct that determines the behavior of the algorithm.  
Upon construction without arguments, it gets filled with default values.

> ctrl= rrl.RobustRegression.modified_lts_control_linear()

By default, the S-estimator is used with an outlier_tolerance parameter of 3, and the method can find 30% of the points as outliers at maximum. 
But all this can be changed, of course

Now we call the modified forward search/ modified lts algorithm
> rrl.RobustRegression.modified_lts_regression_linear(X2, Y2, ctrl, res)

and then print the result, first the slope and intercept 
> print("Slope")
> 
> print(res.main_slope)
> 
> print("Intercept")
> 
> print(res.main_intercept)

Then the indices of the outliers

> print("\nOutlier indices")
> 
> for ind in res.indices_of_removedpoints:
> 
> &emsp;print(ind)

When we want to change the estimators, or the outlier tolerance parameter, the loss function, or the maximum number of outliers we can find, or other details, we can simply
set this in the control struct.

By default, the S estimator is used with an outlier_tolerance of 3 in the same formula i.e. one has

> ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_decision_in_S_ESTIMATION

and a point is an outlier if 
> |err-median(errs)|/S_estimator(errs)>outlier_tolerance

where err is the residuum of the point and errs is the array of residuals. 
They are measured by a specified loss function. If none was given, squared errors are used by default. 

By default, the outlier_tolerance parameter is set to 3.

If we want to have a different value, e.g. 3.5, for the outlier_tolerance parameter, we can easily set e.g.

> ctrl.outlier_tolerance=3.5

The command below would imply that the Q estimator is used:

> ctrl= rrl.RobustRegression.modified_lts_control_linear()
> 
> ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_decision_in_Q_ESTIMATION

Then a point is an outlier if, for its residual with the curve fit err, we have,given the array of all residuals errs:
> |err-median(errs)|/Q_estimator(errs)>outlier_tolerance


The command below would change the estimator to the interquartile range method when the modified lts/modified forward search algorithm is used. 

With the setting below, a point is an outlier if its residual is below Q1 âˆ’ tolerance* IQR or above Q3 + tolerance IQR.
If we do not change the loss function, least squares is used by default.

> ctrl= rrl.RobustRegression.modified_lts_control_linear()
> 
> ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_interquartile_range

 For the interquartile range estimator, we should set the tolerance usually to 1.5

> ctrl.outlier_tolerance=1.5

before we call the regression function.

Similarly, the loss function can be changed. For example, the absolute value of the residuals is usually more statistically robust than the square of the residuals

> ctrl.lossfunction=rrl.LossFunctions.absolutevalue

changes the lossfunction to the absolute value. 

One can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the
function starts its linear behavior.
One also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.
Finally, one can define a custom loss function with a callback mechanism.

Note that if we use the linear versions of the robust regression, then these methods would just make simple linear
fits or repeated median fits, which minimize their own loss function, and the selected loss function of the library
is then only used for outlier removal.

With the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well 
as for the outlier removal procedures.

If one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier
removal, one has to use the non-linear algorithm with a linear call back function. 

Note also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with
the linear robust curve fitting algorithms, since then it is only used for outlier removal. 
If the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm 
because of its asymmetry.




Note that the forward search can be very time consuming, as its performance is given by the binomial koefficient of the pointnumber over the maximum number of 
outliers it can find (which per default is 30% of the pointnumber).

#### Iterative outlier removal

A faster algorithm is the iterative outlier removal method, which makes a curve fit with the entire points, then removes the points whose residuals are outliers and
then makes another curve fit with the remaining point set until no outliers are found anymore.

It can be called similarly:

We define the control structure. Note that it is different from the modified forward search/modified lts control structure. 
> ctrl= rrl.RobustRegression.linear_algorithm_control()

And again create a structure for the result
> res= rrl.RobustRegression.linear_algorithm_result()
Then we start the algorithm
> rrl.RobustRegression.iterative_outlier_removal_regression_linear(X2, Y2, ctrl, res)

And print the result
> print("Slope")
> 
> print(res.main_slope)
> 
> print("Intercept")
> 
> print(res.main_intercept)
> 
> print("\nOutlier indices")
> 
> for ind in res.indices_of_removedpoints:
> 
> &emsp;print(ind)



### Non Linear Regression
Non-linear regression uses an implementation of the Levenberg-Marquardt algorithm

The Levenberg-Marquardt algorithm needs an initial guess for an array of parameters beta , a function f(X,beta) to be fitted and a Jacobi matrix J(X,beta)
For example, a curve fit can be made by  initialising the result, control and initdata structures as follows:
After a call of the constructors:

> res=rrl.NonLinearRegression.result()
> 
> ctrl=rrl.NonLinearRegression.control()
> 
> init=rrl.NonLinearRegression.initdata()

We  supply a Jacobi matrix, a function and an initial guess for the parameters to be found (assuming the function has just 2 curve fitting parameters):

> init.Jacobian=Jacobi
>
> init.f=linear
> 
> init.initialguess = [1,1]

Where  Jacobi and linear are two user defined functions. 

If we would want to fit a line, we would have to implement the function f(X,beta) and the jacobian J(X,beta) as follows

>def linear(X, beta):
>
> &emsp;   Y=[]
>
> &emsp;   for i in range(0,len(X)):
>
> &emsp;   &emsp;    Y.append(beta[0] * X[i] + beta[1])
>
> &emsp;    return Y


> def Jacobi(X, beta):
> 
>&emsp;	m=rrl.MatrixCode.Matrix (len(X), len(beta))
> 
>&emsp;	for i in range(0,len(X)):
> 
>&emsp;	&emsp;	m[i, 0] = X[i]
> 
>&emsp;&emsp;	m[i, 1] = 1
> 
>&emsp;	return m

Then we can call the Levenberg-Marquardt algorithm

> rrl.NonLinearRegression.non_linear_regression(X2, Y2, init, ctrl, res)

and then print the result:
> print("Slope")
> 
> print(res.beta[0])
> 
> print("Intercept")
> 
> print(res.beta[1])

The class NonLinearRegression.control has various parameters that control the behavior of the Levenberg-Marquardt algorithm, among them are various
conditions when the algorithm should stop (e.g. after some time, or after there is no improvement or after the error is below a certain margin). 
They may be set as desired.

Additionally, there are parameters that control the step size of the algorithm. These parameters have the same names as described at
https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm and if not otherwise specified,
defaults are used for them usually work.


### Non-linear robust regression
For non-linear curve fits the library also has the modified forward search/modified lts algorithm and the iterative outlier removal as for linear regression.
For a modified forward search/lts algorithm with default parameters (S estimator, outlier_tolerance=3, loss function as least squares, 30% of the points are outliers at maximum), a call looks as follows:

First the initialisation :
> res=rrl.RobustRegression.nonlinear_algorithm_result()
> 
> ctrl=rrl.RobustRegression.modified_lts_control_nonlinear()
> 
> init=rrl.NonLinearRegression.initdata()
> 
> init.Jacobian=Jacobi
> 
> init.f=linear
> 
> init.initialguess = [1,1]

Then the function call:
> rrl.RobustRegression.modified_lts_regression_nonlinear(X2, Y2, init, ctrl, res)

Finally, we print the result:
> print("Slope")
> 
> print(res.beta[0])
> 
> print("Intercept")
> 
> print(res.beta[1])
> 
> print("\nOutlier indices")
> 
> for ind in res.indices_of_removedpoints:
> 
> &emsp;  print(ind)




For the interative outlier removal algorithm, a call to the regression function with default parameters (S estimator, outlier_tolerance=3, loss function as least squares, 30% of the points are outliers at maximum) would look as follows:

> res=rrl.RobustRegression.nonlinear_algorithm_result()
> 
> ctrl=rrl.RobustRegression.nonlinear_algorithm_control()
> 
> init=rrl.NonLinearRegression.initdata()
> 
> init.Jacobian=Jacobi
> 
> init.f=linear
> 
> init.initialguess = [1,1]

> rrl.RobustRegression.iterative_outlier_removal_regression_nonlinear(X2, Y2, init, ctrl, res)

### Custom error functions
By default, the library uses the sum of the squared residuals divided by the pointnumber as a loss function.
One can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the loss function starts its linear behavior.
One also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.

Finally, one can define a custom loss function with a callback mechanism.

We may define user defined loss functions. This is done in two steps. A function

> def err_pp(Y,fY,pointnumber):
> &emsp;   return (Y-fY)*(Y-fY)/pointnumber

Computes a residual between the data and the curve fit for a single point.
Another function

> def aggregate_err(errs):
> &emsp;    res=0 
> &emsp;   for i in range(0,len(errs)):
> &emsp;  &emsp;     res+=errs[i]
> &emsp;   return res

computes an entire error from a list of residuals generated by the function err_pp.
Note that if the data is such that it does not correspond perfectly to the curve, this should at best be some kind of average error instead of a simple sum.
Since otherwise, removing a point will always reduce the error. Since the robust methods delete points based on the aggregate error, this would usually lead to
curve fits which do not have enough points taken into consideration. The division by the pointnumber can be done in err_pp (as in this example) or in aggregate_err.

The following call will then make a robust curve fit with the custom error function

> res9=rrl.RobustRegression.nonlinear_algorithm_result() 
> ctrl9=rrl.RobustRegression.modified_lts_control_nonlinear()
> ctrl9.lossfunction=rrl.LossFunctions.custom
> ctrl9.loss_perpoint=err_pp
> ctrl9.aggregate_error=aggregate_err

> init9=rrl.NonLinearRegression.initdata() 
> init9.Jacobian=Jacobi
> init9.f=linear
> init9.initialguess = [1,1]
> rrl.RobustRegression.modified_lts_regression_nonlinear(X2, Y2, init9, ctrl9, res9)

Note that if we use the linear versions of the robust regression, then these methods would just make simple linear
fits or repeated median fits, which minimize their own loss function, and the selected loss function of the library
is then only used for outlier removal.

With the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well 
as for the outlier removal procedures.

If one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier
removal, one has to use the non-linear algorithm with a linear call back function. 

Note also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with
the linear robust curve fitting algorithms, since then it is only used for outlier removal. 
If the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm 
because of its asymmetry.




## For the C++ language:

In general, one has to include the library headers as follows:

> #include "statisticfunctions.h"
>
> #include "Matrixcode.h"
> 
> #include "linearregression.h"
> 
> #include "robustregression.h"
> 
> #include "nonlinearregression.h"
>
> #include "lossfunctions.h" 
>

### Simple Linear Regression
The usage of the library in C++ is essentially similar as in Python. the testapplication.cpp demonstrates the same function calls.
The the X and Y data are stored in C++ valarrays. The control, result and initdata are not classes, but structs.

For example, if we define some X,Y data:

> valarray<double> X = { -3, 5,7, 10,13,16,20,22 };
> 
> valarray<double> Y = { -210, 430, 590,830,1070,1310,1630,1790 };

and initialize the struct where we store the result,

> Linear_Regression::result res;

we can call a linear regression as follows:

> Linear_Regression::linear_regression(X, Y, res);

and we may print the result:

> printf(" Slope ");
> 
> printf("%f", res.main_slope);
> 
> printf("\n Intercept ");
> 
> printf("%f", res.main_intercept);

### Robust Regression methods

Let us first define X and Y data with two outliers added.
>	valarray<double> X2 = { -3, 5,7, 10,13,15,16,20,22,25 };
>
>	valarray<double> Y2 = { -210, 430, 590,830,1070,20,1310,1630,1790,-3 };

#### Median Linear Regression: 
Median regression is slower but more robust as simple linear regression.
This command calls a robust curve fit with median regression

> Linear_Regression::result res;
> 
> Linear_Regression::median_linear_regression(X2, Y2, res);

and then we print the result

> printf(" Slope ");
> 
> printf("%f", res.main_slope);
> 
> printf("\n Intercept ");
> 
> printf("%f", res.main_intercept);

#### Modified forward search
When many and large outliers are present, median regression does sometimes not deliver the desired results.

The library therefore has the modified forward search/modified lts algorithm and the iterative outlier removal algorithm that can
find outliers and remove them from the curve fit entirely.

Below is a call for a robust modified forward search/modified lts algorithm initialized with default
parameters:

First the structs for controlling the algorithm and storing the results are initialised,

> Robust_Regression::modified_lts_control_linear ctrl;
> 
> Robust_Regression::linear_algorithm_result res;

Then, we call the functions for the curve fit:
>	Robust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);

Then we print the result:

>	printf(" Slope ");
>
>	printf("%f", res.main_slope);
>
>	printf("\n Intercept ");
>
>	printf("%f", res.main_intercept);
>
>	printf("\n Indices of outliers ");
>
>	for (size_t i = 0; i < res.indices_of_removedpoints.size(); i++)
>	{
>	&emsp; 	size_t w = res.indices_of_removedpoints[i];
>
>	&emsp;	printf("%lu", (unsigned long)w);
>
>	&emsp;	printf(", ");
>
>	}


The default parameters are as follows:

The S estimator is used, outlier_tolerance=3,  30% of the pointnumber are outliers at maximum, loss function is given by least squares of the residuals.

As in the Python documentation, a point with residual err is then an outlier if 

> |err-median(errs)/S_estimator>3

where errs is the array of residuals.


If the Q-estimator is used instead, the initialisation for the modified forward search/lts algorithm looks like

>	Robust_Regression::modified_lts_control_linear ctrl;
>
>	Robust_Regression::linear_algorithm_result res;
>
>	ctrl.rejection_method = Robust_Regression::tolerance_is_decision_in_Q_ESTIMATION;

Then we call the regression function:
>	Robust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);

If the interquartile range estimator should be used, so that a point is removed if it is below Q1 âˆ’ outlier_tolerance* IQR or above Q3 + outlier_tolerance IQR, 
we would have to set:

> ctrl.rejection_method = Robust_Regression::tolerance_is_interquartile_range;

For the interquartile range estimator, outlier_tolerance should be set to 1.5, so we additionally have to write:
>	ctrl.outlier_tolerance = 1.5;

before we call the regression function:
>	Robust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);

Similarly, some may prefer to set the outlier_tolerance parameter to 3.5 when the S,Q, or MAD or other estimators are used.

The loss function can also be changed. For example, the absolute value |err| of the residuals is usually more robust than the square err^2.
The command

> ctrl.lossfunction = LossFunctions::absolutevalue;

changes the lossfunction to the absolute value. 

One can also specify Huber's loss function, but then one also has to supply a border parameter  beyond which the function starts its linear behavior.
One also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.
Finally, one can define a custom loss function with a callback mechanism.


Note that if we use the linear versions of the robust regression, then these methods would just make simple linear
fits or repeated median fits, which minimize their own loss function, and the selected loss function of the library
is then only used for outlier removal.

With the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well 
as for the outlier removal procedures.

If one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier
removal, one has to use the non-linear algorithm with a linear call back function. 

Note also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with
the linear robust curve fitting algorithms, since then it is only used for outlier removal. 
If the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm 
because of its asymmetry.

#### Iterative outlier removal
The modified forward search/modified lts algorithm can be slow since its complexity is given by the binomial coefficient of the pointnumber over the maximum
number of outliers to be found. The iterative outlier removal algorithm is faster. It makes a curve fit of all points, then removes the outliers based on the 
loss function and estimator and makes another curve fit and repeats the procedure until no outliers are found.

A call to this method with the default parameters S estimator, outlier_tolerance=3, the loss function as least_squares and at most 30% of the points
designated as outliers, would look as follows:

>	Robust_Regression::linear_algorithm_result res;
>
>	Robust_Regression::linear_algorithm_control ctrl;
>
>	Robust_Regression::iterative_outlier_removal_regression_linear(X2, Y2, ctrl, res);

### Non-linear Regression
The non-linear curve fitting algorithm implements a Levenberg-Marquardt algorithm.

For it to work, we must first supply initialisation data in form of a valarray of initial parameters beta,
a function f(X,beta) to be found and a Jacobian J(X,beta).

if we wanted to fit a line, we would have do define the function f(X,beta) to be fit

> valarray<double> linear(const valarray<double>&X,const  valarray<double>& beta)
> {
> 
>	&emsp;valarray<double> Y(X.size());
> 
>	&emsp;for (size_t i = 0; i < X.size(); i++)
> 
>	&emsp;&emsp;	Y[i] = beta[0] * X[i] + beta[1];
> 
>	&emsp;return Y;
> 
> }
>

and its Jacobian Matrix J(X,beta)
> Matrix Jacobi(const valarray<double>& X, const  valarray<double>& beta)
> {
> 
>	&emsp;Matrix ret(X.size(), beta.size());
> 
>	&emsp;for (size_t i = 0; i < X.size(); i++)
> 
>	&emsp;{
> 
>	&emsp;&emsp;	ret(i, 0) = X[i];
> 
>	&emsp;&emsp;	ret(i, 1) = 1;
> 
>	&emsp;}
> 
>	&emsp;return ret;
> 
> }

A non-linear fit can the be initialised with default control parameters for the Levenberg-Marquardt algorithm like this:


>	Non_Linear_Regression::result res;
>
>	Non_Linear_Regression::control ctrl;
>
>	Non_Linear_Regression::initdata init;
>
>	init.f = linear;
>
>	init.J = Jacobi;
>
>	valarray<double>beta = { 1,1 };
>
>	init.initialguess = beta;

Then one can call the function:
>	Non_Linear_Regression::non_linear_regression(X, Y, init, ctrl, res);

Then we may print the result:

>	printf("\n Slope ");
>
>	printf("%f", res.beta[0]);
>
>	printf("\n intercept ");
>
>	printf("%f", res.beta[1]);

The struct Non_Linear_Regression::control has various control parameters that control the behavior of the Levenberg-Marquardt algorithm, among them are various
conditions when the algorithm should stop (e.g. after some time, or after there is no improvement or after the error is below a certain margin).
They may be set as desired.

Additionally, there are parameters that control the step size of the algorithm. These parameters have the same names as described at
https://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm and if not otherwise specified,
defaults are used for them  usually work.


### Non-linear robust curve fits

As for the linear regression, the library has the same modified forward search/lts and iterative outlier removal algorithms for the non-linear case

For a modified forward search/lts algorithm, a call looks as follows:

First the initialisation, here with default parameters for the algorithm control:

> Robust_Regression::modified_lts_control_nonlinear ctrl;
> 
> Robust_Regression::nonlinear_algorithm_result res;
> 
> Non_Linear_Regression::initdata init;
> 
> init.f = linear;
> 
> init.J = Jacobi;
> 
> valarray<double>beta = { 1,1 };
> 
> init.initialguess = beta;

Then the call:

> Robust_Regression::modified_lts_regression_nonlinear(X2, Y2, init, ctrl, res);

Then we print the result:
> 	printf(" Slope ");
> 
>	printf("%f", res.beta[0]);
> 
>	printf("\n Intercept ");
>
>	printf("%f", res.beta[1]);
> 
>	printf("\n Indices of outliers ");
>
>	for (size_t i = 0; i < res.indices_of_removedpoints.size(); i++){
>	&emsp;	size_t w = res.indices_of_removedpoints[i];
>  	&emsp;	printf("%lu", (unsigned long)w);
>   &emsp;  printf(", ");
>	}


For the interative outlier removal algorithm, the call to the regression function would look as follows:
First the initialisation, here again with default parameters for the algorithm control:

> Non_Linear_Regression::initdata init;
> 
> Robust_Regression::nonlinear_algorithm_control ctrl;
> 
> init.f = linear;
> 
> init.J = Jacobi;
> 
> valarray<double>beta = { 1,1 };
> 
> init.initialguess = beta;

Then the call,

> Robust_Regression::iterative_outlier_removal_regression_nonlinear(X2, Y2, init, ctrl, res);

The printing of the result would work similar as above.

### Custom error functions
By default, the library uses the sum of the squared residuals divided by the pointnumber as a loss function.
One can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the loss function becomes linear.
One also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.

Finally, one can define a custom loss function with a callback mechanism.

We may define user defined loss functions. This is done in two steps. A function

> double err_pp(const double Y, double fY, const size_t pointnumber) {
> &emsp;	return ((Y - fY)* (Y - fY)) /(double) pointnumber;
> }

Computes a residual between the data and the curve fit for a single point.

Another function

> double aggregate_err(valarray<double>& err){
> &emsp;	return err.sum();
> }

computes an entire error from a list of residuals generated by the function err_pp.
Note that if the data is such that it does not correspond perfectly to the curve, this should at best be some kind of average error instead of a simple sum.
Since otherwise, removing a point will always reduce the error. Since the robust methods delete points based on the aggregate error, this would usually lead to
curve fits which do not have enough points taken into consideration. The division by the pointnumber can be done in err_pp (as in this example) or in aggregate_err.

The following will then make a robust curve fit with the custom error function:
At first the usual initialisation
> Non_Linear_Regression::initdata init13;
>	init13.f = linear;
>	init13.J = Jacobi;
>	init13.initialguess = beta;
>  Robust_Regression::nonlinear_algorithm_result res13;

Then the set of the custom loss function

> Robust_Regression::modified_lts_control_nonlinear ctrl13;
> ctrl13.lossfunction = LossFunctions::custom;
> ctrl13.loss_pp = err_pp;
> ctrl13.agg_err = aggregate_err;

Note that if the aggregate error would not be defined here, the results of the calls of the loss functions per point would just be summed.

Finally, we can make the function call	

> Robust_Regression::modified_lts_regression_nonlinear(X2, Y2, init13, ctrl13, res13);

Note that if we use the linear versions of the robust regression, then these methods would just make simple linear
fits or repeated median fits, which minimize their own loss function, and the selected loss function of the library
is then only used for outlier removal.

With the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well 
as for the outlier removal procedures.

If one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier
removal, one has to use the non-linear algorithm with a linear call back function. 

Note also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with
the linear robust curve fitting algorithms, since then it is only used for outlier removal. 
If the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm 
because of its asymmetry.

# Further documentation
The library has an online repository at https://github.com/bschulz81/robustregression where the source code can be accessed. 

The detailed documentation of all the various control parameters of the curve fiting algorithms is in the docstrings of the Python module and in the c++ header file. 

Also, the C++/Python test applications in the folder testapp are documented and show many function calls

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bschulz81/robustregression",
    "name": "pyRobustRegressionLib",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "robust regression,forward-search,Huber's loss function,repeated median regression,simple linear regression,non-linear regression,Levenberg-Marquardt algorithm,statistics,estimators,S-estimator,Q-estimator,Student T-distribution,interquartile range,machine learning",
    "author": "Benjamin Schulz",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/57/ea/0097ded14ba2a6decb30f9799658ed2278a86fba36d9a371deb422003d5d/pyRobustRegressionLib-1.3.2.tar.gz",
    "platform": null,
    "description": "# RobustregressionLib\r\nThis is a c++ library with statistical machine learning algorithms for linear and non-linear robust regression.\r\n\r\nIt implements the statistical algorithms that were originally developed by the author for an autofocus application for telescopes\r\n\r\nand published in \tarXiv:2201.12466 [astro-ph.IM], https://doi.org/10.1093/mnras/stac189\r\n\r\nIn addition to these, two other robust algorithms were added and the curve fitting library has been brought into a form of a\r\nclear and simply API that can be easily used for very broad and general applications.\r\n\r\nThe library offers Python bindings for most functions. So the programmer has the choice between c++ and Python. In order to \r\ncompile the library with Python bindings Pybind11 and Python3 should be installed and be found by CMake. \r\nOtherwise, only the C++ standard template library is used, together with OpenMP. \r\n\r\nThe documentation of the functions that the library uses are written in the C++header files and in the __doc__ methods of the Python bindings.\r\n\r\nIn addition, a c++ application and a Python script is provided that show the functions of the library with very simple data.\r\n\r\nThe Library is released under MIT License.\r\n\r\nApart from his own publication, the author has not found the main robust curve fitting algorithms from this library in the statistical literature.\r\n\r\nOne of the algorithms presented here is a modification of the forward search algorithms by  Hadi and Simonoff, Atkinson and Riani and the least trimmed squares\r\nmethod of Rousseeuw. The modification of the author is to use various estimators to include data after the algorithm tried a random initial combination.\r\n\r\nThe algorithm was originally developed for physical problems, where one has outliers but also data, which is often subject to random fluctuations, like astronomical seeing.\r\nAs we observed during trials with the astronomy application, including the S estimator in the forward search removes large outliers but allows for small random fluctuations \r\nin the data, which resulted in more natural curve fits than if we would simply select the \"best\" model that we would get from the forward search. If some degree of randomness is present,\r\nthe \"best\" model chosen by such a method would have the smallest error almost certainly by accident and would not include enough points for a precise curve fit.\r\nThe usage of the statistical estimators in the forward search appeared to prevent this problem.\r\n\r\nThe modified least trimmed squares method has also been used by the author in arXiv:2201.12466 with the various estimators to judge the quality of measurement data, which was \r\ndefined as \"Better\" when the algorithm, if used sucessively with several different estimators, comes to a more similar result. \r\n\r\nAnother algorithm presented in this library is an iterative method which also employs various estimators. It has the advantage that it should work with larger datasets but its statistical \r\nproperties have not been extensively tested yet.\r\n\r\nBecause of the use of various statistical estimators and methods, the library builds on previous articles from the statistical literature. \r\nSome references are:\r\n\r\n1. Smiley W. Cheng, James C. Fu, Statistics & Probability Letters 1 (1983), 223-227, for the t distribution\r\n2. B. Peirce,  Astronomical Journal II 45 (1852) for the peirce criterion\r\n3. Peter J. Rousseeuw, Christophe Croux, J. of the Amer. Statistical Assoc. (Theory and Methods), 88 (1993), p. 1273, for the S, Q, and T estimator\r\n5. T. C. Beers,K. Flynn and K. Gebhardt,  Astron. J. 100 (1),32 (1990), for the Biweight Midvariance\r\n6. Transtrum, Mark K, Sethna, James P (2012). \"Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization\". arXiv:1201.5885, for the Levenberg Marquardt Algorithm,\r\n7. Rousseeuw, P. J. (1984).Journal of the American Statistical Association. 79 (388): 871\u00e2\u20ac\u201c880. doi:10.1080/01621459.1984.10477105. JSTOR 2288718.\r\n   Rousseeuw, P. J.; Leroy, A. M. (2005) [1987]. Robust Regression and Outlier Detection. Wiley. doi:10.1002/0471725382. ISBN 978-0-471-85233-9, for the least trimmed squares algorithm\r\n8. Hadi and Simonoff, J. Amer. Statist. Assoc. 88 (1993) 1264-1272, Atkinson and Riani,Robust Diagnostic Regression Analysis (2000), Springer, for the forward search\r\n9. Croux, C., Rousseeuw, P.J. (1992). Time-Efficient Algorithms for Two Highly Robust Estimators of Scale. In: Dodge, Y., Whittaker, J. (eds) Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-662-26811-7_58 \r\n (For the faster version of the S and Q estimators.) The versions of the S and Q estimators in this library are now adapted from Croux and Rousseeuw to the C language. Note that it is not the same Code because of some optimizations. Since many variables act on array indices in this algorithm, it was actually non-trivial to convert from Fortran to C. Especially for the Q estimator, the naive algorithm is faster for less than 100 datapoints. For the S estimator this is the case for less than 10 datapoints. Therefore, in these cases the naive versions are still used.\r\n10. Andrew F. Siegel. Robust regression using repeated medians. Bionaetrika, 69(1):242\u00e2\u20ac\u201c244, 1982,Andrew Stein and Michael Werman. 1992. Finding the repeated median regression line. In Proceedings of the third annual ACM-SIAM symposium on Discrete algorithms (SODA '92). Society for Industrial and Applied Mathematics, USA, 409\u00e2\u20ac\u201c413. https://dl.acm.org/doi/10.5555/139404.139485\r\n\r\n# Compiling and Installing the library:\r\n\r\nThe Library needs CMake and a C compiler that is at least able to generate code according to the C14 standard\r\n(per default, if one does not use Clang or MacOs, it uses C17, but with a CXX_STANDARD 14 switch set in the \r\nCMakeLists.txt for the library, it can use C14, which is the the default for Clang and MacOS.) \r\n\r\nThe library also makes use of OpenMP and needs Python in version 3 and pybind11 to compile. \r\n\r\nBy default, the library also containts two test applications. \r\n\r\nIf the variable $WITH_TESTAPP is set to ON, a c++ test application is compiled and put in an /output directory. \r\n\r\nThe library also shipps with a Python module. By default, the CMake variable $With_Python is set to ON and a Python module will\r\nbe generated in addition to a c++ library.\r\n\r\nIf $WITH_TESTAPP and $WITH_PYTHON are set, which is the default, then a Python test application will be generated in addition to\r\nthe C++ test application.\r\n\r\n## Installing with CMake\r\nOne can compile the library also traditionally via CMake. Typing \r\n\r\n> cmake . \r\n\r\nin the package directory will generate the files necessary to compile the library, depending on the CMake generator set by the user.\r\n\r\nUnder Linux, the command\r\n\r\n> make \r\n\r\nwill then compile the library.\r\n\r\nAfter compilation, an /output directory will appear with the library in binary form. \r\nBy default, the library also containts two test applications. \r\n\r\nIf the variable $WITH_TESTAPP is set, a c++ test application is compiled and put in an /output directory. \r\n\r\nThe library also ships with a Python module. By default, the CMake variable $WITH_PYTHON is ON and a Python module will\r\nbe generated in addition to a c++ library and copied into the /output directory.\r\n\r\nIf $WITH_TESTAPP and $WITH_PYTHON are set to ON, which is the default, then a Python test application will be generated and compiled into the /output directory.\r\n\r\nBy compiling with CMake, the Python module is just compiled into the /output directory. It is not installed in a path for system libraries\r\nor python packages. So if one wants to use the Python module, one has either a) to write the script in the same folder where the module is, or b) load\r\nit by pointing Python to the explicit path of the module, or c) copy the module to a place where Python can find it.\r\n\r\n\r\nIf one does not want the Python module to be compiled, one may set the cmake variable $WITH_PYTHON to OFF.\r\n\r\n## Installing with PIP (This option is mostly for Windows since Linux distributions have their own package managers)\r\n\r\nIf one wants that the module is installed into a library path, where Python scripts may be able to find it easily, one can compile and install the module also \r\nwith pip instead of CMake by typing\r\n\r\n> pip install .\r\n\r\nin the package directory. \r\n\r\nAfter that, the module is compiled and the binaries are copied into a path for Python site-packages, where Python scripts should be able to find it.\r\n\r\nThis is successfull on Windows.\r\n\r\nUnfortunately, problems to find the module remain on Linux.\r\nIf pip is called by the  root user, the module is copied into the /usr/lib directory. Despite this,  python scripts have difficulties to load the module if they do not use a hardcoded import path. \r\nIf one does not install the module as root user, pip will install it in a local site-package directory, where python also has problems to find the module without a hard coded import path.\r\n\r\nIf the module was compiled by pip, one can uninstall it by typing\r\n\r\n> pip uninstall pyRobustRegressionLib\r\n\r\nUnder Linux, compiling with cmake should be preferred. Not at least because linux package managers (e.g. portage from gentoo) sometimes have conflicts with pip.\r\n\r\nAdditionally, the python environment will select ninja as a default generator, which will require to clean the build files\r\nif an earlier generation based on cmake was done that may have used a different generator.\r\n\r\n\r\n# Documentation of the library functions\r\n\r\n## For the Python language\r\n\r\n### Calling the documentation\r\nDocumentation of the API is provided in C++ header files in the /library/include directory and the docstrings for the Python module in the src/pyRobustRegressionLib\r\nModule. The latter It can be loaded in Python scripts with \r\n\r\n> import pyRobustRegressionLib as rrl\r\n\r\nThe command \r\n\r\n> print(rrl.\\_\\_doc__)\r\n\r\nWill list the sub-modules of the library, which are \r\n\r\n- StatisticFunctions, \r\n- LinearRegression, \r\n- MatrixCode, \r\n- NonLinearRegression and \r\n- RobustRegression\r\n- LossFunctions\r\n\r\nAnd their docstrings can be called e.g. by\r\n> print(rrl.*SubModuleName*.\\_\\_doc__)\r\n\r\ne.g.\r\n\r\n> print(rrl.StatisticFunctions.\\_\\_doc__).\r\n\r\nWill list the functions and classes of the sub-module StatisticFunctions. The free functions and classes all have more detailed doc\r\nstrings that can be called as below for example\r\n\r\n> print (rrl.MatrixCode.Identity.\\_\\_doc__)\r\n\r\nMore convenient documentation is provided in the header files of the C++ source code of the package,\r\nwhich can be found in the /library/include directory.\r\n\r\nThe header files can be found in the include subdirectory of the package.\r\n\r\nIn the testapp folder, two example programs, one in Python and one in C++ is provided.\r\nThese test applications have extensive comments and call many functions of the librarym which show the basic usage. \r\n\r\nThe curve fits that are done in the provided example programs are, however, very simple of course.\r\nThis was done in order to keep the demonstration short and simple.\r\nThe library is of course intended to be used with larger and much more complicated data.\r\n\r\n### Simple linear regression\r\n\r\nLet us now define some vector for data X and > which we want to fit to a line.\r\n\r\n> print(\"\\nDefine some arrays X and Y\")\r\n> \r\n> X=[-3.0,5.0,7.0,10.0,13.0,16.0,20.0,22.0]\r\n> \r\n> Y=[-210.0,430.0,590.0,830.0,1070.0,1310.0,1630.0,1790.0]\r\n\r\nA simple linear fit can be called as follows:\r\nAt first we create an instance of the result structure, where the result is stored.\r\n\r\n> res=rrl.LinearRegression.result()\r\n\r\nThen, we call the linear regression function\r\n> rrl.LinearRegression.linear_regression(X, Y, res)\r\n\r\n\r\n\r\nAnd finally, we print out the slope and intercept\r\n> print(\"Slope\")\r\n> \r\n> print(res.main_slope)\r\n> \r\n> print(\"Intercept\")\r\n> \r\n> print(res.main_intercept)\r\n\r\n### Robust regression\r\nThe robust regression is just slightly more complicated. Let us first add two outliers into the data:\r\n\r\n> X2=[-3.0, 5.0,7.0, 10.0,13.0,15.0,16.0,20.0,22.0,25.0]\r\n> \r\n> Y2=[ -210.0, 430.0, 590.0,830.0,1070.0,20.0,1310.0,1630.0,1790.0,-3.0]\r\n\r\n\r\n#### Siegel's repeated Median regression\r\nFor linear regression, the library also has the median linear regression from Siegel, which can be called in the same way\r\n\r\n> rrl.LinearRegression.median_linear_regression(X2, Y2, res)\r\n\r\nbut is slightly more robust.\r\n\r\n\r\n#### Modified forward search/modified Lts regression\r\nMedian linear regression is a bit slower as simple linear regression and can get wrong if many outliers are present.\r\n\r\nTherefore, the library has two methods for robust regression that can remove outliers.\r\n\r\nThe  structure that stores the result for robust linear regression now includes the indices of the used and rejected point.\r\n\r\nWe instantiate it with\r\n\r\n> res= rrl.RobustRegression.linear_algorithm_result()\r\n\r\nAdditionally, there is a struct that determines the behavior of the algorithm.  \r\nUpon construction without arguments, it gets filled with default values.\r\n\r\n> ctrl= rrl.RobustRegression.modified_lts_control_linear()\r\n\r\nBy default, the S-estimator is used with an outlier_tolerance parameter of 3, and the method can find 30% of the points as outliers at maximum. \r\nBut all this can be changed, of course\r\n\r\nNow we call the modified forward search/ modified lts algorithm\r\n> rrl.RobustRegression.modified_lts_regression_linear(X2, Y2, ctrl, res)\r\n\r\nand then print the result, first the slope and intercept \r\n> print(\"Slope\")\r\n> \r\n> print(res.main_slope)\r\n> \r\n> print(\"Intercept\")\r\n> \r\n> print(res.main_intercept)\r\n\r\nThen the indices of the outliers\r\n\r\n> print(\"\\nOutlier indices\")\r\n> \r\n> for ind in res.indices_of_removedpoints:\r\n> \r\n> &emsp;print(ind)\r\n\r\nWhen we want to change the estimators, or the outlier tolerance parameter, the loss function, or the maximum number of outliers we can find, or other details, we can simply\r\nset this in the control struct.\r\n\r\nBy default, the S estimator is used with an outlier_tolerance of 3 in the same formula i.e. one has\r\n\r\n> ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_decision_in_S_ESTIMATION\r\n\r\nand a point is an outlier if \r\n> |err-median(errs)|/S_estimator(errs)>outlier_tolerance\r\n\r\nwhere err is the residuum of the point and errs is the array of residuals. \r\nThey are measured by a specified loss function. If none was given, squared errors are used by default. \r\n\r\nBy default, the outlier_tolerance parameter is set to 3.\r\n\r\nIf we want to have a different value, e.g. 3.5, for the outlier_tolerance parameter, we can easily set e.g.\r\n\r\n> ctrl.outlier_tolerance=3.5\r\n\r\nThe command below would imply that the Q estimator is used:\r\n\r\n> ctrl= rrl.RobustRegression.modified_lts_control_linear()\r\n> \r\n> ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_decision_in_Q_ESTIMATION\r\n\r\nThen a point is an outlier if, for its residual with the curve fit err, we have,given the array of all residuals errs:\r\n> |err-median(errs)|/Q_estimator(errs)>outlier_tolerance\r\n\r\n\r\nThe command below would change the estimator to the interquartile range method when the modified lts/modified forward search algorithm is used. \r\n\r\nWith the setting below, a point is an outlier if its residual is below Q1 \u00e2\u02c6\u2019 tolerance* IQR or above Q3 + tolerance IQR.\r\nIf we do not change the loss function, least squares is used by default.\r\n\r\n> ctrl= rrl.RobustRegression.modified_lts_control_linear()\r\n> \r\n> ctrl.rejection_method=rrl.RobustRegression.estimator_name.tolerance_is_interquartile_range\r\n\r\n For the interquartile range estimator, we should set the tolerance usually to 1.5\r\n\r\n> ctrl.outlier_tolerance=1.5\r\n\r\nbefore we call the regression function.\r\n\r\nSimilarly, the loss function can be changed. For example, the absolute value of the residuals is usually more statistically robust than the square of the residuals\r\n\r\n> ctrl.lossfunction=rrl.LossFunctions.absolutevalue\r\n\r\nchanges the lossfunction to the absolute value. \r\n\r\nOne can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the\r\nfunction starts its linear behavior.\r\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\r\nFinally, one can define a custom loss function with a callback mechanism.\r\n\r\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\r\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\r\nis then only used for outlier removal.\r\n\r\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \r\nas for the outlier removal procedures.\r\n\r\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\r\nremoval, one has to use the non-linear algorithm with a linear call back function. \r\n\r\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\r\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \r\nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \r\nbecause of its asymmetry.\r\n\r\n\r\n\r\n\r\nNote that the forward search can be very time consuming, as its performance is given by the binomial koefficient of the pointnumber over the maximum number of \r\noutliers it can find (which per default is 30% of the pointnumber).\r\n\r\n#### Iterative outlier removal\r\n\r\nA faster algorithm is the iterative outlier removal method, which makes a curve fit with the entire points, then removes the points whose residuals are outliers and\r\nthen makes another curve fit with the remaining point set until no outliers are found anymore.\r\n\r\nIt can be called similarly:\r\n\r\nWe define the control structure. Note that it is different from the modified forward search/modified lts control structure. \r\n> ctrl= rrl.RobustRegression.linear_algorithm_control()\r\n\r\nAnd again create a structure for the result\r\n> res= rrl.RobustRegression.linear_algorithm_result()\r\nThen we start the algorithm\r\n> rrl.RobustRegression.iterative_outlier_removal_regression_linear(X2, Y2, ctrl, res)\r\n\r\nAnd print the result\r\n> print(\"Slope\")\r\n> \r\n> print(res.main_slope)\r\n> \r\n> print(\"Intercept\")\r\n> \r\n> print(res.main_intercept)\r\n> \r\n> print(\"\\nOutlier indices\")\r\n> \r\n> for ind in res.indices_of_removedpoints:\r\n> \r\n> &emsp;print(ind)\r\n\r\n\r\n\r\n### Non Linear Regression\r\nNon-linear regression uses an implementation of the Levenberg-Marquardt algorithm\r\n\r\nThe Levenberg-Marquardt algorithm needs an initial guess for an array of parameters beta , a function f(X,beta) to be fitted and a Jacobi matrix J(X,beta)\r\nFor example, a curve fit can be made by  initialising the result, control and initdata structures as follows:\r\nAfter a call of the constructors:\r\n\r\n> res=rrl.NonLinearRegression.result()\r\n> \r\n> ctrl=rrl.NonLinearRegression.control()\r\n> \r\n> init=rrl.NonLinearRegression.initdata()\r\n\r\nWe  supply a Jacobi matrix, a function and an initial guess for the parameters to be found (assuming the function has just 2 curve fitting parameters):\r\n\r\n> init.Jacobian=Jacobi\r\n>\r\n> init.f=linear\r\n> \r\n> init.initialguess = [1,1]\r\n\r\nWhere  Jacobi and linear are two user defined functions. \r\n\r\nIf we would want to fit a line, we would have to implement the function f(X,beta) and the jacobian J(X,beta) as follows\r\n\r\n>def linear(X, beta):\r\n>\r\n> &emsp;   Y=[]\r\n>\r\n> &emsp;   for i in range(0,len(X)):\r\n>\r\n> &emsp;   &emsp;    Y.append(beta[0] * X[i] + beta[1])\r\n>\r\n> &emsp;    return Y\r\n\r\n\r\n> def Jacobi(X, beta):\r\n> \r\n>&emsp;\tm=rrl.MatrixCode.Matrix (len(X), len(beta))\r\n> \r\n>&emsp;\tfor i in range(0,len(X)):\r\n> \r\n>&emsp;\t&emsp;\tm[i, 0] = X[i]\r\n> \r\n>&emsp;&emsp;\tm[i, 1] = 1\r\n> \r\n>&emsp;\treturn m\r\n\r\nThen we can call the Levenberg-Marquardt algorithm\r\n\r\n> rrl.NonLinearRegression.non_linear_regression(X2, Y2, init, ctrl, res)\r\n\r\nand then print the result:\r\n> print(\"Slope\")\r\n> \r\n> print(res.beta[0])\r\n> \r\n> print(\"Intercept\")\r\n> \r\n> print(res.beta[1])\r\n\r\nThe class NonLinearRegression.control has various parameters that control the behavior of the Levenberg-Marquardt algorithm, among them are various\r\nconditions when the algorithm should stop (e.g. after some time, or after there is no improvement or after the error is below a certain margin). \r\nThey may be set as desired.\r\n\r\nAdditionally, there are parameters that control the step size of the algorithm. These parameters have the same names as described at\r\nhttps://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm and if not otherwise specified,\r\ndefaults are used for them usually work.\r\n\r\n\r\n### Non-linear robust regression\r\nFor non-linear curve fits the library also has the modified forward search/modified lts algorithm and the iterative outlier removal as for linear regression.\r\nFor a modified forward search/lts algorithm with default parameters (S estimator, outlier_tolerance=3, loss function as least squares, 30% of the points are outliers at maximum), a call looks as follows:\r\n\r\nFirst the initialisation :\r\n> res=rrl.RobustRegression.nonlinear_algorithm_result()\r\n> \r\n> ctrl=rrl.RobustRegression.modified_lts_control_nonlinear()\r\n> \r\n> init=rrl.NonLinearRegression.initdata()\r\n> \r\n> init.Jacobian=Jacobi\r\n> \r\n> init.f=linear\r\n> \r\n> init.initialguess = [1,1]\r\n\r\nThen the function call:\r\n> rrl.RobustRegression.modified_lts_regression_nonlinear(X2, Y2, init, ctrl, res)\r\n\r\nFinally, we print the result:\r\n> print(\"Slope\")\r\n> \r\n> print(res.beta[0])\r\n> \r\n> print(\"Intercept\")\r\n> \r\n> print(res.beta[1])\r\n> \r\n> print(\"\\nOutlier indices\")\r\n> \r\n> for ind in res.indices_of_removedpoints:\r\n> \r\n> &emsp;  print(ind)\r\n\r\n\r\n\r\n\r\nFor the interative outlier removal algorithm, a call to the regression function with default parameters (S estimator, outlier_tolerance=3, loss function as least squares, 30% of the points are outliers at maximum) would look as follows:\r\n\r\n> res=rrl.RobustRegression.nonlinear_algorithm_result()\r\n> \r\n> ctrl=rrl.RobustRegression.nonlinear_algorithm_control()\r\n> \r\n> init=rrl.NonLinearRegression.initdata()\r\n> \r\n> init.Jacobian=Jacobi\r\n> \r\n> init.f=linear\r\n> \r\n> init.initialguess = [1,1]\r\n\r\n> rrl.RobustRegression.iterative_outlier_removal_regression_nonlinear(X2, Y2, init, ctrl, res)\r\n\r\n### Custom error functions\r\nBy default, the library uses the sum of the squared residuals divided by the pointnumber as a loss function.\r\nOne can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the loss function starts its linear behavior.\r\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\r\n\r\nFinally, one can define a custom loss function with a callback mechanism.\r\n\r\nWe may define user defined loss functions. This is done in two steps. A function\r\n\r\n> def err_pp(Y,fY,pointnumber):\r\n> &emsp;   return (Y-fY)*(Y-fY)/pointnumber\r\n\r\nComputes a residual between the data and the curve fit for a single point.\r\nAnother function\r\n\r\n> def aggregate_err(errs):\r\n> &emsp;    res=0 \r\n> &emsp;   for i in range(0,len(errs)):\r\n> &emsp;  &emsp;     res+=errs[i]\r\n> &emsp;   return res\r\n\r\ncomputes an entire error from a list of residuals generated by the function err_pp.\r\nNote that if the data is such that it does not correspond perfectly to the curve, this should at best be some kind of average error instead of a simple sum.\r\nSince otherwise, removing a point will always reduce the error. Since the robust methods delete points based on the aggregate error, this would usually lead to\r\ncurve fits which do not have enough points taken into consideration. The division by the pointnumber can be done in err_pp (as in this example) or in aggregate_err.\r\n\r\nThe following call will then make a robust curve fit with the custom error function\r\n\r\n> res9=rrl.RobustRegression.nonlinear_algorithm_result() \r\n> ctrl9=rrl.RobustRegression.modified_lts_control_nonlinear()\r\n> ctrl9.lossfunction=rrl.LossFunctions.custom\r\n> ctrl9.loss_perpoint=err_pp\r\n> ctrl9.aggregate_error=aggregate_err\r\n\r\n> init9=rrl.NonLinearRegression.initdata() \r\n> init9.Jacobian=Jacobi\r\n> init9.f=linear\r\n> init9.initialguess = [1,1]\r\n> rrl.RobustRegression.modified_lts_regression_nonlinear(X2, Y2, init9, ctrl9, res9)\r\n\r\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\r\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\r\nis then only used for outlier removal.\r\n\r\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \r\nas for the outlier removal procedures.\r\n\r\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\r\nremoval, one has to use the non-linear algorithm with a linear call back function. \r\n\r\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\r\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \r\nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \r\nbecause of its asymmetry.\r\n\r\n\r\n\r\n\r\n## For the C++ language:\r\n\r\nIn general, one has to include the library headers as follows:\r\n\r\n> #include \"statisticfunctions.h\"\r\n>\r\n> #include \"Matrixcode.h\"\r\n> \r\n> #include \"linearregression.h\"\r\n> \r\n> #include \"robustregression.h\"\r\n> \r\n> #include \"nonlinearregression.h\"\r\n>\r\n> #include \"lossfunctions.h\" \r\n>\r\n\r\n### Simple Linear Regression\r\nThe usage of the library in C++ is essentially similar as in Python. the testapplication.cpp demonstrates the same function calls.\r\nThe the X and Y data are stored in C++ valarrays. The control, result and initdata are not classes, but structs.\r\n\r\nFor example, if we define some X,Y data:\r\n\r\n> valarray<double> X = { -3, 5,7, 10,13,16,20,22 };\r\n> \r\n> valarray<double> Y = { -210, 430, 590,830,1070,1310,1630,1790 };\r\n\r\nand initialize the struct where we store the result,\r\n\r\n> Linear_Regression::result res;\r\n\r\nwe can call a linear regression as follows:\r\n\r\n> Linear_Regression::linear_regression(X, Y, res);\r\n\r\nand we may print the result:\r\n\r\n> printf(\" Slope \");\r\n> \r\n> printf(\"%f\", res.main_slope);\r\n> \r\n> printf(\"\\n Intercept \");\r\n> \r\n> printf(\"%f\", res.main_intercept);\r\n\r\n### Robust Regression methods\r\n\r\nLet us first define X and Y data with two outliers added.\r\n>\tvalarray<double> X2 = { -3, 5,7, 10,13,15,16,20,22,25 };\r\n>\r\n>\tvalarray<double> Y2 = { -210, 430, 590,830,1070,20,1310,1630,1790,-3 };\r\n\r\n#### Median Linear Regression: \r\nMedian regression is slower but more robust as simple linear regression.\r\nThis command calls a robust curve fit with median regression\r\n\r\n> Linear_Regression::result res;\r\n> \r\n> Linear_Regression::median_linear_regression(X2, Y2, res);\r\n\r\nand then we print the result\r\n\r\n> printf(\" Slope \");\r\n> \r\n> printf(\"%f\", res.main_slope);\r\n> \r\n> printf(\"\\n Intercept \");\r\n> \r\n> printf(\"%f\", res.main_intercept);\r\n\r\n#### Modified forward search\r\nWhen many and large outliers are present, median regression does sometimes not deliver the desired results.\r\n\r\nThe library therefore has the modified forward search/modified lts algorithm and the iterative outlier removal algorithm that can\r\nfind outliers and remove them from the curve fit entirely.\r\n\r\nBelow is a call for a robust modified forward search/modified lts algorithm initialized with default\r\nparameters:\r\n\r\nFirst the structs for controlling the algorithm and storing the results are initialised,\r\n\r\n> Robust_Regression::modified_lts_control_linear ctrl;\r\n> \r\n> Robust_Regression::linear_algorithm_result res;\r\n\r\nThen, we call the functions for the curve fit:\r\n>\tRobust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);\r\n\r\nThen we print the result:\r\n\r\n>\tprintf(\" Slope \");\r\n>\r\n>\tprintf(\"%f\", res.main_slope);\r\n>\r\n>\tprintf(\"\\n Intercept \");\r\n>\r\n>\tprintf(\"%f\", res.main_intercept);\r\n>\r\n>\tprintf(\"\\n Indices of outliers \");\r\n>\r\n>\tfor (size_t i = 0; i < res.indices_of_removedpoints.size(); i++)\r\n>\t{\r\n>\t&emsp; \tsize_t w = res.indices_of_removedpoints[i];\r\n>\r\n>\t&emsp;\tprintf(\"%lu\", (unsigned long)w);\r\n>\r\n>\t&emsp;\tprintf(\", \");\r\n>\r\n>\t}\r\n\r\n\r\nThe default parameters are as follows:\r\n\r\nThe S estimator is used, outlier_tolerance=3,  30% of the pointnumber are outliers at maximum, loss function is given by least squares of the residuals.\r\n\r\nAs in the Python documentation, a point with residual err is then an outlier if \r\n\r\n> |err-median(errs)/S_estimator>3\r\n\r\nwhere errs is the array of residuals.\r\n\r\n\r\nIf the Q-estimator is used instead, the initialisation for the modified forward search/lts algorithm looks like\r\n\r\n>\tRobust_Regression::modified_lts_control_linear ctrl;\r\n>\r\n>\tRobust_Regression::linear_algorithm_result res;\r\n>\r\n>\tctrl.rejection_method = Robust_Regression::tolerance_is_decision_in_Q_ESTIMATION;\r\n\r\nThen we call the regression function:\r\n>\tRobust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);\r\n\r\nIf the interquartile range estimator should be used, so that a point is removed if it is below Q1 \u00e2\u02c6\u2019 outlier_tolerance* IQR or above Q3 + outlier_tolerance IQR, \r\nwe would have to set:\r\n\r\n> ctrl.rejection_method = Robust_Regression::tolerance_is_interquartile_range;\r\n\r\nFor the interquartile range estimator, outlier_tolerance should be set to 1.5, so we additionally have to write:\r\n>\tctrl.outlier_tolerance = 1.5;\r\n\r\nbefore we call the regression function:\r\n>\tRobust_Regression::modified_lts_regression_linear(X2, Y2, ctrl, res);\r\n\r\nSimilarly, some may prefer to set the outlier_tolerance parameter to 3.5 when the S,Q, or MAD or other estimators are used.\r\n\r\nThe loss function can also be changed. For example, the absolute value |err| of the residuals is usually more robust than the square err^2.\r\nThe command\r\n\r\n> ctrl.lossfunction = LossFunctions::absolutevalue;\r\n\r\nchanges the lossfunction to the absolute value. \r\n\r\nOne can also specify Huber's loss function, but then one also has to supply a border parameter  beyond which the function starts its linear behavior.\r\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\r\nFinally, one can define a custom loss function with a callback mechanism.\r\n\r\n\r\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\r\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\r\nis then only used for outlier removal.\r\n\r\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \r\nas for the outlier removal procedures.\r\n\r\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\r\nremoval, one has to use the non-linear algorithm with a linear call back function. \r\n\r\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\r\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \r\nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \r\nbecause of its asymmetry.\r\n\r\n#### Iterative outlier removal\r\nThe modified forward search/modified lts algorithm can be slow since its complexity is given by the binomial coefficient of the pointnumber over the maximum\r\nnumber of outliers to be found. The iterative outlier removal algorithm is faster. It makes a curve fit of all points, then removes the outliers based on the \r\nloss function and estimator and makes another curve fit and repeats the procedure until no outliers are found.\r\n\r\nA call to this method with the default parameters S estimator, outlier_tolerance=3, the loss function as least_squares and at most 30% of the points\r\ndesignated as outliers, would look as follows:\r\n\r\n>\tRobust_Regression::linear_algorithm_result res;\r\n>\r\n>\tRobust_Regression::linear_algorithm_control ctrl;\r\n>\r\n>\tRobust_Regression::iterative_outlier_removal_regression_linear(X2, Y2, ctrl, res);\r\n\r\n### Non-linear Regression\r\nThe non-linear curve fitting algorithm implements a Levenberg-Marquardt algorithm.\r\n\r\nFor it to work, we must first supply initialisation data in form of a valarray of initial parameters beta,\r\na function f(X,beta) to be found and a Jacobian J(X,beta).\r\n\r\nif we wanted to fit a line, we would have do define the function f(X,beta) to be fit\r\n\r\n> valarray<double> linear(const valarray<double>&X,const  valarray<double>& beta)\r\n> {\r\n> \r\n>\t&emsp;valarray<double> Y(X.size());\r\n> \r\n>\t&emsp;for (size_t i = 0; i < X.size(); i++)\r\n> \r\n>\t&emsp;&emsp;\tY[i] = beta[0] * X[i] + beta[1];\r\n> \r\n>\t&emsp;return Y;\r\n> \r\n> }\r\n>\r\n\r\nand its Jacobian Matrix J(X,beta)\r\n> Matrix Jacobi(const valarray<double>& X, const  valarray<double>& beta)\r\n> {\r\n> \r\n>\t&emsp;Matrix ret(X.size(), beta.size());\r\n> \r\n>\t&emsp;for (size_t i = 0; i < X.size(); i++)\r\n> \r\n>\t&emsp;{\r\n> \r\n>\t&emsp;&emsp;\tret(i, 0) = X[i];\r\n> \r\n>\t&emsp;&emsp;\tret(i, 1) = 1;\r\n> \r\n>\t&emsp;}\r\n> \r\n>\t&emsp;return ret;\r\n> \r\n> }\r\n\r\nA non-linear fit can the be initialised with default control parameters for the Levenberg-Marquardt algorithm like this:\r\n\r\n\r\n>\tNon_Linear_Regression::result res;\r\n>\r\n>\tNon_Linear_Regression::control ctrl;\r\n>\r\n>\tNon_Linear_Regression::initdata init;\r\n>\r\n>\tinit.f = linear;\r\n>\r\n>\tinit.J = Jacobi;\r\n>\r\n>\tvalarray<double>beta = { 1,1 };\r\n>\r\n>\tinit.initialguess = beta;\r\n\r\nThen one can call the function:\r\n>\tNon_Linear_Regression::non_linear_regression(X, Y, init, ctrl, res);\r\n\r\nThen we may print the result:\r\n\r\n>\tprintf(\"\\n Slope \");\r\n>\r\n>\tprintf(\"%f\", res.beta[0]);\r\n>\r\n>\tprintf(\"\\n intercept \");\r\n>\r\n>\tprintf(\"%f\", res.beta[1]);\r\n\r\nThe struct Non_Linear_Regression::control has various control parameters that control the behavior of the Levenberg-Marquardt algorithm, among them are various\r\nconditions when the algorithm should stop (e.g. after some time, or after there is no improvement or after the error is below a certain margin).\r\nThey may be set as desired.\r\n\r\nAdditionally, there are parameters that control the step size of the algorithm. These parameters have the same names as described at\r\nhttps://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt_algorithm and if not otherwise specified,\r\ndefaults are used for them  usually work.\r\n\r\n\r\n### Non-linear robust curve fits\r\n\r\nAs for the linear regression, the library has the same modified forward search/lts and iterative outlier removal algorithms for the non-linear case\r\n\r\nFor a modified forward search/lts algorithm, a call looks as follows:\r\n\r\nFirst the initialisation, here with default parameters for the algorithm control:\r\n\r\n> Robust_Regression::modified_lts_control_nonlinear ctrl;\r\n> \r\n> Robust_Regression::nonlinear_algorithm_result res;\r\n> \r\n> Non_Linear_Regression::initdata init;\r\n> \r\n> init.f = linear;\r\n> \r\n> init.J = Jacobi;\r\n> \r\n> valarray<double>beta = { 1,1 };\r\n> \r\n> init.initialguess = beta;\r\n\r\nThen the call:\r\n\r\n> Robust_Regression::modified_lts_regression_nonlinear(X2, Y2, init, ctrl, res);\r\n\r\nThen we print the result:\r\n> \tprintf(\" Slope \");\r\n> \r\n>\tprintf(\"%f\", res.beta[0]);\r\n> \r\n>\tprintf(\"\\n Intercept \");\r\n>\r\n>\tprintf(\"%f\", res.beta[1]);\r\n> \r\n>\tprintf(\"\\n Indices of outliers \");\r\n>\r\n>\tfor (size_t i = 0; i < res.indices_of_removedpoints.size(); i++){\r\n>\t&emsp;\tsize_t w = res.indices_of_removedpoints[i];\r\n>  \t&emsp;\tprintf(\"%lu\", (unsigned long)w);\r\n>   &emsp;  printf(\", \");\r\n>\t}\r\n\r\n\r\nFor the interative outlier removal algorithm, the call to the regression function would look as follows:\r\nFirst the initialisation, here again with default parameters for the algorithm control:\r\n\r\n> Non_Linear_Regression::initdata init;\r\n> \r\n> Robust_Regression::nonlinear_algorithm_control ctrl;\r\n> \r\n> init.f = linear;\r\n> \r\n> init.J = Jacobi;\r\n> \r\n> valarray<double>beta = { 1,1 };\r\n> \r\n> init.initialguess = beta;\r\n\r\nThen the call,\r\n\r\n> Robust_Regression::iterative_outlier_removal_regression_nonlinear(X2, Y2, init, ctrl, res);\r\n\r\nThe printing of the result would work similar as above.\r\n\r\n### Custom error functions\r\nBy default, the library uses the sum of the squared residuals divided by the pointnumber as a loss function.\r\nOne can also specify Huber's loss function, but then one also has to supply a border parameter beyond which the loss function becomes linear.\r\nOne also can set a log cosh loss function, or a quantile loss function. The latter needs a gamma parameter to be specified within the interval of 0 and 1.\r\n\r\nFinally, one can define a custom loss function with a callback mechanism.\r\n\r\nWe may define user defined loss functions. This is done in two steps. A function\r\n\r\n> double err_pp(const double Y, double fY, const size_t pointnumber) {\r\n> &emsp;\treturn ((Y - fY)* (Y - fY)) /(double) pointnumber;\r\n> }\r\n\r\nComputes a residual between the data and the curve fit for a single point.\r\n\r\nAnother function\r\n\r\n> double aggregate_err(valarray<double>& err){\r\n> &emsp;\treturn err.sum();\r\n> }\r\n\r\ncomputes an entire error from a list of residuals generated by the function err_pp.\r\nNote that if the data is such that it does not correspond perfectly to the curve, this should at best be some kind of average error instead of a simple sum.\r\nSince otherwise, removing a point will always reduce the error. Since the robust methods delete points based on the aggregate error, this would usually lead to\r\ncurve fits which do not have enough points taken into consideration. The division by the pointnumber can be done in err_pp (as in this example) or in aggregate_err.\r\n\r\nThe following will then make a robust curve fit with the custom error function:\r\nAt first the usual initialisation\r\n> Non_Linear_Regression::initdata init13;\r\n>\tinit13.f = linear;\r\n>\tinit13.J = Jacobi;\r\n>\tinit13.initialguess = beta;\r\n>  Robust_Regression::nonlinear_algorithm_result res13;\r\n\r\nThen the set of the custom loss function\r\n\r\n> Robust_Regression::modified_lts_control_nonlinear ctrl13;\r\n> ctrl13.lossfunction = LossFunctions::custom;\r\n> ctrl13.loss_pp = err_pp;\r\n> ctrl13.agg_err = aggregate_err;\r\n\r\nNote that if the aggregate error would not be defined here, the results of the calls of the loss functions per point would just be summed.\r\n\r\nFinally, we can make the function call\t\r\n\r\n> Robust_Regression::modified_lts_regression_nonlinear(X2, Y2, init13, ctrl13, res13);\r\n\r\nNote that if we use the linear versions of the robust regression, then these methods would just make simple linear\r\nfits or repeated median fits, which minimize their own loss function, and the selected loss function of the library\r\nis then only used for outlier removal.\r\n\r\nWith the robust non-linear regression algorithms, the custom error functions are used for the curve fits as well \r\nas for the outlier removal procedures.\r\n\r\nIf one needs a linear fit where the custom error function is used as a metric for the curve fit as well as for the outlier\r\nremoval, one has to use the non-linear algorithm with a linear call back function. \r\n\r\nNote also that the quantile loss function is asymmetric. Therefore, the quantile loss function should mostly be used with\r\nthe linear robust curve fitting algorithms, since then it is only used for outlier removal. \r\nIf the quantile loss function is used with the non-linear robust algorithms it is likely to confuse the Levenberg-Marquardt algorithm \r\nbecause of its asymmetry.\r\n\r\n# Further documentation\r\nThe library has an online repository at https://github.com/bschulz81/robustregression where the source code can be accessed. \r\n\r\nThe detailed documentation of all the various control parameters of the curve fiting algorithms is in the docstrings of the Python module and in the c++ header file. \r\n\r\nAlso, the C++/Python test applications in the folder testapp are documented and show many function calls\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A library that implements algorithms for linear and non-linear robust regression.",
    "version": "1.3.2",
    "project_urls": {
        "Homepage": "https://github.com/bschulz81/robustregression"
    },
    "split_keywords": [
        "robust regression",
        "forward-search",
        "huber's loss function",
        "repeated median regression",
        "simple linear regression",
        "non-linear regression",
        "levenberg-marquardt algorithm",
        "statistics",
        "estimators",
        "s-estimator",
        "q-estimator",
        "student t-distribution",
        "interquartile range",
        "machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "771caa919b3a4fb9948448b64e8edf70429d25663bce5e48e7a02986707c2f46",
                "md5": "9d8b319860c2b5dd8b4b6b0b5fd2af5f",
                "sha256": "a1561066ed08998f4915cdcfe69d2cf3e68270a24d2f95fef915e389d2c2fa31"
            },
            "downloads": -1,
            "filename": "pyRobustRegressionLib-1.3.2-cp312-cp312-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "9d8b319860c2b5dd8b4b6b0b5fd2af5f",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.7",
            "size": 209959,
            "upload_time": "2024-01-02T05:50:36",
            "upload_time_iso_8601": "2024-01-02T05:50:36.577254Z",
            "url": "https://files.pythonhosted.org/packages/77/1c/aa919b3a4fb9948448b64e8edf70429d25663bce5e48e7a02986707c2f46/pyRobustRegressionLib-1.3.2-cp312-cp312-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57ea0097ded14ba2a6decb30f9799658ed2278a86fba36d9a371deb422003d5d",
                "md5": "32db10ab4c95bebfc8423390c822c76b",
                "sha256": "b7fdc0860e5ab272a9e0d2cbda3214728f12f908535b79380178de3c2a6852ff"
            },
            "downloads": -1,
            "filename": "pyRobustRegressionLib-1.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "32db10ab4c95bebfc8423390c822c76b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 72046,
            "upload_time": "2024-01-02T05:50:37",
            "upload_time_iso_8601": "2024-01-02T05:50:37.938402Z",
            "url": "https://files.pythonhosted.org/packages/57/ea/0097ded14ba2a6decb30f9799658ed2278a86fba36d9a371deb422003d5d/pyRobustRegressionLib-1.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-02 05:50:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bschulz81",
    "github_project": "robustregression",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyrobustregressionlib"
}

Benjamin Schulz