Name | flexcv JSON |
Version |
24.0b0
JSON |
| download |
home_page | |
Summary | Easy and flexible nested cross validation for tabular data in python. |
upload_time | 2024-01-23 19:57:03 |
maintainer | |
docs_url | None |
author | Patrick Blättermann, Siegbert Versümer |
requires_python | <3.12,>=3.10 |
license | MIT License Copyright (c) 2023 Fabian Rosenthal, Patrick Blättermann, Siegbert Versümer Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
machine learning
cross validation
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<table>
<tr>
<td><img src="https://github.com/radlfabs/flexcv/blob/main/imgs/logo_colored.png?raw=true" width="200"></td>
<td><h1>Flexible Cross Validation and Machine Learning for Regression on Tabular Data
</h1></td>
</tr>
</table>
[![tests](https://github.com/radlfabs/flexcv/actions/workflows/test.yml/badge.svg)](https://github.com/radlfabs/flexcv/actions/workflows/test.yml) [![DOI](https://zenodo.org/badge/708442681.svg)](https://zenodo.org/doi/10.5281/zenodo.10160846)
Authors: Fabian Rosenthal, Patrick Blättermann and Siegbert Versümer
## Introduction
This repository contains the code for the python package `flexcv` which implements flexible cross validation and machine learning for tabular data. It's code is used for the machine learning evaluations in Versümer et al. (2023).
The core functionality has been developed in the course of a research project at Düsseldorf University of Applied Science, Germany.
`flexcv` is a method comparison package for Python that wraps around popular libraries to easily taylor complex cross validation code to your needs.
It provides a range of features for comparing machine learning models on different datasets with different sets of predictors customizing just about everything around cross validations. It supports both fixed and random effects, as well as random slopes.
Install the package and give it a try:
`pip install flexcv`
You can find our documentation [here](https://radlfabs.github.io/flexcv/).
## Features
The `flexcv` package provides the following features:
1. Cross-validation of model performance (generalization estimation)
2. Selection of model hyperparameters using an inner cross-validation and a state-of-the-art optimization provided by `optuna`.
3. Customization of objective functions for optimization to select meaningful model parameters.
4. Fixed and mixed effects modeling (random intercepts and slopes).
5. Scaling of inner and outer cross-validation folds separately.
6. Easy usage of the state-of-the-art logging dashboard `neptune` to track all of your experiments.
7. Adaptations for cross validation splits with stratification for continuous target variables.
8. Easy local summary of all evaluation metrics in a single table.
9. Wrapper classes for the `statsmodels` package to use their mixed effects models inside of a `sklearn` Pipeline. Read more about that package [here](https://github.com/manifoldai/merf).
10. Uses the `merf` package to apply correction for clustered data using the expectation maximization algorithm and supporting any `sklearn` BaseEstimator. Read more about that package [here](https://github.com/manifoldai/merf).
11. Inner cross validation implementation that let's you push groups to the inner split, e. g. to apply GroupKFold.
12. Customizable ObjectiveScorer function for hyperparameter tuning, that let's you make a trade-off between under- and overfitting.
These are the core packages used under the hood in `flexcv`:
1. `sklearn` - A very popular machine learning library. We use their Estimator API for models, the pipeline module, the StandardScaler, metrics and of course wrap around their cross validation split methods. Learn more [here](https://scikit-learn.org/stable/).
2. `Optuna` - A state-of-the-art optimization package. We use it for parameter selection in the inner loop of our nested cross validation. Learn more about theoretical background and opportunities [here](https://optuna.org/).
3. `neptune` - Awesome logging dashboard with lots of integrations. It is a charm in combination with `Optuna`. We used it to track all of our experiments. `Neptune` is quite deeply integrated into `flexcv`. Learn more about this great library [here](https://neptune.ai/).
4. `merf` - Mixed Effects for Random Forests. Applies correction terms on the predictions of clustered data. Works not only with random forest but with every `sklearn` BaseEstimator.
## Why would you use `flexcv`?
Working with cross validation in Python usually starts with creating a sklearn pipeline. Pipelines are super useful to combine preprocessing steps with model fitting and prevent data leakage.
However, there are limitations, e. g. if you want to push the training part of your clustering variable to the inner cross validation split. For some of the features, you would have to write a lot of boilerplate code to get it working, and you end up with a lot of code duplication.
As soon as you want to use a linear mixed effects model, you have to use the `statsmodels` package, which is not compatible with the `sklearn` pipeline.
`flexcv` solves these problems and provides a lot of useful features for cross validation and machine learning on tabular data, so you can focus on your data and your models.
## Getting Started
Let's set up a minimal working example using a LinearRegression estimator and some randomly generated regression data.
```py
# import the interface class, a data generator and our model
from flexcv import CrossValidation
from flexcv.synthesizer import generate_regression
from flexcv.models import LinearModel
# generate some random sample data that is clustered
X, y, group, _ = generate_regression(10, 100, 5, n_slopes=1, noise_level=9.1e-2, random_seed=42)
```
The `CrossValidation` class is the core of this package. It holds all the information about the data, the models, the cross validation splits and the results. It is also responsible for performing the cross validation and logging the results. Setting up the `CrossValidation` object is easy. We can use method chaining to set up our configuration and perform the cross validation. You might be familiar with this pattern from `pandas` and other packages. The set-methods all return the `CrossValidation` object itself, so we can chain them together. The `perform` method then performs the cross validation and returns the `CrossValidation` object again. The `get_results` method returns a `CrossValidationResults` object which holds all the results of the cross validation. It has a `summary` property which returns a `pandas.DataFrame` with all the results. We can then use the `to_excel` method of the `DataFrame` to save the results to an excel file.
```python
# instantiate our cross validation class
cv = CrossValidation()
# now we can use method chaining to set up our configuration perform the cross validation
results = (
cv
.set_data(X, y, group, dataset_name="ExampleData")
# configure our split strategies. Lets go for a GroupKFold since our data is clustered
.set_splits(
method_outer_split=flexcv.CrossValMethod.GROUP
# add the model class
.add_model(LinearModel)
.perform()
.get_results()
)
# results has a summary property which returns a dataframe
# we can simply call the pandas method "to_excel"
results.summary.to_excel("my_cv_results.xlsx")
```
You can then use the various functions and classes provided by the framework to compare machine learning models on your data.
Additional info on how to get started working with this package will be added here soon as well as to the (documentation)[radlfabs.github.io/flexcv/].
## Documentation
Have a look at our [documentation](https://radlfabs.github.io/flexcv/). We currently add lots of additional guides and tutorials to help you get started with `flexcv`. If you are interested in writing a guide or tutorial, feel free to contact us. It would be great to have some community contributions here.
## Conclusion
`flexcv` is a powerful tool for comparing machine learning models on different datasets with different sets of predictors. It provides a range of features for cross-validation, parameter selection, and experiment tracking. With its state-of-the-art optimization package and logging dashboard, it is a valuable addition to any machine learning workflow.
## Earth Extension
An wrapper implementation of the Earth Regression package for R exists that you can use with flexcv. It is called [flexcv-earth](https:github.com/radlfabs/flexcv-earth). It is not yet available on PyPI, but you can install it from GitHub with the command `pip install git+https://github.com/radlfabs/flexcv-earth.git`. You can then use the `EarthModel` class in your `flexcv` configuration by importing it from `flexcv_earth`. Further information is available in the [documentation](https://radlfabs.github.io/flexcv-earth/).
## Acknowledgements
We would like to thank the developers of `sklearn`, `optuna`, `neptune` and `merf` for their great work. Without their awesome packages and dedication, this project would not have been possible. The logo design was generated by [Microsoft Bing Chat Image Creator](https://www.bing.com/images/create) using the prompt "Generate a logo graphic where a line graph becomes the letters 'c' and 'v'. Be as simple and simplistic as possible."
## Contributions
We welcome contributions to this repository. Feel free to open an issue or pull request if you have any suggestions, problems or questions. Since the project is maintained as a side project, we cannot guarantee a quick response or fix. However, we will try to respond as soon as possible. We strongly welcome contributions to the documentation and tests. If you have any questions about contributing, feel free to contact us.
## About
`flexcv` was developed at the [Institute of Sound and Vibration Engineering](https://isave.hs-duesseldorf.de/) at the University of Applied Science Düsseldorf, Germany and is now published and maintained by Fabian Rosenthal as a personal project.
Raw data
{
"_id": null,
"home_page": "",
"name": "flexcv",
"maintainer": "",
"docs_url": null,
"requires_python": "<3.12,>=3.10",
"maintainer_email": "",
"keywords": "machine learning,cross validation",
"author": "Patrick Bl\u00e4ttermann, Siegbert Vers\u00fcmer",
"author_email": "Fabian Rosenthal <rosenthal.fabian@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7e/12/771a1132d259438ecfbc5b4faf00ae55e43b9ffc592289bb6dbcb95234b5/flexcv-24.0b0.tar.gz",
"platform": null,
"description": "<table>\n <tr>\n <td><img src=\"https://github.com/radlfabs/flexcv/blob/main/imgs/logo_colored.png?raw=true\" width=\"200\"></td>\n <td><h1>Flexible Cross Validation and Machine Learning for Regression on Tabular Data\n</h1></td>\n </tr>\n</table>\n\n[![tests](https://github.com/radlfabs/flexcv/actions/workflows/test.yml/badge.svg)](https://github.com/radlfabs/flexcv/actions/workflows/test.yml) [![DOI](https://zenodo.org/badge/708442681.svg)](https://zenodo.org/doi/10.5281/zenodo.10160846)\n\nAuthors: Fabian Rosenthal, Patrick Bl\u00e4ttermann and Siegbert Vers\u00fcmer\n\n## Introduction\nThis repository contains the code for the python package `flexcv` which implements flexible cross validation and machine learning for tabular data. It's code is used for the machine learning evaluations in Vers\u00fcmer et al. (2023).\nThe core functionality has been developed in the course of a research project at D\u00fcsseldorf University of Applied Science, Germany.\n\n`flexcv` is a method comparison package for Python that wraps around popular libraries to easily taylor complex cross validation code to your needs.\n\nIt provides a range of features for comparing machine learning models on different datasets with different sets of predictors customizing just about everything around cross validations. It supports both fixed and random effects, as well as random slopes.\n\nInstall the package and give it a try:\n\n`pip install flexcv`\n\nYou can find our documentation [here](https://radlfabs.github.io/flexcv/).\n\n## Features\n\nThe `flexcv` package provides the following features:\n\n1. Cross-validation of model performance (generalization estimation)\n2. Selection of model hyperparameters using an inner cross-validation and a state-of-the-art optimization provided by `optuna`.\n3. Customization of objective functions for optimization to select meaningful model parameters.\n4. Fixed and mixed effects modeling (random intercepts and slopes).\n5. Scaling of inner and outer cross-validation folds separately.\n6. Easy usage of the state-of-the-art logging dashboard `neptune` to track all of your experiments.\n7. Adaptations for cross validation splits with stratification for continuous target variables.\n8. Easy local summary of all evaluation metrics in a single table.\n9. Wrapper classes for the `statsmodels` package to use their mixed effects models inside of a `sklearn` Pipeline. Read more about that package [here](https://github.com/manifoldai/merf).\n10. Uses the `merf` package to apply correction for clustered data using the expectation maximization algorithm and supporting any `sklearn` BaseEstimator. Read more about that package [here](https://github.com/manifoldai/merf).\n11. Inner cross validation implementation that let's you push groups to the inner split, e. g. to apply GroupKFold.\n12. Customizable ObjectiveScorer function for hyperparameter tuning, that let's you make a trade-off between under- and overfitting.\n\nThese are the core packages used under the hood in `flexcv`:\n\n1. `sklearn` - A very popular machine learning library. We use their Estimator API for models, the pipeline module, the StandardScaler, metrics and of course wrap around their cross validation split methods. Learn more [here](https://scikit-learn.org/stable/).\n2. `Optuna` - A state-of-the-art optimization package. We use it for parameter selection in the inner loop of our nested cross validation. Learn more about theoretical background and opportunities [here](https://optuna.org/).\n3. `neptune` - Awesome logging dashboard with lots of integrations. It is a charm in combination with `Optuna`. We used it to track all of our experiments. `Neptune` is quite deeply integrated into `flexcv`. Learn more about this great library [here](https://neptune.ai/).\n4. `merf` - Mixed Effects for Random Forests. Applies correction terms on the predictions of clustered data. Works not only with random forest but with every `sklearn` BaseEstimator.\n\n## Why would you use `flexcv`?\n\nWorking with cross validation in Python usually starts with creating a sklearn pipeline. Pipelines are super useful to combine preprocessing steps with model fitting and prevent data leakage. \nHowever, there are limitations, e. g. if you want to push the training part of your clustering variable to the inner cross validation split. For some of the features, you would have to write a lot of boilerplate code to get it working, and you end up with a lot of code duplication.\nAs soon as you want to use a linear mixed effects model, you have to use the `statsmodels` package, which is not compatible with the `sklearn` pipeline.\n`flexcv` solves these problems and provides a lot of useful features for cross validation and machine learning on tabular data, so you can focus on your data and your models.\n\n## Getting Started\n\n\nLet's set up a minimal working example using a LinearRegression estimator and some randomly generated regression data.\n\n```py\n# import the interface class, a data generator and our model\nfrom flexcv import CrossValidation\nfrom flexcv.synthesizer import generate_regression\nfrom flexcv.models import LinearModel\n \n# generate some random sample data that is clustered\nX, y, group, _ = generate_regression(10, 100, 5, n_slopes=1, noise_level=9.1e-2, random_seed=42)\n```\n\nThe `CrossValidation` class is the core of this package. It holds all the information about the data, the models, the cross validation splits and the results. It is also responsible for performing the cross validation and logging the results. Setting up the `CrossValidation` object is easy. We can use method chaining to set up our configuration and perform the cross validation. You might be familiar with this pattern from `pandas` and other packages. The set-methods all return the `CrossValidation` object itself, so we can chain them together. The `perform` method then performs the cross validation and returns the `CrossValidation` object again. The `get_results` method returns a `CrossValidationResults` object which holds all the results of the cross validation. It has a `summary` property which returns a `pandas.DataFrame` with all the results. We can then use the `to_excel` method of the `DataFrame` to save the results to an excel file.\n\n```python\n# instantiate our cross validation class\ncv = CrossValidation()\n\n# now we can use method chaining to set up our configuration perform the cross validation\nresults = (\n cv\n .set_data(X, y, group, dataset_name=\"ExampleData\")\n # configure our split strategies. Lets go for a GroupKFold since our data is clustered\n .set_splits(\n method_outer_split=flexcv.CrossValMethod.GROUP\n # add the model class\n .add_model(LinearModel)\n .perform()\n .get_results()\n)\n\n# results has a summary property which returns a dataframe\n# we can simply call the pandas method \"to_excel\"\nresults.summary.to_excel(\"my_cv_results.xlsx\")\n```\n\nYou can then use the various functions and classes provided by the framework to compare machine learning models on your data.\nAdditional info on how to get started working with this package will be added here soon as well as to the (documentation)[radlfabs.github.io/flexcv/].\n\n## Documentation\n\nHave a look at our [documentation](https://radlfabs.github.io/flexcv/). We currently add lots of additional guides and tutorials to help you get started with `flexcv`. If you are interested in writing a guide or tutorial, feel free to contact us. It would be great to have some community contributions here.\n\n## Conclusion\n\n`flexcv` is a powerful tool for comparing machine learning models on different datasets with different sets of predictors. It provides a range of features for cross-validation, parameter selection, and experiment tracking. With its state-of-the-art optimization package and logging dashboard, it is a valuable addition to any machine learning workflow.\n\n## Earth Extension\n\nAn wrapper implementation of the Earth Regression package for R exists that you can use with flexcv. It is called [flexcv-earth](https:github.com/radlfabs/flexcv-earth). It is not yet available on PyPI, but you can install it from GitHub with the command `pip install git+https://github.com/radlfabs/flexcv-earth.git`. You can then use the `EarthModel` class in your `flexcv` configuration by importing it from `flexcv_earth`. Further information is available in the [documentation](https://radlfabs.github.io/flexcv-earth/).\n\n## Acknowledgements\n\nWe would like to thank the developers of `sklearn`, `optuna`, `neptune` and `merf` for their great work. Without their awesome packages and dedication, this project would not have been possible. The logo design was generated by [Microsoft Bing Chat Image Creator](https://www.bing.com/images/create) using the prompt \"Generate a logo graphic where a line graph becomes the letters 'c' and 'v'. Be as simple and simplistic as possible.\"\n\n## Contributions\n\nWe welcome contributions to this repository. Feel free to open an issue or pull request if you have any suggestions, problems or questions. Since the project is maintained as a side project, we cannot guarantee a quick response or fix. However, we will try to respond as soon as possible. We strongly welcome contributions to the documentation and tests. If you have any questions about contributing, feel free to contact us.\n\n## About\n\n`flexcv` was developed at the [Institute of Sound and Vibration Engineering](https://isave.hs-duesseldorf.de/) at the University of Applied Science D\u00fcsseldorf, Germany and is now published and maintained by Fabian Rosenthal as a personal project.\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2023 Fabian Rosenthal, Patrick Bl\u00e4ttermann, Siegbert Vers\u00fcmer Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"summary": "Easy and flexible nested cross validation for tabular data in python.",
"version": "24.0b0",
"project_urls": {
"Bug Tracker": "https://github.com/radlfabs/flexcv/issues",
"Docs": "https://radlfabs.github.io/flexcv",
"Homepage": "https://github.com/radlfabs/flexcv"
},
"split_keywords": [
"machine learning",
"cross validation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c54ca60f82fc55ddc39c37ccdab8ab411491fe57e617624ec541bc3061ae5dc7",
"md5": "f8f84b3eaef04972906f80d4f84092a6",
"sha256": "4231441b1c413c70cecd9fdefa86b269ad13a672be7679150e7d4ad38b26bfc8"
},
"downloads": -1,
"filename": "flexcv-24.0b0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f8f84b3eaef04972906f80d4f84092a6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.10",
"size": 62174,
"upload_time": "2024-01-23T19:57:01",
"upload_time_iso_8601": "2024-01-23T19:57:01.082916Z",
"url": "https://files.pythonhosted.org/packages/c5/4c/a60f82fc55ddc39c37ccdab8ab411491fe57e617624ec541bc3061ae5dc7/flexcv-24.0b0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7e12771a1132d259438ecfbc5b4faf00ae55e43b9ffc592289bb6dbcb95234b5",
"md5": "47e195fe41218d2df7dfb65faee5f7d8",
"sha256": "03cad1731bd7f0f61337b1169746cb854ecab1593a7bd48d7b6d98b311f9cb05"
},
"downloads": -1,
"filename": "flexcv-24.0b0.tar.gz",
"has_sig": false,
"md5_digest": "47e195fe41218d2df7dfb65faee5f7d8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.10",
"size": 2850001,
"upload_time": "2024-01-23T19:57:03",
"upload_time_iso_8601": "2024-01-23T19:57:03.167424Z",
"url": "https://files.pythonhosted.org/packages/7e/12/771a1132d259438ecfbc5b4faf00ae55e43b9ffc592289bb6dbcb95234b5/flexcv-24.0b0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-23 19:57:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "radlfabs",
"github_project": "flexcv",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "flexcv"
}