ogboost


Nameogboost JSON
Version 0.6.2 PyPI version JSON
download
home_pageNone
SummaryOrdinal Gradient Boosting
upload_time2025-02-20 07:11:10
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseMIT License Copyright (c) 2024 asmahani Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords ordinal regression gradient boosting machine learning scikit-learn
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Ordinal Gradient Boosting (`OGBoost`)

## Overview

`OGBoost` is a scikit-learn-compatible, Python package for gradient boosting tailored to ordinal regression problems. It does so by alternating between:
1. Fitting a Machine Learning (ML) regression model - such as a decision tree - to predict a latent score that specifies the mean of a probability density function (PDF), and 
1. Fitting a set of thresholds that generate discrete outcomes from the PDF.

In other words, `OGBoost` implements coordinate-descent optimization that combines functional gradient descent - for updating the regression function - with ordinary gradient descent - for updating the threshold vector.

The main class of the package, `GradientBoostingOrdinal`, is designed to have the same look and feel as `scikit-learn`'s `GradientBoostingClassifier`. It includes many of the same features such as custom link functions, sample weighting, early stopping using a validation set, and staged predictions.

There are, however, important differences as well.

## Unique Features of `OGBoost`

### Latent-Score Prediction

The `decision_function` method of the `GradientBoostingOrdinal` behaves differently from `scikit-learn`'s classifiers. Assuming the target variable has `K` distinct classes, a nominal classifier's decision function would return `K` values for each sample. On the other hand, `decision_function` in `ogboost` would return the latent score for each sample, which is a single value. This latent score can be considered a high-resolution alternative to class labels, and thus may have superior ranking performance.

### Early Stopping using Cross-Validation (CV)

In addition to using a single validation set for early stopping, similar to `GradientBoostingClassifier`, `ogboost` implements early stopping using CV, which means the entire data is used for calculating out-of-sample performance. This can improve the robustness of early-stopping, especially for small and/or imbalanced datasets.

### Heterogeneous Ensemble

While most gradient-boosting software packages exclusively use decision trees with a predetermined set of hyperparameters as the base learner in all boosting iterations, `ogboost` offers significantly more flexibility.

1. Users can pass in a `base_learner` parameter to the class initializer to override the default choice of a `DecisionTreeRegressor`. This can be any scikit-learn regression algorithm such as a feed-forward neural network (`MLPRegressor`), or a K-nearest-neighbor regressor (`KNeighborsRegressor`), etc.
1. Rather than a single base learner, users can specify a list (or a generator) of base learners, which will be drawn from in that order in each boosting iteration. This amounts to creating a *heterogeneous* ensemble as opposed to a *homogeneous* ensemble.

## Installation
```bash
pip install ogboost
```

## Package Vignette

For a more detailed introduction to `OGBoost`, including the underlying math, see the [package vignette](https://arxiv.org/abs/2502.13456), available on arXiv.

## Quick Start
### Load the Wine Quality Dataset
The package includes a utility to load the wine quality dataset (red and white) from the UCI repository. Note that `load_wine_quality` shifts the target variable (`quality`) to start from `0`. (This is required by the `GradientBoostingOrdinal` class.)

```python
from ogboost import load_wine_quality
X, y, _, _ = load_wine_quality(return_X_y=True)
```

### Training, Prediction and Evaluation
Latent scores perform better on discrminative tasks vs. class labels as they contain more information due to higher resolution:
```python
from ogboost import GradientBoostingOrdinal

## training ##
model = GradientBoostingOrdinal(n_estimators=100, link_function='logit', verbose=1)
model.fit(X, y)

## prediction ##
# class labels
predicted_labels = model.predict(X)
# class probabilities
predicted_probabilities = model.predict_proba(X)
# latent score
predicted_latent = model.decision_function(X)

# evaluation
concordance_latent = model.score(X, y) # concordance using latent scores
concordance_label = model.score(X, y, pred_type = 'labels') # concordance using class labels
print(f"Concordance - class labels: {concordance_label:.3f}")
print(f"Concordance - latent scores: {concordance_latent:.3f}")
```

### Early-Stopping using Cross-Validation
Using cross-validation for early stopping can produce more robust results compared to a single holdout set, especially for small and/or imbalanced datasets:
```python
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
import time

n_splits = 10
n_repeats = 10
kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats)

# early-stopping using a simple holdout set
model_earlystop_simple = GradientBoostingOrdinal(n_iter_no_change=10, validation_fraction=0.2)
start = time.time()
c_index_simple = cross_val_score(model_earlystop_simple, X, y, cv=kf, n_jobs=-1)
end = time.time()
print(f'Simple early stopping: {c_index_simple.mean():.3f} ({end - start:.1f} seconds)')

# early-stopping using cross-validation
model_earlystop_cv = GradientBoostingOrdinal(n_iter_no_change=10, cv_early_stopping_splits=5)
start = time.time()
c_index_cv = cross_val_score(model_earlystop_cv, X, y, cv=kf, n_jobs=-1)
end = time.time()
print(f'CV early stopping: {c_index_cv.mean():.3f} ({end - start:.1f} seconds)')
```

### Heterogeneous Ensemble

Rather than a single base learner, users can supply a heterogeneous list of base learners to ```GradientBoostingOrdinal```. The utility function ```generate_heterogeneous_learners``` can be used to easily generate random samples from hyperparameter spaces of one or more base learners:
```python
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from ogboost import generate_heterogeneous_learners

# Number of samples to generate
n_samples = 100

max_depth_choices = [3, 6, 9, None]
max_leaf_nodes_choices = [10, 20, 30, None]

dt_overrides = {
    "max_depth": lambda rng: rng.choice(max_depth_choices),
    "max_leaf_nodes": lambda rng: rng.choice(max_leaf_nodes_choices)
}

# Create list of DecisionTreeRegressor models
random_learners = generate_heterogeneous_learners(
    [DecisionTreeRegressor()], 
    [dt_overrides], 
    total_samples=n_samples
)
```
Such heterogenous boosting ensembles can be a more efficient alternative to hyperparameter tuning (e.g., via grid search):
```python
model_heter = GradientBoostingOrdinal(
    base_learner=random_learners,
    n_estimators=n_samples
)
cv_heter = cross_val_score(model_heter, X, y, cv=kf, n_jobs=-1)
print(f'average cv score of heterogeneous ensemble: {np.mean(cv_heter):.3f}')
```

## Release Notes

### 0.6.2

- Added link to package vignette on arxiv to ```README.md```.
- Simplified the initialization of fold level models in ```_fit_cv```.
- Fixed a bug in ```_fit_cv``` that prevented using CV-based early stopping with heterogeneous base learners.

### 0.6.1

- Debugged ```_fit_cv``` and ```plot_loss``` methods of ```GradientBoostingOrdinal``` to produce correct plots of training/validation loss, and loss improvement after each g and theta update when using cross-validation for early stopping.
- Enhanced docstrings for ```plot_loss```.

### 0.6.0

- Improved the logic for detecting ```random_state``` as a parameter in the base learners (switching from ```hasattr``` to ```get_params```), as the old method was tricked by sklearn's inheritance mechanics into thinking estimators such as SVM included ```random_state``` as a modifiable parameter.
- Added a utility function, ```generate_heterogeneous_learners```, to stochastically generate a list of base learners to supply to ```GradientBoostingOrdinal``` (heterogenous boosting ensemble).
- Edited code examples in ```README.md``` to reflect the enhancements to the package.
- Enhanced ```load_wine_quality``` to add option for returning X and y - instead of a single dataframe - for red and white datasets.

### 0.5.6

- Tweaked the default hyperparameters of ```DecisionTreeRegressor``` (itself the default ```base_learner``` for ```GradientBoostingOrdinal```) to match those in scikit-learn's ```GradientBoostingClassifier```.
- Small improvements to the ```plot_loss``` method of ```GradientBoostingOrdinal```.
- Added the *Release Notes* section to the ```README.md``` file.
- Small edits to the text and code in ```README.md```.

### 0.5.5

- First public release. 

## License
This package is licensed under the [MIT License](./LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ogboost",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "ordinal regression, gradient boosting, machine learning, scikit-learn",
    "author": null,
    "author_email": "\"Alireza S. Mahani, Mansour T.A. Sharabiani\" <alireza.s.mahani@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/37/bf/08622bf1de77f4d5785839cfc9341c23bac6c522f76987f2e28a90e02bde/ogboost-0.6.2.tar.gz",
    "platform": null,
    "description": "# Ordinal Gradient Boosting (`OGBoost`)\r\n\r\n## Overview\r\n\r\n`OGBoost` is a scikit-learn-compatible, Python package for gradient boosting tailored to ordinal regression problems. It does so by alternating between:\r\n1. Fitting a Machine Learning (ML) regression model - such as a decision tree - to predict a latent score that specifies the mean of a probability density function (PDF), and \r\n1. Fitting a set of thresholds that generate discrete outcomes from the PDF.\r\n\r\nIn other words, `OGBoost` implements coordinate-descent optimization that combines functional gradient descent - for updating the regression function - with ordinary gradient descent - for updating the threshold vector.\r\n\r\nThe main class of the package, `GradientBoostingOrdinal`, is designed to have the same look and feel as `scikit-learn`'s `GradientBoostingClassifier`. It includes many of the same features such as custom link functions, sample weighting, early stopping using a validation set, and staged predictions.\r\n\r\nThere are, however, important differences as well.\r\n\r\n## Unique Features of `OGBoost`\r\n\r\n### Latent-Score Prediction\r\n\r\nThe `decision_function` method of the `GradientBoostingOrdinal` behaves differently from `scikit-learn`'s classifiers. Assuming the target variable has `K` distinct classes, a nominal classifier's decision function would return `K` values for each sample. On the other hand, `decision_function` in `ogboost` would return the latent score for each sample, which is a single value. This latent score can be considered a high-resolution alternative to class labels, and thus may have superior ranking performance.\r\n\r\n### Early Stopping using Cross-Validation (CV)\r\n\r\nIn addition to using a single validation set for early stopping, similar to `GradientBoostingClassifier`, `ogboost` implements early stopping using CV, which means the entire data is used for calculating out-of-sample performance. This can improve the robustness of early-stopping, especially for small and/or imbalanced datasets.\r\n\r\n### Heterogeneous Ensemble\r\n\r\nWhile most gradient-boosting software packages exclusively use decision trees with a predetermined set of hyperparameters as the base learner in all boosting iterations, `ogboost` offers significantly more flexibility.\r\n\r\n1. Users can pass in a `base_learner` parameter to the class initializer to override the default choice of a `DecisionTreeRegressor`. This can be any scikit-learn regression algorithm such as a feed-forward neural network (`MLPRegressor`), or a K-nearest-neighbor regressor (`KNeighborsRegressor`), etc.\r\n1. Rather than a single base learner, users can specify a list (or a generator) of base learners, which will be drawn from in that order in each boosting iteration. This amounts to creating a *heterogeneous* ensemble as opposed to a *homogeneous* ensemble.\r\n\r\n## Installation\r\n```bash\r\npip install ogboost\r\n```\r\n\r\n## Package Vignette\r\n\r\nFor a more detailed introduction to `OGBoost`, including the underlying math, see the [package vignette](https://arxiv.org/abs/2502.13456), available on arXiv.\r\n\r\n## Quick Start\r\n### Load the Wine Quality Dataset\r\nThe package includes a utility to load the wine quality dataset (red and white) from the UCI repository. Note that `load_wine_quality` shifts the target variable (`quality`) to start from `0`. (This is required by the `GradientBoostingOrdinal` class.)\r\n\r\n```python\r\nfrom ogboost import load_wine_quality\r\nX, y, _, _ = load_wine_quality(return_X_y=True)\r\n```\r\n\r\n### Training, Prediction and Evaluation\r\nLatent scores perform better on discrminative tasks vs. class labels as they contain more information due to higher resolution:\r\n```python\r\nfrom ogboost import GradientBoostingOrdinal\r\n\r\n## training ##\r\nmodel = GradientBoostingOrdinal(n_estimators=100, link_function='logit', verbose=1)\r\nmodel.fit(X, y)\r\n\r\n## prediction ##\r\n# class labels\r\npredicted_labels = model.predict(X)\r\n# class probabilities\r\npredicted_probabilities = model.predict_proba(X)\r\n# latent score\r\npredicted_latent = model.decision_function(X)\r\n\r\n# evaluation\r\nconcordance_latent = model.score(X, y) # concordance using latent scores\r\nconcordance_label = model.score(X, y, pred_type = 'labels') # concordance using class labels\r\nprint(f\"Concordance - class labels: {concordance_label:.3f}\")\r\nprint(f\"Concordance - latent scores: {concordance_latent:.3f}\")\r\n```\r\n\r\n### Early-Stopping using Cross-Validation\r\nUsing cross-validation for early stopping can produce more robust results compared to a single holdout set, especially for small and/or imbalanced datasets:\r\n```python\r\nfrom sklearn.model_selection import cross_val_score\r\nfrom sklearn.model_selection import RepeatedKFold\r\nimport time\r\n\r\nn_splits = 10\r\nn_repeats = 10\r\nkf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats)\r\n\r\n# early-stopping using a simple holdout set\r\nmodel_earlystop_simple = GradientBoostingOrdinal(n_iter_no_change=10, validation_fraction=0.2)\r\nstart = time.time()\r\nc_index_simple = cross_val_score(model_earlystop_simple, X, y, cv=kf, n_jobs=-1)\r\nend = time.time()\r\nprint(f'Simple early stopping: {c_index_simple.mean():.3f} ({end - start:.1f} seconds)')\r\n\r\n# early-stopping using cross-validation\r\nmodel_earlystop_cv = GradientBoostingOrdinal(n_iter_no_change=10, cv_early_stopping_splits=5)\r\nstart = time.time()\r\nc_index_cv = cross_val_score(model_earlystop_cv, X, y, cv=kf, n_jobs=-1)\r\nend = time.time()\r\nprint(f'CV early stopping: {c_index_cv.mean():.3f} ({end - start:.1f} seconds)')\r\n```\r\n\r\n### Heterogeneous Ensemble\r\n\r\nRather than a single base learner, users can supply a heterogeneous list of base learners to ```GradientBoostingOrdinal```. The utility function ```generate_heterogeneous_learners``` can be used to easily generate random samples from hyperparameter spaces of one or more base learners:\r\n```python\r\nimport numpy as np\r\nfrom sklearn.tree import DecisionTreeRegressor\r\nfrom ogboost import generate_heterogeneous_learners\r\n\r\n# Number of samples to generate\r\nn_samples = 100\r\n\r\nmax_depth_choices = [3, 6, 9, None]\r\nmax_leaf_nodes_choices = [10, 20, 30, None]\r\n\r\ndt_overrides = {\r\n    \"max_depth\": lambda rng: rng.choice(max_depth_choices),\r\n    \"max_leaf_nodes\": lambda rng: rng.choice(max_leaf_nodes_choices)\r\n}\r\n\r\n# Create list of DecisionTreeRegressor models\r\nrandom_learners = generate_heterogeneous_learners(\r\n    [DecisionTreeRegressor()], \r\n    [dt_overrides], \r\n    total_samples=n_samples\r\n)\r\n```\r\nSuch heterogenous boosting ensembles can be a more efficient alternative to hyperparameter tuning (e.g., via grid search):\r\n```python\r\nmodel_heter = GradientBoostingOrdinal(\r\n    base_learner=random_learners,\r\n    n_estimators=n_samples\r\n)\r\ncv_heter = cross_val_score(model_heter, X, y, cv=kf, n_jobs=-1)\r\nprint(f'average cv score of heterogeneous ensemble: {np.mean(cv_heter):.3f}')\r\n```\r\n\r\n## Release Notes\r\n\r\n### 0.6.2\r\n\r\n- Added link to package vignette on arxiv to ```README.md```.\r\n- Simplified the initialization of fold level models in ```_fit_cv```.\r\n- Fixed a bug in ```_fit_cv``` that prevented using CV-based early stopping with heterogeneous base learners.\r\n\r\n### 0.6.1\r\n\r\n- Debugged ```_fit_cv``` and ```plot_loss``` methods of ```GradientBoostingOrdinal``` to produce correct plots of training/validation loss, and loss improvement after each g and theta update when using cross-validation for early stopping.\r\n- Enhanced docstrings for ```plot_loss```.\r\n\r\n### 0.6.0\r\n\r\n- Improved the logic for detecting ```random_state``` as a parameter in the base learners (switching from ```hasattr``` to ```get_params```), as the old method was tricked by sklearn's inheritance mechanics into thinking estimators such as SVM included ```random_state``` as a modifiable parameter.\r\n- Added a utility function, ```generate_heterogeneous_learners```, to stochastically generate a list of base learners to supply to ```GradientBoostingOrdinal``` (heterogenous boosting ensemble).\r\n- Edited code examples in ```README.md``` to reflect the enhancements to the package.\r\n- Enhanced ```load_wine_quality``` to add option for returning X and y - instead of a single dataframe - for red and white datasets.\r\n\r\n### 0.5.6\r\n\r\n- Tweaked the default hyperparameters of ```DecisionTreeRegressor``` (itself the default ```base_learner``` for ```GradientBoostingOrdinal```) to match those in scikit-learn's ```GradientBoostingClassifier```.\r\n- Small improvements to the ```plot_loss``` method of ```GradientBoostingOrdinal```.\r\n- Added the *Release Notes* section to the ```README.md``` file.\r\n- Small edits to the text and code in ```README.md```.\r\n\r\n### 0.5.5\r\n\r\n- First public release. \r\n\r\n## License\r\nThis package is licensed under the [MIT License](./LICENSE).\r\n",
    "bugtrack_url": null,
    "license": "MIT License\r\n        \r\n        Copyright (c) 2024 asmahani\r\n        \r\n        Permission is hereby granted, free of charge, to any person obtaining a copy\r\n        of this software and associated documentation files (the \"Software\"), to deal\r\n        in the Software without restriction, including without limitation the rights\r\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\r\n        copies of the Software, and to permit persons to whom the Software is\r\n        furnished to do so, subject to the following conditions:\r\n        \r\n        The above copyright notice and this permission notice shall be included in all\r\n        copies or substantial portions of the Software.\r\n        \r\n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\r\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\r\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\r\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\r\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\r\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\r\n        SOFTWARE.\r\n        ",
    "summary": "Ordinal Gradient Boosting",
    "version": "0.6.2",
    "project_urls": null,
    "split_keywords": [
        "ordinal regression",
        " gradient boosting",
        " machine learning",
        " scikit-learn"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "830fb9c7ef77fbf11889b5d6ee563cbf018d6ee4cca659770c0db1cabf671efe",
                "md5": "6b45e182d2c07a21a0020927c15818dc",
                "sha256": "336a619a23ff86ab67f26f04c36294cfb44c60aba1b4cb8751749bfbb667ae70"
            },
            "downloads": -1,
            "filename": "ogboost-0.6.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6b45e182d2c07a21a0020927c15818dc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 25597,
            "upload_time": "2025-02-20T07:11:08",
            "upload_time_iso_8601": "2025-02-20T07:11:08.200049Z",
            "url": "https://files.pythonhosted.org/packages/83/0f/b9c7ef77fbf11889b5d6ee563cbf018d6ee4cca659770c0db1cabf671efe/ogboost-0.6.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "37bf08622bf1de77f4d5785839cfc9341c23bac6c522f76987f2e28a90e02bde",
                "md5": "ec33bde837dbd1448d0d5c1e86383d1f",
                "sha256": "5af15497e32d099ea47278cc8f5a3880530b1c4de0324f2ded7c45157859bf81"
            },
            "downloads": -1,
            "filename": "ogboost-0.6.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ec33bde837dbd1448d0d5c1e86383d1f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 25821,
            "upload_time": "2025-02-20T07:11:10",
            "upload_time_iso_8601": "2025-02-20T07:11:10.206355Z",
            "url": "https://files.pythonhosted.org/packages/37/bf/08622bf1de77f4d5785839cfc9341c23bac6c522f76987f2e28a90e02bde/ogboost-0.6.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-20 07:11:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ogboost"
}
        
Elapsed time: 0.44760s