[data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/dholzmueller/pytabkit/blob/main/examples/tutorial_notebook.ipynb)
[data:image/s3,"s3://crabby-images/b5cda/b5cda08718ccbb40437123855e1958368bb653f0" alt=""](https://pytabkit.readthedocs.io/en/latest/)
[data:image/s3,"s3://crabby-images/cc7b4/cc7b40e3c2f3e4b520c8615d3005ae5ad71b7bab" alt="test"](https://github.com/dholzmueller/pytabkit/actions/workflows/testing.yml)
# PyTabKit: Tabular ML models and benchmarking (NeurIPS 2024)
[Paper](https://arxiv.org/abs/2407.04491) | [Documentation](https://pytabkit.readthedocs.io) | [RealMLP-TD-S standalone implementation](https://github.com/dholzmueller/realmlp-td-s_standalone) | [Grinsztajn et al. benchmark code](https://github.com/LeoGrin/tabular-benchmark/tree/better_by_default) | [Data archive](https://doi.org/10.18419/darus-4555) |
| --- | --- |---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
PyTabKit provides **scikit-learn interfaces for modern tabular classification and regression methods**
benchmarked in our [paper](https://arxiv.org/abs/2407.04491), see below.
It also contains the code we used for **benchmarking** these methods
on our benchmarks.
data:image/s3,"s3://crabby-images/1e475/1e4751c34f72d5bfe45bb1307d29df72a961f675" alt="Meta-test benchmark results"
## Installation
```commandline
pip install pytabkit
```
- If you want to use **TabR**, you have to manually install
[faiss](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md),
which is only available on **conda**.
- Please install torch separately if you want to control the version (CPU/GPU etc.)
- Use `pytabkit[autogluon,extra,hpo,bench,dev]` to install additional dependencies for
AutoGluon models, extra preprocessing,
hyperparameter optimization methods beyond random search (hyperopt/SMAC),
the benchmarking part, and testing/documentation. For the hpo part,
you might need to install *swig* (e.g. via pip) if the build of *pyrfr* fails.
See also the [documentation](https://pytabkit.readthedocs.io).
To run the data download for the meta-train benchmark, you need one of rar, unrar, or 7-zip
to be installed on the system.
## Using the ML models
Most of our machine learning models are directly available via scikit-learn interfaces.
For example, you can use RealMLP-TD for classification as follows:
```python
from pytabkit import RealMLP_TD_Classifier
model = RealMLP_TD_Classifier() # or TabR_S_D_Classifier, CatBoost_TD_Classifier, etc.
model.fit(X_train, y_train)
model.predict(X_test)
```
The code above will automatically select a GPU if available,
try to detect categorical columns in dataframes,
preprocess numerical variables and regression targets (no standardization required),
and use a training-validation split for early stopping.
All of this (and much more) can be configured through the constructor
and the parameters of the fit() method.
For example, it is possible to do bagging
(ensembling of models on 5-fold cross-validation)
simply by passing `n_cv=5` to the constructor.
Here is an example for some of the parameters that can be set explicitly:
```python
from pytabkit import RealMLP_TD_Classifier
model = RealMLP_TD_Classifier(device='cpu', random_state=0, n_cv=1, n_refit=0,
n_epochs=256, batch_size=256, hidden_sizes=[256] * 3,
val_metric_name='cross_entropy',
use_ls=False, # for metrics like AUC / log-loss
lr=0.04, verbosity=2)
model.fit(X_train, y_train, X_val, y_val, cat_col_names=['Education'])
model.predict_proba(X_test)
```
See [this notebook](https://colab.research.google.com/github/dholzmueller/pytabkit/blob/main/examples/tutorial_notebook.ipynb)
for more examples. Missing numerical values are currently *not* allowed and need to be imputed beforehand.
### Available ML models
Our ML models are available in up to three variants, all with best-epoch selection:
- library defaults (D)
- our tuned defaults (TD)
- random search hyperparameter optimization (HPO), sometimes also tree parzen estimator (HPO-TPE)
We provide the following ML models:
- **RealMLP** (TD, HPO): Our new neural net models with tuned defaults (TD)
or random search hyperparameter optimization (HPO)
- **XGB**, **LGBM**, **CatBoost** (D, TD, HPO, HPO-TPE): Interfaces for gradient-boosted
tree libraries XGBoost, LightGBM, CatBoost
- **MLP**, **ResNet**, **FTT** (D, HPO): Models from [Revisiting Deep Learning Models for Tabular Data](https://proceedings.neurips.cc/paper_files/paper/2021/hash/9d86d83f925f2149e9edb0ac3b49229c-Abstract.html)
- **MLP-PLR** (D, HPO): MLP with numerical embeddings from [On Embeddings for Numerical Features in Tabular Deep Learning](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9e9f0ffc3d836836ca96cbf8fe14b105-Abstract-Conference.html)
- **TabR** (D, HPO): TabR model from [TabR: Tabular Deep Learning Meets Nearest Neighbors](https://openreview.net/forum?id=rhgIgTSSxW)
- **TabM** (D): TabM model from [TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling](https://arxiv.org/abs/2410.24210)
- **RealTabR** (D): Our new TabR variant with default parameters
- **Ensemble-TD**: Weighted ensemble of all TD models (RealMLP, XGB, LGBM, CatBoost)
## Benchmarking code
Our benchmarking code has functionality for
- dataset download
- running methods highly parallel on single-node/multi-node/multi-GPU hardware,
with automatic scheduling and trying to respect RAM constraints
- analyzing/plotting results
For more details, we refer to the [documentation](https://pytabkit.readthedocs.io).
## Preprocessing code
While many preprocessing methods are implemented in this repository,
a standalone version of our robust scaling + smooth clipping
can be found [here](https://github.com/dholzmueller/realmlp-td-s_standalone/blob/main/preprocessing.py#L65C7-L65C37).
## Citation
If you use this repository for research purposes, please cite our [paper](https://arxiv.org/abs/2407.04491):
```
@inproceedings{holzmuller2024better,
title={Better by default: {S}trong pre-tuned {MLPs} and boosted trees on tabular data},
author={Holzm{\"u}ller, David and Grinsztajn, Leo and Steinwart, Ingo},
booktitle = {Neural {Information} {Processing} {Systems}},
year={2024}
}
```
## Contributors
- David Holzmüller (main developer)
- Léo Grinsztajn (deep learning baselines, plotting)
- Ingo Steinwart (UCI dataset download)
- Katharina Strecker (PyTorch-Lightning interface)
- Lennart Purucker (some features/fixes)
- Jérôme Dockès (deployment, continuous integration)
## Acknowledgements
Code from other repositories is acknowledged as well as possible in code comments.
Especially, we used code from https://github.com/yandex-research/rtdl
and sub-packages (Apache 2.0 license),
code from https://github.com/catboost/benchmarks/
(Apache 2.0 license),
and https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html
(Apache 2.0 license).
## Releases (see git tags)
- v1.1.2:
- Some compatibility improvements for scikit-learn 1.6
(but disabled 1.6 since skorch is not compatible with it).
- Improved documentation for Pytorch-Lightning interface.
- Other small bugfixes and improvements.
- v1.1.1:
- Added parameters `weight_decay`, `tfms`,
and `gradient_clipping_norm` to TabM.
The updated default parameters now apply the RTDL quantile transform.
- v1.1.0:
- Included TabM
- Replaced `__` by `_` in parameter names for MLP, MLP-PLR, ResNet, and FTT,
to comply with scikit-learn interface requirements.
- Fixed non-determinism in NN baselines
by initializing the random state of quantile (and KDI)
preprocessing transforms.
- n_threads parameter is not ignored by NNs anymore.
- Changes by [Lennart Purucker](https://github.com/LennartPurucker):
Add time limit for RealMLP,
add support for `lightning` (but also still allowing `pytorch-lightning`),
making skorch a lazy import, removed msgpack\_numpy dependency.
- v1.0.0: Release for the NeurIPS version and arXiv v2.
- More baselines (MLP-PLR, FT-Transformer, TabR-HPO, RF-HPO),
also some un-polished internal interfaces for other methods,
esp. the ones in AutoGluon.
- Updated benchmarking code (configurations, plots)
including the new version of the Grinsztajn et al. benchmark
- Updated fit() parameters in scikit-learn interfaces, etc.
- v0.0.1: First release for arXiv v1.
Code and data are archived at [DaRUS](https://doi.org/10.18419/darus-4255).
Raw data
{
"_id": null,
"home_page": null,
"name": "pytabkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "RealMLP, deep learning, gradient boosting, scikit-learn, tabular data",
"author": "David Holzm\u00fcller, L\u00e9o Grinsztajn",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/58/24/e455c43b7d1a289068ccd915dcc193e6fcdcac680a97a0b2f566d7d92f95/pytabkit-1.1.2.tar.gz",
"platform": null,
"description": "[data:image/s3,"s3://crabby-images/e7985/e79852128a5f83c92496b9d734ca52d01e009a39" alt="Open In Colab"](https://colab.research.google.com/github/dholzmueller/pytabkit/blob/main/examples/tutorial_notebook.ipynb)\n[data:image/s3,"s3://crabby-images/b5cda/b5cda08718ccbb40437123855e1958368bb653f0" alt=""](https://pytabkit.readthedocs.io/en/latest/)\n[data:image/s3,"s3://crabby-images/cc7b4/cc7b40e3c2f3e4b520c8615d3005ae5ad71b7bab" alt="test"](https://github.com/dholzmueller/pytabkit/actions/workflows/testing.yml)\n# PyTabKit: Tabular ML models and benchmarking (NeurIPS 2024)\n\n[Paper](https://arxiv.org/abs/2407.04491) | [Documentation](https://pytabkit.readthedocs.io) | [RealMLP-TD-S standalone implementation](https://github.com/dholzmueller/realmlp-td-s_standalone) | [Grinsztajn et al. benchmark code](https://github.com/LeoGrin/tabular-benchmark/tree/better_by_default) | [Data archive](https://doi.org/10.18419/darus-4555) |\n| --- | --- |---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|\n\nPyTabKit provides **scikit-learn interfaces for modern tabular classification and regression methods**\nbenchmarked in our [paper](https://arxiv.org/abs/2407.04491), see below. \nIt also contains the code we used for **benchmarking** these methods \non our benchmarks.\n\ndata:image/s3,"s3://crabby-images/1e475/1e4751c34f72d5bfe45bb1307d29df72a961f675" alt="Meta-test benchmark results"\n\n## Installation\n\n```commandline\npip install pytabkit\n```\n- If you want to use **TabR**, you have to manually install \n[faiss](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md), \nwhich is only available on **conda**.\n- Please install torch separately if you want to control the version (CPU/GPU etc.)\n- Use `pytabkit[autogluon,extra,hpo,bench,dev]` to install additional dependencies for \nAutoGluon models, extra preprocessing, \nhyperparameter optimization methods beyond random search (hyperopt/SMAC), \nthe benchmarking part, and testing/documentation. For the hpo part, \nyou might need to install *swig* (e.g. via pip) if the build of *pyrfr* fails.\nSee also the [documentation](https://pytabkit.readthedocs.io).\nTo run the data download for the meta-train benchmark, you need one of rar, unrar, or 7-zip \nto be installed on the system.\n\n## Using the ML models\nMost of our machine learning models are directly available via scikit-learn interfaces.\nFor example, you can use RealMLP-TD for classification as follows:\n\n```python\nfrom pytabkit import RealMLP_TD_Classifier\n\nmodel = RealMLP_TD_Classifier() # or TabR_S_D_Classifier, CatBoost_TD_Classifier, etc.\nmodel.fit(X_train, y_train)\nmodel.predict(X_test)\n```\nThe code above will automatically select a GPU if available, \ntry to detect categorical columns in dataframes, \npreprocess numerical variables and regression targets (no standardization required),\nand use a training-validation split for early stopping. \nAll of this (and much more) can be configured through the constructor \nand the parameters of the fit() method. \nFor example, it is possible to do bagging \n(ensembling of models on 5-fold cross-validation)\nsimply by passing `n_cv=5` to the constructor. \nHere is an example for some of the parameters that can be set explicitly:\n\n```python\nfrom pytabkit import RealMLP_TD_Classifier\n\nmodel = RealMLP_TD_Classifier(device='cpu', random_state=0, n_cv=1, n_refit=0,\n n_epochs=256, batch_size=256, hidden_sizes=[256] * 3,\n val_metric_name='cross_entropy',\n use_ls=False, # for metrics like AUC / log-loss\n lr=0.04, verbosity=2)\nmodel.fit(X_train, y_train, X_val, y_val, cat_col_names=['Education'])\nmodel.predict_proba(X_test)\n```\nSee [this notebook](https://colab.research.google.com/github/dholzmueller/pytabkit/blob/main/examples/tutorial_notebook.ipynb)\nfor more examples. Missing numerical values are currently *not* allowed and need to be imputed beforehand.\n\n### Available ML models\n\nOur ML models are available in up to three variants, all with best-epoch selection: \n- library defaults (D)\n- our tuned defaults (TD)\n- random search hyperparameter optimization (HPO), sometimes also tree parzen estimator (HPO-TPE)\n\nWe provide the following ML models:\n\n- **RealMLP** (TD, HPO): Our new neural net models with tuned defaults (TD) \nor random search hyperparameter optimization (HPO)\n- **XGB**, **LGBM**, **CatBoost** (D, TD, HPO, HPO-TPE): Interfaces for gradient-boosted \ntree libraries XGBoost, LightGBM, CatBoost\n- **MLP**, **ResNet**, **FTT** (D, HPO): Models from [Revisiting Deep Learning Models for Tabular Data](https://proceedings.neurips.cc/paper_files/paper/2021/hash/9d86d83f925f2149e9edb0ac3b49229c-Abstract.html)\n- **MLP-PLR** (D, HPO): MLP with numerical embeddings from [On Embeddings for Numerical Features in Tabular Deep Learning](https://proceedings.neurips.cc/paper_files/paper/2022/hash/9e9f0ffc3d836836ca96cbf8fe14b105-Abstract-Conference.html)\n- **TabR** (D, HPO): TabR model from [TabR: Tabular Deep Learning Meets Nearest Neighbors](https://openreview.net/forum?id=rhgIgTSSxW)\n- **TabM** (D): TabM model from [TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling](https://arxiv.org/abs/2410.24210)\n- **RealTabR** (D): Our new TabR variant with default parameters\n- **Ensemble-TD**: Weighted ensemble of all TD models (RealMLP, XGB, LGBM, CatBoost)\n\n## Benchmarking code\n\nOur benchmarking code has functionality for\n- dataset download\n- running methods highly parallel on single-node/multi-node/multi-GPU hardware,\nwith automatic scheduling and trying to respect RAM constraints\n- analyzing/plotting results\n\nFor more details, we refer to the [documentation](https://pytabkit.readthedocs.io).\n\n## Preprocessing code\n\nWhile many preprocessing methods are implemented in this repository, \na standalone version of our robust scaling + smooth clipping \ncan be found [here](https://github.com/dholzmueller/realmlp-td-s_standalone/blob/main/preprocessing.py#L65C7-L65C37).\n\n## Citation\n\nIf you use this repository for research purposes, please cite our [paper](https://arxiv.org/abs/2407.04491):\n```\n@inproceedings{holzmuller2024better,\n title={Better by default: {S}trong pre-tuned {MLPs} and boosted trees on tabular data},\n author={Holzm{\\\"u}ller, David and Grinsztajn, Leo and Steinwart, Ingo},\n booktitle = {Neural {Information} {Processing} {Systems}},\n year={2024}\n}\n```\n\n## Contributors\n\n- David Holzm\u00fcller (main developer)\n- L\u00e9o Grinsztajn (deep learning baselines, plotting)\n- Ingo Steinwart (UCI dataset download)\n- Katharina Strecker (PyTorch-Lightning interface)\n- Lennart Purucker (some features/fixes)\n- J\u00e9r\u00f4me Dock\u00e8s (deployment, continuous integration)\n\n## Acknowledgements\nCode from other repositories is acknowledged as well as possible in code comments. \nEspecially, we used code from https://github.com/yandex-research/rtdl \nand sub-packages (Apache 2.0 license),\ncode from https://github.com/catboost/benchmarks/\n(Apache 2.0 license), \nand https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html \n(Apache 2.0 license).\n\n## Releases (see git tags)\n\n- v1.1.2: \n - Some compatibility improvements for scikit-learn 1.6 \n (but disabled 1.6 since skorch is not compatible with it).\n - Improved documentation for Pytorch-Lightning interface.\n - Other small bugfixes and improvements.\n- v1.1.1:\n - Added parameters `weight_decay`, `tfms`,\n and `gradient_clipping_norm` to TabM. \n The updated default parameters now apply the RTDL quantile transform.\n- v1.1.0: \n - Included TabM\n - Replaced `__` by `_` in parameter names for MLP, MLP-PLR, ResNet, and FTT,\n to comply with scikit-learn interface requirements.\n - Fixed non-determinism in NN baselines \n by initializing the random state of quantile (and KDI) \n preprocessing transforms.\n - n_threads parameter is not ignored by NNs anymore.\n - Changes by [Lennart Purucker](https://github.com/LennartPurucker): \n Add time limit for RealMLP, \n add support for `lightning` (but also still allowing `pytorch-lightning`),\n making skorch a lazy import, removed msgpack\\_numpy dependency.\n- v1.0.0: Release for the NeurIPS version and arXiv v2. \n - More baselines (MLP-PLR, FT-Transformer, TabR-HPO, RF-HPO), \n also some un-polished internal interfaces for other methods, \n esp. the ones in AutoGluon.\n - Updated benchmarking code (configurations, plots)\n including the new version of the Grinsztajn et al. benchmark\n - Updated fit() parameters in scikit-learn interfaces, etc.\n- v0.0.1: First release for arXiv v1.\n Code and data are archived at [DaRUS](https://doi.org/10.18419/darus-4255).\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "ML models + benchmark for tabular data classification and regression",
"version": "1.1.2",
"project_urls": {
"Documentation": "https://github.com/dholzmueller/pytabkit#readme",
"Issues": "https://github.com/dholzmueller/pytabkit/issues",
"Source": "https://github.com/dholzmueller/pytabkit"
},
"split_keywords": [
"realmlp",
" deep learning",
" gradient boosting",
" scikit-learn",
" tabular data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "cfbb1815df3ac8a3f35055ff8f930784e70e75f4353723523be92cea3afb8f84",
"md5": "5870d74c7b9f6d5b10f42601e4aeac0b",
"sha256": "6f08214ce1f634451947de5246c9f76e42a7d3497c6053f2684a25b2e53a150b"
},
"downloads": -1,
"filename": "pytabkit-1.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5870d74c7b9f6d5b10f42601e4aeac0b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 323506,
"upload_time": "2024-12-23T22:35:44",
"upload_time_iso_8601": "2024-12-23T22:35:44.223525Z",
"url": "https://files.pythonhosted.org/packages/cf/bb/1815df3ac8a3f35055ff8f930784e70e75f4353723523be92cea3afb8f84/pytabkit-1.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5824e455c43b7d1a289068ccd915dcc193e6fcdcac680a97a0b2f566d7d92f95",
"md5": "0c0e268181d45821d5304babb646abd2",
"sha256": "ca02505c49bdfa240f574e62e0811b7a7b89a2bc31e8d3c6600892ca8ea0d6f1"
},
"downloads": -1,
"filename": "pytabkit-1.1.2.tar.gz",
"has_sig": false,
"md5_digest": "0c0e268181d45821d5304babb646abd2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 275668,
"upload_time": "2024-12-23T22:35:49",
"upload_time_iso_8601": "2024-12-23T22:35:49.285427Z",
"url": "https://files.pythonhosted.org/packages/58/24/e455c43b7d1a289068ccd915dcc193e6fcdcac680a97a0b2f566d7d92f95/pytabkit-1.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-23 22:35:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dholzmueller",
"github_project": "pytabkit#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pytabkit"
}