# TabML: a Machine Learning pipeline for tabular data
[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![tests](https://github.com/tiepvupsu/tabml/actions/workflows/python-package.yml/badge.svg)](https://github.com/tiepvupsu/tabml/actions/workflows/python-package.yml)
[![codecov](https://codecov.io/gh/tiepvupsu/tabml/branch/master/graph/badge.svg?token=4JLG0YYUZU)](https://codecov.io/gh/tiepvupsu/tabml)
- [TabML: a Machine Learning pipeline for tabular data](#tabml-a-machine-learning-pipeline-for-tabular-data)
- [Introduction](#introduction)
- [Installation](#installation)
- [Main components](#main-components)
- [Examples](#examples)
- [Setup for development](#setup-for-development)
- [Add path to this repo](#add-path-to-this-repo)
- [Create the environment](#create-the-environment)
- [Check that everthing is working](#check-that-everthing-is-working)
- [Author's notes](#authors-notes)
- [How to release a new version](#how-to-release-a-new-version)
- [Common errors](#common-errors)
## Introduction
This is an active project that aims to create a general machine learning framework for working with tabular data.
Key features:
- One of the most important tasks in working with tabular data is to hanlde feature extraction. TabML allow users to define multiple features isolatedly without worrying about other features. This helps reduce coding conflicts if your team have multiple members simultaneously developing different features. In addition, if one feature needs to be updated, unrelated features could be untouched. In this way, the computating cost is relatively small (compared with running a pipeline to re-generate all other features).
- Parameters are specified in a config file as a config file. This config file is automatically saved into an experiment folder after each training for the reproducibility purpose.
- Support multiple ML packages for tabular data:
- [x] [LightGBM](https://lightgbm.readthedocs.io/en/latest/)
- [x] [XGBoost](https://xgboost.readthedocs.io/en/latest/)
- [x] [CatBoost](https://catboost.ai/)
- [x] Scikit-learn
- [ ] Keras
- [ ] Pytorch
- [ ] TabNet
- [ ] ...
## Installation
```shell
pip install tabml
```
## Main components
![components](flow.png)
In TRAINING step,
1. The **FeatureManager** class is responsible for loading raw data and engineering it into relavent features for model training and analysis. If a `fit` step, e.g. imputation, is required for a feature, the fitted parameters will be stored for using later in the `transform` step. One such usage is in the serving step when there is only `transform` step. For each project, there is one `feature_manager.py` file which specifies how each feature is computed ([example](https://github.com/tiepvupsu/tabml/blob/master/examples/titanic/feature_manager.py)). The computation order as well as feature dependencies are specified in a yaml config file ([example](https://github.com/tiepvupsu/tabml/blob/master/tabml/examples/titanic/configs/feature_config.yaml)).
2. The **DataLoader** loads training and validation data for model training and analysis. In a typical project, tabml already takes care of this class, users only need to specify configuration in the pipeline config file ([example](https://github.com/tiepvupsu/tabml/blob/95da6aa7f8947329487ff70f189ce213469ebbf1/examples/titanic/configs/lgbm_config.yaml#L2-L19)). In that file, features and label used for training need to be specified. In addition, a set of boolean features are used as conditions for selecting training and validation data. Only rows in the dataset that meet all training/validation conditions are selected.
3. The **ModelWrapper** class defines the model, how to train it and other methods for loading the model and making predictions.
4. The **ModelAnalysis** analyzes the model on different metrics at user-defined dimensions. Analyzing metrics at different slices of data could determine if the trained model is biased to some feature value or any slice of data that model performance could be improved.
In SERVING step, raw data is fed into the *fitted* FeatureManager to get the transfomed features that the trained model could use. The model is then making predictions for the transformed features.
## Examples
Please check the [`examples`](https://github.com/tiepvupsu/tabml/tree/master/tabml/examples) folder for several example projects. For each project:
```bash
python feature_manager.py # to generate features
python pipelines.py # to train the model
```
You can change some parameters in the config file then run `python pipelines.py` again.
In most project, users only need to focus their efforts on designing features. The feature dependecy is defined in a yaml config file and the feature implementation is stored in `feature_manager.py`.
## Setup for development
### Add path to this repo
Add the following lines to your shell config file (`~/.bashrc`, `~/.zshrc` or any shell config file of
your choice):
```shell
export TABML=<local_path_to_this_git_repo>
alias 2tabml='cd $TABML; source bashrc; source tabml_env/bin/activate; python3 setup.py install'
```
### Create the environment
```shell
cd $TABML
python3 -m venv tabml_env
source tabml_env/bin/activate
pip3 install -r requirements.txt
```
Setup [pre-commit](https://pre-commit.com/) to auto format code when creating a git
commit:
```shell
pre-commit install
```
### Check that everthing is working
by running test
```shell
2tabml
python3 -m pytest ./tests ./examples
```
### Author's notes
#### How to release a new version
1. Increase `version` in `setup.py` as in [this PR example](https://github.com/tiepvupsu/tabml/pull/220).
2. Generate tar file:
```shell
python setup.py sdist
```
3. Upload tar file:
```shell
twine upload dist/tabml-x.x.xx.tar.gz
```
### Common errors
1. SHAP
SHAP might not work for MacOS if Xcode version < 13, try to upgrade it to xcode 13. [Related issue](https://github.com/slundberg/shap/issues/1386).
2. LightGBM
`pip install lightgbm` might not work for MacOS, try to follow [official installation guide for mac](https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#macos).
---
If you find a bug or want to request a feature, feel free to create an issue. Any Pull Request would be much appreciated.
Raw data
{
"_id": null,
"home_page": "https://github.com/tiepvupsu/tabml",
"name": "tabml",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "Machine Learning,Tabular",
"author": "Tiep Vu",
"author_email": "vuhuutiep@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/13/64/1b7e678803f1691f613caa9a958c1171ad470adb74463c4b9559a6e1ea1b/tabml-0.2.9.tar.gz",
"platform": null,
"description": "# TabML: a Machine Learning pipeline for tabular data\n\n[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)\n[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![tests](https://github.com/tiepvupsu/tabml/actions/workflows/python-package.yml/badge.svg)](https://github.com/tiepvupsu/tabml/actions/workflows/python-package.yml)\n[![codecov](https://codecov.io/gh/tiepvupsu/tabml/branch/master/graph/badge.svg?token=4JLG0YYUZU)](https://codecov.io/gh/tiepvupsu/tabml)\n\n- [TabML: a Machine Learning pipeline for tabular data](#tabml-a-machine-learning-pipeline-for-tabular-data)\n - [Introduction](#introduction)\n - [Installation](#installation)\n - [Main components](#main-components)\n - [Examples](#examples)\n - [Setup for development](#setup-for-development)\n - [Add path to this repo](#add-path-to-this-repo)\n - [Create the environment](#create-the-environment)\n - [Check that everthing is working](#check-that-everthing-is-working)\n - [Author's notes](#authors-notes)\n - [How to release a new version](#how-to-release-a-new-version)\n - [Common errors](#common-errors)\n\n## Introduction\n\nThis is an active project that aims to create a general machine learning framework for working with tabular data.\n\nKey features:\n\n- One of the most important tasks in working with tabular data is to hanlde feature extraction. TabML allow users to define multiple features isolatedly without worrying about other features. This helps reduce coding conflicts if your team have multiple members simultaneously developing different features. In addition, if one feature needs to be updated, unrelated features could be untouched. In this way, the computating cost is relatively small (compared with running a pipeline to re-generate all other features).\n\n- Parameters are specified in a config file as a config file. This config file is automatically saved into an experiment folder after each training for the reproducibility purpose.\n\n- Support multiple ML packages for tabular data:\n - [x] [LightGBM](https://lightgbm.readthedocs.io/en/latest/)\n - [x] [XGBoost](https://xgboost.readthedocs.io/en/latest/)\n - [x] [CatBoost](https://catboost.ai/)\n - [x] Scikit-learn\n - [ ] Keras\n - [ ] Pytorch\n - [ ] TabNet\n - [ ] ...\n\n## Installation\n\n```shell\npip install tabml\n```\n\n## Main components\n\n![components](flow.png)\n\nIn TRAINING step,\n\n1. The **FeatureManager** class is responsible for loading raw data and engineering it into relavent features for model training and analysis. If a `fit` step, e.g. imputation, is required for a feature, the fitted parameters will be stored for using later in the `transform` step. One such usage is in the serving step when there is only `transform` step. For each project, there is one `feature_manager.py` file which specifies how each feature is computed ([example](https://github.com/tiepvupsu/tabml/blob/master/examples/titanic/feature_manager.py)). The computation order as well as feature dependencies are specified in a yaml config file ([example](https://github.com/tiepvupsu/tabml/blob/master/tabml/examples/titanic/configs/feature_config.yaml)).\n\n2. The **DataLoader** loads training and validation data for model training and analysis. In a typical project, tabml already takes care of this class, users only need to specify configuration in the pipeline config file ([example](https://github.com/tiepvupsu/tabml/blob/95da6aa7f8947329487ff70f189ce213469ebbf1/examples/titanic/configs/lgbm_config.yaml#L2-L19)). In that file, features and label used for training need to be specified. In addition, a set of boolean features are used as conditions for selecting training and validation data. Only rows in the dataset that meet all training/validation conditions are selected.\n\n3. The **ModelWrapper** class defines the model, how to train it and other methods for loading the model and making predictions.\n\n4. The **ModelAnalysis** analyzes the model on different metrics at user-defined dimensions. Analyzing metrics at different slices of data could determine if the trained model is biased to some feature value or any slice of data that model performance could be improved.\n\nIn SERVING step, raw data is fed into the *fitted* FeatureManager to get the transfomed features that the trained model could use. The model is then making predictions for the transformed features.\n\n## Examples\n\nPlease check the [`examples`](https://github.com/tiepvupsu/tabml/tree/master/tabml/examples) folder for several example projects. For each project:\n\n```bash\npython feature_manager.py # to generate features\npython pipelines.py # to train the model\n```\n\nYou can change some parameters in the config file then run `python pipelines.py` again.\n\nIn most project, users only need to focus their efforts on designing features. The feature dependecy is defined in a yaml config file and the feature implementation is stored in `feature_manager.py`.\n\n## Setup for development\n\n### Add path to this repo\n\nAdd the following lines to your shell config file (`~/.bashrc`, `~/.zshrc` or any shell config file of\nyour choice):\n\n```shell\nexport TABML=<local_path_to_this_git_repo>\nalias 2tabml='cd $TABML; source bashrc; source tabml_env/bin/activate; python3 setup.py install'\n```\n\n### Create the environment\n\n```shell\ncd $TABML\npython3 -m venv tabml_env\nsource tabml_env/bin/activate\npip3 install -r requirements.txt\n```\n\nSetup [pre-commit](https://pre-commit.com/) to auto format code when creating a git\ncommit:\n\n```shell\npre-commit install\n```\n\n### Check that everthing is working\n\nby running test\n\n```shell\n2tabml\npython3 -m pytest ./tests ./examples\n```\n\n### Author's notes\n\n#### How to release a new version\n\n1. Increase `version` in `setup.py` as in [this PR example](https://github.com/tiepvupsu/tabml/pull/220).\n\n2. Generate tar file:\n\n```shell\npython setup.py sdist\n```\n\n3. Upload tar file:\n\n```shell\ntwine upload dist/tabml-x.x.xx.tar.gz\n```\n\n### Common errors\n\n1. SHAP\n\nSHAP might not work for MacOS if Xcode version < 13, try to upgrade it to xcode 13. [Related issue](https://github.com/slundberg/shap/issues/1386).\n\n2. LightGBM\n\n`pip install lightgbm` might not work for MacOS, try to follow [official installation guide for mac](https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#macos).\n\n---\n\nIf you find a bug or want to request a feature, feel free to create an issue. Any Pull Request would be much appreciated.",
"bugtrack_url": null,
"license": "apache-2.0",
"summary": "A package for machine learning with tabular data",
"version": "0.2.9",
"split_keywords": [
"machine learning",
"tabular"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "13641b7e678803f1691f613caa9a958c1171ad470adb74463c4b9559a6e1ea1b",
"md5": "a07d6411b588348e4be86f02772a3c42",
"sha256": "385b2bd366c4735d809e5967bb2df349ac6bf9520d96232f6516ffcf4a0bf81d"
},
"downloads": -1,
"filename": "tabml-0.2.9.tar.gz",
"has_sig": false,
"md5_digest": "a07d6411b588348e4be86f02772a3c42",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 39877,
"upload_time": "2023-01-09T18:51:57",
"upload_time_iso_8601": "2023-01-09T18:51:57.606583Z",
"url": "https://files.pythonhosted.org/packages/13/64/1b7e678803f1691f613caa9a958c1171ad470adb74463c4b9559a6e1ea1b/tabml-0.2.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-09 18:51:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "tiepvupsu",
"github_project": "tabml",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [
{
"name": "catboost",
"specs": [
[
"==",
"1.1"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"3.9.0"
]
]
},
{
"name": "gputil",
"specs": [
[
"==",
"1.4.0"
]
]
},
{
"name": "isort",
"specs": [
[
"==",
"4.3.21"
]
]
},
{
"name": "lightgbm",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "loguru",
"specs": [
[
"==",
"0.5.1"
]
]
},
{
"name": "mypy",
"specs": [
[
"==",
"0.910"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.23.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"1.4.3"
]
]
},
{
"name": "pandas-profiling",
"specs": [
[
"==",
"2.9.0"
]
]
},
{
"name": "pre-commit",
"specs": [
[
"==",
"2.2.0"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"3.20.1"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"1.8.2"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"5.3.5"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"==",
"1.1.2"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.9.3"
]
]
},
{
"name": "shap",
"specs": [
[
"==",
"0.39.0"
]
]
},
{
"name": "termgraph",
"specs": [
[
"==",
"0.4.2"
]
]
},
{
"name": "types-six",
"specs": [
[
"==",
"1.16.1"
]
]
},
{
"name": "xgboost",
"specs": [
[
"==",
"1.7.1"
]
]
}
],
"lcname": "tabml"
}