tabular-transformer


Nametabular-transformer JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/echosprint/TabularTransformer
SummaryTransformer adapted for tabular data domain
upload_time2024-09-24 03:22:23
maintainerNone
docs_urlNone
authorQiao Qian
requires_python>=3.10
licenseMIT
keywords artificial intelligence transformers attention mechanism tabular data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Transformer adapted for tabular data domain
===============================


TabularTransformer is a lightweight, end-to-end deep learning framework built with PyTorch, leveraging the power of the Transformer architecture. It is designed to be scalable and efficient with the following advantages:

- Streamlined workflow with no need for preprocessing or handling missing values.
- Unleashing the power of Transformer on tabular data domain.
- Native GPU support through PyTorch.
- Minimal APIs to get started quickly.
- Capable of handling large-scale data.


Get Started and Documentation
-----------------------------

Our primary documentation is at https://echosprint.github.io/TabularTransformer/ and is generated from this repository. 

### Installation:

```bash
$ pip install tabular-transformer
```

### Usage

Here we take [Adult Income dataset](https://huggingface.co/datasets/scikit-learn/adult-census-income) as an example to show the usage of `tabular-transformer` package, more examples see the [notebooks](https://github.com/echosprint/TabularTransformer/tree/main/notebooks) folder in this repo.

 <a target="_blank" href="https://colab.research.google.com/github/echosprint/TabularTransformer/blob/main/notebooks/supervised_training.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

```python
import tabular_transformer as ttf
import torch

income_dataset_path = ttf.prepare_income_dataset()

categorical_cols = [
    'workclass', 'education',
    'marital.status', 'occupation',
    'relationship', 'race', 'sex',
    'native.country', 'income']

# all remaining columns are numerical
numerical_cols = []

income_reader = ttf.DataReader(
    file_path=income_dataset_path,
    ensure_categorical_cols=categorical_cols,
    ensure_numerical_cols=numerical_cols,
    label='income',
)

split = income_reader.split_data({'test': 0.2, 'train': -1})

device = 'cuda' if torch.cuda.is_available() else 'cpu'
dtype = 'bfloat16' if torch.cuda.is_available() \
    and torch.cuda.is_bf16_supported() else 'float16'

ts = ttf.TrainSettings(device=device, dtype=dtype)

tp = ttf.TrainParameters(max_iters=3000, learning_rate=5e-4,
                         output_dim=1, loss_type='BINCE',
                         batch_size=128, eval_interval=100,
                         eval_iters=20, warmup_iters=100,
                         validate_split=0.2)

hp = ttf.HyperParameters(dim=64, n_layers=6)

trainer = ttf.Trainer(hp=hp, ts=ts)

trainer.train(
    data_reader=income_reader(file_path=split['train']),
    tp=tp)

predictor = ttf.Predictor(checkpoint='out/ckpt.pt')

predictor.predict(
    data_reader=income_reader(file_path=split['test']),
    save_as="prediction_income.csv"
)
```
Comparison
----------

We used [Higgs](https://archive.ics.uci.edu/dataset/280/higgs) dataset to conduct our comparison experiment. Details of data are listed in the following tables:

| Training Samples | Features | Test Set Description                 | Task                  |
|------------------|----------|--------------------------------------|-----------------------|
| 10,500,000       | 28       | Last 500,000 samples as the test set | Binary classification |


We computed accuracy metric only on the test data set. check [benchmark source](https://github.com/microsoft/LightGBM/blob/master/docs/Experiments.rst#accuracy).
| Data  | Metric | XGBoost | XGBoost_Hist | LightGBM       | TabularTransformer |
|-------|--------|---------|--------------|----------------|--------------------|
| Higgs | AUC    | 0.839593| 0.845314     | 0.845724       | **0.848628**       |

To reproduce the result, please check the [source code](https://github.com/echosprint/TabularTransformer/blob/main/notebooks/higgs_classification.ipynb)

Support
-------

Open **bug reports** and **feature requests** on [GitHub issues](https://github.com/echosprint/TabularTransformer/issues).


Reference Papers
----------------

Xin Huang and Ashish Khetan and Milan Cvitkovic and Zohar Karnin. "[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/abs/2012.06678)". arXiv, 2020.

Prannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan. "[Supervised Contrastive Learning](https://arxiv.org/abs/2004.11362)". arXiv, 2020.

Levin, Roman and Cherepanova, Valeriia and Schwarzschild, Avi and Bansal, Arpit and Bruss, C Bayan and Goldstein, Tom and Wilson, Andrew Gordon and Goldblum, Micah. "[Transfer Learning with Deep Tabular Models](https://arxiv.org/abs/2206.15306)". arXiv, 2022.

License
-------

This project is licensed under the terms of the MIT license. See [LICENSE](https://github.com/echosprint/TabularTransformer/blob/main/LICENSE) for additional details.

Thanks
-------

The prototype of this project is adapted from python parts of [Andrej Karpathy](https://x.com/karpathy)'s [Llama2.c](https://github.com/karpathy/llama2.c), Andrej is a mentor, wish him great success with his startup.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/echosprint/TabularTransformer",
    "name": "tabular-transformer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "artificial intelligence, transformers, attention mechanism, tabular data",
    "author": "Qiao Qian",
    "author_email": "qiaoqianda@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/0e/75/30ff78bd2774f0386e05175fc0c71a11217cdca82047fb8f58abd3a5dad1/tabular_transformer-0.3.0.tar.gz",
    "platform": null,
    "description": "Transformer adapted for tabular data domain\n===============================\n\n\nTabularTransformer is a lightweight, end-to-end deep learning framework built with PyTorch, leveraging the power of the Transformer architecture. It is designed to be scalable and efficient with the following advantages:\n\n- Streamlined workflow with no need for preprocessing or handling missing values.\n- Unleashing the power of Transformer on tabular data domain.\n- Native GPU support through PyTorch.\n- Minimal APIs to get started quickly.\n- Capable of handling large-scale data.\n\n\nGet Started and Documentation\n-----------------------------\n\nOur primary documentation is at https://echosprint.github.io/TabularTransformer/ and is generated from this repository. \n\n### Installation:\n\n```bash\n$ pip install tabular-transformer\n```\n\n### Usage\n\nHere we take [Adult Income dataset](https://huggingface.co/datasets/scikit-learn/adult-census-income) as an example to show the usage of `tabular-transformer` package, more examples see the [notebooks](https://github.com/echosprint/TabularTransformer/tree/main/notebooks) folder in this repo.\n\n <a target=\"_blank\" href=\"https://colab.research.google.com/github/echosprint/TabularTransformer/blob/main/notebooks/supervised_training.ipynb\">\n  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n</a>\n\n```python\nimport tabular_transformer as ttf\nimport torch\n\nincome_dataset_path = ttf.prepare_income_dataset()\n\ncategorical_cols = [\n    'workclass', 'education',\n    'marital.status', 'occupation',\n    'relationship', 'race', 'sex',\n    'native.country', 'income']\n\n# all remaining columns are numerical\nnumerical_cols = []\n\nincome_reader = ttf.DataReader(\n    file_path=income_dataset_path,\n    ensure_categorical_cols=categorical_cols,\n    ensure_numerical_cols=numerical_cols,\n    label='income',\n)\n\nsplit = income_reader.split_data({'test': 0.2, 'train': -1})\n\ndevice = 'cuda' if torch.cuda.is_available() else 'cpu'\ndtype = 'bfloat16' if torch.cuda.is_available() \\\n    and torch.cuda.is_bf16_supported() else 'float16'\n\nts = ttf.TrainSettings(device=device, dtype=dtype)\n\ntp = ttf.TrainParameters(max_iters=3000, learning_rate=5e-4,\n                         output_dim=1, loss_type='BINCE',\n                         batch_size=128, eval_interval=100,\n                         eval_iters=20, warmup_iters=100,\n                         validate_split=0.2)\n\nhp = ttf.HyperParameters(dim=64, n_layers=6)\n\ntrainer = ttf.Trainer(hp=hp, ts=ts)\n\ntrainer.train(\n    data_reader=income_reader(file_path=split['train']),\n    tp=tp)\n\npredictor = ttf.Predictor(checkpoint='out/ckpt.pt')\n\npredictor.predict(\n    data_reader=income_reader(file_path=split['test']),\n    save_as=\"prediction_income.csv\"\n)\n```\nComparison\n----------\n\nWe used [Higgs](https://archive.ics.uci.edu/dataset/280/higgs) dataset to conduct our comparison experiment. Details of data are listed in the following tables:\n\n| Training Samples | Features | Test Set Description                 | Task                  |\n|------------------|----------|--------------------------------------|-----------------------|\n| 10,500,000       | 28       | Last 500,000 samples as the test set | Binary classification |\n\n\nWe computed accuracy metric only on the test data set. check [benchmark source](https://github.com/microsoft/LightGBM/blob/master/docs/Experiments.rst#accuracy).\n| Data  | Metric | XGBoost | XGBoost_Hist | LightGBM       | TabularTransformer |\n|-------|--------|---------|--------------|----------------|--------------------|\n| Higgs | AUC    | 0.839593| 0.845314     | 0.845724       | **0.848628**       |\n\nTo reproduce the result, please check the [source code](https://github.com/echosprint/TabularTransformer/blob/main/notebooks/higgs_classification.ipynb)\n\nSupport\n-------\n\nOpen **bug reports** and **feature requests** on [GitHub issues](https://github.com/echosprint/TabularTransformer/issues).\n\n\nReference Papers\n----------------\n\nXin Huang and Ashish Khetan and Milan Cvitkovic and Zohar Karnin. \"[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/abs/2012.06678)\". arXiv, 2020.\n\nPrannay Khosla and Piotr Teterwak and Chen Wang and Aaron Sarna and Yonglong Tian and Phillip Isola and Aaron Maschinot and Ce Liu and Dilip Krishnan. \"[Supervised Contrastive Learning](https://arxiv.org/abs/2004.11362)\". arXiv, 2020.\n\nLevin, Roman and Cherepanova, Valeriia and Schwarzschild, Avi and Bansal, Arpit and Bruss, C Bayan and Goldstein, Tom and Wilson, Andrew Gordon and Goldblum, Micah. \"[Transfer Learning with Deep Tabular Models](https://arxiv.org/abs/2206.15306)\". arXiv, 2022.\n\nLicense\n-------\n\nThis project is licensed under the terms of the MIT license. See [LICENSE](https://github.com/echosprint/TabularTransformer/blob/main/LICENSE) for additional details.\n\nThanks\n-------\n\nThe prototype of this project is adapted from python parts of [Andrej Karpathy](https://x.com/karpathy)'s [Llama2.c](https://github.com/karpathy/llama2.c), Andrej is a mentor, wish him great success with his startup.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Transformer adapted for tabular data domain",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/echosprint/TabularTransformer"
    },
    "split_keywords": [
        "artificial intelligence",
        " transformers",
        " attention mechanism",
        " tabular data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "27e45d52d95acdc4bce9eebd6b6caa8f407716dbce13e38099ab51c72962b3a8",
                "md5": "350577296910be4276db7358db6f745d",
                "sha256": "275b769966429ed32343cea0892b84ffc4ce72379d9b8cc29a248664d690c8f0"
            },
            "downloads": -1,
            "filename": "tabular_transformer-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "350577296910be4276db7358db6f745d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 37218,
            "upload_time": "2024-09-24T03:22:21",
            "upload_time_iso_8601": "2024-09-24T03:22:21.778860Z",
            "url": "https://files.pythonhosted.org/packages/27/e4/5d52d95acdc4bce9eebd6b6caa8f407716dbce13e38099ab51c72962b3a8/tabular_transformer-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0e7530ff78bd2774f0386e05175fc0c71a11217cdca82047fb8f58abd3a5dad1",
                "md5": "7698331253000af1dc376605dbfa1d91",
                "sha256": "72a9283cc18e3b1aa6b35d5e06f8c54ed48e1efc2afa1b7cfbe181fffed8767a"
            },
            "downloads": -1,
            "filename": "tabular_transformer-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7698331253000af1dc376605dbfa1d91",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 34701,
            "upload_time": "2024-09-24T03:22:23",
            "upload_time_iso_8601": "2024-09-24T03:22:23.242827Z",
            "url": "https://files.pythonhosted.org/packages/0e/75/30ff78bd2774f0386e05175fc0c71a11217cdca82047fb8f58abd3a5dad1/tabular_transformer-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-24 03:22:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "echosprint",
    "github_project": "TabularTransformer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "tabular-transformer"
}
        
Elapsed time: 1.37019s