ts3l


Namets3l JSON
Version 0.30 PyPI version JSON
download
home_pagehttps://github.com/Alcoholrithm/TabularS3L
SummaryA PyTorch Lightning-based library for self- and semi-supervised learning on tabular data.
upload_time2024-04-22 05:19:01
maintainerNone
docs_urlNone
authorMinwook Kim
requires_python>=3.7
licenseNone
keywords tabular-data semi-supervised-learning self-supervised-learning vime subtab scarf denoisingautoencoder
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TabularS3L

[**Overview**](#tabulars3l)
| [**Installation**](#installation)
| [**Available Models with Quick Start Guides**](#available-models-with-quick-start)
| [**Benchmark**](#benchmark)
| [**To DO**](#to-do)
| [**Contributing**](#contributing)
| [**Credit**](#credit)


[![pypi](https://img.shields.io/pypi/v/ts3l)](https://pypi.org/project/ts3l/0.20/)
[![DOI](https://zenodo.org/badge/756740921.svg)](https://zenodo.org/doi/10.5281/zenodo.10776537)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

TabularS3L is a PyTorch Lightning-based library designed to facilitate self- and semi-supervised learning with tabular data. While numerous self- and semi-supervised learning tabular models have been proposed, there is a lack of a comprehensive library that addresses the needs of tabular practitioners. This library aims to fill this gap by providing a unified PyTorch Lightning-based framework for exploring and deploying such models.

## Installation
We provide a Python package ts3l of TabularS3L for users who want to use semi- and self-supervised learning tabular models.

```sh
pip install ts3l
```

## Available Models with Quick Start

TabularS3L employs a two-phase learning approach, where the learning strategies differ between phases. Below is an overview of the models available within TabularS3L, highlighting the learning strategies employed in each phase. The abbreviations 'Self-SL', 'Semi-SL', and 'SL' represent self-supervised learning, semi-supervised learning, and supervised learning, respectively.

| Model | First Phase | Second Phase |
|:---:|:---:|:---:|
| **DAE** ([GitHub](https://github.com/ryancheunggit/tabular_dae))| Self-SL | SL |
| **VIME** ([NeurIPS'20](https://proceedings.neurips.cc/paper/2020/hash/7d97667a3e056acab9aaf653807b4a03-Abstract.html)) | Self-SL | Semi-SL or SL |
| **SubTab** ([NeurIPS'21](https://proceedings.neurips.cc/paper/2021/hash/9c8661befae6dbcd08304dbf4dcaf0db-Abstract.html)) | Self-SL | SL |
| **SCARF** ([ICLR'22](https://iclr.cc/virtual/2022/spotlight/6297))| Self-SL | SL |

#### Denoising AutoEncoder (DAE)
DAE processes input data that has been partially corrupted, producing clean data and predicting which features are corrupted during the self-supervised learning.
The denoising task enables the model to learn the input distribution and generate latent representations that are robust to corruption. 
These latent representations can be utilized for a variety of downstream tasks.

<details close>
  <summary>Quick Start</summary>
  
  ```python
  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

  # Prepare the DAELightning Module
  from ts3l.pl_modules import DAELightning
  from ts3l.utils.dae_utils import DAEDataset, DAECollateFN
  from ts3l.utils import TS3LDataModule
  from ts3l.utils.dae_utils import DAEConfig
  from pytorch_lightning import Trainer
  
  metric = "accuracy_score"
  input_dim = X_train.shape[1]
  hidden_dim = 1024
  output_dim = 2
  encoder_depth=4
  head_depth = 2
  noise_type = "Swap"
  noise_ratio = 0.3
  
  max_epochs = 20
  batch_size = 128
  
  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)
  
  config = DAEConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
  input_dim=input_dim, hidden_dim=hidden_dim,
  output_dim=output_dim, encoder_depth=encoder_depth,
  head_depth = head_depth,
  noise_type = noise_type,
  noise_ratio = noise_ratio,
  num_categoricals=len(category_cols), num_continuous=len(continuous_cols)
  )
  
  pl_dae = DAELightning(config)
  
  ### First Phase Learning
  train_ds = DAEDataset(X = X_train, unlabeled_data = X_unlabeled, continuous_cols = continuous_cols, category_cols = category_cols)
  valid_ds = DAEDataset(X = X_valid, continuous_cols = continuous_cols, category_cols = category_cols)
  
  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=DAECollateFN(config), valid_collate_fn=DAECollateFN(config))
  
  trainer = Trainer(
                      accelerator = 'cpu',
                      max_epochs = max_epochs,
                      num_sanity_val_steps = 2,
      )
  
  trainer.fit(pl_dae, datamodule)
  
  ### Second Phase Learning
  
  pl_dae.set_second_phase()
  
  train_ds = DAEDataset(X = X_train, Y = y_train.values, unlabeled_data=X_unlabeled, continuous_cols=continuous_cols, category_cols=category_cols)
  valid_ds = DAEDataset(X = X_valid, Y = y_valid.values, continuous_cols=continuous_cols, category_cols=category_cols)
          
  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted")
  
  trainer = Trainer(
                      accelerator = 'cpu',
                      max_epochs = max_epochs,
                      num_sanity_val_steps = 2,
      )
  
  trainer.fit(pl_dae, datamodule)
  
  # Evaluation
  from sklearn.metrics import accuracy_score
  import torch
  from torch.nn import functional as F
  from torch.utils.data import DataLoader, SequentialSampler
  
  test_ds = DAEDataset(X_test, category_cols=category_cols, continuous_cols=continuous_cols)
  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds))
  
  preds = trainer.predict(pl_dae, test_dl)
          
  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)
  
  accuracy = accuracy_score(y_test, preds.argmax(1))
  
  print("Accuracy %.2f" % accuracy)
  ```

</details>

#### VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain
VIME enhances tabular data learning through a dual approach. In its first phase, it utilize a pretext task of estimating mask vectors from corrupted tabular data, alongside a reconstruction pretext task for self-supervised learning. The second phase leverages consistency regularization on unlabeled data.

<details close>
  <summary>Quick Start</summary>
  
  ```python
  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

  # Prepare the VIMELightning Module
  from ts3l.pl_modules import VIMELightning
  from ts3l.utils.vime_utils import VIMEDataset
  from ts3l.utils import TS3LDataModule
  from ts3l.utils.vime_utils import VIMEConfig
  from pytorch_lightning import Trainer

  metric = "accuracy_score"
  input_dim = X_train.shape[1]
  hidden_dim = 1024
  output_dim = 2
  alpha1 = 2.0
  alpha2 = 2.0
  beta = 1.0
  K = 3
  p_m = 0.2

  batch_size = 128

  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

  config = VIMEConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
  input_dim=input_dim, hidden_dim=hidden_dim,
  output_dim=output_dim, alpha1=alpha1, alpha2=alpha2, 
  beta=beta, K=K, p_m = p_m,
  num_categoricals=len(category_cols), num_continuous=len(continuous_cols)
  )

  pl_vime = VIMELightning(config)

  ### First Phase Learning
  train_ds = VIMEDataset(X = X_train, unlabeled_data = X_unlabeled, config=config, continuous_cols = continuous_cols, category_cols = category_cols)
  valid_ds = VIMEDataset(X = X_valid, config=config, continuous_cols = continuous_cols, category_cols = category_cols)

  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random')

  trainer = Trainer(
                      accelerator = 'cpu',
                      max_epochs = 20,
                      num_sanity_val_steps = 2,
      )

  trainer.fit(pl_vime, datamodule)

  ### Second Phase Learning
  from ts3l.utils.vime_utils import VIMESemiSLCollateFN

  pl_vime.set_second_phase()

  train_ds = VIMEDataset(X_train, y_train.values, config, unlabeled_data=X_unlabeled, continuous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)
  valid_ds = VIMEDataset(X_valid, y_valid.values, config, continuous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)
          
  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=VIMESemiSLCollateFN())

  trainer.fit(pl_vime, datamodule)

  # Evaluation
  from sklearn.metrics import accuracy_score
  import torch
  from torch.nn import functional as F
  from torch.utils.data import DataLoader, SequentialSampler

  test_ds = VIMEDataset(X_test, category_cols=category_cols, continuous_cols=continuous_cols, is_second_phase=True)
  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds))

  preds = trainer.predict(pl_vime, test_dl)
          
  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

  accuracy = accuracy_score(y_test, preds.argmax(1))

  print("Accuracy %.2f" % accuracy)
  ```

</details>


#### SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning
SubTab turns the task of learning from tabular data into as a multi-view representation challenge by dividing input features into multiple subsets during its first phase. During the second phase, collaborative inference is used to derive a joint representation by aggregating latent variables across subsets. This approach improves the model's performance in supervised learning tasks.

<details close>
  <summary>Quick Start</summary>
  
  ```python
  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

  # Prepare the SubTabLightning Module
  from ts3l.pl_modules import SubTabLightning
  from ts3l.utils.subtab_utils import SubTabDataset, SubTabCollateFN
  from ts3l.utils import TS3LDataModule
  from ts3l.utils.subtab_utils import SubTabConfig
  from pytorch_lightning import Trainer

  metric = "accuracy_score"
  input_dim = X_train.shape[1]
  hidden_dim = 1024
  output_dim = 2
  tau = 1.0
  use_cosine_similarity = True
  use_contrastive = True
  use_distance = True
  n_subsets = 4
  overlap_ratio = 0.75

  mask_ratio = 0.1
  noise_type = "Swap"
  noise_level = 0.1

  batch_size = 128
  max_epochs = 3

  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

  config = SubTabConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
  input_dim=input_dim, hidden_dim=hidden_dim,
  output_dim=output_dim, tau=tau, use_cosine_similarity= use_cosine_similarity, use_contrastive=use_contrastive, use_distance=use_distance, 
  n_subsets=n_subsets, overlap_ratio=overlap_ratio, mask_ratio=mask_ratio, noise_type=noise_type, noise_level=noise_level
  )

  pl_subtab = SubTabLightning(config)

  ### First Phase Learning
  train_ds = SubTabDataset(X_train, unlabeled_data=X_unlabeled)
  valid_ds = SubTabDataset(X_valid)

  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=SubTabCollateFN(config), valid_collate_fn=SubTabCollateFN(config), n_jobs = 4)

  trainer = Trainer(
                      accelerator = 'cpu',
                      max_epochs = max_epochs,
                      num_sanity_val_steps = 2,
      )

  trainer.fit(pl_subtab, datamodule)

  ### Second Phase Learning

  pl_subtab.set_second_phase()

  train_ds = SubTabDataset(X_train, y_train.values)
  valid_ds = SubTabDataset(X_valid, y_valid.values)

  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted", train_collate_fn=SubTabCollateFN(config), valid_collate_fn=SubTabCollateFN(config))

  trainer.fit(pl_subtab, datamodule)

  # Evaluation
  from sklearn.metrics import accuracy_score
  import torch
  from torch.nn import functional as F
  from torch.utils.data import DataLoader, SequentialSampler

  test_ds = SubTabDataset(X_test)
  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4, collate_fn=SubTabCollateFN(config))

  preds = trainer.predict(pl_subtab, test_dl)
          
  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

  accuracy = accuracy_score(y_test, preds.argmax(1))

  print("Accuracy %.2f" % accuracy)
  ```

</details>

#### SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption
SCARF introduces a contrastive learning framework specifically tailored for tabular data. By corrupting random subsets of features, SCARF creates diverse views for self-supervised learning in its first phase. The subsequent phase transitions to supervised learning, utilizing a pretrained encoder to enhance model accuracy and robustness.

<details close>
  <summary>Quick Start</summary>
  
  ```python
  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols

  # Prepare the SCARFLightning Module
  from ts3l.pl_modules import SCARFLightning
  from ts3l.utils.scarf_utils import SCARFDataset
  from ts3l.utils import TS3LDataModule
  from ts3l.utils.scarf_utils import SCARFConfig
  from pytorch_lightning import Trainer

  metric = "accuracy_score"
  input_dim = X_train.shape[1]
  hidden_dim = 1024
  output_dim = 2
  encoder_depth = 3
  head_depth = 1
  dropout_rate = 0.04

  corruption_rate = 0.6

  batch_size = 128
  max_epochs = 10

  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)

  config = SCARFConfig( task="classification", loss_fn="CrossEntropyLoss", metric=metric, metric_hparams={},
  input_dim=input_dim, hidden_dim=hidden_dim,
  output_dim=output_dim, encoder_depth=encoder_depth, head_depth=head_depth,
  dropout_rate=dropout_rate, corruption_rate = corruption_rate
  )

  pl_scarf = SCARFLightning(config)

  ### First Phase Learning
  train_ds = SCARFDataset(X_train, unlabeled_data=X_unlabeled, config = config)
  valid_ds = SCARFDataset(X_valid, config=config)

  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size=batch_size, train_sampler="random")

  trainer = Trainer(
                      accelerator = 'cpu',
                      max_epochs = max_epochs,
                      num_sanity_val_steps = 2,
      )

  trainer.fit(pl_scarf, datamodule)

  ### Second Phase Learning

  pl_scarf.set_second_phase()

  train_ds = SCARFDataset(X_train, y_train.values, is_second_phase=True)
  valid_ds = SCARFDataset(X_valid, y_valid.values, is_second_phase=True)

  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler="weighted")

  trainer.fit(pl_scarf, datamodule)

  # Evaluation
  from sklearn.metrics import accuracy_score
  import torch
  from torch.nn import functional as F
  from torch.utils.data import DataLoader, SequentialSampler

  test_ds = SCARFDataset(X_test, is_second_phase=True)
  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4)

  preds = trainer.predict(pl_scarf, test_dl)
          
  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)

  accuracy = accuracy_score(y_test, preds.argmax(1))

  print("Accuracy %.2f" % accuracy)
  ```

</details>


## Benchmark

We provide a simple benchmark using TabularS3L against XGBoost. The train-validation-test ratio is 6:2:2 and we tuned each model over 50 trials using Optuna. The results are the average of the random seeds [0,4]. The best results are bold. 'acc', 'b-acc', and 'mse' mean 'Accuracy', 'Balanced Accuracy', and 'Mean Squared Error', respectively.

Use this benchmark for reference only, as only a small number of random seeds were used.

##### 10% labeled samples 

| Model | diabetes (acc) | cmc (b-acc) | abalone (mse) |
|:---:|:---:|:---:|:---:|
| XGBoost | 0.7325 | 0.4763 | **5.5739** |
| DAE | 0.7208 | 0.4885 | 5.6168 | 
| VIME | 0.7182 | **0.5087** | 5.6637 |
| SubTab | 0.7312 | 0.4930 | 7.2418 |
| SCARF | **0.7416** | 0.4710 | 5.8888 | 

--------

##### 100% labeled samples

| Model | diabetes (acc) | cmc (b-acc) | abalone (mse) |
|:---:|:---:|:---:|:---:|
| XGBoost | 0.7234 | 0.5291 | 4.8377 |
| DAE | 0.7390 | 0.5500 | 4.5758 |
| VIME | **0.7688** | 0.5477 | 4.5804 |
| SubTab | 0.7390 | 0.5432 | 6.3104 |
| SCARF | 0.7442 | **0.5521** | **4.4443** |

## To DO

- [x] Release nn.Module and Dataset of VIME, SubTab, and SCARF
  - [x] VIME
  - [x] SubTab
  - [x] SCARF
- [x] Release LightningModules of VIME, SubTab, and SCARF
  - [x] VIME
  - [x] SubTab
  - [x] SCARF
- [x] Release Denoising AutoEncoder
  - [x] nn.Module
  - [x] LightningModule
- [ ] Release SwitchTab
  - [ ] nn.Module
  - [ ] LightningModule
- [x] Add example codes

## Contributing

Contributions to this implementation are highly appreciated. Whether it's suggesting improvements, reporting bugs, or proposing new features, feel free to open an issue or submit a pull request.


## Credit  
```
@software{alcoholrithm_2024_10776538,
  author       = {Minwook Kim},
  title        = {TabularS3L},
  month        = mar,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.21},
  doi          = {10.5281/zenodo.10776538},
  url          = {https://doi.org/10.5281/zenodo.10776538}
}
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Alcoholrithm/TabularS3L",
    "name": "ts3l",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "tabular-data semi-supervised-learning self-supervised-learning VIME SubTab SCARF DenoisingAutoEncoder",
    "author": "Minwook Kim",
    "author_email": "kmiiiaa@pusan.ac.kr",
    "download_url": "https://files.pythonhosted.org/packages/fb/e5/424adae1e53afb61e6431157728678650c4a20ef7e0fa0bcfda48424ac79/ts3l-0.30.tar.gz",
    "platform": null,
    "description": "# TabularS3L\n\n[**Overview**](#tabulars3l)\n| [**Installation**](#installation)\n| [**Available Models with Quick Start Guides**](#available-models-with-quick-start)\n| [**Benchmark**](#benchmark)\n| [**To DO**](#to-do)\n| [**Contributing**](#contributing)\n| [**Credit**](#credit)\n\n\n[![pypi](https://img.shields.io/pypi/v/ts3l)](https://pypi.org/project/ts3l/0.20/)\n[![DOI](https://zenodo.org/badge/756740921.svg)](https://zenodo.org/doi/10.5281/zenodo.10776537)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nTabularS3L is a PyTorch Lightning-based library designed to facilitate self- and semi-supervised learning with tabular data. While numerous self- and semi-supervised learning tabular models have been proposed, there is a lack of a comprehensive library that addresses the needs of tabular practitioners. This library aims to fill this gap by providing a unified PyTorch Lightning-based framework for exploring and deploying such models.\n\n## Installation\nWe provide a Python package ts3l of TabularS3L for users who want to use semi- and self-supervised learning tabular models.\n\n```sh\npip install ts3l\n```\n\n## Available Models with Quick Start\n\nTabularS3L employs a two-phase learning approach, where the learning strategies differ between phases. Below is an overview of the models available within TabularS3L, highlighting the learning strategies employed in each phase. The abbreviations 'Self-SL', 'Semi-SL', and 'SL' represent self-supervised learning, semi-supervised learning, and supervised learning, respectively.\n\n| Model | First Phase | Second Phase |\n|:---:|:---:|:---:|\n| **DAE** ([GitHub](https://github.com/ryancheunggit/tabular_dae))| Self-SL | SL |\n| **VIME** ([NeurIPS'20](https://proceedings.neurips.cc/paper/2020/hash/7d97667a3e056acab9aaf653807b4a03-Abstract.html)) | Self-SL | Semi-SL or SL |\n| **SubTab** ([NeurIPS'21](https://proceedings.neurips.cc/paper/2021/hash/9c8661befae6dbcd08304dbf4dcaf0db-Abstract.html)) | Self-SL | SL |\n| **SCARF** ([ICLR'22](https://iclr.cc/virtual/2022/spotlight/6297))| Self-SL | SL |\n\n#### Denoising AutoEncoder (DAE)\nDAE processes input data that has been partially corrupted, producing clean data and predicting which features are corrupted during the self-supervised learning.\nThe denoising task enables the model to learn the input distribution and generate latent representations that are robust to corruption. \nThese latent representations can be utilized for a variety of downstream tasks.\n\n<details close>\n  <summary>Quick Start</summary>\n  \n  ```python\n  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols\n\n  # Prepare the DAELightning Module\n  from ts3l.pl_modules import DAELightning\n  from ts3l.utils.dae_utils import DAEDataset, DAECollateFN\n  from ts3l.utils import TS3LDataModule\n  from ts3l.utils.dae_utils import DAEConfig\n  from pytorch_lightning import Trainer\n  \n  metric = \"accuracy_score\"\n  input_dim = X_train.shape[1]\n  hidden_dim = 1024\n  output_dim = 2\n  encoder_depth=4\n  head_depth = 2\n  noise_type = \"Swap\"\n  noise_ratio = 0.3\n  \n  max_epochs = 20\n  batch_size = 128\n  \n  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)\n  \n  config = DAEConfig( task=\"classification\", loss_fn=\"CrossEntropyLoss\", metric=metric, metric_hparams={},\n  input_dim=input_dim, hidden_dim=hidden_dim,\n  output_dim=output_dim, encoder_depth=encoder_depth,\n  head_depth = head_depth,\n  noise_type = noise_type,\n  noise_ratio = noise_ratio,\n  num_categoricals=len(category_cols), num_continuous=len(continuous_cols)\n  )\n  \n  pl_dae = DAELightning(config)\n  \n  ### First Phase Learning\n  train_ds = DAEDataset(X = X_train, unlabeled_data = X_unlabeled, continuous_cols = continuous_cols, category_cols = category_cols)\n  valid_ds = DAEDataset(X = X_valid, continuous_cols = continuous_cols, category_cols = category_cols)\n  \n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=DAECollateFN(config), valid_collate_fn=DAECollateFN(config))\n  \n  trainer = Trainer(\n                      accelerator = 'cpu',\n                      max_epochs = max_epochs,\n                      num_sanity_val_steps = 2,\n      )\n  \n  trainer.fit(pl_dae, datamodule)\n  \n  ### Second Phase Learning\n  \n  pl_dae.set_second_phase()\n  \n  train_ds = DAEDataset(X = X_train, Y = y_train.values, unlabeled_data=X_unlabeled, continuous_cols=continuous_cols, category_cols=category_cols)\n  valid_ds = DAEDataset(X = X_valid, Y = y_valid.values, continuous_cols=continuous_cols, category_cols=category_cols)\n          \n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler=\"weighted\")\n  \n  trainer = Trainer(\n                      accelerator = 'cpu',\n                      max_epochs = max_epochs,\n                      num_sanity_val_steps = 2,\n      )\n  \n  trainer.fit(pl_dae, datamodule)\n  \n  # Evaluation\n  from sklearn.metrics import accuracy_score\n  import torch\n  from torch.nn import functional as F\n  from torch.utils.data import DataLoader, SequentialSampler\n  \n  test_ds = DAEDataset(X_test, category_cols=category_cols, continuous_cols=continuous_cols)\n  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds))\n  \n  preds = trainer.predict(pl_dae, test_dl)\n          \n  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)\n  \n  accuracy = accuracy_score(y_test, preds.argmax(1))\n  \n  print(\"Accuracy %.2f\" % accuracy)\n  ```\n\n</details>\n\n#### VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain\nVIME enhances tabular data learning through a dual approach. In its first phase, it utilize a pretext task of estimating mask vectors from corrupted tabular data, alongside a reconstruction pretext task for self-supervised learning. The second phase leverages consistency regularization on unlabeled data.\n\n<details close>\n  <summary>Quick Start</summary>\n  \n  ```python\n  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols\n\n  # Prepare the VIMELightning Module\n  from ts3l.pl_modules import VIMELightning\n  from ts3l.utils.vime_utils import VIMEDataset\n  from ts3l.utils import TS3LDataModule\n  from ts3l.utils.vime_utils import VIMEConfig\n  from pytorch_lightning import Trainer\n\n  metric = \"accuracy_score\"\n  input_dim = X_train.shape[1]\n  hidden_dim = 1024\n  output_dim = 2\n  alpha1 = 2.0\n  alpha2 = 2.0\n  beta = 1.0\n  K = 3\n  p_m = 0.2\n\n  batch_size = 128\n\n  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)\n\n  config = VIMEConfig( task=\"classification\", loss_fn=\"CrossEntropyLoss\", metric=metric, metric_hparams={},\n  input_dim=input_dim, hidden_dim=hidden_dim,\n  output_dim=output_dim, alpha1=alpha1, alpha2=alpha2, \n  beta=beta, K=K, p_m = p_m,\n  num_categoricals=len(category_cols), num_continuous=len(continuous_cols)\n  )\n\n  pl_vime = VIMELightning(config)\n\n  ### First Phase Learning\n  train_ds = VIMEDataset(X = X_train, unlabeled_data = X_unlabeled, config=config, continuous_cols = continuous_cols, category_cols = category_cols)\n  valid_ds = VIMEDataset(X = X_valid, config=config, continuous_cols = continuous_cols, category_cols = category_cols)\n\n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random')\n\n  trainer = Trainer(\n                      accelerator = 'cpu',\n                      max_epochs = 20,\n                      num_sanity_val_steps = 2,\n      )\n\n  trainer.fit(pl_vime, datamodule)\n\n  ### Second Phase Learning\n  from ts3l.utils.vime_utils import VIMESemiSLCollateFN\n\n  pl_vime.set_second_phase()\n\n  train_ds = VIMEDataset(X_train, y_train.values, config, unlabeled_data=X_unlabeled, continuous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)\n  valid_ds = VIMEDataset(X_valid, y_valid.values, config, continuous_cols=continuous_cols, category_cols=category_cols, is_second_phase=True)\n          \n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler=\"weighted\", train_collate_fn=VIMESemiSLCollateFN())\n\n  trainer.fit(pl_vime, datamodule)\n\n  # Evaluation\n  from sklearn.metrics import accuracy_score\n  import torch\n  from torch.nn import functional as F\n  from torch.utils.data import DataLoader, SequentialSampler\n\n  test_ds = VIMEDataset(X_test, category_cols=category_cols, continuous_cols=continuous_cols, is_second_phase=True)\n  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds))\n\n  preds = trainer.predict(pl_vime, test_dl)\n          \n  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)\n\n  accuracy = accuracy_score(y_test, preds.argmax(1))\n\n  print(\"Accuracy %.2f\" % accuracy)\n  ```\n\n</details>\n\n\n#### SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning\nSubTab turns the task of learning from tabular data into as a multi-view representation challenge by dividing input features into multiple subsets during its first phase. During the second phase, collaborative inference is used to derive a joint representation by aggregating latent variables across subsets. This approach improves the model's performance in supervised learning tasks.\n\n<details close>\n  <summary>Quick Start</summary>\n  \n  ```python\n  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols\n\n  # Prepare the SubTabLightning Module\n  from ts3l.pl_modules import SubTabLightning\n  from ts3l.utils.subtab_utils import SubTabDataset, SubTabCollateFN\n  from ts3l.utils import TS3LDataModule\n  from ts3l.utils.subtab_utils import SubTabConfig\n  from pytorch_lightning import Trainer\n\n  metric = \"accuracy_score\"\n  input_dim = X_train.shape[1]\n  hidden_dim = 1024\n  output_dim = 2\n  tau = 1.0\n  use_cosine_similarity = True\n  use_contrastive = True\n  use_distance = True\n  n_subsets = 4\n  overlap_ratio = 0.75\n\n  mask_ratio = 0.1\n  noise_type = \"Swap\"\n  noise_level = 0.1\n\n  batch_size = 128\n  max_epochs = 3\n\n  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)\n\n  config = SubTabConfig( task=\"classification\", loss_fn=\"CrossEntropyLoss\", metric=metric, metric_hparams={},\n  input_dim=input_dim, hidden_dim=hidden_dim,\n  output_dim=output_dim, tau=tau, use_cosine_similarity= use_cosine_similarity, use_contrastive=use_contrastive, use_distance=use_distance, \n  n_subsets=n_subsets, overlap_ratio=overlap_ratio, mask_ratio=mask_ratio, noise_type=noise_type, noise_level=noise_level\n  )\n\n  pl_subtab = SubTabLightning(config)\n\n  ### First Phase Learning\n  train_ds = SubTabDataset(X_train, unlabeled_data=X_unlabeled)\n  valid_ds = SubTabDataset(X_valid)\n\n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size, train_sampler='random', train_collate_fn=SubTabCollateFN(config), valid_collate_fn=SubTabCollateFN(config), n_jobs = 4)\n\n  trainer = Trainer(\n                      accelerator = 'cpu',\n                      max_epochs = max_epochs,\n                      num_sanity_val_steps = 2,\n      )\n\n  trainer.fit(pl_subtab, datamodule)\n\n  ### Second Phase Learning\n\n  pl_subtab.set_second_phase()\n\n  train_ds = SubTabDataset(X_train, y_train.values)\n  valid_ds = SubTabDataset(X_valid, y_valid.values)\n\n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler=\"weighted\", train_collate_fn=SubTabCollateFN(config), valid_collate_fn=SubTabCollateFN(config))\n\n  trainer.fit(pl_subtab, datamodule)\n\n  # Evaluation\n  from sklearn.metrics import accuracy_score\n  import torch\n  from torch.nn import functional as F\n  from torch.utils.data import DataLoader, SequentialSampler\n\n  test_ds = SubTabDataset(X_test)\n  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4, collate_fn=SubTabCollateFN(config))\n\n  preds = trainer.predict(pl_subtab, test_dl)\n          \n  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)\n\n  accuracy = accuracy_score(y_test, preds.argmax(1))\n\n  print(\"Accuracy %.2f\" % accuracy)\n  ```\n\n</details>\n\n#### SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption\nSCARF introduces a contrastive learning framework specifically tailored for tabular data. By corrupting random subsets of features, SCARF creates diverse views for self-supervised learning in its first phase. The subsequent phase transitions to supervised learning, utilizing a pretrained encoder to enhance model accuracy and robustness.\n\n<details close>\n  <summary>Quick Start</summary>\n  \n  ```python\n  # Assume that we have X_train, X_valid, X_test, y_train, y_valid, y_test, categorical_cols, and continuous_cols\n\n  # Prepare the SCARFLightning Module\n  from ts3l.pl_modules import SCARFLightning\n  from ts3l.utils.scarf_utils import SCARFDataset\n  from ts3l.utils import TS3LDataModule\n  from ts3l.utils.scarf_utils import SCARFConfig\n  from pytorch_lightning import Trainer\n\n  metric = \"accuracy_score\"\n  input_dim = X_train.shape[1]\n  hidden_dim = 1024\n  output_dim = 2\n  encoder_depth = 3\n  head_depth = 1\n  dropout_rate = 0.04\n\n  corruption_rate = 0.6\n\n  batch_size = 128\n  max_epochs = 10\n\n  X_train, X_unlabeled, y_train, _ = train_test_split(X_train, y_train, train_size = 0.1, random_state=0, stratify=y_train)\n\n  config = SCARFConfig( task=\"classification\", loss_fn=\"CrossEntropyLoss\", metric=metric, metric_hparams={},\n  input_dim=input_dim, hidden_dim=hidden_dim,\n  output_dim=output_dim, encoder_depth=encoder_depth, head_depth=head_depth,\n  dropout_rate=dropout_rate, corruption_rate = corruption_rate\n  )\n\n  pl_scarf = SCARFLightning(config)\n\n  ### First Phase Learning\n  train_ds = SCARFDataset(X_train, unlabeled_data=X_unlabeled, config = config)\n  valid_ds = SCARFDataset(X_valid, config=config)\n\n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size=batch_size, train_sampler=\"random\")\n\n  trainer = Trainer(\n                      accelerator = 'cpu',\n                      max_epochs = max_epochs,\n                      num_sanity_val_steps = 2,\n      )\n\n  trainer.fit(pl_scarf, datamodule)\n\n  ### Second Phase Learning\n\n  pl_scarf.set_second_phase()\n\n  train_ds = SCARFDataset(X_train, y_train.values, is_second_phase=True)\n  valid_ds = SCARFDataset(X_valid, y_valid.values, is_second_phase=True)\n\n  datamodule = TS3LDataModule(train_ds, valid_ds, batch_size = batch_size, train_sampler=\"weighted\")\n\n  trainer.fit(pl_scarf, datamodule)\n\n  # Evaluation\n  from sklearn.metrics import accuracy_score\n  import torch\n  from torch.nn import functional as F\n  from torch.utils.data import DataLoader, SequentialSampler\n\n  test_ds = SCARFDataset(X_test, is_second_phase=True)\n  test_dl = DataLoader(test_ds, batch_size, shuffle=False, sampler = SequentialSampler(test_ds), num_workers=4)\n\n  preds = trainer.predict(pl_scarf, test_dl)\n          \n  preds = F.softmax(torch.concat([out.cpu() for out in preds]).squeeze(),dim=1)\n\n  accuracy = accuracy_score(y_test, preds.argmax(1))\n\n  print(\"Accuracy %.2f\" % accuracy)\n  ```\n\n</details>\n\n\n## Benchmark\n\nWe provide a simple benchmark using TabularS3L against XGBoost. The train-validation-test ratio is 6:2:2 and we tuned each model over 50 trials using Optuna. The results are the average of the random seeds [0,4]. The best results are bold. 'acc', 'b-acc', and 'mse' mean 'Accuracy', 'Balanced Accuracy', and 'Mean Squared Error', respectively.\n\nUse this benchmark for reference only, as only a small number of random seeds were used.\n\n##### 10% labeled samples \n\n| Model | diabetes (acc) | cmc (b-acc) | abalone (mse) |\n|:---:|:---:|:---:|:---:|\n| XGBoost | 0.7325 | 0.4763 | **5.5739** |\n| DAE | 0.7208 | 0.4885 | 5.6168 | \n| VIME | 0.7182 | **0.5087** | 5.6637 |\n| SubTab | 0.7312 | 0.4930 | 7.2418 |\n| SCARF | **0.7416** | 0.4710 | 5.8888 | \n\n--------\n\n##### 100% labeled samples\n\n| Model | diabetes (acc) | cmc (b-acc) | abalone (mse) |\n|:---:|:---:|:---:|:---:|\n| XGBoost | 0.7234 | 0.5291 | 4.8377 |\n| DAE | 0.7390 | 0.5500 | 4.5758 |\n| VIME | **0.7688** | 0.5477 | 4.5804 |\n| SubTab | 0.7390 | 0.5432 | 6.3104 |\n| SCARF | 0.7442 | **0.5521** | **4.4443** |\n\n## To DO\n\n- [x] Release nn.Module and Dataset of VIME, SubTab, and SCARF\n  - [x] VIME\n  - [x] SubTab\n  - [x] SCARF\n- [x] Release LightningModules of VIME, SubTab, and SCARF\n  - [x] VIME\n  - [x] SubTab\n  - [x] SCARF\n- [x] Release Denoising AutoEncoder\n  - [x] nn.Module\n  - [x] LightningModule\n- [ ] Release SwitchTab\n  - [ ] nn.Module\n  - [ ] LightningModule\n- [x] Add example codes\n\n## Contributing\n\nContributions to this implementation are highly appreciated. Whether it's suggesting improvements, reporting bugs, or proposing new features, feel free to open an issue or submit a pull request.\n\n\n## Credit  \n```\n@software{alcoholrithm_2024_10776538,\n  author       = {Minwook Kim},\n  title        = {TabularS3L},\n  month        = mar,\n  year         = 2024,\n  publisher    = {Zenodo},\n  version      = {v0.21},\n  doi          = {10.5281/zenodo.10776538},\n  url          = {https://doi.org/10.5281/zenodo.10776538}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A PyTorch Lightning-based library for self- and semi-supervised learning on tabular data.",
    "version": "0.30",
    "project_urls": {
        "Homepage": "https://github.com/Alcoholrithm/TabularS3L"
    },
    "split_keywords": [
        "tabular-data",
        "semi-supervised-learning",
        "self-supervised-learning",
        "vime",
        "subtab",
        "scarf",
        "denoisingautoencoder"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "44dd1d45bf88453ad2e091fec0fcd3174951a23f7baa24acb8cb551c4e4d4a7d",
                "md5": "506102467fae259c6dd227f9bb3d1e7a",
                "sha256": "70c6c1e952014dfa97c34aa1a8f8b6e3543e5538f2c8e6b19cd055641f3f9e2e"
            },
            "downloads": -1,
            "filename": "ts3l-0.30-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "506102467fae259c6dd227f9bb3d1e7a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 42255,
            "upload_time": "2024-04-22T05:18:59",
            "upload_time_iso_8601": "2024-04-22T05:18:59.608814Z",
            "url": "https://files.pythonhosted.org/packages/44/dd/1d45bf88453ad2e091fec0fcd3174951a23f7baa24acb8cb551c4e4d4a7d/ts3l-0.30-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbe5424adae1e53afb61e6431157728678650c4a20ef7e0fa0bcfda48424ac79",
                "md5": "b1f418d6257811c04513c1fb548a5f09",
                "sha256": "51eb0d9c39287157b8f0820683294553781d60a76b05d1493ccb8284fb97836a"
            },
            "downloads": -1,
            "filename": "ts3l-0.30.tar.gz",
            "has_sig": false,
            "md5_digest": "b1f418d6257811c04513c1fb548a5f09",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 31847,
            "upload_time": "2024-04-22T05:19:01",
            "upload_time_iso_8601": "2024-04-22T05:19:01.594622Z",
            "url": "https://files.pythonhosted.org/packages/fb/e5/424adae1e53afb61e6431157728678650c4a20ef7e0fa0bcfda48424ac79/ts3l-0.30.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-22 05:19:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Alcoholrithm",
    "github_project": "TabularS3L",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "ts3l"
}
        
Elapsed time: 0.26756s