# Table Toolkit (tabkit)
A python library for consistent preprocessing of tabular data. Handles column
type inference, missing value imputation, feature binning, stratified
split/sampling and more in a configuration-driven manner. I made this toolkit because I needed a way to reliably preprocess/cache datasets in a reproducible manner.
## Installation
Stable release via PyPI:
```bash
pip install table-toolkit
```
Or install the latest development version directly from GitHub:
```bash
pip install git+https://github.com/inwonakng/tabkit.git@main
```
This package has been tested only with Python 3.10 and above.
## Quick Start
```python
from tabkit import TableProcessor, DatasetConfig, TableProcessorConfig
# Define your dataset and processing configs
dataset_config = DatasetConfig(
dataset_name="my_dataset",
data_source="disk",
file_path="path/to/your/data.csv",
file_type="csv",
label_col="target"
)
processor_config = TableProcessorConfig(
task_kind="classification", # or "regression"
n_splits=5,
random_state=42
)
# Create processor
processor = TableProcessor(
dataset_config=dataset_config,
config=processor_config
)
# Prepare data (this caches results for future runs)
processor.prepare()
# Get splits
X_train, y_train = processor.get_split("train")
X_val, y_val = processor.get_split("val")
X_test, y_test = processor.get_split("test")
# Or get the raw dataframe
df = processor.get("raw_df")
```
**Note:** You can also use plain dictionaries instead of config classes - both work identically! See [Configuration Options](#configuration-options) below.
For more examples, see [examples/basic_usage.py](examples/basic_usage.py).
## Features
- **Automatic type inference**: Detects categorical, continuous, binary, and datetime columns
- **Flexible preprocessing pipelines**: Chain transforms like imputation, encoding, scaling, discretization
- **Smart caching**: Preprocessed data is cached based on config hash - perfect for distributed training
- **Stratified splitting**: Automatically handles stratified train/val/test splits
- **Reproducible**: Same config always produces same results
## Configuration Options
Tabkit provides **type-safe configuration classes** with IDE autocomplete and inline documentation. You can also use plain dictionaries if you prefer - both approaches work identically.
### Using Config Classes (Recommended)
```python
from tabkit import DatasetConfig, TableProcessorConfig
# Dataset configuration with type hints and autocomplete
dataset_config = DatasetConfig(
dataset_name="my_dataset",
data_source="disk", # "disk", "openml", "uci", "automm"
file_path="data.csv",
file_type="csv", # "csv" or "parquet"
label_col="target"
)
# Processor configuration
processor_config = TableProcessorConfig(
task_kind="classification", # or "regression"
random_state=42,
pipeline=[...], # Custom pipeline (optional)
exclude_columns=["id"], # Columns to exclude (optional)
# Splitting configuration - see next section
test_ratio=0.2, # For ratio-based splitting
val_ratio=0.1, # For ratio-based splitting
# OR
n_splits=10, # For K-fold splitting
split_idx=0 # For K-fold splitting
)
```
For detailed documentation on all available options, see the docstrings in `DatasetConfig` and `TableProcessorConfig`, or check the [config source](src/tabkit/data/data_config.py).
### Using Plain Dictionaries (Also supported)
```python
# Same functionality, dictionary-based
dataset_config = {
"dataset_name": "my_dataset",
"data_source": "disk",
"file_path": "data.csv",
"file_type": "csv",
"label_col": "target"
}
processor_config = {
"task_kind": "classification",
"test_ratio": 0.2,
"val_ratio": 0.1,
"random_state": 42
}
```
## Data Splitting Modes
Tabkit supports two distinct approaches for splitting your data into train/validation/test sets. Choose based on your use case:
### Mode 1: Ratio-Based Splitting (Quick & Simple)
**When to use:**
- You want a simple percentage-based split (e.g., 70/15/15)
- You're doing quick prototyping or one-off experiments
- You don't need full dataset coverage
**How it works:**
- Performs a single random stratified split based on specified ratios
- Fast and intuitive
- Different random seeds give different splits, but no systematic coverage
**Example:**
```python
from tabkit import TableProcessorConfig
config = TableProcessorConfig(
test_ratio=0.2, # 20% test
val_ratio=0.1, # 10% validation
random_state=42 # 70% training
)
```
### Mode 2: K-Fold Based Splitting (Robust & Reproducible)
**When to use:**
- You need robust cross-validation
- You want to ensure every sample appears in the test set across multiple runs
- You're benchmarking models or doing comprehensive evaluation
**How it works:**
- Uses K-fold cross-validation for systematic data splitting
- By varying `split_idx` from 0 to `n_splits-1`, every sample appears in the test set exactly once
- Provides systematic coverage of your entire dataset
- Default: 10 splits = 10% test, then 9 sub-splits on training portion = ~11% val, ~79% train
**Example:**
```python
from tabkit import TableProcessorConfig
# Run 1: Use fold 0 as test
config = TableProcessorConfig(n_splits=5, split_idx=0) # 20% test, rest train+val
# Run 2: Use fold 1 as test
config = TableProcessorConfig(n_splits=5, split_idx=1) # Different 20% test
# ... Run 3-5 to cover all data in test set
```
### Which Mode is Used?
**Priority:** If both `test_ratio` and `val_ratio` are set, ratio-based splitting is used. Otherwise, K-fold splitting is used.
```python
# This uses RATIO mode
config = {"test_ratio": 0.2, "val_ratio": 0.1}
# This uses K-FOLD mode
config = {"n_splits": 10, "split_idx": 0}
# This also uses K-FOLD mode (ratios are None by default)
config = {} # Uses all defaults
```
Raw data
{
"_id": null,
"home_page": null,
"name": "table-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "data-processing, machine-learning, pandas, preprocessing, sklearn, tabular",
"author": null,
"author_email": "inwonakng <inwon.kang04@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a4/39/e9b9046c4c2262ce1f48b596c9127043963b1c44ecce07548cf81eec04b9/table_toolkit-2025.10.15.post1.tar.gz",
"platform": null,
"description": "# Table Toolkit (tabkit)\n\nA python library for consistent preprocessing of tabular data. Handles column\ntype inference, missing value imputation, feature binning, stratified\nsplit/sampling and more in a configuration-driven manner. I made this toolkit because I needed a way to reliably preprocess/cache datasets in a reproducible manner.\n\n## Installation\n\nStable release via PyPI:\n\n```bash\npip install table-toolkit\n```\n\nOr install the latest development version directly from GitHub:\n\n```bash\npip install git+https://github.com/inwonakng/tabkit.git@main\n```\n\nThis package has been tested only with Python 3.10 and above.\n\n## Quick Start\n\n```python\nfrom tabkit import TableProcessor, DatasetConfig, TableProcessorConfig\n\n# Define your dataset and processing configs\ndataset_config = DatasetConfig(\n dataset_name=\"my_dataset\",\n data_source=\"disk\",\n file_path=\"path/to/your/data.csv\",\n file_type=\"csv\",\n label_col=\"target\"\n)\n\nprocessor_config = TableProcessorConfig(\n task_kind=\"classification\", # or \"regression\"\n n_splits=5,\n random_state=42\n)\n\n# Create processor\nprocessor = TableProcessor(\n dataset_config=dataset_config,\n config=processor_config\n)\n\n# Prepare data (this caches results for future runs)\nprocessor.prepare()\n\n# Get splits\nX_train, y_train = processor.get_split(\"train\")\nX_val, y_val = processor.get_split(\"val\")\nX_test, y_test = processor.get_split(\"test\")\n\n# Or get the raw dataframe\ndf = processor.get(\"raw_df\")\n```\n\n**Note:** You can also use plain dictionaries instead of config classes - both work identically! See [Configuration Options](#configuration-options) below.\n\nFor more examples, see [examples/basic_usage.py](examples/basic_usage.py).\n\n## Features\n\n- **Automatic type inference**: Detects categorical, continuous, binary, and datetime columns\n- **Flexible preprocessing pipelines**: Chain transforms like imputation, encoding, scaling, discretization\n- **Smart caching**: Preprocessed data is cached based on config hash - perfect for distributed training\n- **Stratified splitting**: Automatically handles stratified train/val/test splits\n- **Reproducible**: Same config always produces same results\n\n## Configuration Options\n\nTabkit provides **type-safe configuration classes** with IDE autocomplete and inline documentation. You can also use plain dictionaries if you prefer - both approaches work identically.\n\n### Using Config Classes (Recommended)\n\n```python\nfrom tabkit import DatasetConfig, TableProcessorConfig\n\n# Dataset configuration with type hints and autocomplete\ndataset_config = DatasetConfig(\n dataset_name=\"my_dataset\",\n data_source=\"disk\", # \"disk\", \"openml\", \"uci\", \"automm\"\n file_path=\"data.csv\",\n file_type=\"csv\", # \"csv\" or \"parquet\"\n label_col=\"target\"\n)\n\n# Processor configuration\nprocessor_config = TableProcessorConfig(\n task_kind=\"classification\", # or \"regression\"\n random_state=42,\n pipeline=[...], # Custom pipeline (optional)\n exclude_columns=[\"id\"], # Columns to exclude (optional)\n\n # Splitting configuration - see next section\n test_ratio=0.2, # For ratio-based splitting\n val_ratio=0.1, # For ratio-based splitting\n # OR\n n_splits=10, # For K-fold splitting\n split_idx=0 # For K-fold splitting\n)\n```\n\nFor detailed documentation on all available options, see the docstrings in `DatasetConfig` and `TableProcessorConfig`, or check the [config source](src/tabkit/data/data_config.py).\n\n### Using Plain Dictionaries (Also supported)\n\n```python\n# Same functionality, dictionary-based\ndataset_config = {\n \"dataset_name\": \"my_dataset\",\n \"data_source\": \"disk\",\n \"file_path\": \"data.csv\",\n \"file_type\": \"csv\",\n \"label_col\": \"target\"\n}\n\nprocessor_config = {\n \"task_kind\": \"classification\",\n \"test_ratio\": 0.2,\n \"val_ratio\": 0.1,\n \"random_state\": 42\n}\n```\n\n## Data Splitting Modes\n\nTabkit supports two distinct approaches for splitting your data into train/validation/test sets. Choose based on your use case:\n\n### Mode 1: Ratio-Based Splitting (Quick & Simple)\n\n**When to use:**\n- You want a simple percentage-based split (e.g., 70/15/15)\n- You're doing quick prototyping or one-off experiments\n- You don't need full dataset coverage\n\n**How it works:**\n- Performs a single random stratified split based on specified ratios\n- Fast and intuitive\n- Different random seeds give different splits, but no systematic coverage\n\n**Example:**\n```python\nfrom tabkit import TableProcessorConfig\n\nconfig = TableProcessorConfig(\n test_ratio=0.2, # 20% test\n val_ratio=0.1, # 10% validation\n random_state=42 # 70% training\n)\n```\n\n### Mode 2: K-Fold Based Splitting (Robust & Reproducible)\n\n**When to use:**\n- You need robust cross-validation\n- You want to ensure every sample appears in the test set across multiple runs\n- You're benchmarking models or doing comprehensive evaluation\n\n**How it works:**\n- Uses K-fold cross-validation for systematic data splitting\n- By varying `split_idx` from 0 to `n_splits-1`, every sample appears in the test set exactly once\n- Provides systematic coverage of your entire dataset\n- Default: 10 splits = 10% test, then 9 sub-splits on training portion = ~11% val, ~79% train\n\n**Example:**\n```python\nfrom tabkit import TableProcessorConfig\n\n# Run 1: Use fold 0 as test\nconfig = TableProcessorConfig(n_splits=5, split_idx=0) # 20% test, rest train+val\n\n# Run 2: Use fold 1 as test\nconfig = TableProcessorConfig(n_splits=5, split_idx=1) # Different 20% test\n\n# ... Run 3-5 to cover all data in test set\n```\n\n### Which Mode is Used?\n\n**Priority:** If both `test_ratio` and `val_ratio` are set, ratio-based splitting is used. Otherwise, K-fold splitting is used.\n\n```python\n# This uses RATIO mode\nconfig = {\"test_ratio\": 0.2, \"val_ratio\": 0.1}\n\n# This uses K-FOLD mode\nconfig = {\"n_splits\": 10, \"split_idx\": 0}\n\n# This also uses K-FOLD mode (ratios are None by default)\nconfig = {} # Uses all defaults\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python library for consistent preprocessing of tabular data with automatic type inference, caching, and stratified splitting",
"version": "2025.10.15.post1",
"project_urls": {
"Homepage": "https://github.com/inwonakng/tabkit",
"Issues": "https://github.com/inwonakng/tabkit/issues",
"Repository": "https://github.com/inwonakng/tabkit"
},
"split_keywords": [
"data-processing",
" machine-learning",
" pandas",
" preprocessing",
" sklearn",
" tabular"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d35270dde8d76a68323a424de42932e63141dbeccc9e5cc66d0effd53857170e",
"md5": "e97102e7bb04c5944f8bdef686e69bd1",
"sha256": "fcf3e72128078c529dba5b1a1327bd583bdfd51e3adafd12b1315fe4d68b2692"
},
"downloads": -1,
"filename": "table_toolkit-2025.10.15.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e97102e7bb04c5944f8bdef686e69bd1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 28245,
"upload_time": "2025-10-15T19:16:33",
"upload_time_iso_8601": "2025-10-15T19:16:33.100369Z",
"url": "https://files.pythonhosted.org/packages/d3/52/70dde8d76a68323a424de42932e63141dbeccc9e5cc66d0effd53857170e/table_toolkit-2025.10.15.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a439e9b9046c4c2262ce1f48b596c9127043963b1c44ecce07548cf81eec04b9",
"md5": "00cd6bc6eff703b8e4a3f60a92f6d3c6",
"sha256": "1ceda1e03335712c812bf0deeb6fc5d6d89e8dc1602a007ebc5e1b94cd309820"
},
"downloads": -1,
"filename": "table_toolkit-2025.10.15.post1.tar.gz",
"has_sig": false,
"md5_digest": "00cd6bc6eff703b8e4a3f60a92f6d3c6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 49590,
"upload_time": "2025-10-15T19:16:33",
"upload_time_iso_8601": "2025-10-15T19:16:33.946734Z",
"url": "https://files.pythonhosted.org/packages/a4/39/e9b9046c4c2262ce1f48b596c9127043963b1c44ecce07548cf81eec04b9/table_toolkit-2025.10.15.post1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-15 19:16:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "inwonakng",
"github_project": "tabkit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "table-toolkit"
}