table-toolkit

Name	table-toolkit JSON
Version	2025.10.15.post1 JSON
	download
home_page	None
Summary	A Python library for consistent preprocessing of tabular data with automatic type inference, caching, and stratified splitting
upload_time	2025-10-15 19:16:33
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	data-processing machine-learning pandas preprocessing sklearn tabular
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Table Toolkit (tabkit)

A python library for consistent preprocessing of tabular data. Handles column
type inference, missing value imputation, feature binning, stratified
split/sampling and more in a configuration-driven manner. I made this toolkit because I needed a way to reliably preprocess/cache datasets in a reproducible manner.

## Installation

Stable release via PyPI:

```bash
pip install table-toolkit
```

Or install the latest development version directly from GitHub:

```bash
pip install git+https://github.com/inwonakng/tabkit.git@main
```

This package has been tested only with Python 3.10 and above.

## Quick Start

```python
from tabkit import TableProcessor, DatasetConfig, TableProcessorConfig

# Define your dataset and processing configs
dataset_config = DatasetConfig(
    dataset_name="my_dataset",
    data_source="disk",
    file_path="path/to/your/data.csv",
    file_type="csv",
    label_col="target"
)

processor_config = TableProcessorConfig(
    task_kind="classification",  # or "regression"
    n_splits=5,
    random_state=42
)

# Create processor
processor = TableProcessor(
    dataset_config=dataset_config,
    config=processor_config
)

# Prepare data (this caches results for future runs)
processor.prepare()

# Get splits
X_train, y_train = processor.get_split("train")
X_val, y_val = processor.get_split("val")
X_test, y_test = processor.get_split("test")

# Or get the raw dataframe
df = processor.get("raw_df")
```

**Note:** You can also use plain dictionaries instead of config classes - both work identically! See [Configuration Options](#configuration-options) below.

For more examples, see [examples/basic_usage.py](examples/basic_usage.py).

## Features

- **Automatic type inference**: Detects categorical, continuous, binary, and datetime columns
- **Flexible preprocessing pipelines**: Chain transforms like imputation, encoding, scaling, discretization
- **Smart caching**: Preprocessed data is cached based on config hash - perfect for distributed training
- **Stratified splitting**: Automatically handles stratified train/val/test splits
- **Reproducible**: Same config always produces same results

## Configuration Options

Tabkit provides **type-safe configuration classes** with IDE autocomplete and inline documentation. You can also use plain dictionaries if you prefer - both approaches work identically.

### Using Config Classes (Recommended)

```python
from tabkit import DatasetConfig, TableProcessorConfig

# Dataset configuration with type hints and autocomplete
dataset_config = DatasetConfig(
    dataset_name="my_dataset",
    data_source="disk",      # "disk", "openml", "uci", "automm"
    file_path="data.csv",
    file_type="csv",         # "csv" or "parquet"
    label_col="target"
)

# Processor configuration
processor_config = TableProcessorConfig(
    task_kind="classification",  # or "regression"
    random_state=42,
    pipeline=[...],              # Custom pipeline (optional)
    exclude_columns=["id"],      # Columns to exclude (optional)

    # Splitting configuration - see next section
    test_ratio=0.2,              # For ratio-based splitting
    val_ratio=0.1,               # For ratio-based splitting
    # OR
    n_splits=10,                 # For K-fold splitting
    split_idx=0                  # For K-fold splitting
)
```

For detailed documentation on all available options, see the docstrings in `DatasetConfig` and `TableProcessorConfig`, or check the [config source](src/tabkit/data/data_config.py).

### Using Plain Dictionaries (Also supported)

```python
# Same functionality, dictionary-based
dataset_config = {
    "dataset_name": "my_dataset",
    "data_source": "disk",
    "file_path": "data.csv",
    "file_type": "csv",
    "label_col": "target"
}

processor_config = {
    "task_kind": "classification",
    "test_ratio": 0.2,
    "val_ratio": 0.1,
    "random_state": 42
}
```

## Data Splitting Modes

Tabkit supports two distinct approaches for splitting your data into train/validation/test sets. Choose based on your use case:

### Mode 1: Ratio-Based Splitting (Quick & Simple)

**When to use:**
- You want a simple percentage-based split (e.g., 70/15/15)
- You're doing quick prototyping or one-off experiments
- You don't need full dataset coverage

**How it works:**
- Performs a single random stratified split based on specified ratios
- Fast and intuitive
- Different random seeds give different splits, but no systematic coverage

**Example:**
```python
from tabkit import TableProcessorConfig

config = TableProcessorConfig(
    test_ratio=0.2,       # 20% test
    val_ratio=0.1,        # 10% validation
    random_state=42       # 70% training
)
```

### Mode 2: K-Fold Based Splitting (Robust & Reproducible)

**When to use:**
- You need robust cross-validation
- You want to ensure every sample appears in the test set across multiple runs
- You're benchmarking models or doing comprehensive evaluation

**How it works:**
- Uses K-fold cross-validation for systematic data splitting
- By varying `split_idx` from 0 to `n_splits-1`, every sample appears in the test set exactly once
- Provides systematic coverage of your entire dataset
- Default: 10 splits = 10% test, then 9 sub-splits on training portion = ~11% val, ~79% train

**Example:**
```python
from tabkit import TableProcessorConfig

# Run 1: Use fold 0 as test
config = TableProcessorConfig(n_splits=5, split_idx=0)  # 20% test, rest train+val

# Run 2: Use fold 1 as test
config = TableProcessorConfig(n_splits=5, split_idx=1)  # Different 20% test

# ... Run 3-5 to cover all data in test set
```

### Which Mode is Used?

**Priority:** If both `test_ratio` and `val_ratio` are set, ratio-based splitting is used. Otherwise, K-fold splitting is used.

```python
# This uses RATIO mode
config = {"test_ratio": 0.2, "val_ratio": 0.1}

# This uses K-FOLD mode
config = {"n_splits": 10, "split_idx": 0}

# This also uses K-FOLD mode (ratios are None by default)
config = {}  # Uses all defaults
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "table-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "data-processing, machine-learning, pandas, preprocessing, sklearn, tabular",
    "author": null,
    "author_email": "inwonakng <inwon.kang04@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a4/39/e9b9046c4c2262ce1f48b596c9127043963b1c44ecce07548cf81eec04b9/table_toolkit-2025.10.15.post1.tar.gz",
    "platform": null,
    "description": "# Table Toolkit (tabkit)\n\nA python library for consistent preprocessing of tabular data. Handles column\ntype inference, missing value imputation, feature binning, stratified\nsplit/sampling and more in a configuration-driven manner. I made this toolkit because I needed a way to reliably preprocess/cache datasets in a reproducible manner.\n\n## Installation\n\nStable release via PyPI:\n\n```bash\npip install table-toolkit\n```\n\nOr install the latest development version directly from GitHub:\n\n```bash\npip install git+https://github.com/inwonakng/tabkit.git@main\n```\n\nThis package has been tested only with Python 3.10 and above.\n\n## Quick Start\n\n```python\nfrom tabkit import TableProcessor, DatasetConfig, TableProcessorConfig\n\n# Define your dataset and processing configs\ndataset_config = DatasetConfig(\n    dataset_name=\"my_dataset\",\n    data_source=\"disk\",\n    file_path=\"path/to/your/data.csv\",\n    file_type=\"csv\",\n    label_col=\"target\"\n)\n\nprocessor_config = TableProcessorConfig(\n    task_kind=\"classification\",  # or \"regression\"\n    n_splits=5,\n    random_state=42\n)\n\n# Create processor\nprocessor = TableProcessor(\n    dataset_config=dataset_config,\n    config=processor_config\n)\n\n# Prepare data (this caches results for future runs)\nprocessor.prepare()\n\n# Get splits\nX_train, y_train = processor.get_split(\"train\")\nX_val, y_val = processor.get_split(\"val\")\nX_test, y_test = processor.get_split(\"test\")\n\n# Or get the raw dataframe\ndf = processor.get(\"raw_df\")\n```\n\n**Note:** You can also use plain dictionaries instead of config classes - both work identically! See [Configuration Options](#configuration-options) below.\n\nFor more examples, see [examples/basic_usage.py](examples/basic_usage.py).\n\n## Features\n\n- **Automatic type inference**: Detects categorical, continuous, binary, and datetime columns\n- **Flexible preprocessing pipelines**: Chain transforms like imputation, encoding, scaling, discretization\n- **Smart caching**: Preprocessed data is cached based on config hash - perfect for distributed training\n- **Stratified splitting**: Automatically handles stratified train/val/test splits\n- **Reproducible**: Same config always produces same results\n\n## Configuration Options\n\nTabkit provides **type-safe configuration classes** with IDE autocomplete and inline documentation. You can also use plain dictionaries if you prefer - both approaches work identically.\n\n### Using Config Classes (Recommended)\n\n```python\nfrom tabkit import DatasetConfig, TableProcessorConfig\n\n# Dataset configuration with type hints and autocomplete\ndataset_config = DatasetConfig(\n    dataset_name=\"my_dataset\",\n    data_source=\"disk\",      # \"disk\", \"openml\", \"uci\", \"automm\"\n    file_path=\"data.csv\",\n    file_type=\"csv\",         # \"csv\" or \"parquet\"\n    label_col=\"target\"\n)\n\n# Processor configuration\nprocessor_config = TableProcessorConfig(\n    task_kind=\"classification\",  # or \"regression\"\n    random_state=42,\n    pipeline=[...],              # Custom pipeline (optional)\n    exclude_columns=[\"id\"],      # Columns to exclude (optional)\n\n    # Splitting configuration - see next section\n    test_ratio=0.2,              # For ratio-based splitting\n    val_ratio=0.1,               # For ratio-based splitting\n    # OR\n    n_splits=10,                 # For K-fold splitting\n    split_idx=0                  # For K-fold splitting\n)\n```\n\nFor detailed documentation on all available options, see the docstrings in `DatasetConfig` and `TableProcessorConfig`, or check the [config source](src/tabkit/data/data_config.py).\n\n### Using Plain Dictionaries (Also supported)\n\n```python\n# Same functionality, dictionary-based\ndataset_config = {\n    \"dataset_name\": \"my_dataset\",\n    \"data_source\": \"disk\",\n    \"file_path\": \"data.csv\",\n    \"file_type\": \"csv\",\n    \"label_col\": \"target\"\n}\n\nprocessor_config = {\n    \"task_kind\": \"classification\",\n    \"test_ratio\": 0.2,\n    \"val_ratio\": 0.1,\n    \"random_state\": 42\n}\n```\n\n## Data Splitting Modes\n\nTabkit supports two distinct approaches for splitting your data into train/validation/test sets. Choose based on your use case:\n\n### Mode 1: Ratio-Based Splitting (Quick & Simple)\n\n**When to use:**\n- You want a simple percentage-based split (e.g., 70/15/15)\n- You're doing quick prototyping or one-off experiments\n- You don't need full dataset coverage\n\n**How it works:**\n- Performs a single random stratified split based on specified ratios\n- Fast and intuitive\n- Different random seeds give different splits, but no systematic coverage\n\n**Example:**\n```python\nfrom tabkit import TableProcessorConfig\n\nconfig = TableProcessorConfig(\n    test_ratio=0.2,       # 20% test\n    val_ratio=0.1,        # 10% validation\n    random_state=42       # 70% training\n)\n```\n\n### Mode 2: K-Fold Based Splitting (Robust & Reproducible)\n\n**When to use:**\n- You need robust cross-validation\n- You want to ensure every sample appears in the test set across multiple runs\n- You're benchmarking models or doing comprehensive evaluation\n\n**How it works:**\n- Uses K-fold cross-validation for systematic data splitting\n- By varying `split_idx` from 0 to `n_splits-1`, every sample appears in the test set exactly once\n- Provides systematic coverage of your entire dataset\n- Default: 10 splits = 10% test, then 9 sub-splits on training portion = ~11% val, ~79% train\n\n**Example:**\n```python\nfrom tabkit import TableProcessorConfig\n\n# Run 1: Use fold 0 as test\nconfig = TableProcessorConfig(n_splits=5, split_idx=0)  # 20% test, rest train+val\n\n# Run 2: Use fold 1 as test\nconfig = TableProcessorConfig(n_splits=5, split_idx=1)  # Different 20% test\n\n# ... Run 3-5 to cover all data in test set\n```\n\n### Which Mode is Used?\n\n**Priority:** If both `test_ratio` and `val_ratio` are set, ratio-based splitting is used. Otherwise, K-fold splitting is used.\n\n```python\n# This uses RATIO mode\nconfig = {\"test_ratio\": 0.2, \"val_ratio\": 0.1}\n\n# This uses K-FOLD mode\nconfig = {\"n_splits\": 10, \"split_idx\": 0}\n\n# This also uses K-FOLD mode (ratios are None by default)\nconfig = {}  # Uses all defaults\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library for consistent preprocessing of tabular data with automatic type inference, caching, and stratified splitting",
    "version": "2025.10.15.post1",
    "project_urls": {
        "Homepage": "https://github.com/inwonakng/tabkit",
        "Issues": "https://github.com/inwonakng/tabkit/issues",
        "Repository": "https://github.com/inwonakng/tabkit"
    },
    "split_keywords": [
        "data-processing",
        " machine-learning",
        " pandas",
        " preprocessing",
        " sklearn",
        " tabular"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d35270dde8d76a68323a424de42932e63141dbeccc9e5cc66d0effd53857170e",
                "md5": "e97102e7bb04c5944f8bdef686e69bd1",
                "sha256": "fcf3e72128078c529dba5b1a1327bd583bdfd51e3adafd12b1315fe4d68b2692"
            },
            "downloads": -1,
            "filename": "table_toolkit-2025.10.15.post1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e97102e7bb04c5944f8bdef686e69bd1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 28245,
            "upload_time": "2025-10-15T19:16:33",
            "upload_time_iso_8601": "2025-10-15T19:16:33.100369Z",
            "url": "https://files.pythonhosted.org/packages/d3/52/70dde8d76a68323a424de42932e63141dbeccc9e5cc66d0effd53857170e/table_toolkit-2025.10.15.post1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a439e9b9046c4c2262ce1f48b596c9127043963b1c44ecce07548cf81eec04b9",
                "md5": "00cd6bc6eff703b8e4a3f60a92f6d3c6",
                "sha256": "1ceda1e03335712c812bf0deeb6fc5d6d89e8dc1602a007ebc5e1b94cd309820"
            },
            "downloads": -1,
            "filename": "table_toolkit-2025.10.15.post1.tar.gz",
            "has_sig": false,
            "md5_digest": "00cd6bc6eff703b8e4a3f60a92f6d3c6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 49590,
            "upload_time": "2025-10-15T19:16:33",
            "upload_time_iso_8601": "2025-10-15T19:16:33.946734Z",
            "url": "https://files.pythonhosted.org/packages/a4/39/e9b9046c4c2262ce1f48b596c9127043963b1c44ecce07548cf81eec04b9/table_toolkit-2025.10.15.post1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-15 19:16:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "inwonakng",
    "github_project": "tabkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "table-toolkit"
}

None