smolvladataset


Namesmolvladataset JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryLoader for SmolVLA robotics datasets with deterministic train/val/test splits; LeRobot‑compatible, cached locally, downloads precompiled bundles from the Hugging Face Hub or rebuilds from a CSV.
upload_time2025-08-27 01:41:50
maintainerNone
docs_urlNone
authorfranklintra, bonsainoodle
requires_python>=3.11
licenseNone
keywords dataset robotics dataset-loader smolvla vision-language-action lerobot huggingface machine-learning robotics-datasets ai-datasets
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # smolvladataset

Simple, reliable loader for SmolVLA robotics datasets with built‑in train/val/test splits and caching.

This library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.

## Features

- Reproducible train/val/test splits (deterministic seed)
- LeRobot‑compatible splits (`LeRobotDataset` interface)
- Automatic download and local caching (Hugging Face Hub)
- Optional precompiled dataset for fast startup
- Efficient Parquet storage with light schema normalization

## Installation

```bash
pip install smolvladataset
# or using uv
uv add smolvladataset
```

## Requirements

- Python 3.11+
- Core deps: `pandas`, `pyarrow`, `huggingface-hub`, `lerobot`, `datasets`

## Quick Start

```python
from smolvladataset import SmolVLADataset

# Returns (train, val, test) as LeRobot‑compatible datasets
train, val, test = SmolVLADataset()
print(len(train), len(val), len(test))

# Access a row (dict of columns)
sample = train[0]
```

## API

- `SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)`

  - Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple `(train, val, test)`.
  - `csv_list`: Path to CSV whose first column lists HF dataset repo IDs (e.g. `org/name`). If omitted, a packaged default is used.
  - `force_download`: With a custom `csv_list`, rebuild from sources even if cached; otherwise re‑download the precompiled bundle.
  - `force_build`: Only when `csv_list` is omitted: build from the default list instead of downloading the precompiled bundle.
  - `split_config`: Optional `SplitConfig(train=..., val=..., test=..., seed=...)`.

- `SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)`
  - Proportions must sum to 1.0. The seed controls deterministic shuffling.
  - If no `split_config` is provided, the default configuration matches the splits published on Hugging Face.

## Advanced Usage

### Custom Dataset List

```python
# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)
train, val, test = SmolVLADataset(csv_list="path/to/datasets.csv")
```

### Custom Split Configuration

```python
from smolvladataset import SmolVLADataset, SplitConfig

config = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)
train, val, test = SmolVLADataset(split_config=config)
```

### Force Rebuild or Re‑download

```python
# With a custom CSV, forces rebuild from sources
train, val, test = SmolVLADataset(csv_list="data/datasets.csv", force_download=True)

# With the default list, re‑download the precompiled bundle
train, val, test = SmolVLADataset(force_download=True)

# Build from default list instead of using the precompiled bundle
train, val, test = SmolVLADataset(force_build=True)
```

### Cache Directory

Artifacts are cached under `~/.cache/smolvladataset/<hash>/` by default, where `<hash>` depends on the dataset list.

## Dataset List Format

The library expects a CSV file whose first column contains Hugging Face dataset repository IDs:

```csv
dataset_repo_id
org/dataset-1
org/dataset-2
```

Lines beginning with `#` are ignored.

## Cache Layout

- `merged.parquet` — Combined dataset from all sources (includes a `dataset` column)
- `stats.parquet` — Basic per‑source statistics
- `train.parquet`, `validation.parquet`, `test.parquet` — Split files (optional, for HF viewer convenience)

See `LICENSE` for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "smolvladataset",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "dataset, robotics, dataset-loader, smolvla, vision-language-action, lerobot, huggingface, machine-learning, robotics-datasets, ai-datasets",
    "author": "franklintra, bonsainoodle",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/7b/03/f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13/smolvladataset-0.1.1.tar.gz",
    "platform": null,
    "description": "# smolvladataset\n\nSimple, reliable loader for SmolVLA robotics datasets with built\u2011in train/val/test splits and caching.\n\nThis library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.\n\n## Features\n\n- Reproducible train/val/test splits (deterministic seed)\n- LeRobot\u2011compatible splits (`LeRobotDataset` interface)\n- Automatic download and local caching (Hugging Face Hub)\n- Optional precompiled dataset for fast startup\n- Efficient Parquet storage with light schema normalization\n\n## Installation\n\n```bash\npip install smolvladataset\n# or using uv\nuv add smolvladataset\n```\n\n## Requirements\n\n- Python 3.11+\n- Core deps: `pandas`, `pyarrow`, `huggingface-hub`, `lerobot`, `datasets`\n\n## Quick Start\n\n```python\nfrom smolvladataset import SmolVLADataset\n\n# Returns (train, val, test) as LeRobot\u2011compatible datasets\ntrain, val, test = SmolVLADataset()\nprint(len(train), len(val), len(test))\n\n# Access a row (dict of columns)\nsample = train[0]\n```\n\n## API\n\n- `SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)`\n\n  - Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple `(train, val, test)`.\n  - `csv_list`: Path to CSV whose first column lists HF dataset repo IDs (e.g. `org/name`). If omitted, a packaged default is used.\n  - `force_download`: With a custom `csv_list`, rebuild from sources even if cached; otherwise re\u2011download the precompiled bundle.\n  - `force_build`: Only when `csv_list` is omitted: build from the default list instead of downloading the precompiled bundle.\n  - `split_config`: Optional `SplitConfig(train=..., val=..., test=..., seed=...)`.\n\n- `SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)`\n  - Proportions must sum to 1.0. The seed controls deterministic shuffling.\n  - If no `split_config` is provided, the default configuration matches the splits published on Hugging Face.\n\n## Advanced Usage\n\n### Custom Dataset List\n\n```python\n# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)\ntrain, val, test = SmolVLADataset(csv_list=\"path/to/datasets.csv\")\n```\n\n### Custom Split Configuration\n\n```python\nfrom smolvladataset import SmolVLADataset, SplitConfig\n\nconfig = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)\ntrain, val, test = SmolVLADataset(split_config=config)\n```\n\n### Force Rebuild or Re\u2011download\n\n```python\n# With a custom CSV, forces rebuild from sources\ntrain, val, test = SmolVLADataset(csv_list=\"data/datasets.csv\", force_download=True)\n\n# With the default list, re\u2011download the precompiled bundle\ntrain, val, test = SmolVLADataset(force_download=True)\n\n# Build from default list instead of using the precompiled bundle\ntrain, val, test = SmolVLADataset(force_build=True)\n```\n\n### Cache Directory\n\nArtifacts are cached under `~/.cache/smolvladataset/<hash>/` by default, where `<hash>` depends on the dataset list.\n\n## Dataset List Format\n\nThe library expects a CSV file whose first column contains Hugging Face dataset repository IDs:\n\n```csv\ndataset_repo_id\norg/dataset-1\norg/dataset-2\n```\n\nLines beginning with `#` are ignored.\n\n## Cache Layout\n\n- `merged.parquet` \u2014 Combined dataset from all sources (includes a `dataset` column)\n- `stats.parquet` \u2014 Basic per\u2011source statistics\n- `train.parquet`, `validation.parquet`, `test.parquet` \u2014 Split files (optional, for HF viewer convenience)\n\nSee `LICENSE` for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Loader for SmolVLA robotics datasets with deterministic train/val/test splits; LeRobot\u2011compatible, cached locally, downloads precompiled bundles from the Hugging Face Hub or rebuilds from a CSV.",
    "version": "0.1.1",
    "project_urls": {
        "Dataset": "https://huggingface.co/datasets/SmolVLADataset/SmolVLADataset",
        "GitHub": "https://github.com/smolvladataset/smolvladataset"
    },
    "split_keywords": [
        "dataset",
        " robotics",
        " dataset-loader",
        " smolvla",
        " vision-language-action",
        " lerobot",
        " huggingface",
        " machine-learning",
        " robotics-datasets",
        " ai-datasets"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cd5bb7c7f654563f5377ee88f9859649e9d1b18ec007db9ec48139ba5d4c508d",
                "md5": "fc6b11a79c658bcec91e0269fab1bf48",
                "sha256": "7d73fdfa3e98781a1b2e1eec7b3d4cc980980b7a8c142c0aabe02053bc28fc44"
            },
            "downloads": -1,
            "filename": "smolvladataset-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc6b11a79c658bcec91e0269fab1bf48",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 17806,
            "upload_time": "2025-08-27T01:41:49",
            "upload_time_iso_8601": "2025-08-27T01:41:49.445524Z",
            "url": "https://files.pythonhosted.org/packages/cd/5b/b7c7f654563f5377ee88f9859649e9d1b18ec007db9ec48139ba5d4c508d/smolvladataset-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7b03f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13",
                "md5": "b13e0b7040f381e110f2e1157ebf66ff",
                "sha256": "858a4d6ddadd22d13908d3f33f242cf4a07d3d0f2875251b9bd5dd93ef733e99"
            },
            "downloads": -1,
            "filename": "smolvladataset-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b13e0b7040f381e110f2e1157ebf66ff",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 18041,
            "upload_time": "2025-08-27T01:41:50",
            "upload_time_iso_8601": "2025-08-27T01:41:50.549387Z",
            "url": "https://files.pythonhosted.org/packages/7b/03/f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13/smolvladataset-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-27 01:41:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "smolvladataset",
    "github_project": "smolvladataset",
    "github_not_found": true,
    "lcname": "smolvladataset"
}
        
Elapsed time: 1.08044s