# smolvladataset
Simple, reliable loader for SmolVLA robotics datasets with built‑in train/val/test splits and caching.
This library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.
## Features
- Reproducible train/val/test splits (deterministic seed)
- LeRobot‑compatible splits (`LeRobotDataset` interface)
- Automatic download and local caching (Hugging Face Hub)
- Optional precompiled dataset for fast startup
- Efficient Parquet storage with light schema normalization
## Installation
```bash
pip install smolvladataset
# or using uv
uv add smolvladataset
```
## Requirements
- Python 3.11+
- Core deps: `pandas`, `pyarrow`, `huggingface-hub`, `lerobot`, `datasets`
## Quick Start
```python
from smolvladataset import SmolVLADataset
# Returns (train, val, test) as LeRobot‑compatible datasets
train, val, test = SmolVLADataset()
print(len(train), len(val), len(test))
# Access a row (dict of columns)
sample = train[0]
```
## API
- `SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)`
- Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple `(train, val, test)`.
- `csv_list`: Path to CSV whose first column lists HF dataset repo IDs (e.g. `org/name`). If omitted, a packaged default is used.
- `force_download`: With a custom `csv_list`, rebuild from sources even if cached; otherwise re‑download the precompiled bundle.
- `force_build`: Only when `csv_list` is omitted: build from the default list instead of downloading the precompiled bundle.
- `split_config`: Optional `SplitConfig(train=..., val=..., test=..., seed=...)`.
- `SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)`
- Proportions must sum to 1.0. The seed controls deterministic shuffling.
- If no `split_config` is provided, the default configuration matches the splits published on Hugging Face.
## Advanced Usage
### Custom Dataset List
```python
# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)
train, val, test = SmolVLADataset(csv_list="path/to/datasets.csv")
```
### Custom Split Configuration
```python
from smolvladataset import SmolVLADataset, SplitConfig
config = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)
train, val, test = SmolVLADataset(split_config=config)
```
### Force Rebuild or Re‑download
```python
# With a custom CSV, forces rebuild from sources
train, val, test = SmolVLADataset(csv_list="data/datasets.csv", force_download=True)
# With the default list, re‑download the precompiled bundle
train, val, test = SmolVLADataset(force_download=True)
# Build from default list instead of using the precompiled bundle
train, val, test = SmolVLADataset(force_build=True)
```
### Cache Directory
Artifacts are cached under `~/.cache/smolvladataset/<hash>/` by default, where `<hash>` depends on the dataset list.
## Dataset List Format
The library expects a CSV file whose first column contains Hugging Face dataset repository IDs:
```csv
dataset_repo_id
org/dataset-1
org/dataset-2
```
Lines beginning with `#` are ignored.
## Cache Layout
- `merged.parquet` — Combined dataset from all sources (includes a `dataset` column)
- `stats.parquet` — Basic per‑source statistics
- `train.parquet`, `validation.parquet`, `test.parquet` — Split files (optional, for HF viewer convenience)
See `LICENSE` for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "smolvladataset",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "dataset, robotics, dataset-loader, smolvla, vision-language-action, lerobot, huggingface, machine-learning, robotics-datasets, ai-datasets",
"author": "franklintra, bonsainoodle",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/7b/03/f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13/smolvladataset-0.1.1.tar.gz",
"platform": null,
"description": "# smolvladataset\n\nSimple, reliable loader for SmolVLA robotics datasets with built\u2011in train/val/test splits and caching.\n\nThis library accompanies the SmolVLA paper to help the community inspect how the training dataset is composed, rebuild it from source repositories, and customize the composition or split policy. It can either download a precompiled bundle from the Hugging Face Hub or rebuild locally from a CSV list of dataset repositories.\n\n## Features\n\n- Reproducible train/val/test splits (deterministic seed)\n- LeRobot\u2011compatible splits (`LeRobotDataset` interface)\n- Automatic download and local caching (Hugging Face Hub)\n- Optional precompiled dataset for fast startup\n- Efficient Parquet storage with light schema normalization\n\n## Installation\n\n```bash\npip install smolvladataset\n# or using uv\nuv add smolvladataset\n```\n\n## Requirements\n\n- Python 3.11+\n- Core deps: `pandas`, `pyarrow`, `huggingface-hub`, `lerobot`, `datasets`\n\n## Quick Start\n\n```python\nfrom smolvladataset import SmolVLADataset\n\n# Returns (train, val, test) as LeRobot\u2011compatible datasets\ntrain, val, test = SmolVLADataset()\nprint(len(train), len(val), len(test))\n\n# Access a row (dict of columns)\nsample = train[0]\n```\n\n## API\n\n- `SmolVLADataset(csv_list=None, *, force_download=False, force_build=False, split_config=None)`\n\n - Loads a precompiled bundle (default) or builds from a CSV of source repos and returns a tuple `(train, val, test)`.\n - `csv_list`: Path to CSV whose first column lists HF dataset repo IDs (e.g. `org/name`). If omitted, a packaged default is used.\n - `force_download`: With a custom `csv_list`, rebuild from sources even if cached; otherwise re\u2011download the precompiled bundle.\n - `force_build`: Only when `csv_list` is omitted: build from the default list instead of downloading the precompiled bundle.\n - `split_config`: Optional `SplitConfig(train=..., val=..., test=..., seed=...)`.\n\n- `SplitConfig(train=0.8, val=0.1, test=0.1, seed=<int>)`\n - Proportions must sum to 1.0. The seed controls deterministic shuffling.\n - If no `split_config` is provided, the default configuration matches the splits published on Hugging Face.\n\n## Advanced Usage\n\n### Custom Dataset List\n\n```python\n# Use a CSV file with Hugging Face dataset repo IDs (a packaged default is used if omitted)\ntrain, val, test = SmolVLADataset(csv_list=\"path/to/datasets.csv\")\n```\n\n### Custom Split Configuration\n\n```python\nfrom smolvladataset import SmolVLADataset, SplitConfig\n\nconfig = SplitConfig(train=0.7, val=0.15, test=0.15, seed=123)\ntrain, val, test = SmolVLADataset(split_config=config)\n```\n\n### Force Rebuild or Re\u2011download\n\n```python\n# With a custom CSV, forces rebuild from sources\ntrain, val, test = SmolVLADataset(csv_list=\"data/datasets.csv\", force_download=True)\n\n# With the default list, re\u2011download the precompiled bundle\ntrain, val, test = SmolVLADataset(force_download=True)\n\n# Build from default list instead of using the precompiled bundle\ntrain, val, test = SmolVLADataset(force_build=True)\n```\n\n### Cache Directory\n\nArtifacts are cached under `~/.cache/smolvladataset/<hash>/` by default, where `<hash>` depends on the dataset list.\n\n## Dataset List Format\n\nThe library expects a CSV file whose first column contains Hugging Face dataset repository IDs:\n\n```csv\ndataset_repo_id\norg/dataset-1\norg/dataset-2\n```\n\nLines beginning with `#` are ignored.\n\n## Cache Layout\n\n- `merged.parquet` \u2014 Combined dataset from all sources (includes a `dataset` column)\n- `stats.parquet` \u2014 Basic per\u2011source statistics\n- `train.parquet`, `validation.parquet`, `test.parquet` \u2014 Split files (optional, for HF viewer convenience)\n\nSee `LICENSE` for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Loader for SmolVLA robotics datasets with deterministic train/val/test splits; LeRobot\u2011compatible, cached locally, downloads precompiled bundles from the Hugging Face Hub or rebuilds from a CSV.",
"version": "0.1.1",
"project_urls": {
"Dataset": "https://huggingface.co/datasets/SmolVLADataset/SmolVLADataset",
"GitHub": "https://github.com/smolvladataset/smolvladataset"
},
"split_keywords": [
"dataset",
" robotics",
" dataset-loader",
" smolvla",
" vision-language-action",
" lerobot",
" huggingface",
" machine-learning",
" robotics-datasets",
" ai-datasets"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "cd5bb7c7f654563f5377ee88f9859649e9d1b18ec007db9ec48139ba5d4c508d",
"md5": "fc6b11a79c658bcec91e0269fab1bf48",
"sha256": "7d73fdfa3e98781a1b2e1eec7b3d4cc980980b7a8c142c0aabe02053bc28fc44"
},
"downloads": -1,
"filename": "smolvladataset-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fc6b11a79c658bcec91e0269fab1bf48",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 17806,
"upload_time": "2025-08-27T01:41:49",
"upload_time_iso_8601": "2025-08-27T01:41:49.445524Z",
"url": "https://files.pythonhosted.org/packages/cd/5b/b7c7f654563f5377ee88f9859649e9d1b18ec007db9ec48139ba5d4c508d/smolvladataset-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7b03f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13",
"md5": "b13e0b7040f381e110f2e1157ebf66ff",
"sha256": "858a4d6ddadd22d13908d3f33f242cf4a07d3d0f2875251b9bd5dd93ef733e99"
},
"downloads": -1,
"filename": "smolvladataset-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "b13e0b7040f381e110f2e1157ebf66ff",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 18041,
"upload_time": "2025-08-27T01:41:50",
"upload_time_iso_8601": "2025-08-27T01:41:50.549387Z",
"url": "https://files.pythonhosted.org/packages/7b/03/f14a4c2a7049c6b919ca9a092ce5e13b08f1617301fcc91506d85b627f13/smolvladataset-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-27 01:41:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "smolvladataset",
"github_project": "smolvladataset",
"github_not_found": true,
"lcname": "smolvladataset"
}