# CaptionFlow
<!-- [](https://github.com/bghira/CaptionFlow/actions/workflows/tests.yml) -->
[](https://codecov.io/github/bghira/CaptionFlow)
[](https://badge.fury.io/py/caption-flow)
scalable, fault-tolerant **vLLM-powered image captioning**.
a fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.
* **orchestrator**: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
* **workers (vLLM)**: connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
* **config-driven**: all components read YAML config; flags can override.
> no conda. just `venv` + `pip`.
---
## install
```bash
python -m venv .venv
source .venv/bin/activate # windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e . # installs the `caption-flow` command
```
## quickstart (single box)
1. copy + edit the sample configs
```bash
cp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml
cp examples/worker.yaml my-worker.yaml
cp examples/monitor.yaml my-monitor.yaml # optional terminal interface
```
set a unique shared token in both `my-orchestrator.yaml` and `my-worker.yaml` (see `auth.worker_tokens` in the orchestrator config and `worker.token` in the worker config).
if you use private hugging face datasets/models, export `HUGGINGFACE_HUB_TOKEN` before starting anything.
2. start the orchestrator
```bash
caption-flow orchestrator --config my-orchestrator.yaml
```
3. start one or more vLLM workers
```bash
# gpu 0 on the same host
caption-flow worker --config my-worker.yaml --gpu-id 0
# your second GPU
caption-flow worker --config my-worker.yaml --gpu-id 1
# on a remote host
caption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765
```
4. (optional) start the monitor
```bash
caption-flow monitor --config my-monitor.yaml
```
5. export the data
```bash
% caption-flow export --help
Usage: caption-flow export [OPTIONS]
Export caption data to various formats.
Options:
--format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)
```
* **jsonl**: create JSON line file in the specified `--output` path
* **csv**: exports CSV-compatible data columns to the `--output` path containing incomplete metadata
* **json**: creates a `.json` file for each sample inside the `--output` subdirectory containing **complete** metadata; useful for webdatasets
* **txt**: creates `.txt` file for each sample inside the `--output` subdirectory containing ONLY captions
* **huggingface_hub**: creates a dataset on Hugging Face Hub, possibly `--private` and `--nsfw` where necessary
* **all**: creates all export formats in a specified `--output` directory
---
## how it’s wired
### orchestrator
* **websocket server** (default `0.0.0.0:8765`) with three client roles: workers, data-feeders, and admin.
* **dataset control**: the orchestrator centrally defines the dataset (`huggingface` or `local`) and version/name. it chunk-slices shards and assigns work.
* **data serving to remote workers**: local files can be captioned by remote workers that don't have access to the same files, automatically.
* **vLLM config broadcast**: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and **inference prompts** are all pushed to workers; workers can apply many changes without a model reload.
* **storage + checkpoints**: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
* **auth**: token lists for `worker`, `monitor`, and `admin` roles.
### vLLM worker
* **one process per gpu**. select the device with `--gpu-id` (or `worker.gpu_id` in YAML).
* **gets its marching orders** from the orchestrator: dataset info, model, prompts, batch size, and sampling.
* **resilient**: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
* **batched generate()**: images are resized down for consistent batching; each image can get multiple captions (one per prompt).
---
## dataset formats
* huggingface hub or local based URL list datasets that are compatible with the datasets library
* webdatasets shards containing full image data; also can be hosted on the hub
* local folder filled with images; orchestrator will serve the data to workers
## configuration path
### config discovery order
for any component, the CLI looks for config in this order (first match wins):
1. `--config /path/to/file.yaml`
2. `./<component>.yaml` (current directory)
3. `~/.caption-flow/<component>.yaml`
4. `$XDG_CONFIG_HOME/caption-flow/<component>.yaml`
5. `/etc/caption-flow/<component>.yaml`
6. any `$XDG_CONFIG_DIRS` entries under `caption-flow/`
7. `./examples/<component>.yaml` (fallback)
---
## tls / certificates
use the built-in helpers during development:
```bash
# self-signed certs for quick local testing
caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs
# inspect any certificate file
caption-flow inspect_cert ./certs/fullchain.pem
```
then point the orchestrator at the resulting cert/key (or run `--no-ssl` for dev-only ws\://).
---
## tips & notes
* **multi-gpu**: start one worker process per gpu (set `--gpu-id` or `worker.gpu_id`).
* **throughput**: tune `vllm.batch_size` in the orchestrator config (or override with `--batch-size` at worker start). higher isn’t always better; watch VRAM.
* **prompts**: add more strings under `vllm.inference_prompts` to get multiple captions per image; the worker returns only non-empty generations.
* **private HF**: if your dataset/model needs auth, export `HUGGINGFACE_HUB_TOKEN` before `caption-flow worker ...`.
* **self-signed ssl**: pass `--no-verify-ssl` to workers/monitors in dev.
* **recovery**: if you hard-crash mid-run, `caption-flow scan_chunks --fix` can reset abandoned chunks so the orchestrator can reissue them cleanly.
---
## roadmap
* hot config reload via the admin websocket path.
* dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.
* richer monitor TUI.
PRs welcome. keep it simple and fast.
## architecture
```
┌─────────────┐ WebSocket ┌─────────────┐
│ Worker │◄──────────────────►│ │
│ │ │ │ ┌──────────────┐
│ │◄───────────────────│ │────►│Arrow/Parquet │
└─────────────┘ HTTP (img data) │ Orchestrator│ │ Storage │
│ │ └──────────────┘
┌─────────────┐ │ │
│ Worker │◄──────────────────►│ │
│ │ │ │
│ │◄───────────────────│ │
└─────────────┘ HTTP (img data) └─────────────┘
▲
┌─────────────┐ │
│ Monitor │◄──────────────────────────┘
└─────────────┘
```
## Community Clusters
To contribute compute to a cluster:
1. Install caption-flow: `pip install caption-flow`
2. Get a worker token from the project maintainer
3. Run: `caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN`
Your contributions will be tracked and attributed in the final dataset!
## License
AGPLv3
Raw data
{
"_id": null,
"home_page": null,
"name": "caption-flow",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.11",
"maintainer_email": null,
"keywords": "captioning, distributed, vllm, dataset, community",
"author": null,
"author_email": "bghira <bghira@users.github.com>",
"download_url": "https://files.pythonhosted.org/packages/c1/05/24aead2c29bb226cb02b8c2685ea2224627a8c675396a16a2cce6f7b225a/caption_flow-0.4.1.tar.gz",
"platform": null,
"description": "# CaptionFlow\n\n<!-- [](https://github.com/bghira/CaptionFlow/actions/workflows/tests.yml) -->\n[](https://codecov.io/github/bghira/CaptionFlow)\n[](https://badge.fury.io/py/caption-flow)\n\nscalable, fault-tolerant **vLLM-powered image captioning**.\n\na fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.\n\n* **orchestrator**: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.\n* **workers (vLLM)**: connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.\n* **config-driven**: all components read YAML config; flags can override.\n\n> no conda. just `venv` + `pip`.\n\n---\n\n## install\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate # windows: .venv\\Scripts\\activate\npip install --upgrade pip\npip install -e . # installs the `caption-flow` command\n```\n\n## quickstart (single box)\n\n1. copy + edit the sample configs\n\n```bash\ncp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml\ncp examples/worker.yaml my-worker.yaml\ncp examples/monitor.yaml my-monitor.yaml # optional terminal interface\n```\n\nset a unique shared token in both `my-orchestrator.yaml` and `my-worker.yaml` (see `auth.worker_tokens` in the orchestrator config and `worker.token` in the worker config).\n\nif you use private hugging face datasets/models, export `HUGGINGFACE_HUB_TOKEN` before starting anything.\n\n2. start the orchestrator\n\n```bash\ncaption-flow orchestrator --config my-orchestrator.yaml\n```\n\n3. start one or more vLLM workers\n\n```bash\n# gpu 0 on the same host\ncaption-flow worker --config my-worker.yaml --gpu-id 0\n\n# your second GPU\ncaption-flow worker --config my-worker.yaml --gpu-id 1\n\n# on a remote host\ncaption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765\n```\n\n4. (optional) start the monitor\n\n```bash\ncaption-flow monitor --config my-monitor.yaml\n```\n\n5. export the data\n\n```bash\n% caption-flow export --help \nUsage: caption-flow export [OPTIONS]\n\n Export caption data to various formats.\n\nOptions:\n --format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)\n```\n\n* **jsonl**: create JSON line file in the specified `--output` path\n* **csv**: exports CSV-compatible data columns to the `--output` path containing incomplete metadata\n* **json**: creates a `.json` file for each sample inside the `--output` subdirectory containing **complete** metadata; useful for webdatasets\n* **txt**: creates `.txt` file for each sample inside the `--output` subdirectory containing ONLY captions\n* **huggingface_hub**: creates a dataset on Hugging Face Hub, possibly `--private` and `--nsfw` where necessary\n* **all**: creates all export formats in a specified `--output` directory\n\n---\n\n## how it\u2019s wired\n\n### orchestrator\n\n* **websocket server** (default `0.0.0.0:8765`) with three client roles: workers, data-feeders, and admin.\n* **dataset control**: the orchestrator centrally defines the dataset (`huggingface` or `local`) and version/name. it chunk-slices shards and assigns work.\n* **data serving to remote workers**: local files can be captioned by remote workers that don't have access to the same files, automatically.\n* **vLLM config broadcast**: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and **inference prompts** are all pushed to workers; workers can apply many changes without a model reload.\n* **storage + checkpoints**: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don\u2019t double-work.\n* **auth**: token lists for `worker`, `monitor`, and `admin` roles.\n\n### vLLM worker\n\n* **one process per gpu**. select the device with `--gpu-id` (or `worker.gpu_id` in YAML).\n* **gets its marching orders** from the orchestrator: dataset info, model, prompts, batch size, and sampling.\n* **resilient**: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.\n* **batched generate()**: images are resized down for consistent batching; each image can get multiple captions (one per prompt).\n\n---\n\n## dataset formats\n\n* huggingface hub or local based URL list datasets that are compatible with the datasets library\n* webdatasets shards containing full image data; also can be hosted on the hub\n* local folder filled with images; orchestrator will serve the data to workers\n\n## configuration path\n\n### config discovery order\n\nfor any component, the CLI looks for config in this order (first match wins):\n\n1. `--config /path/to/file.yaml`\n2. `./<component>.yaml` (current directory)\n3. `~/.caption-flow/<component>.yaml`\n4. `$XDG_CONFIG_HOME/caption-flow/<component>.yaml`\n5. `/etc/caption-flow/<component>.yaml`\n6. any `$XDG_CONFIG_DIRS` entries under `caption-flow/`\n7. `./examples/<component>.yaml` (fallback)\n\n---\n\n## tls / certificates\n\nuse the built-in helpers during development:\n\n```bash\n# self-signed certs for quick local testing\ncaption-flow generate_cert --self-signed --domain localhost --output-dir ./certs\n\n# inspect any certificate file\ncaption-flow inspect_cert ./certs/fullchain.pem\n```\n\nthen point the orchestrator at the resulting cert/key (or run `--no-ssl` for dev-only ws\\://).\n\n---\n\n## tips & notes\n\n* **multi-gpu**: start one worker process per gpu (set `--gpu-id` or `worker.gpu_id`).\n* **throughput**: tune `vllm.batch_size` in the orchestrator config (or override with `--batch-size` at worker start). higher isn\u2019t always better; watch VRAM.\n* **prompts**: add more strings under `vllm.inference_prompts` to get multiple captions per image; the worker returns only non-empty generations.\n* **private HF**: if your dataset/model needs auth, export `HUGGINGFACE_HUB_TOKEN` before `caption-flow worker ...`.\n* **self-signed ssl**: pass `--no-verify-ssl` to workers/monitors in dev.\n* **recovery**: if you hard-crash mid-run, `caption-flow scan_chunks --fix` can reset abandoned chunks so the orchestrator can reissue them cleanly.\n\n---\n\n## roadmap\n\n* hot config reload via the admin websocket path.\n* dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.\n* richer monitor TUI.\n\nPRs welcome. keep it simple and fast.\n\n## architecture\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 WebSocket \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Worker \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25ba\u2502 \u2502\n\u2502 \u2502 \u2502 \u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502 \u2502\u2500\u2500\u2500\u2500\u25ba\u2502Arrow/Parquet \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 HTTP (img data) \u2502 Orchestrator\u2502 \u2502 Storage \u2502\n \u2502 \u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502 \u2502\n\u2502 Worker \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25ba\u2502 \u2502\n\u2502 \u2502 \u2502 \u2502\n\u2502 \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 HTTP (img data) \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25b2\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 Monitor \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Community Clusters\n\nTo contribute compute to a cluster:\n\n1. Install caption-flow: `pip install caption-flow`\n2. Get a worker token from the project maintainer\n3. Run: `caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN`\n\nYour contributions will be tracked and attributed in the final dataset!\n\n## License\n\nAGPLv3\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Self-contained distributed community captioning system",
"version": "0.4.1",
"project_urls": null,
"split_keywords": [
"captioning",
" distributed",
" vllm",
" dataset",
" community"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fb9d7af0778374bed1af9201422d706fd9adf4f8aa352e54bd7ed3ee1b5c77e2",
"md5": "a8b27a7f61f4adc0e00f34461829802f",
"sha256": "cc5523c71cbabd1db20771bc90adad14a7c6bb4d2800f2aa0f41ab1d73eef74a"
},
"downloads": -1,
"filename": "caption_flow-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a8b27a7f61f4adc0e00f34461829802f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.11",
"size": 127851,
"upload_time": "2025-09-11T11:44:35",
"upload_time_iso_8601": "2025-09-11T11:44:35.752274Z",
"url": "https://files.pythonhosted.org/packages/fb/9d/7af0778374bed1af9201422d706fd9adf4f8aa352e54bd7ed3ee1b5c77e2/caption_flow-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c10524aead2c29bb226cb02b8c2685ea2224627a8c675396a16a2cce6f7b225a",
"md5": "139717c74819d0fc9b3ff2f1a3e42a2b",
"sha256": "a25d54b9a6848dcdb304189a96ac4e95678c4594100f35f859973b79631248a0"
},
"downloads": -1,
"filename": "caption_flow-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "139717c74819d0fc9b3ff2f1a3e42a2b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.11",
"size": 206878,
"upload_time": "2025-09-11T11:44:36",
"upload_time_iso_8601": "2025-09-11T11:44:36.794949Z",
"url": "https://files.pythonhosted.org/packages/c1/05/24aead2c29bb226cb02b8c2685ea2224627a8c675396a16a2cce6f7b225a/caption_flow-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-11 11:44:36",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "caption-flow"
}