caption-flow

Name	caption-flow JSON
Version	0.4.1 JSON
	download
home_page	None
Summary	Self-contained distributed community captioning system
upload_time	2025-09-11 11:44:36
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.11
license	MIT
keywords	captioning distributed vllm dataset community
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CaptionFlow

<!-- [![Tests](https://github.com/bghira/CaptionFlow/workflows/tests/badge.svg)](https://github.com/bghira/CaptionFlow/actions/workflows/tests.yml) -->
[![codecov](https://codecov.io/github/bghira/CaptionFlow/graph/badge.svg?token=PRAQPNGYAS)](https://codecov.io/github/bghira/CaptionFlow)
[![PyPI version](https://badge.fury.io/py/caption-flow.svg)](https://badge.fury.io/py/caption-flow)

scalable, fault-tolerant **vLLM-powered image captioning**.

a fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.

* **orchestrator**: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
* **workers (vLLM)**: connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
* **config-driven**: all components read YAML config; flags can override.

> no conda. just `venv` + `pip`.

---

## install

```bash
python -m venv .venv
source .venv/bin/activate  # windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e .  # installs the `caption-flow` command
```

## quickstart (single box)

1. copy + edit the sample configs

```bash
cp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml
cp examples/worker.yaml my-worker.yaml
cp examples/monitor.yaml my-monitor.yaml   # optional terminal interface
```

set a unique shared token in both `my-orchestrator.yaml` and `my-worker.yaml` (see `auth.worker_tokens` in the orchestrator config and `worker.token` in the worker config).

if you use private hugging face datasets/models, export `HUGGINGFACE_HUB_TOKEN` before starting anything.

2. start the orchestrator

```bash
caption-flow orchestrator --config my-orchestrator.yaml
```

3. start one or more vLLM workers

```bash
# gpu 0 on the same host
caption-flow worker --config my-worker.yaml --gpu-id 0

# your second GPU
caption-flow worker --config my-worker.yaml --gpu-id 1

# on a remote host
caption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765
```

4. (optional) start the monitor

```bash
caption-flow monitor --config my-monitor.yaml
```

5. export the data

```bash
% caption-flow export --help                                                                                                                                      
Usage: caption-flow export [OPTIONS]

  Export caption data to various formats.

Options:
  --format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)
```

* **jsonl**: create JSON line file in the specified `--output` path
* **csv**: exports CSV-compatible data columns to the `--output` path containing incomplete metadata
* **json**: creates a `.json` file for each sample inside the `--output` subdirectory containing **complete** metadata; useful for webdatasets
* **txt**: creates `.txt` file for each sample inside the `--output` subdirectory containing ONLY captions
* **huggingface_hub**: creates a dataset on Hugging Face Hub, possibly `--private` and `--nsfw` where necessary
* **all**: creates all export formats in a specified `--output` directory

---

## how it’s wired

### orchestrator

* **websocket server** (default `0.0.0.0:8765`) with three client roles: workers, data-feeders, and admin.
* **dataset control**: the orchestrator centrally defines the dataset (`huggingface` or `local`) and version/name. it chunk-slices shards and assigns work.
* **data serving to remote workers**: local files can be captioned by remote workers that don't have access to the same files, automatically.
* **vLLM config broadcast**: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and **inference prompts** are all pushed to workers; workers can apply many changes without a model reload.
* **storage + checkpoints**: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
* **auth**: token lists for `worker`, `monitor`, and `admin` roles.

### vLLM worker

* **one process per gpu**. select the device with `--gpu-id` (or `worker.gpu_id` in YAML).
* **gets its marching orders** from the orchestrator: dataset info, model, prompts, batch size, and sampling.
* **resilient**: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
* **batched generate()**: images are resized down for consistent batching; each image can get multiple captions (one per prompt).

---

## dataset formats

* huggingface hub or local based URL list datasets that are compatible with the datasets library
* webdatasets shards containing full image data; also can be hosted on the hub
* local folder filled with images; orchestrator will serve the data to workers

## configuration path

### config discovery order

for any component, the CLI looks for config in this order (first match wins):

1. `--config /path/to/file.yaml`
2. `./<component>.yaml` (current directory)
3. `~/.caption-flow/<component>.yaml`
4. `$XDG_CONFIG_HOME/caption-flow/<component>.yaml`
5. `/etc/caption-flow/<component>.yaml`
6. any `$XDG_CONFIG_DIRS` entries under `caption-flow/`
7. `./examples/<component>.yaml` (fallback)

---

## tls / certificates

use the built-in helpers during development:

```bash
# self-signed certs for quick local testing
caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs

# inspect any certificate file
caption-flow inspect_cert ./certs/fullchain.pem
```

then point the orchestrator at the resulting cert/key (or run `--no-ssl` for dev-only ws\://).

---

## tips & notes

* **multi-gpu**: start one worker process per gpu (set `--gpu-id` or `worker.gpu_id`).
* **throughput**: tune `vllm.batch_size` in the orchestrator config (or override with `--batch-size` at worker start). higher isn’t always better; watch VRAM.
* **prompts**: add more strings under `vllm.inference_prompts` to get multiple captions per image; the worker returns only non-empty generations.
* **private HF**: if your dataset/model needs auth, export `HUGGINGFACE_HUB_TOKEN` before `caption-flow worker ...`.
* **self-signed ssl**: pass `--no-verify-ssl` to workers/monitors in dev.
* **recovery**: if you hard-crash mid-run, `caption-flow scan_chunks --fix` can reset abandoned chunks so the orchestrator can reissue them cleanly.

---

## roadmap

* hot config reload via the admin websocket path.
* dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.
* richer monitor TUI.

PRs welcome. keep it simple and fast.

## architecture

```
┌─────────────┐     WebSocket      ┌─────────────┐
│   Worker    │◄──────────────────►│             │
│             │                    │             │     ┌──────────────┐
│             │◄───────────────────│             │────►│Arrow/Parquet │
└─────────────┘   HTTP (img data)  │ Orchestrator│     │   Storage    │
                                   │             │     └──────────────┘
┌─────────────┐                    │             │
│   Worker    │◄──────────────────►│             │
│             │                    │             │
│             │◄───────────────────│             │
└─────────────┘   HTTP (img data)  └─────────────┘
                                           ▲
┌─────────────┐                           │
│   Monitor   │◄──────────────────────────┘
└─────────────┘
```

## Community Clusters

To contribute compute to a cluster:

1. Install caption-flow: `pip install caption-flow`
2. Get a worker token from the project maintainer
3. Run: `caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN`

Your contributions will be tracked and attributed in the final dataset!

## License

AGPLv3

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "caption-flow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.11",
    "maintainer_email": null,
    "keywords": "captioning, distributed, vllm, dataset, community",
    "author": null,
    "author_email": "bghira <bghira@users.github.com>",
    "download_url": "https://files.pythonhosted.org/packages/c1/05/24aead2c29bb226cb02b8c2685ea2224627a8c675396a16a2cce6f7b225a/caption_flow-0.4.1.tar.gz",
    "platform": null,
    "description": "# CaptionFlow\n\n<!-- [![Tests](https://github.com/bghira/CaptionFlow/workflows/tests/badge.svg)](https://github.com/bghira/CaptionFlow/actions/workflows/tests.yml) -->\n[![codecov](https://codecov.io/github/bghira/CaptionFlow/graph/badge.svg?token=PRAQPNGYAS)](https://codecov.io/github/bghira/CaptionFlow)\n[![PyPI version](https://badge.fury.io/py/caption-flow.svg)](https://badge.fury.io/py/caption-flow)\n\nscalable, fault-tolerant **vLLM-powered image captioning**.\n\na fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.\n\n* **orchestrator**: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.\n* **workers (vLLM)**: connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.\n* **config-driven**: all components read YAML config; flags can override.\n\n> no conda. just `venv` + `pip`.\n\n---\n\n## install\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate  # windows: .venv\\Scripts\\activate\npip install --upgrade pip\npip install -e .  # installs the `caption-flow` command\n```\n\n## quickstart (single box)\n\n1. copy + edit the sample configs\n\n```bash\ncp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml\ncp examples/worker.yaml my-worker.yaml\ncp examples/monitor.yaml my-monitor.yaml   # optional terminal interface\n```\n\nset a unique shared token in both `my-orchestrator.yaml` and `my-worker.yaml` (see `auth.worker_tokens` in the orchestrator config and `worker.token` in the worker config).\n\nif you use private hugging face datasets/models, export `HUGGINGFACE_HUB_TOKEN` before starting anything.\n\n2. start the orchestrator\n\n```bash\ncaption-flow orchestrator --config my-orchestrator.yaml\n```\n\n3. start one or more vLLM workers\n\n```bash\n# gpu 0 on the same host\ncaption-flow worker --config my-worker.yaml --gpu-id 0\n\n# your second GPU\ncaption-flow worker --config my-worker.yaml --gpu-id 1\n\n# on a remote host\ncaption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765\n```\n\n4. (optional) start the monitor\n\n```bash\ncaption-flow monitor --config my-monitor.yaml\n```\n\n5. export the data\n\n```bash\n% caption-flow export --help                                                                                                                                      \nUsage: caption-flow export [OPTIONS]\n\n  Export caption data to various formats.\n\nOptions:\n  --format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)\n```\n\n* **jsonl**: create JSON line file in the specified `--output` path\n* **csv**: exports CSV-compatible data columns to the `--output` path containing incomplete metadata\n* **json**: creates a `.json` file for each sample inside the `--output` subdirectory containing **complete** metadata; useful for webdatasets\n* **txt**: creates `.txt` file for each sample inside the `--output` subdirectory containing ONLY captions\n* **huggingface_hub**: creates a dataset on Hugging Face Hub, possibly `--private` and `--nsfw` where necessary\n* **all**: creates all export formats in a specified `--output` directory\n\n---\n\n## how it\u2019s wired\n\n### orchestrator\n\n* **websocket server** (default `0.0.0.0:8765`) with three client roles: workers, data-feeders, and admin.\n* **dataset control**: the orchestrator centrally defines the dataset (`huggingface` or `local`) and version/name. it chunk-slices shards and assigns work.\n* **data serving to remote workers**: local files can be captioned by remote workers that don't have access to the same files, automatically.\n* **vLLM config broadcast**: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and **inference prompts** are all pushed to workers; workers can apply many changes without a model reload.\n* **storage + checkpoints**: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don\u2019t double-work.\n* **auth**: token lists for `worker`, `monitor`, and `admin` roles.\n\n### vLLM worker\n\n* **one process per gpu**. select the device with `--gpu-id` (or `worker.gpu_id` in YAML).\n* **gets its marching orders** from the orchestrator: dataset info, model, prompts, batch size, and sampling.\n* **resilient**: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.\n* **batched generate()**: images are resized down for consistent batching; each image can get multiple captions (one per prompt).\n\n---\n\n## dataset formats\n\n* huggingface hub or local based URL list datasets that are compatible with the datasets library\n* webdatasets shards containing full image data; also can be hosted on the hub\n* local folder filled with images; orchestrator will serve the data to workers\n\n## configuration path\n\n### config discovery order\n\nfor any component, the CLI looks for config in this order (first match wins):\n\n1. `--config /path/to/file.yaml`\n2. `./<component>.yaml` (current directory)\n3. `~/.caption-flow/<component>.yaml`\n4. `$XDG_CONFIG_HOME/caption-flow/<component>.yaml`\n5. `/etc/caption-flow/<component>.yaml`\n6. any `$XDG_CONFIG_DIRS` entries under `caption-flow/`\n7. `./examples/<component>.yaml` (fallback)\n\n---\n\n## tls / certificates\n\nuse the built-in helpers during development:\n\n```bash\n# self-signed certs for quick local testing\ncaption-flow generate_cert --self-signed --domain localhost --output-dir ./certs\n\n# inspect any certificate file\ncaption-flow inspect_cert ./certs/fullchain.pem\n```\n\nthen point the orchestrator at the resulting cert/key (or run `--no-ssl` for dev-only ws\\://).\n\n---\n\n## tips & notes\n\n* **multi-gpu**: start one worker process per gpu (set `--gpu-id` or `worker.gpu_id`).\n* **throughput**: tune `vllm.batch_size` in the orchestrator config (or override with `--batch-size` at worker start). higher isn\u2019t always better; watch VRAM.\n* **prompts**: add more strings under `vllm.inference_prompts` to get multiple captions per image; the worker returns only non-empty generations.\n* **private HF**: if your dataset/model needs auth, export `HUGGINGFACE_HUB_TOKEN` before `caption-flow worker ...`.\n* **self-signed ssl**: pass `--no-verify-ssl` to workers/monitors in dev.\n* **recovery**: if you hard-crash mid-run, `caption-flow scan_chunks --fix` can reset abandoned chunks so the orchestrator can reissue them cleanly.\n\n---\n\n## roadmap\n\n* hot config reload via the admin websocket path.\n* dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.\n* richer monitor TUI.\n\nPRs welcome. keep it simple and fast.\n\n## architecture\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     WebSocket      \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   Worker    \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25ba\u2502             \u2502\n\u2502             \u2502                    \u2502             \u2502     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502             \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502             \u2502\u2500\u2500\u2500\u2500\u25ba\u2502Arrow/Parquet \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   HTTP (img data)  \u2502 Orchestrator\u2502     \u2502   Storage    \u2502\n                                   \u2502             \u2502     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510                    \u2502             \u2502\n\u2502   Worker    \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25ba\u2502             \u2502\n\u2502             \u2502                    \u2502             \u2502\n\u2502             \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2502             \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   HTTP (img data)  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                                           \u25b2\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510                           \u2502\n\u2502   Monitor   \u2502\u25c4\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Community Clusters\n\nTo contribute compute to a cluster:\n\n1. Install caption-flow: `pip install caption-flow`\n2. Get a worker token from the project maintainer\n3. Run: `caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN`\n\nYour contributions will be tracked and attributed in the final dataset!\n\n## License\n\nAGPLv3\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Self-contained distributed community captioning system",
    "version": "0.4.1",
    "project_urls": null,
    "split_keywords": [
        "captioning",
        " distributed",
        " vllm",
        " dataset",
        " community"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fb9d7af0778374bed1af9201422d706fd9adf4f8aa352e54bd7ed3ee1b5c77e2",
                "md5": "a8b27a7f61f4adc0e00f34461829802f",
                "sha256": "cc5523c71cbabd1db20771bc90adad14a7c6bb4d2800f2aa0f41ab1d73eef74a"
            },
            "downloads": -1,
            "filename": "caption_flow-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a8b27a7f61f4adc0e00f34461829802f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.11",
            "size": 127851,
            "upload_time": "2025-09-11T11:44:35",
            "upload_time_iso_8601": "2025-09-11T11:44:35.752274Z",
            "url": "https://files.pythonhosted.org/packages/fb/9d/7af0778374bed1af9201422d706fd9adf4f8aa352e54bd7ed3ee1b5c77e2/caption_flow-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c10524aead2c29bb226cb02b8c2685ea2224627a8c675396a16a2cce6f7b225a",
                "md5": "139717c74819d0fc9b3ff2f1a3e42a2b",
                "sha256": "a25d54b9a6848dcdb304189a96ac4e95678c4594100f35f859973b79631248a0"
            },
            "downloads": -1,
            "filename": "caption_flow-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "139717c74819d0fc9b3ff2f1a3e42a2b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.11",
            "size": 206878,
            "upload_time": "2025-09-11T11:44:36",
            "upload_time_iso_8601": "2025-09-11T11:44:36.794949Z",
            "url": "https://files.pythonhosted.org/packages/c1/05/24aead2c29bb226cb02b8c2685ea2224627a8c675396a16a2cce6f7b225a/caption_flow-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-11 11:44:36",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "caption-flow"
}

None