chatterbox-vllm

Name	chatterbox-vllm JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	Chatterbox TTS ported to VLLM for efficienct and advanced inference tasks
upload_time	2025-08-16 00:18:39
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT License Copyright (c) 2025 Resemble AI Copyright (c) 2025 David Jia Wei Li Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	llm gpt cli tts chatterbox
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Chatterbox TTS on vLLM

This is a port of https://github.com/resemble-ai/chatterbox to vLLM. Why?

* Improved performance and more efficient use of GPU memory.
  * Early benchmarks show ~4x speedup in generation toks/s without batching, and over 10x with batching. This is a significant improvement over the original Chatterbox implementation, which was bottlenecked by unnecessary CPU-GPU sync/transfers within the HF transformers implementation.
  * More rigorous benchmarking is WIP, but will likely come after batching is fully fleshed out.
* Easier integration with state-of-the-art inference infrastructure.

DISCLAIMER: THIS IS A PERSONAL PROJECT and is not affiliated with my employer or any other corporate entity in any way. The project is based solely on publicly-available information. All opinions are my own and do not necessarily represent the views of my employer.

## Generation Samples

![Sample 1](docs/audio-sample-01.mp3)
<audio controls>
  <source src="docs/audio-sample-01.mp3" type="audio/mp3">
</audio>

![Sample 2](docs/audio-sample-02.mp3)
<audio controls>
  <source src="docs/audio-sample-02.mp3" type="audio/mp3">
</audio>

![Sample 3](docs/audio-sample-03.mp3)
<audio controls>
  <source src="docs/audio-sample-03.mp3" type="audio/mp3">
</audio>


# Project Status: Usable and with Benchmark-Topping Throughput

* ✅ Basic speech cloning with audio and text conditioning.
* ✅ Outputs match the quality of the original Chatterbox implementation.
* ✅ Context Free Guidance (CFG) is implemented.
  * Due to a vLLM limitation, CFG can not be tuned on a per-request basis and can only be configured via the `CHATTERBOX_CFG_SCALE` environment variable.
* ✅ Exaggeration control is implemented.
* ✅ vLLM batching is implemented and produces a significant speedup.
* ℹ️ Project uses vLLM internal APIs and extremely hacky workarounds to get things done.
  * Refactoring to the idomatic vLLM way of doing things is WIP, but will require some changes to vLLM.
  * Until then, this is a Rube Goldberg machine that will likely only work with vLLM 0.9.2.
  * Follow https://github.com/vllm-project/vllm/issues/21989 for updates.
* ℹ️ Substantial refactoring is needed to further clean up unnecessary workarounds and code paths.
* ℹ️ Server API is not implemented and will likely be out-of-scope for this project.
* ❌ Learned speech positional embeddings are not applied, pending support in vLLM. However, this doesn't seem to be causing a very noticeable degradation in quality.
* ❌ APIs are not yet stable and may change.
* ❌ Benchmarks and performance optimizations are not yet implemented.

# Installation

Prerequisites: `git` and [`uv`](https://pypi.org/project/uv/) must be installed

```
git clone https://github.com/randombk/chatterbox-vllm.git
cd chatterbox-vllm
uv venv
source .venv/bin/activate
uv sync
```

The package should automatically download the correct model weights from the Hugging Face Hub.

If you encounter CUDA issues, try resetting the venv and using `uv pip install -e .` instead of `uv sync`.

# Example

[This example](https://github.com/randombk/chatterbox-vllm/blob/master/example-tts.py) can be run with `python example-tts.py` to generate audio samples for three different prompts using three different voices.

```python
import torchaudio as ta
from chatterbox_vllm.tts import ChatterboxTTS


if __name__ == "__main__":
    model = ChatterboxTTS.from_pretrained(
        gpu_memory_utilization = 0.4,
        max_model_len = 1000,

        # Disable CUDA graphs to reduce startup time for one-off generation.
        enforce_eager = True,
    )

    for i, audio_prompt_path in enumerate([None, "docs/audio-sample-01.mp3", "docs/audio-sample-03.mp3"]):
        prompts = [
            "You are listening to a demo of the Chatterbox TTS model running on VLLM.",
            "This is a separate prompt to test the batching implementation.",
            "And here is a third prompt. It's a bit longer than the first one, but not by much.",
        ]
    
        audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)
        for audio_idx, audio in enumerate(audios):
            ta.save(f"test-{i}-{audio_idx}.mp3", audio, model.sr)
```

# Benchmarks

To run a benchmark, tweak and run `benchmark.py`.
The following results were obtained with batching on a 6.6k-word input (`docs/benchmark-text-1.txt`), generating ~40min of audio.

Notes:
 * I'm not _entirely_ sure what the toks/s figures from vLLM are showing - the figures probably aren't directly comparable to others, but the results speak for themselves.
 * With vLLM, **the T3 model is no longer the bottleneck**
   * Vast majority of time is now spent on the S3Gen model, which is not ported/portable to vLLM. This currently uses the original reference implementation from the Chatterbox repo, so there's potential for integrating some of the other community optimizations here.
   * This also means the vLLM section of the model never fully ramps to its peak throughput in these benchmarks.
 * Benchmarks are done without CUDA graphs, as that is currently causing correctness issues.
 * There's some issues with my very rudimentary chunking logic, which is causing some occasional artifacts in output quality.

## Run 1: RTX 3090

System Specs:
 * RTX 3090: 24GB VRAM
 * AMD Ryzen 9 7900X @ 5.70GHz
 * 128GB DDR5 4800 MT/s

Settings & Results:
* Input text: `docs/benchmark-text-1.txt` (6.6k words)
* Input audio: `docs/audio-sample-03.mp3`
* Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8
* CUDA graphs disabled, vLLM max memory utilization=0.6
* Generated output length: 39m50s
* Wall time: 2m30s
* Generation time (without model startup time): 133s
  * Time spent in T3 Llama token generation: 20.6s
  * Time spent in S3Gen waveform generation: 111s

Logs:
```
[BENCHMARK] Text chunked into 154 chunks
[config.py:1472] Using max model len 1200
[default_loader.py:272] Loading weights took 0.16 seconds
[gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.215037 seconds
[gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.
[gpu_worker.py:232] Available KV cache memory: 12.59 GiB
[kv_cache_utils.py:716] GPU KV cache size: 110,000 tokens
[kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 91.67x
[BENCHMARK] Model loaded in 7.156545400619507 seconds
Adding requests: 100%|████| 40/40 [00:00<00:00, 1686.08it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00,  7.73it/s, est. speed input: 1487.13 toks/s, output: 3060.03 toks/s]
[T3] Speech Token Generation time: 5.20s
[S3Gen] Wavform Generation time: 29.09s
Adding requests: 100%|████| 40/40 [00:00<00:00, 1832.95it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00,  7.61it/s, est. speed input: 1522.47 toks/s, output: 3130.34 toks/s]
[T3] Speech Token Generation time: 5.28s
[S3Gen] Wavform Generation time: 30.40s
Adding requests: 100%|████| 40/40 [00:00<00:00, 1801.83it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00,  7.65it/s, est. speed input: 1326.87 toks/s, output: 2912.80 toks/s]
[T3] Speech Token Generation time: 5.25s
[S3Gen] Wavform Generation time: 28.37s
Adding requests: 100%|████| 34/34 [00:00<00:00, 1780.35it/s]
Processed prompts: 100%|████| 34/34 [00:04<00:00,  7.09it/s, est. speed input: 1274.34 toks/s, output: 2582.66 toks/s]
[T3] Speech Token Generation time: 4.82s
[S3Gen] Wavform Generation time: 23.74s
[BENCHMARK] Generation completed in 132.7742235660553 seconds
[BENCHMARK] Audio saved to benchmark.mp3
[BENCHMARK] Total time: 144.99638843536377 seconds

real	2m30.700s
user	2m54.372s
sys	0m2.205s
```


## Run 2: RTX 3060ti

System Specs:
 * RTX 3060ti: 8GB VRAM
 * Intel i7-7700K @ 4.20GHz
 * 32GB DDR4 2133 MT/s

Settings & Results:
* Input text: `docs/benchmark-text-1.txt` (6.6k words)
* Input audio: `docs/audio-sample-03.mp3`
* Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8
* CUDA graphs disabled, vLLM max memory utilization=0.6
* Generated output length: 40m15s
* Wall time: 4m26s
* Generation time (without model startup time): 238s
  * Time spent in T3 Llama token generation: 36.4s
  * Time spent in S3Gen waveform generation: 201s

Logs:
```
[BENCHMARK] Text chunked into 154 chunks.
INFO [config.py:1472] Using max model len 1200
INFO [default_loader.py:272] Loading weights took 0.39 seconds
INFO [gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.497231 seconds
INFO [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.
INFO [gpu_worker.py:232] Available KV cache memory: 3.07 GiB
INFO [kv_cache_utils.py:716] GPU KV cache size: 26,816 tokens
INFO [kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 22.35x
Adding requests: 100%|████| 40/40 [00:00<00:00, 947.42it/s]
Processed prompts: 100%|████| 40/40 [00:09<00:00,  4.15it/s, est. speed input: 799.18 toks/s, output: 1654.94 toks/s]
[T3] Speech Token Generation time: 9.68s
[S3Gen] Wavform Generation time: 53.66s
Adding requests: 100%|████| 40/40 [00:00<00:00, 858.75it/s]
Processed prompts: 100%|████| 40/40 [00:08<00:00,  4.69it/s, est. speed input: 938.19 toks/s, output: 1874.97 toks/s]
[T3] Speech Token Generation time: 8.58s
[S3Gen] Wavform Generation time: 53.86s
Adding requests: 100%|████| 40/40 [00:00<00:00, 815.60it/s]
Processed prompts: 100%|████| 40/40 [00:09<00:00,  4.19it/s, est. speed input: 726.62 toks/s, output: 1531.24 toks/s]
[T3] Speech Token Generation time: 9.60s
[S3Gen] Wavform Generation time: 49.89s
Adding requests: 100%|████| 34/34 [00:00<00:00, 938.61it/s]
Processed prompts: 100%|████| 34/34 [00:08<00:00,  3.98it/s, est. speed input: 714.68 toks/s, output: 1439.42 toks/s]
[T3] Speech Token Generation time: 8.59s
[S3Gen] Wavform Generation time: 43.58s
[BENCHMARK] Generation completed in 238.42230987548828 seconds
[BENCHMARK] Audio saved to benchmark.mp3
[BENCHMARK] Total time: 259.1808190345764 seconds

real    4m26.803s
user    4m42.393s
sys     0m4.285s
```


# Chatterbox Architecture

I could not find an official explanation of the Chatterbox architecture, so below is my best explanation based on the codebase. Chatterbox broadly follows the [CosyVoice](https://funaudiollm.github.io/cosyvoice2/) architecture, applying intermediate fusion multimodal conditioning to a 0.5B parameter Llama model.

<div align="center">
  <img src="https://github.com/randombk/chatterbox-vllm/raw/refs/heads/master/docs/chatterbox-architecture.svg" alt="Chatterbox Architecture" width="100%" />
  <p><em>Chatterbox Architecture Diagram</em></p>
</div>

# Implementation Notes

## CFG Implementation Details

vLLM does not support CFG natively, so substantial hacks were needs to make it work. At a high level, we trick vLLM into thinking the model has double the hidden dimension size as it actually does, then splitting and restacking the states to invoke Llama with double the original batch size. This does pose a risk that vLLM will underestimate the memory requirements of the model - more research is needed into whether vLLM's initial profiling pass will capture this nuance.


<div align="center">
  <img src="https://github.com/randombk/chatterbox-vllm/raw/refs/heads/master/docs/vllm-cfg-impl.svg" alt="vLLM CFG Implementation" width="100%" />
  <p><em>vLLM CFG Implementation</em></p>
</div>

# Changelog

## `0.1.2`
* Update to `vllm 0.10.0`
* Fixed error where batched requests sometimes get truncated, or otherwise jumbled.
  * This also removes the need to double-apply batching when submitting requests. You can submit as many prompts as you'd like into the `generate` function, and `vllm` should perform the batching internally without issue. See changes to `benchmark.py` for details.
  * There is still a (very rare, theorical) possibility that this issue can still happen. If it does, submit a ticket with repro steps, and tweak your max batch size or max token count as a workaround.


## `0.1.1`
* Misc minor cleanups
* API changes:
  * Use `max_batch_size` instead of `gpu_memory_utilization`
  * Use `compile=False` (default) instead of `enforce_eager=True`
  * Look at the latest examples to follow API changes. As a reminder, I do not expect the API to become stable until `1.0.0`.

## `0.1.0`
* Initial publication to pypi
* Moved audio conditioning processing out of vLLM to avoid re-computing it for every request.

## `0.0.1`
* Initial release

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chatterbox-vllm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "llm, gpt, cli, tts, chatterbox",
    "author": null,
    "author_email": "David Li <david@david-li.com>, resemble-ai <engineering@resemble.ai>",
    "download_url": "https://files.pythonhosted.org/packages/a0/0d/dd09ba9cd83c7db03fdebca49e9267109f83b73dcc52815ec39dafcd2329/chatterbox_vllm-0.1.2.tar.gz",
    "platform": null,
    "description": "# Chatterbox TTS on vLLM\n\nThis is a port of https://github.com/resemble-ai/chatterbox to vLLM. Why?\n\n* Improved performance and more efficient use of GPU memory.\n  * Early benchmarks show ~4x speedup in generation toks/s without batching, and over 10x with batching. This is a significant improvement over the original Chatterbox implementation, which was bottlenecked by unnecessary CPU-GPU sync/transfers within the HF transformers implementation.\n  * More rigorous benchmarking is WIP, but will likely come after batching is fully fleshed out.\n* Easier integration with state-of-the-art inference infrastructure.\n\nDISCLAIMER: THIS IS A PERSONAL PROJECT and is not affiliated with my employer or any other corporate entity in any way. The project is based solely on publicly-available information. All opinions are my own and do not necessarily represent the views of my employer.\n\n## Generation Samples\n\n![Sample 1](docs/audio-sample-01.mp3)\n<audio controls>\n  <source src=\"docs/audio-sample-01.mp3\" type=\"audio/mp3\">\n</audio>\n\n![Sample 2](docs/audio-sample-02.mp3)\n<audio controls>\n  <source src=\"docs/audio-sample-02.mp3\" type=\"audio/mp3\">\n</audio>\n\n![Sample 3](docs/audio-sample-03.mp3)\n<audio controls>\n  <source src=\"docs/audio-sample-03.mp3\" type=\"audio/mp3\">\n</audio>\n\n\n# Project Status: Usable and with Benchmark-Topping Throughput\n\n* \u2705 Basic speech cloning with audio and text conditioning.\n* \u2705 Outputs match the quality of the original Chatterbox implementation.\n* \u2705 Context Free Guidance (CFG) is implemented.\n  * Due to a vLLM limitation, CFG can not be tuned on a per-request basis and can only be configured via the `CHATTERBOX_CFG_SCALE` environment variable.\n* \u2705 Exaggeration control is implemented.\n* \u2705 vLLM batching is implemented and produces a significant speedup.\n* \u2139\ufe0f Project uses vLLM internal APIs and extremely hacky workarounds to get things done.\n  * Refactoring to the idomatic vLLM way of doing things is WIP, but will require some changes to vLLM.\n  * Until then, this is a Rube Goldberg machine that will likely only work with vLLM 0.9.2.\n  * Follow https://github.com/vllm-project/vllm/issues/21989 for updates.\n* \u2139\ufe0f Substantial refactoring is needed to further clean up unnecessary workarounds and code paths.\n* \u2139\ufe0f Server API is not implemented and will likely be out-of-scope for this project.\n* \u274c Learned speech positional embeddings are not applied, pending support in vLLM. However, this doesn't seem to be causing a very noticeable degradation in quality.\n* \u274c APIs are not yet stable and may change.\n* \u274c Benchmarks and performance optimizations are not yet implemented.\n\n# Installation\n\nPrerequisites: `git` and [`uv`](https://pypi.org/project/uv/) must be installed\n\n```\ngit clone https://github.com/randombk/chatterbox-vllm.git\ncd chatterbox-vllm\nuv venv\nsource .venv/bin/activate\nuv sync\n```\n\nThe package should automatically download the correct model weights from the Hugging Face Hub.\n\nIf you encounter CUDA issues, try resetting the venv and using `uv pip install -e .` instead of `uv sync`.\n\n# Example\n\n[This example](https://github.com/randombk/chatterbox-vllm/blob/master/example-tts.py) can be run with `python example-tts.py` to generate audio samples for three different prompts using three different voices.\n\n```python\nimport torchaudio as ta\nfrom chatterbox_vllm.tts import ChatterboxTTS\n\n\nif __name__ == \"__main__\":\n    model = ChatterboxTTS.from_pretrained(\n        gpu_memory_utilization = 0.4,\n        max_model_len = 1000,\n\n        # Disable CUDA graphs to reduce startup time for one-off generation.\n        enforce_eager = True,\n    )\n\n    for i, audio_prompt_path in enumerate([None, \"docs/audio-sample-01.mp3\", \"docs/audio-sample-03.mp3\"]):\n        prompts = [\n            \"You are listening to a demo of the Chatterbox TTS model running on VLLM.\",\n            \"This is a separate prompt to test the batching implementation.\",\n            \"And here is a third prompt. It's a bit longer than the first one, but not by much.\",\n        ]\n    \n        audios = model.generate(prompts, audio_prompt_path=audio_prompt_path, exaggeration=0.8)\n        for audio_idx, audio in enumerate(audios):\n            ta.save(f\"test-{i}-{audio_idx}.mp3\", audio, model.sr)\n```\n\n# Benchmarks\n\nTo run a benchmark, tweak and run `benchmark.py`.\nThe following results were obtained with batching on a 6.6k-word input (`docs/benchmark-text-1.txt`), generating ~40min of audio.\n\nNotes:\n * I'm not _entirely_ sure what the toks/s figures from vLLM are showing - the figures probably aren't directly comparable to others, but the results speak for themselves.\n * With vLLM, **the T3 model is no longer the bottleneck**\n   * Vast majority of time is now spent on the S3Gen model, which is not ported/portable to vLLM. This currently uses the original reference implementation from the Chatterbox repo, so there's potential for integrating some of the other community optimizations here.\n   * This also means the vLLM section of the model never fully ramps to its peak throughput in these benchmarks.\n * Benchmarks are done without CUDA graphs, as that is currently causing correctness issues.\n * There's some issues with my very rudimentary chunking logic, which is causing some occasional artifacts in output quality.\n\n## Run 1: RTX 3090\n\nSystem Specs:\n * RTX 3090: 24GB VRAM\n * AMD Ryzen 9 7900X @ 5.70GHz\n * 128GB DDR5 4800 MT/s\n\nSettings & Results:\n* Input text: `docs/benchmark-text-1.txt` (6.6k words)\n* Input audio: `docs/audio-sample-03.mp3`\n* Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8\n* CUDA graphs disabled, vLLM max memory utilization=0.6\n* Generated output length: 39m50s\n* Wall time: 2m30s\n* Generation time (without model startup time): 133s\n  * Time spent in T3 Llama token generation: 20.6s\n  * Time spent in S3Gen waveform generation: 111s\n\nLogs:\n```\n[BENCHMARK] Text chunked into 154 chunks\n[config.py:1472] Using max model len 1200\n[default_loader.py:272] Loading weights took 0.16 seconds\n[gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.215037 seconds\n[gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.\n[gpu_worker.py:232] Available KV cache memory: 12.59 GiB\n[kv_cache_utils.py:716] GPU KV cache size: 110,000 tokens\n[kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 91.67x\n[BENCHMARK] Model loaded in 7.156545400619507 seconds\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:00<00:00, 1686.08it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:05<00:00,  7.73it/s, est. speed input: 1487.13 toks/s, output: 3060.03 toks/s]\n[T3] Speech Token Generation time: 5.20s\n[S3Gen] Wavform Generation time: 29.09s\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:00<00:00, 1832.95it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:05<00:00,  7.61it/s, est. speed input: 1522.47 toks/s, output: 3130.34 toks/s]\n[T3] Speech Token Generation time: 5.28s\n[S3Gen] Wavform Generation time: 30.40s\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:00<00:00, 1801.83it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:05<00:00,  7.65it/s, est. speed input: 1326.87 toks/s, output: 2912.80 toks/s]\n[T3] Speech Token Generation time: 5.25s\n[S3Gen] Wavform Generation time: 28.37s\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 34/34 [00:00<00:00, 1780.35it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 34/34 [00:04<00:00,  7.09it/s, est. speed input: 1274.34 toks/s, output: 2582.66 toks/s]\n[T3] Speech Token Generation time: 4.82s\n[S3Gen] Wavform Generation time: 23.74s\n[BENCHMARK] Generation completed in 132.7742235660553 seconds\n[BENCHMARK] Audio saved to benchmark.mp3\n[BENCHMARK] Total time: 144.99638843536377 seconds\n\nreal\t2m30.700s\nuser\t2m54.372s\nsys\t0m2.205s\n```\n\n\n## Run 2: RTX 3060ti\n\nSystem Specs:\n * RTX 3060ti: 8GB VRAM\n * Intel i7-7700K @ 4.20GHz\n * 32GB DDR4 2133 MT/s\n\nSettings & Results:\n* Input text: `docs/benchmark-text-1.txt` (6.6k words)\n* Input audio: `docs/audio-sample-03.mp3`\n* Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8\n* CUDA graphs disabled, vLLM max memory utilization=0.6\n* Generated output length: 40m15s\n* Wall time: 4m26s\n* Generation time (without model startup time): 238s\n  * Time spent in T3 Llama token generation: 36.4s\n  * Time spent in S3Gen waveform generation: 201s\n\nLogs:\n```\n[BENCHMARK] Text chunked into 154 chunks.\nINFO [config.py:1472] Using max model len 1200\nINFO [default_loader.py:272] Loading weights took 0.39 seconds\nINFO [gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.497231 seconds\nINFO [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.\nINFO [gpu_worker.py:232] Available KV cache memory: 3.07 GiB\nINFO [kv_cache_utils.py:716] GPU KV cache size: 26,816 tokens\nINFO [kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 22.35x\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:00<00:00, 947.42it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:09<00:00,  4.15it/s, est. speed input: 799.18 toks/s, output: 1654.94 toks/s]\n[T3] Speech Token Generation time: 9.68s\n[S3Gen] Wavform Generation time: 53.66s\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:00<00:00, 858.75it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:08<00:00,  4.69it/s, est. speed input: 938.19 toks/s, output: 1874.97 toks/s]\n[T3] Speech Token Generation time: 8.58s\n[S3Gen] Wavform Generation time: 53.86s\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:00<00:00, 815.60it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 40/40 [00:09<00:00,  4.19it/s, est. speed input: 726.62 toks/s, output: 1531.24 toks/s]\n[T3] Speech Token Generation time: 9.60s\n[S3Gen] Wavform Generation time: 49.89s\nAdding requests: 100%|\u2588\u2588\u2588\u2588| 34/34 [00:00<00:00, 938.61it/s]\nProcessed prompts: 100%|\u2588\u2588\u2588\u2588| 34/34 [00:08<00:00,  3.98it/s, est. speed input: 714.68 toks/s, output: 1439.42 toks/s]\n[T3] Speech Token Generation time: 8.59s\n[S3Gen] Wavform Generation time: 43.58s\n[BENCHMARK] Generation completed in 238.42230987548828 seconds\n[BENCHMARK] Audio saved to benchmark.mp3\n[BENCHMARK] Total time: 259.1808190345764 seconds\n\nreal    4m26.803s\nuser    4m42.393s\nsys     0m4.285s\n```\n\n\n# Chatterbox Architecture\n\nI could not find an official explanation of the Chatterbox architecture, so below is my best explanation based on the codebase. Chatterbox broadly follows the [CosyVoice](https://funaudiollm.github.io/cosyvoice2/) architecture, applying intermediate fusion multimodal conditioning to a 0.5B parameter Llama model.\n\n<div align=\"center\">\n  <img src=\"https://github.com/randombk/chatterbox-vllm/raw/refs/heads/master/docs/chatterbox-architecture.svg\" alt=\"Chatterbox Architecture\" width=\"100%\" />\n  <p><em>Chatterbox Architecture Diagram</em></p>\n</div>\n\n# Implementation Notes\n\n## CFG Implementation Details\n\nvLLM does not support CFG natively, so substantial hacks were needs to make it work. At a high level, we trick vLLM into thinking the model has double the hidden dimension size as it actually does, then splitting and restacking the states to invoke Llama with double the original batch size. This does pose a risk that vLLM will underestimate the memory requirements of the model - more research is needed into whether vLLM's initial profiling pass will capture this nuance.\n\n\n<div align=\"center\">\n  <img src=\"https://github.com/randombk/chatterbox-vllm/raw/refs/heads/master/docs/vllm-cfg-impl.svg\" alt=\"vLLM CFG Implementation\" width=\"100%\" />\n  <p><em>vLLM CFG Implementation</em></p>\n</div>\n\n# Changelog\n\n## `0.1.2`\n* Update to `vllm 0.10.0`\n* Fixed error where batched requests sometimes get truncated, or otherwise jumbled.\n  * This also removes the need to double-apply batching when submitting requests. You can submit as many prompts as you'd like into the `generate` function, and `vllm` should perform the batching internally without issue. See changes to `benchmark.py` for details.\n  * There is still a (very rare, theorical) possibility that this issue can still happen. If it does, submit a ticket with repro steps, and tweak your max batch size or max token count as a workaround.\n\n\n## `0.1.1`\n* Misc minor cleanups\n* API changes:\n  * Use `max_batch_size` instead of `gpu_memory_utilization`\n  * Use `compile=False` (default) instead of `enforce_eager=True`\n  * Look at the latest examples to follow API changes. As a reminder, I do not expect the API to become stable until `1.0.0`.\n\n## `0.1.0`\n* Initial publication to pypi\n* Moved audio conditioning processing out of vLLM to avoid re-computing it for every request.\n\n## `0.0.1`\n* Initial release\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Resemble AI\n        Copyright (c) 2025 David Jia Wei Li\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Chatterbox TTS ported to VLLM for efficienct and advanced inference tasks",
    "version": "0.1.2",
    "project_urls": {
        "Changelog": "https://github.com/randombk/chatterbox-vllm/blob/master/README.md",
        "Source": "https://github.com/randombk/chatterbox-vllm",
        "Tracker": "https://github.com/randombk/chatterbox-vllm/issues"
    },
    "split_keywords": [
        "llm",
        " gpt",
        " cli",
        " tts",
        " chatterbox"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a00ddd09ba9cd83c7db03fdebca49e9267109f83b73dcc52815ec39dafcd2329",
                "md5": "fd34a21926b64a8e2f999e66363f84cf",
                "sha256": "d4e069d0196b80a36359b24a95f6ad018eef392cb8a02017ac65a022310f05f4"
            },
            "downloads": -1,
            "filename": "chatterbox_vllm-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "fd34a21926b64a8e2f999e66363f84cf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 80214,
            "upload_time": "2025-08-16T00:18:39",
            "upload_time_iso_8601": "2025-08-16T00:18:39.946064Z",
            "url": "https://files.pythonhosted.org/packages/a0/0d/dd09ba9cd83c7db03fdebca49e9267109f83b73dcc52815ec39dafcd2329/chatterbox_vllm-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-16 00:18:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "randombk",
    "github_project": "chatterbox-vllm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "chatterbox-vllm"
}

None