# Reverse Engineering of S3Tokenizer
<div align="center">
<img src="https://arxiv.org/html/2407.04051v2/x1.png" alt="Description" width="35%" />
<p><em>Supervised Semantic Speech Tokenizer (S3Tokenizer)</em></p>
</div>
S3Tokenizer was initially introduced in CosyVoice [[Paper]](https://arxiv.org/abs/2407.04051v2) [[Repo]](https://github.com/FunAudioLLM/CosyVoice), it is a Supervised Semantic Speech Tokenizer based on the pre-trained SenseVoice-Large model, which enhances the semantic relationship of extracted tokens to textual and paralinguistic information, is robust to data noise, and reduces the reliance on clean data collection, thereby enabling the use of a broader range of data for model training.
However, as indicated in this [[issue]](https://github.com/FunAudioLLM/CosyVoice/issues/70), the authors have no intention to open-source the PyTorch implementation of the S3Tokenizer, and only plan to release an ONNX file. Additionally, users aiming to fine-tune CosyVoice must extract speech codes offline, with the batch size restricted to 1, a process that is notably time-consuming (refer to [[cosyvoice/tools/extract_speech_token.py]](https://github.com/FunAudioLLM/CosyVoice/blob/main/tools/extract_speech_token.py)).
This repository undertakes a reverse engineering of the S3Tokenizer, offering:
1. A pure PyTorch implementation of S3Tokenizer (see [[model.py]](https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/model.py)), compatible with initializing weights from the released ONNX file (see [[utils.py::onnx2torch()]](https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/utils.py)).
2. High-throughput (distributed) batch inference, achieving a ~790x speedup compared to the original inference pipeline in [[cosyvoice/tools/extract_speech_token.py]](https://github.com/FunAudioLLM/CosyVoice/blob/main/tools/extract_speech_token.py).
3. The capability to perform online speech code extraction during SpeechLLM training.
## Supported Models 🔥
- [x] [S3Tokenizer V1 50hz](https://modelscope.cn/models/iic/CosyVoice-300M)
- [x] [S3Tokenizer V1 25hz](https://modelscope.cn/models/iic/CosyVoice-300M-25Hz)
- [x] [S3Tokenizer V2 25hz](https://modelscope.cn/models/iic/CosyVoice2-0.5B)
# Setup
```sh
pip install s3tokenizer
```
# Usage-1: Offline batch inference
```py
import s3tokenizer
tokenizer = s3tokenizer.load_model("speech_tokenizer_v1").cuda() # or "speech_tokenizer_v1_25hz speech_tokenizer_v2_25hz"
mels = []
wav_paths = ["s3tokenizer/assets/BAC009S0764W0121.wav", "s3tokenizer/assets/BAC009S0764W0122.wav"]
for wav_path in wav_paths:
audio = s3tokenizer.load_audio(wav_path)
mels.append(s3tokenizer.log_mel_spectrogram(audio))
mels, mels_lens = s3tokenizer.padding(mels)
codes, codes_lens = tokenizer.quantize(mels.cuda(), mels_lens.cuda())
for i in range(len(wav_paths)):
print(codes[i, :codes_lens[i].item()])
```
# Usage-2: Distributed offline batch inference via command-line tools
## 2.1 CPU batch inference
```sh
s3tokenizer --wav_scp xxx.scp \
--device "cpu" \
--output_dir "./" \
--batch_size 32 \
--model "speech_tokenizer_v1" # or "speech_tokenizer_v1_25hz speech_tokenizer_v2_25hz"
```
https://github.com/user-attachments/assets/d37d10fd-0e13-46a3-86b0-4cbec309086f
## 2.2 (Multi) GPU batch inference (a.k.a Distributed inference)
```sh
torchrun --nproc_per_node=8 --nnodes=1 \
--rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
`which s3tokenizer` --wav_scp xxx.scp \
--device "cuda" \
--output_dir "./" \
--batch_size 32 \
--model "speech_tokenizer_v1" # or "speech_tokenizer_v1_25hz speech_tokenizer_v2_25hz"
```
https://github.com/user-attachments/assets/79a3fb11-7199-4ee2-8a35-9682a3b4d94a
## 2.3 Performance Benchmark
| Method | Time cost on Aishell Test Set | Relative speed up | Miss Rate |
|:------:|:----------:|:--------------:|:-----:|
| [[cosyvoice/tools/extract_speech_token.py]](https://github.com/FunAudioLLM/CosyVoice/blob/main/tools/extract_speech_token.py), cpu | 9 hours | ~ | ~ |
| cpu, batchsize 32 | 1.5h | ~6x | 0.76% |
| 4 gpus (3090), batchsize 32 per gpu | 41s | ~790x | 0.76% |
The miss rate represents the proportion of tokens that are inconsistent between the batch inference predictions and the ONNX (batch=1) inference predictions.
# Usage-3: Online speech code extraction
<table>
<tr>
<th>Before (extract code offline)</th>
<th>After (extract code online)</th>
</tr>
<tr>
<td>
<sub>
```py
class SpeechLLM(nn.Module):
...
def __init__(self, ...):
...
def forward(self, speech_codes: Tensor, text_ids: Tensor, ...):
...
```
</sub>
<td>
<sub>
```py
import s3tokenizer
class SpeechLLM(nn.Module):
...
def __init__(self, ...):
...
self.speech_tokenizer = s3tokenizer.load_model("speech_tokenizer_v1") # or "speech_tokenizer_v1_25hz"
self.speech_tokenizer.freeze()
def forward(self, speech: Tensor, speech_lens: Tensor, text_ids: Tensor, ...):
...
speech_codes, speech_codes_lens = self.speech_tokenizer.quantize(speech, speech_lens)
speech_codes = speech_codes.clone() # for backward compatbility
speech_codes_lens = speeech_codes_lens.clone() # for backward compatbility
```
</sub>
</td>
</tr>
</table>
# TODO
- [x] Usage-1: Offline batch inference
- [x] Usage-2: Distributed offline batch inference via command-line tools
- [x] Usage-3: Online speech code extraction
Raw data
{
"_id": null,
"home_page": "https://github.com/xingchensong/S3Tokenizer",
"name": "s3tokenizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "xingchensong",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/95/12/7ccc4834e41e693385c238548943ea4ef57dcc0a03e6ee3bd00f63f7db97/s3tokenizer-0.1.1.tar.gz",
"platform": null,
"description": "# Reverse Engineering of S3Tokenizer\n\n<div align=\"center\">\n <img src=\"https://arxiv.org/html/2407.04051v2/x1.png\" alt=\"Description\" width=\"35%\" />\n <p><em>Supervised Semantic Speech Tokenizer (S3Tokenizer)</em></p>\n</div>\n\nS3Tokenizer was initially introduced in CosyVoice [[Paper]](https://arxiv.org/abs/2407.04051v2) [[Repo]](https://github.com/FunAudioLLM/CosyVoice), it is a Supervised Semantic Speech Tokenizer based on the pre-trained SenseVoice-Large model, which enhances the semantic relationship of extracted tokens to textual and paralinguistic information, is robust to data noise, and reduces the reliance on clean data collection, thereby enabling the use of a broader range of data for model training.\n\nHowever, as indicated in this [[issue]](https://github.com/FunAudioLLM/CosyVoice/issues/70), the authors have no intention to open-source the PyTorch implementation of the S3Tokenizer, and only plan to release an ONNX file. Additionally, users aiming to fine-tune CosyVoice must extract speech codes offline, with the batch size restricted to 1, a process that is notably time-consuming (refer to [[cosyvoice/tools/extract_speech_token.py]](https://github.com/FunAudioLLM/CosyVoice/blob/main/tools/extract_speech_token.py)).\n\nThis repository undertakes a reverse engineering of the S3Tokenizer, offering:\n1. A pure PyTorch implementation of S3Tokenizer (see [[model.py]](https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/model.py)), compatible with initializing weights from the released ONNX file (see [[utils.py::onnx2torch()]](https://github.com/xingchensong/S3Tokenizer/blob/main/s3tokenizer/utils.py)).\n2. High-throughput (distributed) batch inference, achieving a ~790x speedup compared to the original inference pipeline in [[cosyvoice/tools/extract_speech_token.py]](https://github.com/FunAudioLLM/CosyVoice/blob/main/tools/extract_speech_token.py).\n3. The capability to perform online speech code extraction during SpeechLLM training.\n\n## Supported Models \ud83d\udd25\n- [x] [S3Tokenizer V1 50hz](https://modelscope.cn/models/iic/CosyVoice-300M)\n- [x] [S3Tokenizer V1 25hz](https://modelscope.cn/models/iic/CosyVoice-300M-25Hz)\n- [x] [S3Tokenizer V2 25hz](https://modelscope.cn/models/iic/CosyVoice2-0.5B)\n\n\n# Setup\n\n```sh\npip install s3tokenizer\n```\n\n# Usage-1: Offline batch inference\n\n```py\nimport s3tokenizer\n\ntokenizer = s3tokenizer.load_model(\"speech_tokenizer_v1\").cuda() # or \"speech_tokenizer_v1_25hz speech_tokenizer_v2_25hz\"\n\nmels = []\nwav_paths = [\"s3tokenizer/assets/BAC009S0764W0121.wav\", \"s3tokenizer/assets/BAC009S0764W0122.wav\"]\nfor wav_path in wav_paths:\n audio = s3tokenizer.load_audio(wav_path)\n mels.append(s3tokenizer.log_mel_spectrogram(audio))\nmels, mels_lens = s3tokenizer.padding(mels)\ncodes, codes_lens = tokenizer.quantize(mels.cuda(), mels_lens.cuda())\n\nfor i in range(len(wav_paths)):\n print(codes[i, :codes_lens[i].item()])\n```\n\n# Usage-2: Distributed offline batch inference via command-line tools\n\n## 2.1 CPU batch inference\n\n```sh\ns3tokenizer --wav_scp xxx.scp \\\n --device \"cpu\" \\\n --output_dir \"./\" \\\n --batch_size 32 \\\n --model \"speech_tokenizer_v1\" # or \"speech_tokenizer_v1_25hz speech_tokenizer_v2_25hz\"\n```\n\n\n\nhttps://github.com/user-attachments/assets/d37d10fd-0e13-46a3-86b0-4cbec309086f\n\n\n\n## 2.2 (Multi) GPU batch inference (a.k.a Distributed inference)\n\n```sh\ntorchrun --nproc_per_node=8 --nnodes=1 \\\n --rdzv_id=2024 --rdzv_backend=\"c10d\" --rdzv_endpoint=\"localhost:0\" \\\n `which s3tokenizer` --wav_scp xxx.scp \\\n --device \"cuda\" \\\n --output_dir \"./\" \\\n --batch_size 32 \\\n --model \"speech_tokenizer_v1\" # or \"speech_tokenizer_v1_25hz speech_tokenizer_v2_25hz\"\n```\n\n\n\nhttps://github.com/user-attachments/assets/79a3fb11-7199-4ee2-8a35-9682a3b4d94a\n\n\n\n## 2.3 Performance Benchmark\n\n| Method | Time cost on Aishell Test Set | Relative speed up | Miss Rate |\n|:------:|:----------:|:--------------:|:-----:|\n| [[cosyvoice/tools/extract_speech_token.py]](https://github.com/FunAudioLLM/CosyVoice/blob/main/tools/extract_speech_token.py), cpu | 9 hours | ~ | ~ |\n| cpu, batchsize 32 | 1.5h | ~6x | 0.76% |\n| 4 gpus (3090), batchsize 32 per gpu | 41s | ~790x | 0.76% |\n\nThe miss rate represents the proportion of tokens that are inconsistent between the batch inference predictions and the ONNX (batch=1) inference predictions.\n\n# Usage-3: Online speech code extraction\n\n<table>\n<tr>\n<th>Before (extract code offline)</th>\n<th>After (extract code online)</th>\n</tr>\n<tr>\n<td>\n<sub>\n\n```py\n\nclass SpeechLLM(nn.Module):\n ...\n def __init__(self, ...):\n ...\n\n def forward(self, speech_codes: Tensor, text_ids: Tensor, ...):\n ...\n```\n\n</sub>\n<td>\n<sub>\n\n```py\nimport s3tokenizer\n\nclass SpeechLLM(nn.Module):\n ...\n def __init__(self, ...):\n ...\n self.speech_tokenizer = s3tokenizer.load_model(\"speech_tokenizer_v1\") # or \"speech_tokenizer_v1_25hz\"\n self.speech_tokenizer.freeze()\n\n def forward(self, speech: Tensor, speech_lens: Tensor, text_ids: Tensor, ...):\n ...\n speech_codes, speech_codes_lens = self.speech_tokenizer.quantize(speech, speech_lens)\n speech_codes = speech_codes.clone() # for backward compatbility\n speech_codes_lens = speeech_codes_lens.clone() # for backward compatbility\n```\n\n</sub>\n</td>\n</tr>\n</table>\n\n\n# TODO\n\n- [x] Usage-1: Offline batch inference\n- [x] Usage-2: Distributed offline batch inference via command-line tools\n- [x] Usage-3: Online speech code extraction",
"bugtrack_url": null,
"license": "Apache2.0",
"summary": "Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/xingchensong/S3Tokenizer"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "95127ccc4834e41e693385c238548943ea4ef57dcc0a03e6ee3bd00f63f7db97",
"md5": "d159893d7f5e3f01f9898f31f34f00fd",
"sha256": "867191eebd80c8bd207fbd33764de0e8537e6ff89760ad5099d57eaafa4aa44b"
},
"downloads": -1,
"filename": "s3tokenizer-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d159893d7f5e3f01f9898f31f34f00fd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 217909,
"upload_time": "2024-12-22T03:20:33",
"upload_time_iso_8601": "2024-12-22T03:20:33.036309Z",
"url": "https://files.pythonhosted.org/packages/95/12/7ccc4834e41e693385c238548943ea4ef57dcc0a03e6ee3bd00f63f7db97/s3tokenizer-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-22 03:20:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "xingchensong",
"github_project": "S3Tokenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pre-commit",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "onnx",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "torchaudio",
"specs": []
},
{
"name": "einops",
"specs": []
}
],
"lcname": "s3tokenizer"
}