# WhisperSpeech
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
[](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)
[](https://discord.gg/FANw4rHD5E)
*If you have questions or you want to help you can find us in the
\#audio-generation channel on the LAION Discord server.*
An Open Source text-to-speech system built by inverting Whisper.
Previously known as **spear-tts-pytorch**.
We want this model to be like Stable Diffusion but for speech – both
powerful and easily customizable.
We are working only with properly licensed speech recordings and all the
code is Open Source so the model will be always safe to use for
commercial applications.
Currently the models are trained on the English LibreLight dataset. In
the next release we want to target multiple languages (Whisper and
EnCodec are both multilanguage).
Sample of the synthesized voice:
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434
## Progress update \[2024-01-29\]
We successfully trained a `tiny` S2A model on an en+pl+fr dataset and it
can do voice cloning in French:
https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e
https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9
We were able to do this with frozen semantic tokens that were only
trained on English and Polish. This supports the idea that we will be
able to train a single semantic token model to support all the languages
in the world. Quite likely even ones that are not currently well
supported by the Whisper model. Stay tuned for more updates on this
front. :)
## Progress update \[2024-01-18\]
We spend the last week optimizing inference performance. We integrated
`torch.compile`, added kv-caching and tuned some of the layers – we are
now working over 12x faster than real-time on a consumer 4090!
We can mix languages in a single sentence (here the highlighted English
project names are seamlessly mixed into Polish speech):
> To jest pierwszy test wielojęzycznego `Whisper Speech` modelu
> zamieniającego tekst na mowę, który `Collabora` i `Laion` nauczyli na
> superkomputerze `Jewels`.
https://github.com/collabora/WhisperSpeech/assets/107984/d7092ef1-9df7-40e3-a07e-fdc7a090ae9e
We also added an easy way to test voice-cloning. Here is a sample voice
cloned from [a famous speech by Winston
Churchill](https://en.wikipedia.org/wiki/File:Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg)
(the radio static is a feature, not a bug ;) – it is part of the
reference recording):
https://github.com/collabora/WhisperSpeech/assets/107984/bd28110b-31fb-4d61-83f6-c997f560bc26
You can [test all of these on
Colab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)
(we optimized the dependencies so now it takes less than 30 seconds to
install). A Huggingface Space is coming soon.
## Progress update \[2024-01-10\]
We’ve pushed a new SD S2A model that is a lot faster while still
generating high-quality speech. We’ve also added an example of voice
cloning based on a reference audio file.
As always, you can [check out our
Colab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)
to try it yourself!
## Progress update \[2023-12-10\]
Another trio of models, this time they support multiple languages
(English and Polish). Here are two new samples for a sneak peek. You can
[check out our
Colab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)
to try it yourself!
English speech, female voice (transferred from a Polish language
dataset):
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434
A Polish sample, male voice:
https://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec
[Older progress updates are archived
here](https://github.com/collabora/WhisperSpeech/issues/23)
## Downloads
We encourage you to start with the Google Colab link above or run the
provided notebook locally. If you want to download manually or train the
models from scratch then both [the WhisperSpeech pre-trained
models](https://huggingface.co/collabora/whisperspeech) as well as [the
converted
datasets](https://huggingface.co/datasets/collabora/whisperspeech) are
available on HuggingFace.
## Roadmap
- [ ] [Gather a bigger emotive speech
dataset](https://github.com/collabora/spear-tts-pytorch/issues/11)
- [ ] Figure out a way to condition the generation on emotions and
prosody
- [ ] Create a community effort to gather freely licensed speech in
multiple languages
- [ ] [Train final multi-language
models](https://github.com/collabora/spear-tts-pytorch/issues/12)
## Architecture
The general architecture is similar to
[AudioLM](https://google-research.github.io/seanet/audiolm/examples/),
[SPEAR TTS](https://google-research.github.io/seanet/speartts/examples/)
from Google and [MusicGen](https://ai.honu.io/papers/musicgen/) from
Meta. We avoided the NIH syndrome and built it on top of powerful Open
Source models: [Whisper](https://github.com/openai/whisper) from OpenAI
to generate semantic tokens and perform transcription,
[EnCodec](https://github.com/facebookresearch/encodec) from Meta for
acoustic modeling and
[Vocos](https://github.com/charactr-platform/vocos) from Charactr Inc as
the high-quality vocoder.
We gave two presentation diving deeper into WhisperSpeech. The first one
talks about the challenges of large scale training:
<div>
[](https://www.youtube.com/watch?v=6Fr-rq-yjXo)
Tricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of
Speech - video recording by Jakub Cłapa, Collabora
</div>
The other one goes a bit more into the architectural choices we made:
<div>
[](https://www.youtube.com/watch?v=1OBvf33S77Y)
Open Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion
</div>
### Whisper for modeling semantic tokens
We utilize the OpenAI Whisper encoder block to generate embeddings which
we then quantize to get semantic tokens.
If the language is already supported by Whisper then this process
requires only audio files (without ground truth transcriptions).

## EnCodec for modeling acoustic tokens
We use EnCodec to model the audio waveform. Out of the box it delivers
reasonable quality at 1.5kbps and we can bring this to high-quality by
using Vocos – a vocoder pretrained on EnCodec tokens.

## Appreciation
[<img height=80 src="https://user-images.githubusercontent.com/107984/229537027-a6d7462b-0c9c-4fd4-b69e-58e98c3ee63f.png" alt="Collabora logo">](https://www.collabora.com) [<img height=80 src="https://user-images.githubusercontent.com/107984/229535036-c741d775-4a9b-4193-89a0-9ddb89ecd011.png" alt="LAION logo">](https://laion.ai)
This work would not be possible without the generous sponsorships from:
- [Collabora](https://www.collabora.com) – code development and model
training
- [LAION](https://laion.ai) – community building and datasets (special
thanks to
- [Jülich Supercomputing Centre](https://www.fz-juelich.de/en) - JUWELS
Booster supercomputer
We gratefully acknowledge the Gauss Centre for Supercomputing e.V.
(www.gauss-centre.eu) for funding part of this work by providing
computing time through the John von Neumann Institute for Computing
(NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing
Centre (JSC), with access to compute provided via LAION cooperation on
foundation models research.
We’d like to also thank individual contributors for their great help in
building this model:
- [inevitable-2031](https://github.com/inevitable-2031) (`qwerty_qwer`
on Discord) for dataset curation
## Consulting
We are available to help you with both Open Source and proprietary AI
projects. You can reach us via the Collabora website or on Discord
([](https://discordapp.com/users/270267134960074762)
and
[](https://discordapp.com/users/1088938086400016475))
## Citations
We rely on many amazing Open Source projects and research papers:
``` bibtex
@article{SpearTTS,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
url = {https://arxiv.org/abs/2302.03540},
author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
publisher = {arXiv},
year = {2023},
}
```
``` bibtex
@article{MusicGen,
title={Simple and Controllable Music Generation},
url = {https://arxiv.org/abs/2306.05284},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
publisher={arXiv},
year={2023},
}
```
``` bibtex
@article{Whisper
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
publisher = {arXiv},
year = {2022},
}
```
``` bibtex
@article{EnCodec
title = {High Fidelity Neural Audio Compression},
url = {https://arxiv.org/abs/2210.13438},
author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
publisher = {arXiv},
year = {2022},
}
```
``` bibtex
@article{Vocos
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
url = {https://arxiv.org/abs/2306.00814},
author={Hubert Siuzdak},
publisher={arXiv},
year={2023},
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/collabora/WhisperSpeech",
"name": "WhisperSpeech",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "nbdev jupyter notebook python",
"author": "Jakub Piotr C\u0142apa",
"author_email": "jpc@collabora.com",
"download_url": "https://files.pythonhosted.org/packages/65/10/a135574b7bb6ef12d9c18ed7fcff275e923c5f0331797b307476f039a625/whisperspeech-0.8.9.tar.gz",
"platform": null,
"description": "# WhisperSpeech\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n[](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)\n[](https://discord.gg/FANw4rHD5E) \n*If you have questions or you want to help you can find us in the\n\\#audio-generation channel on the LAION Discord server.*\n\nAn Open Source text-to-speech system built by inverting Whisper.\nPreviously known as **spear-tts-pytorch**.\n\nWe want this model to be like Stable Diffusion but for speech \u2013 both\npowerful and easily customizable.\n\nWe are working only with properly licensed speech recordings and all the\ncode is Open Source so the model will be always safe to use for\ncommercial applications.\n\nCurrently the models are trained on the English LibreLight dataset. In\nthe next release we want to target multiple languages (Whisper and\nEnCodec are both multilanguage).\n\nSample of the synthesized voice:\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434\n\n## Progress update \\[2024-01-29\\]\n\nWe successfully trained a `tiny` S2A model on an en+pl+fr dataset and it\ncan do voice cloning in French:\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9\n\nWe were able to do this with frozen semantic tokens that were only\ntrained on English and Polish. This supports the idea that we will be\nable to train a single semantic token model to support all the languages\nin the world. Quite likely even ones that are not currently well\nsupported by the Whisper model. Stay tuned for more updates on this\nfront. :)\n\n## Progress update \\[2024-01-18\\]\n\nWe spend the last week optimizing inference performance. We integrated\n`torch.compile`, added kv-caching and tuned some of the layers \u2013 we are\nnow working over 12x faster than real-time on a consumer 4090!\n\nWe can mix languages in a single sentence (here the highlighted English\nproject names are seamlessly mixed into Polish speech):\n\n> To jest pierwszy test wieloj\u0119zycznego `Whisper Speech` modelu\n> zamieniaj\u0105cego tekst na mow\u0119, kt\u00f3ry `Collabora` i `Laion` nauczyli na\n> superkomputerze `Jewels`.\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/d7092ef1-9df7-40e3-a07e-fdc7a090ae9e\n\nWe also added an easy way to test voice-cloning. Here is a sample voice\ncloned from [a famous speech by Winston\nChurchill](https://en.wikipedia.org/wiki/File:Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg)\n(the radio static is a feature, not a bug ;) \u2013 it is part of the\nreference recording):\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/bd28110b-31fb-4d61-83f6-c997f560bc26\n\nYou can [test all of these on\nColab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)\n(we optimized the dependencies so now it takes less than 30 seconds to\ninstall). A Huggingface Space is coming soon.\n\n## Progress update \\[2024-01-10\\]\n\nWe\u2019ve pushed a new SD S2A model that is a lot faster while still\ngenerating high-quality speech. We\u2019ve also added an example of voice\ncloning based on a reference audio file.\n\nAs always, you can [check out our\nColab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)\nto try it yourself!\n\n## Progress update \\[2023-12-10\\]\n\nAnother trio of models, this time they support multiple languages\n(English and Polish). Here are two new samples for a sneak peek. You can\n[check out our\nColab](https://colab.research.google.com/drive/1xxGlTbwBmaY6GKA24strRixTXGBOlyiw)\nto try it yourself!\n\nEnglish speech, female voice (transferred from a Polish language\ndataset):\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434\n\nA Polish sample, male voice:\n\nhttps://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec\n\n[Older progress updates are archived\nhere](https://github.com/collabora/WhisperSpeech/issues/23)\n\n## Downloads\n\nWe encourage you to start with the Google Colab link above or run the\nprovided notebook locally. If you want to download manually or train the\nmodels from scratch then both [the WhisperSpeech pre-trained\nmodels](https://huggingface.co/collabora/whisperspeech) as well as [the\nconverted\ndatasets](https://huggingface.co/datasets/collabora/whisperspeech) are\navailable on HuggingFace.\n\n## Roadmap\n\n- [ ] [Gather a bigger emotive speech\n dataset](https://github.com/collabora/spear-tts-pytorch/issues/11)\n- [ ] Figure out a way to condition the generation on emotions and\n prosody\n- [ ] Create a community effort to gather freely licensed speech in\n multiple languages\n- [ ] [Train final multi-language\n models](https://github.com/collabora/spear-tts-pytorch/issues/12)\n\n## Architecture\n\nThe general architecture is similar to\n[AudioLM](https://google-research.github.io/seanet/audiolm/examples/),\n[SPEAR TTS](https://google-research.github.io/seanet/speartts/examples/)\nfrom Google and [MusicGen](https://ai.honu.io/papers/musicgen/) from\nMeta. We avoided the NIH syndrome and built it on top of powerful Open\nSource models: [Whisper](https://github.com/openai/whisper) from OpenAI\nto generate semantic tokens and perform transcription,\n[EnCodec](https://github.com/facebookresearch/encodec) from Meta for\nacoustic modeling and\n[Vocos](https://github.com/charactr-platform/vocos) from Charactr Inc as\nthe high-quality vocoder.\n\nWe gave two presentation diving deeper into WhisperSpeech. The first one\ntalks about the challenges of large scale training:\n\n<div>\n\n[](https://www.youtube.com/watch?v=6Fr-rq-yjXo)\n\nTricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of\nSpeech - video recording by Jakub C\u0142apa, Collabora\n\n</div>\n\nThe other one goes a bit more into the architectural choices we made:\n\n<div>\n\n[](https://www.youtube.com/watch?v=1OBvf33S77Y)\n\nOpen Source Text-To-Speech Projects: WhisperSpeech - In Depth Discussion\n\n</div>\n\n### Whisper for modeling semantic tokens\n\nWe utilize the OpenAI Whisper encoder block to generate embeddings which\nwe then quantize to get semantic tokens.\n\nIf the language is already supported by Whisper then this process\nrequires only audio files (without ground truth transcriptions).\n\n\n\n## EnCodec for modeling acoustic tokens\n\nWe use EnCodec to model the audio waveform. Out of the box it delivers\nreasonable quality at 1.5kbps and we can bring this to high-quality by\nusing Vocos \u2013 a vocoder pretrained on EnCodec tokens.\n\n\n\n## Appreciation\n\n[<img height=80 src=\"https://user-images.githubusercontent.com/107984/229537027-a6d7462b-0c9c-4fd4-b69e-58e98c3ee63f.png\" alt=\"Collabora logo\">](https://www.collabora.com)\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0[<img height=80 src=\"https://user-images.githubusercontent.com/107984/229535036-c741d775-4a9b-4193-89a0-9ddb89ecd011.png\" alt=\"LAION logo\">](https://laion.ai)\n\nThis work would not be possible without the generous sponsorships from:\n\n- [Collabora](https://www.collabora.com) \u2013 code development and model\n training\n- [LAION](https://laion.ai) \u2013 community building and datasets (special\n thanks to\n- [J\u00fclich Supercomputing Centre](https://www.fz-juelich.de/en) - JUWELS\n Booster supercomputer\n\nWe gratefully acknowledge the Gauss Centre for Supercomputing e.V.\n(www.gauss-centre.eu) for funding part of this work by providing\ncomputing time through the John von Neumann Institute for Computing\n(NIC) on the GCS Supercomputer JUWELS Booster at J\u00fclich Supercomputing\nCentre (JSC), with access to compute provided via LAION cooperation on\nfoundation models research.\n\nWe\u2019d like to also thank individual contributors for their great help in\nbuilding this model:\n\n- [inevitable-2031](https://github.com/inevitable-2031) (`qwerty_qwer`\n on Discord) for dataset curation\n\n## Consulting\n\nWe are available to help you with both Open Source and proprietary AI\nprojects. You can reach us via the Collabora website or on Discord\n([](https://discordapp.com/users/270267134960074762)\nand\n[](https://discordapp.com/users/1088938086400016475))\n\n## Citations\n\nWe rely on many amazing Open Source projects and research papers:\n\n``` bibtex\n@article{SpearTTS,\n title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},\n url = {https://arxiv.org/abs/2302.03540},\n author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zal\u00e1n and Marinier, Rapha\u00ebl and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},\n publisher = {arXiv},\n year = {2023},\n}\n```\n\n``` bibtex\n@article{MusicGen,\n title={Simple and Controllable Music Generation}, \n url = {https://arxiv.org/abs/2306.05284},\n author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre D\u00e9fossez},\n publisher={arXiv},\n year={2023},\n}\n```\n\n``` bibtex\n@article{Whisper\n title = {Robust Speech Recognition via Large-Scale Weak Supervision},\n url = {https://arxiv.org/abs/2212.04356},\n author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},\n publisher = {arXiv},\n year = {2022},\n}\n```\n\n``` bibtex\n@article{EnCodec\n title = {High Fidelity Neural Audio Compression},\n url = {https://arxiv.org/abs/2210.13438},\n author = {D\u00e9fossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},\n publisher = {arXiv},\n year = {2022},\n}\n```\n\n``` bibtex\n@article{Vocos\n title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, \n url = {https://arxiv.org/abs/2306.00814},\n author={Hubert Siuzdak},\n publisher={arXiv},\n year={2023},\n}\n```\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "An Open Source text-to-speech system built by inverting Whisper",
"version": "0.8.9",
"project_urls": {
"Homepage": "https://github.com/collabora/WhisperSpeech"
},
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "86d279287f07e8c5fce780b68ac0e01c195c8e4e5a8874d4d8c3a1dc65da71a2",
"md5": "a8962e38d15105168bf39357267b25d2",
"sha256": "6cf9bbe2619dba5233799827464b46ac5b4af7552bd97406867ac552a30e860f"
},
"downloads": -1,
"filename": "WhisperSpeech-0.8.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a8962e38d15105168bf39357267b25d2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 74638,
"upload_time": "2024-08-17T16:11:18",
"upload_time_iso_8601": "2024-08-17T16:11:18.952172Z",
"url": "https://files.pythonhosted.org/packages/86/d2/79287f07e8c5fce780b68ac0e01c195c8e4e5a8874d4d8c3a1dc65da71a2/WhisperSpeech-0.8.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6510a135574b7bb6ef12d9c18ed7fcff275e923c5f0331797b307476f039a625",
"md5": "c052a0d94f55646efde7d8e8b5987811",
"sha256": "a1a76f49ac4cee2a584efa04eb586dcb2a3d34132794c7844bc1acbd03043e61"
},
"downloads": -1,
"filename": "whisperspeech-0.8.9.tar.gz",
"has_sig": false,
"md5_digest": "c052a0d94f55646efde7d8e8b5987811",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 55054,
"upload_time": "2024-08-17T16:11:21",
"upload_time_iso_8601": "2024-08-17T16:11:21.165936Z",
"url": "https://files.pythonhosted.org/packages/65/10/a135574b7bb6ef12d9c18ed7fcff275e923c5f0331797b307476f039a625/whisperspeech-0.8.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-17 16:11:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "collabora",
"github_project": "WhisperSpeech",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "whisperspeech"
}