# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
[Audio samples](https://charactr-platform.github.io/vocos/) |
Paper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)
Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative
Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical
GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral
coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.
## Installation
To use Vocos only in inference mode, install it using:
```bash
pip install vocos
```
If you wish to train the model, install it with additional dependencies:
```bash
pip install vocos[train]
```
## Usage
### Reconstruct audio from mel-spectrogram
```python
import torch
from vocos import Vocos
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
mel = torch.randn(1, 100, 256) # B, C, T
audio = vocos.decode(mel)
```
Copy-synthesis from a file:
```python
import torchaudio
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)
```
### Reconstruct audio from EnCodec tokens
Additionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the
list: `[1.5, 3.0, 6.0, 12.0]`.
```python
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")
audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2]) # 6 kbps
audio = vocos.decode(features, bandwidth_id=bandwidth_id)
```
Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a
single forward pass.
```python
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y, bandwidth_id=bandwidth_id)
```
### Integrate with 🐶 [Bark](https://github.com/suno-ai/bark) text-to-audio model
See [example notebook](notebooks%2FBark%2BVocos.ipynb).
## Pre-trained models
| Model Name | Dataset | Training Iterations | Parameters
|-------------------------------------------------------------------------------------|---------------|-------------------|------------|
| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz) | LibriTTS | 1M | 13.5M
| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2M | 7.9M
## Training
Prepare a filelist of audio files for the training and validation set:
```bash
find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val
```
Fill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:
```bash
python train.py -c configs/vocos.yaml
```
Refer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the
training pipeline.
## Citation
If this code contributes to your research, please cite our work:
```
@article{siuzdak2023vocos,
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author={Siuzdak, Hubert},
journal={arXiv preprint arXiv:2306.00814},
year={2023}
}
```
## License
The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.
Raw data
{
"_id": null,
"home_page": "https://github.com/charactr-platform/vocos",
"name": "vocos",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Hubert Siuzdak",
"author_email": "huberts@charactr.com",
"download_url": "https://files.pythonhosted.org/packages/db/48/1e4d3a4a97292e47ebaa18e3eae6ecb2f57bde47693ccab0e7acb23f9ffe/vocos-0.1.0.tar.gz",
"platform": null,
"description": "# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis\n\n[Audio samples](https://charactr-platform.github.io/vocos/) |\nPaper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)\n\nVocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative\nAdversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical\nGAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral\ncoefficients, facilitating rapid audio reconstruction through inverse Fourier transform.\n\n## Installation\n\nTo use Vocos only in inference mode, install it using:\n\n```bash\npip install vocos\n```\n\nIf you wish to train the model, install it with additional dependencies:\n\n```bash\npip install vocos[train]\n```\n\n## Usage\n\n### Reconstruct audio from mel-spectrogram\n\n```python\nimport torch\n\nfrom vocos import Vocos\n\nvocos = Vocos.from_pretrained(\"charactr/vocos-mel-24khz\")\n\nmel = torch.randn(1, 100, 256) # B, C, T\naudio = vocos.decode(mel)\n```\n\nCopy-synthesis from a file:\n\n```python\nimport torchaudio\n\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\nif y.size(0) > 1: # mix to mono\n y = y.mean(dim=0, keepdim=True)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\ny_hat = vocos(y)\n```\n\n### Reconstruct audio from EnCodec tokens\n\nAdditionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the\nlist: `[1.5, 3.0, 6.0, 12.0]`.\n\n```python\nvocos = Vocos.from_pretrained(\"charactr/vocos-encodec-24khz\")\n\naudio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames\nfeatures = vocos.codes_to_features(audio_tokens)\nbandwidth_id = torch.tensor([2]) # 6 kbps\n\naudio = vocos.decode(features, bandwidth_id=bandwidth_id)\n```\n\nCopy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a\nsingle forward pass.\n\n```python\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\nif y.size(0) > 1: # mix to mono\n y = y.mean(dim=0, keepdim=True)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\n\ny_hat = vocos(y, bandwidth_id=bandwidth_id)\n```\n\n### Integrate with \ud83d\udc36 [Bark](https://github.com/suno-ai/bark) text-to-audio model\n\nSee [example notebook](notebooks%2FBark%2BVocos.ipynb).\n\n## Pre-trained models\n\n| Model Name | Dataset | Training Iterations | Parameters \n|-------------------------------------------------------------------------------------|---------------|-------------------|------------|\n| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz) | LibriTTS | 1M | 13.5M\n| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2M | 7.9M\n\n## Training\n\nPrepare a filelist of audio files for the training and validation set:\n\n```bash\nfind $TRAIN_DATASET_DIR -name *.wav > filelist.train\nfind $VAL_DATASET_DIR -name *.wav > filelist.val\n```\n\nFill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:\n\n```bash\npython train.py -c configs/vocos.yaml\n```\n\nRefer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the\ntraining pipeline.\n\n## Citation\n\nIf this code contributes to your research, please cite our work:\n\n```\n@article{siuzdak2023vocos,\n title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},\n author={Siuzdak, Hubert},\n journal={arXiv preprint arXiv:2306.00814},\n year={2023}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
"bugtrack_url": null,
"license": "",
"summary": "Fourier-based neural vocoder for high-quality audio synthesis",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/charactr-platform/vocos"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0a4582fe9b5696eb5dd4f84632f75b549b48bed0c33a5920b6309fbafd7e3477",
"md5": "e1beabfd08edab1d6b69b693fdff689f",
"sha256": "0ac13eaef68596074301e912d781399b3defa4b4ca60b6bc52c8a4b9209ca235"
},
"downloads": -1,
"filename": "vocos-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e1beabfd08edab1d6b69b693fdff689f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 24085,
"upload_time": "2023-10-14T00:57:27",
"upload_time_iso_8601": "2023-10-14T00:57:27.972325Z",
"url": "https://files.pythonhosted.org/packages/0a/45/82fe9b5696eb5dd4f84632f75b549b48bed0c33a5920b6309fbafd7e3477/vocos-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "db481e4d3a4a97292e47ebaa18e3eae6ecb2f57bde47693ccab0e7acb23f9ffe",
"md5": "b85fbd6885d22b84e4da1c4d4580ecac",
"sha256": "b488224dbe398ff7d4790a027ad659478b4bc02e465db992c62c12b32ca043d8"
},
"downloads": -1,
"filename": "vocos-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "b85fbd6885d22b84e4da1c4d4580ecac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 21107,
"upload_time": "2023-10-14T00:57:29",
"upload_time_iso_8601": "2023-10-14T00:57:29.631879Z",
"url": "https://files.pythonhosted.org/packages/db/48/1e4d3a4a97292e47ebaa18e3eae6ecb2f57bde47693ccab0e7acb23f9ffe/vocos-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-14 00:57:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "charactr-platform",
"github_project": "vocos",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "vocos"
}