# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
[Audio samples](https://charactr-platform.github.io/vocos/) |
Paper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)
## Installation
To use Vocos only in inference mode, install it using:
```bash
pip install vocos
```
If you wish to train the model, install it with additional dependencies:
```bash
pip install vocos[train]
```
## Usage
### Reconstruct audio from mel-spectrogram
```python
import torch
from vocos import Vocos
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
mel = torch.randn(1, 100, 256) # B, C, T
with torch.no_grad():
audio = vocos.decode(mel)
```
Copy-synthesis from a file:
```python
import torchaudio
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
with torch.no_grad():
y_hat = vocos(y)
```
### Reconstruct audio from EnCodec
Additionally, you need to provide a `bandwidth_id` which corresponds to the lookup embedding for bandwidth from the
list: `[1.5, 3.0, 6.0, 12.0]`.
```python
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")
quantized_features = torch.randn(1, 128, 256)
bandwidth_id = torch.tensor([3]) # 12 kbps
with torch.no_grad():
audio = vocos.decode(quantized_features, bandwidth_id=bandwidth_id)
```
Copy-synthesis from a file. Extract and quantize features with EnCodec, reconstruct with Vocos:
```python
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
with torch.no_grad():
y_hat = vocos(y, bandwidth_id=bandwidth_id)
```
## Pre-trained models
The provided models were trained up to 2.5 million generator iterations, which resulted in slightly better objective
scores
compared to those reported in the paper.
| Model Name | Dataset | Training Iterations | Parameters
|-------------------------------------------------------------------------------------|---------------|---------------------|------------|
| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz) | LibriTTS | 2.5 M | 13.5 M
| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2.5 M | 7.9 M
## Training
Prepare a filelist of audio files for the training and validation set:
```bash
find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val
```
Fill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:
```bash
python train.py -c configs/vocos.yaml
```
Refer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the
training pipeline.
## Citation
If you use this code or results in your paper, please cite our work as:
```
@article{siuzdak2023vocos,
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author={Siuzdak, Hubert},
journal={arXiv preprint arXiv:2306.00814},
year={2023}
}
```
## License
The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.
Raw data
{
"_id": null,
"home_page": "https://github.com/hubertsiuzdak/vocos-local",
"name": "tvocos",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Hubert Siuzdak",
"author_email": "huberts@charactr.com",
"download_url": "https://files.pythonhosted.org/packages/c5/18/a139a143e8d622ed6d25dcf0966decaf5966dbe15808b24d118f998a4967/tvocos-0.1.0.tar.gz",
"platform": null,
"description": "# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis\n\n[Audio samples](https://charactr-platform.github.io/vocos/) |\nPaper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)\n\n## Installation\n\nTo use Vocos only in inference mode, install it using:\n\n```bash\npip install vocos\n```\n\nIf you wish to train the model, install it with additional dependencies:\n\n```bash\npip install vocos[train]\n```\n\n## Usage\n\n### Reconstruct audio from mel-spectrogram\n\n```python\nimport torch\n\nfrom vocos import Vocos\n\nvocos = Vocos.from_pretrained(\"charactr/vocos-mel-24khz\")\n\nmel = torch.randn(1, 100, 256) # B, C, T\n\nwith torch.no_grad():\n audio = vocos.decode(mel)\n```\n\nCopy-synthesis from a file:\n\n```python\nimport torchaudio\n\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\n\nwith torch.no_grad():\n y_hat = vocos(y)\n```\n\n### Reconstruct audio from EnCodec\n\nAdditionally, you need to provide a `bandwidth_id` which corresponds to the lookup embedding for bandwidth from the\nlist: `[1.5, 3.0, 6.0, 12.0]`.\n\n```python\nvocos = Vocos.from_pretrained(\"charactr/vocos-encodec-24khz\")\n\nquantized_features = torch.randn(1, 128, 256)\nbandwidth_id = torch.tensor([3]) # 12 kbps\n\nwith torch.no_grad():\n audio = vocos.decode(quantized_features, bandwidth_id=bandwidth_id) \n```\n\nCopy-synthesis from a file. Extract and quantize features with EnCodec, reconstruct with Vocos:\n\n```python\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\n\nwith torch.no_grad():\n y_hat = vocos(y, bandwidth_id=bandwidth_id)\n```\n\n## Pre-trained models\n\nThe provided models were trained up to 2.5 million generator iterations, which resulted in slightly better objective\nscores\ncompared to those reported in the paper.\n\n| Model Name | Dataset | Training Iterations | Parameters \n|-------------------------------------------------------------------------------------|---------------|---------------------|------------|\n| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz) | LibriTTS | 2.5 M | 13.5 M \n| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2.5 M | 7.9 M \n\n## Training\n\nPrepare a filelist of audio files for the training and validation set:\n\n```bash\nfind $TRAIN_DATASET_DIR -name *.wav > filelist.train\nfind $VAL_DATASET_DIR -name *.wav > filelist.val\n```\n\nFill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:\n\n```bash\npython train.py -c configs/vocos.yaml\n```\n\nRefer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the\ntraining pipeline.\n\n## Citation\n\nIf you use this code or results in your paper, please cite our work as:\n\n```\n@article{siuzdak2023vocos,\n title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},\n author={Siuzdak, Hubert},\n journal={arXiv preprint arXiv:2306.00814},\n year={2023}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
"bugtrack_url": null,
"license": "",
"summary": "Fourier-based neural vocoder",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/hubertsiuzdak/vocos-local"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0109690e7baa1aa3161dd94f31b683ab2161a83b81a58cc5a8810cad9bc57bc5",
"md5": "fc8703075d7b15e9ea087064d1ddc63e",
"sha256": "82d15ebd6361b50b043d674edc9639f68695189663619c6a32dd90d77c4338dc"
},
"downloads": -1,
"filename": "tvocos-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fc8703075d7b15e9ea087064d1ddc63e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 22340,
"upload_time": "2023-06-11T19:06:02",
"upload_time_iso_8601": "2023-06-11T19:06:02.461476Z",
"url": "https://files.pythonhosted.org/packages/01/09/690e7baa1aa3161dd94f31b683ab2161a83b81a58cc5a8810cad9bc57bc5/tvocos-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c518a139a143e8d622ed6d25dcf0966decaf5966dbe15808b24d118f998a4967",
"md5": "df25f9615c29b8c5dbfaa9b1c9c77715",
"sha256": "df2d990d256a617a69febceb21b43ec1aa5c603fae41efa43c3ab8892839b131"
},
"downloads": -1,
"filename": "tvocos-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "df25f9615c29b8c5dbfaa9b1c9c77715",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 18033,
"upload_time": "2023-06-11T19:06:04",
"upload_time_iso_8601": "2023-06-11T19:06:04.191183Z",
"url": "https://files.pythonhosted.org/packages/c5/18/a139a143e8d622ed6d25dcf0966decaf5966dbe15808b24d118f998a4967/tvocos-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-11 19:06:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hubertsiuzdak",
"github_project": "vocos-local",
"github_not_found": true,
"lcname": "tvocos"
}