vocos


Namevocos JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/charactr-platform/vocos
SummaryFourier-based neural vocoder for high-quality audio synthesis
upload_time2023-10-14 00:57:29
maintainer
docs_urlNone
authorHubert Siuzdak
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

[Audio samples](https://charactr-platform.github.io/vocos/) |
Paper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)

Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative
Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical
GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral
coefficients, facilitating rapid audio reconstruction through inverse Fourier transform.

## Installation

To use Vocos only in inference mode, install it using:

```bash
pip install vocos
```

If you wish to train the model, install it with additional dependencies:

```bash
pip install vocos[train]
```

## Usage

### Reconstruct audio from mel-spectrogram

```python
import torch

from vocos import Vocos

vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

mel = torch.randn(1, 100, 256)  # B, C, T
audio = vocos.decode(mel)
```

Copy-synthesis from a file:

```python
import torchaudio

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)
```

### Reconstruct audio from EnCodec tokens

Additionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the
list: `[1.5, 3.0, 6.0, 12.0]`.

```python
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")

audio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2])  # 6 kbps

audio = vocos.decode(features, bandwidth_id=bandwidth_id)
```

Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a
single forward pass.

```python
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)

y_hat = vocos(y, bandwidth_id=bandwidth_id)
```

### Integrate with 🐶 [Bark](https://github.com/suno-ai/bark) text-to-audio model

See [example notebook](notebooks%2FBark%2BVocos.ipynb).

## Pre-trained models

| Model Name                                                                          | Dataset       | Training Iterations | Parameters 
|-------------------------------------------------------------------------------------|---------------|-------------------|------------|
| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)         | LibriTTS      | 1M                | 13.5M
| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2M                | 7.9M

## Training

Prepare a filelist of audio files for the training and validation set:

```bash
find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val
```

Fill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:

```bash
python train.py -c configs/vocos.yaml
```

Refer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the
training pipeline.

## Citation

If this code contributes to your research, please cite our work:

```
@article{siuzdak2023vocos,
  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author={Siuzdak, Hubert},
  journal={arXiv preprint arXiv:2306.00814},
  year={2023}
}
```

## License

The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/charactr-platform/vocos",
    "name": "vocos",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Hubert Siuzdak",
    "author_email": "huberts@charactr.com",
    "download_url": "https://files.pythonhosted.org/packages/db/48/1e4d3a4a97292e47ebaa18e3eae6ecb2f57bde47693ccab0e7acb23f9ffe/vocos-0.1.0.tar.gz",
    "platform": null,
    "description": "# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis\n\n[Audio samples](https://charactr-platform.github.io/vocos/) |\nPaper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf)\n\nVocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative\nAdversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical\nGAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral\ncoefficients, facilitating rapid audio reconstruction through inverse Fourier transform.\n\n## Installation\n\nTo use Vocos only in inference mode, install it using:\n\n```bash\npip install vocos\n```\n\nIf you wish to train the model, install it with additional dependencies:\n\n```bash\npip install vocos[train]\n```\n\n## Usage\n\n### Reconstruct audio from mel-spectrogram\n\n```python\nimport torch\n\nfrom vocos import Vocos\n\nvocos = Vocos.from_pretrained(\"charactr/vocos-mel-24khz\")\n\nmel = torch.randn(1, 100, 256)  # B, C, T\naudio = vocos.decode(mel)\n```\n\nCopy-synthesis from a file:\n\n```python\nimport torchaudio\n\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\nif y.size(0) > 1:  # mix to mono\n    y = y.mean(dim=0, keepdim=True)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\ny_hat = vocos(y)\n```\n\n### Reconstruct audio from EnCodec tokens\n\nAdditionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the\nlist: `[1.5, 3.0, 6.0, 12.0]`.\n\n```python\nvocos = Vocos.from_pretrained(\"charactr/vocos-encodec-24khz\")\n\naudio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames\nfeatures = vocos.codes_to_features(audio_tokens)\nbandwidth_id = torch.tensor([2])  # 6 kbps\n\naudio = vocos.decode(features, bandwidth_id=bandwidth_id)\n```\n\nCopy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a\nsingle forward pass.\n\n```python\ny, sr = torchaudio.load(YOUR_AUDIO_FILE)\nif y.size(0) > 1:  # mix to mono\n    y = y.mean(dim=0, keepdim=True)\ny = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)\n\ny_hat = vocos(y, bandwidth_id=bandwidth_id)\n```\n\n### Integrate with \ud83d\udc36 [Bark](https://github.com/suno-ai/bark) text-to-audio model\n\nSee [example notebook](notebooks%2FBark%2BVocos.ipynb).\n\n## Pre-trained models\n\n| Model Name                                                                          | Dataset       | Training Iterations | Parameters \n|-------------------------------------------------------------------------------------|---------------|-------------------|------------|\n| [charactr/vocos-mel-24khz](https://huggingface.co/charactr/vocos-mel-24khz)         | LibriTTS      | 1M                | 13.5M\n| [charactr/vocos-encodec-24khz](https://huggingface.co/charactr/vocos-encodec-24khz) | DNS Challenge | 2M                | 7.9M\n\n## Training\n\nPrepare a filelist of audio files for the training and validation set:\n\n```bash\nfind $TRAIN_DATASET_DIR -name *.wav > filelist.train\nfind $VAL_DATASET_DIR -name *.wav > filelist.val\n```\n\nFill a config file, e.g. [vocos.yaml](configs%2Fvocos.yaml), with your filelist paths and start training with:\n\n```bash\npython train.py -c configs/vocos.yaml\n```\n\nRefer to [Pytorch Lightning documentation](https://lightning.ai/docs/pytorch/stable/) for details about customizing the\ntraining pipeline.\n\n## Citation\n\nIf this code contributes to your research, please cite our work:\n\n```\n@article{siuzdak2023vocos,\n  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},\n  author={Siuzdak, Hubert},\n  journal={arXiv preprint arXiv:2306.00814},\n  year={2023}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Fourier-based neural vocoder for high-quality audio synthesis",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/charactr-platform/vocos"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a4582fe9b5696eb5dd4f84632f75b549b48bed0c33a5920b6309fbafd7e3477",
                "md5": "e1beabfd08edab1d6b69b693fdff689f",
                "sha256": "0ac13eaef68596074301e912d781399b3defa4b4ca60b6bc52c8a4b9209ca235"
            },
            "downloads": -1,
            "filename": "vocos-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e1beabfd08edab1d6b69b693fdff689f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 24085,
            "upload_time": "2023-10-14T00:57:27",
            "upload_time_iso_8601": "2023-10-14T00:57:27.972325Z",
            "url": "https://files.pythonhosted.org/packages/0a/45/82fe9b5696eb5dd4f84632f75b549b48bed0c33a5920b6309fbafd7e3477/vocos-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "db481e4d3a4a97292e47ebaa18e3eae6ecb2f57bde47693ccab0e7acb23f9ffe",
                "md5": "b85fbd6885d22b84e4da1c4d4580ecac",
                "sha256": "b488224dbe398ff7d4790a027ad659478b4bc02e465db992c62c12b32ca043d8"
            },
            "downloads": -1,
            "filename": "vocos-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b85fbd6885d22b84e4da1c4d4580ecac",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 21107,
            "upload_time": "2023-10-14T00:57:29",
            "upload_time_iso_8601": "2023-10-14T00:57:29.631879Z",
            "url": "https://files.pythonhosted.org/packages/db/48/1e4d3a4a97292e47ebaa18e3eae6ecb2f57bde47693ccab0e7acb23f9ffe/vocos-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-14 00:57:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "charactr-platform",
    "github_project": "vocos",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "vocos"
}
        
Elapsed time: 0.13858s