diffwave


Namediffwave JSON
Version 0.1.6 PyPI version JSON
download
home_pagehttps://www.lmnt.com
Summarydiffwave
upload_time2020-10-14 17:32:13
maintainer
docs_urlNone
authorLMNT, Inc.
requires_python
licenseApache 2.0
keywords diffwave machine learning neural vocoder tts speech
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DiffWave
![PyPI Release](https://img.shields.io/pypi/v/diffwave?label=release) [![License](https://img.shields.io/github/license/lmnt-com/diffwave)](https://github.com/lmnt-com/diffwave/blob/master/LICENSE)

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/pdf/2009.09761.pdf).

## What's new (2020-10-14)
- new pretrained model trained for 1M steps
- updated audio samples with output from new model

## Status (2020-10-14)
- [x] stable training
- [x] high-quality synthesis
- [x] mixed-precision training
- [x] multi-GPU training
- [x] command-line inference
- [x] programmatic inference API
- [x] PyPI package
- [x] audio samples
- [x] pretrained models
- [ ] unconditional waveform synthesis

Big thanks to [Zhifeng Kong](https://github.com/FengNiMa) (lead author of DiffWave) for pointers and bug fixes.

## Audio samples
[22.05 kHz audio samples](https://lmnt.com/assets/diffwave)

## Pretrained models
[22.05 kHz pretrained model](https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt) (31 MB, SHA256: `d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8`)

This pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).

### Pre-trained model details
- trained on 4x 1080Ti
- default parameters
- single precision floating point (FP32)
- trained on LJSpeech dataset excluding LJ001* and LJ002*
- trained for 1000578 steps (1273 epochs)

## Install

Install using pip:
```
pip install diffwave
```

or from GitHub:
```
git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
```

### Training
Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [VCTK](https://pytorch.org/audio/_modules/torchaudio/datasets/vctk.html)). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit [params.py](https://github.com/lmnt-com/diffwave/blob/master/src/diffwave/params.py).

```
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs

# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
```

You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).

#### Multi-GPU training
By default, this implementation uses as many GPUs in parallel as returned by [`torch.cuda.device_count()`](https://pytorch.org/docs/stable/cuda.html#torch.cuda.device_count). You can specify which GPUs to use by setting the [`CUDA_DEVICES_AVAILABLE`](https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/) environment variable before running the training module.

### Inference API
Basic usage:

```python
from diffwave.inference import predict as diffwave_predict

model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir)

# audio is a GPU tensor in [N,T] format.
```

### Inference CLI
```
python -m diffwave.inference /path/to/model /path/to/spectrogram -o output.wav
```

## References
- [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/pdf/2009.09761.pdf)
- [Denoising Diffusion Probabilistic Models](https://arxiv.org/pdf/2006.11239.pdf)
- [Code for Denoising Diffusion Probabilistic Models](https://github.com/hojonathanho/diffusion)
            

Raw data

            {
    "_id": null,
    "home_page": "https://www.lmnt.com",
    "name": "diffwave",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "diffwave machine learning neural vocoder tts speech",
    "author": "LMNT, Inc.",
    "author_email": "github@lmnt.com",
    "download_url": "https://files.pythonhosted.org/packages/b6/5d/39c22b5881e4a94a7c1f90c8b98744a36bd6ad9fe5a5cb80a4a03c541e35/diffwave-0.1.6.tar.gz",
    "platform": "",
    "description": "# DiffWave\n![PyPI Release](https://img.shields.io/pypi/v/diffwave?label=release) [![License](https://img.shields.io/github/license/lmnt-com/diffwave)](https://github.com/lmnt-com/diffwave/blob/master/LICENSE)\n\nDiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/pdf/2009.09761.pdf).\n\n## What's new (2020-10-14)\n- new pretrained model trained for 1M steps\n- updated audio samples with output from new model\n\n## Status (2020-10-14)\n- [x] stable training\n- [x] high-quality synthesis\n- [x] mixed-precision training\n- [x] multi-GPU training\n- [x] command-line inference\n- [x] programmatic inference API\n- [x] PyPI package\n- [x] audio samples\n- [x] pretrained models\n- [ ] unconditional waveform synthesis\n\nBig thanks to [Zhifeng Kong](https://github.com/FengNiMa) (lead author of DiffWave) for pointers and bug fixes.\n\n## Audio samples\n[22.05 kHz audio samples](https://lmnt.com/assets/diffwave)\n\n## Pretrained models\n[22.05 kHz pretrained model](https://lmnt.com/assets/diffwave/diffwave-ljspeech-22kHz-1000578.pt) (31 MB, SHA256: `d415d2117bb0bba3999afabdd67ed11d9e43400af26193a451d112e2560821a8`)\n\nThis pre-trained model is able to synthesize speech with a real-time factor of 0.87 (smaller is faster).\n\n### Pre-trained model details\n- trained on 4x 1080Ti\n- default parameters\n- single precision floating point (FP32)\n- trained on LJSpeech dataset excluding LJ001* and LJ002*\n- trained for 1000578 steps (1273 epochs)\n\n## Install\n\nInstall using pip:\n```\npip install diffwave\n```\n\nor from GitHub:\n```\ngit clone https://github.com/lmnt-com/diffwave.git\ncd diffwave\npip install .\n```\n\n### Training\nBefore you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [VCTK](https://pytorch.org/audio/_modules/torchaudio/datasets/vctk.html)). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit [params.py](https://github.com/lmnt-com/diffwave/blob/master/src/diffwave/params.py).\n\n```\npython -m diffwave.preprocess /path/to/dir/containing/wavs\npython -m diffwave /path/to/model/dir /path/to/dir/containing/wavs\n\n# in another shell to monitor training progress:\ntensorboard --logdir /path/to/model/dir --bind_all\n```\n\nYou should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).\n\n#### Multi-GPU training\nBy default, this implementation uses as many GPUs in parallel as returned by [`torch.cuda.device_count()`](https://pytorch.org/docs/stable/cuda.html#torch.cuda.device_count). You can specify which GPUs to use by setting the [`CUDA_DEVICES_AVAILABLE`](https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/) environment variable before running the training module.\n\n### Inference API\nBasic usage:\n\n```python\nfrom diffwave.inference import predict as diffwave_predict\n\nmodel_dir = '/path/to/model/dir'\nspectrogram = # get your hands on a spectrogram in [N,C,W] format\naudio, sample_rate = diffwave_predict(spectrogram, model_dir)\n\n# audio is a GPU tensor in [N,T] format.\n```\n\n### Inference CLI\n```\npython -m diffwave.inference /path/to/model /path/to/spectrogram -o output.wav\n```\n\n## References\n- [DiffWave: A Versatile Diffusion Model for Audio Synthesis](https://arxiv.org/pdf/2009.09761.pdf)\n- [Denoising Diffusion Probabilistic Models](https://arxiv.org/pdf/2006.11239.pdf)\n- [Code for Denoising Diffusion Probabilistic Models](https://github.com/hojonathanho/diffusion)",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "diffwave",
    "version": "0.1.6",
    "split_keywords": [
        "diffwave",
        "machine",
        "learning",
        "neural",
        "vocoder",
        "tts",
        "speech"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "58baceb267fd54cfaebd06679c2a084c",
                "sha256": "53f58a09fc71a734b3652ae8846563ec25d7f8e8935d8a65ce5b596862fbf3d1"
            },
            "downloads": -1,
            "filename": "diffwave-0.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "58baceb267fd54cfaebd06679c2a084c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11819,
            "upload_time": "2020-10-14T17:32:13",
            "upload_time_iso_8601": "2020-10-14T17:32:13.253834Z",
            "url": "https://files.pythonhosted.org/packages/b6/5d/39c22b5881e4a94a7c1f90c8b98744a36bd6ad9fe5a5cb80a4a03c541e35/diffwave-0.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-10-14 17:32:13",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "diffwave"
}
        
Elapsed time: 0.15773s