# PhaseAug
**PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping**<br>
Junhyeok Lee, Seungu Han, Hyunjae Cho, Wonbin Jung @ [MINDsLab Inc.](https://github.com/mindslab-ai), SNU, KAIST
[![arXiv](https://img.shields.io/badge/arXiv-2211.04610-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2211.04610) [![GitHub Repo stars](https://img.shields.io/github/stars/mindslab-ai/phaseaug?color=yellow&label=PhaseAug&logo=github&style=flat-square)](https://github.com/mindslab-ai/phaseaug) [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://mindslab-ai.github.io/phaseaug/)
**Abstract** : Previous generative adversarial network (GAN)-based neural vocoders are trained to reconstruct the exact ground truth waveform from the paired mel-spectrogram and do not consider the one-to-many relationship of speech synthesis. This conventional training causes overfitting for both the discriminators and the generator, leading to the periodicity artifacts in the generated audio signal. In this work, we present PhaseAug, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping. With our proposed method, we outperform baselines without any architecture modification. Code and audio samples will be available at https://github.com/mindslab-ai/phaseaug.
Accepted to ICASSP 2023
![phasor](asset/phaseaug_phasor.png)
## TODO
- [x] PyTorch 2.0 is released, need to modify STFT and iSTFT for complex support (solved at `1.0.0`)
- [x] Arxiv updated
- [x] Errata in paper will be fixed. Section 2.5 in paper, transition band half-width 0.06-> 0.012.
- [x] Section 2.5, mention about multiplyinng rotation matrix to "the left side of F(x)" will be added. -> transpose m,k to reduce ambiguity
- [x] Upload PhaseAug to [pypi](https://pypi.org/project/phaseaug/).
- [x] Upload [VITS](https://arxiv.org/abs/2106.06103)+PhaseAug sampels at demo page.
- [x] Refactoring codes for packaging.
## Use PhaseAug
- Install `alias-free-torch==0.0.6` and `phaseaug`
```bash
pip install alias-free-torch==0.0.6 phaseaug
```
- Insert PhaseAug in your code, check [train.py](./train.py) as a example.
```python
from phaseaug.phaseaug import PhaseAug
...
# define phaseaug
aug = PhaseAug()
...
# discriminator update phase
aug_y, aug_y_g = aug.forward_sync(y, y_g_hat.detach())
y_df_hat_r, y_df_hat_g, _, _ = mpd(aug_y, aug_y_g)
y_ds_hat_r, y_ds_hat_g, _, _ = msd(aug_y, aug_y_g)
...
# generator update phase
aug_y, aug_y_g = aug.forward_sync(y, y_g_hat)
y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = mpd(aug_y, aug_y_g)
y_ds_hat_r, y_ds_hat_g, fmap_s_r, fmap_s_g = msd(aug_y, aug_y_g)
```
- If you are applying `torch.cuda.amp.autocast` like [VITS](https://github.com/jaywalnut310/vits), you need to wrap PhaseAug with `autocast(enabled=False)` to prevent [ComplexHalf issue](https://github.com/jaywalnut310/vits/issues/15).
```python
from torch.cuda.amp import autocast
...
with autocast(enabled=True)
# wrapping PhaseAug with autocast(enabled=False)
with autocast(enabled=False)
aug_y, aug_y_g = aug.forward_sync(y, y_g_hat)
# usually net_d parts are inside of autocast(enabled=True)
y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = net_d(aug_y, aug_y_g)
```
## Requirements
- [PyTorch>=1.7.0](https://pytorch.org/) for [alias-free-torch](https://github.com/junjun3518/alias-free-torch)
- Support PyTorch>=2.0.0
- The requirements are highlighted in [requirements.txt](./requirements.txt).
- We also provide docker setup [Dockerfile](./Dockerfile).
```
docker build -t=phaseaug --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) --build-arg USER_NAME=$USER
```
- Cloned [official HiFi-GAN repo](https://github.com/jik876/hifi-gan).
- Downloaded [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/).
- (optional) [MelGAN](https://github.com/descriptinc/melgan-neurips) generator
## Training
0. Clone this repository and copy python files to hifi-gan folder
```bash
git clone --recursive https://github.com/mindslab-ai/phaseaug
cp ./phaseaug/*.py ./phaseaug/hifi-gan/
cd ./phaseaug/hifi-gan
```
- optional: MelGAN generator
```bash
cp ./phaseaug/config_v1_melgan.json ./phaseaug/hifi-gan/
```
- Change generator to MelGAN generator at train.py
```python
# import MelGanGenerator as Generator at [train.py](./train.py)
#from models import Generator # remove original import Generator
from models import MelGANGenerator as Generator
```
1. Modify dataset path at `train.py`
```python
parser.add_argument('--input_wavs_dir',
default='path/LJSpeech-1.1/wavs_22k')
parser.add_argument('--input_mels_dir',
default='path/LJSpeech-1.1/wavs_22k')
```
2. Run train.py
```
python train.py --config config_v1.json --aug --filter --data_ratio {0.01/0.1/1.} --name phaseaug_hifigan
```
```
python train.py --config config_v1_melgan.json --aug --filter --data_ratio {0.01/0.1/1.} --name phaseaug_melgan
```
## References
This implementation uses code from following repositories:
- [Official HiFi-GAN implementation](https://github.com/jik876/hifi-gan)
- [Official MelGAN implementation](https://github.com/descriptinc/melgan-neurips)
- [Official CARGAN implementation](https://github.com/descriptinc/cargan)
- [alias-free-torch](https://github.com/junjun3518/alias-free-torch)
This README and the webpage for the audio samples are inspired by:
- [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)
- [Official HiFi-GAN implementation](https://github.com/jik876/hifi-gan)
## Citation & Contact
If this repostory useful for yout research, please consider citing!
```bib
@inproceedings{phaseaug,
author={Lee, Junhyeok and Han, Seungu and Cho, Hyunjae and Jung, Wonbin},
title={{PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping}},
journal = {arXiv preprint arXiv:2211.04610},
year=2022,
}
```
Bibtex will be updated after ICASSP 2023.
If you have a question or any kind of inquiries, please contact Junhyeok Lee at [jun3518@icloud.com](mailto:jun3518@icloud.com)
Raw data
{
"_id": null,
"home_page": "https://github.com/mindslab-ai/phaseaug",
"name": "phaseaug",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3",
"maintainer_email": "",
"keywords": "torch,pytorch,augmentation,diffaugment,speech synthesis,vocoder",
"author": "junjun3518",
"author_email": "junjun3518@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/9e/25/232eb1ec752801f014577c19321862c231bf0ac5fdc4acda70ae3f9a04e3/phaseaug-1.0.1.tar.gz",
"platform": null,
"description": "# PhaseAug\n\n**PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping**<br>\nJunhyeok Lee, Seungu Han, Hyunjae Cho, Wonbin Jung @ [MINDsLab Inc.](https://github.com/mindslab-ai), SNU, KAIST\n\n[![arXiv](https://img.shields.io/badge/arXiv-2211.04610-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2211.04610) [![GitHub Repo stars](https://img.shields.io/github/stars/mindslab-ai/phaseaug?color=yellow&label=PhaseAug&logo=github&style=flat-square)](https://github.com/mindslab-ai/phaseaug) [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://mindslab-ai.github.io/phaseaug/)\n\n\n**Abstract** : Previous generative adversarial network (GAN)-based neural vocoders are trained to reconstruct the exact ground truth waveform from the paired mel-spectrogram and do not consider the one-to-many relationship of speech synthesis. This conventional training causes overfitting for both the discriminators and the generator, leading to the periodicity artifacts in the generated audio signal. In this work, we present PhaseAug, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping. With our proposed method, we outperform baselines without any architecture modification. Code and audio samples will be available at https://github.com/mindslab-ai/phaseaug.\n\nAccepted to ICASSP 2023\n\n![phasor](asset/phaseaug_phasor.png) \n\n\n## TODO\n- [x] PyTorch 2.0 is released, need to modify STFT and iSTFT for complex support (solved at `1.0.0`)\n- [x] Arxiv updated\n- [x] Errata in paper will be fixed. Section 2.5 in paper, transition band half-width 0.06-> 0.012.\n- [x] Section 2.5, mention about multiplyinng rotation matrix to \"the left side of F(x)\" will be added. -> transpose m,k to reduce ambiguity\n- [x] Upload PhaseAug to [pypi](https://pypi.org/project/phaseaug/).\n- [x] Upload [VITS](https://arxiv.org/abs/2106.06103)+PhaseAug sampels at demo page.\n- [x] Refactoring codes for packaging.\n\n\n## Use PhaseAug\n- Install `alias-free-torch==0.0.6` and `phaseaug`\n```bash\npip install alias-free-torch==0.0.6 phaseaug \n```\n- Insert PhaseAug in your code, check [train.py](./train.py) as a example.\n```python\nfrom phaseaug.phaseaug import PhaseAug\n...\n# define phaseaug\naug = PhaseAug()\n...\n# discriminator update phase\naug_y, aug_y_g = aug.forward_sync(y, y_g_hat.detach())\ny_df_hat_r, y_df_hat_g, _, _ = mpd(aug_y, aug_y_g)\ny_ds_hat_r, y_ds_hat_g, _, _ = msd(aug_y, aug_y_g)\n...\n# generator update phase\naug_y, aug_y_g = aug.forward_sync(y, y_g_hat)\ny_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = mpd(aug_y, aug_y_g)\ny_ds_hat_r, y_ds_hat_g, fmap_s_r, fmap_s_g = msd(aug_y, aug_y_g)\n```\n\n- If you are applying `torch.cuda.amp.autocast` like [VITS](https://github.com/jaywalnut310/vits), you need to wrap PhaseAug with `autocast(enabled=False)` to prevent [ComplexHalf issue](https://github.com/jaywalnut310/vits/issues/15).\n```python\nfrom torch.cuda.amp import autocast\n...\nwith autocast(enabled=True)\n # wrapping PhaseAug with autocast(enabled=False)\n with autocast(enabled=False)\n aug_y, aug_y_g = aug.forward_sync(y, y_g_hat)\n # usually net_d parts are inside of autocast(enabled=True)\n y_df_hat_r, y_df_hat_g, fmap_f_r, fmap_f_g = net_d(aug_y, aug_y_g)\n\n```\n\n## Requirements\n- [PyTorch>=1.7.0](https://pytorch.org/) for [alias-free-torch](https://github.com/junjun3518/alias-free-torch)\n- Support PyTorch>=2.0.0\n- The requirements are highlighted in [requirements.txt](./requirements.txt).\n- We also provide docker setup [Dockerfile](./Dockerfile).\n```\ndocker build -t=phaseaug --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) --build-arg USER_NAME=$USER\n```\n- Cloned [official HiFi-GAN repo](https://github.com/jik876/hifi-gan).\n- Downloaded [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/).\n- (optional) [MelGAN](https://github.com/descriptinc/melgan-neurips) generator\n\n## Training\n0. Clone this repository and copy python files to hifi-gan folder\n```bash\ngit clone --recursive https://github.com/mindslab-ai/phaseaug\ncp ./phaseaug/*.py ./phaseaug/hifi-gan/\ncd ./phaseaug/hifi-gan\n```\n\n - optional: MelGAN generator\n ```bash\n cp ./phaseaug/config_v1_melgan.json ./phaseaug/hifi-gan/\n ```\n - Change generator to MelGAN generator at train.py\n ```python\n # import MelGanGenerator as Generator at [train.py](./train.py)\n #from models import Generator # remove original import Generator\n from models import MelGANGenerator as Generator\n ```\n\n1. Modify dataset path at `train.py`\n```python\n parser.add_argument('--input_wavs_dir',\n default='path/LJSpeech-1.1/wavs_22k')\n parser.add_argument('--input_mels_dir',\n default='path/LJSpeech-1.1/wavs_22k')\n```\n\n2. Run train.py\n```\npython train.py --config config_v1.json --aug --filter --data_ratio {0.01/0.1/1.} --name phaseaug_hifigan\n```\n```\npython train.py --config config_v1_melgan.json --aug --filter --data_ratio {0.01/0.1/1.} --name phaseaug_melgan\n```\n\n\n## References\nThis implementation uses code from following repositories:\n- [Official HiFi-GAN implementation](https://github.com/jik876/hifi-gan)\n- [Official MelGAN implementation](https://github.com/descriptinc/melgan-neurips)\n- [Official CARGAN implementation](https://github.com/descriptinc/cargan)\n- [alias-free-torch](https://github.com/junjun3518/alias-free-torch)\n\nThis README and the webpage for the audio samples are inspired by:\n- [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/)\n- [Official HiFi-GAN implementation](https://github.com/jik876/hifi-gan)\n\n## Citation & Contact\n\nIf this repostory useful for yout research, please consider citing!\n```bib\n@inproceedings{phaseaug,\n author={Lee, Junhyeok and Han, Seungu and Cho, Hyunjae and Jung, Wonbin},\n title={{PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping}},\n journal = {arXiv preprint arXiv:2211.04610},\n year=2022,\n}\n```\n\nBibtex will be updated after ICASSP 2023.\n\nIf you have a question or any kind of inquiries, please contact Junhyeok Lee at [jun3518@icloud.com](mailto:jun3518@icloud.com)\n\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping",
"version": "1.0.1",
"split_keywords": [
"torch",
"pytorch",
"augmentation",
"diffaugment",
"speech synthesis",
"vocoder"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d5c4ee8fe3e21adc88b475f42271f650152f334bddd79bbe48fc4ed00d85d75c",
"md5": "a6386812af89680cce63917acfbd2fb2",
"sha256": "888b91e4d2eb7e5bc309a0d9eb2de132cd8dee4631535337f90bc5a0144f2dd2"
},
"downloads": -1,
"filename": "phaseaug-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a6386812af89680cce63917acfbd2fb2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3",
"size": 6174,
"upload_time": "2023-04-25T23:14:03",
"upload_time_iso_8601": "2023-04-25T23:14:03.629730Z",
"url": "https://files.pythonhosted.org/packages/d5/c4/ee8fe3e21adc88b475f42271f650152f334bddd79bbe48fc4ed00d85d75c/phaseaug-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9e25232eb1ec752801f014577c19321862c231bf0ac5fdc4acda70ae3f9a04e3",
"md5": "9875360f4273029382984e446fbf4d79",
"sha256": "721584fb53eab0c38ccaf607f44bc73e523f1ae8b2e3653cf6455efd3deb3371"
},
"downloads": -1,
"filename": "phaseaug-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "9875360f4273029382984e446fbf4d79",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3",
"size": 6047,
"upload_time": "2023-04-25T23:14:05",
"upload_time_iso_8601": "2023-04-25T23:14:05.223978Z",
"url": "https://files.pythonhosted.org/packages/9e/25/232eb1ec752801f014577c19321862c231bf0ac5fdc4acda70ae3f9a04e3/phaseaug-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-25 23:14:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "mindslab-ai",
"github_project": "phaseaug",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "tensorboard",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "omegaconf",
"specs": [
[
"==",
"2.1.0"
]
]
},
{
"name": "gpustat",
"specs": [
[
"==",
"0.6.0"
]
]
},
{
"name": "Cython",
"specs": [
[
"==",
"0.29.21"
]
]
},
{
"name": "librosa",
"specs": [
[
"==",
"0.8.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
"==",
"3.3.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.18.5"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.5.2"
]
]
},
{
"name": "Unidecode",
"specs": [
[
"==",
"1.1.1"
]
]
},
{
"name": "kiwipiepy",
"specs": [
[
"==",
"0.8.1"
]
]
},
{
"name": "alias_free_torch",
"specs": [
[
"==",
"0.0.6"
]
]
}
],
"lcname": "phaseaug"
}