# AudioLDM 2
[](https://arxiv.org/abs/2308.05734) [](https://audioldm.github.io/audioldm2/) [](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation.
<hr>
## Change Log
- 2023-08-27: Add two new checkpoints!
- 🌟 **48kHz AudioLDM model**: Now we support high-fidelity audio generation! Use this checkpoint simply by setting "--model_name audioldm_48k"
- **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.
## TODO
- [x] Add the text-to-speech checkpoint
- [ ] Open-source the AudioLDM training code.
- [x] Support the generation of longer audio (> 10s)
- [x] Optimizing the inference speed of the model.
- [ ] Integration with the Diffusers library
## Web APP
1. Prepare running environment
```shell
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
```
2. Start the web application (powered by Gradio)
```shell
python3 app.py
```
3. A link will be printed out. Click the link to open the browser and play.
## Commandline Usage
### Installation
Prepare running environment
```shell
# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
```
If you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by
```shell
sudo apt-get install espeak
```
### Run the model in commandline
- Generate sound effect or Music based on a text prompt
```shell
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```
- Generate sound effect or music based on a list of text
```shell
audioldm2 -tl batch.lst
```
- Generate speech based on (1) the transcription and (2) the description of the speaker
```shell
audioldm2 -t "A female reporter is speaking full of emotion" --transciption "Wish you have a good day"
audioldm2 -t "A female reporter is speaking" --transciption "Wish you have a good day"
```
Text-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.
## Random Seed Matters
Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.
```shell
audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```
## Pretrained Models
You can choose model checkpoint by setting up "model_name":
```shell
# CUDA
audioldm2 --model_name "audioldm_48k" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
# MPS
audioldm2 --model_name "audioldm_48k" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```
We have five checkpoints you can choose:
1. **audioldm_48k** (default): This checkpoint can generate high fidelity sound effect and music.
2. **audioldm2-full**: Generate both sound effect and music generation with the AudioLDM2 architecture.
2. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).
4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full.
5. **audioldm2-music-665k**: Music generation.
6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.
We currently support 3 devices:
- cpu
- cuda
- mps ( Notice that the computation requires about 20GB of RAM. )
## Other options
```shell
usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
[--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
[-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
[--seed SEED]
optional arguments:
-h, --help show this help message and exit
-t TEXT, --text TEXT Text prompt to the model for audio generation
--transcription TRANSCRIPTION
Transcription used for speech synthesis
-tl TEXT_LIST, --text_list TEXT_LIST
A file that contains text prompt to the model for audio generation
-s SAVE_PATH, --save_path SAVE_PATH
The path to save model output
--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
The checkpoint you gonna use
-d DEVICE, --device DEVICE
The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
-b BATCHSIZE, --batchsize BATCHSIZE
Generate how many samples at the same time
--ddim_steps DDIM_STEPS
The sampling step for DDIM
-gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
-n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
heavier computation
--seed SEED Change this value (any integer number) will lead to a different generation result.
```
## Cite this work
If you found this tool useful, please consider citing
```bibtex
@article{liu2023audioldm2,
title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
journal={arXiv preprint arXiv:2308.05734},
year={2023}
}
```
```bibtex
@article{liu2023audioldm,
title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
journal={Proceedings of the International Conference on Machine Learning},
year={2023}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/haoheliu/audioldm2",
"name": "audioldm2",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": "",
"keywords": "",
"author": "Haohe Liu",
"author_email": "haoheliu@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/95/7b/0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae/audioldm2-0.1.0.tar.gz",
"platform": null,
"description": "\n# AudioLDM 2\n\n[](https://arxiv.org/abs/2308.05734) [](https://audioldm.github.io/audioldm2/) [](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music) \n\nThis repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation. \n\n<hr>\n\n## Change Log\n- 2023-08-27: Add two new checkpoints! \n - \ud83c\udf1f **48kHz AudioLDM model**: Now we support high-fidelity audio generation! Use this checkpoint simply by setting \"--model_name audioldm_48k\"\n - **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.\n\n## TODO\n- [x] Add the text-to-speech checkpoint\n- [ ] Open-source the AudioLDM training code.\n- [x] Support the generation of longer audio (> 10s)\n- [x] Optimizing the inference speed of the model.\n- [ ] Integration with the Diffusers library\n\n## Web APP\n\n1. Prepare running environment\n```shell\nconda create -n audioldm python=3.8; conda activate audioldm\npip3 install git+https://github.com/haoheliu/AudioLDM2.git\ngit clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2\n```\n2. Start the web application (powered by Gradio)\n```shell\npython3 app.py\n```\n3. A link will be printed out. Click the link to open the browser and play.\n\n## Commandline Usage\n\n### Installation\nPrepare running environment\n```shell\n# Optional\nconda create -n audioldm python=3.8; conda activate audioldm\n# Install AudioLDM\npip3 install git+https://github.com/haoheliu/AudioLDM2.git\n```\n\nIf you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by \n```shell\nsudo apt-get install espeak\n```\n\n### Run the model in commandline\n- Generate sound effect or Music based on a text prompt\n\n```shell\naudioldm2 -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\n- Generate sound effect or music based on a list of text\n\n```shell\naudioldm2 -tl batch.lst\n```\n\n- Generate speech based on (1) the transcription and (2) the description of the speaker\n\n```shell\naudioldm2 -t \"A female reporter is speaking full of emotion\" --transciption \"Wish you have a good day\"\n\naudioldm2 -t \"A female reporter is speaking\" --transciption \"Wish you have a good day\"\n```\n\nText-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.\n\n## Random Seed Matters\n\nSometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware. \n```shell\naudioldm2 --seed 1234 -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\n## Pretrained Models\n\nYou can choose model checkpoint by setting up \"model_name\":\n\n```shell\n# CUDA\naudioldm2 --model_name \"audioldm_48k\" --device cuda -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n\n# MPS\naudioldm2 --model_name \"audioldm_48k\" --device mps -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\nWe have five checkpoints you can choose:\n\n1. **audioldm_48k** (default): This checkpoint can generate high fidelity sound effect and music.\n2. **audioldm2-full**: Generate both sound effect and music generation with the AudioLDM2 architecture. \n2. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).\n4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full. \n5. **audioldm2-music-665k**: Music generation. \n6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.\n7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.\n\nWe currently support 3 devices:\n- cpu\n- cuda\n- mps ( Notice that the computation requires about 20GB of RAM. )\n\n## Other options\n```shell\n usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]\n [--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]\n [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]\n [--seed SEED]\n\n optional arguments:\n -h, --help show this help message and exit\n -t TEXT, --text TEXT Text prompt to the model for audio generation\n --transcription TRANSCRIPTION\n Transcription used for speech synthesis\n -tl TEXT_LIST, --text_list TEXT_LIST\n A file that contains text prompt to the model for audio generation\n -s SAVE_PATH, --save_path SAVE_PATH\n The path to save model output\n --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}\n The checkpoint you gonna use\n -d DEVICE, --device DEVICE\n The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]\n -b BATCHSIZE, --batchsize BATCHSIZE\n Generate how many samples at the same time\n --ddim_steps DDIM_STEPS\n The sampling step for DDIM\n -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE\n Guidance scale (Large => better quality and relavancy to text; Small => better diversity)\n -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT\n Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with\n heavier computation\n --seed SEED Change this value (any integer number) will lead to a different generation result.\n```\n\n## Cite this work\nIf you found this tool useful, please consider citing\n\n```bibtex\n@article{liu2023audioldm2,\n title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},\n author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},\n journal={arXiv preprint arXiv:2308.05734},\n year={2023}\n}\n```\n\n```bibtex\n@article{liu2023audioldm,\n title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},\n author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},\n journal={Proceedings of the International Conference on Machine Learning},\n year={2023}\n}\n```\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "This package is written for text-to-audio/music generation.",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/haoheliu/audioldm2"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5ab9dceeff14f431c6e071ff4ea29ee039ad336dbafc71245a8c69bb7511c177",
"md5": "de921ad1eaeee0dd43b0bd4be1314f7d",
"sha256": "53040826ac578aeda99fc3da58c3e325381a7d06b62ddf277d4ede9c36131eba"
},
"downloads": -1,
"filename": "audioldm2-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "de921ad1eaeee0dd43b0bd4be1314f7d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 2928554,
"upload_time": "2023-08-27T21:11:12",
"upload_time_iso_8601": "2023-08-27T21:11:12.761024Z",
"url": "https://files.pythonhosted.org/packages/5a/b9/dceeff14f431c6e071ff4ea29ee039ad336dbafc71245a8c69bb7511c177/audioldm2-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "957b0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae",
"md5": "183ff52015e3ac673a09f4162adb362e",
"sha256": "d77beb8cf5a671500f52642e3409a59f37a37b936f66b4630303db3dabbcd478"
},
"downloads": -1,
"filename": "audioldm2-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "183ff52015e3ac673a09f4162adb362e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 2908423,
"upload_time": "2023-08-27T21:11:14",
"upload_time_iso_8601": "2023-08-27T21:11:14.991874Z",
"url": "https://files.pythonhosted.org/packages/95/7b/0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae/audioldm2-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-27 21:11:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "haoheliu",
"github_project": "audioldm2",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "audioldm2"
}