audioldm2

Name	audioldm2 JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/haoheliu/audioldm2
Summary	This package is written for text-to-audio/music generation.
upload_time	2023-08-27 21:11:14
maintainer
docs_url	None
author	Haohe Liu
requires_python	>=3.7.0
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# AudioLDM 2

[![arXiv](https://img.shields.io/badge/arXiv-2308.05734-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2308.05734)  [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://audioldm.github.io/audioldm2/)  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)  

This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation. 

<hr>

## Change Log
- 2023-08-27: Add two new checkpoints! 
  - 🌟 **48kHz AudioLDM model**: Now we support high-fidelity audio generation! Use this checkpoint simply by setting "--model_name audioldm_48k"
  - **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.

## TODO
- [x] Add the text-to-speech checkpoint
- [ ] Open-source the AudioLDM training code.
- [x] Support the generation of longer audio (> 10s)
- [x] Optimizing the inference speed of the model.
- [ ] Integration with the Diffusers library

## Web APP

1. Prepare running environment
```shell
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
```
2. Start the web application (powered by Gradio)
```shell
python3 app.py
```
3. A link will be printed out. Click the link to open the browser and play.

## Commandline Usage

### Installation
Prepare running environment
```shell
# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
```

If you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by 
```shell
sudo apt-get install espeak
```

### Run the model in commandline
- Generate sound effect or Music based on a text prompt

```shell
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```

- Generate sound effect or music based on a list of text

```shell
audioldm2 -tl batch.lst
```

- Generate speech based on (1) the transcription and (2) the description of the speaker

```shell
audioldm2 -t "A female reporter is speaking full of emotion" --transciption "Wish you have a good day"

audioldm2 -t "A female reporter is speaking" --transciption "Wish you have a good day"
```

Text-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.

## Random Seed Matters

Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware. 
```shell
audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```

## Pretrained Models

You can choose model checkpoint by setting up "model_name":

```shell
# CUDA
audioldm2 --model_name "audioldm_48k" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

# MPS
audioldm2 --model_name "audioldm_48k" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
```

We have five checkpoints you can choose:

1. **audioldm_48k** (default): This checkpoint can generate high fidelity sound effect and music.
2. **audioldm2-full**: Generate both sound effect and music generation with the AudioLDM2 architecture. 
2. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).
4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full. 
5. **audioldm2-music-665k**: Music generation. 
6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.

We currently support 3 devices:
- cpu
- cuda
- mps ( Notice that the computation requires about 20GB of RAM. )

## Other options
```shell
  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
                 [--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
                 [--seed SEED]

  optional arguments:
    -h, --help            show this help message and exit
    -t TEXT, --text TEXT  Text prompt to the model for audio generation
    --transcription TRANSCRIPTION
                        Transcription used for speech synthesis
    -tl TEXT_LIST, --text_list TEXT_LIST
                          A file that contains text prompt to the model for audio generation
    -s SAVE_PATH, --save_path SAVE_PATH
                          The path to save model output
    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
                          The checkpoint you gonna use
    -d DEVICE, --device DEVICE
                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
    -b BATCHSIZE, --batchsize BATCHSIZE
                          Generate how many samples at the same time
    --ddim_steps DDIM_STEPS
                          The sampling step for DDIM
    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
                          heavier computation
    --seed SEED           Change this value (any integer number) will lead to a different generation result.
```

## Cite this work
If you found this tool useful, please consider citing

```bibtex
@article{liu2023audioldm2,
  title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
  journal={arXiv preprint arXiv:2308.05734},
  year={2023}
}
```

```bibtex
@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/haoheliu/audioldm2",
    "name": "audioldm2",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Haohe Liu",
    "author_email": "haoheliu@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/95/7b/0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae/audioldm2-0.1.0.tar.gz",
    "platform": null,
    "description": "\n# AudioLDM 2\n\n[![arXiv](https://img.shields.io/badge/arXiv-2308.05734-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2308.05734)  [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://audioldm.github.io/audioldm2/)  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)  \n\nThis repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation. \n\n<hr>\n\n## Change Log\n- 2023-08-27: Add two new checkpoints! \n  - \ud83c\udf1f **48kHz AudioLDM model**: Now we support high-fidelity audio generation! Use this checkpoint simply by setting \"--model_name audioldm_48k\"\n  - **16kHz improved AudioLDM model**: Trained with more data and optimized model architecture.\n\n## TODO\n- [x] Add the text-to-speech checkpoint\n- [ ] Open-source the AudioLDM training code.\n- [x] Support the generation of longer audio (> 10s)\n- [x] Optimizing the inference speed of the model.\n- [ ] Integration with the Diffusers library\n\n## Web APP\n\n1. Prepare running environment\n```shell\nconda create -n audioldm python=3.8; conda activate audioldm\npip3 install git+https://github.com/haoheliu/AudioLDM2.git\ngit clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2\n```\n2. Start the web application (powered by Gradio)\n```shell\npython3 app.py\n```\n3. A link will be printed out. Click the link to open the browser and play.\n\n## Commandline Usage\n\n### Installation\nPrepare running environment\n```shell\n# Optional\nconda create -n audioldm python=3.8; conda activate audioldm\n# Install AudioLDM\npip3 install git+https://github.com/haoheliu/AudioLDM2.git\n```\n\nIf you plan to play around with text-to-speech generation. Please also make sure you have installed [espeak](https://espeak.sourceforge.net/download.html). On linux you can do it by \n```shell\nsudo apt-get install espeak\n```\n\n### Run the model in commandline\n- Generate sound effect or Music based on a text prompt\n\n```shell\naudioldm2 -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\n- Generate sound effect or music based on a list of text\n\n```shell\naudioldm2 -tl batch.lst\n```\n\n- Generate speech based on (1) the transcription and (2) the description of the speaker\n\n```shell\naudioldm2 -t \"A female reporter is speaking full of emotion\" --transciption \"Wish you have a good day\"\n\naudioldm2 -t \"A female reporter is speaking\" --transciption \"Wish you have a good day\"\n```\n\nText-to-Speech use the *audioldm2-speech-gigaspeech* checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set *--model_name audioldm2-speech-ljspeech*.\n\n## Random Seed Matters\n\nSometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware. \n```shell\naudioldm2 --seed 1234 -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\n## Pretrained Models\n\nYou can choose model checkpoint by setting up \"model_name\":\n\n```shell\n# CUDA\naudioldm2 --model_name \"audioldm_48k\" --device cuda -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n\n# MPS\naudioldm2 --model_name \"audioldm_48k\" --device mps -t \"Musical constellations twinkling in the night sky, forming a cosmic melody.\"\n```\n\nWe have five checkpoints you can choose:\n\n1. **audioldm_48k** (default): This checkpoint can generate high fidelity sound effect and music.\n2. **audioldm2-full**: Generate both sound effect and music generation with the AudioLDM2 architecture. \n2. **audioldm_16k_crossattn_t5**: The improved version of [AudioLDM 1.0](https://github.com/haoheliu/AudioLDM).\n4. **audioldm2-full-large-1150k**: Larger version of audioldm2-full. \n5. **audioldm2-music-665k**: Music generation. \n6. **audioldm2-speech-gigaspeech** (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.\n7. **audioldm2-speech-ljspeech**: Text-to-Speech, trained on LJSpeech Dataset.\n\nWe currently support 3 devices:\n- cpu\n- cuda\n- mps ( Notice that the computation requires about 20GB of RAM. )\n\n## Other options\n```shell\n  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]\n                 [--model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]\n                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]\n                 [--seed SEED]\n\n  optional arguments:\n    -h, --help            show this help message and exit\n    -t TEXT, --text TEXT  Text prompt to the model for audio generation\n    --transcription TRANSCRIPTION\n                        Transcription used for speech synthesis\n    -tl TEXT_LIST, --text_list TEXT_LIST\n                          A file that contains text prompt to the model for audio generation\n    -s SAVE_PATH, --save_path SAVE_PATH\n                          The path to save model output\n    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}\n                          The checkpoint you gonna use\n    -d DEVICE, --device DEVICE\n                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]\n    -b BATCHSIZE, --batchsize BATCHSIZE\n                          Generate how many samples at the same time\n    --ddim_steps DDIM_STEPS\n                          The sampling step for DDIM\n    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE\n                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)\n    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT\n                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with\n                          heavier computation\n    --seed SEED           Change this value (any integer number) will lead to a different generation result.\n```\n\n## Cite this work\nIf you found this tool useful, please consider citing\n\n```bibtex\n@article{liu2023audioldm2,\n  title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},\n  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},\n  journal={arXiv preprint arXiv:2308.05734},\n  year={2023}\n}\n```\n\n```bibtex\n@article{liu2023audioldm,\n  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},\n  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},\n  journal={Proceedings of the International Conference on Machine Learning},\n  year={2023}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "This package is written for text-to-audio/music generation.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/haoheliu/audioldm2"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ab9dceeff14f431c6e071ff4ea29ee039ad336dbafc71245a8c69bb7511c177",
                "md5": "de921ad1eaeee0dd43b0bd4be1314f7d",
                "sha256": "53040826ac578aeda99fc3da58c3e325381a7d06b62ddf277d4ede9c36131eba"
            },
            "downloads": -1,
            "filename": "audioldm2-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "de921ad1eaeee0dd43b0bd4be1314f7d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 2928554,
            "upload_time": "2023-08-27T21:11:12",
            "upload_time_iso_8601": "2023-08-27T21:11:12.761024Z",
            "url": "https://files.pythonhosted.org/packages/5a/b9/dceeff14f431c6e071ff4ea29ee039ad336dbafc71245a8c69bb7511c177/audioldm2-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "957b0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae",
                "md5": "183ff52015e3ac673a09f4162adb362e",
                "sha256": "d77beb8cf5a671500f52642e3409a59f37a37b936f66b4630303db3dabbcd478"
            },
            "downloads": -1,
            "filename": "audioldm2-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "183ff52015e3ac673a09f4162adb362e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 2908423,
            "upload_time": "2023-08-27T21:11:14",
            "upload_time_iso_8601": "2023-08-27T21:11:14.991874Z",
            "url": "https://files.pythonhosted.org/packages/95/7b/0aa708c22e2ac8a27337eeabfd6b2eecb780a06e4935118b41f9354a19ae/audioldm2-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-27 21:11:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "haoheliu",
    "github_project": "audioldm2",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "audioldm2"
}

Haohe Liu