VALL-E-X


NameVALL-E-X JSON
Version 0.0.2a1 PyPI version JSON
download
home_pagehttps://github.com/korakoe/VALL-E-X
SummaryAn open source implementation of Microsoft's VALL-E X zero-shot TTS
upload_time2023-11-04 10:47:52
maintainer
docs_urlNone
authorPlachtaa
requires_python
licenseMIT
keywords artificial intelligence deep learning
VCS
bugtrack_url
requirements soundfile numpy torch torchvision torchaudio tokenizers encodec langid wget unidecode pyopenjtalk-prebuilt pypinyin inflect cn2an jieba eng_to_ipa openai-whisper matplotlib gradio nltk sudachipy sudachidict_core vocos
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning ๐Ÿ”Š
[![Discord](https://img.shields.io/badge/Discord-%235865F2.svg?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/qCBRmAnTxg)
<br>
English | [ไธญๆ–‡](README-ZH.md)
<br>
An open source implementation of Microsoft's [VALL-E X](https://arxiv.org/pdf/2303.03926) zero-shot TTS model.<br>
**We release our trained model to the public for research or application usage.**

![vallex-framework](/vallex/images/vallex_framework.jpg "VALL-E X framework")

VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS! ๐ŸŽง
<br>
<br>
More details about the model are presented in [model card](./model-card.md).

# NEW!
Install as a library using:
```sh
pip install git+https://github.com/korakoe/VALL-E-X.git

OR

pip install VALL-E-X
```

You can train using the repo below (this repo is compatible | View the source for preload_models for info on loading custom models):
https://github.com/0417keito/VALL-E-X-Trainer-by-CustomData
<br>
<br>
OR
<br>
<br>
Use my training Colab!
<br>
<a href="https://colab.research.google.com/github/korakoe/VALL-E-X/blob/main/finetune.ipynb" target="_parent\"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ๐Ÿ“– Quick Index
* [๐Ÿš€ Updates](#-updates)
* [๐Ÿ“ข Features](#-features)
* [๐Ÿ’ป Installation](#-installation)
* [๐ŸŽง Demos](#-demos)
* [๐Ÿ Usage](#-usage-in-python)
* [โ“ FAQ](#-faq)
* [๐Ÿง  TODO](#-todo)

## ๐Ÿš€ Updates
**2023.10.14**
- Use better practice for loading models (by extension allowing loading of custom models | undocumented)
- Allow for mapping to other devices
- Fix numerous other issues...

**2023.10.13**
- Turned into installable library

**2023.09.10**
- Added AR decoder batch decoding for more stable generation result.

**2023.08.30**
- Replaced EnCodec decoder with Vocos decoder, improved audio quality. (Thanks to [@v0xie](https://github.com/v0xie))

**2023.08.23**
- Added long text generation.

**2023.08.20**
- Added [Chinese README](README-ZH.md).

**2023.08.14**
- Pretrained VALL-E X checkpoint is now released. Download it [here](https://drive.google.com/file/d/10gdQWvP-K_e1undkvv0p2b7SU6I4Egyl/view?usp=sharing)

## ๐Ÿ’ป Installation
### Install with pip, Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+
```commandline
git clone https://github.com/Plachtaa/VALL-E-X.git
cd VALL-E-X
pip install -r requirements.txt
```

> Note: If you want to make prompt, you need to install ffmpeg and add its folder to the environment variable PATH.

When you run the program for the first time, it will automatically download the corresponding model. 

If the download fails and reports an error, please follow the steps below to manually download the model.

(Please pay attention to the capitalization of folders)

1. Check whether there is a `checkpoints` folder in the installation directory. 
If not, manually create a `checkpoints` folder (`./checkpoints/`) in the installation directory.

2. Check whether there is a `vallex-checkpoint.pt` file in the `checkpoints` folder. 
If not, please manually download the `vallex-checkpoint.pt` file from [here](https://huggingface.co/Plachta/VALL-E-X/resolve/main/vallex-checkpoint.pt) and put it in the `checkpoints` folder.

3. Check whether there is a `whisper` folder in the installation directory. 
If not, manually create a `whisper` folder (`./whisper/`) in the installation directory.

4. Check whether there is a `medium.pt` file in the `whisper` folder. 
If not, please manually download the `medium.pt` file from [here](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt) and put it in the `whisper` folder.

##  ๐ŸŽง Demos
Not ready to set up the environment on your local machine just yet? No problem! We've got you covered with our online demos. You can try out VALL-E X directly on Hugging Face or Google Colab, experiencing the model's capabilities hassle-free!
<br>
[![Open in Spaces](https://img.shields.io/badge/๐Ÿค—-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/Plachta/VALL-E-X)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1yyD_sz531QntLKowMHo-XxorsFBCfKul?usp=sharing)


## ๐Ÿ“ข Features

VALL-E X comes packed with cutting-edge functionalities:

1. **Multilingual TTS**: Speak in three languages - English, Chinese, and Japanese - with natural and expressive speech synthesis.

2. **Zero-shot Voice Cloning**: Enroll a short 3~10 seconds recording of an unseen speaker, and watch VALL-E X create personalized, high-quality speech that sounds just like them!

<details>
  <summary><h5>see example</h5></summary>

[prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/a7baa51d-a53a-41cc-a03d-6970f25fcca7)


[output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/b895601a-d126-4138-beff-061aabdc7985)

</details>

3. **Speech Emotion Control**: Experience the power of emotions! VALL-E X can synthesize speech with the same emotion as the acoustic prompt provided, adding an extra layer of expressiveness to your audio.

<details>
  <summary><h5>see example</h5></summary>

https://github.com/Plachtaa/VALL-E-X/assets/112609742/56fa9988-925e-4757-82c5-83ecb0df6266


https://github.com/Plachtaa/VALL-E-X/assets/112609742/699c47a3-d502-4801-8364-bd89bcc0b8f1

</details>

4. **Zero-shot Cross-Lingual Speech Synthesis**: Take monolingual speakers on a linguistic journey! VALL-E X can produce personalized speech in another language without compromising on fluency or accent. Below is a Japanese speaker talk in Chinese & English. ๐Ÿ‡ฏ๐Ÿ‡ต ๐Ÿ—ฃ

<details>
  <summary><h5>see example</h5></summary>

[jp-prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/ea6e2ee4-139a-41b4-837e-0bd04dda6e19)


[en-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/db8f9782-923f-425e-ba94-e8c1bd48f207)


[zh-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/15829d79-e448-44d3-8965-fafa7a3f8c28)

</details>

5. **Accent Control**: Get creative with accents! VALL-E X allows you to experiment with different accents, like speaking Chinese with an English accent or vice versa. ๐Ÿ‡จ๐Ÿ‡ณ ๐Ÿ’ฌ

<details>
  <summary><h5>see example</h5></summary>

[en-prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/f688d7f6-70ef-46ec-b1cc-355c31e78b3b)


[zh-accent-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/be59c7ca-b45b-44ca-a30d-4d800c950ccc)


[en-accent-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/8b4f4f9b-f299-4ea4-a548-137437b71738)

</details>

6. **Acoustic Environment Maintenance**: No need for perfectly clean audio prompts! VALL-E X adapts to the acoustic environment of the input, making speech generation feel natural and immersive.

<details>
  <summary><h5>see example</h5></summary>

[noise-prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/68986d88-abd0-4d1d-96e4-4f893eb9259e)


[noise-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/96c4c612-4516-4683-8804-501b70938608)

</details>


Explore our [demo page](https://plachtaa.github.io/) for a lot more examples!

## ๐Ÿ Usage in Python

<details open>
  <summary><h3>๐Ÿช‘ Basics</h3></summary>

```python
from vallex.utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

# download and load all models 
model, codec, vocos = preload_models()

# generate audio from text
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(model, codec, vocos, text_prompt)

# save audio to disk
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
```

[hamburger.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/578d7bbe-cda9-483e-898c-29646edc8f2e)

</details>

<details open>
  <summary><h3>๐ŸŒŽ Foreign Language</h3></summary>
<br>
This VALL-E X implementation also supports Chinese and Japanese. All three languages have equally awesome performance!
<br>

```python

text_prompt = """
    ใƒใƒฅใ‚ฝใ‚ฏใฏ็งใฎใŠๆฐ—ใซๅ…ฅใ‚Šใฎ็ฅญใ‚Šใงใ™ใ€‚ ็งใฏๆ•ฐๆ—ฅ้–“ไผ‘ใ‚“ใงใ€ๅ‹ไบบใ‚„ๅฎถๆ—ใจใฎๆ™‚้–“ใ‚’้Žใ”ใ™ใ“ใจใŒใงใใพใ™ใ€‚
"""
audio_array = generate_audio(text_prompt)
```

[vallex_japanese.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/ee57a688-3e83-4be5-b0fe-019d16eec51c)

*Note: VALL-E X controls accent perfectly even when synthesizing code-switch text. However, you need to manually denote language of respective sentences (since our g2p tool is rule-base)*
```python
text_prompt = """
    [EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]
    [ZH]่ฟ™ๆ˜ฏๅކๅฒ็š„ๅผ€ๅง‹ใ€‚ ๅฆ‚ๆžœๆ‚จๆƒณๅฌๆ›ดๅคš๏ผŒ่ฏท็ปง็ปญใ€‚[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')
```

[vallex_codeswitch.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/d8667abf-bd08-499f-a383-a861d852f98a)

</details>

<details open>
<summary><h3>๐Ÿ“ผ Voice Presets</h3></summary>
  
VALL-E X provides tens of speaker voices which you can directly used for inference! Browse all voices in the [code](/vallex/presets)

> VALL-E X tries to match the tone, pitch, emotion and prosody of a given preset. The model also attempts to preserve music, ambient noise, etc.

```python
text_prompt = """
I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.
"""
audio_array = generate_audio(text_prompt, prompt="dingzhen")
```

[smoky.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/d3f55732-b1cd-420f-87d6-eab60db14dc5)

</details>

<details open>
<summary><h3>๐ŸŽ™Voice Cloning</h3></summary>
  
VALL-E X supports voice cloning! You can make a voice prompt with any person, character or even your own voice, and use it like other voice presets.<br>
To make a voice prompt, you need to provide a speech of 3~10 seconds long, as well as the transcript of the speech. 
You can also leave the transcript blank to let the [Whisper](https://github.com/openai/whisper) model to generate the transcript.
> VALL-E X tries to match the tone, pitch, emotion and prosody of a given prompt. The model also attempts to preserve music, ambient noise, etc.

```python
from vallex.utils.prompt_making import make_prompt

### Use given transcript
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav",
            transcript="Just, what was that? Paimon thought we were gonna get eaten.")

### Alternatively, use whisper
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")
```
Now let's try out the prompt we've just made!

```python
from vallex.utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# download and load all models 
model, codec, vocos = preload_models()

text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(model, codec, vocos, text_prompt, prompt="paimon")

write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)

```

[paimon_prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/e7922859-9d12-4e2a-8651-e156e4280311)


[paimon_cloned.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/60d3b7e9-5ead-4024-b499-a897ce5f3d5e)


</details>


<details open>
<summary><h3>๐ŸŽขUser Interface</h3></summary>

Not comfortable with codes? No problem! We've also created a user-friendly graphical interface for VALL-E X. It allows you to interact with the model effortlessly, making voice cloning and multilingual speech synthesis a breeze.
<br>
You can launch the UI by the following command:
```commandline
python -X utf8 launch-ui.py
```
</details>

## ๐Ÿ› ๏ธ Hardware and Inference Speed

VALL-E X works well on both CPU and GPU (`pytorch 2.0+`, CUDA 11.7 and CUDA 12.0).

A GPU VRAM of 6GB is enough for running VALL-E X without offloading.

## โš™๏ธ Details

VALL-E X is similar to [Bark](https://github.com/suno-ai/bark), [VALL-E](https://arxiv.org/abs/2301.02111) and [AudioLM](https://arxiv.org/abs/2209.03143), which generates audio in GPT-style by predicting audio tokens quantized by [EnCodec](https://github.com/facebookresearch/encodec).
<br>
Comparing to [Bark](https://github.com/suno-ai/bark):
- โœ” **Light-weighted**: 3๏ธโƒฃ โœ– smaller,
- โœ” **Efficient**: 4๏ธโƒฃ โœ– faster, 
- โœ” **Better quality on Chinese & Japanese**
- โœ” **Cross-lingual speech without foreign accent**
- โœ” **Easy voice-cloning**
- โŒ **Less languages**
- โŒ **No special tokens for music / sound effects**

### Supported Languages

| Language | Status |
| --- | :---: |
| English (en) | โœ… |
| Japanese (ja) | โœ… |
| Chinese, simplified (zh) | โœ… |

## โ“ FAQ

#### Where is code for training?
* [lifeiteng's vall-e](https://github.com/lifeiteng/vall-e) has almost everything. There is no plan to release our training code because there is no difference between lifeiteng's implementation.

#### Where can I download the model checkpoint?
* We use `wget` to download the model to directory `./checkpoints/` when you run the program for the first time.
* If the download fails on the first run, please manually download from [this link](https://huggingface.co/Plachta/VALL-E-X/resolve/main/vallex-checkpoint.pt), and put the file under directory `./checkpoints/`.

#### How much VRAM do I need?
* 6GB GPU VRAM - Almost all NVIDIA GPUs satisfy the requirement.

#### Why the model fails to generate long text?
* Transformer's computation complexity increases quadratically while the sequence length increases. Hence, all training 
are kept under 22 seconds. Please make sure the total length of audio prompt and generated audio is less than 22 seconds 
to ensure acceptable performance. 


#### MORE TO BE ADDED...

## ๐Ÿง  TODO
- [x] Add Chinese README
- [x] Long text generation
- [x] Replace Encodec decoder with Vocos decoder
- [ ] Fine-tuning for better voice adaptation
- [ ] `.bat` scripts for non-python users
- [ ] To be added...

## ๐Ÿ™ Appreciation
- [VALL-E X paper](https://arxiv.org/pdf/2303.03926) for the brilliant idea
- [lifeiteng's vall-e](https://github.com/lifeiteng/vall-e) for related training code
- [bark](https://github.com/suno-ai/bark) for the amazing pioneering work in neuro-codec TTS model

## โญ๏ธ Show Your Support

If you find VALL-E X interesting and useful, give us a star on GitHub! โญ๏ธ It encourages us to keep improving the model and adding exciting features.

## ๐Ÿ“œ License

VALL-E X is licensed under the [MIT License](./LICENSE).

---

Have questions or need assistance? Feel free to [open an issue](https://github.com/Plachtaa/VALL-E-X/issues/new) or join our [Discord](https://discord.gg/qCBRmAnTxg)

Happy voice cloning! ๐ŸŽค

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/korakoe/VALL-E-X",
    "name": "VALL-E-X",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "artificial intelligence,deep learning",
    "author": "Plachtaa",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/a1/30/3a9c53d04d408ad7d6bf353140e9f959e348ab773716264aeec726bf6658/VALL-E-X-0.0.2a1.tar.gz",
    "platform": null,
    "description": "# VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning \ud83d\udd0a\n[![Discord](https://img.shields.io/badge/Discord-%235865F2.svg?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/qCBRmAnTxg)\n<br>\nEnglish | [\u4e2d\u6587](README-ZH.md)\n<br>\nAn open source implementation of Microsoft's [VALL-E X](https://arxiv.org/pdf/2303.03926) zero-shot TTS model.<br>\n**We release our trained model to the public for research or application usage.**\n\n![vallex-framework](/vallex/images/vallex_framework.jpg \"VALL-E X framework\")\n\nVALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS! \ud83c\udfa7\n<br>\n<br>\nMore details about the model are presented in [model card](./model-card.md).\n\n# NEW!\nInstall as a library using:\n```sh\npip install git+https://github.com/korakoe/VALL-E-X.git\n\nOR\n\npip install VALL-E-X\n```\n\nYou can train using the repo below (this repo is compatible | View the source for preload_models for info on loading custom models):\nhttps://github.com/0417keito/VALL-E-X-Trainer-by-CustomData\n<br>\n<br>\nOR\n<br>\n<br>\nUse my training Colab!\n<br>\n<a href=\"https://colab.research.google.com/github/korakoe/VALL-E-X/blob/main/finetune.ipynb\" target=\"_parent\\\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n\n## \ud83d\udcd6 Quick Index\n* [\ud83d\ude80 Updates](#-updates)\n* [\ud83d\udce2 Features](#-features)\n* [\ud83d\udcbb Installation](#-installation)\n* [\ud83c\udfa7 Demos](#-demos)\n* [\ud83d\udc0d Usage](#-usage-in-python)\n* [\u2753 FAQ](#-faq)\n* [\ud83e\udde0 TODO](#-todo)\n\n## \ud83d\ude80 Updates\n**2023.10.14**\n- Use better practice for loading models (by extension allowing loading of custom models | undocumented)\n- Allow for mapping to other devices\n- Fix numerous other issues...\n\n**2023.10.13**\n- Turned into installable library\n\n**2023.09.10**\n- Added AR decoder batch decoding for more stable generation result.\n\n**2023.08.30**\n- Replaced EnCodec decoder with Vocos decoder, improved audio quality. (Thanks to [@v0xie](https://github.com/v0xie))\n\n**2023.08.23**\n- Added long text generation.\n\n**2023.08.20**\n- Added [Chinese README](README-ZH.md).\n\n**2023.08.14**\n- Pretrained VALL-E X checkpoint is now released. Download it [here](https://drive.google.com/file/d/10gdQWvP-K_e1undkvv0p2b7SU6I4Egyl/view?usp=sharing)\n\n## \ud83d\udcbb Installation\n### Install with pip, Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+\n```commandline\ngit clone https://github.com/Plachtaa/VALL-E-X.git\ncd VALL-E-X\npip install -r requirements.txt\n```\n\n> Note: If you want to make prompt, you need to install ffmpeg and add its folder to the environment variable PATH.\n\nWhen you run the program for the first time, it will automatically download the corresponding model. \n\nIf the download fails and reports an error, please follow the steps below to manually download the model.\n\n(Please pay attention to the capitalization of folders)\n\n1. Check whether there is a `checkpoints` folder in the installation directory. \nIf not, manually create a `checkpoints` folder (`./checkpoints/`) in the installation directory.\n\n2. Check whether there is a `vallex-checkpoint.pt` file in the `checkpoints` folder. \nIf not, please manually download the `vallex-checkpoint.pt` file from [here](https://huggingface.co/Plachta/VALL-E-X/resolve/main/vallex-checkpoint.pt) and put it in the `checkpoints` folder.\n\n3. Check whether there is a `whisper` folder in the installation directory. \nIf not, manually create a `whisper` folder (`./whisper/`) in the installation directory.\n\n4. Check whether there is a `medium.pt` file in the `whisper` folder. \nIf not, please manually download the `medium.pt` file from [here](https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt) and put it in the `whisper` folder.\n\n##  \ud83c\udfa7 Demos\nNot ready to set up the environment on your local machine just yet? No problem! We've got you covered with our online demos. You can try out VALL-E X directly on Hugging Face or Google Colab, experiencing the model's capabilities hassle-free!\n<br>\n[![Open in Spaces](https://img.shields.io/badge/\ud83e\udd17-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/Plachta/VALL-E-X)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1yyD_sz531QntLKowMHo-XxorsFBCfKul?usp=sharing)\n\n\n## \ud83d\udce2 Features\n\nVALL-E X comes packed with cutting-edge functionalities:\n\n1. **Multilingual TTS**: Speak in three languages - English, Chinese, and Japanese - with natural and expressive speech synthesis.\n\n2. **Zero-shot Voice Cloning**: Enroll a short 3~10 seconds recording of an unseen speaker, and watch VALL-E X create personalized, high-quality speech that sounds just like them!\n\n<details>\n  <summary><h5>see example</h5></summary>\n\n[prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/a7baa51d-a53a-41cc-a03d-6970f25fcca7)\n\n\n[output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/b895601a-d126-4138-beff-061aabdc7985)\n\n</details>\n\n3. **Speech Emotion Control**: Experience the power of emotions! VALL-E X can synthesize speech with the same emotion as the acoustic prompt provided, adding an extra layer of expressiveness to your audio.\n\n<details>\n  <summary><h5>see example</h5></summary>\n\nhttps://github.com/Plachtaa/VALL-E-X/assets/112609742/56fa9988-925e-4757-82c5-83ecb0df6266\n\n\nhttps://github.com/Plachtaa/VALL-E-X/assets/112609742/699c47a3-d502-4801-8364-bd89bcc0b8f1\n\n</details>\n\n4. **Zero-shot Cross-Lingual Speech Synthesis**: Take monolingual speakers on a linguistic journey! VALL-E X can produce personalized speech in another language without compromising on fluency or accent. Below is a Japanese speaker talk in Chinese & English. \ud83c\uddef\ud83c\uddf5 \ud83d\udde3\n\n<details>\n  <summary><h5>see example</h5></summary>\n\n[jp-prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/ea6e2ee4-139a-41b4-837e-0bd04dda6e19)\n\n\n[en-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/db8f9782-923f-425e-ba94-e8c1bd48f207)\n\n\n[zh-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/15829d79-e448-44d3-8965-fafa7a3f8c28)\n\n</details>\n\n5. **Accent Control**: Get creative with accents! VALL-E X allows you to experiment with different accents, like speaking Chinese with an English accent or vice versa. \ud83c\udde8\ud83c\uddf3 \ud83d\udcac\n\n<details>\n  <summary><h5>see example</h5></summary>\n\n[en-prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/f688d7f6-70ef-46ec-b1cc-355c31e78b3b)\n\n\n[zh-accent-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/be59c7ca-b45b-44ca-a30d-4d800c950ccc)\n\n\n[en-accent-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/8b4f4f9b-f299-4ea4-a548-137437b71738)\n\n</details>\n\n6. **Acoustic Environment Maintenance**: No need for perfectly clean audio prompts! VALL-E X adapts to the acoustic environment of the input, making speech generation feel natural and immersive.\n\n<details>\n  <summary><h5>see example</h5></summary>\n\n[noise-prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/68986d88-abd0-4d1d-96e4-4f893eb9259e)\n\n\n[noise-output.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/96c4c612-4516-4683-8804-501b70938608)\n\n</details>\n\n\nExplore our [demo page](https://plachtaa.github.io/) for a lot more examples!\n\n## \ud83d\udc0d Usage in Python\n\n<details open>\n  <summary><h3>\ud83e\ude91 Basics</h3></summary>\n\n```python\nfrom vallex.utils.generation import SAMPLE_RATE, generate_audio, preload_models\nfrom scipy.io.wavfile import write as write_wav\nfrom IPython.display import Audio\n\n# download and load all models \nmodel, codec, vocos = preload_models()\n\n# generate audio from text\ntext_prompt = \"\"\"\nHello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.\n\"\"\"\naudio_array = generate_audio(model, codec, vocos, text_prompt)\n\n# save audio to disk\nwrite_wav(\"vallex_generation.wav\", SAMPLE_RATE, audio_array)\n\n# play text in notebook\nAudio(audio_array, rate=SAMPLE_RATE)\n```\n\n[hamburger.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/578d7bbe-cda9-483e-898c-29646edc8f2e)\n\n</details>\n\n<details open>\n  <summary><h3>\ud83c\udf0e Foreign Language</h3></summary>\n<br>\nThis VALL-E X implementation also supports Chinese and Japanese. All three languages have equally awesome performance!\n<br>\n\n```python\n\ntext_prompt = \"\"\"\n    \u30c1\u30e5\u30bd\u30af\u306f\u79c1\u306e\u304a\u6c17\u306b\u5165\u308a\u306e\u796d\u308a\u3067\u3059\u3002 \u79c1\u306f\u6570\u65e5\u9593\u4f11\u3093\u3067\u3001\u53cb\u4eba\u3084\u5bb6\u65cf\u3068\u306e\u6642\u9593\u3092\u904e\u3054\u3059\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\n\"\"\"\naudio_array = generate_audio(text_prompt)\n```\n\n[vallex_japanese.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/ee57a688-3e83-4be5-b0fe-019d16eec51c)\n\n*Note: VALL-E X controls accent perfectly even when synthesizing code-switch text. However, you need to manually denote language of respective sentences (since our g2p tool is rule-base)*\n```python\ntext_prompt = \"\"\"\n    [EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]\n    [ZH]\u8fd9\u662f\u5386\u53f2\u7684\u5f00\u59cb\u3002 \u5982\u679c\u60a8\u60f3\u542c\u66f4\u591a\uff0c\u8bf7\u7ee7\u7eed\u3002[ZH]\n\"\"\"\naudio_array = generate_audio(text_prompt, language='mix')\n```\n\n[vallex_codeswitch.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/d8667abf-bd08-499f-a383-a861d852f98a)\n\n</details>\n\n<details open>\n<summary><h3>\ud83d\udcfc Voice Presets</h3></summary>\n  \nVALL-E X provides tens of speaker voices which you can directly used for inference! Browse all voices in the [code](/vallex/presets)\n\n> VALL-E X tries to match the tone, pitch, emotion and prosody of a given preset. The model also attempts to preserve music, ambient noise, etc.\n\n```python\ntext_prompt = \"\"\"\nI am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.\n\"\"\"\naudio_array = generate_audio(text_prompt, prompt=\"dingzhen\")\n```\n\n[smoky.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/d3f55732-b1cd-420f-87d6-eab60db14dc5)\n\n</details>\n\n<details open>\n<summary><h3>\ud83c\udf99Voice Cloning</h3></summary>\n  \nVALL-E X supports voice cloning! You can make a voice prompt with any person, character or even your own voice, and use it like other voice presets.<br>\nTo make a voice prompt, you need to provide a speech of 3~10 seconds long, as well as the transcript of the speech. \nYou can also leave the transcript blank to let the [Whisper](https://github.com/openai/whisper) model to generate the transcript.\n> VALL-E X tries to match the tone, pitch, emotion and prosody of a given prompt. The model also attempts to preserve music, ambient noise, etc.\n\n```python\nfrom vallex.utils.prompt_making import make_prompt\n\n### Use given transcript\nmake_prompt(name=\"paimon\", audio_prompt_path=\"paimon_prompt.wav\",\n            transcript=\"Just, what was that? Paimon thought we were gonna get eaten.\")\n\n### Alternatively, use whisper\nmake_prompt(name=\"paimon\", audio_prompt_path=\"paimon_prompt.wav\")\n```\nNow let's try out the prompt we've just made!\n\n```python\nfrom vallex.utils.generation import SAMPLE_RATE, generate_audio, preload_models\nfrom scipy.io.wavfile import write as write_wav\n\n# download and load all models \nmodel, codec, vocos = preload_models()\n\ntext_prompt = \"\"\"\nHey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!\n\"\"\"\naudio_array = generate_audio(model, codec, vocos, text_prompt, prompt=\"paimon\")\n\nwrite_wav(\"paimon_cloned.wav\", SAMPLE_RATE, audio_array)\n\n```\n\n[paimon_prompt.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/e7922859-9d12-4e2a-8651-e156e4280311)\n\n\n[paimon_cloned.webm](https://github.com/Plachtaa/VALL-E-X/assets/112609742/60d3b7e9-5ead-4024-b499-a897ce5f3d5e)\n\n\n</details>\n\n\n<details open>\n<summary><h3>\ud83c\udfa2User Interface</h3></summary>\n\nNot comfortable with codes? No problem! We've also created a user-friendly graphical interface for VALL-E X. It allows you to interact with the model effortlessly, making voice cloning and multilingual speech synthesis a breeze.\n<br>\nYou can launch the UI by the following command:\n```commandline\npython -X utf8 launch-ui.py\n```\n</details>\n\n## \ud83d\udee0\ufe0f Hardware and Inference Speed\n\nVALL-E X works well on both CPU and GPU (`pytorch 2.0+`, CUDA 11.7 and CUDA 12.0).\n\nA GPU VRAM of 6GB is enough for running VALL-E X without offloading.\n\n## \u2699\ufe0f Details\n\nVALL-E X is similar to [Bark](https://github.com/suno-ai/bark), [VALL-E](https://arxiv.org/abs/2301.02111) and [AudioLM](https://arxiv.org/abs/2209.03143), which generates audio in GPT-style by predicting audio tokens quantized by [EnCodec](https://github.com/facebookresearch/encodec).\n<br>\nComparing to [Bark](https://github.com/suno-ai/bark):\n- \u2714 **Light-weighted**: 3\ufe0f\u20e3 \u2716 smaller,\n- \u2714 **Efficient**: 4\ufe0f\u20e3 \u2716 faster, \n- \u2714 **Better quality on Chinese & Japanese**\n- \u2714 **Cross-lingual speech without foreign accent**\n- \u2714 **Easy voice-cloning**\n- \u274c **Less languages**\n- \u274c **No special tokens for music / sound effects**\n\n### Supported Languages\n\n| Language | Status |\n| --- | :---: |\n| English (en) | \u2705 |\n| Japanese (ja) | \u2705 |\n| Chinese, simplified (zh) | \u2705 |\n\n## \u2753 FAQ\n\n#### Where is code for training?\n* [lifeiteng's vall-e](https://github.com/lifeiteng/vall-e) has almost everything. There is no plan to release our training code because there is no difference between lifeiteng's implementation.\n\n#### Where can I download the model checkpoint?\n* We use `wget` to download the model to directory `./checkpoints/` when you run the program for the first time.\n* If the download fails on the first run, please manually download from [this link](https://huggingface.co/Plachta/VALL-E-X/resolve/main/vallex-checkpoint.pt), and put the file under directory `./checkpoints/`.\n\n#### How much VRAM do I need?\n* 6GB GPU VRAM - Almost all NVIDIA GPUs satisfy the requirement.\n\n#### Why the model fails to generate long text?\n* Transformer's computation complexity increases quadratically while the sequence length increases. Hence, all training \nare kept under 22 seconds. Please make sure the total length of audio prompt and generated audio is less than 22 seconds \nto ensure acceptable performance. \n\n\n#### MORE TO BE ADDED...\n\n## \ud83e\udde0 TODO\n- [x] Add Chinese README\n- [x] Long text generation\n- [x] Replace Encodec decoder with Vocos decoder\n- [ ] Fine-tuning for better voice adaptation\n- [ ] `.bat` scripts for non-python users\n- [ ] To be added...\n\n## \ud83d\ude4f Appreciation\n- [VALL-E X paper](https://arxiv.org/pdf/2303.03926) for the brilliant idea\n- [lifeiteng's vall-e](https://github.com/lifeiteng/vall-e) for related training code\n- [bark](https://github.com/suno-ai/bark) for the amazing pioneering work in neuro-codec TTS model\n\n## \u2b50\ufe0f Show Your Support\n\nIf you find VALL-E X interesting and useful, give us a star on GitHub! \u2b50\ufe0f It encourages us to keep improving the model and adding exciting features.\n\n## \ud83d\udcdc License\n\nVALL-E X is licensed under the [MIT License](./LICENSE).\n\n---\n\nHave questions or need assistance? Feel free to [open an issue](https://github.com/Plachtaa/VALL-E-X/issues/new) or join our [Discord](https://discord.gg/qCBRmAnTxg)\n\nHappy voice cloning! \ud83c\udfa4\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An open source implementation of Microsoft's VALL-E X zero-shot TTS",
    "version": "0.0.2a1",
    "project_urls": {
        "Homepage": "https://github.com/korakoe/VALL-E-X",
        "Source": "https://github.com/korakoe/VALL-E-X"
    },
    "split_keywords": [
        "artificial intelligence",
        "deep learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c0ec2c4eb43ffe186b5ee1456a15b93ddb609271a7dae20edc3c1937ba2987a5",
                "md5": "2c6307746bcfa0c1b07ea45501dddc1c",
                "sha256": "0046e8e18dfa30002993a9dfc35c86608baeca09162abc19bdd4bf372bc233d1"
            },
            "downloads": -1,
            "filename": "VALL_E_X-0.0.2a1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2c6307746bcfa0c1b07ea45501dddc1c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 93953,
            "upload_time": "2023-11-04T10:47:50",
            "upload_time_iso_8601": "2023-11-04T10:47:50.979114Z",
            "url": "https://files.pythonhosted.org/packages/c0/ec/2c4eb43ffe186b5ee1456a15b93ddb609271a7dae20edc3c1937ba2987a5/VALL_E_X-0.0.2a1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a1303a9c53d04d408ad7d6bf353140e9f959e348ab773716264aeec726bf6658",
                "md5": "5722907bd1fd35ab4e9e521498ec3d81",
                "sha256": "05fcbce054fc14de25efd1d7e2376ec8dfa1011df69565214f7753e64f9ef7c9"
            },
            "downloads": -1,
            "filename": "VALL-E-X-0.0.2a1.tar.gz",
            "has_sig": false,
            "md5_digest": "5722907bd1fd35ab4e9e521498ec3d81",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 82307,
            "upload_time": "2023-11-04T10:47:52",
            "upload_time_iso_8601": "2023-11-04T10:47:52.633871Z",
            "url": "https://files.pythonhosted.org/packages/a1/30/3a9c53d04d408ad7d6bf353140e9f959e348ab773716264aeec726bf6658/VALL-E-X-0.0.2a1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-04 10:47:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "korakoe",
    "github_project": "VALL-E-X",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "soundfile",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "torchaudio",
            "specs": []
        },
        {
            "name": "tokenizers",
            "specs": []
        },
        {
            "name": "encodec",
            "specs": []
        },
        {
            "name": "langid",
            "specs": []
        },
        {
            "name": "wget",
            "specs": []
        },
        {
            "name": "unidecode",
            "specs": []
        },
        {
            "name": "pyopenjtalk-prebuilt",
            "specs": []
        },
        {
            "name": "pypinyin",
            "specs": []
        },
        {
            "name": "inflect",
            "specs": []
        },
        {
            "name": "cn2an",
            "specs": []
        },
        {
            "name": "jieba",
            "specs": []
        },
        {
            "name": "eng_to_ipa",
            "specs": []
        },
        {
            "name": "openai-whisper",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "gradio",
            "specs": [
                [
                    "==",
                    "3.41.2"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": []
        },
        {
            "name": "sudachipy",
            "specs": []
        },
        {
            "name": "sudachidict_core",
            "specs": []
        },
        {
            "name": "vocos",
            "specs": []
        }
    ],
    "lcname": "vall-e-x"
}
        
Elapsed time: 0.13150s