
# E2 TTS — MLX
Implementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.
This implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.
## Installation
```bash
pip install mlx-e2-tts
```
## Usage
```python
import mlx.core as mx
from e2_tts_mlx.model import E2TTS
from e2_tts_mlx.trainer import E2Trainer
from e2_tts_mlx.data import load_libritts_r
e2tts = E2TTS(
tokenizer="char-utf8", # or "phoneme_en"
cond_drop_prob = 0.25,
frac_lengths_mask = (0.7, 0.9),
transformer = dict(
dim = 1024,
depth = 24,
heads = 16,
text_depth = 12,
text_heads = 8,
text_ff_mult = 4,
max_seq_len = 4096,
dropout = 0.1
)
)
mx.eval(e2tts.parameters())
batch_size = 128
max_duration = 30
dataset = load_libritts_r(split="dev-clean") # or any audio/caption dataset
trainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)
trainer.train(
train_dataset = ...,
learning_rate = 7.5e-5,
batch_size = batch_size,
total_steps = 1_000_000
)
```
... after much training ...
```python
cond = ...
text = ...
duration = ... # from a trained DurationPredictor or otherwise
generated_audio = e2tts.sample(
cond = cond,
text = text,
duration = duration,
steps = 32,
cfg_strength = 1.0, # if trained for cfg
use_vocos = True # set to False to get mel spectrograms instead of audio
)
```
Note the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.
## Appreciation
[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.
## Citations
```bibtex
@inproceedings{Eskimez2024E2TE,
title = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
author = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
year = {2024},
url = {https://api.semanticscholar.org/CorpusID:270738197}
}
```
```bibtex
@article{Burtsev2021MultiStreamT,
title = {Multi-Stream Transformers},
author = {Mikhail S. Burtsev and Anna Rumshisky},
journal = {ArXiv},
year = {2021},
volume = {abs/2107.10342},
url = {https://api.semanticscholar.org/CorpusID:236171087}
}
```
## License
The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.
Raw data
{
"_id": null,
"home_page": null,
"name": "mlx-e2-tts",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "artificial intelligence, asr, audio-generation, deep learning, transformers, text-to-speech",
"author": null,
"author_email": "Lucas Newman <lucasnewman@me.com>",
"download_url": "https://files.pythonhosted.org/packages/5b/a8/1d61fd0555c1d232df79dcadf8bf7e9c55020135460581db0ef1c64d0964/mlx_e2_tts-0.0.6.tar.gz",
"platform": null,
"description": "\n\n# E2 TTS \u2014 MLX\n\nImplementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.\n\nThis implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.\n\n## Installation\n\n```bash\npip install mlx-e2-tts\n```\n\n## Usage\n\n```python\nimport mlx.core as mx\n\nfrom e2_tts_mlx.model import E2TTS\nfrom e2_tts_mlx.trainer import E2Trainer\nfrom e2_tts_mlx.data import load_libritts_r\n\ne2tts = E2TTS(\n tokenizer=\"char-utf8\", # or \"phoneme_en\"\n cond_drop_prob = 0.25,\n frac_lengths_mask = (0.7, 0.9),\n transformer = dict(\n dim = 1024,\n depth = 24,\n heads = 16,\n text_depth = 12,\n text_heads = 8,\n text_ff_mult = 4,\n max_seq_len = 4096,\n dropout = 0.1\n )\n)\nmx.eval(e2tts.parameters())\n\nbatch_size = 128\nmax_duration = 30\n\ndataset = load_libritts_r(split=\"dev-clean\") # or any audio/caption dataset\n\ntrainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)\n\ntrainer.train(\n train_dataset = ...,\n learning_rate = 7.5e-5,\n batch_size = batch_size,\n total_steps = 1_000_000\n)\n```\n\n... after much training ...\n\n```python\ncond = ...\ntext = ...\nduration = ... # from a trained DurationPredictor or otherwise\n\ngenerated_audio = e2tts.sample(\n cond = cond,\n text = text,\n duration = duration,\n steps = 32,\n cfg_strength = 1.0, # if trained for cfg\n use_vocos = True # set to False to get mel spectrograms instead of audio\n)\n```\n\nNote the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.\n\n## Appreciation\n\n[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.\n\n## Citations\n\n```bibtex\n@inproceedings{Eskimez2024E2TE,\n title = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},\n author = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},\n year = {2024},\n url = {https://api.semanticscholar.org/CorpusID:270738197}\n}\n```\n\n```bibtex\n@article{Burtsev2021MultiStreamT,\n title = {Multi-Stream Transformers},\n author = {Mikhail S. Burtsev and Anna Rumshisky},\n journal = {ArXiv},\n year = {2021},\n volume = {abs/2107.10342},\n url = {https://api.semanticscholar.org/CorpusID:236171087}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "E2-TTS - MLX",
"version": "0.0.6",
"project_urls": {
"Homepage": "https://github.com/lucasnewman/e2-tts-mlx"
},
"split_keywords": [
"artificial intelligence",
" asr",
" audio-generation",
" deep learning",
" transformers",
" text-to-speech"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ffee27765574e4759efa0e5d6b0cba36586ee9f790733b193b57eb90eaeefd69",
"md5": "2d9a5be3e25c756132ba70039c28bc39",
"sha256": "5520776ab255c706f730994d6681032ff2a17f13bf3ecd5a36296b59222878d6"
},
"downloads": -1,
"filename": "mlx_e2_tts-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2d9a5be3e25c756132ba70039c28bc39",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 16970,
"upload_time": "2024-10-08T03:18:03",
"upload_time_iso_8601": "2024-10-08T03:18:03.899985Z",
"url": "https://files.pythonhosted.org/packages/ff/ee/27765574e4759efa0e5d6b0cba36586ee9f790733b193b57eb90eaeefd69/mlx_e2_tts-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5ba81d61fd0555c1d232df79dcadf8bf7e9c55020135460581db0ef1c64d0964",
"md5": "0db4f558c49de203bc010db4465fc6a3",
"sha256": "93e164b16146a0a5ccc7db43c941ca080058f852f89a471b72bb081a6c65c97e"
},
"downloads": -1,
"filename": "mlx_e2_tts-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "0db4f558c49de203bc010db4465fc6a3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 17786,
"upload_time": "2024-10-08T03:18:05",
"upload_time_iso_8601": "2024-10-08T03:18:05.463309Z",
"url": "https://files.pythonhosted.org/packages/5b/a8/1d61fd0555c1d232df79dcadf8bf7e9c55020135460581db0ef1c64d0964/mlx_e2_tts-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-08 03:18:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lucasnewman",
"github_project": "e2-tts-mlx",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "mlx-e2-tts"
}