# E2 TTS — MLX
Implementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.
This implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.
## Installation
```bash
pip install mlx-e2-tts
```
## Usage
```python
import mlx.core as mx
from e2_tts_mlx.model import E2TTS
from e2_tts_mlx.trainer import E2Trainer
from e2_tts_mlx.data import load_libritts_r
e2tts = E2TTS(
tokenizer="char-utf8", # or "phoneme_en"
cond_drop_prob = 0.25,
frac_lengths_mask = (0.7, 0.9),
transformer = dict(
dim = 1024,
depth = 24,
heads = 16,
text_depth = 12,
text_heads = 8,
text_ff_mult = 4,
max_seq_len = 4096,
dropout = 0.1
)
)
mx.eval(e2tts.parameters())
batch_size = 128
max_duration = 30
dataset = load_libritts_r(split="dev-clean") # or any audio/caption dataset
trainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)
trainer.train(
train_dataset = ...,
learning_rate = 7.5e-5,
batch_size = batch_size
)
```
... after much training ...
```python
cond = ...
text = ...
duration = ... # from a trained DurationPredictor or otherwise
generated_audio = e2tts.sample(
cond = cond,
text = text,
duration = duration,
steps = 32,
cfg_strength = 1.0, # if trained for cfg
use_vocos = True # set to False to get mel spectrograms instead of audio
)
```
Note the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.
## Appreciation
[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.
## Citations
```bibtex
@inproceedings{Eskimez2024E2TE,
title = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
author = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
year = {2024},
url = {https://api.semanticscholar.org/CorpusID:270738197}
}
```
```bibtex
@article{Burtsev2021MultiStreamT,
title = {Multi-Stream Transformers},
author = {Mikhail S. Burtsev and Anna Rumshisky},
journal = {ArXiv},
year = {2021},
volume = {abs/2107.10342},
url = {https://api.semanticscholar.org/CorpusID:236171087}
}
```
## License
The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.
Raw data
{
"_id": null,
"home_page": null,
"name": "mlx-e2-tts",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "artificial intelligence, asr, audio-generation, deep learning, transformers, text-to-speech",
"author": null,
"author_email": "Lucas Newman <lucasnewman@me.com>",
"download_url": "https://files.pythonhosted.org/packages/82/b8/9c479c87eac5be41bd901efcd007390dcb213a38c931697f8a60d0fcfd93/mlx_e2_tts-0.0.4.tar.gz",
"platform": null,
"description": "# E2 TTS \u2014 MLX\n\nImplementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.\n\nThis implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.\n\n## Installation\n\n```bash\npip install mlx-e2-tts\n```\n\n## Usage\n\n```python\nimport mlx.core as mx\n\nfrom e2_tts_mlx.model import E2TTS\nfrom e2_tts_mlx.trainer import E2Trainer\nfrom e2_tts_mlx.data import load_libritts_r\n\ne2tts = E2TTS(\n tokenizer=\"char-utf8\", # or \"phoneme_en\"\n cond_drop_prob = 0.25,\n frac_lengths_mask = (0.7, 0.9),\n transformer = dict(\n dim = 1024,\n depth = 24,\n heads = 16,\n text_depth = 12,\n text_heads = 8,\n text_ff_mult = 4,\n max_seq_len = 4096,\n dropout = 0.1\n )\n)\nmx.eval(e2tts.parameters())\n\nbatch_size = 128\nmax_duration = 30\n\ndataset = load_libritts_r(split=\"dev-clean\") # or any audio/caption dataset\n\ntrainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)\n\ntrainer.train(\n train_dataset = ...,\n learning_rate = 7.5e-5,\n batch_size = batch_size\n)\n```\n\n... after much training ...\n\n```python\ncond = ...\ntext = ...\nduration = ... # from a trained DurationPredictor or otherwise\n\ngenerated_audio = e2tts.sample(\n cond = cond,\n text = text,\n duration = duration,\n steps = 32,\n cfg_strength = 1.0, # if trained for cfg\n use_vocos = True # set to False to get mel spectrograms instead of audio\n)\n```\n\nNote the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.\n\n## Appreciation\n\n[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.\n\n## Citations\n\n```bibtex\n@inproceedings{Eskimez2024E2TE,\n title = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},\n author = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},\n year = {2024},\n url = {https://api.semanticscholar.org/CorpusID:270738197}\n}\n```\n\n```bibtex\n@article{Burtsev2021MultiStreamT,\n title = {Multi-Stream Transformers},\n author = {Mikhail S. Burtsev and Anna Rumshisky},\n journal = {ArXiv},\n year = {2021},\n volume = {abs/2107.10342},\n url = {https://api.semanticscholar.org/CorpusID:236171087}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "E2-TTS - MLX",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/lucasnewman/e2-tts-mlx"
},
"split_keywords": [
"artificial intelligence",
" asr",
" audio-generation",
" deep learning",
" transformers",
" text-to-speech"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4c5892d08aac3c35b18a4c0a38e9367fc2aba6a025e883b83e8497a6e0f4f308",
"md5": "f221ccfae1eedd55d5a002cf47017204",
"sha256": "0c3adadfe20a045d6534c4c34768c4b626eca2231d39aec41d47e872d0ed1a16"
},
"downloads": -1,
"filename": "mlx_e2_tts-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f221ccfae1eedd55d5a002cf47017204",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 16684,
"upload_time": "2024-10-06T21:48:28",
"upload_time_iso_8601": "2024-10-06T21:48:28.033673Z",
"url": "https://files.pythonhosted.org/packages/4c/58/92d08aac3c35b18a4c0a38e9367fc2aba6a025e883b83e8497a6e0f4f308/mlx_e2_tts-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "82b89c479c87eac5be41bd901efcd007390dcb213a38c931697f8a60d0fcfd93",
"md5": "1bb579356349cab5d27d99b411741038",
"sha256": "7c6cc949cd28bbb979e8a322ebc043e1fd78d17deaafa684caf1327c1b08489f"
},
"downloads": -1,
"filename": "mlx_e2_tts-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "1bb579356349cab5d27d99b411741038",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 17408,
"upload_time": "2024-10-06T21:48:29",
"upload_time_iso_8601": "2024-10-06T21:48:29.723051Z",
"url": "https://files.pythonhosted.org/packages/82/b8/9c479c87eac5be41bd901efcd007390dcb213a38c931697f8a60d0fcfd93/mlx_e2_tts-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-06 21:48:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lucasnewman",
"github_project": "e2-tts-mlx",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "mlx-e2-tts"
}