mlx-e2-tts

Name	mlx-e2-tts JSON
Version	0.0.6 JSON
	download
home_page	None
Summary	E2-TTS - MLX
upload_time	2024-10-08 03:18:05
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT
keywords	artificial intelligence asr audio-generation deep learning transformers text-to-speech
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![E2 TTS diagram](e2tts.png)

# E2 TTS — MLX

Implementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.

This implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.

## Installation

```bash
pip install mlx-e2-tts
```

## Usage

```python
import mlx.core as mx

from e2_tts_mlx.model import E2TTS
from e2_tts_mlx.trainer import E2Trainer
from e2_tts_mlx.data import load_libritts_r

e2tts = E2TTS(
    tokenizer="char-utf8",  # or "phoneme_en"
    cond_drop_prob = 0.25,
    frac_lengths_mask = (0.7, 0.9),
    transformer = dict(
        dim = 1024,
        depth = 24,
        heads = 16,
        text_depth = 12,
        text_heads = 8,
        text_ff_mult = 4,
        max_seq_len = 4096,
        dropout = 0.1
    )
)
mx.eval(e2tts.parameters())

batch_size = 128
max_duration = 30

dataset = load_libritts_r(split="dev-clean")  # or any audio/caption dataset

trainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)

trainer.train(
    train_dataset = ...,
    learning_rate = 7.5e-5,
    batch_size = batch_size,
    total_steps = 1_000_000
)
```

... after much training ...

```python
cond = ...
text = ...
duration = ...  # from a trained DurationPredictor or otherwise

generated_audio = e2tts.sample(
    cond = cond,
    text = text,
    duration = duration,
    steps = 32,
    cfg_strength = 1.0,  # if trained for cfg
    use_vocos = True  # set to False to get mel spectrograms instead of audio
)
```

Note the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.

## Appreciation

[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.

## Citations

```bibtex
@inproceedings{Eskimez2024E2TE,
    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:270738197}
}
```

```bibtex
@article{Burtsev2021MultiStreamT,
    title     = {Multi-Stream Transformers},
    author    = {Mikhail S. Burtsev and Anna Rumshisky},
    journal   = {ArXiv},
    year      = {2021},
    volume    = {abs/2107.10342},
    url       = {https://api.semanticscholar.org/CorpusID:236171087}
}
```

## License

The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mlx-e2-tts",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "artificial intelligence, asr, audio-generation, deep learning, transformers, text-to-speech",
    "author": null,
    "author_email": "Lucas Newman <lucasnewman@me.com>",
    "download_url": "https://files.pythonhosted.org/packages/5b/a8/1d61fd0555c1d232df79dcadf8bf7e9c55020135460581db0ef1c64d0964/mlx_e2_tts-0.0.6.tar.gz",
    "platform": null,
    "description": "![E2 TTS diagram](e2tts.png)\n\n# E2 TTS \u2014 MLX\n\nImplementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.\n\nThis implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.\n\n## Installation\n\n```bash\npip install mlx-e2-tts\n```\n\n## Usage\n\n```python\nimport mlx.core as mx\n\nfrom e2_tts_mlx.model import E2TTS\nfrom e2_tts_mlx.trainer import E2Trainer\nfrom e2_tts_mlx.data import load_libritts_r\n\ne2tts = E2TTS(\n    tokenizer=\"char-utf8\",  # or \"phoneme_en\"\n    cond_drop_prob = 0.25,\n    frac_lengths_mask = (0.7, 0.9),\n    transformer = dict(\n        dim = 1024,\n        depth = 24,\n        heads = 16,\n        text_depth = 12,\n        text_heads = 8,\n        text_ff_mult = 4,\n        max_seq_len = 4096,\n        dropout = 0.1\n    )\n)\nmx.eval(e2tts.parameters())\n\nbatch_size = 128\nmax_duration = 30\n\ndataset = load_libritts_r(split=\"dev-clean\")  # or any audio/caption dataset\n\ntrainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)\n\ntrainer.train(\n    train_dataset = ...,\n    learning_rate = 7.5e-5,\n    batch_size = batch_size,\n    total_steps = 1_000_000\n)\n```\n\n... after much training ...\n\n```python\ncond = ...\ntext = ...\nduration = ...  # from a trained DurationPredictor or otherwise\n\ngenerated_audio = e2tts.sample(\n    cond = cond,\n    text = text,\n    duration = duration,\n    steps = 32,\n    cfg_strength = 1.0,  # if trained for cfg\n    use_vocos = True  # set to False to get mel spectrograms instead of audio\n)\n```\n\nNote the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.\n\n## Appreciation\n\n[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.\n\n## Citations\n\n```bibtex\n@inproceedings{Eskimez2024E2TE,\n    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},\n    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},\n    year    = {2024},\n    url     = {https://api.semanticscholar.org/CorpusID:270738197}\n}\n```\n\n```bibtex\n@article{Burtsev2021MultiStreamT,\n    title     = {Multi-Stream Transformers},\n    author    = {Mikhail S. Burtsev and Anna Rumshisky},\n    journal   = {ArXiv},\n    year      = {2021},\n    volume    = {abs/2107.10342},\n    url       = {https://api.semanticscholar.org/CorpusID:236171087}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "E2-TTS - MLX",
    "version": "0.0.6",
    "project_urls": {
        "Homepage": "https://github.com/lucasnewman/e2-tts-mlx"
    },
    "split_keywords": [
        "artificial intelligence",
        " asr",
        " audio-generation",
        " deep learning",
        " transformers",
        " text-to-speech"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ffee27765574e4759efa0e5d6b0cba36586ee9f790733b193b57eb90eaeefd69",
                "md5": "2d9a5be3e25c756132ba70039c28bc39",
                "sha256": "5520776ab255c706f730994d6681032ff2a17f13bf3ecd5a36296b59222878d6"
            },
            "downloads": -1,
            "filename": "mlx_e2_tts-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2d9a5be3e25c756132ba70039c28bc39",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16970,
            "upload_time": "2024-10-08T03:18:03",
            "upload_time_iso_8601": "2024-10-08T03:18:03.899985Z",
            "url": "https://files.pythonhosted.org/packages/ff/ee/27765574e4759efa0e5d6b0cba36586ee9f790733b193b57eb90eaeefd69/mlx_e2_tts-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ba81d61fd0555c1d232df79dcadf8bf7e9c55020135460581db0ef1c64d0964",
                "md5": "0db4f558c49de203bc010db4465fc6a3",
                "sha256": "93e164b16146a0a5ccc7db43c941ca080058f852f89a471b72bb081a6c65c97e"
            },
            "downloads": -1,
            "filename": "mlx_e2_tts-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "0db4f558c49de203bc010db4465fc6a3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 17786,
            "upload_time": "2024-10-08T03:18:05",
            "upload_time_iso_8601": "2024-10-08T03:18:05.463309Z",
            "url": "https://files.pythonhosted.org/packages/5b/a8/1d61fd0555c1d232df79dcadf8bf7e9c55020135460581db0ef1c64d0964/mlx_e2_tts-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-08 03:18:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lucasnewman",
    "github_project": "e2-tts-mlx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mlx-e2-tts"
}

None