mlx-e2-tts


Namemlx-e2-tts JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryE2-TTS - MLX
upload_time2024-10-06 21:48:29
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords artificial intelligence asr audio-generation deep learning transformers text-to-speech
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # E2 TTS — MLX

Implementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.

This implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.

## Installation

```bash
pip install mlx-e2-tts
```

## Usage

```python
import mlx.core as mx

from e2_tts_mlx.model import E2TTS
from e2_tts_mlx.trainer import E2Trainer
from e2_tts_mlx.data import load_libritts_r

e2tts = E2TTS(
    tokenizer="char-utf8",  # or "phoneme_en"
    cond_drop_prob = 0.25,
    frac_lengths_mask = (0.7, 0.9),
    transformer = dict(
        dim = 1024,
        depth = 24,
        heads = 16,
        text_depth = 12,
        text_heads = 8,
        text_ff_mult = 4,
        max_seq_len = 4096,
        dropout = 0.1
    )
)
mx.eval(e2tts.parameters())

batch_size = 128
max_duration = 30

dataset = load_libritts_r(split="dev-clean")  # or any audio/caption dataset

trainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)

trainer.train(
    train_dataset = ...,
    learning_rate = 7.5e-5,
    batch_size = batch_size
)
```

... after much training ...

```python
cond = ...
text = ...
duration = ...  # from a trained DurationPredictor or otherwise

generated_audio = e2tts.sample(
    cond = cond,
    text = text,
    duration = duration,
    steps = 32,
    cfg_strength = 1.0,  # if trained for cfg
    use_vocos = True  # set to False to get mel spectrograms instead of audio
)
```

Note the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.

## Appreciation

[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.

## Citations

```bibtex
@inproceedings{Eskimez2024E2TE,
    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:270738197}
}
```

```bibtex
@article{Burtsev2021MultiStreamT,
    title     = {Multi-Stream Transformers},
    author    = {Mikhail S. Burtsev and Anna Rumshisky},
    journal   = {ArXiv},
    year      = {2021},
    volume    = {abs/2107.10342},
    url       = {https://api.semanticscholar.org/CorpusID:236171087}
}
```

## License

The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mlx-e2-tts",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "artificial intelligence, asr, audio-generation, deep learning, transformers, text-to-speech",
    "author": null,
    "author_email": "Lucas Newman <lucasnewman@me.com>",
    "download_url": "https://files.pythonhosted.org/packages/82/b8/9c479c87eac5be41bd901efcd007390dcb213a38c931697f8a60d0fcfd93/mlx_e2_tts-0.0.4.tar.gz",
    "platform": null,
    "description": "# E2 TTS \u2014 MLX\n\nImplementation of E2-TTS, [Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS](https://arxiv.org/abs/2406.18009v1), with the [MLX](https://github.com/ml-explore/mlx) framework.\n\nThis implementation is based on the [lucidrains implementation](https://github.com/lucidrains/e2-tts-pytorch) in Pytorch, which differs from the paper in that it uses a [multistream transformer](https://arxiv.org/abs/2107.10342) for text and audio, with conditioning done every transformer block.\n\n## Installation\n\n```bash\npip install mlx-e2-tts\n```\n\n## Usage\n\n```python\nimport mlx.core as mx\n\nfrom e2_tts_mlx.model import E2TTS\nfrom e2_tts_mlx.trainer import E2Trainer\nfrom e2_tts_mlx.data import load_libritts_r\n\ne2tts = E2TTS(\n    tokenizer=\"char-utf8\",  # or \"phoneme_en\"\n    cond_drop_prob = 0.25,\n    frac_lengths_mask = (0.7, 0.9),\n    transformer = dict(\n        dim = 1024,\n        depth = 24,\n        heads = 16,\n        text_depth = 12,\n        text_heads = 8,\n        text_ff_mult = 4,\n        max_seq_len = 4096,\n        dropout = 0.1\n    )\n)\nmx.eval(e2tts.parameters())\n\nbatch_size = 128\nmax_duration = 30\n\ndataset = load_libritts_r(split=\"dev-clean\")  # or any audio/caption dataset\n\ntrainer = E2Trainer(model = e2tts, num_warmup_steps = 1000)\n\ntrainer.train(\n    train_dataset = ...,\n    learning_rate = 7.5e-5,\n    batch_size = batch_size\n)\n```\n\n... after much training ...\n\n```python\ncond = ...\ntext = ...\nduration = ...  # from a trained DurationPredictor or otherwise\n\ngenerated_audio = e2tts.sample(\n    cond = cond,\n    text = text,\n    duration = duration,\n    steps = 32,\n    cfg_strength = 1.0,  # if trained for cfg\n    use_vocos = True  # set to False to get mel spectrograms instead of audio\n)\n```\n\nNote the model size specified above (from the paper) is very large. See `train_example.py` for a more practical-sized model you can train on your local device.\n\n## Appreciation\n\n[lucidrains](https://github.com/lucidrains) for the original implementation in Pytorch.\n\n## Citations\n\n```bibtex\n@inproceedings{Eskimez2024E2TE,\n    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},\n    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},\n    year    = {2024},\n    url     = {https://api.semanticscholar.org/CorpusID:270738197}\n}\n```\n\n```bibtex\n@article{Burtsev2021MultiStreamT,\n    title     = {Multi-Stream Transformers},\n    author    = {Mikhail S. Burtsev and Anna Rumshisky},\n    journal   = {ArXiv},\n    year      = {2021},\n    volume    = {abs/2107.10342},\n    url       = {https://api.semanticscholar.org/CorpusID:236171087}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "E2-TTS - MLX",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/lucasnewman/e2-tts-mlx"
    },
    "split_keywords": [
        "artificial intelligence",
        " asr",
        " audio-generation",
        " deep learning",
        " transformers",
        " text-to-speech"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4c5892d08aac3c35b18a4c0a38e9367fc2aba6a025e883b83e8497a6e0f4f308",
                "md5": "f221ccfae1eedd55d5a002cf47017204",
                "sha256": "0c3adadfe20a045d6534c4c34768c4b626eca2231d39aec41d47e872d0ed1a16"
            },
            "downloads": -1,
            "filename": "mlx_e2_tts-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f221ccfae1eedd55d5a002cf47017204",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16684,
            "upload_time": "2024-10-06T21:48:28",
            "upload_time_iso_8601": "2024-10-06T21:48:28.033673Z",
            "url": "https://files.pythonhosted.org/packages/4c/58/92d08aac3c35b18a4c0a38e9367fc2aba6a025e883b83e8497a6e0f4f308/mlx_e2_tts-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "82b89c479c87eac5be41bd901efcd007390dcb213a38c931697f8a60d0fcfd93",
                "md5": "1bb579356349cab5d27d99b411741038",
                "sha256": "7c6cc949cd28bbb979e8a322ebc043e1fd78d17deaafa684caf1327c1b08489f"
            },
            "downloads": -1,
            "filename": "mlx_e2_tts-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "1bb579356349cab5d27d99b411741038",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 17408,
            "upload_time": "2024-10-06T21:48:29",
            "upload_time_iso_8601": "2024-10-06T21:48:29.723051Z",
            "url": "https://files.pythonhosted.org/packages/82/b8/9c479c87eac5be41bd901efcd007390dcb213a38c931697f8a60d0fcfd93/mlx_e2_tts-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-06 21:48:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lucasnewman",
    "github_project": "e2-tts-mlx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "mlx-e2-tts"
}
        
Elapsed time: 0.38796s