f5-tts-mlx


Namef5-tts-mlx JSON
Version 0.2.6 PyPI version JSON
download
home_pageNone
SummaryF5-TTS - MLX
upload_time2025-03-19 02:11:37
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords artificial intelligence asr audio-generation deep learning transformers text-to-speech
VCS
bugtrack_url
requirements einops einx jieba huggingface_hub mlx numpy pypinyin setuptools sounddevice soundfile tqdm vocos-mlx
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![F5 TTS diagram](f5tts.jpg)

# F5 TTS — MLX

Implementation of [F5-TTS](https://arxiv.org/abs/2410.06885), with the [MLX](https://github.com/ml-explore/mlx) framework.

F5 TTS is a non-autoregressive, zero-shot text-to-speech system using a flow-matching mel spectrogram generator with a diffusion transformer (DiT).

You can listen to a [sample here](https://s3.amazonaws.com/lucasnewman.datasets/f5tts/sample.wav) that was generated in ~4 seconds on an M3 Max MacBook Pro.

F5 is an evolution of [E2 TTS](https://arxiv.org/abs/2406.18009v2) and improves performance with ConvNeXT v2 blocks for the learned text alignment. This repository is based on the original Pytorch implementation available [here](https://github.com/SWivid/F5-TTS).

## Installation

```bash
pip install f5-tts-mlx
```

## Basic Usage

```bash
python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog."
```

You can also use a pipe to generate speech from the output of another process, for instance from a language model:

```bash
mlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose false \
 --temp 0 --max-tokens 512 --prompt "Write a concise paragraph explaining wavelets." \
| python -m f5_tts_mlx.generate
```

## Voice Matching

If you want to use your own reference audio sample, make sure it's a mono, 24kHz wav file of around 5-10 seconds:

```bash
python -m f5_tts_mlx.generate \
--text "The quick brown fox jumped over the lazy dog." \
--ref-audio /path/to/audio.wav \
--ref-text "This is the caption for the reference audio."
```

You can convert an audio file to the correct format with ffmpeg like this:

```bash
ffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav
```

See [here](./f5_tts_mlx) for more options to customize generation.

## Quantized Models

If you're in a bandwidth or memory-limited environment, you can use the `--q` option to load a quantized version of the model. 4-bit and 8-bit variants are supported.

```bash
python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog." --q 4
```

## From Python

You can load a pretrained model from Python:

```python
from f5_tts_mlx.generate import generate

audio = generate(text = "Hello world.", ...)
```

Pretrained model weights are also available [on Hugging Face](https://huggingface.co/lucasnewman/f5-tts-mlx).

## Appreciation

[Yushen Chen](https://github.com/SWivid) for the original Pytorch implementation of F5 TTS and pretrained model.

[Phil Wang](https://github.com/lucidrains) for the E2 TTS implementation that this model is based on.

## Citations

```bibtex
@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}
```

```bibtex
@inproceedings{Eskimez2024E2TE,
    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:270738197}
}
```

## License

The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "f5-tts-mlx",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "artificial intelligence, asr, audio-generation, deep learning, transformers, text-to-speech",
    "author": null,
    "author_email": "Lucas Newman <lucasnewman@me.com>",
    "download_url": "https://files.pythonhosted.org/packages/98/be/37627863ed8c2ab7fdd09123c3b2299d637a9153f639b2c1bb84b4a065ac/f5_tts_mlx-0.2.6.tar.gz",
    "platform": null,
    "description": "![F5 TTS diagram](f5tts.jpg)\n\n# F5 TTS \u2014 MLX\n\nImplementation of [F5-TTS](https://arxiv.org/abs/2410.06885), with the [MLX](https://github.com/ml-explore/mlx) framework.\n\nF5 TTS is a non-autoregressive, zero-shot text-to-speech system using a flow-matching mel spectrogram generator with a diffusion transformer (DiT).\n\nYou can listen to a [sample here](https://s3.amazonaws.com/lucasnewman.datasets/f5tts/sample.wav) that was generated in ~4 seconds on an M3 Max MacBook Pro.\n\nF5 is an evolution of [E2 TTS](https://arxiv.org/abs/2406.18009v2) and improves performance with ConvNeXT v2 blocks for the learned text alignment. This repository is based on the original Pytorch implementation available [here](https://github.com/SWivid/F5-TTS).\n\n## Installation\n\n```bash\npip install f5-tts-mlx\n```\n\n## Basic Usage\n\n```bash\npython -m f5_tts_mlx.generate --text \"The quick brown fox jumped over the lazy dog.\"\n```\n\nYou can also use a pipe to generate speech from the output of another process, for instance from a language model:\n\n```bash\nmlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose false \\\n --temp 0 --max-tokens 512 --prompt \"Write a concise paragraph explaining wavelets.\" \\\n| python -m f5_tts_mlx.generate\n```\n\n## Voice Matching\n\nIf you want to use your own reference audio sample, make sure it's a mono, 24kHz wav file of around 5-10 seconds:\n\n```bash\npython -m f5_tts_mlx.generate \\\n--text \"The quick brown fox jumped over the lazy dog.\" \\\n--ref-audio /path/to/audio.wav \\\n--ref-text \"This is the caption for the reference audio.\"\n```\n\nYou can convert an audio file to the correct format with ffmpeg like this:\n\n```bash\nffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav\n```\n\nSee [here](./f5_tts_mlx) for more options to customize generation.\n\n## Quantized Models\n\nIf you're in a bandwidth or memory-limited environment, you can use the `--q` option to load a quantized version of the model. 4-bit and 8-bit variants are supported.\n\n```bash\npython -m f5_tts_mlx.generate --text \"The quick brown fox jumped over the lazy dog.\" --q 4\n```\n\n## From Python\n\nYou can load a pretrained model from Python:\n\n```python\nfrom f5_tts_mlx.generate import generate\n\naudio = generate(text = \"Hello world.\", ...)\n```\n\nPretrained model weights are also available [on Hugging Face](https://huggingface.co/lucasnewman/f5-tts-mlx).\n\n## Appreciation\n\n[Yushen Chen](https://github.com/SWivid) for the original Pytorch implementation of F5 TTS and pretrained model.\n\n[Phil Wang](https://github.com/lucidrains) for the E2 TTS implementation that this model is based on.\n\n## Citations\n\n```bibtex\n@article{chen-etal-2024-f5tts,\n      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, \n      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},\n      journal={arXiv preprint arXiv:2410.06885},\n      year={2024},\n}\n```\n\n```bibtex\n@inproceedings{Eskimez2024E2TE,\n    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},\n    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},\n    year    = {2024},\n    url     = {https://api.semanticscholar.org/CorpusID:270738197}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "F5-TTS - MLX",
    "version": "0.2.6",
    "project_urls": {
        "Homepage": "https://github.com/lucasnewman/f5-tts-mlx"
    },
    "split_keywords": [
        "artificial intelligence",
        " asr",
        " audio-generation",
        " deep learning",
        " transformers",
        " text-to-speech"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e50418740971b4602651c4caf3dfd44062df0a93aa345e4f40e10863df8c8111",
                "md5": "af23da22d0dff66711bcff51bfdb35c7",
                "sha256": "bfb7d4c5c9020436f8bb16a8051d8a42c8a83b940b634a71a127024bb6554ec4"
            },
            "downloads": -1,
            "filename": "f5_tts_mlx-0.2.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "af23da22d0dff66711bcff51bfdb35c7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 237700,
            "upload_time": "2025-03-19T02:11:35",
            "upload_time_iso_8601": "2025-03-19T02:11:35.887713Z",
            "url": "https://files.pythonhosted.org/packages/e5/04/18740971b4602651c4caf3dfd44062df0a93aa345e4f40e10863df8c8111/f5_tts_mlx-0.2.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "98be37627863ed8c2ab7fdd09123c3b2299d637a9153f639b2c1bb84b4a065ac",
                "md5": "f9eec5c14aa11d359a4e2d0546ce534e",
                "sha256": "de94f7ba838cce4b44c4c926335a820b93bd629a5ad0f9f8ac98fc65c166529d"
            },
            "downloads": -1,
            "filename": "f5_tts_mlx-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "f9eec5c14aa11d359a4e2d0546ce534e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 237088,
            "upload_time": "2025-03-19T02:11:37",
            "upload_time_iso_8601": "2025-03-19T02:11:37.406222Z",
            "url": "https://files.pythonhosted.org/packages/98/be/37627863ed8c2ab7fdd09123c3b2299d637a9153f639b2c1bb84b4a065ac/f5_tts_mlx-0.2.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-19 02:11:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lucasnewman",
    "github_project": "f5-tts-mlx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "einops",
            "specs": []
        },
        {
            "name": "einx",
            "specs": []
        },
        {
            "name": "jieba",
            "specs": []
        },
        {
            "name": "huggingface_hub",
            "specs": []
        },
        {
            "name": "mlx",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pypinyin",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "sounddevice",
            "specs": []
        },
        {
            "name": "soundfile",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "vocos-mlx",
            "specs": []
        }
    ],
    "lcname": "f5-tts-mlx"
}
        
Elapsed time: 1.55952s