f5-tts-mlx-quantized

Name	f5-tts-mlx-quantized JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	F5-TTS - MLX
upload_time	2024-12-13 02:04:04
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT
keywords	artificial intelligence asr audio-generation deep learning transformers text-to-speech
VCS
bugtrack_url
requirements	einops einx jieba huggingface_hub mlx numpy pypinyin setuptools sounddevice soundfile tqdm vocos-mlx
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![F5 Quantized so smol](f5quantized.png)

HF: [Mode weight](https://huggingface.co/alandao/f5-tts-mlx-4bit)

# F5 TTS — MLX

_This repo is a fork of original f5-tts-mlx implementation but a quantized flow-matching model that is only **223MB** in size. The repo is meant to be used as a component of my blog post on low VRAM voice generator_

Implementation of [F5-TTS](https://arxiv.org/abs/2410.06885), with the [MLX](https://github.com/ml-explore/mlx) framework.

F5 TTS is a non-autoregressive, zero-shot text-to-speech system using a flow-matching mel spectrogram generator with a diffusion transformer (DiT).

This repo attempted to reduce the VRAM usage of the original model so that it can be easily deployed on any kind of Apple Device with ease (with MLX). The result as can be seen, is still very usable.

# Demo:
## 4bit (223MB)
https://github.com/user-attachments/assets/406b4624-8f7c-48a4-a35d-2108fb081744

## Original (1.35GB)
https://github.com/user-attachments/assets/c8b6f7c0-65ab-4950-ac96-b10608954174

## Installation

```bash
pip install f5-tts-mlx-quantized
```

## Basic Usage

```bash
python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog."
```

You can also use a pipe to generate speech from the output of another process, for instance from a language model:

```bash
mlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose false \
 --temp 0 --max-tokens 512 --prompt "Write a concise paragraph explaning wavelets." \
| python -m f5_tts_mlx.generate
```

## Voice Matching

If you want to use your own reference audio sample, make sure it's a mono, 24kHz wav file of around 5-10 seconds:

```bash
python -m f5_tts_mlx.generate \
--text "The quick brown fox jumped over the lazy dog." \
--ref-audio /path/to/audio.wav \
--ref-text "This is the caption for the reference audio."
```

You can convert an audio file to the correct format with ffmpeg like this:

```bash
ffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav
```

See [here](./f5_tts_mlx) for more options to customize generation.

## From Python

You can load a pretrained model from Python:

```python
from f5_tts_mlx.generate import generate

audio = generate(text = "Hello world.", ...)
```

Pretrained model weights are also available [on Hugging Face](https://huggingface.co/lucasnewman/f5-tts-mlx).

## Appreciation

[Lucas Newman](https://github.com/lucasnewman/f5-tts-mlx) for original implementation of F5 TTS on MLX

[Yushen Chen](https://github.com/SWivid) for the original Pytorch implementation of F5 TTS and pretrained model.

[Phil Wang](https://github.com/lucidrains) for the E2 TTS implementation that this model is based on.

## Citations

```bibtex
@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}
```

```bibtex
@inproceedings{Eskimez2024E2TE,
    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},
    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},
    year    = {2024},
    url     = {https://api.semanticscholar.org/CorpusID:270738197}
}
```

## License

The code in this repository is released under the MIT license as found in the
[LICENSE](LICENSE) file.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "f5-tts-mlx-quantized",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "artificial intelligence, asr, audio-generation, deep learning, transformers, text-to-speech",
    "author": null,
    "author_email": "Alan Dao <contact@alandao.net>",
    "download_url": "https://files.pythonhosted.org/packages/54/a2/959a0fb5f29edc3a275c778e997ad450052b9464198fb55dd525efd76f42/f5_tts_mlx_quantized-0.1.1.tar.gz",
    "platform": null,
    "description": "![F5 Quantized so smol](f5quantized.png)\n\nHF: [Mode weight](https://huggingface.co/alandao/f5-tts-mlx-4bit)\n\n# F5 TTS \u2014 MLX\n\n_This repo is a fork of original f5-tts-mlx implementation but a quantized flow-matching model that is only **223MB** in size. The repo is meant to be used as a component of my blog post on low VRAM voice generator_\n\nImplementation of [F5-TTS](https://arxiv.org/abs/2410.06885), with the [MLX](https://github.com/ml-explore/mlx) framework.\n\nF5 TTS is a non-autoregressive, zero-shot text-to-speech system using a flow-matching mel spectrogram generator with a diffusion transformer (DiT).\n\nThis repo attempted to reduce the VRAM usage of the original model so that it can be easily deployed on any kind of Apple Device with ease (with MLX). The result as can be seen, is still very usable.\n\n# Demo:\n## 4bit (223MB)\nhttps://github.com/user-attachments/assets/406b4624-8f7c-48a4-a35d-2108fb081744\n\n## Original (1.35GB)\nhttps://github.com/user-attachments/assets/c8b6f7c0-65ab-4950-ac96-b10608954174\n\n## Installation\n\n```bash\npip install f5-tts-mlx-quantized\n```\n\n## Basic Usage\n\n```bash\npython -m f5_tts_mlx.generate --text \"The quick brown fox jumped over the lazy dog.\"\n```\n\nYou can also use a pipe to generate speech from the output of another process, for instance from a language model:\n\n```bash\nmlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --verbose false \\\n --temp 0 --max-tokens 512 --prompt \"Write a concise paragraph explaning wavelets.\" \\\n| python -m f5_tts_mlx.generate\n```\n\n## Voice Matching\n\nIf you want to use your own reference audio sample, make sure it's a mono, 24kHz wav file of around 5-10 seconds:\n\n```bash\npython -m f5_tts_mlx.generate \\\n--text \"The quick brown fox jumped over the lazy dog.\" \\\n--ref-audio /path/to/audio.wav \\\n--ref-text \"This is the caption for the reference audio.\"\n```\n\nYou can convert an audio file to the correct format with ffmpeg like this:\n\n```bash\nffmpeg -i /path/to/audio.wav -ac 1 -ar 24000 -sample_fmt s16 -t 10 /path/to/output_audio.wav\n```\n\nSee [here](./f5_tts_mlx) for more options to customize generation.\n\n## From Python\n\nYou can load a pretrained model from Python:\n\n```python\nfrom f5_tts_mlx.generate import generate\n\naudio = generate(text = \"Hello world.\", ...)\n```\n\nPretrained model weights are also available [on Hugging Face](https://huggingface.co/lucasnewman/f5-tts-mlx).\n\n## Appreciation\n\n[Lucas Newman](https://github.com/lucasnewman/f5-tts-mlx) for original implementation of F5 TTS on MLX\n\n[Yushen Chen](https://github.com/SWivid) for the original Pytorch implementation of F5 TTS and pretrained model.\n\n[Phil Wang](https://github.com/lucidrains) for the E2 TTS implementation that this model is based on.\n\n## Citations\n\n```bibtex\n@article{chen-etal-2024-f5tts,\n      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, \n      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},\n      journal={arXiv preprint arXiv:2410.06885},\n      year={2024},\n}\n```\n\n```bibtex\n@inproceedings{Eskimez2024E2TE,\n    title   = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS},\n    author  = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda},\n    year    = {2024},\n    url     = {https://api.semanticscholar.org/CorpusID:270738197}\n}\n```\n\n## License\n\nThe code in this repository is released under the MIT license as found in the\n[LICENSE](LICENSE) file.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "F5-TTS - MLX",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/tikikun/f5-tts-mlx-quantized"
    },
    "split_keywords": [
        "artificial intelligence",
        " asr",
        " audio-generation",
        " deep learning",
        " transformers",
        " text-to-speech"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bd6a29c7faaca0ecf4a5a395015a588155a72dda5610395bd61b9e16d4f505e4",
                "md5": "dc9f620e598fe5e1212818eb323aabf5",
                "sha256": "36c9192b2ccc822d33beec319b22c2c23b7dd91cd67747a4ccc034bc64eaaa06"
            },
            "downloads": -1,
            "filename": "f5_tts_mlx_quantized-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dc9f620e598fe5e1212818eb323aabf5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 237499,
            "upload_time": "2024-12-13T02:04:02",
            "upload_time_iso_8601": "2024-12-13T02:04:02.410447Z",
            "url": "https://files.pythonhosted.org/packages/bd/6a/29c7faaca0ecf4a5a395015a588155a72dda5610395bd61b9e16d4f505e4/f5_tts_mlx_quantized-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "54a2959a0fb5f29edc3a275c778e997ad450052b9464198fb55dd525efd76f42",
                "md5": "368ab1b7988651956358a9593f85715c",
                "sha256": "2be1b3a2149958c3e504e8087b58bdd27f3a8cd03ef24bce9f193b09c47587db"
            },
            "downloads": -1,
            "filename": "f5_tts_mlx_quantized-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "368ab1b7988651956358a9593f85715c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 236962,
            "upload_time": "2024-12-13T02:04:04",
            "upload_time_iso_8601": "2024-12-13T02:04:04.934881Z",
            "url": "https://files.pythonhosted.org/packages/54/a2/959a0fb5f29edc3a275c778e997ad450052b9464198fb55dd525efd76f42/f5_tts_mlx_quantized-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-13 02:04:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tikikun",
    "github_project": "f5-tts-mlx-quantized",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "einops",
            "specs": []
        },
        {
            "name": "einx",
            "specs": []
        },
        {
            "name": "jieba",
            "specs": []
        },
        {
            "name": "huggingface_hub",
            "specs": []
        },
        {
            "name": "mlx",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pypinyin",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "sounddevice",
            "specs": []
        },
        {
            "name": "soundfile",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "vocos-mlx",
            "specs": []
        }
    ],
    "lcname": "f5-tts-mlx-quantized"
}

None