moshi


Namemoshi JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryMoshi is moshi
upload_time2025-02-06 13:42:24
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Moshi - PyTorch

See the [top-level README.md][main_repo] for more information on Moshi.

[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.
It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses
24 kHz audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codec.

This is the PyTorch implementation for Moshi and Mimi.


## Requirements

You will need at least Python 3.10. We kept a minimal set of dependencies for the current project.
It was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure
to have PyTorch properly installed before installing Moshi.

```bash
pip install -U moshi      # moshi PyTorch, from PyPI
# Or the bleeding edge versions for Moshi
pip install -U -e "git+https://git@github.com/kyutai-labs/moshi#egg=moshi&subdirectory=moshi"
```

While we hope that the present codebase will work on Windows, we do not provide official support for it.
At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).


## Usage

This package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi).

In order to run in interactive mode, you need to start a server which will
run the model, you can then use either the web UI or a command line client.

Start the server with:
```bash
python -m moshi.server [--gradio-tunnel]
```

And then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine
with no direct access, `--gradio-tunnel` will create a tunnel with a URL accessible from anywhere.
Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).
You can use `--gradio-tunnel-token` to set a fixed secret token and reuse the same address over time.
Alternatively, you might want to use SSH to redirect your connection.

You can use `--hf-repo` to select a different pretrained model, by setting the proper Hugging Face repository.
See [the model list](https://github.com/kyutai-labs/moshi?tab=readme-ov-file#models) for a reference of the available models.

Accessing a server that is not localhost via http may cause issues with using
the microphone in the web UI (in some browsers this is only allowed using
https).

A local client is also available, as
```bash
python -m moshi.client [--url URL_TO_GRADIO]
```
However note, that unlike the web browser, this client is barebone. It does not perform any echo cancellation,
nor does it try to compensate for a growing lag by skipping frames.


## API

You can use programmatically the Mimi/Moshi as follows:
```python
from huggingface_hub import hf_hub_download
import torch

from moshi.models import loaders, LMGen

mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
mimi = loaders.get_mimi(mimi_weight, device='cpu')
mimi.set_num_codebooks(8)  # up to 32 for mimi, but limited to 8 for moshi.

wav = torch.randn(1, 1, 24000 * 10)  # should be [B, C=1, T]
with torch.no_grad():
    codes = mimi.encode(wav)  # [B, K = 8, T]
    decoded = mimi.decode(codes)

    # Supports streaming too.
    frame_size = int(mimi.sample_rate / mimi.frame_rate)
    all_codes = []
    with mimi.streaming(batch_size=1):
        for offset in range(0, wav.shape[-1], frame_size):
            frame = wav[:, :, offset: offset + frame_size]
            codes = mimi.encode(frame)
            assert codes.shape[-1] == 1, codes.shape
            all_codes.append(codes)

## WARNING: When streaming, make sure to always feed a total amount of audio that is a multiple
#           of the frame size (1920), otherwise the last frame will not be complete, and thus
#           will not be encoded. For simplicity, we recommend feeding in audio always in multiple
#           of the frame size, so that you always know how many time steps you get back in `codes`.

# Now if you have a GPU around.
mimi.cuda()
moshi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MOSHI_NAME)
moshi = loaders.get_moshi_lm(moshi_weight, device='cuda')
lm_gen = LMGen(moshi, temp=0.8, temp_text=0.7)  # this handles sampling params etc.
out_wav_chunks = []
# Now we will stream over both Moshi I/O, and decode on the fly with Mimi.
with torch.no_grad(), lm_gen.streaming(1), mimi.streaming(1):
    for idx, code in enumerate(all_codes):
        tokens_out = lm_gen.step(code.cuda())
        # tokens_out is [B, 1 + 8, 1], with tokens_out[:, 1] representing the text token.
        if tokens_out is not None:
            wav_chunk = mimi.decode(tokens_out[:, 1:])
            out_wav_chunks.append(wav_chunk)
        print(idx, end='\r')
out_wav = torch.cat(out_wav_chunks, dim=-1)
```

## Development

If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:
```bash
# From the current folder (e.g. `moshi/`)
pip install -e '.[dev]'
pre-commit install
```

Once locally installed, Mimi can be tested with the following command, from **the root** of the repository,
```bash
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
python scripts/mimi_streaming_test.py

```

Similary, Moshi can be tested (with a GPU) with
```bash
python scripts/moshi_benchmark.py
```


## License

The present code is provided under the MIT license.
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
the MIT license.

## Citation

If you use either Mimi or Moshi, please cite the following paper,

```
@techreport{kyutai2024moshi,
    author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and
			  Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
    title = {Moshi: a speech-text foundation model for real-time dialogue},
    institution = {Kyutai},
    year={2024},
    month={September},
    url={http://kyutai.org/Moshi.pdf},
}
```

[moshi]: https://kyutai.org/Moshi.pdf
[main_repo]: https://github.com/kyutai-labs/moshi

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "moshi",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Laurent Mazar\u00e9 <laurent@kyutai.org>",
    "keywords": null,
    "author": null,
    "author_email": "Laurent Mazar\u00e9 <laurent@kyutai.org>",
    "download_url": "https://files.pythonhosted.org/packages/b0/20/4fcc791b01fbcf70655626d5364cb35da0899a53cb6993f5dba93b50a5a4/moshi-0.2.1.tar.gz",
    "platform": null,
    "description": "# Moshi - PyTorch\n\nSee the [top-level README.md][main_repo] for more information on Moshi.\n\n[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework.\nIt uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Mimi operates at 12.5 Hz, and compresses\n24 kHz audio down to 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codec.\n\nThis is the PyTorch implementation for Moshi and Mimi.\n\n\n## Requirements\n\nYou will need at least Python 3.10. We kept a minimal set of dependencies for the current project.\nIt was tested with PyTorch 2.2 or 2.4. If you need a specific CUDA version, please make sure\nto have PyTorch properly installed before installing Moshi.\n\n```bash\npip install -U moshi      # moshi PyTorch, from PyPI\n# Or the bleeding edge versions for Moshi\npip install -U -e \"git+https://git@github.com/kyutai-labs/moshi#egg=moshi&subdirectory=moshi\"\n```\n\nWhile we hope that the present codebase will work on Windows, we do not provide official support for it.\nAt the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).\n\n\n## Usage\n\nThis package provides a streaming version of the audio tokenizer (Mimi) and the lm model (Moshi).\n\nIn order to run in interactive mode, you need to start a server which will\nrun the model, you can then use either the web UI or a command line client.\n\nStart the server with:\n```bash\npython -m moshi.server [--gradio-tunnel]\n```\n\nAnd then access the web UI on [localhost:8998](http://localhost:8998). If your GPU is on a distant machine\nwith no direct access, `--gradio-tunnel` will create a tunnel with a URL accessible from anywhere.\nKeep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe).\nYou can use `--gradio-tunnel-token` to set a fixed secret token and reuse the same address over time.\nAlternatively, you might want to use SSH to redirect your connection.\n\nYou can use `--hf-repo` to select a different pretrained model, by setting the proper Hugging Face repository.\nSee [the model list](https://github.com/kyutai-labs/moshi?tab=readme-ov-file#models) for a reference of the available models.\n\nAccessing a server that is not localhost via http may cause issues with using\nthe microphone in the web UI (in some browsers this is only allowed using\nhttps).\n\nA local client is also available, as\n```bash\npython -m moshi.client [--url URL_TO_GRADIO]\n```\nHowever note, that unlike the web browser, this client is barebone. It does not perform any echo cancellation,\nnor does it try to compensate for a growing lag by skipping frames.\n\n\n## API\n\nYou can use programmatically the Mimi/Moshi as follows:\n```python\nfrom huggingface_hub import hf_hub_download\nimport torch\n\nfrom moshi.models import loaders, LMGen\n\nmimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME)\nmimi = loaders.get_mimi(mimi_weight, device='cpu')\nmimi.set_num_codebooks(8)  # up to 32 for mimi, but limited to 8 for moshi.\n\nwav = torch.randn(1, 1, 24000 * 10)  # should be [B, C=1, T]\nwith torch.no_grad():\n    codes = mimi.encode(wav)  # [B, K = 8, T]\n    decoded = mimi.decode(codes)\n\n    # Supports streaming too.\n    frame_size = int(mimi.sample_rate / mimi.frame_rate)\n    all_codes = []\n    with mimi.streaming(batch_size=1):\n        for offset in range(0, wav.shape[-1], frame_size):\n            frame = wav[:, :, offset: offset + frame_size]\n            codes = mimi.encode(frame)\n            assert codes.shape[-1] == 1, codes.shape\n            all_codes.append(codes)\n\n## WARNING: When streaming, make sure to always feed a total amount of audio that is a multiple\n#           of the frame size (1920), otherwise the last frame will not be complete, and thus\n#           will not be encoded. For simplicity, we recommend feeding in audio always in multiple\n#           of the frame size, so that you always know how many time steps you get back in `codes`.\n\n# Now if you have a GPU around.\nmimi.cuda()\nmoshi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MOSHI_NAME)\nmoshi = loaders.get_moshi_lm(moshi_weight, device='cuda')\nlm_gen = LMGen(moshi, temp=0.8, temp_text=0.7)  # this handles sampling params etc.\nout_wav_chunks = []\n# Now we will stream over both Moshi I/O, and decode on the fly with Mimi.\nwith torch.no_grad(), lm_gen.streaming(1), mimi.streaming(1):\n    for idx, code in enumerate(all_codes):\n        tokens_out = lm_gen.step(code.cuda())\n        # tokens_out is [B, 1 + 8, 1], with tokens_out[:, 1] representing the text token.\n        if tokens_out is not None:\n            wav_chunk = mimi.decode(tokens_out[:, 1:])\n            out_wav_chunks.append(wav_chunk)\n        print(idx, end='\\r')\nout_wav = torch.cat(out_wav_chunks, dim=-1)\n```\n\n## Development\n\nIf you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:\n```bash\n# From the current folder (e.g. `moshi/`)\npip install -e '.[dev]'\npre-commit install\n```\n\nOnce locally installed, Mimi can be tested with the following command, from **the root** of the repository,\n```bash\nwget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3\npython scripts/mimi_streaming_test.py\n\n```\n\nSimilary, Moshi can be tested (with a GPU) with\n```bash\npython scripts/moshi_benchmark.py\n```\n\n\n## License\n\nThe present code is provided under the MIT license.\nNote that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under\nthe MIT license.\n\n## Citation\n\nIf you use either Mimi or Moshi, please cite the following paper,\n\n```\n@techreport{kyutai2024moshi,\n    author = {Alexandre D\\'efossez and Laurent Mazar\\'e and Manu Orsini and Am\\'elie Royer and\n\t\t\t  Patrick P\\'erez and Herv\\'e J\\'egou and Edouard Grave and Neil Zeghidour},\n    title = {Moshi: a speech-text foundation model for real-time dialogue},\n    institution = {Kyutai},\n    year={2024},\n    month={September},\n    url={http://kyutai.org/Moshi.pdf},\n}\n```\n\n[moshi]: https://kyutai.org/Moshi.pdf\n[main_repo]: https://github.com/kyutai-labs/moshi\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Moshi is moshi",
    "version": "0.2.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b0204fcc791b01fbcf70655626d5364cb35da0899a53cb6993f5dba93b50a5a4",
                "md5": "761e22553b70da1fb4e95b9a9cc20ae7",
                "sha256": "67274581fe90aa96294815a22c95a6bab00e255622a8e4c16f27d43ff864e31d"
            },
            "downloads": -1,
            "filename": "moshi-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "761e22553b70da1fb4e95b9a9cc20ae7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 72578,
            "upload_time": "2025-02-06T13:42:24",
            "upload_time_iso_8601": "2025-02-06T13:42:24.096731Z",
            "url": "https://files.pythonhosted.org/packages/b0/20/4fcc791b01fbcf70655626d5364cb35da0899a53cb6993f5dba93b50a5a4/moshi-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-06 13:42:24",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "moshi"
}
        
Elapsed time: 0.42002s