mlx-voxtral

Name	mlx-voxtral JSON
Version	0.0.2 JSON
	download
home_page	None
Summary	Voxtral audio processing and model implementation for Apple Silicon using MLX
upload_time	2025-07-25 07:59:49
maintainer	None
docs_url	None
author	Anchen Li
requires_python	>=3.11
license	Personal Use Only - See LICENSE file
keywords	mlx voxtral whisper audio speech apple-silicon
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MLX Voxtral

MLX Voxtral is an optimized implementation of Mistral AI's Voxtral speech models for Apple Silicon, providing efficient audio transcription with support for model quantization and streaming processing.

## Features

- 🚀 **Optimized for Apple Silicon** - Leverages MLX framework for maximum performance on M1/M2/M3 chips
- 🗜️ **Model Quantization** - Reduce model size by 4.3x with minimal quality loss
- 🎙️ **Full Audio Pipeline** - Complete audio processing from file/URL to transcription
- 🔧 **CLI Tools** - Command-line utilities for transcription and quantization
- 📦 **Pre-quantized Models** - Ready-to-use quantized models available

## Installation

### Install from PyPI

```bash
# Install mlx-voxtral from PyPI
pip install mlx-voxtral

# Install transformers from GitHub (required)
pip install git+https://github.com/huggingface/transformers
```

### Install from Source

```bash
# Clone the repository
git clone https://github.com/mzbac/mlx.voxtral
cd mlx.voxtral

# Install in development mode
pip install -e .
```

## Quick Start

### Simple Transcription

```python
from mlx_voxtral import VoxtralForConditionalGeneration, VoxtralProcessor

# Load model and processor
model = VoxtralForConditionalGeneration.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")

# Transcribe audio
inputs = processor.apply_transcrition_request(
    language="en",
    audio="speech.mp3"
)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.0)
transcription = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(transcription)
```

### Command Line Usage

```bash
# Basic transcription
mlx-voxtral.generate --audio speech.mp3

# With custom parameters
mlx-voxtral.generate --model mistralai/Voxtral-Mini-3B-2507 --max-token 2048 --temperature 0.1 --audio speech.mp3

# From URL
mlx-voxtral.generate --audio https://example.com/podcast.mp3

# Using quantized model
mlx-voxtral.generate --model ./voxtral-mini-4bit --audio speech.mp3
```

## Model Quantization

MLX Voxtral includes powerful quantization capabilities to reduce model size and improve performance:

### Quantization Tool

```bash
# Basic 4-bit quantization (recommended)
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 -o ./voxtral-mini-4bit

# Mixed precision quantization (best quality)
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 --output-dir ./voxtral-mini-mixed --mixed

# Custom quantization settings
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 \
    --output-dir ./voxtral-mini-8bit \
    --bits 8 \
    --group-size 32
```

### Using Quantized Models

```python
# Load pre-quantized model (same API as original)
model = VoxtralForConditionalGeneration.from_pretrained("mzbac/voxtral-mini-3b-4bit-mixed")
processor = VoxtralProcessor.from_pretrained(".mzbac/voxtral-mini-3b-4bit-mixed")

# Use exactly like the original model
transcription = model.transcribe("speech.mp3", processor)
```

## Audio Processing Pipeline

### Low-Level Audio Processing

```python
from mlx_voxtral import process_audio_for_voxtral

# Process audio file for direct model input
result = process_audio_for_voxtral("speech.mp3")

# Access processed features
mel_features = result["input_features"]  # Shape: [n_chunks, 128, 3000]
print(f"Audio duration: {result['duration_seconds']:.2f}s")
print(f"Number of 30s chunks: {result['n_chunks']}")
```

The audio processing pipeline:
1. **Audio Loading**: Supports files and URLs, resamples to 16kHz mono
2. **Chunking**: Splits into 30-second chunks with proper padding
3. **STFT**: 400-point FFT with 160 hop length
4. **Mel Spectrogram**: 128 mel bins covering 0-8000 Hz
5. **Normalization**: Log scale with global max normalization

## Advanced Usage

### Streaming Transcription

```python
# Process long audio files efficiently
for chunk in model.transcribe_stream("podcast.mp3", processor, chunk_length_s=30):
    print(chunk, end="", flush=True)
```

### Custom Generation Parameters

```python
inputs = processor.apply_transcrition_request(
    language="en",
    audio="speech.mp3"
)

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.1,
    top_p=0.95,
    repetition_penalty=1.1
)
```

### Processing Multiple Files

```python
# Process multiple audio files sequentially
audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
transcriptions = []

for audio_file in audio_files:
    inputs = processor.apply_transcrition_request(language="en", audio=audio_file)
    outputs = model.generate(**inputs, max_new_tokens=1024)
    text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    transcriptions.append(text)
```

Note: The model processes one audio file at a time. For long audio files, it automatically splits them into 30-second chunks internally.

## Pre-quantized Models

For convenience, pre-quantized models are available:

```python
models = {
    "mzbac/voxtral-mini-3b-4bit-mixed": "3.2GB model with mixed precision",
    "mzbac/voxtral-mini-3b-8bit": "5.3GB model with 8-bit quantization"
}
```

## API Reference

### VoxtralProcessor

```python
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")

# Apply transcription formatting
inputs = processor.apply_transcrition_request(
    language="en",  # or "fr", "de", etc.
    audio="path/to/audio.mp3",
    task="transcribe",  # or "translate"
)

# Decode model outputs
text = processor.decode(token_ids, skip_special_tokens=True)
```

### VoxtralForConditionalGeneration

```python
model = VoxtralForConditionalGeneration.from_pretrained(
    "mistralai/Voxtral-Mini-3B-2507",
    dtype=mx.bfloat16  # Optional: specify dtype
)

# Generate transcription
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.0,
    do_sample=False
)
```

## Performance Tips

1. **Use Quantized Models**: 4-bit quantization provides the best balance of size and quality
2. **Temperature Settings**: Use `temperature=0.0` for deterministic transcription
3. **Chunk Size**: Default 30-second chunks are optimal for most use cases
4. **Long Audio**: The model automatically handles long audio by splitting into chunks

## Requirements

- **Python**: 3.11 or higher
- **Platform**: Apple Silicon Mac (M1/M2/M3)
- **Dependencies**:
  - MLX >= 0.26.5
  - mlx-lm >= 0.26.0
  - mistral-common >= 1.8.2
  - transformers (latest from GitHub)
  - Audio: soundfile, soxr, or ffmpeg

## TODO

- [ ] **Batch Processing Support**: Implement batched inference for processing multiple audio files simultaneously
- [ ] **Transformers Tokenizer Integration**: Add support for using Hugging Face Transformers tokenizers as an alternative to mistral-common
- [ ] **Swift Support**: Create a Swift library for Voxtral support

## License

see LICENSE file for details.

## Acknowledgments

- This implementation is based on Mistral AI's Voxtral models and the Hugging Face Transformers implementation
- Built using Apple's MLX framework for optimized performance on Apple Silicon

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mlx-voxtral",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "mlx, voxtral, whisper, audio, speech, apple-silicon",
    "author": "Anchen Li",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/e4/f4/3c3dfdd9404965ce02917b697b01b4ff5305e0734c42879d6a0069b89c37/mlx_voxtral-0.0.2.tar.gz",
    "platform": null,
    "description": "# MLX Voxtral\n\nMLX Voxtral is an optimized implementation of Mistral AI's Voxtral speech models for Apple Silicon, providing efficient audio transcription with support for model quantization and streaming processing.\n\n## Features\n\n- \ud83d\ude80 **Optimized for Apple Silicon** - Leverages MLX framework for maximum performance on M1/M2/M3 chips\n- \ud83d\udddc\ufe0f **Model Quantization** - Reduce model size by 4.3x with minimal quality loss\n- \ud83c\udf99\ufe0f **Full Audio Pipeline** - Complete audio processing from file/URL to transcription\n- \ud83d\udd27 **CLI Tools** - Command-line utilities for transcription and quantization\n- \ud83d\udce6 **Pre-quantized Models** - Ready-to-use quantized models available\n\n## Installation\n\n### Install from PyPI\n\n```bash\n# Install mlx-voxtral from PyPI\npip install mlx-voxtral\n\n# Install transformers from GitHub (required)\npip install git+https://github.com/huggingface/transformers\n```\n\n### Install from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/mzbac/mlx.voxtral\ncd mlx.voxtral\n\n# Install in development mode\npip install -e .\n```\n\n## Quick Start\n\n### Simple Transcription\n\n```python\nfrom mlx_voxtral import VoxtralForConditionalGeneration, VoxtralProcessor\n\n# Load model and processor\nmodel = VoxtralForConditionalGeneration.from_pretrained(\"mistralai/Voxtral-Mini-3B-2507\")\nprocessor = VoxtralProcessor.from_pretrained(\"mistralai/Voxtral-Mini-3B-2507\")\n\n# Transcribe audio\ninputs = processor.apply_transcrition_request(\n    language=\"en\",\n    audio=\"speech.mp3\"\n)\noutputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.0)\ntranscription = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\nprint(transcription)\n```\n\n### Command Line Usage\n\n```bash\n# Basic transcription\nmlx-voxtral.generate --audio speech.mp3\n\n# With custom parameters\nmlx-voxtral.generate --model mistralai/Voxtral-Mini-3B-2507 --max-token 2048 --temperature 0.1 --audio speech.mp3\n\n# From URL\nmlx-voxtral.generate --audio https://example.com/podcast.mp3\n\n# Using quantized model\nmlx-voxtral.generate --model ./voxtral-mini-4bit --audio speech.mp3\n```\n\n## Model Quantization\n\nMLX Voxtral includes powerful quantization capabilities to reduce model size and improve performance:\n\n### Quantization Tool\n\n```bash\n# Basic 4-bit quantization (recommended)\nmlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 -o ./voxtral-mini-4bit\n\n# Mixed precision quantization (best quality)\nmlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 --output-dir ./voxtral-mini-mixed --mixed\n\n# Custom quantization settings\nmlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 \\\n    --output-dir ./voxtral-mini-8bit \\\n    --bits 8 \\\n    --group-size 32\n```\n\n### Using Quantized Models\n\n```python\n# Load pre-quantized model (same API as original)\nmodel = VoxtralForConditionalGeneration.from_pretrained(\"mzbac/voxtral-mini-3b-4bit-mixed\")\nprocessor = VoxtralProcessor.from_pretrained(\".mzbac/voxtral-mini-3b-4bit-mixed\")\n\n# Use exactly like the original model\ntranscription = model.transcribe(\"speech.mp3\", processor)\n```\n\n## Audio Processing Pipeline\n\n### Low-Level Audio Processing\n\n```python\nfrom mlx_voxtral import process_audio_for_voxtral\n\n# Process audio file for direct model input\nresult = process_audio_for_voxtral(\"speech.mp3\")\n\n# Access processed features\nmel_features = result[\"input_features\"]  # Shape: [n_chunks, 128, 3000]\nprint(f\"Audio duration: {result['duration_seconds']:.2f}s\")\nprint(f\"Number of 30s chunks: {result['n_chunks']}\")\n```\n\nThe audio processing pipeline:\n1. **Audio Loading**: Supports files and URLs, resamples to 16kHz mono\n2. **Chunking**: Splits into 30-second chunks with proper padding\n3. **STFT**: 400-point FFT with 160 hop length\n4. **Mel Spectrogram**: 128 mel bins covering 0-8000 Hz\n5. **Normalization**: Log scale with global max normalization\n\n## Advanced Usage\n\n### Streaming Transcription\n\n```python\n# Process long audio files efficiently\nfor chunk in model.transcribe_stream(\"podcast.mp3\", processor, chunk_length_s=30):\n    print(chunk, end=\"\", flush=True)\n```\n\n### Custom Generation Parameters\n\n```python\ninputs = processor.apply_transcrition_request(\n    language=\"en\",\n    audio=\"speech.mp3\"\n)\n\noutputs = model.generate(\n    **inputs,\n    max_new_tokens=2048,\n    temperature=0.1,\n    top_p=0.95,\n    repetition_penalty=1.1\n)\n```\n\n### Processing Multiple Files\n\n```python\n# Process multiple audio files sequentially\naudio_files = [\"audio1.mp3\", \"audio2.mp3\", \"audio3.mp3\"]\ntranscriptions = []\n\nfor audio_file in audio_files:\n    inputs = processor.apply_transcrition_request(language=\"en\", audio=audio_file)\n    outputs = model.generate(**inputs, max_new_tokens=1024)\n    text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\n    transcriptions.append(text)\n```\n\nNote: The model processes one audio file at a time. For long audio files, it automatically splits them into 30-second chunks internally.\n\n## Pre-quantized Models\n\nFor convenience, pre-quantized models are available:\n\n```python\nmodels = {\n    \"mzbac/voxtral-mini-3b-4bit-mixed\": \"3.2GB model with mixed precision\",\n    \"mzbac/voxtral-mini-3b-8bit\": \"5.3GB model with 8-bit quantization\"\n}\n```\n\n## API Reference\n\n### VoxtralProcessor\n\n```python\nprocessor = VoxtralProcessor.from_pretrained(\"mistralai/Voxtral-Mini-3B-2507\")\n\n# Apply transcription formatting\ninputs = processor.apply_transcrition_request(\n    language=\"en\",  # or \"fr\", \"de\", etc.\n    audio=\"path/to/audio.mp3\",\n    task=\"transcribe\",  # or \"translate\"\n)\n\n# Decode model outputs\ntext = processor.decode(token_ids, skip_special_tokens=True)\n```\n\n### VoxtralForConditionalGeneration\n\n```python\nmodel = VoxtralForConditionalGeneration.from_pretrained(\n    \"mistralai/Voxtral-Mini-3B-2507\",\n    dtype=mx.bfloat16  # Optional: specify dtype\n)\n\n# Generate transcription\noutputs = model.generate(\n    **inputs,\n    max_new_tokens=1024,\n    temperature=0.0,\n    do_sample=False\n)\n```\n\n## Performance Tips\n\n1. **Use Quantized Models**: 4-bit quantization provides the best balance of size and quality\n2. **Temperature Settings**: Use `temperature=0.0` for deterministic transcription\n3. **Chunk Size**: Default 30-second chunks are optimal for most use cases\n4. **Long Audio**: The model automatically handles long audio by splitting into chunks\n\n## Requirements\n\n- **Python**: 3.11 or higher\n- **Platform**: Apple Silicon Mac (M1/M2/M3)\n- **Dependencies**:\n  - MLX >= 0.26.5\n  - mlx-lm >= 0.26.0\n  - mistral-common >= 1.8.2\n  - transformers (latest from GitHub)\n  - Audio: soundfile, soxr, or ffmpeg\n\n## TODO\n\n- [ ] **Batch Processing Support**: Implement batched inference for processing multiple audio files simultaneously\n- [ ] **Transformers Tokenizer Integration**: Add support for using Hugging Face Transformers tokenizers as an alternative to mistral-common\n- [ ] **Swift Support**: Create a Swift library for Voxtral support\n\n## License\n\nsee LICENSE file for details.\n\n## Acknowledgments\n\n- This implementation is based on Mistral AI's Voxtral models and the Hugging Face Transformers implementation\n- Built using Apple's MLX framework for optimized performance on Apple Silicon\n",
    "bugtrack_url": null,
    "license": "Personal Use Only - See LICENSE file",
    "summary": "Voxtral audio processing and model implementation for Apple Silicon using MLX",
    "version": "0.0.2",
    "project_urls": {
        "Issues": "https://github.com/mzbac/mlx.voxtral/issues",
        "Repository": "https://github.com/mzbac/mlx.voxtral"
    },
    "split_keywords": [
        "mlx",
        " voxtral",
        " whisper",
        " audio",
        " speech",
        " apple-silicon"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "948be9cbf8f8fe373ee3c6de5ef2637e75e902414e91de2b4f99c5b0a1e12e56",
                "md5": "ae74ce1dbc148fc228e01651dc0e1de5",
                "sha256": "4100c0b3a99709e240d3b2e2baa15c8225fa5dae6a0d298d9cc2bf23a88a57ac"
            },
            "downloads": -1,
            "filename": "mlx_voxtral-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ae74ce1dbc148fc228e01651dc0e1de5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 33592,
            "upload_time": "2025-07-25T07:59:48",
            "upload_time_iso_8601": "2025-07-25T07:59:48.514829Z",
            "url": "https://files.pythonhosted.org/packages/94/8b/e9cbf8f8fe373ee3c6de5ef2637e75e902414e91de2b4f99c5b0a1e12e56/mlx_voxtral-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e4f43c3dfdd9404965ce02917b697b01b4ff5305e0734c42879d6a0069b89c37",
                "md5": "6395d4c8823f6b4fb37e82d513d221c7",
                "sha256": "bf8a3a6b29c7fda7b6528726e5e25d9a575b8dec7bc0f935273a52469c852413"
            },
            "downloads": -1,
            "filename": "mlx_voxtral-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "6395d4c8823f6b4fb37e82d513d221c7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 31166,
            "upload_time": "2025-07-25T07:59:49",
            "upload_time_iso_8601": "2025-07-25T07:59:49.742990Z",
            "url": "https://files.pythonhosted.org/packages/e4/f4/3c3dfdd9404965ce02917b697b01b4ff5305e0734c42879d6a0069b89c37/mlx_voxtral-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 07:59:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mzbac",
    "github_project": "mlx.voxtral",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mlx-voxtral"
}

Anchen Li