# MLX Voxtral
MLX Voxtral is an optimized implementation of Mistral AI's Voxtral speech models for Apple Silicon, providing efficient audio transcription with support for model quantization and streaming processing.
## Features
- 🚀 **Optimized for Apple Silicon** - Leverages MLX framework for maximum performance on M1/M2/M3 chips
- 🗜️ **Model Quantization** - Reduce model size by 4.3x with minimal quality loss
- 🎙️ **Full Audio Pipeline** - Complete audio processing from file/URL to transcription
- 🔧 **CLI Tools** - Command-line utilities for transcription and quantization
- 📦 **Pre-quantized Models** - Ready-to-use quantized models available
## Installation
### Install from PyPI
```bash
# Install mlx-voxtral from PyPI
pip install mlx-voxtral
# Install transformers from GitHub (required)
pip install git+https://github.com/huggingface/transformers
```
### Install from Source
```bash
# Clone the repository
git clone https://github.com/mzbac/mlx.voxtral
cd mlx.voxtral
# Install in development mode
pip install -e .
```
## Quick Start
### Simple Transcription
```python
from mlx_voxtral import VoxtralForConditionalGeneration, VoxtralProcessor
# Load model and processor
model = VoxtralForConditionalGeneration.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
# Transcribe audio
inputs = processor.apply_transcrition_request(
language="en",
audio="speech.mp3"
)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.0)
transcription = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(transcription)
```
### Command Line Usage
```bash
# Basic transcription
mlx-voxtral.generate --audio speech.mp3
# With custom parameters
mlx-voxtral.generate --model mistralai/Voxtral-Mini-3B-2507 --max-token 2048 --temperature 0.1 --audio speech.mp3
# From URL
mlx-voxtral.generate --audio https://example.com/podcast.mp3
# Using quantized model
mlx-voxtral.generate --model ./voxtral-mini-4bit --audio speech.mp3
```
## Model Quantization
MLX Voxtral includes powerful quantization capabilities to reduce model size and improve performance:
### Quantization Tool
```bash
# Basic 4-bit quantization (recommended)
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 -o ./voxtral-mini-4bit
# Mixed precision quantization (best quality)
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 --output-dir ./voxtral-mini-mixed --mixed
# Custom quantization settings
mlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 \
--output-dir ./voxtral-mini-8bit \
--bits 8 \
--group-size 32
```
### Using Quantized Models
```python
# Load pre-quantized model (same API as original)
model = VoxtralForConditionalGeneration.from_pretrained("mzbac/voxtral-mini-3b-4bit-mixed")
processor = VoxtralProcessor.from_pretrained(".mzbac/voxtral-mini-3b-4bit-mixed")
# Use exactly like the original model
transcription = model.transcribe("speech.mp3", processor)
```
## Audio Processing Pipeline
### Low-Level Audio Processing
```python
from mlx_voxtral import process_audio_for_voxtral
# Process audio file for direct model input
result = process_audio_for_voxtral("speech.mp3")
# Access processed features
mel_features = result["input_features"] # Shape: [n_chunks, 128, 3000]
print(f"Audio duration: {result['duration_seconds']:.2f}s")
print(f"Number of 30s chunks: {result['n_chunks']}")
```
The audio processing pipeline:
1. **Audio Loading**: Supports files and URLs, resamples to 16kHz mono
2. **Chunking**: Splits into 30-second chunks with proper padding
3. **STFT**: 400-point FFT with 160 hop length
4. **Mel Spectrogram**: 128 mel bins covering 0-8000 Hz
5. **Normalization**: Log scale with global max normalization
## Advanced Usage
### Streaming Transcription
```python
# Process long audio files efficiently
for chunk in model.transcribe_stream("podcast.mp3", processor, chunk_length_s=30):
print(chunk, end="", flush=True)
```
### Custom Generation Parameters
```python
inputs = processor.apply_transcrition_request(
language="en",
audio="speech.mp3"
)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.1,
top_p=0.95,
repetition_penalty=1.1
)
```
### Processing Multiple Files
```python
# Process multiple audio files sequentially
audio_files = ["audio1.mp3", "audio2.mp3", "audio3.mp3"]
transcriptions = []
for audio_file in audio_files:
inputs = processor.apply_transcrition_request(language="en", audio=audio_file)
outputs = model.generate(**inputs, max_new_tokens=1024)
text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
transcriptions.append(text)
```
Note: The model processes one audio file at a time. For long audio files, it automatically splits them into 30-second chunks internally.
## Pre-quantized Models
For convenience, pre-quantized models are available:
```python
models = {
"mzbac/voxtral-mini-3b-4bit-mixed": "3.2GB model with mixed precision",
"mzbac/voxtral-mini-3b-8bit": "5.3GB model with 8-bit quantization"
}
```
## API Reference
### VoxtralProcessor
```python
processor = VoxtralProcessor.from_pretrained("mistralai/Voxtral-Mini-3B-2507")
# Apply transcription formatting
inputs = processor.apply_transcrition_request(
language="en", # or "fr", "de", etc.
audio="path/to/audio.mp3",
task="transcribe", # or "translate"
)
# Decode model outputs
text = processor.decode(token_ids, skip_special_tokens=True)
```
### VoxtralForConditionalGeneration
```python
model = VoxtralForConditionalGeneration.from_pretrained(
"mistralai/Voxtral-Mini-3B-2507",
dtype=mx.bfloat16 # Optional: specify dtype
)
# Generate transcription
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.0,
do_sample=False
)
```
## Performance Tips
1. **Use Quantized Models**: 4-bit quantization provides the best balance of size and quality
2. **Temperature Settings**: Use `temperature=0.0` for deterministic transcription
3. **Chunk Size**: Default 30-second chunks are optimal for most use cases
4. **Long Audio**: The model automatically handles long audio by splitting into chunks
## Requirements
- **Python**: 3.11 or higher
- **Platform**: Apple Silicon Mac (M1/M2/M3)
- **Dependencies**:
- MLX >= 0.26.5
- mlx-lm >= 0.26.0
- mistral-common >= 1.8.2
- transformers (latest from GitHub)
- Audio: soundfile, soxr, or ffmpeg
## TODO
- [ ] **Batch Processing Support**: Implement batched inference for processing multiple audio files simultaneously
- [ ] **Transformers Tokenizer Integration**: Add support for using Hugging Face Transformers tokenizers as an alternative to mistral-common
- [ ] **Swift Support**: Create a Swift library for Voxtral support
## License
see LICENSE file for details.
## Acknowledgments
- This implementation is based on Mistral AI's Voxtral models and the Hugging Face Transformers implementation
- Built using Apple's MLX framework for optimized performance on Apple Silicon
Raw data
{
"_id": null,
"home_page": null,
"name": "mlx-voxtral",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "mlx, voxtral, whisper, audio, speech, apple-silicon",
"author": "Anchen Li",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/e4/f4/3c3dfdd9404965ce02917b697b01b4ff5305e0734c42879d6a0069b89c37/mlx_voxtral-0.0.2.tar.gz",
"platform": null,
"description": "# MLX Voxtral\n\nMLX Voxtral is an optimized implementation of Mistral AI's Voxtral speech models for Apple Silicon, providing efficient audio transcription with support for model quantization and streaming processing.\n\n## Features\n\n- \ud83d\ude80 **Optimized for Apple Silicon** - Leverages MLX framework for maximum performance on M1/M2/M3 chips\n- \ud83d\udddc\ufe0f **Model Quantization** - Reduce model size by 4.3x with minimal quality loss\n- \ud83c\udf99\ufe0f **Full Audio Pipeline** - Complete audio processing from file/URL to transcription\n- \ud83d\udd27 **CLI Tools** - Command-line utilities for transcription and quantization\n- \ud83d\udce6 **Pre-quantized Models** - Ready-to-use quantized models available\n\n## Installation\n\n### Install from PyPI\n\n```bash\n# Install mlx-voxtral from PyPI\npip install mlx-voxtral\n\n# Install transformers from GitHub (required)\npip install git+https://github.com/huggingface/transformers\n```\n\n### Install from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/mzbac/mlx.voxtral\ncd mlx.voxtral\n\n# Install in development mode\npip install -e .\n```\n\n## Quick Start\n\n### Simple Transcription\n\n```python\nfrom mlx_voxtral import VoxtralForConditionalGeneration, VoxtralProcessor\n\n# Load model and processor\nmodel = VoxtralForConditionalGeneration.from_pretrained(\"mistralai/Voxtral-Mini-3B-2507\")\nprocessor = VoxtralProcessor.from_pretrained(\"mistralai/Voxtral-Mini-3B-2507\")\n\n# Transcribe audio\ninputs = processor.apply_transcrition_request(\n language=\"en\",\n audio=\"speech.mp3\"\n)\noutputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.0)\ntranscription = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\nprint(transcription)\n```\n\n### Command Line Usage\n\n```bash\n# Basic transcription\nmlx-voxtral.generate --audio speech.mp3\n\n# With custom parameters\nmlx-voxtral.generate --model mistralai/Voxtral-Mini-3B-2507 --max-token 2048 --temperature 0.1 --audio speech.mp3\n\n# From URL\nmlx-voxtral.generate --audio https://example.com/podcast.mp3\n\n# Using quantized model\nmlx-voxtral.generate --model ./voxtral-mini-4bit --audio speech.mp3\n```\n\n## Model Quantization\n\nMLX Voxtral includes powerful quantization capabilities to reduce model size and improve performance:\n\n### Quantization Tool\n\n```bash\n# Basic 4-bit quantization (recommended)\nmlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 -o ./voxtral-mini-4bit\n\n# Mixed precision quantization (best quality)\nmlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 --output-dir ./voxtral-mini-mixed --mixed\n\n# Custom quantization settings\nmlx-voxtral.quantize mistralai/Voxtral-Mini-3B-2507 \\\n --output-dir ./voxtral-mini-8bit \\\n --bits 8 \\\n --group-size 32\n```\n\n### Using Quantized Models\n\n```python\n# Load pre-quantized model (same API as original)\nmodel = VoxtralForConditionalGeneration.from_pretrained(\"mzbac/voxtral-mini-3b-4bit-mixed\")\nprocessor = VoxtralProcessor.from_pretrained(\".mzbac/voxtral-mini-3b-4bit-mixed\")\n\n# Use exactly like the original model\ntranscription = model.transcribe(\"speech.mp3\", processor)\n```\n\n## Audio Processing Pipeline\n\n### Low-Level Audio Processing\n\n```python\nfrom mlx_voxtral import process_audio_for_voxtral\n\n# Process audio file for direct model input\nresult = process_audio_for_voxtral(\"speech.mp3\")\n\n# Access processed features\nmel_features = result[\"input_features\"] # Shape: [n_chunks, 128, 3000]\nprint(f\"Audio duration: {result['duration_seconds']:.2f}s\")\nprint(f\"Number of 30s chunks: {result['n_chunks']}\")\n```\n\nThe audio processing pipeline:\n1. **Audio Loading**: Supports files and URLs, resamples to 16kHz mono\n2. **Chunking**: Splits into 30-second chunks with proper padding\n3. **STFT**: 400-point FFT with 160 hop length\n4. **Mel Spectrogram**: 128 mel bins covering 0-8000 Hz\n5. **Normalization**: Log scale with global max normalization\n\n## Advanced Usage\n\n### Streaming Transcription\n\n```python\n# Process long audio files efficiently\nfor chunk in model.transcribe_stream(\"podcast.mp3\", processor, chunk_length_s=30):\n print(chunk, end=\"\", flush=True)\n```\n\n### Custom Generation Parameters\n\n```python\ninputs = processor.apply_transcrition_request(\n language=\"en\",\n audio=\"speech.mp3\"\n)\n\noutputs = model.generate(\n **inputs,\n max_new_tokens=2048,\n temperature=0.1,\n top_p=0.95,\n repetition_penalty=1.1\n)\n```\n\n### Processing Multiple Files\n\n```python\n# Process multiple audio files sequentially\naudio_files = [\"audio1.mp3\", \"audio2.mp3\", \"audio3.mp3\"]\ntranscriptions = []\n\nfor audio_file in audio_files:\n inputs = processor.apply_transcrition_request(language=\"en\", audio=audio_file)\n outputs = model.generate(**inputs, max_new_tokens=1024)\n text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)\n transcriptions.append(text)\n```\n\nNote: The model processes one audio file at a time. For long audio files, it automatically splits them into 30-second chunks internally.\n\n## Pre-quantized Models\n\nFor convenience, pre-quantized models are available:\n\n```python\nmodels = {\n \"mzbac/voxtral-mini-3b-4bit-mixed\": \"3.2GB model with mixed precision\",\n \"mzbac/voxtral-mini-3b-8bit\": \"5.3GB model with 8-bit quantization\"\n}\n```\n\n## API Reference\n\n### VoxtralProcessor\n\n```python\nprocessor = VoxtralProcessor.from_pretrained(\"mistralai/Voxtral-Mini-3B-2507\")\n\n# Apply transcription formatting\ninputs = processor.apply_transcrition_request(\n language=\"en\", # or \"fr\", \"de\", etc.\n audio=\"path/to/audio.mp3\",\n task=\"transcribe\", # or \"translate\"\n)\n\n# Decode model outputs\ntext = processor.decode(token_ids, skip_special_tokens=True)\n```\n\n### VoxtralForConditionalGeneration\n\n```python\nmodel = VoxtralForConditionalGeneration.from_pretrained(\n \"mistralai/Voxtral-Mini-3B-2507\",\n dtype=mx.bfloat16 # Optional: specify dtype\n)\n\n# Generate transcription\noutputs = model.generate(\n **inputs,\n max_new_tokens=1024,\n temperature=0.0,\n do_sample=False\n)\n```\n\n## Performance Tips\n\n1. **Use Quantized Models**: 4-bit quantization provides the best balance of size and quality\n2. **Temperature Settings**: Use `temperature=0.0` for deterministic transcription\n3. **Chunk Size**: Default 30-second chunks are optimal for most use cases\n4. **Long Audio**: The model automatically handles long audio by splitting into chunks\n\n## Requirements\n\n- **Python**: 3.11 or higher\n- **Platform**: Apple Silicon Mac (M1/M2/M3)\n- **Dependencies**:\n - MLX >= 0.26.5\n - mlx-lm >= 0.26.0\n - mistral-common >= 1.8.2\n - transformers (latest from GitHub)\n - Audio: soundfile, soxr, or ffmpeg\n\n## TODO\n\n- [ ] **Batch Processing Support**: Implement batched inference for processing multiple audio files simultaneously\n- [ ] **Transformers Tokenizer Integration**: Add support for using Hugging Face Transformers tokenizers as an alternative to mistral-common\n- [ ] **Swift Support**: Create a Swift library for Voxtral support\n\n## License\n\nsee LICENSE file for details.\n\n## Acknowledgments\n\n- This implementation is based on Mistral AI's Voxtral models and the Hugging Face Transformers implementation\n- Built using Apple's MLX framework for optimized performance on Apple Silicon\n",
"bugtrack_url": null,
"license": "Personal Use Only - See LICENSE file",
"summary": "Voxtral audio processing and model implementation for Apple Silicon using MLX",
"version": "0.0.2",
"project_urls": {
"Issues": "https://github.com/mzbac/mlx.voxtral/issues",
"Repository": "https://github.com/mzbac/mlx.voxtral"
},
"split_keywords": [
"mlx",
" voxtral",
" whisper",
" audio",
" speech",
" apple-silicon"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "948be9cbf8f8fe373ee3c6de5ef2637e75e902414e91de2b4f99c5b0a1e12e56",
"md5": "ae74ce1dbc148fc228e01651dc0e1de5",
"sha256": "4100c0b3a99709e240d3b2e2baa15c8225fa5dae6a0d298d9cc2bf23a88a57ac"
},
"downloads": -1,
"filename": "mlx_voxtral-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ae74ce1dbc148fc228e01651dc0e1de5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 33592,
"upload_time": "2025-07-25T07:59:48",
"upload_time_iso_8601": "2025-07-25T07:59:48.514829Z",
"url": "https://files.pythonhosted.org/packages/94/8b/e9cbf8f8fe373ee3c6de5ef2637e75e902414e91de2b4f99c5b0a1e12e56/mlx_voxtral-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e4f43c3dfdd9404965ce02917b697b01b4ff5305e0734c42879d6a0069b89c37",
"md5": "6395d4c8823f6b4fb37e82d513d221c7",
"sha256": "bf8a3a6b29c7fda7b6528726e5e25d9a575b8dec7bc0f935273a52469c852413"
},
"downloads": -1,
"filename": "mlx_voxtral-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "6395d4c8823f6b4fb37e82d513d221c7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 31166,
"upload_time": "2025-07-25T07:59:49",
"upload_time_iso_8601": "2025-07-25T07:59:49.742990Z",
"url": "https://files.pythonhosted.org/packages/e4/f4/3c3dfdd9404965ce02917b697b01b4ff5305e0734c42879d6a0069b89c37/mlx_voxtral-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-25 07:59:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mzbac",
"github_project": "mlx.voxtral",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "mlx-voxtral"
}