# MLX-HuBERT
A pure MLX implementation of HuBERT (Hidden Unit BERT) for Apple Silicon, providing efficient speech representation learning and automatic speech recognition.
## Features
- 🚀 **Optimized for Apple Silicon** - Leverages MLX framework for efficient computation on M1/M2/M3 chips
- 🎯 **Compatible with HuggingFace** - Load pretrained HuBERT models from HuggingFace Hub
- 🔧 **Easy to use** - Simple API similar to Transformers
- 📊 **Efficient** - Faster inference compared to CPU-based implementations
- 🎤 **Speech Recognition** - Built-in CTC decoding for automatic speech recognition
## Installation
```bash
pip install mlx-hubert
```
Or install from source:
```bash
git clone https://github.com/mzbac/mlx-hubert.git
cd mlx-hubert
pip install -e .
```
## Quick Start
```python
import mlx.core as mx
from mlx_hubert import load_model, HubertProcessor
from datasets import load_dataset
# Load processor and model
processor = HubertProcessor(sampling_rate=16000)
model, config = load_model("mzbac/hubert-large-ls960-ft")
# Load audio dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
# Process audio
inputs = processor(ds[0]["audio"]["array"])
input_values = inputs["input_values"]
# Generate transcription
logits = model(input_values).logits
predicted_ids = mx.argmax(logits, axis=-1)
transcription = processor.decode(predicted_ids[0])
print(transcription)
# Output: "A MAN SAID TO THE UNIVERSE SIR I EXIST"
```
## Model Architecture
MLX-HuBERT implements the full HuBERT architecture:
- **Feature Encoder**: Convolutional layers that process raw audio waveforms
- **Feature Projection**: Projects CNN features to transformer dimension
- **Transformer Encoder**: Self-attention layers for learning representations
- **CTC Head**: Linear layer for character/token prediction (ASR models)
## Supported Models
### Pre-converted Models on HuggingFace Hub
- `mzbac/hubert-large-ls960-ft` - Large model fine-tuned for ASR
### Converting Your Own Models
Use the included conversion script to convert any HuBERT model:
```bash
# Convert base model for feature extraction (automatically detected)
python convert_model.py --model facebook/hubert-base-ls960
# Convert CTC model for speech recognition (automatically detected)
python convert_model.py --model facebook/hubert-large-ls960-ft
```
The script automatically detects whether a model is a base model or CTC model from its configuration. The converted models will be saved in `./converted_models/` by default.
## Advanced Usage
### Batch Processing
```python
# Process multiple audio samples
audio_samples = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_samples, padding=True)
input_values = inputs["input_values"]
attention_mask = inputs["attention_mask"]
outputs = model(input_values, attention_mask=attention_mask)
predictions = mx.argmax(outputs.logits, axis=-1)
transcriptions = processor.batch_decode(predictions)
```
### Feature Extraction with Base Models
```python
# Load base model for feature extraction
model, config = load_model("./converted_models/hubert-base-ls960")
processor = HubertProcessor.from_pretrained("./converted_models/hubert-base-ls960")
# Process audio
inputs = processor(audio_array)
input_values = inputs["input_values"]
# Extract features
outputs = model(input_values)
features = outputs.last_hidden_state # Shape: (batch, time, hidden_size)
# Get utterance-level embedding
utterance_embedding = mx.mean(features, axis=1) # Shape: (batch, hidden_size)
```
### Custom Vocabulary
```python
# Define custom vocabulary
vocab_dict = {
"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3,
" ": 4, "A": 5, "B": 6, # ... etc
}
processor = HubertProcessor(
vocab_dict=vocab_dict,
sampling_rate=16000
)
```
## Model Usage
### Using Pre-converted Models
The easiest way is to use models that have already been converted to safetensors format:
```python
from mlx_hubert import load_model, HubertProcessor
# Load from HuggingFace Hub (already converted)
model, config = load_model("mzbac/hubert-large-ls960-ft")
processor = HubertProcessor.from_pretrained("mzbac/hubert-large-ls960-ft")
# Or load from local path
model, config = load_model("./converted_ctc_models")
processor = HubertProcessor.from_pretrained("./converted_ctc_models")
```
### Converting HuggingFace Models
To convert a HuggingFace model to safetensors format for use with MLX:
#### Using the Command Line
```bash
# Convert a CTC model (auto-detects model type)
python convert_model.py --model facebook/hubert-large-ls960-ft
# Convert a base model
python convert_model.py --model facebook/hubert-base-ls960 --type base
# Convert to a specific directory
python convert_model.py --model facebook/hubert-large-ls960-ft --output ./my_model
# Convert without testing
python convert_model.py --model facebook/hubert-large-ls960-ft --no-test
```
#### Using the Python API
```python
from mlx_hubert import convert_from_transformers
# Convert a model programmatically
model_path, config_path = convert_from_transformers(
"facebook/hubert-large-ls960-ft",
"./converted_model",
model_type="auto" # or "ctc", "base"
)
# Then load the converted model
from mlx_hubert import load_model, HubertProcessor
model, config = load_model("./converted_model")
processor = HubertProcessor.from_pretrained("./converted_model")
```
### Direct PyTorch to MLX Conversion (Advanced)
For advanced users who want to convert models programmatically:
```python
from transformers import HubertForCTC as HFHubertForCTC
from mlx_hubert import HubertForCTC, HubertConfig
from mlx_hubert.utils import load_pytorch_weights
# Load HuggingFace model
hf_model = HFHubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")
# Create MLX config from HuggingFace config
config = HubertConfig.from_dict(hf_model.config.to_dict())
# Initialize MLX model
mlx_model = HubertForCTC(config)
# Load weights from PyTorch state dict
mlx_model = load_pytorch_weights(mlx_model, hf_model.state_dict(), config)
# Now you can use the model
model.eval()
```
## Examples
Check the `examples/` directory for:
- `simple_transcription.py` - Basic speech recognition
- `speech_recognition.py` - Advanced examples with batching and streaming
- `feature_extraction.py` - Extract speech representations
- `base_model_example.py` - Using base models for feature extraction and similarity
## Development
### Running Tests
```bash
pip install -e ".[dev]"
pytest tests/
```
### Code Style
```bash
black mlx_hubert/
isort mlx_hubert/
flake8 mlx_hubert/
```
## Citation
Original HuBERT paper:
```bibtex
@article{hsu2021hubert,
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2021}
}
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Thanks to the MLX team at Apple for the excellent framework
- The HuggingFace team for the Transformers implementation
- Meta AI Research for the original HuBERT model
Raw data
{
"_id": null,
"home_page": "https://github.com/ml-explore/mlx-hubert",
"name": "mlx_hubert",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "mlx, hubert, speech-recognition, asr, apple-silicon, machine-learning",
"author": "MLX Community",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/18/01/8b01cf02f4a1957adebabfc77aa83af082bbf86a38f81b3c2aa6460fbfee/mlx_hubert-0.1.0.tar.gz",
"platform": null,
"description": "# MLX-HuBERT\n\nA pure MLX implementation of HuBERT (Hidden Unit BERT) for Apple Silicon, providing efficient speech representation learning and automatic speech recognition.\n\n## Features\n\n- \ud83d\ude80 **Optimized for Apple Silicon** - Leverages MLX framework for efficient computation on M1/M2/M3 chips\n- \ud83c\udfaf **Compatible with HuggingFace** - Load pretrained HuBERT models from HuggingFace Hub\n- \ud83d\udd27 **Easy to use** - Simple API similar to Transformers\n- \ud83d\udcca **Efficient** - Faster inference compared to CPU-based implementations\n- \ud83c\udfa4 **Speech Recognition** - Built-in CTC decoding for automatic speech recognition\n\n## Installation\n\n```bash\npip install mlx-hubert\n```\n\nOr install from source:\n\n```bash\ngit clone https://github.com/mzbac/mlx-hubert.git\ncd mlx-hubert\npip install -e .\n```\n\n## Quick Start\n\n```python\nimport mlx.core as mx\nfrom mlx_hubert import load_model, HubertProcessor\nfrom datasets import load_dataset\n\n# Load processor and model\nprocessor = HubertProcessor(sampling_rate=16000)\nmodel, config = load_model(\"mzbac/hubert-large-ls960-ft\")\n\n# Load audio dataset\nds = load_dataset(\"patrickvonplaten/librispeech_asr_dummy\", \"clean\", split=\"validation\")\n\n# Process audio\ninputs = processor(ds[0][\"audio\"][\"array\"])\ninput_values = inputs[\"input_values\"]\n\n# Generate transcription\nlogits = model(input_values).logits\npredicted_ids = mx.argmax(logits, axis=-1)\ntranscription = processor.decode(predicted_ids[0])\n\nprint(transcription)\n# Output: \"A MAN SAID TO THE UNIVERSE SIR I EXIST\"\n```\n\n## Model Architecture\n\nMLX-HuBERT implements the full HuBERT architecture:\n\n- **Feature Encoder**: Convolutional layers that process raw audio waveforms\n- **Feature Projection**: Projects CNN features to transformer dimension \n- **Transformer Encoder**: Self-attention layers for learning representations\n- **CTC Head**: Linear layer for character/token prediction (ASR models)\n\n## Supported Models\n\n### Pre-converted Models on HuggingFace Hub\n\n- `mzbac/hubert-large-ls960-ft` - Large model fine-tuned for ASR\n\n### Converting Your Own Models\n\nUse the included conversion script to convert any HuBERT model:\n\n```bash\n# Convert base model for feature extraction (automatically detected)\npython convert_model.py --model facebook/hubert-base-ls960\n\n# Convert CTC model for speech recognition (automatically detected) \npython convert_model.py --model facebook/hubert-large-ls960-ft\n```\n\nThe script automatically detects whether a model is a base model or CTC model from its configuration. The converted models will be saved in `./converted_models/` by default.\n\n## Advanced Usage\n\n### Batch Processing\n\n```python\n# Process multiple audio samples\naudio_samples = [ds[i][\"audio\"][\"array\"] for i in range(4)]\ninputs = processor(audio_samples, padding=True)\n\ninput_values = inputs[\"input_values\"]\nattention_mask = inputs[\"attention_mask\"]\n\noutputs = model(input_values, attention_mask=attention_mask)\npredictions = mx.argmax(outputs.logits, axis=-1)\n\ntranscriptions = processor.batch_decode(predictions)\n```\n\n### Feature Extraction with Base Models\n\n```python\n# Load base model for feature extraction\nmodel, config = load_model(\"./converted_models/hubert-base-ls960\")\nprocessor = HubertProcessor.from_pretrained(\"./converted_models/hubert-base-ls960\")\n\n# Process audio\ninputs = processor(audio_array)\ninput_values = inputs[\"input_values\"]\n\n# Extract features\noutputs = model(input_values)\nfeatures = outputs.last_hidden_state # Shape: (batch, time, hidden_size)\n\n# Get utterance-level embedding\nutterance_embedding = mx.mean(features, axis=1) # Shape: (batch, hidden_size)\n```\n\n### Custom Vocabulary\n\n```python\n# Define custom vocabulary\nvocab_dict = {\n \"<pad>\": 0, \"<s>\": 1, \"</s>\": 2, \"<unk>\": 3,\n \" \": 4, \"A\": 5, \"B\": 6, # ... etc\n}\n\nprocessor = HubertProcessor(\n vocab_dict=vocab_dict,\n sampling_rate=16000\n)\n```\n\n## Model Usage\n\n### Using Pre-converted Models\n\nThe easiest way is to use models that have already been converted to safetensors format:\n\n```python\nfrom mlx_hubert import load_model, HubertProcessor\n\n# Load from HuggingFace Hub (already converted)\nmodel, config = load_model(\"mzbac/hubert-large-ls960-ft\")\nprocessor = HubertProcessor.from_pretrained(\"mzbac/hubert-large-ls960-ft\")\n\n# Or load from local path\nmodel, config = load_model(\"./converted_ctc_models\")\nprocessor = HubertProcessor.from_pretrained(\"./converted_ctc_models\")\n```\n\n### Converting HuggingFace Models\n\nTo convert a HuggingFace model to safetensors format for use with MLX:\n\n#### Using the Command Line\n\n```bash\n# Convert a CTC model (auto-detects model type)\npython convert_model.py --model facebook/hubert-large-ls960-ft\n\n# Convert a base model\npython convert_model.py --model facebook/hubert-base-ls960 --type base\n\n# Convert to a specific directory\npython convert_model.py --model facebook/hubert-large-ls960-ft --output ./my_model\n\n# Convert without testing\npython convert_model.py --model facebook/hubert-large-ls960-ft --no-test\n```\n\n#### Using the Python API\n\n```python\nfrom mlx_hubert import convert_from_transformers\n\n# Convert a model programmatically\nmodel_path, config_path = convert_from_transformers(\n \"facebook/hubert-large-ls960-ft\",\n \"./converted_model\",\n model_type=\"auto\" # or \"ctc\", \"base\"\n)\n\n# Then load the converted model\nfrom mlx_hubert import load_model, HubertProcessor\n\nmodel, config = load_model(\"./converted_model\")\nprocessor = HubertProcessor.from_pretrained(\"./converted_model\")\n```\n\n### Direct PyTorch to MLX Conversion (Advanced)\n\nFor advanced users who want to convert models programmatically:\n\n```python\nfrom transformers import HubertForCTC as HFHubertForCTC\nfrom mlx_hubert import HubertForCTC, HubertConfig\nfrom mlx_hubert.utils import load_pytorch_weights\n\n# Load HuggingFace model\nhf_model = HFHubertForCTC.from_pretrained(\"facebook/hubert-large-ls960-ft\")\n\n# Create MLX config from HuggingFace config\nconfig = HubertConfig.from_dict(hf_model.config.to_dict())\n\n# Initialize MLX model\nmlx_model = HubertForCTC(config)\n\n# Load weights from PyTorch state dict\nmlx_model = load_pytorch_weights(mlx_model, hf_model.state_dict(), config)\n\n# Now you can use the model\nmodel.eval()\n```\n\n## Examples\n\nCheck the `examples/` directory for:\n\n- `simple_transcription.py` - Basic speech recognition\n- `speech_recognition.py` - Advanced examples with batching and streaming\n- `feature_extraction.py` - Extract speech representations\n- `base_model_example.py` - Using base models for feature extraction and similarity\n\n\n## Development\n\n### Running Tests\n\n```bash\npip install -e \".[dev]\"\npytest tests/\n```\n\n### Code Style\n\n```bash\nblack mlx_hubert/\nisort mlx_hubert/\nflake8 mlx_hubert/\n```\n\n## Citation\n\nOriginal HuBERT paper:\n\n```bibtex\n@article{hsu2021hubert,\n title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},\n author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},\n journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},\n year={2021}\n}\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Thanks to the MLX team at Apple for the excellent framework\n- The HuggingFace team for the Transformers implementation\n- Meta AI Research for the original HuBERT model\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "HuBERT (Hidden Unit BERT) implementation in MLX for Apple Silicon",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/mzbac/mlx-hubert#readme",
"Homepage": "https://github.com/mzbac/mlx-hubert",
"Issues": "https://github.com/mzbac/mlx-hubert/issues",
"Repository": "https://github.com/mzbac/mlx-hubert"
},
"split_keywords": [
"mlx",
" hubert",
" speech-recognition",
" asr",
" apple-silicon",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7cce7363b969e1b72d27cb55ac5cfed93315a5002dbb90d286c1b06cbb8dcbad",
"md5": "1169d9d5b6770a70f9d20d4fa6368d5f",
"sha256": "ee59b6bb0cfd1b78f7a44f1f7a30460e53eb0a94c082bde0587847158ec3e182"
},
"downloads": -1,
"filename": "mlx_hubert-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1169d9d5b6770a70f9d20d4fa6368d5f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 22654,
"upload_time": "2025-07-27T13:24:00",
"upload_time_iso_8601": "2025-07-27T13:24:00.291560Z",
"url": "https://files.pythonhosted.org/packages/7c/ce/7363b969e1b72d27cb55ac5cfed93315a5002dbb90d286c1b06cbb8dcbad/mlx_hubert-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "18018b01cf02f4a1957adebabfc77aa83af082bbf86a38f81b3c2aa6460fbfee",
"md5": "ce4bc64f418622ff79bb2bba12d3960f",
"sha256": "477fea2402392d7bf07e4f6ac143e94194f05e291f9f4479066af2b0736b86de"
},
"downloads": -1,
"filename": "mlx_hubert-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "ce4bc64f418622ff79bb2bba12d3960f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 32744,
"upload_time": "2025-07-27T13:24:02",
"upload_time_iso_8601": "2025-07-27T13:24:02.004961Z",
"url": "https://files.pythonhosted.org/packages/18/01/8b01cf02f4a1957adebabfc77aa83af082bbf86a38f81b3c2aa6460fbfee/mlx_hubert-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 13:24:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ml-explore",
"github_project": "mlx-hubert",
"github_not_found": true,
"lcname": "mlx_hubert"
}