mlx_hubert


Namemlx_hubert JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/ml-explore/mlx-hubert
SummaryHuBERT (Hidden Unit BERT) implementation in MLX for Apple Silicon
upload_time2025-07-27 13:24:02
maintainerNone
docs_urlNone
authorMLX Community
requires_python>=3.8
licenseMIT
keywords mlx hubert speech-recognition asr apple-silicon machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MLX-HuBERT

A pure MLX implementation of HuBERT (Hidden Unit BERT) for Apple Silicon, providing efficient speech representation learning and automatic speech recognition.

## Features

- 🚀 **Optimized for Apple Silicon** - Leverages MLX framework for efficient computation on M1/M2/M3 chips
- 🎯 **Compatible with HuggingFace** - Load pretrained HuBERT models from HuggingFace Hub
- 🔧 **Easy to use** - Simple API similar to Transformers
- 📊 **Efficient** - Faster inference compared to CPU-based implementations
- 🎤 **Speech Recognition** - Built-in CTC decoding for automatic speech recognition

## Installation

```bash
pip install mlx-hubert
```

Or install from source:

```bash
git clone https://github.com/mzbac/mlx-hubert.git
cd mlx-hubert
pip install -e .
```

## Quick Start

```python
import mlx.core as mx
from mlx_hubert import load_model, HubertProcessor
from datasets import load_dataset

# Load processor and model
processor = HubertProcessor(sampling_rate=16000)
model, config = load_model("mzbac/hubert-large-ls960-ft")

# Load audio dataset
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# Process audio
inputs = processor(ds[0]["audio"]["array"])
input_values = inputs["input_values"]

# Generate transcription
logits = model(input_values).logits
predicted_ids = mx.argmax(logits, axis=-1)
transcription = processor.decode(predicted_ids[0])

print(transcription)
# Output: "A MAN SAID TO THE UNIVERSE SIR I EXIST"
```

## Model Architecture

MLX-HuBERT implements the full HuBERT architecture:

- **Feature Encoder**: Convolutional layers that process raw audio waveforms
- **Feature Projection**: Projects CNN features to transformer dimension  
- **Transformer Encoder**: Self-attention layers for learning representations
- **CTC Head**: Linear layer for character/token prediction (ASR models)

## Supported Models

### Pre-converted Models on HuggingFace Hub

- `mzbac/hubert-large-ls960-ft` - Large model fine-tuned for ASR

### Converting Your Own Models

Use the included conversion script to convert any HuBERT model:

```bash
# Convert base model for feature extraction (automatically detected)
python convert_model.py --model facebook/hubert-base-ls960

# Convert CTC model for speech recognition (automatically detected) 
python convert_model.py --model facebook/hubert-large-ls960-ft
```

The script automatically detects whether a model is a base model or CTC model from its configuration. The converted models will be saved in `./converted_models/` by default.

## Advanced Usage

### Batch Processing

```python
# Process multiple audio samples
audio_samples = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_samples, padding=True)

input_values = inputs["input_values"]
attention_mask = inputs["attention_mask"]

outputs = model(input_values, attention_mask=attention_mask)
predictions = mx.argmax(outputs.logits, axis=-1)

transcriptions = processor.batch_decode(predictions)
```

### Feature Extraction with Base Models

```python
# Load base model for feature extraction
model, config = load_model("./converted_models/hubert-base-ls960")
processor = HubertProcessor.from_pretrained("./converted_models/hubert-base-ls960")

# Process audio
inputs = processor(audio_array)
input_values = inputs["input_values"]

# Extract features
outputs = model(input_values)
features = outputs.last_hidden_state  # Shape: (batch, time, hidden_size)

# Get utterance-level embedding
utterance_embedding = mx.mean(features, axis=1)  # Shape: (batch, hidden_size)
```

### Custom Vocabulary

```python
# Define custom vocabulary
vocab_dict = {
    "<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3,
    " ": 4, "A": 5, "B": 6, # ... etc
}

processor = HubertProcessor(
    vocab_dict=vocab_dict,
    sampling_rate=16000
)
```

## Model Usage

### Using Pre-converted Models

The easiest way is to use models that have already been converted to safetensors format:

```python
from mlx_hubert import load_model, HubertProcessor

# Load from HuggingFace Hub (already converted)
model, config = load_model("mzbac/hubert-large-ls960-ft")
processor = HubertProcessor.from_pretrained("mzbac/hubert-large-ls960-ft")

# Or load from local path
model, config = load_model("./converted_ctc_models")
processor = HubertProcessor.from_pretrained("./converted_ctc_models")
```

### Converting HuggingFace Models

To convert a HuggingFace model to safetensors format for use with MLX:

#### Using the Command Line

```bash
# Convert a CTC model (auto-detects model type)
python convert_model.py --model facebook/hubert-large-ls960-ft

# Convert a base model
python convert_model.py --model facebook/hubert-base-ls960 --type base

# Convert to a specific directory
python convert_model.py --model facebook/hubert-large-ls960-ft --output ./my_model

# Convert without testing
python convert_model.py --model facebook/hubert-large-ls960-ft --no-test
```

#### Using the Python API

```python
from mlx_hubert import convert_from_transformers

# Convert a model programmatically
model_path, config_path = convert_from_transformers(
    "facebook/hubert-large-ls960-ft",
    "./converted_model",
    model_type="auto"  # or "ctc", "base"
)

# Then load the converted model
from mlx_hubert import load_model, HubertProcessor

model, config = load_model("./converted_model")
processor = HubertProcessor.from_pretrained("./converted_model")
```

### Direct PyTorch to MLX Conversion (Advanced)

For advanced users who want to convert models programmatically:

```python
from transformers import HubertForCTC as HFHubertForCTC
from mlx_hubert import HubertForCTC, HubertConfig
from mlx_hubert.utils import load_pytorch_weights

# Load HuggingFace model
hf_model = HFHubertForCTC.from_pretrained("facebook/hubert-large-ls960-ft")

# Create MLX config from HuggingFace config
config = HubertConfig.from_dict(hf_model.config.to_dict())

# Initialize MLX model
mlx_model = HubertForCTC(config)

# Load weights from PyTorch state dict
mlx_model = load_pytorch_weights(mlx_model, hf_model.state_dict(), config)

# Now you can use the model
model.eval()
```

## Examples

Check the `examples/` directory for:

- `simple_transcription.py` - Basic speech recognition
- `speech_recognition.py` - Advanced examples with batching and streaming
- `feature_extraction.py` - Extract speech representations
- `base_model_example.py` - Using base models for feature extraction and similarity


## Development

### Running Tests

```bash
pip install -e ".[dev]"
pytest tests/
```

### Code Style

```bash
black mlx_hubert/
isort mlx_hubert/
flake8 mlx_hubert/
```

## Citation

Original HuBERT paper:

```bibtex
@article{hsu2021hubert,
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2021}
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Thanks to the MLX team at Apple for the excellent framework
- The HuggingFace team for the Transformers implementation
- Meta AI Research for the original HuBERT model

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ml-explore/mlx-hubert",
    "name": "mlx_hubert",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "mlx, hubert, speech-recognition, asr, apple-silicon, machine-learning",
    "author": "MLX Community",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/18/01/8b01cf02f4a1957adebabfc77aa83af082bbf86a38f81b3c2aa6460fbfee/mlx_hubert-0.1.0.tar.gz",
    "platform": null,
    "description": "# MLX-HuBERT\n\nA pure MLX implementation of HuBERT (Hidden Unit BERT) for Apple Silicon, providing efficient speech representation learning and automatic speech recognition.\n\n## Features\n\n- \ud83d\ude80 **Optimized for Apple Silicon** - Leverages MLX framework for efficient computation on M1/M2/M3 chips\n- \ud83c\udfaf **Compatible with HuggingFace** - Load pretrained HuBERT models from HuggingFace Hub\n- \ud83d\udd27 **Easy to use** - Simple API similar to Transformers\n- \ud83d\udcca **Efficient** - Faster inference compared to CPU-based implementations\n- \ud83c\udfa4 **Speech Recognition** - Built-in CTC decoding for automatic speech recognition\n\n## Installation\n\n```bash\npip install mlx-hubert\n```\n\nOr install from source:\n\n```bash\ngit clone https://github.com/mzbac/mlx-hubert.git\ncd mlx-hubert\npip install -e .\n```\n\n## Quick Start\n\n```python\nimport mlx.core as mx\nfrom mlx_hubert import load_model, HubertProcessor\nfrom datasets import load_dataset\n\n# Load processor and model\nprocessor = HubertProcessor(sampling_rate=16000)\nmodel, config = load_model(\"mzbac/hubert-large-ls960-ft\")\n\n# Load audio dataset\nds = load_dataset(\"patrickvonplaten/librispeech_asr_dummy\", \"clean\", split=\"validation\")\n\n# Process audio\ninputs = processor(ds[0][\"audio\"][\"array\"])\ninput_values = inputs[\"input_values\"]\n\n# Generate transcription\nlogits = model(input_values).logits\npredicted_ids = mx.argmax(logits, axis=-1)\ntranscription = processor.decode(predicted_ids[0])\n\nprint(transcription)\n# Output: \"A MAN SAID TO THE UNIVERSE SIR I EXIST\"\n```\n\n## Model Architecture\n\nMLX-HuBERT implements the full HuBERT architecture:\n\n- **Feature Encoder**: Convolutional layers that process raw audio waveforms\n- **Feature Projection**: Projects CNN features to transformer dimension  \n- **Transformer Encoder**: Self-attention layers for learning representations\n- **CTC Head**: Linear layer for character/token prediction (ASR models)\n\n## Supported Models\n\n### Pre-converted Models on HuggingFace Hub\n\n- `mzbac/hubert-large-ls960-ft` - Large model fine-tuned for ASR\n\n### Converting Your Own Models\n\nUse the included conversion script to convert any HuBERT model:\n\n```bash\n# Convert base model for feature extraction (automatically detected)\npython convert_model.py --model facebook/hubert-base-ls960\n\n# Convert CTC model for speech recognition (automatically detected) \npython convert_model.py --model facebook/hubert-large-ls960-ft\n```\n\nThe script automatically detects whether a model is a base model or CTC model from its configuration. The converted models will be saved in `./converted_models/` by default.\n\n## Advanced Usage\n\n### Batch Processing\n\n```python\n# Process multiple audio samples\naudio_samples = [ds[i][\"audio\"][\"array\"] for i in range(4)]\ninputs = processor(audio_samples, padding=True)\n\ninput_values = inputs[\"input_values\"]\nattention_mask = inputs[\"attention_mask\"]\n\noutputs = model(input_values, attention_mask=attention_mask)\npredictions = mx.argmax(outputs.logits, axis=-1)\n\ntranscriptions = processor.batch_decode(predictions)\n```\n\n### Feature Extraction with Base Models\n\n```python\n# Load base model for feature extraction\nmodel, config = load_model(\"./converted_models/hubert-base-ls960\")\nprocessor = HubertProcessor.from_pretrained(\"./converted_models/hubert-base-ls960\")\n\n# Process audio\ninputs = processor(audio_array)\ninput_values = inputs[\"input_values\"]\n\n# Extract features\noutputs = model(input_values)\nfeatures = outputs.last_hidden_state  # Shape: (batch, time, hidden_size)\n\n# Get utterance-level embedding\nutterance_embedding = mx.mean(features, axis=1)  # Shape: (batch, hidden_size)\n```\n\n### Custom Vocabulary\n\n```python\n# Define custom vocabulary\nvocab_dict = {\n    \"<pad>\": 0, \"<s>\": 1, \"</s>\": 2, \"<unk>\": 3,\n    \" \": 4, \"A\": 5, \"B\": 6, # ... etc\n}\n\nprocessor = HubertProcessor(\n    vocab_dict=vocab_dict,\n    sampling_rate=16000\n)\n```\n\n## Model Usage\n\n### Using Pre-converted Models\n\nThe easiest way is to use models that have already been converted to safetensors format:\n\n```python\nfrom mlx_hubert import load_model, HubertProcessor\n\n# Load from HuggingFace Hub (already converted)\nmodel, config = load_model(\"mzbac/hubert-large-ls960-ft\")\nprocessor = HubertProcessor.from_pretrained(\"mzbac/hubert-large-ls960-ft\")\n\n# Or load from local path\nmodel, config = load_model(\"./converted_ctc_models\")\nprocessor = HubertProcessor.from_pretrained(\"./converted_ctc_models\")\n```\n\n### Converting HuggingFace Models\n\nTo convert a HuggingFace model to safetensors format for use with MLX:\n\n#### Using the Command Line\n\n```bash\n# Convert a CTC model (auto-detects model type)\npython convert_model.py --model facebook/hubert-large-ls960-ft\n\n# Convert a base model\npython convert_model.py --model facebook/hubert-base-ls960 --type base\n\n# Convert to a specific directory\npython convert_model.py --model facebook/hubert-large-ls960-ft --output ./my_model\n\n# Convert without testing\npython convert_model.py --model facebook/hubert-large-ls960-ft --no-test\n```\n\n#### Using the Python API\n\n```python\nfrom mlx_hubert import convert_from_transformers\n\n# Convert a model programmatically\nmodel_path, config_path = convert_from_transformers(\n    \"facebook/hubert-large-ls960-ft\",\n    \"./converted_model\",\n    model_type=\"auto\"  # or \"ctc\", \"base\"\n)\n\n# Then load the converted model\nfrom mlx_hubert import load_model, HubertProcessor\n\nmodel, config = load_model(\"./converted_model\")\nprocessor = HubertProcessor.from_pretrained(\"./converted_model\")\n```\n\n### Direct PyTorch to MLX Conversion (Advanced)\n\nFor advanced users who want to convert models programmatically:\n\n```python\nfrom transformers import HubertForCTC as HFHubertForCTC\nfrom mlx_hubert import HubertForCTC, HubertConfig\nfrom mlx_hubert.utils import load_pytorch_weights\n\n# Load HuggingFace model\nhf_model = HFHubertForCTC.from_pretrained(\"facebook/hubert-large-ls960-ft\")\n\n# Create MLX config from HuggingFace config\nconfig = HubertConfig.from_dict(hf_model.config.to_dict())\n\n# Initialize MLX model\nmlx_model = HubertForCTC(config)\n\n# Load weights from PyTorch state dict\nmlx_model = load_pytorch_weights(mlx_model, hf_model.state_dict(), config)\n\n# Now you can use the model\nmodel.eval()\n```\n\n## Examples\n\nCheck the `examples/` directory for:\n\n- `simple_transcription.py` - Basic speech recognition\n- `speech_recognition.py` - Advanced examples with batching and streaming\n- `feature_extraction.py` - Extract speech representations\n- `base_model_example.py` - Using base models for feature extraction and similarity\n\n\n## Development\n\n### Running Tests\n\n```bash\npip install -e \".[dev]\"\npytest tests/\n```\n\n### Code Style\n\n```bash\nblack mlx_hubert/\nisort mlx_hubert/\nflake8 mlx_hubert/\n```\n\n## Citation\n\nOriginal HuBERT paper:\n\n```bibtex\n@article{hsu2021hubert,\n  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},\n  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},\n  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},\n  year={2021}\n}\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Thanks to the MLX team at Apple for the excellent framework\n- The HuggingFace team for the Transformers implementation\n- Meta AI Research for the original HuBERT model\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "HuBERT (Hidden Unit BERT) implementation in MLX for Apple Silicon",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/mzbac/mlx-hubert#readme",
        "Homepage": "https://github.com/mzbac/mlx-hubert",
        "Issues": "https://github.com/mzbac/mlx-hubert/issues",
        "Repository": "https://github.com/mzbac/mlx-hubert"
    },
    "split_keywords": [
        "mlx",
        " hubert",
        " speech-recognition",
        " asr",
        " apple-silicon",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7cce7363b969e1b72d27cb55ac5cfed93315a5002dbb90d286c1b06cbb8dcbad",
                "md5": "1169d9d5b6770a70f9d20d4fa6368d5f",
                "sha256": "ee59b6bb0cfd1b78f7a44f1f7a30460e53eb0a94c082bde0587847158ec3e182"
            },
            "downloads": -1,
            "filename": "mlx_hubert-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1169d9d5b6770a70f9d20d4fa6368d5f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 22654,
            "upload_time": "2025-07-27T13:24:00",
            "upload_time_iso_8601": "2025-07-27T13:24:00.291560Z",
            "url": "https://files.pythonhosted.org/packages/7c/ce/7363b969e1b72d27cb55ac5cfed93315a5002dbb90d286c1b06cbb8dcbad/mlx_hubert-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "18018b01cf02f4a1957adebabfc77aa83af082bbf86a38f81b3c2aa6460fbfee",
                "md5": "ce4bc64f418622ff79bb2bba12d3960f",
                "sha256": "477fea2402392d7bf07e4f6ac143e94194f05e291f9f4479066af2b0736b86de"
            },
            "downloads": -1,
            "filename": "mlx_hubert-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ce4bc64f418622ff79bb2bba12d3960f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 32744,
            "upload_time": "2025-07-27T13:24:02",
            "upload_time_iso_8601": "2025-07-27T13:24:02.004961Z",
            "url": "https://files.pythonhosted.org/packages/18/01/8b01cf02f4a1957adebabfc77aa83af082bbf86a38f81b3c2aa6460fbfee/mlx_hubert-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 13:24:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ml-explore",
    "github_project": "mlx-hubert",
    "github_not_found": true,
    "lcname": "mlx_hubert"
}
        
Elapsed time: 2.62369s