# Parakeet MLX
An implementation of the Parakeet models - Nvidia's ASR(Automatic Speech Recognition) models - for Apple Silicon using MLX.
## Installation
> [!NOTE]
> Make sure you have `ffmpeg` installed on your system first, otherwise CLI won't work properly.
Using [uv](https://docs.astral.sh/uv/) - recommended way:
```bash
uv add parakeet-mlx -U
```
Or, for the CLI:
```bash
uv tool install parakeet-mlx -U
```
Using pip:
```bash
pip install parakeet-mlx -U
```
## CLI Quick Start
```bash
parakeet-mlx <audio_files> [OPTIONS]
```
## Arguments
- `audio_files`: One or more audio files to transcribe (WAV, MP3, etc.)
## Options
- `--model` (default: `mlx-community/parakeet-tdt-0.6b-v2`, env: `PARAKEET_MODEL`)
- Hugging Face repository of the model to use
- `--output-dir` (default: current directory)
- Directory to save transcription outputs
- `--output-format` (default: srt, env: `PARAKEET_OUTPUT_FORMAT`)
- Output format (txt/srt/vtt/json/all)
- `--output-template` (default: `{filename}`, env: `PARAKEET_OUTPUT_TEMPLATE`)
- Template for output filenames, `{parent}`, `{filename}`, `{index}`, `{date}` is supported.
- `--highlight-words` (default: False)
- Enable word-level timestamps in SRT/VTT outputs
- `--verbose` / `-v` (default: False)
- Print detailed progress information
- `--chunk-duration` (default: 120 seconds, env: `PARAKEET_CHUNK_DURATION`)
- Chunking duration in seconds for long audio, `0` to disable chunking
- `--overlap-duration` (default: 15 seconds, env: `PARAKEET_OVERLAP_DURATION`)
- Overlap duration in seconds if using chunking
- `--fp32` / `--bf16` (default: `bf16`, env: `PARAKEET_FP32` - boolean)
- Determine the precision to use
- `--full-attention` / `--local-attention` (default: `full-attention`, env: `PARAKEET_LOCAL_ATTENTION` - boolean)
- Use full attention or local attention (Local attention reduces intermediate memory usage)
- Expected usage case is for long audio transcribing without chunking
- `--local-attention-context-size` (default: 256, env: `PARAKEET_LOCAL_ATTENTION_CTX`)
- Local attention context size(window) in frames of Parakeet model
## Examples
```bash
# Basic transcription
parakeet-mlx audio.mp3
# Multiple files with word-level timestamps of VTT subtitle
parakeet-mlx *.mp3 --output-format vtt --highlight-words
# Generate all output formats
parakeet-mlx audio.mp3 --output-format all
```
## Python API Quick Start
Transcribe a file:
```py
from parakeet_mlx import from_pretrained
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")
result = model.transcribe("audio_file.wav")
print(result.text)
```
Check timestamps:
```py
from parakeet_mlx import from_pretrained
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")
result = model.transcribe("audio_file.wav")
print(result.sentences)
# [AlignedSentence(text="Hello World.", start=1.01, end=2.04, duration=1.03, tokens=[...])]
```
Do chunking:
```py
from parakeet_mlx import from_pretrained
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")
result = model.transcribe("audio_file.wav", chunk_duration=60 * 2.0, overlap_duration=15.0)
print(result.sentences)
```
Use local attention:
```py
from parakeet_mlx import from_pretrained
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")
model.encoder.set_attention_model(
"rel_pos_local_attn", # Follows NeMo's naming convention
(256, 256),
)
result = model.transcribe("audio_file.wav")
print(result.sentences)
```
## Timestamp Result
- `AlignedResult`: Top-level result containing the full text and sentences
- `text`: Full transcribed text
- `sentences`: List of `AlignedSentence`
- `AlignedSentence`: Sentence-level alignments with start/end times
- `text`: Sentence text
- `start`: Start time in seconds
- `end`: End time in seconds
- `duration`: Between `start` and `end`.
- `tokens`: List of `AlignedToken`
- `AlignedToken`: Word/token-level alignments with precise timestamps
- `text`: Token text
- `start`: Start time in seconds
- `end`: End time in seconds
- `duration`: Between `start` and `end`.
## Streaming Transcription
For real-time transcription, use the `transcribe_stream` method which creates a streaming context:
```py
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio
import numpy as np
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")
# Create a streaming context
with model.transcribe_stream(
context_size=(256, 256), # (left_context, right_context) frames
) as transcriber:
# Simulate real-time audio chunks
audio_data = load_audio("audio_file.wav", model.preprocessor_config.sample_rate)
chunk_size = model.preprocessor_config.sample_rate # 1 second chunks
for i in range(0, len(audio_data), chunk_size):
chunk = audio_data[i:i+chunk_size]
transcriber.add_audio(chunk)
# Access current transcription
result = transcriber.result
print(f"Current text: {result.text}")
# Access finalized and draft tokens
# transcriber.finalized_tokens
# transcriber.draft_tokens
```
### Streaming Parameters
- `context_size`: Tuple of (left_context, right_context) for attention windows
- Controls how many frames the model looks at before and after current position
- Default: (256, 256)
- `depth`: Number of encoder layers that preserve exact computation across chunks
- Controls how many layers maintain exact equivalence with non-streaming forward pass
- depth=1: Only first encoder layer matches non-streaming computation exactly
- depth=2: First two layers match exactly, and so on
- depth=N (total layers): Full equivalence to non-streaming forward pass
- Higher depth means more computational consistency with non-streaming mode
- Default: 1
- `keep_original_attention`: Whether to keep original attention mechanism
- False: Switches to local attention for streaming (recommended)
- True: Keeps original attention (less suitable for streaming)
- Default: False
## Low-Level API
To transcribe log-mel spectrum directly, you can do the following:
```python
import mlx.core as mx
from parakeet_mlx.audio import get_logmel, load_audio
# Load and preprocess audio manually
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)
# Generate transcription with alignments
# Accepts both [batch, sequence, feat] and [sequence, feat]
# `alignments` is list of AlignedResult. (no matter if you fed batch dimension or not!)
alignments = model.generate(mel)
```
## Todo
- [X] Add CLI for better usability
- [X] Add support for other Parakeet variants
- [X] Streaming input (real-time transcription with `transcribe_stream`)
- [ ] Option to enhance chosen words' accuracy
- [ ] Chunking with continuous context (partially achieved with streaming)
## Acknowledgments
- Thanks to [Nvidia](https://www.nvidia.com/) for training these awesome models and writing cool papers and providing nice implementation.
- Thanks to [MLX](https://github.com/ml-explore/mlx) project for providing the framework that made this implementation possible.
- Thanks to [audiofile](https://github.com/audeering/audiofile) and [audresample](https://github.com/audeering/audresample), [numpy](https://numpy.org), [librosa](https://librosa.org) for audio processing.
- Thanks to [dacite](https://github.com/konradhalas/dacite) for config management.
## License
Apache 2.0
Raw data
{
"_id": null,
"home_page": null,
"name": "parakeet-mlx",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "mlx, parakeet, asr, nvidia, apple, speech, recognition, ml",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/9c/54/7ae646fd7aef30caa8be444ff4d37f8c5314c6cc637863bb3f1fb6606fa1/parakeet_mlx-0.3.5.tar.gz",
"platform": null,
"description": "# Parakeet MLX\n\nAn implementation of the Parakeet models - Nvidia's ASR(Automatic Speech Recognition) models - for Apple Silicon using MLX.\n\n## Installation\n\n> [!NOTE]\n> Make sure you have `ffmpeg` installed on your system first, otherwise CLI won't work properly.\n\nUsing [uv](https://docs.astral.sh/uv/) - recommended way:\n\n```bash\nuv add parakeet-mlx -U\n```\n\nOr, for the CLI:\n\n```bash\nuv tool install parakeet-mlx -U\n```\n\nUsing pip:\n\n```bash\npip install parakeet-mlx -U\n```\n\n## CLI Quick Start\n\n```bash\nparakeet-mlx <audio_files> [OPTIONS]\n```\n\n## Arguments\n\n- `audio_files`: One or more audio files to transcribe (WAV, MP3, etc.)\n\n## Options\n\n- `--model` (default: `mlx-community/parakeet-tdt-0.6b-v2`, env: `PARAKEET_MODEL`)\n - Hugging Face repository of the model to use\n\n- `--output-dir` (default: current directory)\n - Directory to save transcription outputs\n\n- `--output-format` (default: srt, env: `PARAKEET_OUTPUT_FORMAT`)\n - Output format (txt/srt/vtt/json/all)\n\n- `--output-template` (default: `{filename}`, env: `PARAKEET_OUTPUT_TEMPLATE`)\n - Template for output filenames, `{parent}`, `{filename}`, `{index}`, `{date}` is supported.\n\n- `--highlight-words` (default: False)\n - Enable word-level timestamps in SRT/VTT outputs\n\n- `--verbose` / `-v` (default: False)\n - Print detailed progress information\n\n- `--chunk-duration` (default: 120 seconds, env: `PARAKEET_CHUNK_DURATION`)\n - Chunking duration in seconds for long audio, `0` to disable chunking\n\n- `--overlap-duration` (default: 15 seconds, env: `PARAKEET_OVERLAP_DURATION`)\n - Overlap duration in seconds if using chunking\n\n- `--fp32` / `--bf16` (default: `bf16`, env: `PARAKEET_FP32` - boolean)\n - Determine the precision to use\n\n- `--full-attention` / `--local-attention` (default: `full-attention`, env: `PARAKEET_LOCAL_ATTENTION` - boolean)\n - Use full attention or local attention (Local attention reduces intermediate memory usage)\n - Expected usage case is for long audio transcribing without chunking\n\n- `--local-attention-context-size` (default: 256, env: `PARAKEET_LOCAL_ATTENTION_CTX`)\n - Local attention context size(window) in frames of Parakeet model\n\n## Examples\n\n```bash\n# Basic transcription\nparakeet-mlx audio.mp3\n\n# Multiple files with word-level timestamps of VTT subtitle\nparakeet-mlx *.mp3 --output-format vtt --highlight-words\n\n# Generate all output formats\nparakeet-mlx audio.mp3 --output-format all\n```\n\n\n## Python API Quick Start\n\nTranscribe a file:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nresult = model.transcribe(\"audio_file.wav\")\n\nprint(result.text)\n```\n\nCheck timestamps:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nresult = model.transcribe(\"audio_file.wav\")\n\nprint(result.sentences)\n# [AlignedSentence(text=\"Hello World.\", start=1.01, end=2.04, duration=1.03, tokens=[...])]\n```\n\nDo chunking:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nresult = model.transcribe(\"audio_file.wav\", chunk_duration=60 * 2.0, overlap_duration=15.0)\n\nprint(result.sentences)\n```\n\nUse local attention:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nmodel.encoder.set_attention_model(\n \"rel_pos_local_attn\", # Follows NeMo's naming convention\n (256, 256),\n)\n\nresult = model.transcribe(\"audio_file.wav\")\n\nprint(result.sentences)\n```\n\n## Timestamp Result\n\n- `AlignedResult`: Top-level result containing the full text and sentences\n - `text`: Full transcribed text\n - `sentences`: List of `AlignedSentence`\n- `AlignedSentence`: Sentence-level alignments with start/end times\n - `text`: Sentence text\n - `start`: Start time in seconds\n - `end`: End time in seconds\n - `duration`: Between `start` and `end`.\n - `tokens`: List of `AlignedToken`\n- `AlignedToken`: Word/token-level alignments with precise timestamps\n - `text`: Token text\n - `start`: Start time in seconds\n - `end`: End time in seconds\n - `duration`: Between `start` and `end`.\n\n## Streaming Transcription\n\nFor real-time transcription, use the `transcribe_stream` method which creates a streaming context:\n\n```py\nfrom parakeet_mlx import from_pretrained\nfrom parakeet_mlx.audio import load_audio\nimport numpy as np\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\n# Create a streaming context\nwith model.transcribe_stream(\n context_size=(256, 256), # (left_context, right_context) frames\n) as transcriber:\n # Simulate real-time audio chunks\n audio_data = load_audio(\"audio_file.wav\", model.preprocessor_config.sample_rate)\n chunk_size = model.preprocessor_config.sample_rate # 1 second chunks\n\n for i in range(0, len(audio_data), chunk_size):\n chunk = audio_data[i:i+chunk_size]\n transcriber.add_audio(chunk)\n\n # Access current transcription\n result = transcriber.result\n print(f\"Current text: {result.text}\")\n\n # Access finalized and draft tokens\n # transcriber.finalized_tokens\n # transcriber.draft_tokens\n```\n\n### Streaming Parameters\n\n- `context_size`: Tuple of (left_context, right_context) for attention windows\n - Controls how many frames the model looks at before and after current position\n - Default: (256, 256)\n\n- `depth`: Number of encoder layers that preserve exact computation across chunks\n - Controls how many layers maintain exact equivalence with non-streaming forward pass\n - depth=1: Only first encoder layer matches non-streaming computation exactly\n - depth=2: First two layers match exactly, and so on\n - depth=N (total layers): Full equivalence to non-streaming forward pass\n - Higher depth means more computational consistency with non-streaming mode\n - Default: 1\n\n- `keep_original_attention`: Whether to keep original attention mechanism\n - False: Switches to local attention for streaming (recommended)\n - True: Keeps original attention (less suitable for streaming)\n - Default: False\n\n## Low-Level API\n\nTo transcribe log-mel spectrum directly, you can do the following:\n\n```python\nimport mlx.core as mx\nfrom parakeet_mlx.audio import get_logmel, load_audio\n\n# Load and preprocess audio manually\naudio = load_audio(\"audio.wav\", model.preprocessor_config.sample_rate)\nmel = get_logmel(audio, model.preprocessor_config)\n\n# Generate transcription with alignments\n# Accepts both [batch, sequence, feat] and [sequence, feat]\n# `alignments` is list of AlignedResult. (no matter if you fed batch dimension or not!)\nalignments = model.generate(mel)\n```\n\n## Todo\n\n- [X] Add CLI for better usability\n- [X] Add support for other Parakeet variants\n- [X] Streaming input (real-time transcription with `transcribe_stream`)\n- [ ] Option to enhance chosen words' accuracy\n- [ ] Chunking with continuous context (partially achieved with streaming)\n\n\n## Acknowledgments\n\n- Thanks to [Nvidia](https://www.nvidia.com/) for training these awesome models and writing cool papers and providing nice implementation.\n- Thanks to [MLX](https://github.com/ml-explore/mlx) project for providing the framework that made this implementation possible.\n- Thanks to [audiofile](https://github.com/audeering/audiofile) and [audresample](https://github.com/audeering/audresample), [numpy](https://numpy.org), [librosa](https://librosa.org) for audio processing.\n- Thanks to [dacite](https://github.com/konradhalas/dacite) for config management.\n\n## License\n\nApache 2.0\n",
"bugtrack_url": null,
"license": null,
"summary": "An implementation of the Nvidia's Parakeet models for Apple Silicon using MLX.",
"version": "0.3.5",
"project_urls": {
"Issues": "https://github.com/senstella/parakeet-mlx/issues",
"Repository": "https://github.com/senstella/parakeet-mlx.git"
},
"split_keywords": [
"mlx",
" parakeet",
" asr",
" nvidia",
" apple",
" speech",
" recognition",
" ml"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d8f05c38f35108360902d27a983d3a94d5c283de2f06365c02ce2c8b97be8d98",
"md5": "5caac801c4bbe2b8fa76537c14a89eeb",
"sha256": "e94bdb1a9084b86f2dd9f28b634de71119b4d5eefd4407c1b519ee228066541e"
},
"downloads": -1,
"filename": "parakeet_mlx-0.3.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5caac801c4bbe2b8fa76537c14a89eeb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 33002,
"upload_time": "2025-07-13T17:19:57",
"upload_time_iso_8601": "2025-07-13T17:19:57.736329Z",
"url": "https://files.pythonhosted.org/packages/d8/f0/5c38f35108360902d27a983d3a94d5c283de2f06365c02ce2c8b97be8d98/parakeet_mlx-0.3.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9c547ae646fd7aef30caa8be444ff4d37f8c5314c6cc637863bb3f1fb6606fa1",
"md5": "8aeb31f2d0af81e9493ee998f9e59e7c",
"sha256": "31a5e437dbb9761fd7532111d3d2e7885c6dd45dc5829016a1c02673c50ba499"
},
"downloads": -1,
"filename": "parakeet_mlx-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "8aeb31f2d0af81e9493ee998f9e59e7c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 32373,
"upload_time": "2025-07-13T17:19:58",
"upload_time_iso_8601": "2025-07-13T17:19:58.947953Z",
"url": "https://files.pythonhosted.org/packages/9c/54/7ae646fd7aef30caa8be444ff4d37f8c5314c6cc637863bb3f1fb6606fa1/parakeet_mlx-0.3.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-13 17:19:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "senstella",
"github_project": "parakeet-mlx",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "parakeet-mlx"
}