parakeet-mlx

Name	parakeet-mlx JSON
Version	0.3.5 JSON
	download
home_page	None
Summary	An implementation of the Nvidia's Parakeet models for Apple Silicon using MLX.
upload_time	2025-07-13 17:19:58
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	mlx parakeet asr nvidia apple speech recognition ml
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Parakeet MLX

An implementation of the Parakeet models - Nvidia's ASR(Automatic Speech Recognition) models - for Apple Silicon using MLX.

## Installation

> [!NOTE]
> Make sure you have `ffmpeg` installed on your system first, otherwise CLI won't work properly.

Using [uv](https://docs.astral.sh/uv/) - recommended way:

```bash
uv add parakeet-mlx -U
```

Or, for the CLI:

```bash
uv tool install parakeet-mlx -U
```

Using pip:

```bash
pip install parakeet-mlx -U
```

## CLI Quick Start

```bash
parakeet-mlx <audio_files> [OPTIONS]
```

## Arguments

- `audio_files`: One or more audio files to transcribe (WAV, MP3, etc.)

## Options

- `--model` (default: `mlx-community/parakeet-tdt-0.6b-v2`, env: `PARAKEET_MODEL`)
  - Hugging Face repository of the model to use

- `--output-dir` (default: current directory)
  - Directory to save transcription outputs

- `--output-format` (default: srt, env: `PARAKEET_OUTPUT_FORMAT`)
  - Output format (txt/srt/vtt/json/all)

- `--output-template` (default: `{filename}`, env: `PARAKEET_OUTPUT_TEMPLATE`)
  - Template for output filenames, `{parent}`, `{filename}`, `{index}`, `{date}` is supported.

- `--highlight-words` (default: False)
  - Enable word-level timestamps in SRT/VTT outputs

- `--verbose` / `-v` (default: False)
  - Print detailed progress information

- `--chunk-duration` (default: 120 seconds, env: `PARAKEET_CHUNK_DURATION`)
  - Chunking duration in seconds for long audio, `0` to disable chunking

- `--overlap-duration` (default: 15 seconds, env: `PARAKEET_OVERLAP_DURATION`)
  - Overlap duration in seconds if using chunking

- `--fp32` / `--bf16` (default: `bf16`, env: `PARAKEET_FP32` - boolean)
  - Determine the precision to use

- `--full-attention` / `--local-attention` (default: `full-attention`, env: `PARAKEET_LOCAL_ATTENTION` - boolean)
  - Use full attention or local attention (Local attention reduces intermediate memory usage)
  - Expected usage case is for long audio transcribing without chunking

- `--local-attention-context-size` (default: 256, env: `PARAKEET_LOCAL_ATTENTION_CTX`)
  - Local attention context size(window) in frames of Parakeet model

## Examples

```bash
# Basic transcription
parakeet-mlx audio.mp3

# Multiple files with word-level timestamps of VTT subtitle
parakeet-mlx *.mp3 --output-format vtt --highlight-words

# Generate all output formats
parakeet-mlx audio.mp3 --output-format all
```


## Python API Quick Start

Transcribe a file:

```py
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")

result = model.transcribe("audio_file.wav")

print(result.text)
```

Check timestamps:

```py
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")

result = model.transcribe("audio_file.wav")

print(result.sentences)
# [AlignedSentence(text="Hello World.", start=1.01, end=2.04, duration=1.03, tokens=[...])]
```

Do chunking:

```py
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")

result = model.transcribe("audio_file.wav", chunk_duration=60 * 2.0, overlap_duration=15.0)

print(result.sentences)
```

Use local attention:

```py
from parakeet_mlx import from_pretrained

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")

model.encoder.set_attention_model(
    "rel_pos_local_attn", # Follows NeMo's naming convention
    (256, 256),
)

result = model.transcribe("audio_file.wav")

print(result.sentences)
```

## Timestamp Result

- `AlignedResult`: Top-level result containing the full text and sentences
  - `text`: Full transcribed text
  - `sentences`: List of `AlignedSentence`
- `AlignedSentence`: Sentence-level alignments with start/end times
  - `text`: Sentence text
  - `start`: Start time in seconds
  - `end`: End time in seconds
  - `duration`: Between `start` and `end`.
  - `tokens`: List of `AlignedToken`
- `AlignedToken`: Word/token-level alignments with precise timestamps
  - `text`: Token text
  - `start`: Start time in seconds
  - `end`: End time in seconds
  - `duration`: Between `start` and `end`.

## Streaming Transcription

For real-time transcription, use the `transcribe_stream` method which creates a streaming context:

```py
from parakeet_mlx import from_pretrained
from parakeet_mlx.audio import load_audio
import numpy as np

model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v2")

# Create a streaming context
with model.transcribe_stream(
    context_size=(256, 256),  # (left_context, right_context) frames
) as transcriber:
    # Simulate real-time audio chunks
    audio_data = load_audio("audio_file.wav", model.preprocessor_config.sample_rate)
    chunk_size = model.preprocessor_config.sample_rate  # 1 second chunks

    for i in range(0, len(audio_data), chunk_size):
        chunk = audio_data[i:i+chunk_size]
        transcriber.add_audio(chunk)

        # Access current transcription
        result = transcriber.result
        print(f"Current text: {result.text}")

        # Access finalized and draft tokens
        # transcriber.finalized_tokens
        # transcriber.draft_tokens
```

### Streaming Parameters

- `context_size`: Tuple of (left_context, right_context) for attention windows
  - Controls how many frames the model looks at before and after current position
  - Default: (256, 256)

- `depth`: Number of encoder layers that preserve exact computation across chunks
  - Controls how many layers maintain exact equivalence with non-streaming forward pass
  - depth=1: Only first encoder layer matches non-streaming computation exactly
  - depth=2: First two layers match exactly, and so on
  - depth=N (total layers): Full equivalence to non-streaming forward pass
  - Higher depth means more computational consistency with non-streaming mode
  - Default: 1

- `keep_original_attention`: Whether to keep original attention mechanism
  - False: Switches to local attention for streaming (recommended)
  - True: Keeps original attention (less suitable for streaming)
  - Default: False

## Low-Level API

To transcribe log-mel spectrum directly, you can do the following:

```python
import mlx.core as mx
from parakeet_mlx.audio import get_logmel, load_audio

# Load and preprocess audio manually
audio = load_audio("audio.wav", model.preprocessor_config.sample_rate)
mel = get_logmel(audio, model.preprocessor_config)

# Generate transcription with alignments
# Accepts both [batch, sequence, feat] and [sequence, feat]
# `alignments` is list of AlignedResult. (no matter if you fed batch dimension or not!)
alignments = model.generate(mel)
```

## Todo

- [X] Add CLI for better usability
- [X] Add support for other Parakeet variants
- [X] Streaming input (real-time transcription with `transcribe_stream`)
- [ ] Option to enhance chosen words' accuracy
- [ ] Chunking with continuous context (partially achieved with streaming)


## Acknowledgments

- Thanks to [Nvidia](https://www.nvidia.com/) for training these awesome models and writing cool papers and providing nice implementation.
- Thanks to [MLX](https://github.com/ml-explore/mlx) project for providing the framework that made this implementation possible.
- Thanks to [audiofile](https://github.com/audeering/audiofile) and [audresample](https://github.com/audeering/audresample), [numpy](https://numpy.org), [librosa](https://librosa.org) for audio processing.
- Thanks to [dacite](https://github.com/konradhalas/dacite) for config management.

## License

Apache 2.0

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parakeet-mlx",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "mlx, parakeet, asr, nvidia, apple, speech, recognition, ml",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/9c/54/7ae646fd7aef30caa8be444ff4d37f8c5314c6cc637863bb3f1fb6606fa1/parakeet_mlx-0.3.5.tar.gz",
    "platform": null,
    "description": "# Parakeet MLX\n\nAn implementation of the Parakeet models - Nvidia's ASR(Automatic Speech Recognition) models - for Apple Silicon using MLX.\n\n## Installation\n\n> [!NOTE]\n> Make sure you have `ffmpeg` installed on your system first, otherwise CLI won't work properly.\n\nUsing [uv](https://docs.astral.sh/uv/) - recommended way:\n\n```bash\nuv add parakeet-mlx -U\n```\n\nOr, for the CLI:\n\n```bash\nuv tool install parakeet-mlx -U\n```\n\nUsing pip:\n\n```bash\npip install parakeet-mlx -U\n```\n\n## CLI Quick Start\n\n```bash\nparakeet-mlx <audio_files> [OPTIONS]\n```\n\n## Arguments\n\n- `audio_files`: One or more audio files to transcribe (WAV, MP3, etc.)\n\n## Options\n\n- `--model` (default: `mlx-community/parakeet-tdt-0.6b-v2`, env: `PARAKEET_MODEL`)\n  - Hugging Face repository of the model to use\n\n- `--output-dir` (default: current directory)\n  - Directory to save transcription outputs\n\n- `--output-format` (default: srt, env: `PARAKEET_OUTPUT_FORMAT`)\n  - Output format (txt/srt/vtt/json/all)\n\n- `--output-template` (default: `{filename}`, env: `PARAKEET_OUTPUT_TEMPLATE`)\n  - Template for output filenames, `{parent}`, `{filename}`, `{index}`, `{date}` is supported.\n\n- `--highlight-words` (default: False)\n  - Enable word-level timestamps in SRT/VTT outputs\n\n- `--verbose` / `-v` (default: False)\n  - Print detailed progress information\n\n- `--chunk-duration` (default: 120 seconds, env: `PARAKEET_CHUNK_DURATION`)\n  - Chunking duration in seconds for long audio, `0` to disable chunking\n\n- `--overlap-duration` (default: 15 seconds, env: `PARAKEET_OVERLAP_DURATION`)\n  - Overlap duration in seconds if using chunking\n\n- `--fp32` / `--bf16` (default: `bf16`, env: `PARAKEET_FP32` - boolean)\n  - Determine the precision to use\n\n- `--full-attention` / `--local-attention` (default: `full-attention`, env: `PARAKEET_LOCAL_ATTENTION` - boolean)\n  - Use full attention or local attention (Local attention reduces intermediate memory usage)\n  - Expected usage case is for long audio transcribing without chunking\n\n- `--local-attention-context-size` (default: 256, env: `PARAKEET_LOCAL_ATTENTION_CTX`)\n  - Local attention context size(window) in frames of Parakeet model\n\n## Examples\n\n```bash\n# Basic transcription\nparakeet-mlx audio.mp3\n\n# Multiple files with word-level timestamps of VTT subtitle\nparakeet-mlx *.mp3 --output-format vtt --highlight-words\n\n# Generate all output formats\nparakeet-mlx audio.mp3 --output-format all\n```\n\n\n## Python API Quick Start\n\nTranscribe a file:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nresult = model.transcribe(\"audio_file.wav\")\n\nprint(result.text)\n```\n\nCheck timestamps:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nresult = model.transcribe(\"audio_file.wav\")\n\nprint(result.sentences)\n# [AlignedSentence(text=\"Hello World.\", start=1.01, end=2.04, duration=1.03, tokens=[...])]\n```\n\nDo chunking:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nresult = model.transcribe(\"audio_file.wav\", chunk_duration=60 * 2.0, overlap_duration=15.0)\n\nprint(result.sentences)\n```\n\nUse local attention:\n\n```py\nfrom parakeet_mlx import from_pretrained\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\nmodel.encoder.set_attention_model(\n    \"rel_pos_local_attn\", # Follows NeMo's naming convention\n    (256, 256),\n)\n\nresult = model.transcribe(\"audio_file.wav\")\n\nprint(result.sentences)\n```\n\n## Timestamp Result\n\n- `AlignedResult`: Top-level result containing the full text and sentences\n  - `text`: Full transcribed text\n  - `sentences`: List of `AlignedSentence`\n- `AlignedSentence`: Sentence-level alignments with start/end times\n  - `text`: Sentence text\n  - `start`: Start time in seconds\n  - `end`: End time in seconds\n  - `duration`: Between `start` and `end`.\n  - `tokens`: List of `AlignedToken`\n- `AlignedToken`: Word/token-level alignments with precise timestamps\n  - `text`: Token text\n  - `start`: Start time in seconds\n  - `end`: End time in seconds\n  - `duration`: Between `start` and `end`.\n\n## Streaming Transcription\n\nFor real-time transcription, use the `transcribe_stream` method which creates a streaming context:\n\n```py\nfrom parakeet_mlx import from_pretrained\nfrom parakeet_mlx.audio import load_audio\nimport numpy as np\n\nmodel = from_pretrained(\"mlx-community/parakeet-tdt-0.6b-v2\")\n\n# Create a streaming context\nwith model.transcribe_stream(\n    context_size=(256, 256),  # (left_context, right_context) frames\n) as transcriber:\n    # Simulate real-time audio chunks\n    audio_data = load_audio(\"audio_file.wav\", model.preprocessor_config.sample_rate)\n    chunk_size = model.preprocessor_config.sample_rate  # 1 second chunks\n\n    for i in range(0, len(audio_data), chunk_size):\n        chunk = audio_data[i:i+chunk_size]\n        transcriber.add_audio(chunk)\n\n        # Access current transcription\n        result = transcriber.result\n        print(f\"Current text: {result.text}\")\n\n        # Access finalized and draft tokens\n        # transcriber.finalized_tokens\n        # transcriber.draft_tokens\n```\n\n### Streaming Parameters\n\n- `context_size`: Tuple of (left_context, right_context) for attention windows\n  - Controls how many frames the model looks at before and after current position\n  - Default: (256, 256)\n\n- `depth`: Number of encoder layers that preserve exact computation across chunks\n  - Controls how many layers maintain exact equivalence with non-streaming forward pass\n  - depth=1: Only first encoder layer matches non-streaming computation exactly\n  - depth=2: First two layers match exactly, and so on\n  - depth=N (total layers): Full equivalence to non-streaming forward pass\n  - Higher depth means more computational consistency with non-streaming mode\n  - Default: 1\n\n- `keep_original_attention`: Whether to keep original attention mechanism\n  - False: Switches to local attention for streaming (recommended)\n  - True: Keeps original attention (less suitable for streaming)\n  - Default: False\n\n## Low-Level API\n\nTo transcribe log-mel spectrum directly, you can do the following:\n\n```python\nimport mlx.core as mx\nfrom parakeet_mlx.audio import get_logmel, load_audio\n\n# Load and preprocess audio manually\naudio = load_audio(\"audio.wav\", model.preprocessor_config.sample_rate)\nmel = get_logmel(audio, model.preprocessor_config)\n\n# Generate transcription with alignments\n# Accepts both [batch, sequence, feat] and [sequence, feat]\n# `alignments` is list of AlignedResult. (no matter if you fed batch dimension or not!)\nalignments = model.generate(mel)\n```\n\n## Todo\n\n- [X] Add CLI for better usability\n- [X] Add support for other Parakeet variants\n- [X] Streaming input (real-time transcription with `transcribe_stream`)\n- [ ] Option to enhance chosen words' accuracy\n- [ ] Chunking with continuous context (partially achieved with streaming)\n\n\n## Acknowledgments\n\n- Thanks to [Nvidia](https://www.nvidia.com/) for training these awesome models and writing cool papers and providing nice implementation.\n- Thanks to [MLX](https://github.com/ml-explore/mlx) project for providing the framework that made this implementation possible.\n- Thanks to [audiofile](https://github.com/audeering/audiofile) and [audresample](https://github.com/audeering/audresample), [numpy](https://numpy.org), [librosa](https://librosa.org) for audio processing.\n- Thanks to [dacite](https://github.com/konradhalas/dacite) for config management.\n\n## License\n\nApache 2.0\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An implementation of the Nvidia's Parakeet models for Apple Silicon using MLX.",
    "version": "0.3.5",
    "project_urls": {
        "Issues": "https://github.com/senstella/parakeet-mlx/issues",
        "Repository": "https://github.com/senstella/parakeet-mlx.git"
    },
    "split_keywords": [
        "mlx",
        " parakeet",
        " asr",
        " nvidia",
        " apple",
        " speech",
        " recognition",
        " ml"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d8f05c38f35108360902d27a983d3a94d5c283de2f06365c02ce2c8b97be8d98",
                "md5": "5caac801c4bbe2b8fa76537c14a89eeb",
                "sha256": "e94bdb1a9084b86f2dd9f28b634de71119b4d5eefd4407c1b519ee228066541e"
            },
            "downloads": -1,
            "filename": "parakeet_mlx-0.3.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5caac801c4bbe2b8fa76537c14a89eeb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 33002,
            "upload_time": "2025-07-13T17:19:57",
            "upload_time_iso_8601": "2025-07-13T17:19:57.736329Z",
            "url": "https://files.pythonhosted.org/packages/d8/f0/5c38f35108360902d27a983d3a94d5c283de2f06365c02ce2c8b97be8d98/parakeet_mlx-0.3.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9c547ae646fd7aef30caa8be444ff4d37f8c5314c6cc637863bb3f1fb6606fa1",
                "md5": "8aeb31f2d0af81e9493ee998f9e59e7c",
                "sha256": "31a5e437dbb9761fd7532111d3d2e7885c6dd45dc5829016a1c02673c50ba499"
            },
            "downloads": -1,
            "filename": "parakeet_mlx-0.3.5.tar.gz",
            "has_sig": false,
            "md5_digest": "8aeb31f2d0af81e9493ee998f9e59e7c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 32373,
            "upload_time": "2025-07-13T17:19:58",
            "upload_time_iso_8601": "2025-07-13T17:19:58.947953Z",
            "url": "https://files.pythonhosted.org/packages/9c/54/7ae646fd7aef30caa8be444ff4d37f8c5314c6cc637863bb3f1fb6606fa1/parakeet_mlx-0.3.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-13 17:19:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "senstella",
    "github_project": "parakeet-mlx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "parakeet-mlx"
}

None