chunkformer


Namechunkformer JSON
Version 1.2.1 PyPI version JSON
download
home_pageNone
SummaryChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
upload_time2025-10-10 05:06:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseCC-BY-4.0
keywords speech-recognition asr chunkformer transformer conformer pytorch long-form-audio machine-learning deep-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
---
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)

This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, **"ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription"**. The paper has been fully accepted by the reviewers with scores: **4/4/4**.

[![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
[![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)

## Table of Contents
- [Introduction](#introduction)
- [Key Features](#key-features)
- [Installation](#installation)
  - [Install from PyPI (Recommended)](#option-1-install-from-pypi-recommended)
  - [Install from source](#option-2-install-from-source)
  - [Pretrained Models](#pretrained-models)
- [Usage](#usage)
  - [Feature Extraction](#feature-extraction)
  - [Python API Transcription](#python-api-transcription)
  - [Command Line Transcription](#command-line-transcription)
- [Training the Model](#training)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)

<a name = "introduction" ></a>
## Introduction
ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a **chunk-wise processing mechanism** with **relative right context** and employs the **Masked Batch technique** to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.
![chunkformer_architecture](docs/chunkformer_architecture.png)

<a name = "key-features" ></a>
## Key Features
- **Transcribing Extremely Long Audio**: ChunkFormer can **transcribe audio recordings up to 16 hours** in length with results comparable to existing models. It is currently the first model capable of handling this duration.
- **Efficient Decoding on Low-Memory GPUs**: Chunkformer can **handle long-form transcription on GPUs with limited memory** without losing context or mismatching the training phase.
- **Masked Batching Technique**: ChunkFormer efficiently **removes the need for padding in batches with highly variable lengths**.  For instance, **decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.**

| GPU Memory | Total Batch Duration (minutes) |
|---|---|
| 80GB | 980 |
| 24GB | 240 |

<a name = "installation" ></a>
## Installation

### Option 1: Install from PyPI (Recommended)
```bash
pip install chunkformer
```

### Option 2: Install from source
```bash
# Clone the repository
git clone https://github.com/your-username/chunkformer.git
cd chunkformer

# Install in development mode
pip install -e .
```

### Pretrained Models
| Language | Model |
|----------|-------|
| Vietnamese  | [![Hugging Face](https://img.shields.io/badge/HuggingFace-chunkformer--rnnt--large--vie-orange?logo=huggingface)](https://huggingface.co/khanhld/chunkformer-rnnt-large-vie) |
| Vietnamese  | [![Hugging Face](https://img.shields.io/badge/HuggingFace-chunkformer--ctc--large--vie-orange?logo=huggingface)](https://huggingface.co/khanhld/chunkformer-ctc-large-vie) |
| English   | [![Hugging Face](https://img.shields.io/badge/HuggingFace-chunkformer--ctc--large--en--libri--960h-orange?logo=huggingface)](https://huggingface.co/khanhld/chunkformer-large-en-libri-960h) |
<a name = "usage" ></a>
## Usage

### Feature Extraction
```python
from chunkformer import ChunkFormerModel
import torch

device = "cuda:0"

# Load a pre-trained model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie").to(device)
x, x_len = model._load_audio_and_extract_features("path/to/audio")  # x: (T, F), x_len: int
x = x.unsqueeze(0).to(device)
x_len = torch.tensor([x_len], device=device)

# Extract feature
feature, feature_len = model.encode(
    xs=x,
    xs_lens=x_len,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
)

print("feature: ", feature.shape)
print("feature_len: ", feature_len)
```

### Python API Transcription
```python
from chunkformer import ChunkFormerModel

# Load a pre-trained encoder from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie")

# For single long-form audio transcription
transcription = model.endless_decode(
    audio_path="path/to/long_audio.wav",
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=14400,  # in seconds
    return_timestamps=True
)
print(transcription)

# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
    audio_paths=audio_files,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=1800  # Total batch duration in seconds
)

for i, transcription in enumerate(transcriptions):
    print(f"Audio {i+1}: {transcription}")

```

### Command Line Transcription
#### Long-Form Audio Testing
To test the model with a single [long-form audio file](samples/audios/audio_1.wav). Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:
```bash
chunkformer-decode \
    --model_checkpoint path/to/hf/checkpoint/repo \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```

#### Batch Transcription Testing
The [data.tsv](samples/data.tsv) file must have at least one column named **wav**. Optionally, a column named **txt** can be included to compute the **Word Error Rate (WER)**. Output will be saved to the same file.

```bash
chunkformer-decode \
    --model_checkpoint path/to/hf/checkpoint/repo \
    --audio_list path/to/data.tsv \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```
Example Output:
```
WER: 0.1234
```

---

<a name = "training" ></a>
## Training

See **[🚀 Training Guide 🚀](examples/)** for complete documentation.


<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex
@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}

```

<a name = "acknowledgments" ></a>
## Acknowledgments
This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.

---

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chunkformer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "speech-recognition, asr, chunkformer, transformer, conformer, pytorch, long-form-audio, machine-learning, deep-learning",
    "author": null,
    "author_email": "khanld <khanhld218@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e7/f6/a1a174eaa2eb7fab95a05b96cac57dc319186f55de58dda525d7f6c5127d/chunkformer-1.2.1.tar.gz",
    "platform": null,
    "description": "# ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription\n---\n[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\n[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)\n\nThis repository contains the implementation and supplementary materials for our ICASSP 2025 paper, **\"ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription\"**. The paper has been fully accepted by the reviewers with scores: **4/4/4**.\n\n[![Ranked #1: Speech Recognition on Common Voice Vi](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20Common%20Voice%20Vi-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)\n[![Ranked #1: Speech Recognition on VIVOS](https://img.shields.io/badge/Ranked%20%231%3A%20Speech%20Recognition%20on%20VIVOS-%F0%9F%8F%86%20SOTA-blueviolet?style=for-the-badge&logo=paperswithcode&logoColor=white)](https://paperswithcode.com/sota/speech-recognition-on-vivos)\n\n## Table of Contents\n- [Introduction](#introduction)\n- [Key Features](#key-features)\n- [Installation](#installation)\n  - [Install from PyPI (Recommended)](#option-1-install-from-pypi-recommended)\n  - [Install from source](#option-2-install-from-source)\n  - [Pretrained Models](#pretrained-models)\n- [Usage](#usage)\n  - [Feature Extraction](#feature-extraction)\n  - [Python API Transcription](#python-api-transcription)\n  - [Command Line Transcription](#command-line-transcription)\n- [Training the Model](#training)\n- [Citation](#citation)\n- [Acknowledgments](#acknowledgments)\n\n<a name = \"introduction\" ></a>\n## Introduction\nChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a **chunk-wise processing mechanism** with **relative right context** and employs the **Masked Batch technique** to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.\n![chunkformer_architecture](docs/chunkformer_architecture.png)\n\n<a name = \"key-features\" ></a>\n## Key Features\n- **Transcribing Extremely Long Audio**: ChunkFormer can **transcribe audio recordings up to 16 hours** in length with results comparable to existing models. It is currently the first model capable of handling this duration.\n- **Efficient Decoding on Low-Memory GPUs**: Chunkformer can **handle long-form transcription on GPUs with limited memory** without losing context or mismatching the training phase.\n- **Masked Batching Technique**: ChunkFormer efficiently **removes the need for padding in batches with highly variable lengths**.  For instance, **decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.**\n\n| GPU Memory | Total Batch Duration (minutes) |\n|---|---|\n| 80GB | 980 |\n| 24GB | 240 |\n\n<a name = \"installation\" ></a>\n## Installation\n\n### Option 1: Install from PyPI (Recommended)\n```bash\npip install chunkformer\n```\n\n### Option 2: Install from source\n```bash\n# Clone the repository\ngit clone https://github.com/your-username/chunkformer.git\ncd chunkformer\n\n# Install in development mode\npip install -e .\n```\n\n### Pretrained Models\n| Language | Model |\n|----------|-------|\n| Vietnamese  | [![Hugging Face](https://img.shields.io/badge/HuggingFace-chunkformer--rnnt--large--vie-orange?logo=huggingface)](https://huggingface.co/khanhld/chunkformer-rnnt-large-vie) |\n| Vietnamese  | [![Hugging Face](https://img.shields.io/badge/HuggingFace-chunkformer--ctc--large--vie-orange?logo=huggingface)](https://huggingface.co/khanhld/chunkformer-ctc-large-vie) |\n| English   | [![Hugging Face](https://img.shields.io/badge/HuggingFace-chunkformer--ctc--large--en--libri--960h-orange?logo=huggingface)](https://huggingface.co/khanhld/chunkformer-large-en-libri-960h) |\n<a name = \"usage\" ></a>\n## Usage\n\n### Feature Extraction\n```python\nfrom chunkformer import ChunkFormerModel\nimport torch\n\ndevice = \"cuda:0\"\n\n# Load a pre-trained model from Hugging Face or local directory\nmodel = ChunkFormerModel.from_pretrained(\"khanhld/chunkformer-ctc-large-vie\").to(device)\nx, x_len = model._load_audio_and_extract_features(\"path/to/audio\")  # x: (T, F), x_len: int\nx = x.unsqueeze(0).to(device)\nx_len = torch.tensor([x_len], device=device)\n\n# Extract feature\nfeature, feature_len = model.encode(\n    xs=x,\n    xs_lens=x_len,\n    chunk_size=64,\n    left_context_size=128,\n    right_context_size=128,\n)\n\nprint(\"feature: \", feature.shape)\nprint(\"feature_len: \", feature_len)\n```\n\n### Python API Transcription\n```python\nfrom chunkformer import ChunkFormerModel\n\n# Load a pre-trained encoder from Hugging Face or local directory\nmodel = ChunkFormerModel.from_pretrained(\"khanhld/chunkformer-ctc-large-vie\")\n\n# For single long-form audio transcription\ntranscription = model.endless_decode(\n    audio_path=\"path/to/long_audio.wav\",\n    chunk_size=64,\n    left_context_size=128,\n    right_context_size=128,\n    total_batch_duration=14400,  # in seconds\n    return_timestamps=True\n)\nprint(transcription)\n\n# For batch processing of multiple audio files\naudio_files = [\"audio1.wav\", \"audio2.wav\", \"audio3.wav\"]\ntranscriptions = model.batch_decode(\n    audio_paths=audio_files,\n    chunk_size=64,\n    left_context_size=128,\n    right_context_size=128,\n    total_batch_duration=1800  # Total batch duration in seconds\n)\n\nfor i, transcription in enumerate(transcriptions):\n    print(f\"Audio {i+1}: {transcription}\")\n\n```\n\n### Command Line Transcription\n#### Long-Form Audio Testing\nTo test the model with a single [long-form audio file](samples/audios/audio_1.wav). Audio file extensions \".mp3\", \".wav\", \".flac\", \".m4a\", \".aac\" are accepted:\n```bash\nchunkformer-decode \\\n    --model_checkpoint path/to/hf/checkpoint/repo \\\n    --long_form_audio path/to/audio.wav \\\n    --total_batch_duration 14400 \\\n    --chunk_size 64 \\\n    --left_context_size 128 \\\n    --right_context_size 128\n```\nExample Output:\n```\n[00:00:01.200] - [00:00:02.400]: this is a transcription example\n[00:00:02.500] - [00:00:03.700]: testing the long-form audio\n```\n\n#### Batch Transcription Testing\nThe [data.tsv](samples/data.tsv) file must have at least one column named **wav**. Optionally, a column named **txt** can be included to compute the **Word Error Rate (WER)**. Output will be saved to the same file.\n\n```bash\nchunkformer-decode \\\n    --model_checkpoint path/to/hf/checkpoint/repo \\\n    --audio_list path/to/data.tsv \\\n    --total_batch_duration 14400 \\\n    --chunk_size 64 \\\n    --left_context_size 128 \\\n    --right_context_size 128\n```\nExample Output:\n```\nWER: 0.1234\n```\n\n---\n\n<a name = \"training\" ></a>\n## Training\n\nSee **[\ud83d\ude80 Training Guide \ud83d\ude80](examples/)** for complete documentation.\n\n\n<a name = \"citation\" ></a>\n## Citation\nIf you use this work in your research, please cite:\n\n```bibtex\n@INPROCEEDINGS{10888640,\n  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},\n  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},\n  year={2025},\n  volume={},\n  number={},\n  pages={1-5},\n  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},\n  doi={10.1109/ICASSP49660.2025.10888640}}\n\n```\n\n<a name = \"acknowledgments\" ></a>\n## Acknowledgments\nThis implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.\n\n---\n",
    "bugtrack_url": null,
    "license": "CC-BY-4.0",
    "summary": "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription",
    "version": "1.2.1",
    "project_urls": {
        "Bug Reports": "https://github.com/khanld/chunkformer/issues",
        "Documentation": "https://github.com/khanld/chunkformer/blob/main/README.md",
        "Homepage": "https://github.com/khanld/chunkformer",
        "Paper": "https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf",
        "Repository": "https://github.com/khanld/chunkformer"
    },
    "split_keywords": [
        "speech-recognition",
        " asr",
        " chunkformer",
        " transformer",
        " conformer",
        " pytorch",
        " long-form-audio",
        " machine-learning",
        " deep-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8f812abfbdeb2c3e1918436edb8a3dc50f669770b986476e54e38c70b0c00a00",
                "md5": "26cfeea2e4e44c9105c8b2d0052c427e",
                "sha256": "eddf6911592f21db0e865147b8224cd9493fc85f8b203bf8780a1b64165adf08"
            },
            "downloads": -1,
            "filename": "chunkformer-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "26cfeea2e4e44c9105c8b2d0052c427e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 142441,
            "upload_time": "2025-10-10T05:06:37",
            "upload_time_iso_8601": "2025-10-10T05:06:37.936628Z",
            "url": "https://files.pythonhosted.org/packages/8f/81/2abfbdeb2c3e1918436edb8a3dc50f669770b986476e54e38c70b0c00a00/chunkformer-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e7f6a1a174eaa2eb7fab95a05b96cac57dc319186f55de58dda525d7f6c5127d",
                "md5": "ca5800d5c1237236babef4579d9b70ac",
                "sha256": "492f4f1db2a14d2e8a75b9b2eaa368a20eea68f98b7b559a41c751390c941615"
            },
            "downloads": -1,
            "filename": "chunkformer-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ca5800d5c1237236babef4579d9b70ac",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 1286443,
            "upload_time": "2025-10-10T05:06:39",
            "upload_time_iso_8601": "2025-10-10T05:06:39.453595Z",
            "url": "https://files.pythonhosted.org/packages/e7/f6/a1a174eaa2eb7fab95a05b96cac57dc319186f55de58dda525d7f6c5127d/chunkformer-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-10 05:06:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "khanld",
    "github_project": "chunkformer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chunkformer"
}
        
Elapsed time: 1.93926s