# ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription
---
[](https://creativecommons.org/licenses/by/4.0/)
[](https://arxiv.org/abs/2502.14673)
This repository contains the implementation and supplementary materials for our ICASSP 2025 paper, **"ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription"**. The paper has been fully accepted by the reviewers with scores: **4/4/4**.
[](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)
[](https://paperswithcode.com/sota/speech-recognition-on-vivos)
## Table of Contents
- [Introduction](#introduction)
- [Key Features](#key-features)
- [Installation](#installation)
- [Install from PyPI (Recommended)](#option-1-install-from-pypi-recommended)
- [Install from source](#option-2-install-from-source)
- [Pretrained Models](#pretrained-models)
- [Usage](#usage)
- [Feature Extraction](#feature-extraction)
- [Python API Transcription](#python-api-transcription)
- [Command Line Transcription](#command-line-transcription)
- [Training the Model](#training)
- [Citation](#citation)
- [Acknowledgments](#acknowledgments)
<a name = "introduction" ></a>
## Introduction
ChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a **chunk-wise processing mechanism** with **relative right context** and employs the **Masked Batch technique** to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.

<a name = "key-features" ></a>
## Key Features
- **Transcribing Extremely Long Audio**: ChunkFormer can **transcribe audio recordings up to 16 hours** in length with results comparable to existing models. It is currently the first model capable of handling this duration.
- **Efficient Decoding on Low-Memory GPUs**: Chunkformer can **handle long-form transcription on GPUs with limited memory** without losing context or mismatching the training phase.
- **Masked Batching Technique**: ChunkFormer efficiently **removes the need for padding in batches with highly variable lengths**. For instance, **decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.**
| GPU Memory | Total Batch Duration (minutes) |
|---|---|
| 80GB | 980 |
| 24GB | 240 |
<a name = "installation" ></a>
## Installation
### Option 1: Install from PyPI (Recommended)
```bash
pip install chunkformer
```
### Option 2: Install from source
```bash
# Clone the repository
git clone https://github.com/your-username/chunkformer.git
cd chunkformer
# Install in development mode
pip install -e .
```
### Pretrained Models
| Language | Model |
|----------|-------|
| Vietnamese | [](https://huggingface.co/khanhld/chunkformer-rnnt-large-vie) |
| Vietnamese | [](https://huggingface.co/khanhld/chunkformer-ctc-large-vie) |
| English | [](https://huggingface.co/khanhld/chunkformer-large-en-libri-960h) |
<a name = "usage" ></a>
## Usage
### Feature Extraction
```python
from chunkformer import ChunkFormerModel
import torch
device = "cuda:0"
# Load a pre-trained model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie").to(device)
x, x_len = model._load_audio_and_extract_features("path/to/audio") # x: (T, F), x_len: int
x = x.unsqueeze(0).to(device)
x_len = torch.tensor([x_len], device=device)
# Extract feature
feature, feature_len = model.encode(
xs=x,
xs_lens=x_len,
chunk_size=64,
left_context_size=128,
right_context_size=128,
)
print("feature: ", feature.shape)
print("feature_len: ", feature_len)
```
### Python API Transcription
```python
from chunkformer import ChunkFormerModel
# Load a pre-trained encoder from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie")
# For single long-form audio transcription
transcription = model.endless_decode(
audio_path="path/to/long_audio.wav",
chunk_size=64,
left_context_size=128,
right_context_size=128,
total_batch_duration=14400, # in seconds
return_timestamps=True
)
print(transcription)
# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
audio_paths=audio_files,
chunk_size=64,
left_context_size=128,
right_context_size=128,
total_batch_duration=1800 # Total batch duration in seconds
)
for i, transcription in enumerate(transcriptions):
print(f"Audio {i+1}: {transcription}")
```
### Command Line Transcription
#### Long-Form Audio Testing
To test the model with a single [long-form audio file](samples/audios/audio_1.wav). Audio file extensions ".mp3", ".wav", ".flac", ".m4a", ".aac" are accepted:
```bash
chunkformer-decode \
--model_checkpoint path/to/hf/checkpoint/repo \
--long_form_audio path/to/audio.wav \
--total_batch_duration 14400 \
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
```
Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```
#### Batch Transcription Testing
The [data.tsv](samples/data.tsv) file must have at least one column named **wav**. Optionally, a column named **txt** can be included to compute the **Word Error Rate (WER)**. Output will be saved to the same file.
```bash
chunkformer-decode \
--model_checkpoint path/to/hf/checkpoint/repo \
--audio_list path/to/data.tsv \
--total_batch_duration 14400 \
--chunk_size 64 \
--left_context_size 128 \
--right_context_size 128
```
Example Output:
```
WER: 0.1234
```
---
<a name = "training" ></a>
## Training
See **[🚀 Training Guide 🚀](examples/)** for complete documentation.
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:
```bibtex
@INPROCEEDINGS{10888640,
author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
doi={10.1109/ICASSP49660.2025.10888640}}
```
<a name = "acknowledgments" ></a>
## Acknowledgments
This implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.
---
Raw data
{
"_id": null,
"home_page": null,
"name": "chunkformer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "speech-recognition, asr, chunkformer, transformer, conformer, pytorch, long-form-audio, machine-learning, deep-learning",
"author": null,
"author_email": "khanld <khanhld218@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/e7/f6/a1a174eaa2eb7fab95a05b96cac57dc319186f55de58dda525d7f6c5127d/chunkformer-1.2.1.tar.gz",
"platform": null,
"description": "# ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription\n---\n[](https://creativecommons.org/licenses/by/4.0/)\n[](https://arxiv.org/abs/2502.14673)\n\nThis repository contains the implementation and supplementary materials for our ICASSP 2025 paper, **\"ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription\"**. The paper has been fully accepted by the reviewers with scores: **4/4/4**.\n\n[](https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi)\n[](https://paperswithcode.com/sota/speech-recognition-on-vivos)\n\n## Table of Contents\n- [Introduction](#introduction)\n- [Key Features](#key-features)\n- [Installation](#installation)\n - [Install from PyPI (Recommended)](#option-1-install-from-pypi-recommended)\n - [Install from source](#option-2-install-from-source)\n - [Pretrained Models](#pretrained-models)\n- [Usage](#usage)\n - [Feature Extraction](#feature-extraction)\n - [Python API Transcription](#python-api-transcription)\n - [Command Line Transcription](#command-line-transcription)\n- [Training the Model](#training)\n- [Citation](#citation)\n- [Acknowledgments](#acknowledgments)\n\n<a name = \"introduction\" ></a>\n## Introduction\nChunkFormer is an ASR model designed for processing long audio inputs effectively on low-memory GPUs. It uses a **chunk-wise processing mechanism** with **relative right context** and employs the **Masked Batch technique** to minimize memory waste due to padding. The model is scalable, robust, and optimized for both streaming and non-streaming ASR scenarios.\n\n\n<a name = \"key-features\" ></a>\n## Key Features\n- **Transcribing Extremely Long Audio**: ChunkFormer can **transcribe audio recordings up to 16 hours** in length with results comparable to existing models. It is currently the first model capable of handling this duration.\n- **Efficient Decoding on Low-Memory GPUs**: Chunkformer can **handle long-form transcription on GPUs with limited memory** without losing context or mismatching the training phase.\n- **Masked Batching Technique**: ChunkFormer efficiently **removes the need for padding in batches with highly variable lengths**. For instance, **decoding a batch containing audio clips of 1 hour and 1 second costs only 1 hour + 1 second of computational and memory usage, instead of 2 hours due to padding.**\n\n| GPU Memory | Total Batch Duration (minutes) |\n|---|---|\n| 80GB | 980 |\n| 24GB | 240 |\n\n<a name = \"installation\" ></a>\n## Installation\n\n### Option 1: Install from PyPI (Recommended)\n```bash\npip install chunkformer\n```\n\n### Option 2: Install from source\n```bash\n# Clone the repository\ngit clone https://github.com/your-username/chunkformer.git\ncd chunkformer\n\n# Install in development mode\npip install -e .\n```\n\n### Pretrained Models\n| Language | Model |\n|----------|-------|\n| Vietnamese | [](https://huggingface.co/khanhld/chunkformer-rnnt-large-vie) |\n| Vietnamese | [](https://huggingface.co/khanhld/chunkformer-ctc-large-vie) |\n| English | [](https://huggingface.co/khanhld/chunkformer-large-en-libri-960h) |\n<a name = \"usage\" ></a>\n## Usage\n\n### Feature Extraction\n```python\nfrom chunkformer import ChunkFormerModel\nimport torch\n\ndevice = \"cuda:0\"\n\n# Load a pre-trained model from Hugging Face or local directory\nmodel = ChunkFormerModel.from_pretrained(\"khanhld/chunkformer-ctc-large-vie\").to(device)\nx, x_len = model._load_audio_and_extract_features(\"path/to/audio\") # x: (T, F), x_len: int\nx = x.unsqueeze(0).to(device)\nx_len = torch.tensor([x_len], device=device)\n\n# Extract feature\nfeature, feature_len = model.encode(\n xs=x,\n xs_lens=x_len,\n chunk_size=64,\n left_context_size=128,\n right_context_size=128,\n)\n\nprint(\"feature: \", feature.shape)\nprint(\"feature_len: \", feature_len)\n```\n\n### Python API Transcription\n```python\nfrom chunkformer import ChunkFormerModel\n\n# Load a pre-trained encoder from Hugging Face or local directory\nmodel = ChunkFormerModel.from_pretrained(\"khanhld/chunkformer-ctc-large-vie\")\n\n# For single long-form audio transcription\ntranscription = model.endless_decode(\n audio_path=\"path/to/long_audio.wav\",\n chunk_size=64,\n left_context_size=128,\n right_context_size=128,\n total_batch_duration=14400, # in seconds\n return_timestamps=True\n)\nprint(transcription)\n\n# For batch processing of multiple audio files\naudio_files = [\"audio1.wav\", \"audio2.wav\", \"audio3.wav\"]\ntranscriptions = model.batch_decode(\n audio_paths=audio_files,\n chunk_size=64,\n left_context_size=128,\n right_context_size=128,\n total_batch_duration=1800 # Total batch duration in seconds\n)\n\nfor i, transcription in enumerate(transcriptions):\n print(f\"Audio {i+1}: {transcription}\")\n\n```\n\n### Command Line Transcription\n#### Long-Form Audio Testing\nTo test the model with a single [long-form audio file](samples/audios/audio_1.wav). Audio file extensions \".mp3\", \".wav\", \".flac\", \".m4a\", \".aac\" are accepted:\n```bash\nchunkformer-decode \\\n --model_checkpoint path/to/hf/checkpoint/repo \\\n --long_form_audio path/to/audio.wav \\\n --total_batch_duration 14400 \\\n --chunk_size 64 \\\n --left_context_size 128 \\\n --right_context_size 128\n```\nExample Output:\n```\n[00:00:01.200] - [00:00:02.400]: this is a transcription example\n[00:00:02.500] - [00:00:03.700]: testing the long-form audio\n```\n\n#### Batch Transcription Testing\nThe [data.tsv](samples/data.tsv) file must have at least one column named **wav**. Optionally, a column named **txt** can be included to compute the **Word Error Rate (WER)**. Output will be saved to the same file.\n\n```bash\nchunkformer-decode \\\n --model_checkpoint path/to/hf/checkpoint/repo \\\n --audio_list path/to/data.tsv \\\n --total_batch_duration 14400 \\\n --chunk_size 64 \\\n --left_context_size 128 \\\n --right_context_size 128\n```\nExample Output:\n```\nWER: 0.1234\n```\n\n---\n\n<a name = \"training\" ></a>\n## Training\n\nSee **[\ud83d\ude80 Training Guide \ud83d\ude80](examples/)** for complete documentation.\n\n\n<a name = \"citation\" ></a>\n## Citation\nIf you use this work in your research, please cite:\n\n```bibtex\n@INPROCEEDINGS{10888640,\n author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},\n booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},\n year={2025},\n volume={},\n number={},\n pages={1-5},\n keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},\n doi={10.1109/ICASSP49660.2025.10888640}}\n\n```\n\n<a name = \"acknowledgments\" ></a>\n## Acknowledgments\nThis implementation is based on the WeNet framework. We extend our gratitude to the WeNet development team for providing an excellent foundation for speech recognition research and development.\n\n---\n",
"bugtrack_url": null,
"license": "CC-BY-4.0",
"summary": "ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription",
"version": "1.2.1",
"project_urls": {
"Bug Reports": "https://github.com/khanld/chunkformer/issues",
"Documentation": "https://github.com/khanld/chunkformer/blob/main/README.md",
"Homepage": "https://github.com/khanld/chunkformer",
"Paper": "https://github.com/khanld/chunkformer/blob/main/docs/paper.pdf",
"Repository": "https://github.com/khanld/chunkformer"
},
"split_keywords": [
"speech-recognition",
" asr",
" chunkformer",
" transformer",
" conformer",
" pytorch",
" long-form-audio",
" machine-learning",
" deep-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8f812abfbdeb2c3e1918436edb8a3dc50f669770b986476e54e38c70b0c00a00",
"md5": "26cfeea2e4e44c9105c8b2d0052c427e",
"sha256": "eddf6911592f21db0e865147b8224cd9493fc85f8b203bf8780a1b64165adf08"
},
"downloads": -1,
"filename": "chunkformer-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "26cfeea2e4e44c9105c8b2d0052c427e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 142441,
"upload_time": "2025-10-10T05:06:37",
"upload_time_iso_8601": "2025-10-10T05:06:37.936628Z",
"url": "https://files.pythonhosted.org/packages/8f/81/2abfbdeb2c3e1918436edb8a3dc50f669770b986476e54e38c70b0c00a00/chunkformer-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e7f6a1a174eaa2eb7fab95a05b96cac57dc319186f55de58dda525d7f6c5127d",
"md5": "ca5800d5c1237236babef4579d9b70ac",
"sha256": "492f4f1db2a14d2e8a75b9b2eaa368a20eea68f98b7b559a41c751390c941615"
},
"downloads": -1,
"filename": "chunkformer-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "ca5800d5c1237236babef4579d9b70ac",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 1286443,
"upload_time": "2025-10-10T05:06:39",
"upload_time_iso_8601": "2025-10-10T05:06:39.453595Z",
"url": "https://files.pythonhosted.org/packages/e7/f6/a1a174eaa2eb7fab95a05b96cac57dc319186f55de58dda525d7f6c5127d/chunkformer-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-10 05:06:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "khanld",
"github_project": "chunkformer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "chunkformer"
}