# ๐ฏ Bournemouth Forced Aligner (BFA)
<div align="center">
[](https://www.python.org/downloads/)
[](https://badge.fury.io/py/bournemouth-forced-aligner)
[](https://www.gnu.org/licenses/gpl-3.0)
[](https://github.com/tabahi/bournemouth-forced-aligner/stargazers)
**High-precision multi-lingual phoneme-level timestamp extraction from audio files**
> ๐ฏ **Find the exact time when any phoneme is spoken** - provided you have the audio and its text.
[๐ Quick Start](#-getting-started) โข [๐ Documentation](#-how-does-it-work) โข [๐ง Installation](#-installation) โข [๐ป CLI](#-command-line-interface-cli) โข [๐ค Contributing](https://github.com/tabahi/bournemouth-forced-aligner/issues)
</div>
---
## โจ Overview
BFA is a lightning-fast Python library that extracts **phoneme/word timestamps** from audio files with millisecond precision. Built on [Contextless Universal Phoneme Encoder (CUPE)](https://github.com/tabahi/contexless-phonemes-CUPE), it delivers accurate forced alignment for speech analysis, linguistics research, and audio processing applications.
## ๐ Key Features
<div align="center">
| Feature | Description | Performance |
|---------|-------------|-------------|
| โก **Ultra-Fast** | CPU-optimized processing | 0.2s for 10s audio |
| ๐ฏ **Phoneme-Level** | Millisecond-precision timestamps | High accuracy alignment |
| ๐ **Multi-Language** | Via espeak phonemization | 80+ Indo-European + related |
| ๐ง **Easy Integration** | JSON & TextGrid output | Praat compatibility |
</div>
**Words+Phonemes aligned to Mel-spectrum frames:**


Try [mel_spectrum_alignment.py](examples/mel_spectrum_alignment.py)
---
## ๐ Installation
### ๐ฆ From PyPI (Recommended)
```bash
# Install the package
pip install bournemouth-forced-aligner
# Alternatively, install the latest library directly from github:
# pip install git+https://github.com/tabahi/bournemouth-forced-aligner.git
# Install system dependencies
apt-get install espeak-ng ffmpeg
```
### โ
Verify Installation
```bash
# Show help
balign --help
# Check version
balign --version
# Test installation
python -c "from bournemouth_aligner import PhonemeTimestampAligner; print('โ
Installation successful!')"
```
---
## ๐ฏ Getting Started
### ๐ฅ Quick Example
```python
import torch
import time
import json
from bournemouth_aligner import PhonemeTimestampAligner
# Configuration
text_sentence = "butterfly"
audio_path = "examples/samples/audio/109867__timkahn__butterfly.wav"
# Initialize aligner using language preset (recommended)
extractor = PhonemeTimestampAligner(
preset="en-us", # Automatically selects best English model
duration_max=10,
device='cpu'
)
# Alternative: explicit model selection
# extractor = PhonemeTimestampAligner(
# model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt",
# lang='en-us',
# duration_max=10,
# device='cpu'
# )
# Load and process
audio_wav = extractor.load_audio(audio_path) # use RMS normalization for preloaded wav `audio_wav = extractor._rms_normalize(audio_wav)`
t0 = time.time()
timestamps = extractor.process_sentence(
text_sentence,
audio_wav,
ts_out_path=None,
extract_embeddings=False,
vspt_path=None,
do_groups=True,
debug=True
)
t1 = time.time()
print("๐ฏ Timestamps:")
print(json.dumps(timestamps, indent=4, ensure_ascii=False))
print(f"โก Processing time: {t1 - t0:.2f} seconds")
```
### ๐ Multi-Language Examples
```python
# German with MLS8 model
aligner_de = PhonemeTimestampAligner(preset="de")
# Hindi with Universal model
aligner_hi = PhonemeTimestampAligner(preset="hi")
# French with MLS8 model
aligner_fr = PhonemeTimestampAligner(preset="fr")
```
### ๐ Sample Output
<details>
<summary>๐ Click to see detailed JSON output</summary>
```json
{
"segments": [
{
"start": 0.0,
"end": 1.2588125,
"text": "butterfly",
"ph66": [
29,
10,
58,
9,
43,
56,
23
],
"pg16": [
7,
2,
14,
2,
8,
13,
5
],
"coverage_analysis": {
"target_count": 7,
"aligned_count": 7,
"missing_count": 0,
"extra_count": 0,
"coverage_ratio": 1.0,
"missing_phonemes": [],
"extra_phonemes": []
},
"ipa": [
"b",
"ส",
"ษพ",
"ษ",
"f",
"l",
"aษช"
],
"word_num": [
0,
0,
0,
0,
0,
0,
0
],
"words": [
"butterfly"
],
"phoneme_ts": [
{
"phoneme_idx": 29,
"phoneme_label": "b",
"start_ms": 33.56833267211914,
"end_ms": 50.35249710083008,
"confidence": 0.9849503040313721
},
{
"phoneme_idx": 10,
"phoneme_label": "ส",
"start_ms": 100.70499420166016,
"end_ms": 117.48916625976562,
"confidence": 0.8435571193695068
},
{
"phoneme_idx": 58,
"phoneme_label": "ษพ",
"start_ms": 134.27333068847656,
"end_ms": 151.0574951171875,
"confidence": 0.3894280791282654
},
{
"phoneme_idx": 9,
"phoneme_label": "ษ",
"start_ms": 285.3308410644531,
"end_ms": 302.114990234375,
"confidence": 0.3299962282180786
},
{
"phoneme_idx": 43,
"phoneme_label": "f",
"start_ms": 369.2516784667969,
"end_ms": 386.03582763671875,
"confidence": 0.9150863289833069
},
{
"phoneme_idx": 56,
"phoneme_label": "l",
"start_ms": 520.3091430664062,
"end_ms": 553.8775024414062,
"confidence": 0.9060741662979126
},
{
"phoneme_idx": 23,
"phoneme_label": "aษช",
"start_ms": 604.22998046875,
"end_ms": 621.01416015625,
"confidence": 0.21650740504264832
}
],
"group_ts": [
{
"group_idx": 7,
"group_label": "voiced_stops",
"start_ms": 33.56833267211914,
"end_ms": 50.35249710083008,
"confidence": 0.9911064505577087
},
{
"group_idx": 2,
"group_label": "central_vowels",
"start_ms": 100.70499420166016,
"end_ms": 117.48916625976562,
"confidence": 0.8446590304374695
},
{
"group_idx": 14,
"group_label": "rhotics",
"start_ms": 134.27333068847656,
"end_ms": 151.0574951171875,
"confidence": 0.28526052832603455
},
{
"group_idx": 2,
"group_label": "central_vowels",
"start_ms": 285.3308410644531,
"end_ms": 302.114990234375,
"confidence": 0.7377423048019409
},
{
"group_idx": 8,
"group_label": "voiceless_fricatives",
"start_ms": 352.4674987792969,
"end_ms": 402.8199768066406,
"confidence": 0.9877637028694153
},
{
"group_idx": 13,
"group_label": "laterals",
"start_ms": 520.3091430664062,
"end_ms": 553.8775024414062,
"confidence": 0.9163824915885925
},
{
"group_idx": 5,
"group_label": "diphthongs",
"start_ms": 604.22998046875,
"end_ms": 621.01416015625,
"confidence": 0.4117060899734497
}
],
"words_ts": [
{
"word": "butterfly",
"start_ms": 33.56833267211914,
"end_ms": 621.01416015625,
"confidence": 0.6550856615815844,
"ph66": [
29,
10,
58,
9,
43,
56,
23
],
"ipa": [
"b",
"ส",
"ษพ",
"ษ",
"f",
"l",
"aษช"
]
}
],
}
]
}
```
</details>
### ๐ Output Format Guide
| Key | Description | Format |
|-----|-------------|--------|
| `ph66` | Standardized 66 phoneme classes (including silence) | See [ph66_mapper.py](bournemouth_aligner/ipamappers/ph66_mapper.py) |
| `pg16` | 16 phoneme category groups (lateral, vowels, rhotics, etc.) | Grouped classifications |
| `ipa` | IPA sequences from espeak | Unicode IPA symbols |
| `words` | Word segmentation | Regex-based: `\b\w+\b` |
| `phoneme_ts` | Aligned phoneme timestamps | Millisecond precision |
| `group_ts` | Phoneme group timestamps | Often more accurate |
| `word_num` | Word index for each phoneme | Maps phonemes to words |
| `words_ts` | Word-level timestamps | Derived from phonemes |
| `coverage_analysis` | Alignment quality metrics | Insertions/deletions |
---
## ๐ ๏ธ Methods
### ๐ Language Presets
BFA supports **80+ languages** through intelligent preset selection, focusing on Indo-European and closely related language families. Simply specify a language code as `preset` parameter for automatic model and language configuration.
**โ ๏ธ Note**: Tonal languages (Chinese, Vietnamese, Thai) and distant language families (Japanese, Korean, Bantu, etc.) are not supported through presets due to CUPE model limitations.
```python
# Using presets (recommended)
aligner = PhonemeTimestampAligner(preset="de") # German with MLS8 model
aligner = PhonemeTimestampAligner(preset="hi") # Hindi with Universal model
aligner = PhonemeTimestampAligner(preset="fr") # French with MLS8 model
```
#### ๐ฏ Parameter Priority
1. **Explicit `cupe_ckpt_path`** (highest priority)
2. **Explicit `model_name`**
3. **Preset values** (only if no explicit model specified)
4. **Default values**
#### ๐ Complete Preset Table
<details>
<summary>๐ Click to view all 80+ supported language presets</summary>
| **Language** | **Preset Code** | **Model Used** | **Language Family** |
|--------------|-----------------|----------------|-------------------|
| **๐บ๐ธ ENGLISH VARIANTS** | | |
| English (US) | `en-us`, `en` | English Model | West Germanic |
| English (UK) | `en-gb` | English Model | West Germanic |
| English (Caribbean) | `en-029` | English Model | West Germanic |
| English (Lancastrian) | `en-gb-x-gbclan` | English Model | West Germanic |
| English (RP) | `en-gb-x-rp` | English Model | West Germanic |
| English (Scottish) | `en-gb-scotland` | English Model | West Germanic |
| English (West Midlands) | `en-gb-x-gbcwmd` | English Model | West Germanic |
| **๐ช๐บ EUROPEAN LANGUAGES (MLS8)** | | |
| German | `de` | MLS8 Model | West Germanic |
| French | `fr` | MLS8 Model | Romance |
| French (Belgium) | `fr-be` | MLS8 Model | Romance |
| French (Switzerland) | `fr-ch` | MLS8 Model | Romance |
| Spanish | `es` | MLS8 Model | Romance |
| Spanish (Latin America) | `es-419` | MLS8 Model | Romance |
| Italian | `it` | MLS8 Model | Romance |
| Portuguese | `pt` | MLS8 Model | Romance |
| Portuguese (Brazil) | `pt-br` | MLS8 Model | Romance |
| Polish | `pl` | MLS8 Model | West Slavic |
| Dutch | `nl` | MLS8 Model | West Germanic |
| Danish | `da` | MLS8 Model | North Germanic |
| Swedish | `sv` | MLS8 Model | North Germanic |
| Norwegian Bokmรฅl | `nb` | MLS8 Model | North Germanic |
| Icelandic | `is` | MLS8 Model | North Germanic |
| Czech | `cs` | MLS8 Model | West Slavic |
| Slovak | `sk` | MLS8 Model | West Slavic |
| Slovenian | `sl` | MLS8 Model | South Slavic |
| Croatian | `hr` | MLS8 Model | South Slavic |
| Bosnian | `bs` | MLS8 Model | South Slavic |
| Serbian | `sr` | MLS8 Model | South Slavic |
| Macedonian | `mk` | MLS8 Model | South Slavic |
| Bulgarian | `bg` | MLS8 Model | South Slavic |
| Romanian | `ro` | MLS8 Model | Romance |
| Hungarian | `hu` | MLS8 Model | Uralic |
| Estonian | `et` | MLS8 Model | Uralic |
| Latvian | `lv` | MLS8 Model | Baltic |
| Lithuanian | `lt` | MLS8 Model | Baltic |
| Catalan | `ca` | MLS8 Model | Romance |
| Aragonese | `an` | MLS8 Model | Romance |
| Papiamento | `pap` | MLS8 Model | Romance |
| Haitian Creole | `ht` | MLS8 Model | Romance |
| Afrikaans | `af` | MLS8 Model | West Germanic |
| Luxembourgish | `lb` | MLS8 Model | West Germanic |
| Irish Gaelic | `ga` | MLS8 Model | Celtic |
| Scottish Gaelic | `gd` | MLS8 Model | Celtic |
| Welsh | `cy` | MLS8 Model | Celtic |
| **๐ INDO-EUROPEAN LANGUAGES (Universal)** | | |
| Russian | `ru` | Universal Model | East Slavic |
| Russian (Latvia) | `ru-lv` | Universal Model | East Slavic |
| Ukrainian | `uk` | Universal Model | East Slavic |
| Belarusian | `be` | Universal Model | East Slavic |
| Hindi | `hi` | Universal Model | Indic |
| Bengali | `bn` | Universal Model | Indic |
| Urdu | `ur` | Universal Model | Indic |
| Punjabi | `pa` | Universal Model | Indic |
| Gujarati | `gu` | Universal Model | Indic |
| Marathi | `mr` | Universal Model | Indic |
| Nepali | `ne` | Universal Model | Indic |
| Assamese | `as` | Universal Model | Indic |
| Oriya | `or` | Universal Model | Indic |
| Sinhala | `si` | Universal Model | Indic |
| Konkani | `kok` | Universal Model | Indic |
| Bishnupriya Manipuri | `bpy` | Universal Model | Indic |
| Sindhi | `sd` | Universal Model | Indic |
| Persian | `fa` | Universal Model | Iranian |
| Persian (Latin) | `fa-latn` | Universal Model | Iranian |
| Kurdish | `ku` | Universal Model | Iranian |
| Greek (Modern) | `el` | Universal Model | Greek |
| Greek (Ancient) | `grc` | Universal Model | Greek |
| Armenian (East) | `hy` | Universal Model | Indo-European |
| Armenian (West) | `hyw` | Universal Model | Indo-European |
| Albanian | `sq` | Universal Model | Indo-European |
| Latin | `la` | Universal Model | Italic |
| **๐น๐ท TURKIC LANGUAGES (Universal)** | | |
| Turkish | `tr` | Universal Model | Turkic |
| Azerbaijani | `az` | Universal Model | Turkic |
| Kazakh | `kk` | Universal Model | Turkic |
| Kyrgyz | `ky` | Universal Model | Turkic |
| Uzbek | `uz` | Universal Model | Turkic |
| Tatar | `tt` | Universal Model | Turkic |
| Turkmen | `tk` | Universal Model | Turkic |
| Uyghur | `ug` | Universal Model | Turkic |
| Bashkir | `ba` | Universal Model | Turkic |
| Chuvash | `cu` | Universal Model | Turkic |
| Nogai | `nog` | Universal Model | Turkic |
| **๐ซ๐ฎ URALIC LANGUAGES (Universal)** | | |
| Finnish | `fi` | Universal Model | Uralic |
| Lule Saami | `smj` | Universal Model | Uralic |
| **๐ SEMITIC LANGUAGES (Universal)** | | |
| Arabic | `ar` | Universal Model | Semitic |
| Hebrew | `he` | Universal Model | Semitic |
| Amharic | `am` | Universal Model | Semitic |
| Maltese | `mt` | Universal Model | Semitic |
| **๐๏ธ MALAYO-POLYNESIAN LANGUAGES (Universal)** | | |
| Indonesian | `id` | Universal Model | Malayo-Polynesian |
| Malay | `ms` | Universal Model | Malayo-Polynesian |
| **๐ฎ๐ณ DRAVIDIAN LANGUAGES (Universal)** | | |
| Tamil | `ta` | Universal Model | Dravidian |
| Telugu | `te` | Universal Model | Dravidian |
| Kannada | `kn` | Universal Model | Dravidian |
| Malayalam | `ml` | Universal Model | Dravidian |
| **๐ฌ๐ช SOUTH CAUCASIAN LANGUAGES (Universal)** | | |
| Georgian | `ka` | Universal Model | South Caucasian |
| **๐พ LANGUAGE ISOLATES & OTHERS (Universal)** | | |
| Basque | `eu` | Universal Model | Language Isolate |
| Quechua | `qu` | Universal Model | Quechuan |
| **๐ธ CONSTRUCTED LANGUAGES (Universal)** | | |
| Esperanto | `eo` | Universal Model | Constructed |
| Interlingua | `ia` | Universal Model | Constructed |
| Ido | `io` | Universal Model | Constructed |
| Lingua Franca Nova | `lfn` | Universal Model | Constructed |
| Lojban | `jbo` | Universal Model | Constructed |
| Pyash | `py` | Universal Model | Constructed |
| Lang Belta | `qdb` | Universal Model | Constructed |
| Quenya | `qya` | Universal Model | Constructed |
| Klingon | `piqd` | Universal Model | Constructed |
| Sindarin | `sjn` | Universal Model | Constructed |
</details>
#### ๐ง Model Selection Guide
| **Model** | **Languages** | **Use Case** | **Performance** |
|-----------|---------------|--------------|-----------------|
| **English Model** | English variants | Best for English | Highest accuracy for English |
| **MLS8 Model** | 8 European + similar | European languages | High accuracy for European |
| **Universal Model** | 60+ Indo-European + related | Other supported languages | Good for Indo-European families |
**โ ๏ธ Unsupported Language Types:**
- **Tonal languages**: Chinese (Mandarin, Cantonese), Vietnamese, Thai, Burmese
- **Distant families**: Japanese, Korean, most African languages (Swahili, etc.)
- **Indigenous languages**: Most Native American, Polynesian (except Indonesian/Malay)
- **Recommendation**: For unsupported languages, use explicit `model_name` parameter with caution
### Initialization
```python
PhonemeTimestampAligner(
preset="en-us", # Language preset (recommended)
model_name=None, # Optional: explicit model override
cupe_ckpt_path=None, # Optional: direct checkpoint path
lang="en-us", # Language for phonemization
duration_max=10,
output_frames_key="phoneme_idx",
device="cpu",
boost_targets=True,
enforce_minimum=True,
enforce_all_targets=True,
ignore_noise=True
)
```
**Parameters:**
- `preset`: **[NEW]** Language preset for automatic model and language selection. Use language codes like "de", "fr", "hi", "ja", etc. Supports 127+ languages with intelligent model selection.
- `model_name`: Name of the CUPE model (see [HuggingFace models](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt)). Overrides preset selection. Downloaded automatically if available.
- `cupe_ckpt_path`: Local path to model checkpoint. Highest priority - overrides both preset and model_name.
- `lang`: Language code for phonemization ([espeak lang codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)). Only overridden by preset if using default.
- `duration_max`: Maximum segment duration (seconds, for batch padding). Best to keep <30 seconds.
- `output_frames_key`: Output key for frame assortment (`phoneme_idx`, `phoneme_label`, `group_idx`, `group_label`).
- `device`: Inference device (`cpu` or `cuda`).
- `silence_anchors`: Number of silent frames to anchor pauses (i.e., split segments when at least `silence_anchors` frames are silent). Set `0` to disable. Default is `10`. Set a lower value to increase sensitivity to silences. Best set `enforce_all_targets=True` when using this.
- `boost_targets`: Boost target phoneme probabilities for better alignment.
- `enforce_minimum`: Enforce minimum probability for target phonemes.
- `enforce_all_targets`: Band-aid postprocessing patch. It will insert phonemes missed by viterbi decoding at their expected positions based on target positions.
- `ignore_noise`: Whether to ignore the predicted "noise" in the alignment. If set to True, noise will be skipped over. If False, long noisy/silent segments will be included as "noise" timestamps.
---
**Models:**
- `model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"` for best performance on English. This model is trained on 1000 hours LibriSpeech.
- `model_name="en_libri1000_uj01d_e62_val_GER=0.2438.ckpt"` for best performance on heavy accented English speech. This is the same as above, just unsettled weights.
- `model_name="multi_MLS8_uh02_e36_val_GER=0.2334.ckpt"` for best performance on 8 european languages including English, German, French, Dutch, Italian, Spanish, Italian, Portuguese, Polish. This model's accuracy on English (buckeye corpus) is on par with the above (main) English model. We can only assume that the performance will be the same on the rest of the 7 languages.
- `model_name="multi_mswc38_ug20_e59_val_GER=0.5611.ckpt"` universal model for all non-tonal languages. This model is extremely acoustic, if it hears /i/, it will mark an /i/ regardless of the language.
- Models for tonal languages (Mandarin, Vietnamese, Thai) will have to wait.
Do not forget to set `lang="en-us"` parameter based on your model and [Language Identifier](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md).
### Process SRT File
```python
PhonemeTimestampAligner.process_srt_file(
srt_path,
audio_path,
ts_out_path=None,
extract_embeddings=False,
vspt_path=None,
do_groups=True,
debug=True
)
```
**Parameters:**
- `srt_path`: Path to input SRT file (whisper JSON format).
- `audio_path`: Path to audio file.
- `ts_out_path`: Output path for timestamps (vs2 format).
- `extract_embeddings`: Extract embeddings.
- `vspt_path`: Path to save embeddings (`.pt` file).
- `do_groups`: Extract group timestamps.
- `debug`: Enable debug output.
**Returns:**
- `timestamps_dict`: Dictionary with extracted timestamps.
---
### Process text sentences
```python
PhonemeTimestampAligner.process_sentence(
text,
audio_wav,
ts_out_path=None,
extract_embeddings=False,
vspt_path=None,
do_groups=True,
debug=False
)
```
**Parameters:**
- `text`: Sentence/text.
- `audio_wav`: Audio waveform tensor (`torch.Tensor`).
- `ts_out_path`: Output path for timestamps (optional).
- `extract_embeddings`: Extract embeddings (optional).
- `vspt_path`: Path to save embeddings (`.pt`, optional).
- `do_groups`: Extract group timestamps (optional).
- `debug`: Enable debug output (optional).
Returns: `timestamps_dict`
---
### ๐ฃ๏ธ Convert Text to Phonemes
Phonemization in BFA is powered by the [phonemizer](https://github.com/bootphon/phonemizer) package, using the [espeak-ng](https://github.com/espeak-ng/espeak-ng) backend for robust multi-language support.
```python
PhonemeTimestampAligner.phonemize_sentence(text)
```
**Optional:** Change the espeak language after initialization:
```python
PhonemeTimestampAligner.phonemizer.set_backend(language='en')
```
**Method Description:**
Phonemizes a sentence and returns a detailed mapping:
- `text`: Original input sentence
- `ipa`: List of phonemes in IPA format
- `ph66`: List of phoneme class indices (mapped to 66-class set)
- `pg16`: List of phoneme group indices (16 broad categories)
- `words`: List of words corresponding to phonemes
- `word_num`: Word indices for each phoneme
**Example Usage:**
```python
result = PhonemeTimestampAligner.phonemize_sentence("butterfly")
print(result["ipa"]) # ['b', 'ส', 'ษพ', 'ษ', 'f', 'l', 'aษช']
print(result["ph66"]) # [29, 10, 58, 9, 43, 56, 23]
print(result["pg16"]) # [7, 2, 14, 2, 8, 13, 5]
```
### Extract Timestamps from Segment
```python
PhonemeTimestampAligner.extract_timestamps_from_segment(
wav,
wav_len,
phoneme_sequence,
start_offset_time=0,
group_sequence=None,
extract_embeddings=True,
do_groups=True,
debug=True
)
```
**Parameters:**
- `wav`: Audio tensor for the segment. Shape: [1, samples]
- `wav_len`: Length of the audio segment (samples).
- `phoneme_sequence`: List/tensor of phoneme indices (ph66)
- `start_offset_time`: Segment start offset (seconds).
- `group_sequence`: Optional group indices (pg16).
- `extract_embeddings`: Extract pooled phoneme embeddings.
- `do_groups`: Extract phoneme group timestamps.
- `debug`: Enable debug output.
**Returns:**
- `timestamp_dict`: Contains phoneme and group timestamps.
- `pooled_embeddings_phonemes`: Pooled phoneme embeddings or `None`.
- `pooled_embeddings_groups`: Pooled group embeddings or `None`.
---
### Convert to TextGrid
```python
PhonemeTimestampAligner.convert_to_textgrid(
timestamps_dict,
output_file=None,
include_confidence=False
)
```
**Description:**
Converts VS2 timestamp data to [Praat TextGrid](https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_format.html) format.
**Parameters:**
- `timestamps_dict`: Timestamp dictionary (from alignment).
- `output_file`: Path to save TextGrid file (optional).
- `include_confidence`: Include confidence values in output (optional).
**Returns:**
- `textgrid_content`: TextGrid file content as string.
---
## ๐ง Advanced Usage
### ๐๏ธ Mel-Spectrum Alignment
BFA provides advanced mel-spectrogram compatibility methods for audio synthesis workflows. These methods enable seamless integration with [HiFi-GAN](https://github.com/jik876/hifi-gan) and [BigVGAN vocoder](https://github.com/NVIDIA/BigVGAN) and other mel-based audio processing pipelines.
See full [example here](examples/mel_spectrum_alignment.py).
#### Extract Mel Spectrogram
```python
PhonemeTimestampAligner.extract_mel_spectrum(
wav,
wav_sample_rate,
vocoder_config={'num_mels': 80, 'num_freq': 1025, 'n_fft': 1024, 'hop_size': 256, 'win_size': 1024, 'sampling_rate': 22050, 'fmin': 0, 'fmax': 8000, 'model': 'whatever_22khz_80band_fmax8k_256x'}
)
```
**Description:**
Extracts mel spectrogram from audio with vocoder compatibility.
**Parameters:**
- `wav`: Input waveform tensor of shape `(1, T)`
- `wav_sample_rate`: Sample rate of the input waveform
- `vocoder_config`: Configuration dictionary for HiFiGAN/BigVGAN vocoder compatibility.
**Returns:**
- `mel`: Mel spectrogram tensor of shape `(frames, mel_bins)` - transposed for easy frame-wise processing
#### Frame-wise Assortment
```python
PhonemeTimestampAligner.framewise_assortment(
aligned_ts,
total_frames,
frames_per_second,
gap_contraction=5,
select_key="phoneme_idx"
)
```
**Description:**
Converts timestamp-based phoneme alignment to frame-wise labels matching mel-spectrogram frames.
**Parameters:**
- `aligned_ts`: List of timestamp dictionaries (from `phoneme_ts`, `group_ts`, or `word_ts`)
- `total_frames`: Total number of frames in the mel spectrogram
- `frames_per_second`: Frame rate of the mel spectrogram
- `gap_contraction`: Number of frames to fill silent gaps on either side of segments (default: 5)
- `select_key`: Key to extract from timestamps (`"phoneme_idx"`, `"group_idx"`, etc.)
**Returns:**
- List of frame labels with length `total_frames`
#### Frame Compression
```python
PhonemeTimestampAligner.compress_frames(frames_list)
```
**Description:**
Compresses consecutive identical frame values into run-length encoded format.
**Example:**
```python
frames = [0,0,0,0,1,1,1,1,3,4,5,4,5,2,2,2]
compressed = compress_frames(frames)
# Returns: [(0,4), (1,4), (3,1), (4,1), (5,1), (4,1), (5,1), (2,3)]
```
**Returns:**
- List of `(frame_value, count)` tuples
#### Frame Decompression
```python
PhonemeTimestampAligner.decompress_frames(compressed_frames)
```
**Description:**
Decompresses run-length encoded frames back to full frame sequence.
**Parameters:**
- `compressed_frames`: List of `(phoneme_id, count)` tuples
**Returns:**
- Decompressed list of frame labels
<details>
<summary>๐ Complete mel-spectrum alignment example</summary>
```python
# pip install librosa
import torch
from bournemouth_aligner import PhonemeTimestampAligner
# Initialize aligner
extractor = PhonemeTimestampAligner(model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt",
lang='en-us', duration_max=10, device='cpu')
# Process audio and get timestamps
audio_wav = extractor.load_audio("examples/samples/audio/109867__timkahn__butterfly.wav")
timestamps = extractor.process_sentence("butterfly", audio_wav)
# Extract mel spectrogram with vocoder compatibility
vocoder_config = {'num_mels': 80, 'hop_size': 256, 'sampling_rate': 22050}
segment_wav = audio_wav[:, :int(timestamps['segments'][0]['end'] * extractor.resampler_sample_rate)]
mel_spec = extractor.extract_mel_spectrum(segment_wav, extractor.resampler_sample_rate, vocoder_config)
# Create frame-wise phoneme alignment
total_frames = mel_spec.shape[0]
frames_per_second = total_frames / timestamps['segments'][0]['end']
frames_assorted = extractor.framewise_assortment(
aligned_ts=timestamps['segments'][0]['phoneme_ts'],
total_frames=total_frames,
frames_per_second=frames_per_second
)
# Compress and visualize
compress_framesed = extractor.compress_frames(frames_assorted)
# Use provided plot_mel_phonemes() function to visualize
```
</details>
### ๐ Integration Examples
<details>
<summary>๐๏ธ Whisper Integration</summary>
```python
# pip install git+https://github.com/openai/whisper.git
import whisper, json
from bournemouth_aligner import PhonemeTimestampAligner
# Transcribe and align
model = whisper.load_model("turbo")
result = model.transcribe("audio.wav")
with open("whisper_output.srt.json", "w") as f:
json.dump(result, f)
# Process with BFA
extractor = PhonemeTimestampAligner(model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt")
timestamps = extractor.process_srt_file("whisper_output.srt.json", "audio.wav", "timestamps.json")
```
</details>
<details>
<summary>๐ฌ Manual Processing Pipeline</summary>
```python
import torch
from bournemouth_aligner import PhonemeTimestampAligner
# Initialize and process
extractor = PhonemeTimestampAligner(model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt")
audio_wav = extractor.load_audio("audio.wav") # Handles resampling and normalization
timestamps = extractor.process_sentence("your text here", audio_wav)
# Export to Praat
extractor.convert_to_textgrid(timestamps, "output.TextGrid")
```
</details>
### ๐ค Machine Learning Integration
For phoneme embeddings in ML pipelines, check out our [embeddings example](examples/read_embeddings.py).
---
## ๐ป Command Line Interface (CLI)
### ๐ Quick CLI Usage
```bash
# Basic alignment
balign audio.wav transcription.srt.json output.json
# With debug output
balign audio.wav transcription.srt.json output.json --debug
# Extract embeddings
balign audio.wav transcription.srt.json output.json --embeddings embeddings.pt
```
### โ๏ธ Command Syntax
```bash
balign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH
```
### ๐ Arguments & Options
<div align="center">
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `AUDIO_PATH` | **Required** | - | Audio file path (.wav, .mp3, .flac) |
| `SRT_PATH` | **Required** | - | SRT JSON file path |
| `OUTPUT_PATH` | **Required** | - | Output timestamps (.json) |
</div>
<details>
<summary>๐ง Advanced Options</summary>
| Option | Default | Description |
|--------|---------|-------------|
| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |
| `--lang TEXT` | `en-us` | Language code ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |
| `--device TEXT` | `cpu` | Processing device (`cpu` or `cuda`) |
| `--embeddings PATH` | None | Save phoneme embeddings (.pt file) |
| `--duration-max FLOAT` | `10.0` | Max segment duration (seconds) |
| `--debug / --no-debug` | `False` | Enable detailed output |
| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting |
| `--help` | | Show help message |
| `--version` | | Show version info |
</details>
### ๐ CLI Examples
```bash
# Basic usage
balign audio.wav transcription.srt.json output.json
# With GPU and embeddings
balign audio.wav transcription.srt.json output.json --device cuda --embeddings embeddings.pt
# Multi-language (English + 8 european langauges model available)
balign audio.wav transcription.srt.json output.json --lang es
# Batch processing
for audio in *.wav; do balign "$audio" "${audio%.wav}.srt" "${audio%.wav}.json"; done
```
### ๐ Input Format
SRT files must be in JSON format:
```json
{
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "hello world this is a test"
},
{
"start": 3.5,
"end": 7.2,
"text": "another segment of speech"
}
]
}
```
### ๐ฏ Creating Input Files
Use Whisper for transcription (see [Integration Examples](#-integration-examples)) or create SRT JSON manually with the format shown above.
### ๐ Debug Mode
Enable comprehensive processing information:
```bash
balign audio.wav transcription.srt.json output.json --debug
```
<details>
<summary>๐ Sample debug output</summary>
```
๐ Bournemouth Forced Aligner
๐ต Audio: audio.wav
๐ SRT: transcription.srt.json
๐พ Output: output.json
๐ท๏ธ Language: en-us
๐ฅ๏ธ Device: cpu
๐ฏ Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt
--------------------------------------------------
๐ง Initializing aligner...
Setting backend for language: en-us
โ
Aligner initialized successfully
๐ต Processing audio...
Loaded SRT file with 1 segments from transcription.srt.json
Resampling audio.wav from 22050Hz to 16000Hz
Expected phonemes: ['p', 'ษน', 'ษช', ...'ส', 'ษ', 'n']
Target phonemes: 108, Expected: ['p', 'ษน', 'ษช', ..., 'ส', 'ษ', 'n']
Spectral length: 600
Forced alignment took 135.305 ms
Aligned phonemes: 108
Target phonemes: 108
SUCCESS: All target phonemes were aligned!
============================================================
PROCESSING SUMMARY
============================================================
Total segments processed: 1
Perfect sequence matches: 1/1 (100.0%)
Total phonemes aligned: 108
Overall average confidence: 0.502
============================================================
Results saved to: output.json
โ
Timestamps extracted to output.json
๐ Processed 1 segments with 108 phonemes
๐ Processing completed successfully!
```
</details>
---
---
## ๐ง How Does It Work?
Read full paper: [BFA: REAL-TIME MULTILINGUAL TEXT-TO-SPEECH FORCED ALIGNMENT](https://arxiv.org/pdf/2509.23147)
### ๐ Processing Pipeline
```mermaid
graph TD
A[Audio Input] --> B[RMS Normalization]
B --> C[Audio Windowing]
C --> D[CUPE Model Inference]
D --> E[Phoneme/Group Probabilities]
E --> F[Text Phonemization]
F --> G[Target Boosting]
G --> H[Viterbi Forced Alignment]
H --> I[Missing Phoneme Recovery]
I --> J[Confidence Calculation]
J --> K[Frame-to-Time Conversion]
K --> L[Output Generation]
```
**CTC Transition Rules:**
- **Stay**: Remain in current state (repeat phoneme or blank)
- **Advance**: Move to next state in sequence
- **Skip**: Jump over blank to next phoneme (when consecutive phonemes differ)
**Core Components:**
1. **๐ต Audio Preprocessing**: RMS normalization and windowing (120ms windows, 80ms stride)
2. **๐ง CUPE Model**: Contextless Universal Phoneme Encoder extracts frame-level phoneme probabilities
3. **๐ Phonemization**: espeak-ng converts text to 66-class phoneme indices (ph66) and 16 phoneme groups (pg16)
4. **๐ฏ Target Boosting**: Enhances probabilities of expected phonemes for better alignment
5. **๐ CTC style Viterbi**: CTC-based forced alignment with minimum probability enforcement
6. **๐ ๏ธ Recovery Mechanism**: Ensures all target phonemes appear in alignment, even with low confidence
7. **๐ Confidence Scoring**: Frame-level probability averaging with adaptive thresholding
8. **โฑ๏ธ Timestamp Conversion**: Frame indices to millisecond timestamps with segment offset
### ๐๏ธ Key Alignment Parameters
BFA provides several unique control parameters not available in traditional aligners like MFA:
#### ๐ฏ `boost_targets` (Default: `True`)
Increases log-probabilities of expected phonemes by a fixed boost factor (typically +5.0) before Viterbi decoding. If the sentence is very long or contains every possible phoneme, then boosting them all equally doesn't have much effectโbecause no phoneme stands out more than the others.
**When it helps:**
- **Cross-lingual scenarios**: Using English models on other languages where some phonemes are underrepresented
- **Noisy audio**: When target phonemes have very low confidence but should be present
- **Domain mismatch**: When model training data differs significantly from your audio
**Important caveat:** For monolingual sentences, boosting affects ALL phonemes in the target sequence equally, making it equivalent to no boosting. The real benefit comes when using multilingual models or when certain phonemes are systematically underrepresented.
#### ๐ก๏ธ `enforce_minimum` (Default: `True`)
Ensures every target phoneme has at least a minimum probability (default: 1e-8) at each frame, preventing complete elimination during alignment.
**Why this matters:**
- Prevents target phonemes from being "zeroed out" by the model
- Guarantees that even very quiet or unclear phonemes can be aligned
- Helps for highly noisy audio in which all phonemes, not just targets, have extremely low probabilities.
#### ๐ `enforce_all_targets` (Default: `True`)
**This is BFA's key differentiator from MFA.** After Viterbi decoding, BFA applies post-processing to guarantee that every target phoneme is present in the final alignmentโeven those with low acoustic probability. However, **downstream tasks can filter out these "forced" phonemes using their confidence scores**. For practical use, consider setting a confidence threshold e.g., `timestamps["phoneme_ts"][p]["confidence"] <0.05`) to exclude phonemes that were aligned with little to no acoustic evidence.
**Recovery mechanism:**
1. Identifies any missing target phonemes after initial alignment
2. Finds frames with highest probability for each missing phoneme
3. Strategically inserts missing phonemes by:
- Replacing blank frames when possible
- Searching nearby frames within a small radius
- Force-replacing frames as last resort
**Use cases:**
- **Guaranteed coverage**: When you need every phoneme to be timestamped
- **Noisy environments**: Where some phonemes might be completely missed by standard Viterbi
- **Research applications**: When completeness is more important than probabilistic accuracy
#### โ๏ธ Parameter Interaction Effects
| Scenario | Recommended Settings | Outcome |
|----------|---------------------|---------|
| **Clean monolingual audio** | All defaults | Standard high-quality alignment |
| **Cross-lingual/noisy** | `boost_targets=True` | Better phoneme recovery |
| **Research/completeness** | `enforce_all_targets=True` | 100% phoneme coverage |
| **Probabilistically strict** | `enforce_all_targets=False` | Only high-confidence alignments |
**Technical Details:**
- **Audio Processing**: 16kHz sampling, sliding window approach for long audio
- **Model Architecture**: Pre-trained CUPE-2i models from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i)
- **Alignment Strategy**: CTC path construction with blank tokens between phonemes
- **Quality Assurance**: Post-processing ensures 100% target phoneme coverage (when enabled)
> **Performance Note**: CPU-optimized implementation. The iterative Viterbi algorithm and windowing operations are designed for single-threaded efficiency. Most operations are vectorized where possible, so batch processing should be faster on GPUs.
---
### ๐ Alignment Error Analysis
**Alignment error histogram on [TIMIT](https://catalog.ldc.upenn.edu/LDC93S1):**
<div align="center">
<img src="examples/samples/images/BFA_vs_MFA_errors_on_TIMIT.png" alt="Alignment Error Histogram - TIMIT Dataset" width="600"/>
</div>
- Most phoneme boundaries are aligned within **ยฑ30ms** of ground truth.
- Errors above **100ms** are rare and typically due to ambiguous or noisy segments.
**For comparison:**
See [Montreal Forced Aligner](https://www.isca-archive.org/interspeech_2017/mcauliffe17_interspeech.pdf) for benchmark results on similar datasets.
> โ ๏ธ **Best Performance**: For optimal results, use audio segments **under 30 seconds**. For longer audio, segment first using Whisper or VAD. Audio duration above 60 seconds creates too many possibilities for the Viterbi algorithm to handle properly.
---
## ๐ฌ Comparison with MFA
Our alignment quality compared to Montreal Forced Aligner (MFA) using [Praat](https://www.fon.hum.uva.nl/praat/) TextGrid visualization:
<div align="center">
| Metric | BFA | MFA |
|--------|-----|-----|
| **Speed** | 0.2s per 10s audio | 10s per 2s audio |
| **Real-time potential** | โ
Yes (contextless) | โ No |
| **Stop consonants** | โ
Better (t,d,p,k) | โ ๏ธ Extends too much |
| **Tail endings** | โ ๏ธ Sometimes missed | โ Onset only |
| **Breathy sounds** | โ ๏ธ Misses h# | โ
Captures |
| **Punctuation** | โ
Silence aware | โ No punctuation |
</div>
### ๐ Sample Visualizations in Praat
**"In being comparatively modern..."** - LJ Speech Dataset
[๐ต Audio Sample](examples/samples/LJSpeech/LJ001-0002.wav)

---
### Citation
[Rehman, A., Cai, J., Zhang, J.-J., & Yang, X. (2025). BFA: Real-time multilingual text-to-speech forced alignment. *arXiv*. https://arxiv.org/abs/2509.23147](https://arxiv.org/pdf/2509.23147)
```bibtex
@misc{rehman2025bfa,
title={BFA: Real-time Multilingual Text-to-speech Forced Alignment},
author={Abdul Rehman and Jingyao Cai and Jian-Jun Zhang and Xiaosong Yang},
year={2025},
eprint={2509.23147},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.23147},
}
```
---
<div align="center">
[โญ Star us on GitHub](https://github.com/tabahi/bournemouth-forced-aligner) โข [๐ Report Issues](https://github.com/tabahi/bournemouth-forced-aligner/issues)
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/tabahi/bournemouth-forced-aligner",
"name": "bournemouth-forced-aligner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Tabahi <tabahi@duck.com>",
"keywords": "phoneme, alignment, speech, audio, timestamp, forced-alignment, bournemouth, CUPE, speech-recognition, linguistics",
"author": "Tabahi",
"author_email": "Tabahi <tabahi@duck.com>",
"download_url": "https://files.pythonhosted.org/packages/8a/4c/98ffa0344f506f7af1f791f402b8604583808dbb26bfc7213d36bac98915/bournemouth_forced_aligner-0.1.7.tar.gz",
"platform": null,
"description": "# \ud83c\udfaf Bournemouth Forced Aligner (BFA)\n\n<div align=\"center\">\n\n[](https://www.python.org/downloads/)\n[](https://badge.fury.io/py/bournemouth-forced-aligner)\n[](https://www.gnu.org/licenses/gpl-3.0)\n[](https://github.com/tabahi/bournemouth-forced-aligner/stargazers)\n\n**High-precision multi-lingual phoneme-level timestamp extraction from audio files**\n> \ud83c\udfaf **Find the exact time when any phoneme is spoken** - provided you have the audio and its text.\n\n\n[\ud83d\ude80 Quick Start](#-getting-started) \u2022 [\ud83d\udcda Documentation](#-how-does-it-work) \u2022 [\ud83d\udd27 Installation](#-installation) \u2022 [\ud83d\udcbb CLI](#-command-line-interface-cli) \u2022 [\ud83e\udd1d Contributing](https://github.com/tabahi/bournemouth-forced-aligner/issues)\n\n</div>\n\n---\n\n## \u2728 Overview\n\nBFA is a lightning-fast Python library that extracts **phoneme/word timestamps** from audio files with millisecond precision. Built on [Contextless Universal Phoneme Encoder (CUPE)](https://github.com/tabahi/contexless-phonemes-CUPE), it delivers accurate forced alignment for speech analysis, linguistics research, and audio processing applications.\n\n\n## \ud83c\udf1f Key Features\n\n<div align=\"center\">\n\n| Feature | Description | Performance |\n|---------|-------------|-------------|\n| \u26a1 **Ultra-Fast** | CPU-optimized processing | 0.2s for 10s audio |\n| \ud83c\udfaf **Phoneme-Level** | Millisecond-precision timestamps | High accuracy alignment |\n| \ud83c\udf0d **Multi-Language** | Via espeak phonemization | 80+ Indo-European + related |\n| \ud83d\udd27 **Easy Integration** | JSON & TextGrid output | Praat compatibility |\n\n\n\n</div>\n\n\n**Words+Phonemes aligned to Mel-spectrum frames:**\n\n\n\n\nTry [mel_spectrum_alignment.py](examples/mel_spectrum_alignment.py)\n\n\n---\n## \ud83d\ude80 Installation\n\n### \ud83d\udce6 From PyPI (Recommended)\n\n```bash\n# Install the package\npip install bournemouth-forced-aligner\n\n# Alternatively, install the latest library directly from github:\n# pip install git+https://github.com/tabahi/bournemouth-forced-aligner.git\n\n# Install system dependencies\napt-get install espeak-ng ffmpeg\n```\n\n### \u2705 Verify Installation\n\n```bash\n# Show help\nbalign --help\n\n# Check version\nbalign --version\n\n# Test installation\npython -c \"from bournemouth_aligner import PhonemeTimestampAligner; print('\u2705 Installation successful!')\"\n```\n\n---\n\n## \ud83c\udfaf Getting Started\n\n### \ud83d\udd25 Quick Example\n\n```python\nimport torch\nimport time\nimport json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Configuration\ntext_sentence = \"butterfly\"\naudio_path = \"examples/samples/audio/109867__timkahn__butterfly.wav\"\n\n# Initialize aligner using language preset (recommended)\nextractor = PhonemeTimestampAligner(\n preset=\"en-us\", # Automatically selects best English model\n duration_max=10,\n device='cpu'\n)\n\n# Alternative: explicit model selection\n# extractor = PhonemeTimestampAligner(\n# model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\",\n# lang='en-us',\n# duration_max=10,\n# device='cpu'\n# )\n\n# Load and process\naudio_wav = extractor.load_audio(audio_path) # use RMS normalization for preloaded wav `audio_wav = extractor._rms_normalize(audio_wav)`\n\nt0 = time.time()\ntimestamps = extractor.process_sentence(\n text_sentence,\n audio_wav,\n ts_out_path=None,\n extract_embeddings=False,\n vspt_path=None,\n do_groups=True,\n debug=True\n)\nt1 = time.time()\n\nprint(\"\ud83c\udfaf Timestamps:\")\nprint(json.dumps(timestamps, indent=4, ensure_ascii=False))\nprint(f\"\u26a1 Processing time: {t1 - t0:.2f} seconds\")\n```\n\n\n\n### \ud83c\udf10 Multi-Language Examples\n\n```python\n# German with MLS8 model\naligner_de = PhonemeTimestampAligner(preset=\"de\")\n\n# Hindi with Universal model\naligner_hi = PhonemeTimestampAligner(preset=\"hi\")\n\n# French with MLS8 model\naligner_fr = PhonemeTimestampAligner(preset=\"fr\")\n\n```\n\n### \ud83d\udcca Sample Output\n\n<details>\n<summary>\ud83d\udccb Click to see detailed JSON output</summary>\n\n```json\n{\n \"segments\": [\n {\n \"start\": 0.0,\n \"end\": 1.2588125,\n \"text\": \"butterfly\",\n \"ph66\": [\n 29,\n 10,\n 58,\n 9,\n 43,\n 56,\n 23\n ],\n \"pg16\": [\n 7,\n 2,\n 14,\n 2,\n 8,\n 13,\n 5\n ],\n \"coverage_analysis\": {\n \"target_count\": 7,\n \"aligned_count\": 7,\n \"missing_count\": 0,\n \"extra_count\": 0,\n \"coverage_ratio\": 1.0,\n \"missing_phonemes\": [],\n \"extra_phonemes\": []\n },\n \"ipa\": [\n \"b\",\n \"\u028c\",\n \"\u027e\",\n \"\u025a\",\n \"f\",\n \"l\",\n \"a\u026a\"\n ],\n \"word_num\": [\n 0,\n 0,\n 0,\n 0,\n 0,\n 0,\n 0\n ],\n \"words\": [\n \"butterfly\"\n ],\n \"phoneme_ts\": [\n {\n \"phoneme_idx\": 29,\n \"phoneme_label\": \"b\",\n \"start_ms\": 33.56833267211914,\n \"end_ms\": 50.35249710083008,\n \"confidence\": 0.9849503040313721\n },\n {\n \"phoneme_idx\": 10,\n \"phoneme_label\": \"\u028c\",\n \"start_ms\": 100.70499420166016,\n \"end_ms\": 117.48916625976562,\n \"confidence\": 0.8435571193695068\n },\n {\n \"phoneme_idx\": 58,\n \"phoneme_label\": \"\u027e\",\n \"start_ms\": 134.27333068847656,\n \"end_ms\": 151.0574951171875,\n \"confidence\": 0.3894280791282654\n },\n {\n \"phoneme_idx\": 9,\n \"phoneme_label\": \"\u025a\",\n \"start_ms\": 285.3308410644531,\n \"end_ms\": 302.114990234375,\n \"confidence\": 0.3299962282180786\n },\n {\n \"phoneme_idx\": 43,\n \"phoneme_label\": \"f\",\n \"start_ms\": 369.2516784667969,\n \"end_ms\": 386.03582763671875,\n \"confidence\": 0.9150863289833069\n },\n {\n \"phoneme_idx\": 56,\n \"phoneme_label\": \"l\",\n \"start_ms\": 520.3091430664062,\n \"end_ms\": 553.8775024414062,\n \"confidence\": 0.9060741662979126\n },\n {\n \"phoneme_idx\": 23,\n \"phoneme_label\": \"a\u026a\",\n \"start_ms\": 604.22998046875,\n \"end_ms\": 621.01416015625,\n \"confidence\": 0.21650740504264832\n }\n ],\n \"group_ts\": [\n {\n \"group_idx\": 7,\n \"group_label\": \"voiced_stops\",\n \"start_ms\": 33.56833267211914,\n \"end_ms\": 50.35249710083008,\n \"confidence\": 0.9911064505577087\n },\n {\n \"group_idx\": 2,\n \"group_label\": \"central_vowels\",\n \"start_ms\": 100.70499420166016,\n \"end_ms\": 117.48916625976562,\n \"confidence\": 0.8446590304374695\n },\n {\n \"group_idx\": 14,\n \"group_label\": \"rhotics\",\n \"start_ms\": 134.27333068847656,\n \"end_ms\": 151.0574951171875,\n \"confidence\": 0.28526052832603455\n },\n {\n \"group_idx\": 2,\n \"group_label\": \"central_vowels\",\n \"start_ms\": 285.3308410644531,\n \"end_ms\": 302.114990234375,\n \"confidence\": 0.7377423048019409\n },\n {\n \"group_idx\": 8,\n \"group_label\": \"voiceless_fricatives\",\n \"start_ms\": 352.4674987792969,\n \"end_ms\": 402.8199768066406,\n \"confidence\": 0.9877637028694153\n },\n {\n \"group_idx\": 13,\n \"group_label\": \"laterals\",\n \"start_ms\": 520.3091430664062,\n \"end_ms\": 553.8775024414062,\n \"confidence\": 0.9163824915885925\n },\n {\n \"group_idx\": 5,\n \"group_label\": \"diphthongs\",\n \"start_ms\": 604.22998046875,\n \"end_ms\": 621.01416015625,\n \"confidence\": 0.4117060899734497\n }\n ],\n \"words_ts\": [\n {\n \"word\": \"butterfly\",\n \"start_ms\": 33.56833267211914,\n \"end_ms\": 621.01416015625,\n \"confidence\": 0.6550856615815844,\n \"ph66\": [\n 29,\n 10,\n 58,\n 9,\n 43,\n 56,\n 23\n ],\n \"ipa\": [\n \"b\",\n \"\u028c\",\n \"\u027e\",\n \"\u025a\",\n \"f\",\n \"l\",\n \"a\u026a\"\n ]\n }\n ],\n }\n ]\n}\n```\n\n</details>\n\n### \ud83d\udd11 Output Format Guide\n\n| Key | Description | Format |\n|-----|-------------|--------|\n| `ph66` | Standardized 66 phoneme classes (including silence) | See [ph66_mapper.py](bournemouth_aligner/ipamappers/ph66_mapper.py) |\n| `pg16` | 16 phoneme category groups (lateral, vowels, rhotics, etc.) | Grouped classifications |\n| `ipa` | IPA sequences from espeak | Unicode IPA symbols |\n| `words` | Word segmentation | Regex-based: `\\b\\w+\\b` |\n| `phoneme_ts` | Aligned phoneme timestamps | Millisecond precision |\n| `group_ts` | Phoneme group timestamps | Often more accurate |\n| `word_num` | Word index for each phoneme | Maps phonemes to words |\n| `words_ts` | Word-level timestamps | Derived from phonemes |\n| `coverage_analysis` | Alignment quality metrics | Insertions/deletions |\n\n---\n\n## \ud83d\udee0\ufe0f Methods\n\n### \ud83c\udf0d Language Presets\n\nBFA supports **80+ languages** through intelligent preset selection, focusing on Indo-European and closely related language families. Simply specify a language code as `preset` parameter for automatic model and language configuration.\n\n**\u26a0\ufe0f Note**: Tonal languages (Chinese, Vietnamese, Thai) and distant language families (Japanese, Korean, Bantu, etc.) are not supported through presets due to CUPE model limitations.\n\n```python\n# Using presets (recommended)\naligner = PhonemeTimestampAligner(preset=\"de\") # German with MLS8 model\naligner = PhonemeTimestampAligner(preset=\"hi\") # Hindi with Universal model\naligner = PhonemeTimestampAligner(preset=\"fr\") # French with MLS8 model\n```\n\n#### \ud83c\udfaf Parameter Priority\n1. **Explicit `cupe_ckpt_path`** (highest priority)\n2. **Explicit `model_name`**\n3. **Preset values** (only if no explicit model specified)\n4. **Default values**\n\n#### \ud83d\udccb Complete Preset Table\n\n<details>\n<summary>\ud83d\udd0d Click to view all 80+ supported language presets</summary>\n\n| **Language** | **Preset Code** | **Model Used** | **Language Family** |\n|--------------|-----------------|----------------|-------------------|\n| **\ud83c\uddfa\ud83c\uddf8 ENGLISH VARIANTS** | | |\n| English (US) | `en-us`, `en` | English Model | West Germanic |\n| English (UK) | `en-gb` | English Model | West Germanic |\n| English (Caribbean) | `en-029` | English Model | West Germanic |\n| English (Lancastrian) | `en-gb-x-gbclan` | English Model | West Germanic |\n| English (RP) | `en-gb-x-rp` | English Model | West Germanic |\n| English (Scottish) | `en-gb-scotland` | English Model | West Germanic |\n| English (West Midlands) | `en-gb-x-gbcwmd` | English Model | West Germanic |\n| **\ud83c\uddea\ud83c\uddfa EUROPEAN LANGUAGES (MLS8)** | | |\n| German | `de` | MLS8 Model | West Germanic |\n| French | `fr` | MLS8 Model | Romance |\n| French (Belgium) | `fr-be` | MLS8 Model | Romance |\n| French (Switzerland) | `fr-ch` | MLS8 Model | Romance |\n| Spanish | `es` | MLS8 Model | Romance |\n| Spanish (Latin America) | `es-419` | MLS8 Model | Romance |\n| Italian | `it` | MLS8 Model | Romance |\n| Portuguese | `pt` | MLS8 Model | Romance |\n| Portuguese (Brazil) | `pt-br` | MLS8 Model | Romance |\n| Polish | `pl` | MLS8 Model | West Slavic |\n| Dutch | `nl` | MLS8 Model | West Germanic |\n| Danish | `da` | MLS8 Model | North Germanic |\n| Swedish | `sv` | MLS8 Model | North Germanic |\n| Norwegian Bokm\u00e5l | `nb` | MLS8 Model | North Germanic |\n| Icelandic | `is` | MLS8 Model | North Germanic |\n| Czech | `cs` | MLS8 Model | West Slavic |\n| Slovak | `sk` | MLS8 Model | West Slavic |\n| Slovenian | `sl` | MLS8 Model | South Slavic |\n| Croatian | `hr` | MLS8 Model | South Slavic |\n| Bosnian | `bs` | MLS8 Model | South Slavic |\n| Serbian | `sr` | MLS8 Model | South Slavic |\n| Macedonian | `mk` | MLS8 Model | South Slavic |\n| Bulgarian | `bg` | MLS8 Model | South Slavic |\n| Romanian | `ro` | MLS8 Model | Romance |\n| Hungarian | `hu` | MLS8 Model | Uralic |\n| Estonian | `et` | MLS8 Model | Uralic |\n| Latvian | `lv` | MLS8 Model | Baltic |\n| Lithuanian | `lt` | MLS8 Model | Baltic |\n| Catalan | `ca` | MLS8 Model | Romance |\n| Aragonese | `an` | MLS8 Model | Romance |\n| Papiamento | `pap` | MLS8 Model | Romance |\n| Haitian Creole | `ht` | MLS8 Model | Romance |\n| Afrikaans | `af` | MLS8 Model | West Germanic |\n| Luxembourgish | `lb` | MLS8 Model | West Germanic |\n| Irish Gaelic | `ga` | MLS8 Model | Celtic |\n| Scottish Gaelic | `gd` | MLS8 Model | Celtic |\n| Welsh | `cy` | MLS8 Model | Celtic |\n| **\ud83c\udf0f INDO-EUROPEAN LANGUAGES (Universal)** | | |\n| Russian | `ru` | Universal Model | East Slavic |\n| Russian (Latvia) | `ru-lv` | Universal Model | East Slavic |\n| Ukrainian | `uk` | Universal Model | East Slavic |\n| Belarusian | `be` | Universal Model | East Slavic |\n| Hindi | `hi` | Universal Model | Indic |\n| Bengali | `bn` | Universal Model | Indic |\n| Urdu | `ur` | Universal Model | Indic |\n| Punjabi | `pa` | Universal Model | Indic |\n| Gujarati | `gu` | Universal Model | Indic |\n| Marathi | `mr` | Universal Model | Indic |\n| Nepali | `ne` | Universal Model | Indic |\n| Assamese | `as` | Universal Model | Indic |\n| Oriya | `or` | Universal Model | Indic |\n| Sinhala | `si` | Universal Model | Indic |\n| Konkani | `kok` | Universal Model | Indic |\n| Bishnupriya Manipuri | `bpy` | Universal Model | Indic |\n| Sindhi | `sd` | Universal Model | Indic |\n| Persian | `fa` | Universal Model | Iranian |\n| Persian (Latin) | `fa-latn` | Universal Model | Iranian |\n| Kurdish | `ku` | Universal Model | Iranian |\n| Greek (Modern) | `el` | Universal Model | Greek |\n| Greek (Ancient) | `grc` | Universal Model | Greek |\n| Armenian (East) | `hy` | Universal Model | Indo-European |\n| Armenian (West) | `hyw` | Universal Model | Indo-European |\n| Albanian | `sq` | Universal Model | Indo-European |\n| Latin | `la` | Universal Model | Italic |\n| **\ud83c\uddf9\ud83c\uddf7 TURKIC LANGUAGES (Universal)** | | |\n| Turkish | `tr` | Universal Model | Turkic |\n| Azerbaijani | `az` | Universal Model | Turkic |\n| Kazakh | `kk` | Universal Model | Turkic |\n| Kyrgyz | `ky` | Universal Model | Turkic |\n| Uzbek | `uz` | Universal Model | Turkic |\n| Tatar | `tt` | Universal Model | Turkic |\n| Turkmen | `tk` | Universal Model | Turkic |\n| Uyghur | `ug` | Universal Model | Turkic |\n| Bashkir | `ba` | Universal Model | Turkic |\n| Chuvash | `cu` | Universal Model | Turkic |\n| Nogai | `nog` | Universal Model | Turkic |\n| **\ud83c\uddeb\ud83c\uddee URALIC LANGUAGES (Universal)** | | |\n| Finnish | `fi` | Universal Model | Uralic |\n| Lule Saami | `smj` | Universal Model | Uralic |\n| **\ud83d\udd4c SEMITIC LANGUAGES (Universal)** | | |\n| Arabic | `ar` | Universal Model | Semitic |\n| Hebrew | `he` | Universal Model | Semitic |\n| Amharic | `am` | Universal Model | Semitic |\n| Maltese | `mt` | Universal Model | Semitic |\n| **\ud83c\udfdd\ufe0f MALAYO-POLYNESIAN LANGUAGES (Universal)** | | |\n| Indonesian | `id` | Universal Model | Malayo-Polynesian |\n| Malay | `ms` | Universal Model | Malayo-Polynesian |\n| **\ud83c\uddee\ud83c\uddf3 DRAVIDIAN LANGUAGES (Universal)** | | |\n| Tamil | `ta` | Universal Model | Dravidian |\n| Telugu | `te` | Universal Model | Dravidian |\n| Kannada | `kn` | Universal Model | Dravidian |\n| Malayalam | `ml` | Universal Model | Dravidian |\n| **\ud83c\uddec\ud83c\uddea SOUTH CAUCASIAN LANGUAGES (Universal)** | | |\n| Georgian | `ka` | Universal Model | South Caucasian |\n| **\ud83d\uddfe LANGUAGE ISOLATES & OTHERS (Universal)** | | |\n| Basque | `eu` | Universal Model | Language Isolate |\n| Quechua | `qu` | Universal Model | Quechuan |\n| **\ud83d\udef8 CONSTRUCTED LANGUAGES (Universal)** | | |\n| Esperanto | `eo` | Universal Model | Constructed |\n| Interlingua | `ia` | Universal Model | Constructed |\n| Ido | `io` | Universal Model | Constructed |\n| Lingua Franca Nova | `lfn` | Universal Model | Constructed |\n| Lojban | `jbo` | Universal Model | Constructed |\n| Pyash | `py` | Universal Model | Constructed |\n| Lang Belta | `qdb` | Universal Model | Constructed |\n| Quenya | `qya` | Universal Model | Constructed |\n| Klingon | `piqd` | Universal Model | Constructed |\n| Sindarin | `sjn` | Universal Model | Constructed |\n\n</details>\n\n#### \ud83d\udd27 Model Selection Guide\n\n| **Model** | **Languages** | **Use Case** | **Performance** |\n|-----------|---------------|--------------|-----------------|\n| **English Model** | English variants | Best for English | Highest accuracy for English |\n| **MLS8 Model** | 8 European + similar | European languages | High accuracy for European |\n| **Universal Model** | 60+ Indo-European + related | Other supported languages | Good for Indo-European families |\n\n**\u26a0\ufe0f Unsupported Language Types:**\n- **Tonal languages**: Chinese (Mandarin, Cantonese), Vietnamese, Thai, Burmese\n- **Distant families**: Japanese, Korean, most African languages (Swahili, etc.)\n- **Indigenous languages**: Most Native American, Polynesian (except Indonesian/Malay)\n- **Recommendation**: For unsupported languages, use explicit `model_name` parameter with caution\n\n### Initialization\n\n```python\nPhonemeTimestampAligner(\n preset=\"en-us\", # Language preset (recommended)\n model_name=None, # Optional: explicit model override\n cupe_ckpt_path=None, # Optional: direct checkpoint path\n lang=\"en-us\", # Language for phonemization\n duration_max=10,\n output_frames_key=\"phoneme_idx\",\n device=\"cpu\",\n boost_targets=True,\n enforce_minimum=True,\n enforce_all_targets=True,\n ignore_noise=True\n)\n```\n\n**Parameters:**\n- `preset`: **[NEW]** Language preset for automatic model and language selection. Use language codes like \"de\", \"fr\", \"hi\", \"ja\", etc. Supports 127+ languages with intelligent model selection.\n- `model_name`: Name of the CUPE model (see [HuggingFace models](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt)). Overrides preset selection. Downloaded automatically if available.\n- `cupe_ckpt_path`: Local path to model checkpoint. Highest priority - overrides both preset and model_name.\n- `lang`: Language code for phonemization ([espeak lang codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)). Only overridden by preset if using default.\n- `duration_max`: Maximum segment duration (seconds, for batch padding). Best to keep <30 seconds.\n- `output_frames_key`: Output key for frame assortment (`phoneme_idx`, `phoneme_label`, `group_idx`, `group_label`).\n- `device`: Inference device (`cpu` or `cuda`).\n- `silence_anchors`: Number of silent frames to anchor pauses (i.e., split segments when at least `silence_anchors` frames are silent). Set `0` to disable. Default is `10`. Set a lower value to increase sensitivity to silences. Best set `enforce_all_targets=True` when using this.\n- `boost_targets`: Boost target phoneme probabilities for better alignment.\n- `enforce_minimum`: Enforce minimum probability for target phonemes.\n- `enforce_all_targets`: Band-aid postprocessing patch. It will insert phonemes missed by viterbi decoding at their expected positions based on target positions.\n- `ignore_noise`: Whether to ignore the predicted \"noise\" in the alignment. If set to True, noise will be skipped over. If False, long noisy/silent segments will be included as \"noise\" timestamps.\n---\n\n**Models:**\n- `model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\"` for best performance on English. This model is trained on 1000 hours LibriSpeech.\n- `model_name=\"en_libri1000_uj01d_e62_val_GER=0.2438.ckpt\"` for best performance on heavy accented English speech. This is the same as above, just unsettled weights.\n- `model_name=\"multi_MLS8_uh02_e36_val_GER=0.2334.ckpt\"` for best performance on 8 european languages including English, German, French, Dutch, Italian, Spanish, Italian, Portuguese, Polish. This model's accuracy on English (buckeye corpus) is on par with the above (main) English model. We can only assume that the performance will be the same on the rest of the 7 languages.\n- `model_name=\"multi_mswc38_ug20_e59_val_GER=0.5611.ckpt\"` universal model for all non-tonal languages. This model is extremely acoustic, if it hears /i/, it will mark an /i/ regardless of the language.\n- Models for tonal languages (Mandarin, Vietnamese, Thai) will have to wait.\n\nDo not forget to set `lang=\"en-us\"` parameter based on your model and [Language Identifier](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md).\n\n\n### Process SRT File\n\n```python\nPhonemeTimestampAligner.process_srt_file(\n srt_path,\n audio_path,\n ts_out_path=None,\n extract_embeddings=False,\n vspt_path=None,\n do_groups=True,\n debug=True\n)\n```\n\n**Parameters:**\n- `srt_path`: Path to input SRT file (whisper JSON format).\n- `audio_path`: Path to audio file.\n- `ts_out_path`: Output path for timestamps (vs2 format).\n- `extract_embeddings`: Extract embeddings.\n- `vspt_path`: Path to save embeddings (`.pt` file).\n- `do_groups`: Extract group timestamps.\n- `debug`: Enable debug output.\n\n**Returns:** \n- `timestamps_dict`: Dictionary with extracted timestamps.\n\n---\n\n### Process text sentences\n\n```python\nPhonemeTimestampAligner.process_sentence(\n text,\n audio_wav,\n ts_out_path=None,\n extract_embeddings=False,\n vspt_path=None,\n do_groups=True,\n debug=False\n)\n```\n\n**Parameters:**\n- `text`: Sentence/text.\n- `audio_wav`: Audio waveform tensor (`torch.Tensor`).\n- `ts_out_path`: Output path for timestamps (optional).\n- `extract_embeddings`: Extract embeddings (optional).\n- `vspt_path`: Path to save embeddings (`.pt`, optional).\n- `do_groups`: Extract group timestamps (optional).\n- `debug`: Enable debug output (optional).\n\nReturns: `timestamps_dict`\n\n---\n### \ud83d\udde3\ufe0f Convert Text to Phonemes\n\nPhonemization in BFA is powered by the [phonemizer](https://github.com/bootphon/phonemizer) package, using the [espeak-ng](https://github.com/espeak-ng/espeak-ng) backend for robust multi-language support.\n\n```python\nPhonemeTimestampAligner.phonemize_sentence(text)\n```\n\n**Optional:** Change the espeak language after initialization:\n```python\nPhonemeTimestampAligner.phonemizer.set_backend(language='en')\n```\n\n**Method Description:**\n\nPhonemizes a sentence and returns a detailed mapping:\n\n- `text`: Original input sentence\n- `ipa`: List of phonemes in IPA format\n- `ph66`: List of phoneme class indices (mapped to 66-class set)\n- `pg16`: List of phoneme group indices (16 broad categories)\n- `words`: List of words corresponding to phonemes\n- `word_num`: Word indices for each phoneme\n\n**Example Usage:**\n```python\nresult = PhonemeTimestampAligner.phonemize_sentence(\"butterfly\")\nprint(result[\"ipa\"]) # ['b', '\u028c', '\u027e', '\u025a', 'f', 'l', 'a\u026a']\nprint(result[\"ph66\"]) # [29, 10, 58, 9, 43, 56, 23]\nprint(result[\"pg16\"]) # [7, 2, 14, 2, 8, 13, 5]\n```\n\n\n\n### Extract Timestamps from Segment\n\n```python\nPhonemeTimestampAligner.extract_timestamps_from_segment(\n wav,\n wav_len,\n phoneme_sequence,\n start_offset_time=0,\n group_sequence=None,\n extract_embeddings=True,\n do_groups=True,\n debug=True\n)\n```\n\n**Parameters:**\n- `wav`: Audio tensor for the segment. Shape: [1, samples]\n- `wav_len`: Length of the audio segment (samples).\n- `phoneme_sequence`: List/tensor of phoneme indices (ph66)\n- `start_offset_time`: Segment start offset (seconds).\n- `group_sequence`: Optional group indices (pg16).\n- `extract_embeddings`: Extract pooled phoneme embeddings.\n- `do_groups`: Extract phoneme group timestamps.\n- `debug`: Enable debug output.\n\n**Returns:**\n- `timestamp_dict`: Contains phoneme and group timestamps.\n- `pooled_embeddings_phonemes`: Pooled phoneme embeddings or `None`.\n- `pooled_embeddings_groups`: Pooled group embeddings or `None`.\n\n---\n\n### Convert to TextGrid\n\n```python\nPhonemeTimestampAligner.convert_to_textgrid(\n timestamps_dict,\n output_file=None,\n include_confidence=False\n)\n```\n\n**Description:** \nConverts VS2 timestamp data to [Praat TextGrid](https://www.fon.hum.uva.nl/praat/manual/TextGrid_file_format.html) format.\n\n**Parameters:**\n- `timestamps_dict`: Timestamp dictionary (from alignment).\n- `output_file`: Path to save TextGrid file (optional).\n- `include_confidence`: Include confidence values in output (optional).\n\n**Returns:** \n- `textgrid_content`: TextGrid file content as string.\n\n\n\n---\n\n\n\n## \ud83d\udd27 Advanced Usage\n\n\n### \ud83c\udf99\ufe0f Mel-Spectrum Alignment\n\nBFA provides advanced mel-spectrogram compatibility methods for audio synthesis workflows. These methods enable seamless integration with [HiFi-GAN](https://github.com/jik876/hifi-gan) and [BigVGAN vocoder](https://github.com/NVIDIA/BigVGAN) and other mel-based audio processing pipelines.\n\nSee full [example here](examples/mel_spectrum_alignment.py).\n\n#### Extract Mel Spectrogram\n\n```python\nPhonemeTimestampAligner.extract_mel_spectrum(\n wav,\n wav_sample_rate,\n vocoder_config={'num_mels': 80, 'num_freq': 1025, 'n_fft': 1024, 'hop_size': 256, 'win_size': 1024, 'sampling_rate': 22050, 'fmin': 0, 'fmax': 8000, 'model': 'whatever_22khz_80band_fmax8k_256x'}\n)\n```\n\n**Description:** \nExtracts mel spectrogram from audio with vocoder compatibility.\n\n**Parameters:**\n- `wav`: Input waveform tensor of shape `(1, T)`\n- `wav_sample_rate`: Sample rate of the input waveform\n- `vocoder_config`: Configuration dictionary for HiFiGAN/BigVGAN vocoder compatibility.\n\n**Returns:** \n- `mel`: Mel spectrogram tensor of shape `(frames, mel_bins)` - transposed for easy frame-wise processing\n\n#### Frame-wise Assortment\n\n```python\nPhonemeTimestampAligner.framewise_assortment(\n aligned_ts,\n total_frames,\n frames_per_second,\n gap_contraction=5,\n select_key=\"phoneme_idx\"\n)\n```\n\n**Description:** \nConverts timestamp-based phoneme alignment to frame-wise labels matching mel-spectrogram frames.\n\n**Parameters:**\n- `aligned_ts`: List of timestamp dictionaries (from `phoneme_ts`, `group_ts`, or `word_ts`)\n- `total_frames`: Total number of frames in the mel spectrogram\n- `frames_per_second`: Frame rate of the mel spectrogram\n- `gap_contraction`: Number of frames to fill silent gaps on either side of segments (default: 5)\n- `select_key`: Key to extract from timestamps (`\"phoneme_idx\"`, `\"group_idx\"`, etc.)\n\n**Returns:** \n- List of frame labels with length `total_frames`\n\n#### Frame Compression\n\n```python\nPhonemeTimestampAligner.compress_frames(frames_list)\n```\n\n**Description:** \nCompresses consecutive identical frame values into run-length encoded format.\n\n**Example:**\n```python\nframes = [0,0,0,0,1,1,1,1,3,4,5,4,5,2,2,2]\ncompressed = compress_frames(frames)\n# Returns: [(0,4), (1,4), (3,1), (4,1), (5,1), (4,1), (5,1), (2,3)]\n```\n\n**Returns:** \n- List of `(frame_value, count)` tuples\n\n#### Frame Decompression\n\n```python\nPhonemeTimestampAligner.decompress_frames(compressed_frames)\n```\n\n**Description:** \nDecompresses run-length encoded frames back to full frame sequence.\n\n**Parameters:**\n- `compressed_frames`: List of `(phoneme_id, count)` tuples\n\n**Returns:** \n- Decompressed list of frame labels\n\n<details>\n<summary>\ud83d\udcca Complete mel-spectrum alignment example</summary>\n\n```python\n# pip install librosa\nimport torch\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Initialize aligner\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\", \n lang='en-us', duration_max=10, device='cpu')\n\n# Process audio and get timestamps\naudio_wav = extractor.load_audio(\"examples/samples/audio/109867__timkahn__butterfly.wav\")\ntimestamps = extractor.process_sentence(\"butterfly\", audio_wav)\n\n# Extract mel spectrogram with vocoder compatibility\nvocoder_config = {'num_mels': 80, 'hop_size': 256, 'sampling_rate': 22050}\nsegment_wav = audio_wav[:, :int(timestamps['segments'][0]['end'] * extractor.resampler_sample_rate)]\nmel_spec = extractor.extract_mel_spectrum(segment_wav, extractor.resampler_sample_rate, vocoder_config)\n\n# Create frame-wise phoneme alignment\ntotal_frames = mel_spec.shape[0]\nframes_per_second = total_frames / timestamps['segments'][0]['end']\nframes_assorted = extractor.framewise_assortment(\n aligned_ts=timestamps['segments'][0]['phoneme_ts'], \n total_frames=total_frames, \n frames_per_second=frames_per_second\n)\n\n# Compress and visualize\ncompress_framesed = extractor.compress_frames(frames_assorted)\n# Use provided plot_mel_phonemes() function to visualize\n```\n\n</details>\n\n\n\n### \ud83d\udd17 Integration Examples\n\n<details>\n<summary>\ud83c\udf99\ufe0f Whisper Integration</summary>\n\n```python\n# pip install git+https://github.com/openai/whisper.git \nimport whisper, json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Transcribe and align\nmodel = whisper.load_model(\"turbo\")\nresult = model.transcribe(\"audio.wav\")\nwith open(\"whisper_output.srt.json\", \"w\") as f:\n json.dump(result, f)\n\n# Process with BFA\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\")\ntimestamps = extractor.process_srt_file(\"whisper_output.srt.json\", \"audio.wav\", \"timestamps.json\")\n```\n\n</details>\n\n<details>\n<summary>\ud83d\udd2c Manual Processing Pipeline</summary>\n\n```python\nimport torch\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n# Initialize and process\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\")\naudio_wav = extractor.load_audio(\"audio.wav\") # Handles resampling and normalization\ntimestamps = extractor.process_sentence(\"your text here\", audio_wav)\n\n# Export to Praat\nextractor.convert_to_textgrid(timestamps, \"output.TextGrid\")\n```\n\n</details>\n\n### \ud83e\udd16 Machine Learning Integration\n\nFor phoneme embeddings in ML pipelines, check out our [embeddings example](examples/read_embeddings.py).\n\n---\n\n## \ud83d\udcbb Command Line Interface (CLI)\n\n### \ud83d\ude80 Quick CLI Usage\n\n```bash\n# Basic alignment\nbalign audio.wav transcription.srt.json output.json\n\n# With debug output\nbalign audio.wav transcription.srt.json output.json --debug\n\n# Extract embeddings\nbalign audio.wav transcription.srt.json output.json --embeddings embeddings.pt\n```\n\n### \u2699\ufe0f Command Syntax\n\n```bash\nbalign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH\n```\n\n### \ud83d\udcdd Arguments & Options\n\n<div align=\"center\">\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `AUDIO_PATH` | **Required** | - | Audio file path (.wav, .mp3, .flac) |\n| `SRT_PATH` | **Required** | - | SRT JSON file path |\n| `OUTPUT_PATH` | **Required** | - | Output timestamps (.json) |\n\n</div>\n\n<details>\n<summary>\ud83d\udd27 Advanced Options</summary>\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |\n| `--lang TEXT` | `en-us` | Language code ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |\n| `--device TEXT` | `cpu` | Processing device (`cpu` or `cuda`) |\n| `--embeddings PATH` | None | Save phoneme embeddings (.pt file) |\n| `--duration-max FLOAT` | `10.0` | Max segment duration (seconds) |\n| `--debug / --no-debug` | `False` | Enable detailed output |\n| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting |\n| `--help` | | Show help message |\n| `--version` | | Show version info |\n\n</details>\n\n### \ud83c\udf1f CLI Examples\n\n```bash\n# Basic usage\nbalign audio.wav transcription.srt.json output.json\n\n# With GPU and embeddings \nbalign audio.wav transcription.srt.json output.json --device cuda --embeddings embeddings.pt\n\n# Multi-language (English + 8 european langauges model available)\nbalign audio.wav transcription.srt.json output.json --lang es\n\n# Batch processing\nfor audio in *.wav; do balign \"$audio\" \"${audio%.wav}.srt\" \"${audio%.wav}.json\"; done\n```\n\n### \ud83d\udcca Input Format\n\nSRT files must be in JSON format:\n\n```json\n{\n \"segments\": [\n {\n \"start\": 0.0,\n \"end\": 3.5,\n \"text\": \"hello world this is a test\"\n },\n {\n \"start\": 3.5,\n \"end\": 7.2, \n \"text\": \"another segment of speech\"\n }\n ]\n}\n```\n\n### \ud83c\udfaf Creating Input Files\n\nUse Whisper for transcription (see [Integration Examples](#-integration-examples)) or create SRT JSON manually with the format shown above.\n\n### \ud83d\udd0d Debug Mode\n\nEnable comprehensive processing information:\n\n```bash\nbalign audio.wav transcription.srt.json output.json --debug\n```\n\n<details>\n<summary>\ud83d\udcca Sample debug output</summary>\n\n```\n\ud83d\ude80 Bournemouth Forced Aligner\n\ud83c\udfb5 Audio: audio.wav\n\ud83d\udcc4 SRT: transcription.srt.json \n\ud83d\udcbe Output: output.json\n\ud83c\udff7\ufe0f Language: en-us\n\ud83d\udda5\ufe0f Device: cpu\n\ud83c\udfaf Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\n--------------------------------------------------\n\ud83d\udd27 Initializing aligner...\nSetting backend for language: en-us\n\u2705 Aligner initialized successfully\n\ud83c\udfb5 Processing audio...\nLoaded SRT file with 1 segments from transcription.srt.json\nResampling audio.wav from 22050Hz to 16000Hz\nExpected phonemes: ['p', '\u0279', '\u026a', ...'\u0283', '\u0259', 'n']\nTarget phonemes: 108, Expected: ['p', '\u0279', '\u026a', ..., '\u0283', '\u0259', 'n']\nSpectral length: 600\nForced alignment took 135.305 ms\nAligned phonemes: 108\nTarget phonemes: 108\nSUCCESS: All target phonemes were aligned!\n\n============================================================\nPROCESSING SUMMARY\n============================================================\nTotal segments processed: 1\nPerfect sequence matches: 1/1 (100.0%)\nTotal phonemes aligned: 108\nOverall average confidence: 0.502\n============================================================\nResults saved to: output.json\n\u2705 Timestamps extracted to output.json\n\ud83d\udcca Processed 1 segments with 108 phonemes\n\ud83c\udf89 Processing completed successfully!\n```\n\n</details>\n\n---\n\n\n\n---\n\n## \ud83e\udde0 How Does It Work?\n\nRead full paper: [BFA: REAL-TIME MULTILINGUAL TEXT-TO-SPEECH FORCED ALIGNMENT](https://arxiv.org/pdf/2509.23147)\n\n### \ud83d\udd04 Processing Pipeline\n\n```mermaid\ngraph TD\n A[Audio Input] --> B[RMS Normalization]\n B --> C[Audio Windowing]\n C --> D[CUPE Model Inference]\n D --> E[Phoneme/Group Probabilities]\n E --> F[Text Phonemization]\n F --> G[Target Boosting]\n G --> H[Viterbi Forced Alignment]\n H --> I[Missing Phoneme Recovery]\n I --> J[Confidence Calculation]\n J --> K[Frame-to-Time Conversion]\n K --> L[Output Generation]\n```\n\n**CTC Transition Rules:**\n- **Stay**: Remain in current state (repeat phoneme or blank)\n- **Advance**: Move to next state in sequence\n- **Skip**: Jump over blank to next phoneme (when consecutive phonemes differ)\n\n**Core Components:**\n\n1. **\ud83c\udfb5 Audio Preprocessing**: RMS normalization and windowing (120ms windows, 80ms stride)\n2. **\ud83e\udde0 CUPE Model**: Contextless Universal Phoneme Encoder extracts frame-level phoneme probabilities\n3. **\ud83d\udcdd Phonemization**: espeak-ng converts text to 66-class phoneme indices (ph66) and 16 phoneme groups (pg16)\n4. **\ud83c\udfaf Target Boosting**: Enhances probabilities of expected phonemes for better alignment\n5. **\ud83d\udd0d CTC style Viterbi**: CTC-based forced alignment with minimum probability enforcement\n6. **\ud83d\udee0\ufe0f Recovery Mechanism**: Ensures all target phonemes appear in alignment, even with low confidence\n7. **\ud83d\udcca Confidence Scoring**: Frame-level probability averaging with adaptive thresholding\n8. **\u23f1\ufe0f Timestamp Conversion**: Frame indices to millisecond timestamps with segment offset\n\n### \ud83c\udf9b\ufe0f Key Alignment Parameters\n\nBFA provides several unique control parameters not available in traditional aligners like MFA:\n\n#### \ud83c\udfaf `boost_targets` (Default: `True`)\nIncreases log-probabilities of expected phonemes by a fixed boost factor (typically +5.0) before Viterbi decoding. If the sentence is very long or contains every possible phoneme, then boosting them all equally doesn't have much effect\u2014because no phoneme stands out more than the others.\n\n**When it helps:**\n- **Cross-lingual scenarios**: Using English models on other languages where some phonemes are underrepresented\n- **Noisy audio**: When target phonemes have very low confidence but should be present\n- **Domain mismatch**: When model training data differs significantly from your audio\n\n**Important caveat:** For monolingual sentences, boosting affects ALL phonemes in the target sequence equally, making it equivalent to no boosting. The real benefit comes when using multilingual models or when certain phonemes are systematically underrepresented.\n\n#### \ud83d\udee1\ufe0f `enforce_minimum` (Default: `True`) \nEnsures every target phoneme has at least a minimum probability (default: 1e-8) at each frame, preventing complete elimination during alignment.\n\n**Why this matters:**\n- Prevents target phonemes from being \"zeroed out\" by the model\n- Guarantees that even very quiet or unclear phonemes can be aligned\n- Helps for highly noisy audio in which all phonemes, not just targets, have extremely low probabilities.\n\n#### \ud83d\udd12 `enforce_all_targets` (Default: `True`)\n**This is BFA's key differentiator from MFA.** After Viterbi decoding, BFA applies post-processing to guarantee that every target phoneme is present in the final alignment\u2014even those with low acoustic probability. However, **downstream tasks can filter out these \"forced\" phonemes using their confidence scores**. For practical use, consider setting a confidence threshold e.g., `timestamps[\"phoneme_ts\"][p][\"confidence\"] <0.05`) to exclude phonemes that were aligned with little to no acoustic evidence.\n\n**Recovery mechanism:**\n1. Identifies any missing target phonemes after initial alignment\n2. Finds frames with highest probability for each missing phoneme\n3. Strategically inserts missing phonemes by:\n - Replacing blank frames when possible\n - Searching nearby frames within a small radius\n - Force-replacing frames as last resort\n\n**Use cases:**\n- **Guaranteed coverage**: When you need every phoneme to be timestamped\n- **Noisy environments**: Where some phonemes might be completely missed by standard Viterbi\n- **Research applications**: When completeness is more important than probabilistic accuracy\n\n#### \u2696\ufe0f Parameter Interaction Effects\n\n| Scenario | Recommended Settings | Outcome |\n|----------|---------------------|---------|\n| **Clean monolingual audio** | All defaults | Standard high-quality alignment |\n| **Cross-lingual/noisy** | `boost_targets=True` | Better phoneme recovery |\n| **Research/completeness** | `enforce_all_targets=True` | 100% phoneme coverage |\n| **Probabilistically strict** | `enforce_all_targets=False` | Only high-confidence alignments |\n\n**Technical Details:**\n\n- **Audio Processing**: 16kHz sampling, sliding window approach for long audio\n- **Model Architecture**: Pre-trained CUPE-2i models from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i) \n- **Alignment Strategy**: CTC path construction with blank tokens between phonemes\n- **Quality Assurance**: Post-processing ensures 100% target phoneme coverage (when enabled)\n\n> **Performance Note**: CPU-optimized implementation. The iterative Viterbi algorithm and windowing operations are designed for single-threaded efficiency. Most operations are vectorized where possible, so batch processing should be faster on GPUs.\n\n--- \n\n\n\n\n### \ud83d\udcca Alignment Error Analysis\n\n**Alignment error histogram on [TIMIT](https://catalog.ldc.upenn.edu/LDC93S1):**\n\n<div align=\"center\">\n <img src=\"examples/samples/images/BFA_vs_MFA_errors_on_TIMIT.png\" alt=\"Alignment Error Histogram - TIMIT Dataset\" width=\"600\"/>\n</div>\n\n- Most phoneme boundaries are aligned within **\u00b130ms** of ground truth.\n- Errors above **100ms** are rare and typically due to ambiguous or noisy segments.\n\n**For comparison:** \nSee [Montreal Forced Aligner](https://www.isca-archive.org/interspeech_2017/mcauliffe17_interspeech.pdf) for benchmark results on similar datasets.\n\n\n\n\n\n> \u26a0\ufe0f **Best Performance**: For optimal results, use audio segments **under 30 seconds**. For longer audio, segment first using Whisper or VAD. Audio duration above 60 seconds creates too many possibilities for the Viterbi algorithm to handle properly.\n\n---\n\n## \ud83d\udd2c Comparison with MFA\n\nOur alignment quality compared to Montreal Forced Aligner (MFA) using [Praat](https://www.fon.hum.uva.nl/praat/) TextGrid visualization:\n\n<div align=\"center\">\n\n| Metric | BFA | MFA |\n|--------|-----|-----|\n| **Speed** | 0.2s per 10s audio | 10s per 2s audio |\n| **Real-time potential** | \u2705 Yes (contextless) | \u274c No |\n| **Stop consonants** | \u2705 Better (t,d,p,k) | \u26a0\ufe0f Extends too much |\n| **Tail endings** | \u26a0\ufe0f Sometimes missed | \u274c Onset only |\n| **Breathy sounds** | \u26a0\ufe0f Misses h# | \u2705 Captures |\n| **Punctuation** | \u2705 Silence aware | \u274c No punctuation |\n\n</div>\n\n### \ud83d\udcca Sample Visualizations in Praat\n\n**\"In being comparatively modern...\"** - LJ Speech Dataset \n[\ud83c\udfb5 Audio Sample](examples/samples/LJSpeech/LJ001-0002.wav)\n\n\n\n\n\n---\n\n\n\n\n### Citation\n\n[Rehman, A., Cai, J., Zhang, J.-J., & Yang, X. (2025). BFA: Real-time multilingual text-to-speech forced alignment. *arXiv*. https://arxiv.org/abs/2509.23147](https://arxiv.org/pdf/2509.23147)\n\n```bibtex\n@misc{rehman2025bfa,\n title={BFA: Real-time Multilingual Text-to-speech Forced Alignment}, \n author={Abdul Rehman and Jingyao Cai and Jian-Jun Zhang and Xiaosong Yang},\n year={2025},\n eprint={2509.23147},\n archivePrefix={arXiv},\n primaryClass={eess.AS},\n url={https://arxiv.org/abs/2509.23147}, \n}\n```\n---\n\n<div align=\"center\">\n\n\n[\u2b50 Star us on GitHub](https://github.com/tabahi/bournemouth-forced-aligner) \u2022 [\ud83d\udc1b Report Issues](https://github.com/tabahi/bournemouth-forced-aligner/issues)\n\n</div>\n",
"bugtrack_url": null,
"license": "gplv3",
"summary": "Bournemouth Forced Aligner - Phoneme-level timestamp extraction",
"version": "0.1.7",
"project_urls": {
"Bug Tracker": "https://github.com/tabahi/bournemouth-forced-aligner/issues",
"CUPE Models": "https://huggingface.co/Tabahi/CUPE-2i",
"Changelog": "https://github.com/tabahi/bournemouth-forced-aligner/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/tabahi/bournemouth-forced-aligner#readme",
"Homepage": "https://github.com/tabahi/bournemouth-forced-aligner",
"Repository": "https://github.com/tabahi/bournemouth-forced-aligner"
},
"split_keywords": [
"phoneme",
" alignment",
" speech",
" audio",
" timestamp",
" forced-alignment",
" bournemouth",
" cupe",
" speech-recognition",
" linguistics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4acb94d4f2b1504f7aac8ccbc51a84ece611286e6f02f8ae6e6de473b5cd2190",
"md5": "11a20b0262c10ed0b1b97f2c946affd6",
"sha256": "abd57023b21741eaa5bd1fe8f8a99decc10d51111cd5dab8374ecf2f32c78c92"
},
"downloads": -1,
"filename": "bournemouth_forced_aligner-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "11a20b0262c10ed0b1b97f2c946affd6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 119212,
"upload_time": "2025-10-30T12:24:19",
"upload_time_iso_8601": "2025-10-30T12:24:19.233041Z",
"url": "https://files.pythonhosted.org/packages/4a/cb/94d4f2b1504f7aac8ccbc51a84ece611286e6f02f8ae6e6de473b5cd2190/bournemouth_forced_aligner-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8a4c98ffa0344f506f7af1f791f402b8604583808dbb26bfc7213d36bac98915",
"md5": "be7b570433a48feb553d024568534b8a",
"sha256": "c4fbf79efbdff7d7573efe0a738638214988a024a7e6c209b70ef44eeb9d87d7"
},
"downloads": -1,
"filename": "bournemouth_forced_aligner-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "be7b570433a48feb553d024568534b8a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 2403878,
"upload_time": "2025-10-30T12:24:23",
"upload_time_iso_8601": "2025-10-30T12:24:23.274567Z",
"url": "https://files.pythonhosted.org/packages/8a/4c/98ffa0344f506f7af1f791f402b8604583808dbb26bfc7213d36bac98915/bournemouth_forced_aligner-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-30 12:24:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tabahi",
"github_project": "bournemouth-forced-aligner",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "torch",
"specs": [
[
">=",
"1.9.0"
]
]
},
{
"name": "torchaudio",
"specs": [
[
">=",
"0.9.0"
]
]
},
{
"name": "huggingface_hub",
"specs": [
[
">=",
"0.8.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.19.0"
]
]
},
{
"name": "phonemizer",
"specs": [
[
">=",
"3.3.0"
]
]
},
{
"name": "librosa",
"specs": [
[
">=",
"0.8.0"
]
]
}
],
"lcname": "bournemouth-forced-aligner"
}