# Bournemouth Forced Aligner (BFA)
A Python library for extracting phoneme-level timestamps from audio files and transcriptions.
**URL:** https://github.com/tabahi/bournemouth-forced-aligner
- Find out the exact time in millisecond when a phoneme/word is being spoken in an audio clip, provided you already have the text for the audio.
- This project depends on pretrained models from Contextless Universal Phoneme Encoder (CUPE): https://github.com/tabahi/contexless-phonemes-CUPE
- Currently, only the English model has been pretrained upto useable accuracy.
- Work in progress.
## Features
- **Fast alignment** Takes 0.2s for 10s audio on CPU.
- **Phoneme-level timestamp extraction** from audio with high accuracy
- **Viterbi algorithm** with confidence scoring and target boosting
- **Multi-language support** via espeak phonemization (*current acoustic model is English only)
- **Embedding extraction** contextless, pure phoneme embeddings for downstream machine learning tasks
- **Word-level alignment** derived from phoneme timestamps
- **Command-line interface** for handsoff use.
- **JSON output format** for easy integration with other tools
- **Textgrid output format** for importing timestamps into Praat
## Installation
### From PyPI (recommended)
```bash
pip install bournemouth-forced-aligner
# Dependencies
apt-get install espeak-ng
```
Check the installation:
```bash
# Show help
balign --help
# Show version
balign --version
# Test installation
python -c "from bournemouth_aligner import PhonemeTimestampAligner; print('Installation OK')"
```
## Getting Started
Start with example.py
```python
import torch
import time
import json
from bournemouth_aligner import PhonemeTimestampAligner
transcription = "butterfly"
audio_path = "examples/samples/audio/109867__timkahn__butterfly.wav"
model_name = "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
extractor = PhonemeTimestampAligner(model_name=model_name, lang='en-us', duration_max=10, device='cpu')
audio_wav = extractor.load_audio(audio_path) # can replace it with custom audio source
t0 = time.time()
timestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=True)
t1 = time.time()
print("Timestamps:")
print(json.dumps(timestamps, indent=4, ensure_ascii=False))
print(f"Processing time: {t1 - t0:.2f} seconds") # 0.2s
```
## Output
Sample output:
```json
{
"segments": [
{
"start": 0.0,
"end": 1.2588125,
"text": "butterfly", "ph66": [29, 10, 58, 9, 43, 56, 23], "pg16": [7, 2, 14, 2, 8, 13, 5], "coverage_analysis": {"target_count": 7, "aligned_count": 7, "missing_count": 0, "extra_count": 0, "coverage_ratio": 1.0, "missing_phonemes": [], "extra_phonemes": []}, "ipa": ["b", "ʌ", "ɾ", "ɚ", "f", "l", "aɪ"], "word_num": [0, 0, 0, 0, 0, 0, 0], "words": [
"butterfly"
],
"phoneme_ts": [
{
"phoneme_idx": 29,
"phoneme_label": "b",
"start_ms": 33.56833267211914,
"end_ms": 50.35249710083008,
"confidence": 0.9849503040313721
},
...,
{
"phoneme_idx": 23,
"phoneme_label": "aɪ",
"start_ms": 604.22998046875,
"end_ms": 621.01416015625,
"confidence": 0.21650740504264832
}
],
"group_ts": [
{
"group_idx": 7,
"group_label": "voiced_stops",
"start_ms": 33.56833267211914,
"end_ms": 50.35249710083008,
"confidence": 0.9911064505577087
},
...,
{
"group_idx": 5,
"group_label": "diphthongs",
"start_ms": 604.22998046875,
"end_ms": 621.01416015625,
"confidence": 0.4117060899734497
}
],
"words_ts": [
{
"word": "butterfly", "start_ms": 33.56833267211914, "end_ms": 621.01416015625, "confidence": 0.6550856615815844, "ph66": [29, 10, 58, 9, 43, 56, 23], "ipa": ["b", "ʌ", "ɾ", "ɚ", "f", "l", "aɪ"]
}
]
}
]
}
```
Output keys:
- 'ph66' standardized 66 phoneme classes including silence. See more in [mapper66.py](bournemouth_aligner/mapper66.py)
- 'pg16' standardized 16 phoneme category grounds such as latery, lower front vowels, rhotics, etc. See complete mapping in `phoneme_groups_index` in [mapper66.py](bournemouth_aligner/mapper66.py)
- 'ipa' - list of IPA sequences generated by espeak. These can cause unicode issues.
- 'words' - List of words splitted by simple regex: `re.findall(r"\b\w+\b|[.,!?;:]", "sentence")`
- 'phoneme_ts': aligned timestamps for phonemes (ph66).
- 'group_ts': aligned timestamps for phoneme groups (pg16). These can be more accurate than phoneme timestamps.
- 'word_num': list of index of word for each phoneme in 'ph66'. The point to the word number each corresponding phoneme belongs to.
- 'words_ts': aligned timestamps for words, remapped from 'phoneme_ts'.
- 'coverage_analysis': metrics for alignment quality. Reports insertions and deletions.
## Alignment Accuracy
Below is an example Praat TextGrid visualization of phoneme-level alignment produced by [BFA](https://github.com/tabahi/bournemouth-forced-aligner), compared with [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner)

Currently, there are no reliable metrics to report other than speed. The timing distance error is 40ms on TIMIT (unreliable). MFA takes 10 seconds for 2 seconds audio which makes it difficult for real-time use. BFA has potential for real-time inference due to being contextless ([see CUPE](https://huggingface.co/Tabahi/CUPE-2i)). If you have any comparisons or suggestions for improvements then please let me know in the issues.
## How does it work?
The phoneme probabilities extracted from [CUPE](https://huggingface.co/Tabahi/CUPE-2i) are passed through [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm), which makes forward and backward tracing paths over the probabilities to pick the right frames at which the expexted phonemes beging and end. See the core code in [ViterbiDecoder](bournemouth_aligner/forced_alignment.py). Due to many for-loops, it works better on CPU. There is a potential for more optimization.
## Advanced Usage
See [example_advanced.py](examples/example_advanced.py) for more advanced batch processing.
Extract timestamps directly from audio by first transcribing using whisper:
```python
# pip install git+https://github.com/openai/whisper.git
import whisper
import json
from bournemouth_aligner import PhonemeTimestampAligner
model = whisper.load_model("turbo")
audio_path = "audio.wav"
srt_path = "whisper_output.srt.json"
ts_out_path = "timestamps.vs2.json"
result = model.transcribe(audio_path)
# save whisper output
with open(srt_path, "w") as srt_file:
json.dump(result, srt_file)
extractor = PhonemeTimestampAligner(model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt", lang='en-us', duration_max=10, device='cpu')
timestamps_dict = extractor.process_srt_file(srt_path, audio_path, ts_out_path, extract_embeddings=False, vspt_path=None, debug=False)
with open(ts_out_path, "w") as ts_file:
json.dump(timestamps_dict, ts_file)
```
Build it step by step in notebook:
```python
import torch
import torchaudio
from bournemouth_aligner import PhonemeTimestampAligner
# Step1: Initialize PhonemeTimestampAligner
device = 'cpu' # CPU is faster for sigle file processing
duration_max = 10 # it's only for padding and clipping. Set it more than your expected duration
model_name = "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" # Find more models at: https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt
lang = 'en-us' # Each CUPE model is trained on a specific language(s)
extractor = PhonemeTimestampAligner(model_name=model_name, lang=lang, duration_max=duration_max, device='cpu')
# Step 2a: Load and preprocess audio - manually
audio_path = "examples/samples/audio/Schwa-What.wav"
audio_wav, sr = torchaudio.load(audio_path, normalize=True) # normalize=True is for torch dtype normalization, not for amplitude
# Stick with the CUPE's sample rate of 16000. For consistency, use the same audio loading and resampling pipeline same as the CUPE's training preprocessing:
resampler = torchaudio.transforms.Resample(
orig_freq=sr,
new_freq=160000,
lowpass_filter_width=64,
rolloff=0.9475937167399596,
resampling_method="sinc_interp_kaiser",
beta=14.769656459379492,
)
audio_wav = resampler(audio_wav)
rms = torch.sqrt(torch.mean(audio_wav ** 2)) # rms normalize (better to have at least 75% voiced duration)
audio_wav = (audio_wav / rms) if rms > 0 else audio_wav
# Step 2b: Load and preprocess audio - streamlining
audio_wav = extractor.load_audio(audio_path)
# Step2: Load/create text transcriptions:
transcription = "ah What!"
# Step3: Align
timestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=False)
# Step4 (optional): Convert to textgrid
extractor.convert_to_textgrid(timestamps, output_file="output_timestamps.TextGrid", include_confidence=False)
```
If you are interested in using the phoneme embeddings for machine learning then check out [this example](examples/read_embeddings.py).
# Command Line Interface (CLI)
Bournemouth Forced Aligner includes a powerful command-line interface for batch processing and automation. The CLI command is `balign`.
After installation, the `balign` command will be available in your terminal.
```bash
# Basic usage
balign audio.wav transcription.srt.json output.json
# With debug output
balign audio.wav transcription.srt.json output.json --debug
# Extract embeddings too
balign audio.wav transcription.srt.json output.json --embeddings embeddings.pt
```
## Command Syntax
```bash
balign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH
# example:
balign audio.wav transcription.srt.json output.json --device cuda:0 --embeddings output_embd.pt --duration-max 5
```
### Required Arguments
- **`AUDIO_PATH`**: Path to audio file (supports .wav, .mp3, .flac, etc.)
- **`SRT_PATH`**: Path to SRT file in JSON format (see [Input Format](#input-format))
- **`OUTPUT_PATH`**: Path for output timestamps file (.json)
### Options
| Option | Default | Description |
|--------|---------|-------------|
| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model name from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |
| `--lang TEXT` | `en-us` | Language code for phonemization ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |
| `--device TEXT` | `cpu` | Device for inference (`cpu` or `cuda`) |
| `--embeddings PATH` | None | Path to save phoneme embeddings (.pt file) |
| `--duration-max FLOAT` | `10.0` | Maximum segment duration in seconds. Shorter max duration will use less CUDA memory and will be faster.
| `--debug / --no-debug` | `False` | Enable detailed debug output |
| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting for better alignment |
| `--help` | | Show help message and exit |
| `--version` | | Show version and exit |
## Usage Examples
### Basic Phoneme Alignment
```bash
# Simple alignment with English audio
balign speech.wav transcription.srt.json timestamps.json
```
### With Embeddings Extraction
```bash
# Extract phoneme embeddings for downstream tasks
balign speech.wav transcription.srt.json timestamps.json --embeddings speech_embeddings.pt
```
### Multi-language Support (*planned)
```bash
# Spanish audio
balign spanish_audio.wav transcription.srt.json output.json --lang es
# French audio
balign french_audio.wav transcription.srt.json output.json --lang fr
# German audio
balign german_audio.wav transcription.srt.json output.json --lang de
```
### GPU Acceleration
```bash
# Use CUDA for faster processing
balign large_audio.wav transcription.srt.json output.json --device cuda
```
### Advanced Configuration
```bash
# Custom model with longer segments and debug output
balign audio.wav transcription.srt.json output.json \
--model "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" \
--duration-max 15 \
--debug \
--embeddings embeddings.pt
```
### Batch Processing
```bash
#!/bin/bash
# Process multiple files
for audio in *.wav; do
base=$(basename "$audio" .wav)
balign "$audio" "${base}.srt" "${base}_timestamps.json" --debug
done
```
## Input Format
The SRT file must be in JSON format with the following structure:
```json
{
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "hello world this is a test"
},
{
"start": 3.5,
"end": 7.2,
"text": "another segment of speech"
}
]
}
```
### Creating SRT Files
You can create SRT files using various methods:
**From Whisper output:**
```python
import whisper
import json
model = whisper.load_model("base")
result = model.transcribe("audio.wav")
# Convert to balign format
srt_data = {"segments": result["segments"]}
with open("transcription.srt.json", "w") as f:
json.dump(srt_data, f, indent=2)
```
**Manual SRT creation:**
```python
import json
srt_data = {
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "your transcribed text here"
}
]
}
with open("transcription.srt.json", "w") as f:
json.dump(srt_data, f, indent=2)
```
## Output Format
The CLI generates a detailed JSON file with phoneme-level timestamps:
```json
{
"segments": [
{
"start": 0.0,
"end": 3.5,
"text": "hello world",
"ipa": "həloʊ wɜrld",
"phoneme_ts": [
{
"phoneme_idx": 23,
"phoneme_label": "h",
"start_ms": 0.0,
"end_ms": 120.5,
"confidence": 0.95
},
{
"phoneme_idx": 15,
"phoneme_label": "ə",
"start_ms": 120.5,
"end_ms": 200.3,
"confidence": 0.87
}
],
"words_ts": [
{
"word": "hello",
"start_ms": 0.0,
"end_ms": 650.2,
"confidence": 0.91,
"ph66": [23, 15, 31, 31, 45],
"ipa": ["h", "ə", "l", "l", "oʊ"]
},
{
"word": "world",
"start_ms": 650.2,
"end_ms": 1200.8,
"confidence": 0.89,
"ph66": [52, 15, 48, 31, 8],
"ipa": ["w", "ɜ", "r", "l", "d"]
}
]
}
]
}
```
## Debug Mode
Enable debug mode for detailed processing information:
```bash
balign audio.wav transcription.srt.json output.json --debug
```
Debug output includes:
- Model initialization status
- Audio processing details
- Phoneme sequence predictions
- Alignment coverage analysis
- Processing time statistics
- Confidence scores
Example debug output:
```
🚀 Bournemouth Forced Aligner
📁 Audio: audio.wav
📄 SRT: transcription.srt.json
💾 Output: output.json
🏷️ Language: en-us
🖥️ Device: cpu
🎯 Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt
--------------------------------------------------
🔧 Initializing aligner...
Setting backend for language: en-us
✅ Aligner initialized successfully
🎵 Processing audio...
Loaded SRT file with 1 segments from transcription.srt.json
Resampling audio.wav from 22050Hz to 16000Hz
Expected phonemes: ['p', 'ɹ', 'ɪ', ...'ʃ', 'ə', 'n']
Target phonemes: 108, Expected: ['p', 'ɹ', 'ɪ', ..., 'ʃ', 'ə', 'n']
Spectral length: 600
Forced alignment took 135.305 ms
Aligned phonemes: 108
Target phonemes: 108
SUCCESS: All target phonemes were aligned!
Predicted phonemes 108
Predicted groups 108
start_offset_time 0.0
1: p, voiceless_stops -> (0.000 - 32.183), Confidence: 0.554
2: ɹ, rhotics -> (32.183 - 64.367), Confidence: 0.336
...
107: ə, central_vowels -> (9429.717 - 9445.809), Confidence: 0.434
108: n, nasals -> (9445.809 - 9477.992), Confidence: 0.824
Alignment Coverage Analysis:
Target phonemes: 29
Aligned phonemes: 29
Coverage ratio: 100.00%
============================================================
PROCESSING SUMMARY
============================================================
Total segments processed: 1
Perfect sequence matches: 1/1 (100.0%)
Total phonemes aligned: 108
Overall average confidence: 0.502
============================================================
Results saved to: output.json
✅ Timestamps extracted to output.json
📊 Processed 1 segments with 108 phonemes
🎉 Processing completed successfully!
```
For more help, [open an issue on GitHub](https://github.com/tabahi/bournemouth-forced-aligner/issues).
Raw data
{
"_id": null,
"home_page": "https://github.com/tabahi/bournemouth-forced-aligner",
"name": "bournemouth-forced-aligner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Tabahi <tabahi@duck.com>",
"keywords": "phoneme, alignment, speech, audio, timestamp, forced-alignment, bournemouth, CUPE, speech-recognition, linguistics",
"author": "Tabahi",
"author_email": "Tabahi <tabahi@duck.com>",
"download_url": "https://files.pythonhosted.org/packages/63/e6/dacd430f19555948d1e6dbff0534afa572024e4a3ce9b2a9d51612a26d7c/bournemouth_forced_aligner-0.1.0.tar.gz",
"platform": null,
"description": "\n# Bournemouth Forced Aligner (BFA)\n\n\nA Python library for extracting phoneme-level timestamps from audio files and transcriptions. \n**URL:** https://github.com/tabahi/bournemouth-forced-aligner\n\n- Find out the exact time in millisecond when a phoneme/word is being spoken in an audio clip, provided you already have the text for the audio.\n\n- This project depends on pretrained models from Contextless Universal Phoneme Encoder (CUPE): https://github.com/tabahi/contexless-phonemes-CUPE\n- Currently, only the English model has been pretrained upto useable accuracy.\n- Work in progress.\n\n## Features\n- **Fast alignment** Takes 0.2s for 10s audio on CPU.\n- **Phoneme-level timestamp extraction** from audio with high accuracy\n- **Viterbi algorithm** with confidence scoring and target boosting\n- **Multi-language support** via espeak phonemization (*current acoustic model is English only)\n- **Embedding extraction** contextless, pure phoneme embeddings for downstream machine learning tasks\n- **Word-level alignment** derived from phoneme timestamps\n- **Command-line interface** for handsoff use.\n- **JSON output format** for easy integration with other tools\n- **Textgrid output format** for importing timestamps into Praat\n\n## Installation\n\n### From PyPI (recommended)\n```bash\npip install bournemouth-forced-aligner\n\n# Dependencies\napt-get install espeak-ng\n```\n\nCheck the installation:\n```bash\n# Show help\nbalign --help\n\n# Show version\nbalign --version\n\n# Test installation\npython -c \"from bournemouth_aligner import PhonemeTimestampAligner; print('Installation OK')\"\n```\n\n## Getting Started\n\nStart with example.py\n\n```python\n\nimport torch\nimport time\nimport json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n\ntranscription = \"butterfly\"\naudio_path = \"examples/samples/audio/109867__timkahn__butterfly.wav\"\n\nmodel_name = \"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\" \nextractor = PhonemeTimestampAligner(model_name=model_name, lang='en-us', duration_max=10, device='cpu')\n\naudio_wav = extractor.load_audio(audio_path) # can replace it with custom audio source\n\nt0 = time.time()\n\ntimestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=True)\n\nt1 = time.time()\nprint(\"Timestamps:\")\nprint(json.dumps(timestamps, indent=4, ensure_ascii=False))\nprint(f\"Processing time: {t1 - t0:.2f} seconds\") # 0.2s\n\n```\n\n## Output\n\nSample output:\n```json\n{\n \"segments\": [\n {\n \"start\": 0.0,\n \"end\": 1.2588125,\n \"text\": \"butterfly\", \"ph66\": [29, 10, 58, 9, 43, 56, 23], \"pg16\": [7, 2, 14, 2, 8, 13, 5], \"coverage_analysis\": {\"target_count\": 7, \"aligned_count\": 7, \"missing_count\": 0, \"extra_count\": 0, \"coverage_ratio\": 1.0, \"missing_phonemes\": [], \"extra_phonemes\": []}, \"ipa\": [\"b\", \"\u028c\", \"\u027e\", \"\u025a\", \"f\", \"l\", \"a\u026a\"], \"word_num\": [0, 0, 0, 0, 0, 0, 0], \"words\": [\n \"butterfly\"\n ],\n \"phoneme_ts\": [\n {\n \"phoneme_idx\": 29,\n \"phoneme_label\": \"b\",\n \"start_ms\": 33.56833267211914,\n \"end_ms\": 50.35249710083008,\n \"confidence\": 0.9849503040313721\n },\n ...,\n {\n \"phoneme_idx\": 23,\n \"phoneme_label\": \"a\u026a\",\n \"start_ms\": 604.22998046875,\n \"end_ms\": 621.01416015625,\n \"confidence\": 0.21650740504264832\n }\n ],\n \"group_ts\": [\n {\n \"group_idx\": 7,\n \"group_label\": \"voiced_stops\",\n \"start_ms\": 33.56833267211914,\n \"end_ms\": 50.35249710083008,\n \"confidence\": 0.9911064505577087\n },\n ...,\n {\n \"group_idx\": 5,\n \"group_label\": \"diphthongs\",\n \"start_ms\": 604.22998046875,\n \"end_ms\": 621.01416015625,\n \"confidence\": 0.4117060899734497\n }\n ],\n \"words_ts\": [\n {\n \"word\": \"butterfly\", \"start_ms\": 33.56833267211914, \"end_ms\": 621.01416015625, \"confidence\": 0.6550856615815844, \"ph66\": [29, 10, 58, 9, 43, 56, 23], \"ipa\": [\"b\", \"\u028c\", \"\u027e\", \"\u025a\", \"f\", \"l\", \"a\u026a\"]\n }\n ]\n }\n ]\n}\n```\nOutput keys:\n- 'ph66' standardized 66 phoneme classes including silence. See more in [mapper66.py](bournemouth_aligner/mapper66.py)\n- 'pg16' standardized 16 phoneme category grounds such as latery, lower front vowels, rhotics, etc. See complete mapping in `phoneme_groups_index` in [mapper66.py](bournemouth_aligner/mapper66.py)\n- 'ipa' - list of IPA sequences generated by espeak. These can cause unicode issues.\n- 'words' - List of words splitted by simple regex: `re.findall(r\"\\b\\w+\\b|[.,!?;:]\", \"sentence\")`\n- 'phoneme_ts': aligned timestamps for phonemes (ph66).\n- 'group_ts': aligned timestamps for phoneme groups (pg16). These can be more accurate than phoneme timestamps.\n- 'word_num': list of index of word for each phoneme in 'ph66'. The point to the word number each corresponding phoneme belongs to.\n- 'words_ts': aligned timestamps for words, remapped from 'phoneme_ts'.\n- 'coverage_analysis': metrics for alignment quality. Reports insertions and deletions.\n\n\n\n\n## Alignment Accuracy\n\nBelow is an example Praat TextGrid visualization of phoneme-level alignment produced by [BFA](https://github.com/tabahi/bournemouth-forced-aligner), compared with [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner)\n\n\n\nCurrently, there are no reliable metrics to report other than speed. The timing distance error is 40ms on TIMIT (unreliable). MFA takes 10 seconds for 2 seconds audio which makes it difficult for real-time use. BFA has potential for real-time inference due to being contextless ([see CUPE](https://huggingface.co/Tabahi/CUPE-2i)). If you have any comparisons or suggestions for improvements then please let me know in the issues.\n\n\n\n## How does it work?\n\nThe phoneme probabilities extracted from [CUPE](https://huggingface.co/Tabahi/CUPE-2i) are passed through [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm), which makes forward and backward tracing paths over the probabilities to pick the right frames at which the expexted phonemes beging and end. See the core code in [ViterbiDecoder](bournemouth_aligner/forced_alignment.py). Due to many for-loops, it works better on CPU. There is a potential for more optimization.\n\n\n\n## Advanced Usage\n\nSee [example_advanced.py](examples/example_advanced.py) for more advanced batch processing.\n\nExtract timestamps directly from audio by first transcribing using whisper:\n\n```python\n# pip install git+https://github.com/openai/whisper.git \n\nimport whisper\nimport json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\nmodel = whisper.load_model(\"turbo\")\n\naudio_path = \"audio.wav\"\nsrt_path = \"whisper_output.srt.json\"\nts_out_path = \"timestamps.vs2.json\"\n\nresult = model.transcribe(audio_path)\n\n# save whisper output\nwith open(srt_path, \"w\") as srt_file:\n json.dump(result, srt_file)\n\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\", lang='en-us', duration_max=10, device='cpu')\ntimestamps_dict = extractor.process_srt_file(srt_path, audio_path, ts_out_path, extract_embeddings=False, vspt_path=None, debug=False)\n\nwith open(ts_out_path, \"w\") as ts_file:\n json.dump(timestamps_dict, ts_file)\n\n```\n\nBuild it step by step in notebook:\n```python\nimport torch\nimport torchaudio\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n\n\n# Step1: Initialize PhonemeTimestampAligner\ndevice = 'cpu' # CPU is faster for sigle file processing\nduration_max = 10 # it's only for padding and clipping. Set it more than your expected duration\nmodel_name = \"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\" # Find more models at: https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt\nlang = 'en-us' # Each CUPE model is trained on a specific language(s)\nextractor = PhonemeTimestampAligner(model_name=model_name, lang=lang, duration_max=duration_max, device='cpu')\n\n\n\n\n\n# Step 2a: Load and preprocess audio - manually\n\naudio_path = \"examples/samples/audio/Schwa-What.wav\"\naudio_wav, sr = torchaudio.load(audio_path, normalize=True) # normalize=True is for torch dtype normalization, not for amplitude\n\n# Stick with the CUPE's sample rate of 16000. For consistency, use the same audio loading and resampling pipeline same as the CUPE's training preprocessing:\nresampler = torchaudio.transforms.Resample(\n orig_freq=sr,\n new_freq=160000,\n lowpass_filter_width=64,\n rolloff=0.9475937167399596,\n resampling_method=\"sinc_interp_kaiser\",\n beta=14.769656459379492,\n )\naudio_wav = resampler(audio_wav)\n\nrms = torch.sqrt(torch.mean(audio_wav ** 2)) # rms normalize (better to have at least 75% voiced duration)\naudio_wav = (audio_wav / rms) if rms > 0 else audio_wav\n\n\n\n\n\n# Step 2b: Load and preprocess audio - streamlining\naudio_wav = extractor.load_audio(audio_path)\n\n\n\n\n# Step2: Load/create text transcriptions:\ntranscription = \"ah What!\"\n\n\n\n# Step3: Align\ntimestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=False)\n\n\n# Step4 (optional): Convert to textgrid\nextractor.convert_to_textgrid(timestamps, output_file=\"output_timestamps.TextGrid\", include_confidence=False)\n```\n\n\nIf you are interested in using the phoneme embeddings for machine learning then check out [this example](examples/read_embeddings.py).\n\n\n\n# Command Line Interface (CLI)\n\nBournemouth Forced Aligner includes a powerful command-line interface for batch processing and automation. The CLI command is `balign`.\n\n\nAfter installation, the `balign` command will be available in your terminal.\n\n\n```bash\n# Basic usage\nbalign audio.wav transcription.srt.json output.json\n\n# With debug output\nbalign audio.wav transcription.srt.json output.json --debug\n\n# Extract embeddings too\nbalign audio.wav transcription.srt.json output.json --embeddings embeddings.pt\n```\n\n## Command Syntax\n\n```bash\nbalign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH\n\n# example:\nbalign audio.wav transcription.srt.json output.json --device cuda:0 --embeddings output_embd.pt --duration-max 5\n```\n\n### Required Arguments\n\n- **`AUDIO_PATH`**: Path to audio file (supports .wav, .mp3, .flac, etc.)\n- **`SRT_PATH`**: Path to SRT file in JSON format (see [Input Format](#input-format))\n- **`OUTPUT_PATH`**: Path for output timestamps file (.json)\n\n### Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model name from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |\n| `--lang TEXT` | `en-us` | Language code for phonemization ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |\n| `--device TEXT` | `cpu` | Device for inference (`cpu` or `cuda`) |\n| `--embeddings PATH` | None | Path to save phoneme embeddings (.pt file) |\n| `--duration-max FLOAT` | `10.0` | Maximum segment duration in seconds. Shorter max duration will use less CUDA memory and will be faster.\n| `--debug / --no-debug` | `False` | Enable detailed debug output |\n| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting for better alignment |\n| `--help` | | Show help message and exit |\n| `--version` | | Show version and exit |\n\n## Usage Examples\n\n### Basic Phoneme Alignment\n\n```bash\n# Simple alignment with English audio\nbalign speech.wav transcription.srt.json timestamps.json\n```\n\n### With Embeddings Extraction\n\n```bash\n# Extract phoneme embeddings for downstream tasks\nbalign speech.wav transcription.srt.json timestamps.json --embeddings speech_embeddings.pt\n```\n\n### Multi-language Support (*planned)\n\n```bash\n# Spanish audio\nbalign spanish_audio.wav transcription.srt.json output.json --lang es\n\n# French audio \nbalign french_audio.wav transcription.srt.json output.json --lang fr\n\n# German audio\nbalign german_audio.wav transcription.srt.json output.json --lang de\n```\n\n### GPU Acceleration\n\n```bash\n# Use CUDA for faster processing\nbalign large_audio.wav transcription.srt.json output.json --device cuda\n```\n\n### Advanced Configuration\n\n```bash\n# Custom model with longer segments and debug output\nbalign audio.wav transcription.srt.json output.json \\\n --model \"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\" \\\n --duration-max 15 \\\n --debug \\\n --embeddings embeddings.pt\n```\n\n### Batch Processing\n\n```bash\n#!/bin/bash\n# Process multiple files\nfor audio in *.wav; do\n base=$(basename \"$audio\" .wav)\n balign \"$audio\" \"${base}.srt\" \"${base}_timestamps.json\" --debug\ndone\n```\n\n## Input Format\n\nThe SRT file must be in JSON format with the following structure:\n\n```json\n{\n \"segments\": [\n {\n \"start\": 0.0,\n \"end\": 3.5,\n \"text\": \"hello world this is a test\"\n },\n {\n \"start\": 3.5,\n \"end\": 7.2,\n \"text\": \"another segment of speech\"\n }\n ]\n}\n```\n\n### Creating SRT Files\n\nYou can create SRT files using various methods:\n\n**From Whisper output:**\n```python\nimport whisper\nimport json\n\nmodel = whisper.load_model(\"base\")\nresult = model.transcribe(\"audio.wav\")\n\n# Convert to balign format\nsrt_data = {\"segments\": result[\"segments\"]}\nwith open(\"transcription.srt.json\", \"w\") as f:\n json.dump(srt_data, f, indent=2)\n```\n\n**Manual SRT creation:**\n```python\nimport json\n\nsrt_data = {\n \"segments\": [\n {\n \"start\": 0.0,\n \"end\": 2.5,\n \"text\": \"your transcribed text here\"\n }\n ]\n}\n\nwith open(\"transcription.srt.json\", \"w\") as f:\n json.dump(srt_data, f, indent=2)\n```\n\n## Output Format\n\nThe CLI generates a detailed JSON file with phoneme-level timestamps:\n\n```json\n{\n \"segments\": [\n {\n \"start\": 0.0,\n \"end\": 3.5,\n \"text\": \"hello world\",\n \"ipa\": \"h\u0259lo\u028a w\u025crld\",\n \"phoneme_ts\": [\n {\n \"phoneme_idx\": 23,\n \"phoneme_label\": \"h\",\n \"start_ms\": 0.0,\n \"end_ms\": 120.5,\n \"confidence\": 0.95\n },\n {\n \"phoneme_idx\": 15,\n \"phoneme_label\": \"\u0259\",\n \"start_ms\": 120.5,\n \"end_ms\": 200.3,\n \"confidence\": 0.87\n }\n ],\n \"words_ts\": [\n {\n \"word\": \"hello\",\n \"start_ms\": 0.0,\n \"end_ms\": 650.2,\n \"confidence\": 0.91,\n \"ph66\": [23, 15, 31, 31, 45],\n \"ipa\": [\"h\", \"\u0259\", \"l\", \"l\", \"o\u028a\"]\n },\n {\n \"word\": \"world\", \n \"start_ms\": 650.2,\n \"end_ms\": 1200.8,\n \"confidence\": 0.89,\n \"ph66\": [52, 15, 48, 31, 8],\n \"ipa\": [\"w\", \"\u025c\", \"r\", \"l\", \"d\"]\n }\n ]\n }\n ]\n}\n```\n\n## Debug Mode\n\nEnable debug mode for detailed processing information:\n\n```bash\nbalign audio.wav transcription.srt.json output.json --debug\n```\n\nDebug output includes:\n- Model initialization status\n- Audio processing details\n- Phoneme sequence predictions\n- Alignment coverage analysis\n- Processing time statistics\n- Confidence scores\n\nExample debug output:\n```\n\ud83d\ude80 Bournemouth Forced Aligner\n\ud83d\udcc1 Audio: audio.wav\n\ud83d\udcc4 SRT: transcription.srt.json\n\ud83d\udcbe Output: output.json\n\ud83c\udff7\ufe0f Language: en-us\n\ud83d\udda5\ufe0f Device: cpu\n\ud83c\udfaf Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\n--------------------------------------------------\n\ud83d\udd27 Initializing aligner...\nSetting backend for language: en-us\n\u2705 Aligner initialized successfully\n\ud83c\udfb5 Processing audio...\nLoaded SRT file with 1 segments from transcription.srt.json\nResampling audio.wav from 22050Hz to 16000Hz\nExpected phonemes: ['p', '\u0279', '\u026a', ...'\u0283', '\u0259', 'n']\nTarget phonemes: 108, Expected: ['p', '\u0279', '\u026a', ..., '\u0283', '\u0259', 'n']\nSpectral length: 600\nForced alignment took 135.305 ms\nAligned phonemes: 108\nTarget phonemes: 108\nSUCCESS: All target phonemes were aligned!\nPredicted phonemes 108\nPredicted groups 108\nstart_offset_time 0.0\n 1: p, voiceless_stops -> (0.000 - 32.183), Confidence: 0.554\n 2: \u0279, rhotics -> (32.183 - 64.367), Confidence: 0.336\n ...\n107: \u0259, central_vowels -> (9429.717 - 9445.809), Confidence: 0.434\n108: n, nasals -> (9445.809 - 9477.992), Confidence: 0.824\nAlignment Coverage Analysis:\n Target phonemes: 29\n Aligned phonemes: 29\n Coverage ratio: 100.00%\n\n============================================================\nPROCESSING SUMMARY\n============================================================\nTotal segments processed: 1\nPerfect sequence matches: 1/1 (100.0%)\nTotal phonemes aligned: 108\nOverall average confidence: 0.502\n============================================================\nResults saved to: output.json\n\u2705 Timestamps extracted to output.json\n\ud83d\udcca Processed 1 segments with 108 phonemes\n\ud83c\udf89 Processing completed successfully!\n```\n\n\n\n\nFor more help, [open an issue on GitHub](https://github.com/tabahi/bournemouth-forced-aligner/issues).\n",
"bugtrack_url": null,
"license": "gplv3",
"summary": "Bournemouth Forced Aligner - Phoneme-level timestamp extraction",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/tabahi/bournemouth-forced-aligner/issues",
"CUPE Models": "https://huggingface.co/Tabahi/CUPE-2i",
"Changelog": "https://github.com/tabahi/bournemouth-forced-aligner/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/tabahi/bournemouth-forced-aligner#readme",
"Homepage": "https://github.com/tabahi/bournemouth-forced-aligner",
"Repository": "https://github.com/tabahi/bournemouth-forced-aligner"
},
"split_keywords": [
"phoneme",
" alignment",
" speech",
" audio",
" timestamp",
" forced-alignment",
" bournemouth",
" cupe",
" speech-recognition",
" linguistics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6061063f375d0c1c8f3b5ffd120a3326a34f3eed2918b00af7ad6f8712523b5a",
"md5": "2aef6a95542a2e48449b9e30baf930c6",
"sha256": "567238afb1ec7f4caa22bd75480af6ef495dfdbb22104e1879618410a092c090"
},
"downloads": -1,
"filename": "bournemouth_forced_aligner-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2aef6a95542a2e48449b9e30baf930c6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 68312,
"upload_time": "2025-08-19T09:25:22",
"upload_time_iso_8601": "2025-08-19T09:25:22.966840Z",
"url": "https://files.pythonhosted.org/packages/60/61/063f375d0c1c8f3b5ffd120a3326a34f3eed2918b00af7ad6f8712523b5a/bournemouth_forced_aligner-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "63e6dacd430f19555948d1e6dbff0534afa572024e4a3ce9b2a9d51612a26d7c",
"md5": "139d06fdbd327b664dd6f896b130e576",
"sha256": "3431a9495ca9c1a28472b90e487697d6448e02172c3d367d71f8de5db82cf7cd"
},
"downloads": -1,
"filename": "bournemouth_forced_aligner-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "139d06fdbd327b664dd6f896b130e576",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 1670255,
"upload_time": "2025-08-19T09:25:26",
"upload_time_iso_8601": "2025-08-19T09:25:26.065809Z",
"url": "https://files.pythonhosted.org/packages/63/e6/dacd430f19555948d1e6dbff0534afa572024e4a3ce9b2a9d51612a26d7c/bournemouth_forced_aligner-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-19 09:25:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tabahi",
"github_project": "bournemouth-forced-aligner",
"github_not_found": true,
"lcname": "bournemouth-forced-aligner"
}