bournemouth-forced-aligner


Namebournemouth-forced-aligner JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/tabahi/bournemouth-forced-aligner
SummaryBournemouth Forced Aligner - Phoneme-level timestamp extraction
upload_time2025-08-19 09:25:26
maintainerNone
docs_urlNone
authorTabahi
requires_python>=3.8
licensegplv3
keywords phoneme alignment speech audio timestamp forced-alignment bournemouth cupe speech-recognition linguistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Bournemouth Forced Aligner (BFA)


A Python library for extracting phoneme-level timestamps from audio files and transcriptions. 
**URL:** https://github.com/tabahi/bournemouth-forced-aligner

- Find out the exact time in millisecond when a phoneme/word is being spoken in an audio clip, provided you already have the text for the audio.

- This project depends on pretrained models from Contextless Universal Phoneme Encoder (CUPE): https://github.com/tabahi/contexless-phonemes-CUPE
- Currently, only the English model has been pretrained upto useable accuracy.
- Work in progress.

## Features
- **Fast alignment** Takes 0.2s for 10s audio on CPU.
- **Phoneme-level timestamp extraction** from audio with high accuracy
- **Viterbi algorithm** with confidence scoring and target boosting
- **Multi-language support** via espeak phonemization (*current acoustic model is English only)
- **Embedding extraction** contextless, pure phoneme embeddings for downstream machine learning tasks
- **Word-level alignment** derived from phoneme timestamps
- **Command-line interface** for handsoff use.
- **JSON output format** for easy integration with other tools
- **Textgrid output format** for importing timestamps into Praat

## Installation

### From PyPI (recommended)
```bash
pip install bournemouth-forced-aligner

# Dependencies
apt-get install espeak-ng
```

Check the installation:
```bash
# Show help
balign --help

# Show version
balign --version

# Test installation
python -c "from bournemouth_aligner import PhonemeTimestampAligner; print('Installation OK')"
```

## Getting Started

Start with example.py

```python

import torch
import time
import json
from bournemouth_aligner import PhonemeTimestampAligner


transcription = "butterfly"
audio_path = "examples/samples/audio/109867__timkahn__butterfly.wav"

model_name = "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" 
extractor = PhonemeTimestampAligner(model_name=model_name, lang='en-us', duration_max=10, device='cpu')

audio_wav = extractor.load_audio(audio_path) # can replace it with custom audio source

t0 = time.time()

timestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=True)

t1 = time.time()
print("Timestamps:")
print(json.dumps(timestamps, indent=4, ensure_ascii=False))
print(f"Processing time: {t1 - t0:.2f} seconds") # 0.2s

```

## Output

Sample output:
```json
{
    "segments": [
        {
            "start": 0.0,
            "end": 1.2588125,
            "text": "butterfly", "ph66": [29, 10, 58, 9, 43, 56, 23], "pg16": [7, 2, 14, 2, 8, 13, 5], "coverage_analysis": {"target_count": 7, "aligned_count": 7, "missing_count": 0, "extra_count": 0, "coverage_ratio": 1.0, "missing_phonemes": [], "extra_phonemes": []}, "ipa": ["b", "ʌ", "ɾ", "ɚ", "f", "l", "aɪ"], "word_num": [0, 0, 0, 0, 0, 0, 0], "words": [
                "butterfly"
            ],
            "phoneme_ts": [
                {
                    "phoneme_idx": 29,
                    "phoneme_label": "b",
                    "start_ms": 33.56833267211914,
                    "end_ms": 50.35249710083008,
                    "confidence": 0.9849503040313721
                },
                ...,
                {
                    "phoneme_idx": 23,
                    "phoneme_label": "aɪ",
                    "start_ms": 604.22998046875,
                    "end_ms": 621.01416015625,
                    "confidence": 0.21650740504264832
                }
            ],
            "group_ts": [
                {
                    "group_idx": 7,
                    "group_label": "voiced_stops",
                    "start_ms": 33.56833267211914,
                    "end_ms": 50.35249710083008,
                    "confidence": 0.9911064505577087
                },
                ...,
                {
                    "group_idx": 5,
                    "group_label": "diphthongs",
                    "start_ms": 604.22998046875,
                    "end_ms": 621.01416015625,
                    "confidence": 0.4117060899734497
                }
            ],
            "words_ts": [
                {
                    "word": "butterfly", "start_ms": 33.56833267211914, "end_ms": 621.01416015625, "confidence": 0.6550856615815844, "ph66": [29, 10, 58, 9, 43, 56, 23], "ipa": ["b", "ʌ", "ɾ", "ɚ", "f", "l", "aɪ"]
                }
            ]
        }
    ]
}
```
Output keys:
- 'ph66' standardized 66 phoneme classes including silence. See more in [mapper66.py](bournemouth_aligner/mapper66.py)
- 'pg16' standardized 16 phoneme category grounds such as latery, lower front vowels, rhotics, etc. See complete mapping in `phoneme_groups_index` in  [mapper66.py](bournemouth_aligner/mapper66.py)
- 'ipa' - list of IPA sequences generated by espeak. These can cause unicode issues.
- 'words' - List of words splitted by simple regex: `re.findall(r"\b\w+\b|[.,!?;:]", "sentence")`
- 'phoneme_ts': aligned timestamps for phonemes (ph66).
- 'group_ts': aligned timestamps for phoneme groups (pg16). These can be more accurate than phoneme timestamps.
- 'word_num': list of index of word for each phoneme in 'ph66'. The point to the word number each corresponding phoneme belongs to.
- 'words_ts': aligned timestamps for words, remapped from 'phoneme_ts'.
- 'coverage_analysis': metrics for alignment quality. Reports insertions and deletions.




## Alignment Accuracy

Below is an example Praat TextGrid visualization of phoneme-level alignment produced by [BFA](https://github.com/tabahi/bournemouth-forced-aligner), compared with [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner)

![Praat Alignment Example](examples/samples/images/LJ02_praat.png)

Currently, there are no reliable metrics to report other than speed. The timing distance error is 40ms on TIMIT (unreliable). MFA takes 10 seconds for 2 seconds audio which makes it difficult for real-time use. BFA has potential for real-time inference due to being contextless ([see CUPE](https://huggingface.co/Tabahi/CUPE-2i)). If you have any comparisons or suggestions for improvements then please let me know in the issues.



## How does it work?

The phoneme probabilities extracted from [CUPE](https://huggingface.co/Tabahi/CUPE-2i) are passed through [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm), which makes forward and backward tracing paths over the probabilities to pick the right frames at which the expexted phonemes beging and end. See the core code in [ViterbiDecoder](bournemouth_aligner/forced_alignment.py). Due to many for-loops, it works better on CPU. There is a potential for more optimization.



## Advanced Usage

See [example_advanced.py](examples/example_advanced.py) for more advanced batch processing.

Extract timestamps directly from audio by first transcribing using whisper:

```python
# pip install git+https://github.com/openai/whisper.git 

import whisper
import json
from bournemouth_aligner import PhonemeTimestampAligner

model = whisper.load_model("turbo")

audio_path = "audio.wav"
srt_path = "whisper_output.srt.json"
ts_out_path = "timestamps.vs2.json"

result = model.transcribe(audio_path)

# save whisper output
with open(srt_path, "w") as srt_file:
    json.dump(result, srt_file)

extractor = PhonemeTimestampAligner(model_name="en_libri1000_uj01d_e199_val_GER=0.2307.ckpt", lang='en-us', duration_max=10, device='cpu')
timestamps_dict = extractor.process_srt_file(srt_path, audio_path, ts_out_path, extract_embeddings=False, vspt_path=None, debug=False)

with open(ts_out_path, "w") as ts_file:
    json.dump(timestamps_dict, ts_file)

```

Build it step by step in notebook:
```python
import torch
import torchaudio
from bournemouth_aligner import PhonemeTimestampAligner



# Step1: Initialize PhonemeTimestampAligner
device = 'cpu' # CPU is faster for sigle file processing
duration_max = 10 # it's only for padding and clipping. Set it more than your expected duration
model_name = "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" # Find more models at: https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt
lang = 'en-us' # Each CUPE model is trained on a specific language(s)
extractor = PhonemeTimestampAligner(model_name=model_name, lang=lang, duration_max=duration_max, device='cpu')





# Step 2a: Load and preprocess audio - manually

audio_path = "examples/samples/audio/Schwa-What.wav"
audio_wav, sr = torchaudio.load(audio_path, normalize=True) #  normalize=True is for torch dtype normalization, not for amplitude

# Stick with the CUPE's sample rate of 16000. For consistency, use the same audio loading and resampling pipeline same as the CUPE's training preprocessing:
resampler = torchaudio.transforms.Resample(
        orig_freq=sr,
        new_freq=160000,
        lowpass_filter_width=64,
        rolloff=0.9475937167399596,
        resampling_method="sinc_interp_kaiser",
        beta=14.769656459379492,
    )
audio_wav = resampler(audio_wav)

rms = torch.sqrt(torch.mean(audio_wav ** 2)) # rms normalize (better to have at least 75% voiced duration)
audio_wav = (audio_wav / rms) if rms > 0 else audio_wav





# Step 2b: Load and preprocess audio - streamlining
audio_wav =  extractor.load_audio(audio_path)




# Step2: Load/create text transcriptions:
transcription = "ah What!"



# Step3: Align
timestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=False)


# Step4 (optional): Convert to textgrid
extractor.convert_to_textgrid(timestamps, output_file="output_timestamps.TextGrid", include_confidence=False)
```


If you are interested in using the phoneme embeddings for machine learning then check out [this example](examples/read_embeddings.py).



# Command Line Interface (CLI)

Bournemouth Forced Aligner includes a powerful command-line interface for batch processing and automation. The CLI command is `balign`.


After installation, the `balign` command will be available in your terminal.


```bash
# Basic usage
balign audio.wav transcription.srt.json output.json

# With debug output
balign audio.wav transcription.srt.json output.json --debug

# Extract embeddings too
balign audio.wav transcription.srt.json output.json --embeddings embeddings.pt
```

## Command Syntax

```bash
balign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH

# example:
balign audio.wav transcription.srt.json output.json --device cuda:0 --embeddings output_embd.pt --duration-max 5
```

### Required Arguments

- **`AUDIO_PATH`**: Path to audio file (supports .wav, .mp3, .flac, etc.)
- **`SRT_PATH`**: Path to SRT file in JSON format (see [Input Format](#input-format))
- **`OUTPUT_PATH`**: Path for output timestamps file (.json)

### Options

| Option | Default | Description |
|--------|---------|-------------|
| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model name from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |
| `--lang TEXT` | `en-us` | Language code for phonemization ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |
| `--device TEXT` | `cpu` | Device for inference (`cpu` or `cuda`) |
| `--embeddings PATH` | None | Path to save phoneme embeddings (.pt file) |
| `--duration-max FLOAT` | `10.0` | Maximum segment duration in seconds. Shorter max duration will use less CUDA memory and will be faster.
| `--debug / --no-debug` | `False` | Enable detailed debug output |
| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting for better alignment |
| `--help` | | Show help message and exit |
| `--version` | | Show version and exit |

## Usage Examples

### Basic Phoneme Alignment

```bash
# Simple alignment with English audio
balign speech.wav transcription.srt.json timestamps.json
```

### With Embeddings Extraction

```bash
# Extract phoneme embeddings for downstream tasks
balign speech.wav transcription.srt.json timestamps.json --embeddings speech_embeddings.pt
```

### Multi-language Support (*planned)

```bash
# Spanish audio
balign spanish_audio.wav transcription.srt.json output.json --lang es

# French audio  
balign french_audio.wav transcription.srt.json output.json --lang fr

# German audio
balign german_audio.wav transcription.srt.json output.json --lang de
```

### GPU Acceleration

```bash
# Use CUDA for faster processing
balign large_audio.wav transcription.srt.json output.json --device cuda
```

### Advanced Configuration

```bash
# Custom model with longer segments and debug output
balign audio.wav transcription.srt.json output.json \
    --model "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" \
    --duration-max 15 \
    --debug \
    --embeddings embeddings.pt
```

### Batch Processing

```bash
#!/bin/bash
# Process multiple files
for audio in *.wav; do
    base=$(basename "$audio" .wav)
    balign "$audio" "${base}.srt" "${base}_timestamps.json" --debug
done
```

## Input Format

The SRT file must be in JSON format with the following structure:

```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "hello world this is a test"
    },
    {
      "start": 3.5,
      "end": 7.2,
      "text": "another segment of speech"
    }
  ]
}
```

### Creating SRT Files

You can create SRT files using various methods:

**From Whisper output:**
```python
import whisper
import json

model = whisper.load_model("base")
result = model.transcribe("audio.wav")

# Convert to balign format
srt_data = {"segments": result["segments"]}
with open("transcription.srt.json", "w") as f:
    json.dump(srt_data, f, indent=2)
```

**Manual SRT creation:**
```python
import json

srt_data = {
    "segments": [
        {
            "start": 0.0,
            "end": 2.5,
            "text": "your transcribed text here"
        }
    ]
}

with open("transcription.srt.json", "w") as f:
    json.dump(srt_data, f, indent=2)
```

## Output Format

The CLI generates a detailed JSON file with phoneme-level timestamps:

```json
{
  "segments": [
    {
      "start": 0.0,
      "end": 3.5,
      "text": "hello world",
      "ipa": "həloʊ wɜrld",
      "phoneme_ts": [
        {
          "phoneme_idx": 23,
          "phoneme_label": "h",
          "start_ms": 0.0,
          "end_ms": 120.5,
          "confidence": 0.95
        },
        {
          "phoneme_idx": 15,
          "phoneme_label": "ə",
          "start_ms": 120.5,
          "end_ms": 200.3,
          "confidence": 0.87
        }
      ],
      "words_ts": [
        {
          "word": "hello",
          "start_ms": 0.0,
          "end_ms": 650.2,
          "confidence": 0.91,
          "ph66": [23, 15, 31, 31, 45],
          "ipa": ["h", "ə", "l", "l", "oʊ"]
        },
        {
          "word": "world", 
          "start_ms": 650.2,
          "end_ms": 1200.8,
          "confidence": 0.89,
          "ph66": [52, 15, 48, 31, 8],
          "ipa": ["w", "ɜ", "r", "l", "d"]
        }
      ]
    }
  ]
}
```

## Debug Mode

Enable debug mode for detailed processing information:

```bash
balign audio.wav transcription.srt.json output.json --debug
```

Debug output includes:
- Model initialization status
- Audio processing details
- Phoneme sequence predictions
- Alignment coverage analysis
- Processing time statistics
- Confidence scores

Example debug output:
```
🚀 Bournemouth Forced Aligner
📁 Audio: audio.wav
📄 SRT: transcription.srt.json
💾 Output: output.json
🏷️  Language: en-us
🖥️  Device: cpu
🎯 Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt
--------------------------------------------------
🔧 Initializing aligner...
Setting backend for language: en-us
✅ Aligner initialized successfully
🎵 Processing audio...
Loaded SRT file with 1 segments from transcription.srt.json
Resampling audio.wav from 22050Hz to 16000Hz
Expected phonemes: ['p', 'ɹ', 'ɪ', ...'ʃ', 'ə', 'n']
Target phonemes: 108, Expected: ['p', 'ɹ', 'ɪ', ..., 'ʃ', 'ə', 'n']
Spectral length: 600
Forced alignment took 135.305 ms
Aligned phonemes: 108
Target phonemes: 108
SUCCESS: All target phonemes were aligned!
Predicted phonemes 108
Predicted groups 108
start_offset_time 0.0
 1:   p, voiceless_stops  -> (0.000 - 32.183), Confidence: 0.554
 2:   ɹ, rhotics  -> (32.183 - 64.367), Confidence: 0.336
 ...
107:   ə, central_vowels  -> (9429.717 - 9445.809), Confidence: 0.434
108:   n, nasals  -> (9445.809 - 9477.992), Confidence: 0.824
Alignment Coverage Analysis:
  Target phonemes: 29
  Aligned phonemes: 29
  Coverage ratio: 100.00%

============================================================
PROCESSING SUMMARY
============================================================
Total segments processed: 1
Perfect sequence matches: 1/1 (100.0%)
Total phonemes aligned: 108
Overall average confidence: 0.502
============================================================
Results saved to: output.json
✅ Timestamps extracted to output.json
📊 Processed 1 segments with 108 phonemes
🎉 Processing completed successfully!
```




For more help, [open an issue on GitHub](https://github.com/tabahi/bournemouth-forced-aligner/issues).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tabahi/bournemouth-forced-aligner",
    "name": "bournemouth-forced-aligner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Tabahi <tabahi@duck.com>",
    "keywords": "phoneme, alignment, speech, audio, timestamp, forced-alignment, bournemouth, CUPE, speech-recognition, linguistics",
    "author": "Tabahi",
    "author_email": "Tabahi <tabahi@duck.com>",
    "download_url": "https://files.pythonhosted.org/packages/63/e6/dacd430f19555948d1e6dbff0534afa572024e4a3ce9b2a9d51612a26d7c/bournemouth_forced_aligner-0.1.0.tar.gz",
    "platform": null,
    "description": "\n# Bournemouth Forced Aligner (BFA)\n\n\nA Python library for extracting phoneme-level timestamps from audio files and transcriptions. \n**URL:** https://github.com/tabahi/bournemouth-forced-aligner\n\n- Find out the exact time in millisecond when a phoneme/word is being spoken in an audio clip, provided you already have the text for the audio.\n\n- This project depends on pretrained models from Contextless Universal Phoneme Encoder (CUPE): https://github.com/tabahi/contexless-phonemes-CUPE\n- Currently, only the English model has been pretrained upto useable accuracy.\n- Work in progress.\n\n## Features\n- **Fast alignment** Takes 0.2s for 10s audio on CPU.\n- **Phoneme-level timestamp extraction** from audio with high accuracy\n- **Viterbi algorithm** with confidence scoring and target boosting\n- **Multi-language support** via espeak phonemization (*current acoustic model is English only)\n- **Embedding extraction** contextless, pure phoneme embeddings for downstream machine learning tasks\n- **Word-level alignment** derived from phoneme timestamps\n- **Command-line interface** for handsoff use.\n- **JSON output format** for easy integration with other tools\n- **Textgrid output format** for importing timestamps into Praat\n\n## Installation\n\n### From PyPI (recommended)\n```bash\npip install bournemouth-forced-aligner\n\n# Dependencies\napt-get install espeak-ng\n```\n\nCheck the installation:\n```bash\n# Show help\nbalign --help\n\n# Show version\nbalign --version\n\n# Test installation\npython -c \"from bournemouth_aligner import PhonemeTimestampAligner; print('Installation OK')\"\n```\n\n## Getting Started\n\nStart with example.py\n\n```python\n\nimport torch\nimport time\nimport json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n\ntranscription = \"butterfly\"\naudio_path = \"examples/samples/audio/109867__timkahn__butterfly.wav\"\n\nmodel_name = \"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\" \nextractor = PhonemeTimestampAligner(model_name=model_name, lang='en-us', duration_max=10, device='cpu')\n\naudio_wav = extractor.load_audio(audio_path) # can replace it with custom audio source\n\nt0 = time.time()\n\ntimestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=True)\n\nt1 = time.time()\nprint(\"Timestamps:\")\nprint(json.dumps(timestamps, indent=4, ensure_ascii=False))\nprint(f\"Processing time: {t1 - t0:.2f} seconds\") # 0.2s\n\n```\n\n## Output\n\nSample output:\n```json\n{\n    \"segments\": [\n        {\n            \"start\": 0.0,\n            \"end\": 1.2588125,\n            \"text\": \"butterfly\", \"ph66\": [29, 10, 58, 9, 43, 56, 23], \"pg16\": [7, 2, 14, 2, 8, 13, 5], \"coverage_analysis\": {\"target_count\": 7, \"aligned_count\": 7, \"missing_count\": 0, \"extra_count\": 0, \"coverage_ratio\": 1.0, \"missing_phonemes\": [], \"extra_phonemes\": []}, \"ipa\": [\"b\", \"\u028c\", \"\u027e\", \"\u025a\", \"f\", \"l\", \"a\u026a\"], \"word_num\": [0, 0, 0, 0, 0, 0, 0], \"words\": [\n                \"butterfly\"\n            ],\n            \"phoneme_ts\": [\n                {\n                    \"phoneme_idx\": 29,\n                    \"phoneme_label\": \"b\",\n                    \"start_ms\": 33.56833267211914,\n                    \"end_ms\": 50.35249710083008,\n                    \"confidence\": 0.9849503040313721\n                },\n                ...,\n                {\n                    \"phoneme_idx\": 23,\n                    \"phoneme_label\": \"a\u026a\",\n                    \"start_ms\": 604.22998046875,\n                    \"end_ms\": 621.01416015625,\n                    \"confidence\": 0.21650740504264832\n                }\n            ],\n            \"group_ts\": [\n                {\n                    \"group_idx\": 7,\n                    \"group_label\": \"voiced_stops\",\n                    \"start_ms\": 33.56833267211914,\n                    \"end_ms\": 50.35249710083008,\n                    \"confidence\": 0.9911064505577087\n                },\n                ...,\n                {\n                    \"group_idx\": 5,\n                    \"group_label\": \"diphthongs\",\n                    \"start_ms\": 604.22998046875,\n                    \"end_ms\": 621.01416015625,\n                    \"confidence\": 0.4117060899734497\n                }\n            ],\n            \"words_ts\": [\n                {\n                    \"word\": \"butterfly\", \"start_ms\": 33.56833267211914, \"end_ms\": 621.01416015625, \"confidence\": 0.6550856615815844, \"ph66\": [29, 10, 58, 9, 43, 56, 23], \"ipa\": [\"b\", \"\u028c\", \"\u027e\", \"\u025a\", \"f\", \"l\", \"a\u026a\"]\n                }\n            ]\n        }\n    ]\n}\n```\nOutput keys:\n- 'ph66' standardized 66 phoneme classes including silence. See more in [mapper66.py](bournemouth_aligner/mapper66.py)\n- 'pg16' standardized 16 phoneme category grounds such as latery, lower front vowels, rhotics, etc. See complete mapping in `phoneme_groups_index` in  [mapper66.py](bournemouth_aligner/mapper66.py)\n- 'ipa' - list of IPA sequences generated by espeak. These can cause unicode issues.\n- 'words' - List of words splitted by simple regex: `re.findall(r\"\\b\\w+\\b|[.,!?;:]\", \"sentence\")`\n- 'phoneme_ts': aligned timestamps for phonemes (ph66).\n- 'group_ts': aligned timestamps for phoneme groups (pg16). These can be more accurate than phoneme timestamps.\n- 'word_num': list of index of word for each phoneme in 'ph66'. The point to the word number each corresponding phoneme belongs to.\n- 'words_ts': aligned timestamps for words, remapped from 'phoneme_ts'.\n- 'coverage_analysis': metrics for alignment quality. Reports insertions and deletions.\n\n\n\n\n## Alignment Accuracy\n\nBelow is an example Praat TextGrid visualization of phoneme-level alignment produced by [BFA](https://github.com/tabahi/bournemouth-forced-aligner), compared with [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner)\n\n![Praat Alignment Example](examples/samples/images/LJ02_praat.png)\n\nCurrently, there are no reliable metrics to report other than speed. The timing distance error is 40ms on TIMIT (unreliable). MFA takes 10 seconds for 2 seconds audio which makes it difficult for real-time use. BFA has potential for real-time inference due to being contextless ([see CUPE](https://huggingface.co/Tabahi/CUPE-2i)). If you have any comparisons or suggestions for improvements then please let me know in the issues.\n\n\n\n## How does it work?\n\nThe phoneme probabilities extracted from [CUPE](https://huggingface.co/Tabahi/CUPE-2i) are passed through [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm), which makes forward and backward tracing paths over the probabilities to pick the right frames at which the expexted phonemes beging and end. See the core code in [ViterbiDecoder](bournemouth_aligner/forced_alignment.py). Due to many for-loops, it works better on CPU. There is a potential for more optimization.\n\n\n\n## Advanced Usage\n\nSee [example_advanced.py](examples/example_advanced.py) for more advanced batch processing.\n\nExtract timestamps directly from audio by first transcribing using whisper:\n\n```python\n# pip install git+https://github.com/openai/whisper.git \n\nimport whisper\nimport json\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\nmodel = whisper.load_model(\"turbo\")\n\naudio_path = \"audio.wav\"\nsrt_path = \"whisper_output.srt.json\"\nts_out_path = \"timestamps.vs2.json\"\n\nresult = model.transcribe(audio_path)\n\n# save whisper output\nwith open(srt_path, \"w\") as srt_file:\n    json.dump(result, srt_file)\n\nextractor = PhonemeTimestampAligner(model_name=\"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\", lang='en-us', duration_max=10, device='cpu')\ntimestamps_dict = extractor.process_srt_file(srt_path, audio_path, ts_out_path, extract_embeddings=False, vspt_path=None, debug=False)\n\nwith open(ts_out_path, \"w\") as ts_file:\n    json.dump(timestamps_dict, ts_file)\n\n```\n\nBuild it step by step in notebook:\n```python\nimport torch\nimport torchaudio\nfrom bournemouth_aligner import PhonemeTimestampAligner\n\n\n\n# Step1: Initialize PhonemeTimestampAligner\ndevice = 'cpu' # CPU is faster for sigle file processing\nduration_max = 10 # it's only for padding and clipping. Set it more than your expected duration\nmodel_name = \"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\" # Find more models at: https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt\nlang = 'en-us' # Each CUPE model is trained on a specific language(s)\nextractor = PhonemeTimestampAligner(model_name=model_name, lang=lang, duration_max=duration_max, device='cpu')\n\n\n\n\n\n# Step 2a: Load and preprocess audio - manually\n\naudio_path = \"examples/samples/audio/Schwa-What.wav\"\naudio_wav, sr = torchaudio.load(audio_path, normalize=True) #  normalize=True is for torch dtype normalization, not for amplitude\n\n# Stick with the CUPE's sample rate of 16000. For consistency, use the same audio loading and resampling pipeline same as the CUPE's training preprocessing:\nresampler = torchaudio.transforms.Resample(\n        orig_freq=sr,\n        new_freq=160000,\n        lowpass_filter_width=64,\n        rolloff=0.9475937167399596,\n        resampling_method=\"sinc_interp_kaiser\",\n        beta=14.769656459379492,\n    )\naudio_wav = resampler(audio_wav)\n\nrms = torch.sqrt(torch.mean(audio_wav ** 2)) # rms normalize (better to have at least 75% voiced duration)\naudio_wav = (audio_wav / rms) if rms > 0 else audio_wav\n\n\n\n\n\n# Step 2b: Load and preprocess audio - streamlining\naudio_wav =  extractor.load_audio(audio_path)\n\n\n\n\n# Step2: Load/create text transcriptions:\ntranscription = \"ah What!\"\n\n\n\n# Step3: Align\ntimestamps = extractor.process_transcription(transcription, audio_wav, ts_out_path=None, extract_embeddings=False, vspt_path=None, do_groups=True, debug=False)\n\n\n# Step4 (optional): Convert to textgrid\nextractor.convert_to_textgrid(timestamps, output_file=\"output_timestamps.TextGrid\", include_confidence=False)\n```\n\n\nIf you are interested in using the phoneme embeddings for machine learning then check out [this example](examples/read_embeddings.py).\n\n\n\n# Command Line Interface (CLI)\n\nBournemouth Forced Aligner includes a powerful command-line interface for batch processing and automation. The CLI command is `balign`.\n\n\nAfter installation, the `balign` command will be available in your terminal.\n\n\n```bash\n# Basic usage\nbalign audio.wav transcription.srt.json output.json\n\n# With debug output\nbalign audio.wav transcription.srt.json output.json --debug\n\n# Extract embeddings too\nbalign audio.wav transcription.srt.json output.json --embeddings embeddings.pt\n```\n\n## Command Syntax\n\n```bash\nbalign [OPTIONS] AUDIO_PATH SRT_PATH OUTPUT_PATH\n\n# example:\nbalign audio.wav transcription.srt.json output.json --device cuda:0 --embeddings output_embd.pt --duration-max 5\n```\n\n### Required Arguments\n\n- **`AUDIO_PATH`**: Path to audio file (supports .wav, .mp3, .flac, etc.)\n- **`SRT_PATH`**: Path to SRT file in JSON format (see [Input Format](#input-format))\n- **`OUTPUT_PATH`**: Path for output timestamps file (.json)\n\n### Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--model TEXT` | `en_libri1000_uj01d_e199_val_GER=0.2307.ckpt` | CUPE model name from [HuggingFace](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt) |\n| `--lang TEXT` | `en-us` | Language code for phonemization ([espeak codes](https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md)) |\n| `--device TEXT` | `cpu` | Device for inference (`cpu` or `cuda`) |\n| `--embeddings PATH` | None | Path to save phoneme embeddings (.pt file) |\n| `--duration-max FLOAT` | `10.0` | Maximum segment duration in seconds. Shorter max duration will use less CUDA memory and will be faster.\n| `--debug / --no-debug` | `False` | Enable detailed debug output |\n| `--boost-targets / --no-boost-targets` | `True` | Enable target phoneme boosting for better alignment |\n| `--help` | | Show help message and exit |\n| `--version` | | Show version and exit |\n\n## Usage Examples\n\n### Basic Phoneme Alignment\n\n```bash\n# Simple alignment with English audio\nbalign speech.wav transcription.srt.json timestamps.json\n```\n\n### With Embeddings Extraction\n\n```bash\n# Extract phoneme embeddings for downstream tasks\nbalign speech.wav transcription.srt.json timestamps.json --embeddings speech_embeddings.pt\n```\n\n### Multi-language Support (*planned)\n\n```bash\n# Spanish audio\nbalign spanish_audio.wav transcription.srt.json output.json --lang es\n\n# French audio  \nbalign french_audio.wav transcription.srt.json output.json --lang fr\n\n# German audio\nbalign german_audio.wav transcription.srt.json output.json --lang de\n```\n\n### GPU Acceleration\n\n```bash\n# Use CUDA for faster processing\nbalign large_audio.wav transcription.srt.json output.json --device cuda\n```\n\n### Advanced Configuration\n\n```bash\n# Custom model with longer segments and debug output\nbalign audio.wav transcription.srt.json output.json \\\n    --model \"en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\" \\\n    --duration-max 15 \\\n    --debug \\\n    --embeddings embeddings.pt\n```\n\n### Batch Processing\n\n```bash\n#!/bin/bash\n# Process multiple files\nfor audio in *.wav; do\n    base=$(basename \"$audio\" .wav)\n    balign \"$audio\" \"${base}.srt\" \"${base}_timestamps.json\" --debug\ndone\n```\n\n## Input Format\n\nThe SRT file must be in JSON format with the following structure:\n\n```json\n{\n  \"segments\": [\n    {\n      \"start\": 0.0,\n      \"end\": 3.5,\n      \"text\": \"hello world this is a test\"\n    },\n    {\n      \"start\": 3.5,\n      \"end\": 7.2,\n      \"text\": \"another segment of speech\"\n    }\n  ]\n}\n```\n\n### Creating SRT Files\n\nYou can create SRT files using various methods:\n\n**From Whisper output:**\n```python\nimport whisper\nimport json\n\nmodel = whisper.load_model(\"base\")\nresult = model.transcribe(\"audio.wav\")\n\n# Convert to balign format\nsrt_data = {\"segments\": result[\"segments\"]}\nwith open(\"transcription.srt.json\", \"w\") as f:\n    json.dump(srt_data, f, indent=2)\n```\n\n**Manual SRT creation:**\n```python\nimport json\n\nsrt_data = {\n    \"segments\": [\n        {\n            \"start\": 0.0,\n            \"end\": 2.5,\n            \"text\": \"your transcribed text here\"\n        }\n    ]\n}\n\nwith open(\"transcription.srt.json\", \"w\") as f:\n    json.dump(srt_data, f, indent=2)\n```\n\n## Output Format\n\nThe CLI generates a detailed JSON file with phoneme-level timestamps:\n\n```json\n{\n  \"segments\": [\n    {\n      \"start\": 0.0,\n      \"end\": 3.5,\n      \"text\": \"hello world\",\n      \"ipa\": \"h\u0259lo\u028a w\u025crld\",\n      \"phoneme_ts\": [\n        {\n          \"phoneme_idx\": 23,\n          \"phoneme_label\": \"h\",\n          \"start_ms\": 0.0,\n          \"end_ms\": 120.5,\n          \"confidence\": 0.95\n        },\n        {\n          \"phoneme_idx\": 15,\n          \"phoneme_label\": \"\u0259\",\n          \"start_ms\": 120.5,\n          \"end_ms\": 200.3,\n          \"confidence\": 0.87\n        }\n      ],\n      \"words_ts\": [\n        {\n          \"word\": \"hello\",\n          \"start_ms\": 0.0,\n          \"end_ms\": 650.2,\n          \"confidence\": 0.91,\n          \"ph66\": [23, 15, 31, 31, 45],\n          \"ipa\": [\"h\", \"\u0259\", \"l\", \"l\", \"o\u028a\"]\n        },\n        {\n          \"word\": \"world\", \n          \"start_ms\": 650.2,\n          \"end_ms\": 1200.8,\n          \"confidence\": 0.89,\n          \"ph66\": [52, 15, 48, 31, 8],\n          \"ipa\": [\"w\", \"\u025c\", \"r\", \"l\", \"d\"]\n        }\n      ]\n    }\n  ]\n}\n```\n\n## Debug Mode\n\nEnable debug mode for detailed processing information:\n\n```bash\nbalign audio.wav transcription.srt.json output.json --debug\n```\n\nDebug output includes:\n- Model initialization status\n- Audio processing details\n- Phoneme sequence predictions\n- Alignment coverage analysis\n- Processing time statistics\n- Confidence scores\n\nExample debug output:\n```\n\ud83d\ude80 Bournemouth Forced Aligner\n\ud83d\udcc1 Audio: audio.wav\n\ud83d\udcc4 SRT: transcription.srt.json\n\ud83d\udcbe Output: output.json\n\ud83c\udff7\ufe0f  Language: en-us\n\ud83d\udda5\ufe0f  Device: cpu\n\ud83c\udfaf Model: en_libri1000_uj01d_e199_val_GER=0.2307.ckpt\n--------------------------------------------------\n\ud83d\udd27 Initializing aligner...\nSetting backend for language: en-us\n\u2705 Aligner initialized successfully\n\ud83c\udfb5 Processing audio...\nLoaded SRT file with 1 segments from transcription.srt.json\nResampling audio.wav from 22050Hz to 16000Hz\nExpected phonemes: ['p', '\u0279', '\u026a', ...'\u0283', '\u0259', 'n']\nTarget phonemes: 108, Expected: ['p', '\u0279', '\u026a', ..., '\u0283', '\u0259', 'n']\nSpectral length: 600\nForced alignment took 135.305 ms\nAligned phonemes: 108\nTarget phonemes: 108\nSUCCESS: All target phonemes were aligned!\nPredicted phonemes 108\nPredicted groups 108\nstart_offset_time 0.0\n 1:   p, voiceless_stops  -> (0.000 - 32.183), Confidence: 0.554\n 2:   \u0279, rhotics  -> (32.183 - 64.367), Confidence: 0.336\n ...\n107:   \u0259, central_vowels  -> (9429.717 - 9445.809), Confidence: 0.434\n108:   n, nasals  -> (9445.809 - 9477.992), Confidence: 0.824\nAlignment Coverage Analysis:\n  Target phonemes: 29\n  Aligned phonemes: 29\n  Coverage ratio: 100.00%\n\n============================================================\nPROCESSING SUMMARY\n============================================================\nTotal segments processed: 1\nPerfect sequence matches: 1/1 (100.0%)\nTotal phonemes aligned: 108\nOverall average confidence: 0.502\n============================================================\nResults saved to: output.json\n\u2705 Timestamps extracted to output.json\n\ud83d\udcca Processed 1 segments with 108 phonemes\n\ud83c\udf89 Processing completed successfully!\n```\n\n\n\n\nFor more help, [open an issue on GitHub](https://github.com/tabahi/bournemouth-forced-aligner/issues).\n",
    "bugtrack_url": null,
    "license": "gplv3",
    "summary": "Bournemouth Forced Aligner - Phoneme-level timestamp extraction",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/tabahi/bournemouth-forced-aligner/issues",
        "CUPE Models": "https://huggingface.co/Tabahi/CUPE-2i",
        "Changelog": "https://github.com/tabahi/bournemouth-forced-aligner/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/tabahi/bournemouth-forced-aligner#readme",
        "Homepage": "https://github.com/tabahi/bournemouth-forced-aligner",
        "Repository": "https://github.com/tabahi/bournemouth-forced-aligner"
    },
    "split_keywords": [
        "phoneme",
        " alignment",
        " speech",
        " audio",
        " timestamp",
        " forced-alignment",
        " bournemouth",
        " cupe",
        " speech-recognition",
        " linguistics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6061063f375d0c1c8f3b5ffd120a3326a34f3eed2918b00af7ad6f8712523b5a",
                "md5": "2aef6a95542a2e48449b9e30baf930c6",
                "sha256": "567238afb1ec7f4caa22bd75480af6ef495dfdbb22104e1879618410a092c090"
            },
            "downloads": -1,
            "filename": "bournemouth_forced_aligner-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2aef6a95542a2e48449b9e30baf930c6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 68312,
            "upload_time": "2025-08-19T09:25:22",
            "upload_time_iso_8601": "2025-08-19T09:25:22.966840Z",
            "url": "https://files.pythonhosted.org/packages/60/61/063f375d0c1c8f3b5ffd120a3326a34f3eed2918b00af7ad6f8712523b5a/bournemouth_forced_aligner-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "63e6dacd430f19555948d1e6dbff0534afa572024e4a3ce9b2a9d51612a26d7c",
                "md5": "139d06fdbd327b664dd6f896b130e576",
                "sha256": "3431a9495ca9c1a28472b90e487697d6448e02172c3d367d71f8de5db82cf7cd"
            },
            "downloads": -1,
            "filename": "bournemouth_forced_aligner-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "139d06fdbd327b664dd6f896b130e576",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 1670255,
            "upload_time": "2025-08-19T09:25:26",
            "upload_time_iso_8601": "2025-08-19T09:25:26.065809Z",
            "url": "https://files.pythonhosted.org/packages/63/e6/dacd430f19555948d1e6dbff0534afa572024e4a3ce9b2a9d51612a26d7c/bournemouth_forced_aligner-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-19 09:25:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tabahi",
    "github_project": "bournemouth-forced-aligner",
    "github_not_found": true,
    "lcname": "bournemouth-forced-aligner"
}
        
Elapsed time: 0.45686s