# ForceAlign
ForceAlign is a Python library for forced alignment of English text to English audio. It can generate **word** or [**phoneme**](https://en.wikipedia.org/wiki/Phoneme)-level alignments, identifying the specific time a word or phoneme was spoken within an audio recording. ForceAlign supports `.mp3` and `.wav` audio file formats.
For phoneme-level alignments, ForceAlign currently supports the [ARPABET](https://en.wikipedia.org/wiki/ARPABET) phonetic transcription encoding.
ForceAlign uses PyTorch's **Wav2Vec2** pretrained model for acoustic feature extraction and can run on both CPU and CUDA GPU devices. It now includes **automatic speech-to-text transcription**, making it even more flexible for use cases where transcripts are not readily available.
---
## Features
- Fast and accurate word and phoneme-level forced alignment of text to audio.
- Includes **automatic speech transcription** if a transcript is not provided.
- Optimized for both CPU and GPU.
- OS-independent—compatible with macOS, Windows, and Linux.
- Supports `.mp3` and `.wav` audio file formats.
---
## Installation and Dependencies
1. Install ForceAlign:
```bash
pip3 install forcealign
```
2. Install `ffmpeg` (required for audio processing):
- **macOS**: `brew install ffmpeg`
- **Linux**: `sudo apt install ffmpeg`
- **Windows**: Install from [ffmpeg.org](https://ffmpeg.org/download.html)
---
## Usage Examples
### Example 1: Getting Word-Level Text Alignments with a Provided Transcript
```python
from forcealign import ForceAlign
# Provide path to audio file and corresponding transcript
transcript = "The quick brown fox jumps over the lazy dog."
align = ForceAlign(audio_file='./speech.mp3', transcript=transcript)
# Run prediction and return alignment results
words = align.inference()
# Show predicted word-level alignments
for word in words:
print(f"Word: {word.word}, Start: {word.time_start}s, End: {word.time_end}s")
```
---
### Example 2: Getting Word-Level Text Alignments with Automatic Speech Transcription
If a transcript is not provided, ForceAlign can automatically generate one using Wav2Vec2.
```python
from forcealign import ForceAlign
# Provide path to audio file; omit transcript
align = ForceAlign(audio_file='./speech.mp3')
# Automatically generate transcript and align words
words = align.inference()
# Show the generated transcript
print("Generated Transcript:")
print(align.raw_text)
# Show predicted word-level alignments
for word in words:
print(f"Word: {word.word}, Start: {word.time_start}s, End: {word.time_end}s")
```
---
### Example 3: Getting Phoneme-Level Text Alignments
```python
from forcealign import ForceAlign
# Provide path to audio file and transcript
transcript = "The quick brown fox jumps over the lazy dog."
align = ForceAlign(audio_file='./speech.mp3', transcript=transcript)
# Run prediction and return alignment results
words = align.inference()
# Access predicted phoneme-level alignments
for word in words:
print(f"Word: {word.word}")
for phoneme in word.phonemes:
print(f"Phoneme: {phoneme.phoneme}, Start: {phoneme.time_start}s, End: {phoneme.time_end}s")
```
---
### Example 4: Reviewing Word-Level Alignments in Real-Time
```python
from forcealign import ForceAlign
# Provide path to audio file and transcript
transcript = "The quick brown fox jumps over the lazy dog."
align = ForceAlign(audio_file='./speech.mp3', transcript=transcript)
# Play the audio while printing word alignments in real-time
align.review_alignment()
```
---
## Where ForceAlign Works Well
ForceAlign excels in the following scenarios:
1. **Clear Audio Recordings**:
- Audio with minimal background noise, clear enunciation, and consistent speaking patterns.
2. **Short and Medium-Length Recordings**:
- Audio files up to ~30 minutes, where transcription and alignment can be processed efficiently.
3. **Standard English Pronunciation**:
- Recordings with native or near-native English pronunciation.
---
## Where ForceAlign May Struggle
1. **Noisy Audio**:
- Recordings with heavy background noise or overlapping speech may result in reduced transcription and alignment accuracy.
2. **Non-Standard English Accents**:
- Strong regional accents or dialects not represented in the Wav2Vec2 training data may lead to transcription errors.
3. **Long Audio Files**:
- For recordings exceeding ~1 hour, memory and processing time may become significant issues.
4. **Non-English Speech**:
- ForceAlign currently supports English only.
---
## Use Cases
- **Subtitle Generation**:
- Generate timestamps for subtitles or closed captions for videos.
- **Phoneme Analysis**:
- Analyze phoneme-level details for language research, speech therapy, or pronunciation training.
- **Animated Lip Syncing**:
- Use phoneme alignments to synchronize animated character lip movements with audio.
- **Accessibility Tools**:
- Enhance accessibility by creating aligned captions or transcripts for audio recordings.
---
## FAQ
**1. Does ForceAlign have speech-to-text capabilities?**
Yes! If you do not provide a transcript, ForceAlign will automatically generate one using Wav2Vec2. You can also provide your own transcript for better accuracy.
**2. Can ForceAlign be used with both CPU and GPU?**
Yes. ForceAlign is optimized for both CPU and CUDA-enabled GPU devices. Using a GPU significantly speeds up processing for longer recordings.
**3. Can ForceAlign handle non-English audio?**
No. Currently, ForceAlign supports English only. Support for additional languages may be added in future updates.
---
## Acknowledgements
This project is heavily based upon a demo from PyTorch by Moto Hira: [FORCED ALIGNMENT WITH WAV2VEC2](https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html).
Raw data
{
"_id": null,
"home_page": "https://github.com/lukerbs/forcealign",
"name": "forcealign",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "force align, forced alignment, audio segmentation, audio forced alignment, python forced alignment, phoneme, generate subtitles",
"author": "Luke Kerbs",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/da/e1/a226b66c525c51e73a3a626a57f98941e92bcdd4103ccf94cdbe20829021/forcealign-1.1.9.tar.gz",
"platform": null,
"description": "# ForceAlign \nForceAlign is a Python library for forced alignment of English text to English audio. It can generate **word** or [**phoneme**](https://en.wikipedia.org/wiki/Phoneme)-level alignments, identifying the specific time a word or phoneme was spoken within an audio recording. ForceAlign supports `.mp3` and `.wav` audio file formats.\n\nFor phoneme-level alignments, ForceAlign currently supports the [ARPABET](https://en.wikipedia.org/wiki/ARPABET) phonetic transcription encoding.\n\nForceAlign uses PyTorch's **Wav2Vec2** pretrained model for acoustic feature extraction and can run on both CPU and CUDA GPU devices. It now includes **automatic speech-to-text transcription**, making it even more flexible for use cases where transcripts are not readily available.\n\n---\n\n## Features\n- Fast and accurate word and phoneme-level forced alignment of text to audio.\n- Includes **automatic speech transcription** if a transcript is not provided.\n- Optimized for both CPU and GPU.\n- OS-independent\u2014compatible with macOS, Windows, and Linux.\n- Supports `.mp3` and `.wav` audio file formats.\n\n---\n\n## Installation and Dependencies\n1. Install ForceAlign:\n ```bash\n pip3 install forcealign\n ```\n2. Install `ffmpeg` (required for audio processing):\n - **macOS**: `brew install ffmpeg`\n - **Linux**: `sudo apt install ffmpeg`\n - **Windows**: Install from [ffmpeg.org](https://ffmpeg.org/download.html)\n\n---\n\n## Usage Examples\n\n### Example 1: Getting Word-Level Text Alignments with a Provided Transcript\n```python\nfrom forcealign import ForceAlign\n\n# Provide path to audio file and corresponding transcript\ntranscript = \"The quick brown fox jumps over the lazy dog.\"\nalign = ForceAlign(audio_file='./speech.mp3', transcript=transcript)\n\n# Run prediction and return alignment results\nwords = align.inference()\n\n# Show predicted word-level alignments\nfor word in words:\n print(f\"Word: {word.word}, Start: {word.time_start}s, End: {word.time_end}s\")\n```\n\n---\n\n### Example 2: Getting Word-Level Text Alignments with Automatic Speech Transcription\nIf a transcript is not provided, ForceAlign can automatically generate one using Wav2Vec2.\n\n```python\nfrom forcealign import ForceAlign\n\n# Provide path to audio file; omit transcript\nalign = ForceAlign(audio_file='./speech.mp3')\n\n# Automatically generate transcript and align words\nwords = align.inference()\n\n# Show the generated transcript\nprint(\"Generated Transcript:\")\nprint(align.raw_text)\n\n# Show predicted word-level alignments\nfor word in words:\n print(f\"Word: {word.word}, Start: {word.time_start}s, End: {word.time_end}s\")\n```\n\n---\n\n### Example 3: Getting Phoneme-Level Text Alignments\n```python\nfrom forcealign import ForceAlign\n\n# Provide path to audio file and transcript\ntranscript = \"The quick brown fox jumps over the lazy dog.\"\nalign = ForceAlign(audio_file='./speech.mp3', transcript=transcript)\n\n# Run prediction and return alignment results\nwords = align.inference()\n\n# Access predicted phoneme-level alignments\nfor word in words:\n print(f\"Word: {word.word}\")\n for phoneme in word.phonemes:\n print(f\"Phoneme: {phoneme.phoneme}, Start: {phoneme.time_start}s, End: {phoneme.time_end}s\")\n```\n\n---\n\n### Example 4: Reviewing Word-Level Alignments in Real-Time\n```python\nfrom forcealign import ForceAlign\n\n# Provide path to audio file and transcript\ntranscript = \"The quick brown fox jumps over the lazy dog.\"\nalign = ForceAlign(audio_file='./speech.mp3', transcript=transcript)\n\n# Play the audio while printing word alignments in real-time\nalign.review_alignment()\n```\n\n---\n\n## Where ForceAlign Works Well\nForceAlign excels in the following scenarios:\n1. **Clear Audio Recordings**:\n - Audio with minimal background noise, clear enunciation, and consistent speaking patterns.\n2. **Short and Medium-Length Recordings**:\n - Audio files up to ~30 minutes, where transcription and alignment can be processed efficiently.\n3. **Standard English Pronunciation**:\n - Recordings with native or near-native English pronunciation.\n\n---\n\n## Where ForceAlign May Struggle\n1. **Noisy Audio**:\n - Recordings with heavy background noise or overlapping speech may result in reduced transcription and alignment accuracy.\n2. **Non-Standard English Accents**:\n - Strong regional accents or dialects not represented in the Wav2Vec2 training data may lead to transcription errors.\n3. **Long Audio Files**:\n - For recordings exceeding ~1 hour, memory and processing time may become significant issues.\n4. **Non-English Speech**:\n - ForceAlign currently supports English only.\n\n---\n\n## Use Cases\n- **Subtitle Generation**:\n - Generate timestamps for subtitles or closed captions for videos.\n- **Phoneme Analysis**:\n - Analyze phoneme-level details for language research, speech therapy, or pronunciation training.\n- **Animated Lip Syncing**:\n - Use phoneme alignments to synchronize animated character lip movements with audio.\n- **Accessibility Tools**:\n - Enhance accessibility by creating aligned captions or transcripts for audio recordings.\n\n---\n\n## FAQ\n\n**1. Does ForceAlign have speech-to-text capabilities?** \nYes! If you do not provide a transcript, ForceAlign will automatically generate one using Wav2Vec2. You can also provide your own transcript for better accuracy.\n\n**2. Can ForceAlign be used with both CPU and GPU?** \nYes. ForceAlign is optimized for both CPU and CUDA-enabled GPU devices. Using a GPU significantly speeds up processing for longer recordings.\n\n**3. Can ForceAlign handle non-English audio?** \nNo. Currently, ForceAlign supports English only. Support for additional languages may be added in future updates.\n\n---\n\n## Acknowledgements\nThis project is heavily based upon a demo from PyTorch by Moto Hira: [FORCED ALIGNMENT WITH WAV2VEC2](https://pytorch.org/audio/stable/tutorials/forced_alignment_tutorial.html).\n",
"bugtrack_url": null,
"license": null,
"summary": "A Python library for forced alignment of English text to English audio.",
"version": "1.1.9",
"project_urls": {
"Homepage": "https://github.com/lukerbs/forcealign"
},
"split_keywords": [
"force align",
" forced alignment",
" audio segmentation",
" audio forced alignment",
" python forced alignment",
" phoneme",
" generate subtitles"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "187f7ff5b2e4bc8a01d22482952c16b5ff4931284d94cb364dbbb6e4c594a038",
"md5": "47d4c275e3c4daf99317247d18273ccb",
"sha256": "1281c11e8c8c5e96fe890037bd425b0eb427435c3eee0cfa429ecc0aa4d94460"
},
"downloads": -1,
"filename": "forcealign-1.1.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "47d4c275e3c4daf99317247d18273ccb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 8633,
"upload_time": "2024-12-04T06:54:03",
"upload_time_iso_8601": "2024-12-04T06:54:03.892738Z",
"url": "https://files.pythonhosted.org/packages/18/7f/7ff5b2e4bc8a01d22482952c16b5ff4931284d94cb364dbbb6e4c594a038/forcealign-1.1.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "dae1a226b66c525c51e73a3a626a57f98941e92bcdd4103ccf94cdbe20829021",
"md5": "60576eeb37be27b187f95f35b0a85a2c",
"sha256": "a07418d13b33fe1a5375a933f78fb91ddbb97da5b757e7acdb30c5aa59f54a09"
},
"downloads": -1,
"filename": "forcealign-1.1.9.tar.gz",
"has_sig": false,
"md5_digest": "60576eeb37be27b187f95f35b0a85a2c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 8037,
"upload_time": "2024-12-04T06:54:05",
"upload_time_iso_8601": "2024-12-04T06:54:05.293486Z",
"url": "https://files.pythonhosted.org/packages/da/e1/a226b66c525c51e73a3a626a57f98941e92bcdd4103ccf94cdbe20829021/forcealign-1.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-04 06:54:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lukerbs",
"github_project": "forcealign",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "click",
"specs": [
[
"==",
"8.1.7"
]
]
},
{
"name": "Distance",
"specs": [
[
"==",
"0.1.3"
]
]
},
{
"name": "filelock",
"specs": [
[
"==",
"3.16.1"
]
]
},
{
"name": "fsspec",
"specs": [
[
"==",
"2024.10.0"
]
]
},
{
"name": "g2p-en",
"specs": [
[
"==",
"2.1.0"
]
]
},
{
"name": "inflect",
"specs": [
[
"==",
"7.4.0"
]
]
},
{
"name": "Jinja2",
"specs": [
[
"==",
"3.1.4"
]
]
},
{
"name": "joblib",
"specs": [
[
"==",
"1.4.2"
]
]
},
{
"name": "MarkupSafe",
"specs": [
[
"==",
"3.0.2"
]
]
},
{
"name": "more-itertools",
"specs": [
[
"==",
"10.5.0"
]
]
},
{
"name": "mpmath",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "networkx",
"specs": [
[
"==",
"3.4.2"
]
]
},
{
"name": "nltk",
"specs": [
[
"==",
"3.9.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.1.3"
]
]
},
{
"name": "pydub",
"specs": [
[
"==",
"0.25.1"
]
]
},
{
"name": "regex",
"specs": [
[
"==",
"2024.11.6"
]
]
},
{
"name": "sympy",
"specs": [
[
"==",
"1.13.1"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "torchaudio",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.67.1"
]
]
},
{
"name": "typeguard",
"specs": [
[
"==",
"4.4.1"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.12.2"
]
]
}
],
"lcname": "forcealign"
}