<h1 align="center">Python forced alignment</h1>
<div align="center">
[![PyPI](https://img.shields.io/pypi/v/pyfoal.svg)](https://pypi.python.org/pypi/pyfoal)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Downloads](https://static.pepy.tech/badge/pyfoal)](https://pepy.tech/project/pyfoal)
</div>
</div>
Forced alignment suite. Includes English grapheme-to-phoneme (G2P) and
phoneme alignment from the following forced alignment tools.
- RAD-TTS [1]
- Montreal Forced Aligner (MFA) [2]
- Penn Phonetic Forced Aligner (P2FA) [3]
RAD-TTS is used by default. Alignments can be saved to disk or accessed via the
`pypar.Alignment` phoneme alignment representation. See
[`pypar`](https://github.com/maxrmorrison/pypar) for more details.
`pyfoal` also includes the following
- Converting alignments to and from a categorical representation
suitable for training machine learning models (`pyfoal.convert`)
- Natural interpolation of forced alignments for time-stretching speech
(`pyfoal.interpolate`)
## Table of contents
- [Installation](#installation)
- [Inference](#inference)
* [Application programming interface](#application-programming-interface)
* [`pyfoal.from_text_and_audio`](#pyfoalfrom_text_and_audio)
* [`pyfoal.from_file`](#pyfoalfrom_file)
* [`pyfoal.from_file_to_file`](#pyfoalfrom_file_to_file)
* [`pyfoal.from_files_to_files`](#pyfoalfrom_files_to_files)
* [Command-line interface](#command-line-interface)
- [Training](#training)
* [Download](#download)
* [Preprocess](#preprocess)
* [Partition](#partition)
* [Train](#train)
* [Monitor](#monitor)
* [Evaluate](#evaluate)
- [References](#references)
## Installation
`pip install pyfoal`
MFA and P2FA both require additional installation steps found below.
### Montreal Forced Aligner (MFA)
`conda install -c conda-forge montreal-forced-aligner`
### Penn Phonetic Forced Aligner (P2FA)
P2FA depends on the
[Hidden Markov Model Toolkit (HTK)](http://htk.eng.cam.ac.uk/), which has been
tested on Mac OS and Linux using HTK version 3.4.0. There are known issues in
using version 3.4.1 on Linux. HTK is released under a license that prohibits
redistribution, so you must install HTK yourself and verify that the commands
`HCopy` and `HVite` are available as system-wide binaries. After downloading
HTK, I use the following for installation on Linux.
```
sudo apt-get install -y gcc-multilib libx11-dev
sudo chmod +x configure
./configure --disable-hslab
make all
sudo make install
```
For more help with HTK installation, see notes by
[Jaekoo Kang](https://github.com/jaekookang/p2fa_py3#install-htk) and
[Steve Rubin](https://github.com/ucbvislab/p2fa-vislab#install-htk-34-note-341-will-not-work-get-htk-here).
## Inference
### Force-align text and audio
```python
import pyfoal
# Load text
text = pyfoal.load.text(text_file)
# Load and resample audio
audio = pyfoal.load.audio(audio_file)
# Select an aligner. One of ['mfa', 'p2fa', 'radtts' (default)].
aligner = 'radtts'
# For RAD-TTS, select a model checkpoint
checkpoint = pyfoal.DEFAULT_CHECKPOINT
# Select a GPU to run inference on
gpu = 0
alignment = pyfoal.from_text_and_audio(
text,
audio,
pyfoal.SAMPLE_RATE,
aligner=aligner,
checkpoint=checkpoint,
gpu=gpu)
```
### Application programming interface
#### `pyfoal.from_text_and_audio`
```
"""Phoneme-level forced-alignment
Arguments
text : string
The speech transcript
audio : torch.tensor(shape=(1, samples))
The speech signal to process
sample_rate : int
The audio sampling rate
Returns
alignment : pypar.Alignment
The forced alignment
"""
```
#### `pyfoal.from_file`
```
"""Phoneme alignment from audio and text files
Arguments
text_file : Path
The corresponding transcript file
audio_file : Path
The audio file to process
aligner : str
The alignment method to use
checkpoint : Path
The checkpoint to use for neural methods
gpu : int
The index of the gpu to perform alignment on for neural methods
Returns
alignment : Alignment
The forced alignment
"""
```
#### `pyfoal.from_file_to_file`
```
"""Perform phoneme alignment from files and save to disk
Arguments
text_file : Path
The corresponding transcript file
audio_file : Path
The audio file to process
output_file : Path
The file to save the alignment
aligner : str
The alignment method to use
checkpoint : Path
The checkpoint to use for neural methods
gpu : int
The index of the gpu to perform alignment on for neural methods
"""
```
#### `pyfoal.from_files_to_files`
```
"""Perform parallel phoneme alignment from many files and save to disk
Arguments
text_files : list
The transcript files
audio_files : list
The corresponding speech audio files
output_files : list
The files to save the alignments
aligner : str
The alignment method to use
num_workers : int
Number of CPU cores to utilize. Defaults to all cores.
checkpoint : Path
The checkpoint to use for neural methods
gpu : int
The index of the gpu to perform alignment on for neural methods
"""
```
### Command-line interface
```
python -m pyfoal
[-h]
--text_files TEXT_FILES [TEXT_FILES ...]
--audio_files AUDIO_FILES [AUDIO_FILES ...]
--output_files OUTPUT_FILES [OUTPUT_FILES ...]
[--aligner ALIGNER]
[--num_workers NUM_WORKERS]
[--checkpoint CHECKPOINT]
[--gpu GPU]
Arguments:
-h, --help
show this help message and exit
--text_files TEXT_FILES [TEXT_FILES ...]
The speech transcript files
--audio_files AUDIO_FILES [AUDIO_FILES ...]
The speech audio files
--output_files OUTPUT_FILES [OUTPUT_FILES ...]
The files to save the alignments
--aligner ALIGNER
The alignment method to use
--num_workers NUM_WORKERS
Number of CPU cores to utilize. Defaults to all cores.
--checkpoint CHECKPOINT
The checkpoint to use for neural methods
--gpu GPU
The index of the GPU to use for inference. Defaults to CPU.
```
## Training
### Download
`python -m pyfoal.data.download`
Downloads and uncompresses the `arctic` and `libritts` datasets used for training.
### Preprocess
`python -m pyfoal.data.preprocess`
Converts each dataset to a common format on disk ready for training.
### Partition
`python -m pyfoal.partition`
Generates `train` `valid`, and `test` partitions for `arctic` and `libritts`.
Partitioning is deterministic given the same random seed. You do not need to
run this step, as the original partitions are saved in
`pyfoal/assets/partitions`.
### Train
`python -m pyfoal.train --config <config> --gpus <gpus>`
Trains a model according to a given configuration on the `libritts`
dataset. Uses a list of GPU indices as an argument, and uses distributed
data parallelism (DDP) if more than one index is given. For example,
`--gpus 0 3` will train using DDP on GPUs `0` and `3`.
### Monitor
Run `tensorboard --logdir runs/`. If you are running training remotely, you
must create a SSH connection with port forwarding to view Tensorboard.
This can be done with `ssh -L 6006:localhost:6006 <user>@<server-ip-address>`.
Then, open `localhost:6006` in your browser.
### Evaluate
```
python -m pyfal.evaluate \
--config <config> \
--checkpoint <checkpoint> \
--gpu <gpu>
```
Evaluate a model. `<checkpoint>` is the checkpoint file to evaluate and `<gpu>`
is the GPU index.
## References
[1] R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B.
Catanzaro, "One TTS Alignment to Rule Them All," International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
[2] J. Yuan and M. Liberman, “Speaker identification on the scotus
corpus,” Journal of the Acoustical Society of America, vol. 123, p.
3878, 2008.
[3] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger,
"Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,"
Interspeech, vol. 2017, p. 498-502. 2017.
Raw data
{
"_id": null,
"home_page": "https://github.com/maxrmorrison/pyfoal",
"name": "pyfoal",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "align, alignment, attention, duration, phoneme, speech, word",
"author": "Max Morrison",
"author_email": "maxrmorrison@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/2c/c9/19cff9a8b51b078ebc5cc31930764a3c7ff09c7a924d53b4d9389e251735/pyfoal-1.0.1.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">Python forced alignment</h1>\n<div align=\"center\">\n\n[![PyPI](https://img.shields.io/pypi/v/pyfoal.svg)](https://pypi.python.org/pypi/pyfoal)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![Downloads](https://static.pepy.tech/badge/pyfoal)](https://pepy.tech/project/pyfoal)\n\n</div>\n\n</div>\n\nForced alignment suite. Includes English grapheme-to-phoneme (G2P) and\nphoneme alignment from the following forced alignment tools.\n - RAD-TTS [1]\n - Montreal Forced Aligner (MFA) [2]\n - Penn Phonetic Forced Aligner (P2FA) [3]\n\nRAD-TTS is used by default. Alignments can be saved to disk or accessed via the\n`pypar.Alignment` phoneme alignment representation. See\n[`pypar`](https://github.com/maxrmorrison/pypar) for more details.\n\n`pyfoal` also includes the following\n - Converting alignments to and from a categorical representation\n suitable for training machine learning models (`pyfoal.convert`)\n - Natural interpolation of forced alignments for time-stretching speech\n (`pyfoal.interpolate`)\n\n\n## Table of contents\n\n- [Installation](#installation)\n- [Inference](#inference)\n * [Application programming interface](#application-programming-interface)\n * [`pyfoal.from_text_and_audio`](#pyfoalfrom_text_and_audio)\n * [`pyfoal.from_file`](#pyfoalfrom_file)\n * [`pyfoal.from_file_to_file`](#pyfoalfrom_file_to_file)\n * [`pyfoal.from_files_to_files`](#pyfoalfrom_files_to_files)\n * [Command-line interface](#command-line-interface)\n- [Training](#training)\n * [Download](#download)\n * [Preprocess](#preprocess)\n * [Partition](#partition)\n * [Train](#train)\n * [Monitor](#monitor)\n * [Evaluate](#evaluate)\n- [References](#references)\n\n\n## Installation\n\n`pip install pyfoal`\n\nMFA and P2FA both require additional installation steps found below.\n\n\n### Montreal Forced Aligner (MFA)\n\n`conda install -c conda-forge montreal-forced-aligner`\n\n\n### Penn Phonetic Forced Aligner (P2FA)\n\nP2FA depends on the\n[Hidden Markov Model Toolkit (HTK)](http://htk.eng.cam.ac.uk/), which has been\ntested on Mac OS and Linux using HTK version 3.4.0. There are known issues in\nusing version 3.4.1 on Linux. HTK is released under a license that prohibits\nredistribution, so you must install HTK yourself and verify that the commands\n`HCopy` and `HVite` are available as system-wide binaries. After downloading\nHTK, I use the following for installation on Linux.\n\n```\nsudo apt-get install -y gcc-multilib libx11-dev\nsudo chmod +x configure\n./configure --disable-hslab\nmake all\nsudo make install\n```\n\nFor more help with HTK installation, see notes by\n[Jaekoo Kang](https://github.com/jaekookang/p2fa_py3#install-htk) and\n[Steve Rubin](https://github.com/ucbvislab/p2fa-vislab#install-htk-34-note-341-will-not-work-get-htk-here).\n\n\n## Inference\n\n### Force-align text and audio\n\n```python\nimport pyfoal\n\n# Load text\ntext = pyfoal.load.text(text_file)\n\n# Load and resample audio\naudio = pyfoal.load.audio(audio_file)\n\n# Select an aligner. One of ['mfa', 'p2fa', 'radtts' (default)].\naligner = 'radtts'\n\n# For RAD-TTS, select a model checkpoint\ncheckpoint = pyfoal.DEFAULT_CHECKPOINT\n\n# Select a GPU to run inference on\ngpu = 0\n\nalignment = pyfoal.from_text_and_audio(\n text,\n audio,\n pyfoal.SAMPLE_RATE,\n aligner=aligner,\n checkpoint=checkpoint,\n gpu=gpu)\n```\n\n\n### Application programming interface\n\n#### `pyfoal.from_text_and_audio`\n\n\n```\n\"\"\"Phoneme-level forced-alignment\n\nArguments\n text : string\n The speech transcript\n audio : torch.tensor(shape=(1, samples))\n The speech signal to process\n sample_rate : int\n The audio sampling rate\n\nReturns\n alignment : pypar.Alignment\n The forced alignment\n\"\"\"\n```\n\n\n#### `pyfoal.from_file`\n\n```\n\"\"\"Phoneme alignment from audio and text files\n\nArguments\n text_file : Path\n The corresponding transcript file\n audio_file : Path\n The audio file to process\n aligner : str\n The alignment method to use\n checkpoint : Path\n The checkpoint to use for neural methods\n gpu : int\n The index of the gpu to perform alignment on for neural methods\n\nReturns\n alignment : Alignment\n The forced alignment\n\"\"\"\n```\n\n\n#### `pyfoal.from_file_to_file`\n\n```\n\"\"\"Perform phoneme alignment from files and save to disk\n\nArguments\n text_file : Path\n The corresponding transcript file\n audio_file : Path\n The audio file to process\n output_file : Path\n The file to save the alignment\n aligner : str\n The alignment method to use\n checkpoint : Path\n The checkpoint to use for neural methods\n gpu : int\n The index of the gpu to perform alignment on for neural methods\n\"\"\"\n```\n\n\n#### `pyfoal.from_files_to_files`\n\n```\n\"\"\"Perform parallel phoneme alignment from many files and save to disk\n\nArguments\n text_files : list\n The transcript files\n audio_files : list\n The corresponding speech audio files\n output_files : list\n The files to save the alignments\n aligner : str\n The alignment method to use\n num_workers : int\n Number of CPU cores to utilize. Defaults to all cores.\n checkpoint : Path\n The checkpoint to use for neural methods\n gpu : int\n The index of the gpu to perform alignment on for neural methods\n\"\"\"\n```\n\n\n### Command-line interface\n\n```\npython -m pyfoal\n [-h]\n --text_files TEXT_FILES [TEXT_FILES ...]\n --audio_files AUDIO_FILES [AUDIO_FILES ...]\n --output_files OUTPUT_FILES [OUTPUT_FILES ...]\n [--aligner ALIGNER]\n [--num_workers NUM_WORKERS]\n [--checkpoint CHECKPOINT]\n [--gpu GPU]\n\nArguments:\n -h, --help\n show this help message and exit\n --text_files TEXT_FILES [TEXT_FILES ...]\n The speech transcript files\n --audio_files AUDIO_FILES [AUDIO_FILES ...]\n The speech audio files\n --output_files OUTPUT_FILES [OUTPUT_FILES ...]\n The files to save the alignments\n --aligner ALIGNER\n The alignment method to use\n --num_workers NUM_WORKERS\n Number of CPU cores to utilize. Defaults to all cores.\n --checkpoint CHECKPOINT\n The checkpoint to use for neural methods\n --gpu GPU\n The index of the GPU to use for inference. Defaults to CPU.\n```\n\n\n## Training\n\n### Download\n\n`python -m pyfoal.data.download`\n\nDownloads and uncompresses the `arctic` and `libritts` datasets used for training.\n\n\n### Preprocess\n\n`python -m pyfoal.data.preprocess`\n\nConverts each dataset to a common format on disk ready for training.\n\n\n### Partition\n\n`python -m pyfoal.partition`\n\nGenerates `train` `valid`, and `test` partitions for `arctic` and `libritts`.\nPartitioning is deterministic given the same random seed. You do not need to\nrun this step, as the original partitions are saved in\n`pyfoal/assets/partitions`.\n\n\n### Train\n\n`python -m pyfoal.train --config <config> --gpus <gpus>`\n\nTrains a model according to a given configuration on the `libritts`\ndataset. Uses a list of GPU indices as an argument, and uses distributed\ndata parallelism (DDP) if more than one index is given. For example,\n`--gpus 0 3` will train using DDP on GPUs `0` and `3`.\n\n\n### Monitor\n\nRun `tensorboard --logdir runs/`. If you are running training remotely, you\nmust create a SSH connection with port forwarding to view Tensorboard.\nThis can be done with `ssh -L 6006:localhost:6006 <user>@<server-ip-address>`.\nThen, open `localhost:6006` in your browser.\n\n### Evaluate\n\n```\npython -m pyfal.evaluate \\\n --config <config> \\\n --checkpoint <checkpoint> \\\n --gpu <gpu>\n```\n\nEvaluate a model. `<checkpoint>` is the checkpoint file to evaluate and `<gpu>`\nis the GPU index.\n\n\n## References\n\n[1] R. Badlani, A. \u0141a\u0144cucki, K. J. Shih, R. Valle, W. Ping, and B.\nCatanzaro, \"One TTS Alignment to Rule Them All,\" International\nConference on Acoustics, Speech and Signal Processing (ICASSP), 2022.\n\n[2] J. Yuan and M. Liberman, \u201cSpeaker identification on the scotus\ncorpus,\u201d Journal of the Acoustical Society of America, vol. 123, p.\n3878, 2008.\n\n[3] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger,\n\"Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,\"\nInterspeech, vol. 2017, p. 498-502. 2017.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python forced aligner",
"version": "1.0.1",
"project_urls": {
"Homepage": "https://github.com/maxrmorrison/pyfoal"
},
"split_keywords": [
"align",
" alignment",
" attention",
" duration",
" phoneme",
" speech",
" word"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "db826e2d86a4cfa5e817b2e5f8231eda3f8cbe97c8b1d20d2480c4d0747034aa",
"md5": "a90f51e948e31faa52073581ae72ee0d",
"sha256": "f03072fc3322c8b935e250e5d64b1fdf283e898e67849890781773a78d3fa598"
},
"downloads": -1,
"filename": "pyfoal-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a90f51e948e31faa52073581ae72ee0d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 3326839,
"upload_time": "2024-04-12T23:12:22",
"upload_time_iso_8601": "2024-04-12T23:12:22.278490Z",
"url": "https://files.pythonhosted.org/packages/db/82/6e2d86a4cfa5e817b2e5f8231eda3f8cbe97c8b1d20d2480c4d0747034aa/pyfoal-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2cc919cff9a8b51b078ebc5cc31930764a3c7ff09c7a924d53b4d9389e251735",
"md5": "821069330cd69b1a862657db6aeab3c6",
"sha256": "8c67293e7cdd9424aaebdf6193759f5b641bc83d60b6be26e56d58996c2dd722"
},
"downloads": -1,
"filename": "pyfoal-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "821069330cd69b1a862657db6aeab3c6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 3244518,
"upload_time": "2024-04-12T23:12:24",
"upload_time_iso_8601": "2024-04-12T23:12:24.513786Z",
"url": "https://files.pythonhosted.org/packages/2c/c9/19cff9a8b51b078ebc5cc31930764a3c7ff09c7a924d53b4d9389e251735/pyfoal-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-12 23:12:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "maxrmorrison",
"github_project": "pyfoal",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pyfoal"
}