# PAFST
---
### Library That Preprocessing Audio For TTS.
This library enables easy processing of audio files into a format suitable for TTS training data with a simple execution.

## Description
PAFST have three features.
1. Separator and Denoiser
2. VAD
3. Diarization
4. STT
* Separator or Denoiser : Removes background music (MR) and noise from each audio file to isolate clean voice tracks.
* VAD : Detects whether the audio is present or absent.
* Diarization : Separates speakers within each audio file, identifying distinct voices.
* STT : Extract text from audio.
```
# before run()
path
├── TEST-1.wav # have mr or noise
└── TEST-2.wav
# after run()
path
├── speaker_SPEAKER_00
│ ├── SPEAKER_00_1.wav # removed mr and noise
│ ├── SPEAKER_00_2.wav
│ └── SPEAKER_00_3.wav
├── speaker_SPEAKER_01
│ ├── SPEAKER_01_1.wav
│ └── SPEAKER_01_2.wav
├── speaker_SPEAKER_02
│ ├── SPEAKER_02_1.wav
│ └── SPEAKER_02_2.wav
├── asr.json
└── diarization.json
# diarization.json
[
{
"speaker_path": "/processed_audio/speaker_SPEAKER_00/SPEAKER_00_0.wav",
"audio_filepath": "processed_audio//TEST-1.wav", # this is audio separated
"start_time": 0.03,
"end_time": 3.81
},
...
]
# asr.json
[
{
"asr_text": " Let's talk about music. I often do you listen to music.",
"audio_filepath": "/processed_audio/speaker_SPEAKER_00/SPEAKER_00_0.wav",
"language": "en"
}
]
```
## Features
* Separator : Using the [UVR](https://github.com/Anjok07/ultimatevocalremovergui) project’s model and code for music source separation.
* Denoiser : DFNet3 and Facebook's ` denoiser `
* VAD : Using [webrtcvad](https://github.com/wiseman/py-webrtcvad)
* Diarization : Using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio)
* STT : Using STT model whisper from [OpenAI](https://github.com/openai/whisper) and ` faster-whisper `
## Setup
This library was developed using Python 3.10, and we recommend using Python versions 3.8 to 3.10 for compatibility.
While the library is compatible with both Linux and Windows, all testing was conducted on Linux.
For any issues or errors encountered while running on Linux, please feel free to open an issue.
Before running the library, please ensure the following are installed:
### PyTorch
We highly recommend using a GPU to optimize performance. For PyTorch installation, please follow the commands below to ensure compatibility with your GPU
```
# Example for installing PyTorch with CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
### ffmpeg
[ffmpeg](https://ffmpeg.org/) is required for audio processing tasks within this library. Please ensure it is installed and accessible from your system’s PATH.
To install ffmpeg:
#### Windows
Download the latest FFmpeg release from [FFmpeg’s official website](https://ffmpeg.org/download.html), and add the bin folder to your system’s PATH.
#### Linux
Use the following command to install FFmpeg:
```
sudo apt update
sudo apt install ffmpeg
```
After installation, you can verify by running
```
ffmpeg -version
```
### HuggingFace Access Token (required for diarization)
To enable diarization functionality, please complete the following steps
1. Accept [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0) user conditions
2. Accept [`pyannote/speaker-diarization-3.1`](https://huggingface.co/pyannote/speaker-diarization-3.1) user conditions
3. Create access token at [`hf.co/settings/tokens`](https://huggingface.co/login?next=%2Fsettings%2Ftokens).
```
from pafst.pafts import PAFST
p = PAFST(
path = 'your_audio_directory_path',
output_path = 'output_path',
hf_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE"
)
```
After completing the setup steps above, you can install this library by running
```
pip install pafst
```
## Usage
```
from pafst import PAFST
p = PAFST(
path = 'your_audio_directory_path',
output_path = 'output_path',
hf_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE" # if you use diarization
)
# Separator
p.separator() # or
p.denoiser(processor="dfn") # use "den" for facebook's denoiser
p.vad() # voice-activity-detection using webrtcvad
# Diarization
p.diarization()
# STT
p.stt(model_size='small')
# One-Click Process
p.run()
```
## TODO
- [ ] Command line
- [ ] Clean logging
- [ ] Separator with Model Selection
References:
* [PAFTS](https://github.com/harmlessman/PAFTS) for base code
* [Paper](https://arxiv.org/pdf/2409.05356) for DFNet3 use case
## License
The code of **PAFST** is [MIT-licensed](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/prassr/PAFST",
"name": "pafst",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.11,>=3.10",
"maintainer_email": null,
"keywords": "speechrecognition asr voiceactivitydetection vad webrtc pafst audio denoising speaker diarization",
"author": "ashtavakra",
"author_email": "vidyaaltar@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/4d/1e/a26b901c838cf2472506b158444af3d13ceaad1b1efacd85d66b4ba9730e/pafst-1.0.0.tar.gz",
"platform": null,
"description": "# PAFST\n\n\n---\n\n### Library That Preprocessing Audio For TTS.\nThis library enables easy processing of audio files into a format suitable for TTS training data with a simple execution.\n\n\n## Description \nPAFST have three features.\n\n1. Separator and Denoiser\n2. VAD\n3. Diarization\n4. STT\n\n* Separator or Denoiser : Removes background music (MR) and noise from each audio file to isolate clean voice tracks.\n* VAD : Detects whether the audio is present or absent.\n* Diarization : Separates speakers within each audio file, identifying distinct voices.\n* STT : Extract text from audio.\n\n\n\n\n```\n# before run()\n\n path\n \u251c\u2500\u2500 TEST-1.wav # have mr or noise\n \u2514\u2500\u2500 TEST-2.wav\n \n\n\n# after run()\n \n path\n \u251c\u2500\u2500 speaker_SPEAKER_00\n \u2502 \u251c\u2500\u2500 SPEAKER_00_1.wav # removed mr and noise\n \u2502 \u251c\u2500\u2500 SPEAKER_00_2.wav\n \u2502 \u2514\u2500\u2500 SPEAKER_00_3.wav\n \u251c\u2500\u2500 speaker_SPEAKER_01\n \u2502 \u251c\u2500\u2500 SPEAKER_01_1.wav\n \u2502 \u2514\u2500\u2500 SPEAKER_01_2.wav\n \u251c\u2500\u2500 speaker_SPEAKER_02\n \u2502 \u251c\u2500\u2500 SPEAKER_02_1.wav\n \u2502 \u2514\u2500\u2500 SPEAKER_02_2.wav\n \u251c\u2500\u2500 asr.json\n \u2514\u2500\u2500 diarization.json\n \n # diarization.json\n [\n {\n \"speaker_path\": \"/processed_audio/speaker_SPEAKER_00/SPEAKER_00_0.wav\",\n \"audio_filepath\": \"processed_audio//TEST-1.wav\", # this is audio separated\n \"start_time\": 0.03,\n \"end_time\": 3.81\n },\n ...\n ]\n\n # asr.json\n [\n {\n \"asr_text\": \" Let's talk about music. I often do you listen to music.\",\n \"audio_filepath\": \"/processed_audio/speaker_SPEAKER_00/SPEAKER_00_0.wav\",\n \"language\": \"en\"\n } \n ]\n```\n\n\n## Features\n* Separator : Using the [UVR](https://github.com/Anjok07/ultimatevocalremovergui) project\u2019s model and code for music source separation.\n* Denoiser : DFNet3 and Facebook's ` denoiser `\n* VAD : Using [webrtcvad](https://github.com/wiseman/py-webrtcvad)\n* Diarization : Using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio)\n* STT : Using STT model whisper from [OpenAI](https://github.com/openai/whisper) and ` faster-whisper `\n\n\n## Setup\nThis library was developed using Python 3.10, and we recommend using Python versions 3.8 to 3.10 for compatibility.\n\nWhile the library is compatible with both Linux and Windows, all testing was conducted on Linux. \nFor any issues or errors encountered while running on Linux, please feel free to open an issue.\n\nBefore running the library, please ensure the following are installed:\n\n### PyTorch\nWe highly recommend using a GPU to optimize performance. For PyTorch installation, please follow the commands below to ensure compatibility with your GPU\n```\n# Example for installing PyTorch with CUDA 11.8\npip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118\n```\n\n### ffmpeg\n[ffmpeg](https://ffmpeg.org/) is required for audio processing tasks within this library. Please ensure it is installed and accessible from your system\u2019s PATH.\nTo install ffmpeg:\n\n#### Windows\nDownload the latest FFmpeg release from [FFmpeg\u2019s official website](https://ffmpeg.org/download.html), and add the bin folder to your system\u2019s PATH.\n\n#### Linux \nUse the following command to install FFmpeg:\n```\nsudo apt update\nsudo apt install ffmpeg\n```\n\nAfter installation, you can verify by running\n```\nffmpeg -version\n```\n\n\n### HuggingFace Access Token (required for diarization)\nTo enable diarization functionality, please complete the following steps\n1. Accept [`pyannote/segmentation-3.0`](https://huggingface.co/pyannote/segmentation-3.0) user conditions\n2. Accept [`pyannote/speaker-diarization-3.1`](https://huggingface.co/pyannote/speaker-diarization-3.1) user conditions\n3. Create access token at [`hf.co/settings/tokens`](https://huggingface.co/login?next=%2Fsettings%2Ftokens).\n\n```\nfrom pafst.pafts import PAFST\n\np = PAFST(\n path = 'your_audio_directory_path',\n output_path = 'output_path',\n hf_token=\"HUGGINGFACE_ACCESS_TOKEN_GOES_HERE\"\n)\n\n```\n\nAfter completing the setup steps above, you can install this library by running\n```\npip install pafst\n```\n\n\n## Usage\n```\nfrom pafst import PAFST\n\np = PAFST(\n path = 'your_audio_directory_path',\n output_path = 'output_path',\n hf_token=\"HUGGINGFACE_ACCESS_TOKEN_GOES_HERE\" # if you use diarization\n \n)\n\n# Separator\np.separator() # or\n\np.denoiser(processor=\"dfn\") # use \"den\" for facebook's denoiser\n\np.vad() # voice-activity-detection using webrtcvad\n\n# Diarization\np.diarization()\n\n# STT\np.stt(model_size='small')\n\n# One-Click Process\np.run()\n\n```\n\n## TODO\n- [ ] Command line\n- [ ] Clean logging\n- [ ] Separator with Model Selection\n\nReferences:\n\n* [PAFTS](https://github.com/harmlessman/PAFTS) for base code\n* [Paper](https://arxiv.org/pdf/2409.05356) for DFNet3 use case\n\n## License\n\nThe code of **PAFST** is [MIT-licensed](LICENSE)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Library That Preprocessing Audio For TTS/STT.",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/prassr/PAFST"
},
"split_keywords": [
"speechrecognition",
"asr",
"voiceactivitydetection",
"vad",
"webrtc",
"pafst",
"audio",
"denoising",
"speaker",
"diarization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6ca8bd676fc715b88ff9a5a84efff35e643d2a20366c5b94c37d1a90a5086b83",
"md5": "9976b144bd88b186fc3712abaabaea1d",
"sha256": "d013335752be5af5b5260fa2811b65c263a3e2da70ba8235f79b6911c620689c"
},
"downloads": -1,
"filename": "pafst-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9976b144bd88b186fc3712abaabaea1d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.11,>=3.10",
"size": 157842,
"upload_time": "2025-01-05T05:02:01",
"upload_time_iso_8601": "2025-01-05T05:02:01.523115Z",
"url": "https://files.pythonhosted.org/packages/6c/a8/bd676fc715b88ff9a5a84efff35e643d2a20366c5b94c37d1a90a5086b83/pafst-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4d1ea26b901c838cf2472506b158444af3d13ceaad1b1efacd85d66b4ba9730e",
"md5": "26b4283e4e2afaff02c8f9d18f663d18",
"sha256": "7db51957202d01e285522e2301b6cdfaf62c948d09ff684edbc1c2396d29d746"
},
"downloads": -1,
"filename": "pafst-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "26b4283e4e2afaff02c8f9d18f663d18",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.11,>=3.10",
"size": 124690,
"upload_time": "2025-01-05T05:02:04",
"upload_time_iso_8601": "2025-01-05T05:02:04.272438Z",
"url": "https://files.pythonhosted.org/packages/4d/1e/a26b901c838cf2472506b158444af3d13ceaad1b1efacd85d66b4ba9730e/pafst-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-05 05:02:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "prassr",
"github_project": "PAFST",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "spleeter",
"specs": []
},
{
"name": "pydub",
"specs": []
},
{
"name": "SpeechRecognition",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "torchaudio",
"specs": []
},
{
"name": "audio-separator",
"specs": []
},
{
"name": "requests",
"specs": [
[
">=",
"2"
]
]
},
{
"name": "librosa",
"specs": [
[
">=",
"0.10"
]
]
},
{
"name": "samplerate",
"specs": [
[
"==",
"0.1.0"
]
]
},
{
"name": "six",
"specs": [
[
">=",
"1.16"
]
]
},
{
"name": "onnx",
"specs": [
[
">=",
"1.14"
]
]
},
{
"name": "onnx2torch",
"specs": [
[
">=",
"1.5"
]
]
},
{
"name": "onnxruntime",
"specs": [
[
">=",
"1.17"
]
]
},
{
"name": "onnxruntime-gpu",
"specs": [
[
">=",
"1.17"
]
]
},
{
"name": "julius",
"specs": [
[
">=",
"0.2"
]
]
},
{
"name": "diffq",
"specs": [
[
">=",
"0.2"
]
]
},
{
"name": "einops",
"specs": [
[
">=",
"0.7"
]
]
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "ml_collections",
"specs": []
},
{
"name": "resampy",
"specs": [
[
">=",
"0.4"
]
]
},
{
"name": "beartype",
"specs": [
[
"==",
"0.18.5"
]
]
},
{
"name": "rotary-embedding-torch",
"specs": [
[
"==",
"0.6.1"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.13.0"
]
]
},
{
"name": "denoiser",
"specs": [
[
"==",
"0.1.5"
]
]
},
{
"name": "deepfilternet",
"specs": []
},
{
"name": "silero_vad",
"specs": []
},
{
"name": "webrtcvad",
"specs": []
},
{
"name": "pyannote.audio",
"specs": []
},
{
"name": "openai-whisper",
"specs": []
},
{
"name": "faster-whisper",
"specs": [
[
"==",
"1.1.0"
]
]
}
],
"lcname": "pafst"
}