Name | vanpy JSON |
Version |
0.92.14
JSON |
| download |
home_page | None |
Summary | VANPY - Voice Analysis framework in Python |
upload_time | 2025-02-21 22:41:50 |
maintainer | None |
docs_url | None |
author | Gregory Koushnir |
requires_python | >=3.8 |
license | Apache License v2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# VANPY
**VANPY** (Voice Analysis Python) is a flexible and extensible framework for voice analysis, feature extraction, and classification. It provides a modular pipeline architecture for processing audio segments with near- and state-of-the-art deep learning models.

<!-- ## Quick Start
Try VANPY in Google Colab:
- [](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/VANPY_example.ipynb)
Basic VANPY capabilities demo
- [](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/using_VANPY_to_classify_emotions_on_RAVDESS_dataset.ipynb)
Emotion classification on RAVDESS dataset
-->
## Architecture
**VANPY** consists of three optional pipelines that can be used independently or in combination:
1. **Preprocessing Pipeline**: Handles audio format conversion and voice segment extraction
2. **Feature Extraction Pipeline**: Generates feature/latent vectors from voice segments
3. **Model Inference Pipeline**
You can use these pipelines flexibly based on your needs:
- Use only preprocessing for voice separation
- Combine preprocessing and classification for direct audio analysis
- Use all pipelines for complete feature extraction and classification
## Models Trained as part of the VANPY project
<table>
<tr>
<th>Task</th>
<th>Dataset</th>
<th>Performance</th>
</tr>
<tr>
<td rowspan="3">Gender Identification (Accuracy)</td>
<td>VoxCeleb2</td>
<td>98.9%</td>
</tr>
<tr>
<td>Mozilla Common Voice v10.0</td>
<td>92.3%</td>
</tr>
<tr>
<td>TIMIT</td>
<td>99.6%</td>
</tr>
<tr>
<td rowspan="2">Emotion Recognition (Accuracy)</td>
<td>RAVDESS (8-class)</td>
<td>84.71%</td>
</tr>
<tr>
<td>RAVDESS (7-class)</td>
<td>86.24%</td>
</tr>
<tr>
<td rowspan="3">Age Estimation (MAE in years)</td>
<td>VoxCeleb2</td>
<td>7.88</td>
</tr>
<tr>
<td>TIMIT</td>
<td>4.95</td>
</tr>
<tr>
<td>Combined VoxCeleb2-TIMIT</td>
<td>6.93</td>
</tr>
<tr>
<td rowspan="2">Height Estimation (MAE in cm)</td>
<td>VoxCeleb2</td>
<td>6.01</td>
</tr>
<tr>
<td>TIMIT</td>
<td>6.02</td>
</tr>
</table>
All of the models can be used as a part of the VANPY pipeline or separately and are available on 🤗[HuggingFace](https://huggingface.co/griko)
## Configuration
### Environment Setup
1. Create a `pipeline.yaml` configuration file. You can use the `src/pipeline.yaml` as a template.
2. For HuggingFace models (Pyannote components), create a `.env` file:
```
huggingface_ACCESS_TOKEN=<your_token>
```
3. Pipelines examples are available in `src/run.py`.
## Components
Each component expects as an input and returns as an output a `ComponentPayload` object.
Each component supports:
- Batch processing (if applicable)
- Progress tracking
- Performance monitoring and logging
- Incremental processing (skip already processed files)
- GPU acceleration where applicable
- Configurable parameters
### Preprocessing Components
| Component | Description |
|-----------|-------------|
| **Filelist-DataFrame Creator** | Initializes data pipeline by creating a DataFrame of audio file paths. Supports both directory scanning and loading from existing CSV files. Manages path metadata for downstream components. |
| **WAV Converter** | Standardizes audio format to WAV with configurable parameters including bit rate (default: 256k), channels (default: mono), sample rate (default: 16kHz), and codec (default: PCM 16-bit). Uses FFMPEG for robust conversion. |
| **WAV Splitter** | Handles large audio files by splitting them into manageable segments based on either duration or file size limits. Maintains audio quality and creates properly labeled segments with original file references. |
| **INA Voice Separator** | Separates audio into voice and non-voice segments, distinguishing between male and female speakers. Filters out non-speech content while preserving speaker gender information. |
| **Pyannote VAD** | Performs Voice Activity Detection using Pyannote's state-of-the-art deep learning model. Identifies and extracts speech segments with configurable sensitivity.
| **Silero VAD** | Alternative Voice Activity Detection using Silero's efficient model. Optimized for real-time performance with customizable parameters. |
| **Pyannote SD** | Speaker Diarization component that identifies and separates different speakers in audio. Creates individual segments for each speaker with timing information. Supports overlapping speech handling. |
| **MetricGAN SE** | Speech Enhancement using MetricGAN+ model from SpeechBrain. Reduces background noise and improves speech clarity. |
| **SepFormer SE** | Speech Enhancement using SepFormer model, specialized in separating speech from complex background noise. |
### Feature Extraction Components
| Component | Description |
|-----------|-------------|
| **Librosa Features Extractor** | Comprehensive audio feature extraction using the Librosa library. Supports multiple feature types including: MFCC (Mel-frequency cepstral coefficients), Delta-MFCC, zero-crossing rate, spectral features (centroid, bandwidth, contrast, flatness), fundamental frequency (F0), and tonnetz. |
| **Pyannote Embedding** | Generates speaker embeddings using Pyannote's deep learning models. Uses sliding window analysis with configurable duration and step size. Outputs high-dimensional embeddings optimized for speaker differentiation. |
| **SpeechBrain Embedding** | Extracts neural embeddings using SpeechBrain's pretrained models, particularly the ECAPA-TDNN architecture (default: spkrec-ecapa-voxceleb). |
### Model Inference Components
| Component | Description |
|-----------|-------------|
| **VanpyGender Classifier** | SVM-based binary gender classification using speech embeddings. Supports two models: ECAPA-TDNN (192-dim) and XVECT (512-dim) embeddings from SpeechBrain. Trained on VoxCeleb2 dataset with optimized hyperparameters. Provides both verbal ('female'/'male') and numeric label options. |
| **VanpyAge Regressor** | Multi-architecture age estimation supporting SVR and ANN models. Features multiple variants: pure SpeechBrain embeddings (192-dim), combined SpeechBrain and Librosa features (233-dim), and dataset-specific models (VoxCeleb2/TIMIT). |
| **VanpyEmotion Classifier** | 7-class SVM emotion classifier trained on RAVDESS dataset using SpeechBrain embeddings. Classifies emotions into: angry, disgust, fearful, happy, neutral/calm, sad, surprised. |
| **IEMOCAP Emotion** | SpeechBrain-based emotion classifier trained on the IEMOCAP dataset. Uses Wav2Vec2 for feature extraction. Supports four emotion classes: angry, happy, neutral, sad. |
| **Wav2Vec2 ADV** | Advanced emotion analysis using Wav2Vec2, providing continuous scores for arousal, dominance, and valence dimensions. |
| **Wav2Vec2 STT** | Speech-to-text transcription using Facebook's Wav2Vec2 model. |
| **Whisper STT** | OpenAI's Whisper model for robust speech recognition. Supports multiple model sizes and languages. Includes automatic language detection. |
| **Cosine Distance Clusterer** | a Clustering method that can be used for speaker diarization using cosine similarity metrics. Groups speech segments by speaker identity using embedding similarity. |
| **GMM Clusterer** | Gaussian Mixture Model-based speaker clustering. |
| **Agglomerative Clusterer** | Hierarchical clustering for speaker diarization. Uses distance-based merging with configurable threshold and maximum clusters. |
| **YAMNet Classifier** | Google's YAMNet model for general audio classification. Supports 521 audio classes from AudioSet ontology. |
## ComponentPayload Structure
The `ComponentPayload` class manages data flow between pipeline components:
```
class ComponentPayload:
metadata: Dict # Pipeline metadata
df: pd.DataFrame # Processing results
```
### Metadata fields
- `input_path`: Path to the input directory (required for `FilelistDataFrameCreator` if no `df` is provided)
- `paths_column`: Column name for audio file paths
- `all_paths_columns`: List of all path columns
- `feature_columns`: List of feature columns
- `meta_columns`: List of metadata columns
- `classification_columns`: List of classification columns
### df fields
- `df`: pd.DataFrame
Includes all the collected information through the preprocessing and classification
- each preprocessor adds a column of paths where the processed files are hold
- embedding/feature extraction components add the embedding/features columns
- each model adds a model-results column
### Key Methods
- `get_features_df()`: Extract features DataFrame
- `get_classification_df()`: Extract classification results DataFrame
## Coming Soon
- Custom classifier integration guide
- Additional preprocessing components
- Extended model support
- Newer python and dependencies version support
## Citing VANPY
Please, cite VANPY if you use it
```bibtex
@misc{vanpy,
title={VANPY: Voice Analysis Framework},
author={Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan},
year={2025},
eprint={TBD},
archivePrefix={arXiv},
primaryClass={TBD},
note={arXiv:TBD}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "vanpy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Gregory Koushnir",
"author_email": "koushgre@post.bgu.ac.il",
"download_url": "https://files.pythonhosted.org/packages/a6/b0/a49fe4a4f3f4e812fe55d15df86b629a42d91fc0d3426fb7cb73a8566025/vanpy-0.92.14.tar.gz",
"platform": "unix",
"description": "# VANPY \r\n**VANPY** (Voice Analysis Python) is a flexible and extensible framework for voice analysis, feature extraction, and classification. It provides a modular pipeline architecture for processing audio segments with near- and state-of-the-art deep learning models.\r\n\r\n\r\n\r\n<!-- ## Quick Start\r\nTry VANPY in Google Colab:\r\n\r\n- [](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/VANPY_example.ipynb)\r\n \r\n Basic VANPY capabilities demo\r\n\r\n- [](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/using_VANPY_to_classify_emotions_on_RAVDESS_dataset.ipynb)\r\n\r\n Emotion classification on RAVDESS dataset\r\n -->\r\n\r\n\r\n## Architecture\r\n**VANPY** consists of three optional pipelines that can be used independently or in combination:\r\n\r\n1. **Preprocessing Pipeline**: Handles audio format conversion and voice segment extraction\r\n2. **Feature Extraction Pipeline**: Generates feature/latent vectors from voice segments\r\n3. **Model Inference Pipeline**\r\n\r\nYou can use these pipelines flexibly based on your needs:\r\n\r\n- Use only preprocessing for voice separation\r\n- Combine preprocessing and classification for direct audio analysis\r\n- Use all pipelines for complete feature extraction and classification \r\n\r\n## Models Trained as part of the VANPY project\r\n\r\n<table>\r\n <tr>\r\n <th>Task</th>\r\n <th>Dataset</th>\r\n <th>Performance</th>\r\n </tr>\r\n <tr>\r\n <td rowspan=\"3\">Gender Identification (Accuracy)</td>\r\n <td>VoxCeleb2</td>\r\n <td>98.9%</td>\r\n </tr>\r\n <tr>\r\n <td>Mozilla Common Voice v10.0</td>\r\n <td>92.3%</td>\r\n </tr>\r\n <tr>\r\n <td>TIMIT</td>\r\n <td>99.6%</td>\r\n </tr>\r\n <tr>\r\n <td rowspan=\"2\">Emotion Recognition (Accuracy)</td>\r\n <td>RAVDESS (8-class)</td>\r\n <td>84.71%</td>\r\n </tr>\r\n <tr>\r\n <td>RAVDESS (7-class)</td>\r\n <td>86.24%</td>\r\n </tr>\r\n <tr>\r\n <td rowspan=\"3\">Age Estimation (MAE in years)</td>\r\n <td>VoxCeleb2</td>\r\n <td>7.88</td>\r\n </tr>\r\n <tr>\r\n <td>TIMIT</td>\r\n <td>4.95</td>\r\n </tr>\r\n <tr>\r\n <td>Combined VoxCeleb2-TIMIT</td>\r\n <td>6.93</td>\r\n </tr>\r\n <tr>\r\n <td rowspan=\"2\">Height Estimation (MAE in cm)</td>\r\n <td>VoxCeleb2</td>\r\n <td>6.01</td>\r\n </tr>\r\n <tr>\r\n <td>TIMIT</td>\r\n <td>6.02</td>\r\n </tr>\r\n</table>\r\n\r\nAll of the models can be used as a part of the VANPY pipeline or separately and are available on \u00f0\u0178\u00a4\u2014[HuggingFace](https://huggingface.co/griko)\r\n\r\n\r\n## Configuration\r\n### Environment Setup\r\n\r\n1. Create a `pipeline.yaml` configuration file. You can use the `src/pipeline.yaml` as a template.\r\n2. For HuggingFace models (Pyannote components), create a `.env` file:\r\n```\r\nhuggingface_ACCESS_TOKEN=<your_token>\r\n```\r\n3. Pipelines examples are available in `src/run.py`.\r\n\r\n## Components\r\nEach component expects as an input and returns as an output a `ComponentPayload` object.\r\n\r\nEach component supports:\r\n- Batch processing (if applicable)\r\n- Progress tracking\r\n- Performance monitoring and logging\r\n- Incremental processing (skip already processed files)\r\n- GPU acceleration where applicable\r\n- Configurable parameters\r\n\r\n### Preprocessing Components\r\n\r\n| Component | Description |\r\n|-----------|-------------|\r\n| **Filelist-DataFrame Creator** | Initializes data pipeline by creating a DataFrame of audio file paths. Supports both directory scanning and loading from existing CSV files. Manages path metadata for downstream components. |\r\n| **WAV Converter** | Standardizes audio format to WAV with configurable parameters including bit rate (default: 256k), channels (default: mono), sample rate (default: 16kHz), and codec (default: PCM 16-bit). Uses FFMPEG for robust conversion. |\r\n| **WAV Splitter** | Handles large audio files by splitting them into manageable segments based on either duration or file size limits. Maintains audio quality and creates properly labeled segments with original file references. |\r\n| **INA Voice Separator** | Separates audio into voice and non-voice segments, distinguishing between male and female speakers. Filters out non-speech content while preserving speaker gender information. |\r\n| **Pyannote VAD** | Performs Voice Activity Detection using Pyannote's state-of-the-art deep learning model. Identifies and extracts speech segments with configurable sensitivity.\r\n| **Silero VAD** | Alternative Voice Activity Detection using Silero's efficient model. Optimized for real-time performance with customizable parameters. |\r\n| **Pyannote SD** | Speaker Diarization component that identifies and separates different speakers in audio. Creates individual segments for each speaker with timing information. Supports overlapping speech handling. |\r\n| **MetricGAN SE** | Speech Enhancement using MetricGAN+ model from SpeechBrain. Reduces background noise and improves speech clarity. |\r\n| **SepFormer SE** | Speech Enhancement using SepFormer model, specialized in separating speech from complex background noise. |\r\n\r\n### Feature Extraction Components\r\n\r\n| Component | Description |\r\n|-----------|-------------|\r\n| **Librosa Features Extractor** | Comprehensive audio feature extraction using the Librosa library. Supports multiple feature types including: MFCC (Mel-frequency cepstral coefficients), Delta-MFCC, zero-crossing rate, spectral features (centroid, bandwidth, contrast, flatness), fundamental frequency (F0), and tonnetz. |\r\n| **Pyannote Embedding** | Generates speaker embeddings using Pyannote's deep learning models. Uses sliding window analysis with configurable duration and step size. Outputs high-dimensional embeddings optimized for speaker differentiation. |\r\n| **SpeechBrain Embedding** | Extracts neural embeddings using SpeechBrain's pretrained models, particularly the ECAPA-TDNN architecture (default: spkrec-ecapa-voxceleb). |\r\n\r\n### Model Inference Components\r\n\r\n| Component | Description |\r\n|-----------|-------------|\r\n| **VanpyGender Classifier** | SVM-based binary gender classification using speech embeddings. Supports two models: ECAPA-TDNN (192-dim) and XVECT (512-dim) embeddings from SpeechBrain. Trained on VoxCeleb2 dataset with optimized hyperparameters. Provides both verbal ('female'/'male') and numeric label options. |\r\n| **VanpyAge Regressor** | Multi-architecture age estimation supporting SVR and ANN models. Features multiple variants: pure SpeechBrain embeddings (192-dim), combined SpeechBrain and Librosa features (233-dim), and dataset-specific models (VoxCeleb2/TIMIT). |\r\n| **VanpyEmotion Classifier** | 7-class SVM emotion classifier trained on RAVDESS dataset using SpeechBrain embeddings. Classifies emotions into: angry, disgust, fearful, happy, neutral/calm, sad, surprised. |\r\n| **IEMOCAP Emotion** | SpeechBrain-based emotion classifier trained on the IEMOCAP dataset. Uses Wav2Vec2 for feature extraction. Supports four emotion classes: angry, happy, neutral, sad. |\r\n| **Wav2Vec2 ADV** | Advanced emotion analysis using Wav2Vec2, providing continuous scores for arousal, dominance, and valence dimensions. |\r\n| **Wav2Vec2 STT** | Speech-to-text transcription using Facebook's Wav2Vec2 model. |\r\n| **Whisper STT** | OpenAI's Whisper model for robust speech recognition. Supports multiple model sizes and languages. Includes automatic language detection. |\r\n| **Cosine Distance Clusterer** | a Clustering method that can be used for speaker diarization using cosine similarity metrics. Groups speech segments by speaker identity using embedding similarity. |\r\n| **GMM Clusterer** | Gaussian Mixture Model-based speaker clustering. |\r\n| **Agglomerative Clusterer** | Hierarchical clustering for speaker diarization. Uses distance-based merging with configurable threshold and maximum clusters. |\r\n| **YAMNet Classifier** | Google's YAMNet model for general audio classification. Supports 521 audio classes from AudioSet ontology. |\r\n\r\n\r\n## ComponentPayload Structure\r\nThe `ComponentPayload` class manages data flow between pipeline components:\r\n```\r\nclass ComponentPayload:\r\n metadata: Dict # Pipeline metadata\r\n df: pd.DataFrame # Processing results\r\n``` \r\n### Metadata fields\r\n- `input_path`: Path to the input directory (required for `FilelistDataFrameCreator` if no `df` is provided)\r\n- `paths_column`: Column name for audio file paths\r\n- `all_paths_columns`: List of all path columns\r\n- `feature_columns`: List of feature columns\r\n- `meta_columns`: List of metadata columns\r\n- `classification_columns`: List of classification columns\r\n\r\n### df fields\r\n- `df`: pd.DataFrame\r\n \r\n Includes all the collected information through the preprocessing and classification\r\n - each preprocessor adds a column of paths where the processed files are hold\r\n - embedding/feature extraction components add the embedding/features columns\r\n - each model adds a model-results column\r\n\r\n### Key Methods\r\n- `get_features_df()`: Extract features DataFrame\r\n- `get_classification_df()`: Extract classification results DataFrame\r\n\r\n\r\n## Coming Soon\r\n- Custom classifier integration guide\r\n- Additional preprocessing components\r\n- Extended model support\r\n- Newer python and dependencies version support\r\n\r\n## Citing VANPY\r\nPlease, cite VANPY if you use it\r\n\r\n```bibtex\r\n@misc{vanpy,\r\n title={VANPY: Voice Analysis Framework},\r\n author={Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan},\r\n year={2025},\r\n eprint={TBD},\r\n archivePrefix={arXiv},\r\n primaryClass={TBD},\r\n note={arXiv:TBD}\r\n}\r\n```\r\n",
"bugtrack_url": null,
"license": "Apache License v2.0",
"summary": "VANPY - Voice Analysis framework in Python",
"version": "0.92.14",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f6e7b7e5d335a763957871359cd04001ba6e81a584fc131547b11fc0fbdaa152",
"md5": "bca64239e331d1b83d90171224de6262",
"sha256": "1360a32d6b972e0cd7e86d5460f3f1fcfaff0643766634265e82e21b16238860"
},
"downloads": -1,
"filename": "vanpy-0.92.14-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bca64239e331d1b83d90171224de6262",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 115471,
"upload_time": "2025-02-21T22:41:48",
"upload_time_iso_8601": "2025-02-21T22:41:48.953198Z",
"url": "https://files.pythonhosted.org/packages/f6/e7/b7e5d335a763957871359cd04001ba6e81a584fc131547b11fc0fbdaa152/vanpy-0.92.14-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a6b0a49fe4a4f3f4e812fe55d15df86b629a42d91fc0d3426fb7cb73a8566025",
"md5": "f577bd522a456d06a7267b5b72a5f3d8",
"sha256": "230abe36a67777674862928cf40da0d06aa3cd5d78f355814a6b0d0942e52d99"
},
"downloads": -1,
"filename": "vanpy-0.92.14.tar.gz",
"has_sig": false,
"md5_digest": "f577bd522a456d06a7267b5b72a5f3d8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 48801,
"upload_time": "2025-02-21T22:41:50",
"upload_time_iso_8601": "2025-02-21T22:41:50.695632Z",
"url": "https://files.pythonhosted.org/packages/a6/b0/a49fe4a4f3f4e812fe55d15df86b629a42d91fc0d3426fb7cb73a8566025/vanpy-0.92.14.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-21 22:41:50",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "vanpy"
}