vanpy


Namevanpy JSON
Version 0.92.14 PyPI version JSON
download
home_pageNone
SummaryVANPY - Voice Analysis framework in Python
upload_time2025-02-21 22:41:50
maintainerNone
docs_urlNone
authorGregory Koushnir
requires_python>=3.8
licenseApache License v2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # VANPY 
**VANPY** (Voice Analysis Python) is a flexible and extensible framework for voice analysis, feature extraction, and classification. It provides a modular pipeline architecture for processing audio segments with near- and state-of-the-art deep learning models.

![VANPY](https://github.com/griko/vanpy/raw/main/images/VANPY_architecture.png)

<!-- ## Quick Start
Try VANPY in Google Colab:

- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/VANPY_example.ipynb)
  
  Basic VANPY capabilities demo

- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/using_VANPY_to_classify_emotions_on_RAVDESS_dataset.ipynb)

  Emotion classification on RAVDESS dataset
 -->


## Architecture
**VANPY** consists of three optional pipelines that can be used independently or in combination:

1. **Preprocessing Pipeline**: Handles audio format conversion and voice segment extraction
2. **Feature Extraction Pipeline**: Generates feature/latent vectors from voice segments
3. **Model Inference Pipeline**

You can use these pipelines flexibly based on your needs:

- Use only preprocessing for voice separation
- Combine preprocessing and classification for direct audio analysis
- Use all pipelines for complete feature extraction and classification 

## Models Trained as part of the VANPY project

<table>
  <tr>
    <th>Task</th>
    <th>Dataset</th>
    <th>Performance</th>
  </tr>
  <tr>
    <td rowspan="3">Gender Identification (Accuracy)</td>
    <td>VoxCeleb2</td>
    <td>98.9%</td>
  </tr>
  <tr>
    <td>Mozilla Common Voice v10.0</td>
    <td>92.3%</td>
  </tr>
  <tr>
    <td>TIMIT</td>
    <td>99.6%</td>
  </tr>
  <tr>
    <td rowspan="2">Emotion Recognition (Accuracy)</td>
    <td>RAVDESS (8-class)</td>
    <td>84.71%</td>
  </tr>
  <tr>
    <td>RAVDESS (7-class)</td>
    <td>86.24%</td>
  </tr>
  <tr>
    <td rowspan="3">Age Estimation (MAE in years)</td>
    <td>VoxCeleb2</td>
    <td>7.88</td>
  </tr>
  <tr>
    <td>TIMIT</td>
    <td>4.95</td>
  </tr>
  <tr>
    <td>Combined VoxCeleb2-TIMIT</td>
    <td>6.93</td>
  </tr>
  <tr>
    <td rowspan="2">Height Estimation (MAE in cm)</td>
    <td>VoxCeleb2</td>
    <td>6.01</td>
  </tr>
  <tr>
    <td>TIMIT</td>
    <td>6.02</td>
  </tr>
</table>

All of the models can be used as a part of the VANPY pipeline or separately and are available on 🤗[HuggingFace](https://huggingface.co/griko)


## Configuration
### Environment Setup

1. Create a `pipeline.yaml` configuration file. You can use the `src/pipeline.yaml` as a template.
2. For HuggingFace models (Pyannote components), create a `.env` file:
```
huggingface_ACCESS_TOKEN=<your_token>
```
3. Pipelines examples are available in `src/run.py`.

## Components
Each component expects as an input and returns as an output a `ComponentPayload` object.

Each component supports:
- Batch processing (if applicable)
- Progress tracking
- Performance monitoring and logging
- Incremental processing (skip already processed files)
- GPU acceleration where applicable
- Configurable parameters

### Preprocessing Components

| Component | Description |
|-----------|-------------|
| **Filelist-DataFrame Creator** | Initializes data pipeline by creating a DataFrame of audio file paths. Supports both directory scanning and loading from existing CSV files. Manages path metadata for downstream components. |
| **WAV Converter** | Standardizes audio format to WAV with configurable parameters including bit rate (default: 256k), channels (default: mono), sample rate (default: 16kHz), and codec (default: PCM 16-bit). Uses FFMPEG for robust conversion. |
| **WAV Splitter** | Handles large audio files by splitting them into manageable segments based on either duration or file size limits. Maintains audio quality and creates properly labeled segments with original file references. |
| **INA Voice Separator** | Separates audio into voice and non-voice segments, distinguishing between male and female speakers. Filters out non-speech content while preserving speaker gender information. |
| **Pyannote VAD** | Performs Voice Activity Detection using Pyannote's state-of-the-art deep learning model. Identifies and extracts speech segments with configurable sensitivity.
| **Silero VAD** | Alternative Voice Activity Detection using Silero's efficient model. Optimized for real-time performance with customizable parameters. |
| **Pyannote SD** | Speaker Diarization component that identifies and separates different speakers in audio. Creates individual segments for each speaker with timing information. Supports overlapping speech handling. |
| **MetricGAN SE** | Speech Enhancement using MetricGAN+ model from SpeechBrain. Reduces background noise and improves speech clarity. |
| **SepFormer SE** | Speech Enhancement using SepFormer model, specialized in separating speech from complex background noise. |

### Feature Extraction Components

| Component | Description |
|-----------|-------------|
| **Librosa Features Extractor** | Comprehensive audio feature extraction using the Librosa library. Supports multiple feature types including: MFCC (Mel-frequency cepstral coefficients), Delta-MFCC, zero-crossing rate, spectral features (centroid, bandwidth, contrast, flatness), fundamental frequency (F0), and tonnetz. |
| **Pyannote Embedding** | Generates speaker embeddings using Pyannote's deep learning models. Uses sliding window analysis with configurable duration and step size. Outputs high-dimensional embeddings optimized for speaker differentiation. |
| **SpeechBrain Embedding** | Extracts neural embeddings using SpeechBrain's pretrained models, particularly the ECAPA-TDNN architecture (default: spkrec-ecapa-voxceleb). |

### Model Inference Components

| Component | Description |
|-----------|-------------|
| **VanpyGender Classifier** | SVM-based binary gender classification using speech embeddings. Supports two models: ECAPA-TDNN (192-dim) and XVECT (512-dim) embeddings from SpeechBrain. Trained on VoxCeleb2 dataset with optimized hyperparameters. Provides both verbal ('female'/'male') and numeric label options. |
| **VanpyAge Regressor** | Multi-architecture age estimation supporting SVR and ANN models. Features multiple variants: pure SpeechBrain embeddings (192-dim), combined SpeechBrain and Librosa features (233-dim), and dataset-specific models (VoxCeleb2/TIMIT). |
| **VanpyEmotion Classifier** | 7-class SVM emotion classifier trained on RAVDESS dataset using SpeechBrain embeddings. Classifies emotions into: angry, disgust, fearful, happy, neutral/calm, sad, surprised. |
| **IEMOCAP Emotion** | SpeechBrain-based emotion classifier trained on the IEMOCAP dataset. Uses Wav2Vec2 for feature extraction. Supports four emotion classes: angry, happy, neutral, sad. |
| **Wav2Vec2 ADV** | Advanced emotion analysis using Wav2Vec2, providing continuous scores for arousal, dominance, and valence dimensions. |
| **Wav2Vec2 STT** | Speech-to-text transcription using Facebook's Wav2Vec2 model. |
| **Whisper STT** | OpenAI's Whisper model for robust speech recognition. Supports multiple model sizes and languages. Includes automatic language detection. |
| **Cosine Distance Clusterer** | a Clustering method that can be used for speaker diarization using cosine similarity metrics. Groups speech segments by speaker identity using embedding similarity. |
| **GMM Clusterer** | Gaussian Mixture Model-based speaker clustering. |
| **Agglomerative Clusterer** | Hierarchical clustering for speaker diarization. Uses distance-based merging with configurable threshold and maximum clusters. |
| **YAMNet Classifier** | Google's YAMNet model for general audio classification. Supports 521 audio classes from AudioSet ontology. |


## ComponentPayload Structure
The `ComponentPayload` class manages data flow between pipeline components:
```
class ComponentPayload:
    metadata: Dict  # Pipeline metadata
    df: pd.DataFrame  # Processing results
```    
### Metadata fields
- `input_path`: Path to the input directory (required for `FilelistDataFrameCreator` if no `df` is provided)
- `paths_column`: Column name for audio file paths
- `all_paths_columns`: List of all path columns
- `feature_columns`: List of feature columns
- `meta_columns`: List of metadata columns
- `classification_columns`: List of classification columns

### df fields
- `df`: pd.DataFrame
  
  Includes all the collected information through the preprocessing and classification
  - each preprocessor adds a column of paths where the processed files are hold
  - embedding/feature extraction components add the embedding/features columns
  - each model adds a model-results column

### Key Methods
- `get_features_df()`: Extract features DataFrame
- `get_classification_df()`: Extract classification results DataFrame


## Coming Soon
- Custom classifier integration guide
- Additional preprocessing components
- Extended model support
- Newer python and dependencies version support

## Citing VANPY
Please, cite VANPY if you use it

```bibtex
@misc{vanpy,
  title={VANPY: Voice Analysis Framework},
  author={Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan},
  year={2025},
  eprint={TBD},
  archivePrefix={arXiv},
  primaryClass={TBD},
  note={arXiv:TBD}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vanpy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Gregory Koushnir",
    "author_email": "koushgre@post.bgu.ac.il",
    "download_url": "https://files.pythonhosted.org/packages/a6/b0/a49fe4a4f3f4e812fe55d15df86b629a42d91fc0d3426fb7cb73a8566025/vanpy-0.92.14.tar.gz",
    "platform": "unix",
    "description": "# VANPY \r\n**VANPY** (Voice Analysis Python) is a flexible and extensible framework for voice analysis, feature extraction, and classification. It provides a modular pipeline architecture for processing audio segments with near- and state-of-the-art deep learning models.\r\n\r\n![VANPY](https://github.com/griko/vanpy/raw/main/images/VANPY_architecture.png)\r\n\r\n<!-- ## Quick Start\r\nTry VANPY in Google Colab:\r\n\r\n- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/VANPY_example.ipynb)\r\n  \r\n  Basic VANPY capabilities demo\r\n\r\n- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/griko/VANPY/blob/main/examples/using_VANPY_to_classify_emotions_on_RAVDESS_dataset.ipynb)\r\n\r\n  Emotion classification on RAVDESS dataset\r\n -->\r\n\r\n\r\n## Architecture\r\n**VANPY** consists of three optional pipelines that can be used independently or in combination:\r\n\r\n1. **Preprocessing Pipeline**: Handles audio format conversion and voice segment extraction\r\n2. **Feature Extraction Pipeline**: Generates feature/latent vectors from voice segments\r\n3. **Model Inference Pipeline**\r\n\r\nYou can use these pipelines flexibly based on your needs:\r\n\r\n- Use only preprocessing for voice separation\r\n- Combine preprocessing and classification for direct audio analysis\r\n- Use all pipelines for complete feature extraction and classification \r\n\r\n## Models Trained as part of the VANPY project\r\n\r\n<table>\r\n  <tr>\r\n    <th>Task</th>\r\n    <th>Dataset</th>\r\n    <th>Performance</th>\r\n  </tr>\r\n  <tr>\r\n    <td rowspan=\"3\">Gender Identification (Accuracy)</td>\r\n    <td>VoxCeleb2</td>\r\n    <td>98.9%</td>\r\n  </tr>\r\n  <tr>\r\n    <td>Mozilla Common Voice v10.0</td>\r\n    <td>92.3%</td>\r\n  </tr>\r\n  <tr>\r\n    <td>TIMIT</td>\r\n    <td>99.6%</td>\r\n  </tr>\r\n  <tr>\r\n    <td rowspan=\"2\">Emotion Recognition (Accuracy)</td>\r\n    <td>RAVDESS (8-class)</td>\r\n    <td>84.71%</td>\r\n  </tr>\r\n  <tr>\r\n    <td>RAVDESS (7-class)</td>\r\n    <td>86.24%</td>\r\n  </tr>\r\n  <tr>\r\n    <td rowspan=\"3\">Age Estimation (MAE in years)</td>\r\n    <td>VoxCeleb2</td>\r\n    <td>7.88</td>\r\n  </tr>\r\n  <tr>\r\n    <td>TIMIT</td>\r\n    <td>4.95</td>\r\n  </tr>\r\n  <tr>\r\n    <td>Combined VoxCeleb2-TIMIT</td>\r\n    <td>6.93</td>\r\n  </tr>\r\n  <tr>\r\n    <td rowspan=\"2\">Height Estimation (MAE in cm)</td>\r\n    <td>VoxCeleb2</td>\r\n    <td>6.01</td>\r\n  </tr>\r\n  <tr>\r\n    <td>TIMIT</td>\r\n    <td>6.02</td>\r\n  </tr>\r\n</table>\r\n\r\nAll of the models can be used as a part of the VANPY pipeline or separately and are available on \u00f0\u0178\u00a4\u2014[HuggingFace](https://huggingface.co/griko)\r\n\r\n\r\n## Configuration\r\n### Environment Setup\r\n\r\n1. Create a `pipeline.yaml` configuration file. You can use the `src/pipeline.yaml` as a template.\r\n2. For HuggingFace models (Pyannote components), create a `.env` file:\r\n```\r\nhuggingface_ACCESS_TOKEN=<your_token>\r\n```\r\n3. Pipelines examples are available in `src/run.py`.\r\n\r\n## Components\r\nEach component expects as an input and returns as an output a `ComponentPayload` object.\r\n\r\nEach component supports:\r\n- Batch processing (if applicable)\r\n- Progress tracking\r\n- Performance monitoring and logging\r\n- Incremental processing (skip already processed files)\r\n- GPU acceleration where applicable\r\n- Configurable parameters\r\n\r\n### Preprocessing Components\r\n\r\n| Component | Description |\r\n|-----------|-------------|\r\n| **Filelist-DataFrame Creator** | Initializes data pipeline by creating a DataFrame of audio file paths. Supports both directory scanning and loading from existing CSV files. Manages path metadata for downstream components. |\r\n| **WAV Converter** | Standardizes audio format to WAV with configurable parameters including bit rate (default: 256k), channels (default: mono), sample rate (default: 16kHz), and codec (default: PCM 16-bit). Uses FFMPEG for robust conversion. |\r\n| **WAV Splitter** | Handles large audio files by splitting them into manageable segments based on either duration or file size limits. Maintains audio quality and creates properly labeled segments with original file references. |\r\n| **INA Voice Separator** | Separates audio into voice and non-voice segments, distinguishing between male and female speakers. Filters out non-speech content while preserving speaker gender information. |\r\n| **Pyannote VAD** | Performs Voice Activity Detection using Pyannote's state-of-the-art deep learning model. Identifies and extracts speech segments with configurable sensitivity.\r\n| **Silero VAD** | Alternative Voice Activity Detection using Silero's efficient model. Optimized for real-time performance with customizable parameters. |\r\n| **Pyannote SD** | Speaker Diarization component that identifies and separates different speakers in audio. Creates individual segments for each speaker with timing information. Supports overlapping speech handling. |\r\n| **MetricGAN SE** | Speech Enhancement using MetricGAN+ model from SpeechBrain. Reduces background noise and improves speech clarity. |\r\n| **SepFormer SE** | Speech Enhancement using SepFormer model, specialized in separating speech from complex background noise. |\r\n\r\n### Feature Extraction Components\r\n\r\n| Component | Description |\r\n|-----------|-------------|\r\n| **Librosa Features Extractor** | Comprehensive audio feature extraction using the Librosa library. Supports multiple feature types including: MFCC (Mel-frequency cepstral coefficients), Delta-MFCC, zero-crossing rate, spectral features (centroid, bandwidth, contrast, flatness), fundamental frequency (F0), and tonnetz. |\r\n| **Pyannote Embedding** | Generates speaker embeddings using Pyannote's deep learning models. Uses sliding window analysis with configurable duration and step size. Outputs high-dimensional embeddings optimized for speaker differentiation. |\r\n| **SpeechBrain Embedding** | Extracts neural embeddings using SpeechBrain's pretrained models, particularly the ECAPA-TDNN architecture (default: spkrec-ecapa-voxceleb). |\r\n\r\n### Model Inference Components\r\n\r\n| Component | Description |\r\n|-----------|-------------|\r\n| **VanpyGender Classifier** | SVM-based binary gender classification using speech embeddings. Supports two models: ECAPA-TDNN (192-dim) and XVECT (512-dim) embeddings from SpeechBrain. Trained on VoxCeleb2 dataset with optimized hyperparameters. Provides both verbal ('female'/'male') and numeric label options. |\r\n| **VanpyAge Regressor** | Multi-architecture age estimation supporting SVR and ANN models. Features multiple variants: pure SpeechBrain embeddings (192-dim), combined SpeechBrain and Librosa features (233-dim), and dataset-specific models (VoxCeleb2/TIMIT). |\r\n| **VanpyEmotion Classifier** | 7-class SVM emotion classifier trained on RAVDESS dataset using SpeechBrain embeddings. Classifies emotions into: angry, disgust, fearful, happy, neutral/calm, sad, surprised. |\r\n| **IEMOCAP Emotion** | SpeechBrain-based emotion classifier trained on the IEMOCAP dataset. Uses Wav2Vec2 for feature extraction. Supports four emotion classes: angry, happy, neutral, sad. |\r\n| **Wav2Vec2 ADV** | Advanced emotion analysis using Wav2Vec2, providing continuous scores for arousal, dominance, and valence dimensions. |\r\n| **Wav2Vec2 STT** | Speech-to-text transcription using Facebook's Wav2Vec2 model. |\r\n| **Whisper STT** | OpenAI's Whisper model for robust speech recognition. Supports multiple model sizes and languages. Includes automatic language detection. |\r\n| **Cosine Distance Clusterer** | a Clustering method that can be used for speaker diarization using cosine similarity metrics. Groups speech segments by speaker identity using embedding similarity. |\r\n| **GMM Clusterer** | Gaussian Mixture Model-based speaker clustering. |\r\n| **Agglomerative Clusterer** | Hierarchical clustering for speaker diarization. Uses distance-based merging with configurable threshold and maximum clusters. |\r\n| **YAMNet Classifier** | Google's YAMNet model for general audio classification. Supports 521 audio classes from AudioSet ontology. |\r\n\r\n\r\n## ComponentPayload Structure\r\nThe `ComponentPayload` class manages data flow between pipeline components:\r\n```\r\nclass ComponentPayload:\r\n    metadata: Dict  # Pipeline metadata\r\n    df: pd.DataFrame  # Processing results\r\n```    \r\n### Metadata fields\r\n- `input_path`: Path to the input directory (required for `FilelistDataFrameCreator` if no `df` is provided)\r\n- `paths_column`: Column name for audio file paths\r\n- `all_paths_columns`: List of all path columns\r\n- `feature_columns`: List of feature columns\r\n- `meta_columns`: List of metadata columns\r\n- `classification_columns`: List of classification columns\r\n\r\n### df fields\r\n- `df`: pd.DataFrame\r\n  \r\n  Includes all the collected information through the preprocessing and classification\r\n  - each preprocessor adds a column of paths where the processed files are hold\r\n  - embedding/feature extraction components add the embedding/features columns\r\n  - each model adds a model-results column\r\n\r\n### Key Methods\r\n- `get_features_df()`: Extract features DataFrame\r\n- `get_classification_df()`: Extract classification results DataFrame\r\n\r\n\r\n## Coming Soon\r\n- Custom classifier integration guide\r\n- Additional preprocessing components\r\n- Extended model support\r\n- Newer python and dependencies version support\r\n\r\n## Citing VANPY\r\nPlease, cite VANPY if you use it\r\n\r\n```bibtex\r\n@misc{vanpy,\r\n  title={VANPY: Voice Analysis Framework},\r\n  author={Gregory Koushnir, Michael Fire, Galit Fuhrmann Alpert, Dima Kagan},\r\n  year={2025},\r\n  eprint={TBD},\r\n  archivePrefix={arXiv},\r\n  primaryClass={TBD},\r\n  note={arXiv:TBD}\r\n}\r\n```\r\n",
    "bugtrack_url": null,
    "license": "Apache License v2.0",
    "summary": "VANPY - Voice Analysis framework in Python",
    "version": "0.92.14",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f6e7b7e5d335a763957871359cd04001ba6e81a584fc131547b11fc0fbdaa152",
                "md5": "bca64239e331d1b83d90171224de6262",
                "sha256": "1360a32d6b972e0cd7e86d5460f3f1fcfaff0643766634265e82e21b16238860"
            },
            "downloads": -1,
            "filename": "vanpy-0.92.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bca64239e331d1b83d90171224de6262",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 115471,
            "upload_time": "2025-02-21T22:41:48",
            "upload_time_iso_8601": "2025-02-21T22:41:48.953198Z",
            "url": "https://files.pythonhosted.org/packages/f6/e7/b7e5d335a763957871359cd04001ba6e81a584fc131547b11fc0fbdaa152/vanpy-0.92.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a6b0a49fe4a4f3f4e812fe55d15df86b629a42d91fc0d3426fb7cb73a8566025",
                "md5": "f577bd522a456d06a7267b5b72a5f3d8",
                "sha256": "230abe36a67777674862928cf40da0d06aa3cd5d78f355814a6b0d0942e52d99"
            },
            "downloads": -1,
            "filename": "vanpy-0.92.14.tar.gz",
            "has_sig": false,
            "md5_digest": "f577bd522a456d06a7267b5b72a5f3d8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 48801,
            "upload_time": "2025-02-21T22:41:50",
            "upload_time_iso_8601": "2025-02-21T22:41:50.695632Z",
            "url": "https://files.pythonhosted.org/packages/a6/b0/a49fe4a4f3f4e812fe55d15df86b629a42d91fc0d3426fb7cb73a8566025/vanpy-0.92.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-21 22:41:50",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "vanpy"
}
        
Elapsed time: 0.77367s