vggsounder


Namevggsounder JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryA Python package for accessing VGGSounder dataset labels and metadata
upload_time2025-08-07 20:22:23
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords audio classification dataset machine-learning vgg video
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center"><a href="https://vggsounder.github.io/static/workshop_paper.pdf">
VGGSounder: Audio-Visual Evaluations for Foundation Models</a></h1>
<h5 align="center"> If our project helps you, please give us a star โญ on GitHub to support us. ๐Ÿ™๐Ÿ™</h2>


<h5 align="center">

<!-- [![arXiv](https://img.shields.io/badge/Arxiv-2501.13106-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2501.13106)  -->
[![Project page](https://img.shields.io/badge/Project_page-https-blue)](https://vggsounder.github.io) 
<br>

[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/LICENSE) 
![Badge](https://hitscounter.dev/api/hit?url=https%3A%2F%2Fgithub.com%2FBizilizi%2Fvggsounder&label=HITs&icon=fire&color=%23198754)
[![GitHub issues](https://img.shields.io/github/issues/Bizilizi/vggsounder?color=critical&label=Issues)](https://github.com/Bizilizi/vggsounder/issues?q=is%3Aopen+is%3Aissue)
[![GitHub closed issues](https://img.shields.io/github/issues-closed/Bizilizi/vggsounder?color=success&label=Issues)](https://github.com/Bizilizi/vggsounder/issues?q=is%3Aissue+is%3Aclosed)
</h5>

## ๐Ÿ“ฐ News

* **[11.06.2025]**  ๐Ÿ“ƒ Released technical report of VGGSounder. Contains detailed discussion on how we built the first multimodal benchmark for video tagging with complete per-modality annotations for every class.


## ๐ŸŒŸ Introduction
**VGGSounder** is a re-annotated benchmark built upon the [VGGSound dataset](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:

- ๐Ÿ” Per-label modality tags (audible / visible / both) for all classes in the sample
- ๐ŸŽต Meta labels for background music, voice-over, and static images
- ๐Ÿ“Š Multiple classes per one sample


## ๐Ÿš€ Installation

The VGGSounder dataset is now available as a Python package! Install it via pip:

```bash
pip install vggsounder
```

Or install from source using [uv](https://docs.astral.sh/uv/):

```bash
git clone https://github.com/bizilizi/vggsounder.git
cd vggsounder
uv build
pip install dist/vggsounder-*.whl
```

## ๐Ÿ Python Package Usage

### Quick Start

```python
import vggsounder

# Load the dataset
labels = vggsounder.VGGSounder()

# Access video data by ID
video_data = labels["--U7joUcTCo_000000"]
print(video_data.labels)        # List of labels for this video
print(video_data.meta_labels)   # Metadata (background_music, static_image, voice_over)
print(video_data.modalities)    # Modality for each label (A, V, AV)

# Get dataset statistics
stats = labels.stats()
print(f"Total videos: {stats['total_videos']}")
print(f"Unique labels: {stats['unique_labels']}")

# Search functionality
piano_videos = labels.get_videos_with_labels("playing piano")
voice_over_videos = labels.get_videos_with_meta(voice_over=True)
```

### Advanced Usage

```python
# Dict-like interface
print(len(labels))                    # Number of videos
print("video_id" in labels)           # Check if video exists
for video_id in labels:               # Iterate over video IDs
    video_data = labels[video_id]

# Get all unique labels
all_labels = labels.get_all_labels()

# Complex queries
static_speech_videos = labels.get_videos_with_meta(
    static_image=True, voice_over=True
)
```

## ๐Ÿท๏ธ Label Format

VGGSounder annotations are stored in a CSV file located at `data/vggsounder.csv`. Each row corresponds to a single label for a specific video sample. The dataset supports **multi-label**, **multi-modal** classification with additional **meta-information** for robust evaluation.

### Columns

- **`video_id`**: Unique identifier for a 10-second video clip.
- **`label`**: Human-readable label representing a sound or visual category (e.g. `male singing`, `playing timpani`).
- **`modality`**: The modality in which the label is perceivable:
  - `A` = Audible
  - `V` = Visible
  - `AV` = Both audible and visible
- **`background_music`**: `True` if the video contains background music.
- **`static_image`**: `True` if the video consists of a static image.
- **`voice_over`**: `True` if the video contains voice-over narration.

### Example

| video_id           | label             | modality | background_music | static_image | voice_over |
|--------------------|------------------|----------|------------------|--------------|------------|
| `---g-f_I2yQ_000001` | `male singing`     | A        | True             | False        | False      |
| `---g-f_I2yQ_000001` | `people crowd`     | AV       | True             | False        | False      |
| `---g-f_I2yQ_000001` | `playing timpani`  | A        | True             | False        | False      |

## ๐Ÿงช Benchmark Evaluation

VGGSounder provides a comprehensive benchmarking system to evaluate audio-visual foundation models across multiple modalities and metrics. The benchmark supports both discrete predictions and continuous logits-based evaluation.

### Supported Modalities

- **`a`**: Audio - includes samples with audio component (A + AV)
- **`v`**: Visual - includes samples with visual component (V + AV)
- **`av`**: Audio-Visual - samples with both modalities (AV only)
- **`a only`**: Audio-only - pure audio samples (excludes AV samples)
- **`v only`**: Visual-only - pure visual samples (excludes AV samples)

### Available Metrics

The benchmark computes a comprehensive set of metrics:
- **Top-k metrics**: `hit_rate@k`, `f1@k`, `accuracy@k`, `precision@k`, `recall@k`, `jaccard@k` (for k=1,3,5,10)
- **Aggregate metrics**: `f1`, `f1_macro`, `accuracy`, `precision`, `recall`, `jaccard`, `hit_rate`
- **AUC metrics**: `auc_roc`, `auc_pr` (ROC-AUC and Precision-Recall AUC)
- **Modality confusion**: `mu` (measures when single modalities succeed where multimodal fails)

### Model Results Format

Model predictions should be saved as pickle files with the following structure:

```python
{
    "video_id": {
        "predictions": {  # Optional: discrete predictions
            "a": ["label1", "label2", ...],     # Audio predictions
            "v": ["label1", "label3", ...],     # Visual predictions
            "av": ["label1", "label2", ...]     # Audio-visual predictions
        },
        "logits": {      # Optional: continuous scores
            "a": [0.1, 0.8, 0.3, ...],         # Audio logits (310 classes)
            "v": [0.2, 0.1, 0.9, ...],         # Visual logits (310 classes)  
            "av": [0.4, 0.6, 0.2, ...]         # Audio-visual logits (310 classes)
        }
    },
    # ... more video_ids
}
```

**Note**: Either `predictions` or `logits` (or both) should be provided. Logits enable more detailed top-k and AUC analysis.

### Running the Benchmark

#### Quick Start

```python
from vggsounder.benchmark import benchmark

# Define model display names
display_names = {
    "cav-mae": "CAV-MAE",
    "deepavfusion": "DeepAVFusion", 
    "equiav": "Equi-AV",
    "gemini-1.5-flash": "Gemini 1.5 Flash",
    "gemini-1.5-pro": "Gemini 1.5 Pro"
}

# Specify metrics and modalities to evaluate
metrics = [
    ("accuracy", ["a", "v", "av"]),
    ("f1", ["a", "v", "av", "a only", "v only"]), 
    ("hit_rate", ["a", "v", "av"]),
    ("mu", ["a", "v", "av"])  # Modality confusion
]

# Run benchmark
results_table = benchmark(
    models_path="path/to/model/pickles",
    display_names=display_names,
    metrics=metrics
)

print(results_table)
```

For a detailed example of how we generate the tables used in our paper, please see the [example notebook](https://github.com/Bizilizi/VGGSounder/blob/main/experiments/visualisations/metrics.ipynb).


## ๐Ÿ“‘ Citation

If you find VGGSounder useful for your research and applications, please consider citing us using this BibTeX:

```bibtex
@article{zverevwiedemer2025vggsounder,
  author    = {Daniil Zverev, Thaddรคus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke},
  title     = {VGGSounder: Audio-Visual Evaluations for Foundation Models},
  year      = {2025},
}
```

## โค๏ธ Acknowledgement
The authors would like to thank [Felix Fรถrster](https://www.linkedin.com/in/felix-f%C3%B6rster-316010235/?trk=public_profile_browsemap&originalSubdomain=de), [Sayak Mallick](https://scholar.google.fr/citations?user=L_0KSXUAAAAJ&hl=en), and [Prasanna Mayilvahananan](https://scholar.google.fr/citations?user=3xq1YcYAAAAJ&hl=en) for their help with data annotation, as well as [Thomas Klein](https://scholar.google.de/citations?user=3WfC0yMAAAAJ&hl=en) and [Shyamgopal Karthik](https://scholar.google.co.in/citations?user=OiVCfscAAAAJ&hl=en) for their help in setting up MTurk. They also thank numerous MTurk workers for labelling. This work was in part supported by the [BMBF](https://www.bmbf.de/DE/Home/home_node.html) (FKZ: 01IS24060, 01I524085B), the [DFG](https://www.dfg.de/) (SFB 1233, TP A1, project number: 276693517), and the [Open Philanthropy Foundation](https://www.openphilanthropy.org/) funded by the [Good Ventures Foundation](https://www.goodventures.org/). The authors thank the IMPRS-IS for supporting TW.


## ๐Ÿ‘ฎ License

This project is released under the Apache 2.0 license as found in the LICENSE file. Please get in touch with us if you find any potential violations. 
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vggsounder",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "audio, classification, dataset, machine-learning, vgg, video",
    "author": null,
    "author_email": "Daniil Zverev <daniil.zverev@tum.de>",
    "download_url": "https://files.pythonhosted.org/packages/bb/47/9493faba9e2f90fe21faedaa622a5db7e9220dc6083b538cc368f0bfc3bd/vggsounder-0.1.2.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\"><a href=\"https://vggsounder.github.io/static/workshop_paper.pdf\">\nVGGSounder: Audio-Visual Evaluations for Foundation Models</a></h1>\n<h5 align=\"center\"> If our project helps you, please give us a star \u2b50 on GitHub to support us. \ud83d\ude4f\ud83d\ude4f</h2>\n\n\n<h5 align=\"center\">\n\n<!-- [![arXiv](https://img.shields.io/badge/Arxiv-2501.13106-AD1C18.svg?logo=arXiv)](https://arxiv.org/abs/2501.13106)  -->\n[![Project page](https://img.shields.io/badge/Project_page-https-blue)](https://vggsounder.github.io) \n<br>\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/LICENSE) \n![Badge](https://hitscounter.dev/api/hit?url=https%3A%2F%2Fgithub.com%2FBizilizi%2Fvggsounder&label=HITs&icon=fire&color=%23198754)\n[![GitHub issues](https://img.shields.io/github/issues/Bizilizi/vggsounder?color=critical&label=Issues)](https://github.com/Bizilizi/vggsounder/issues?q=is%3Aopen+is%3Aissue)\n[![GitHub closed issues](https://img.shields.io/github/issues-closed/Bizilizi/vggsounder?color=success&label=Issues)](https://github.com/Bizilizi/vggsounder/issues?q=is%3Aissue+is%3Aclosed)\n</h5>\n\n## \ud83d\udcf0 News\n\n* **[11.06.2025]**  \ud83d\udcc3 Released technical report of VGGSounder. Contains detailed discussion on how we built the first multimodal benchmark for video tagging with complete per-modality annotations for every class.\n\n\n## \ud83c\udf1f Introduction\n**VGGSounder** is a re-annotated benchmark built upon the [VGGSound dataset](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:\n\n- \ud83d\udd0d Per-label modality tags (audible / visible / both) for all classes in the sample\n- \ud83c\udfb5 Meta labels for background music, voice-over, and static images\n- \ud83d\udcca Multiple classes per one sample\n\n\n## \ud83d\ude80 Installation\n\nThe VGGSounder dataset is now available as a Python package! Install it via pip:\n\n```bash\npip install vggsounder\n```\n\nOr install from source using [uv](https://docs.astral.sh/uv/):\n\n```bash\ngit clone https://github.com/bizilizi/vggsounder.git\ncd vggsounder\nuv build\npip install dist/vggsounder-*.whl\n```\n\n## \ud83d\udc0d Python Package Usage\n\n### Quick Start\n\n```python\nimport vggsounder\n\n# Load the dataset\nlabels = vggsounder.VGGSounder()\n\n# Access video data by ID\nvideo_data = labels[\"--U7joUcTCo_000000\"]\nprint(video_data.labels)        # List of labels for this video\nprint(video_data.meta_labels)   # Metadata (background_music, static_image, voice_over)\nprint(video_data.modalities)    # Modality for each label (A, V, AV)\n\n# Get dataset statistics\nstats = labels.stats()\nprint(f\"Total videos: {stats['total_videos']}\")\nprint(f\"Unique labels: {stats['unique_labels']}\")\n\n# Search functionality\npiano_videos = labels.get_videos_with_labels(\"playing piano\")\nvoice_over_videos = labels.get_videos_with_meta(voice_over=True)\n```\n\n### Advanced Usage\n\n```python\n# Dict-like interface\nprint(len(labels))                    # Number of videos\nprint(\"video_id\" in labels)           # Check if video exists\nfor video_id in labels:               # Iterate over video IDs\n    video_data = labels[video_id]\n\n# Get all unique labels\nall_labels = labels.get_all_labels()\n\n# Complex queries\nstatic_speech_videos = labels.get_videos_with_meta(\n    static_image=True, voice_over=True\n)\n```\n\n## \ud83c\udff7\ufe0f Label Format\n\nVGGSounder annotations are stored in a CSV file located at `data/vggsounder.csv`. Each row corresponds to a single label for a specific video sample. The dataset supports **multi-label**, **multi-modal** classification with additional **meta-information** for robust evaluation.\n\n### Columns\n\n- **`video_id`**: Unique identifier for a 10-second video clip.\n- **`label`**: Human-readable label representing a sound or visual category (e.g. `male singing`, `playing timpani`).\n- **`modality`**: The modality in which the label is perceivable:\n  - `A` = Audible\n  - `V` = Visible\n  - `AV` = Both audible and visible\n- **`background_music`**: `True` if the video contains background music.\n- **`static_image`**: `True` if the video consists of a static image.\n- **`voice_over`**: `True` if the video contains voice-over narration.\n\n### Example\n\n| video_id           | label             | modality | background_music | static_image | voice_over |\n|--------------------|------------------|----------|------------------|--------------|------------|\n| `---g-f_I2yQ_000001` | `male singing`     | A        | True             | False        | False      |\n| `---g-f_I2yQ_000001` | `people crowd`     | AV       | True             | False        | False      |\n| `---g-f_I2yQ_000001` | `playing timpani`  | A        | True             | False        | False      |\n\n## \ud83e\uddea Benchmark Evaluation\n\nVGGSounder provides a comprehensive benchmarking system to evaluate audio-visual foundation models across multiple modalities and metrics. The benchmark supports both discrete predictions and continuous logits-based evaluation.\n\n### Supported Modalities\n\n- **`a`**: Audio - includes samples with audio component (A + AV)\n- **`v`**: Visual - includes samples with visual component (V + AV)\n- **`av`**: Audio-Visual - samples with both modalities (AV only)\n- **`a only`**: Audio-only - pure audio samples (excludes AV samples)\n- **`v only`**: Visual-only - pure visual samples (excludes AV samples)\n\n### Available Metrics\n\nThe benchmark computes a comprehensive set of metrics:\n- **Top-k metrics**: `hit_rate@k`, `f1@k`, `accuracy@k`, `precision@k`, `recall@k`, `jaccard@k` (for k=1,3,5,10)\n- **Aggregate metrics**: `f1`, `f1_macro`, `accuracy`, `precision`, `recall`, `jaccard`, `hit_rate`\n- **AUC metrics**: `auc_roc`, `auc_pr` (ROC-AUC and Precision-Recall AUC)\n- **Modality confusion**: `mu` (measures when single modalities succeed where multimodal fails)\n\n### Model Results Format\n\nModel predictions should be saved as pickle files with the following structure:\n\n```python\n{\n    \"video_id\": {\n        \"predictions\": {  # Optional: discrete predictions\n            \"a\": [\"label1\", \"label2\", ...],     # Audio predictions\n            \"v\": [\"label1\", \"label3\", ...],     # Visual predictions\n            \"av\": [\"label1\", \"label2\", ...]     # Audio-visual predictions\n        },\n        \"logits\": {      # Optional: continuous scores\n            \"a\": [0.1, 0.8, 0.3, ...],         # Audio logits (310 classes)\n            \"v\": [0.2, 0.1, 0.9, ...],         # Visual logits (310 classes)  \n            \"av\": [0.4, 0.6, 0.2, ...]         # Audio-visual logits (310 classes)\n        }\n    },\n    # ... more video_ids\n}\n```\n\n**Note**: Either `predictions` or `logits` (or both) should be provided. Logits enable more detailed top-k and AUC analysis.\n\n### Running the Benchmark\n\n#### Quick Start\n\n```python\nfrom vggsounder.benchmark import benchmark\n\n# Define model display names\ndisplay_names = {\n    \"cav-mae\": \"CAV-MAE\",\n    \"deepavfusion\": \"DeepAVFusion\", \n    \"equiav\": \"Equi-AV\",\n    \"gemini-1.5-flash\": \"Gemini 1.5 Flash\",\n    \"gemini-1.5-pro\": \"Gemini 1.5 Pro\"\n}\n\n# Specify metrics and modalities to evaluate\nmetrics = [\n    (\"accuracy\", [\"a\", \"v\", \"av\"]),\n    (\"f1\", [\"a\", \"v\", \"av\", \"a only\", \"v only\"]), \n    (\"hit_rate\", [\"a\", \"v\", \"av\"]),\n    (\"mu\", [\"a\", \"v\", \"av\"])  # Modality confusion\n]\n\n# Run benchmark\nresults_table = benchmark(\n    models_path=\"path/to/model/pickles\",\n    display_names=display_names,\n    metrics=metrics\n)\n\nprint(results_table)\n```\n\nFor a detailed example of how we generate the tables used in our paper, please see the [example notebook](https://github.com/Bizilizi/VGGSounder/blob/main/experiments/visualisations/metrics.ipynb).\n\n\n## \ud83d\udcd1 Citation\n\nIf you find VGGSounder useful for your research and applications, please consider citing us using this BibTeX:\n\n```bibtex\n@article{zverevwiedemer2025vggsounder,\n  author    = {Daniil Zverev, Thadd\u00e4us Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke},\n  title     = {VGGSounder: Audio-Visual Evaluations for Foundation Models},\n  year      = {2025},\n}\n```\n\n## \u2764\ufe0f Acknowledgement\nThe authors would like to thank [Felix F\u00f6rster](https://www.linkedin.com/in/felix-f%C3%B6rster-316010235/?trk=public_profile_browsemap&originalSubdomain=de), [Sayak Mallick](https://scholar.google.fr/citations?user=L_0KSXUAAAAJ&hl=en), and [Prasanna Mayilvahananan](https://scholar.google.fr/citations?user=3xq1YcYAAAAJ&hl=en) for their help with data annotation, as well as [Thomas Klein](https://scholar.google.de/citations?user=3WfC0yMAAAAJ&hl=en) and [Shyamgopal Karthik](https://scholar.google.co.in/citations?user=OiVCfscAAAAJ&hl=en) for their help in setting up MTurk. They also thank numerous MTurk workers for labelling. This work was in part supported by the [BMBF](https://www.bmbf.de/DE/Home/home_node.html) (FKZ: 01IS24060, 01I524085B), the [DFG](https://www.dfg.de/) (SFB 1233, TP A1, project number: 276693517), and the [Open Philanthropy Foundation](https://www.openphilanthropy.org/) funded by the [Good Ventures Foundation](https://www.goodventures.org/). The authors thank the IMPRS-IS for supporting TW.\n\n\n## \ud83d\udc6e License\n\nThis project is released under the Apache 2.0 license as found in the LICENSE file. Please get in touch with us if you find any potential violations. ",
    "bugtrack_url": null,
    "license": null,
    "summary": "A Python package for accessing VGGSounder dataset labels and metadata",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://vggsounder.github.io/",
        "Issues": "https://github.com/Bizilizi/VGGSounder/issues",
        "Repository": "https://github.com/Bizilizi/VGGSounder"
    },
    "split_keywords": [
        "audio",
        " classification",
        " dataset",
        " machine-learning",
        " vgg",
        " video"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "71f4a2c25345d576e0368ca8bb3fc3b73b9a93fec7c51808c7de8b91ee3f17f0",
                "md5": "f7376205fd773afe4de03fb145532056",
                "sha256": "2665ac84659f5016a3de71858141c4cf8295c08df08c17fdb75be4130988db81"
            },
            "downloads": -1,
            "filename": "vggsounder-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f7376205fd773afe4de03fb145532056",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 335847,
            "upload_time": "2025-08-07T20:22:21",
            "upload_time_iso_8601": "2025-08-07T20:22:21.556704Z",
            "url": "https://files.pythonhosted.org/packages/71/f4/a2c25345d576e0368ca8bb3fc3b73b9a93fec7c51808c7de8b91ee3f17f0/vggsounder-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bb479493faba9e2f90fe21faedaa622a5db7e9220dc6083b538cc368f0bfc3bd",
                "md5": "c139054cba658ce1957602cac4fb7465",
                "sha256": "f5b74273bb5a3559b10fad8b0ed6f05ceef4180546c8b2a9d669e59e38b17820"
            },
            "downloads": -1,
            "filename": "vggsounder-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "c139054cba658ce1957602cac4fb7465",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 330570,
            "upload_time": "2025-08-07T20:22:23",
            "upload_time_iso_8601": "2025-08-07T20:22:23.746507Z",
            "url": "https://files.pythonhosted.org/packages/bb/47/9493faba9e2f90fe21faedaa622a5db7e9220dc6083b538cc368f0bfc3bd/vggsounder-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-07 20:22:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Bizilizi",
    "github_project": "VGGSounder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "vggsounder"
}
        
Elapsed time: 1.06112s