# PaSST package for HEAR 2021 NeurIPS Challenge Holistic Evaluation of Audio Representations
This is an implementation for [Efficient Training of Audio Transformers with Patchout](https://arxiv.org/abs/2110.05069) for HEAR 2021 NeurIPS Challenge
Holistic Evaluation of Audio Representations
# CUDA version
This is an implementation is tested with CUDA version 11.1, and torch installed:
```shell
pip3 install torch==1.8.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
```
but should work on newer versions of CUDA and torch.
# Installation
Install the latest version of this repo:
```shell
pip install hear21passt
```
The models follow the [common API](https://neuralaudio.ai/hear2021-holistic-evaluation-of-audio-representations.html#common-api) of HEAR 21
:
```shell
hear-validator --model hear21passt.base.pt hear21passt.base
hear-validator --model noweights.txt hear21passt.base2levelF
hear-validator --model noweights.txt hear21passt.base2levelmel
```
There are three modules available `hear21passt.base`,`hear21passt.base2level`, `hear21passt.base2levelmel` :
```python
import torch
from hear21passt.base import load_model, get_scene_embeddings, get_timestamp_embeddings
model = load_model().cuda()
seconds = 15
audio = torch.ones((3, 32000 * seconds))*0.5
embed, time_stamps = get_timestamp_embeddings(audio, model)
print(embed.shape)
embed = get_scene_embeddings(audio, model)
print(embed.shape)
```
# Getting the Logits/Class Labels
You can get the logits (before the sigmoid activation) for the 527 classes of audioset:
```python
from hear21passt.base import load_model
model = load_model(mode="logits").cuda()
logits = model(wave_signal)
```
The class labels indices can be found [here](https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/metadata/class_labels_indices.csv)
You can also use different pre-trained models, for example, the model trained with KD `passt_s_kd_p16_128_ap486`:
```python
from hear21passt.base import get_basic_model
model = get_basic_model(mode="logits", arch="passt_s_kd_p16_128_ap486")
logits = model(wave_signal)
```
# Supporting longer clips
In case of an input longer than 10 seconds, the `get_scene_embeddings` method compute the average of the embedding of a 10-second overlapping windows.
Depending on the application, it may be useful to use a pre-trained that can extract embeddings from 20 or 30 seconds without averaging. These variant has pre-trained time positional encoding or 20/30 seconds:
```python
# from version 0.0.18, it's possible to use:
from hear21passt.base20sec import load_model # up to 20 seconds of audio.
# or
from hear21passt.base30sec import load_model # up to 30 seconds of audio.
model = load_model(mode="logits").cuda()
logits = model(wave_signal)
```
# Loading other pre-trained models for logits or fine-tuning
Each pre-trained model has a specific frequency/time positional encoding, it's necessary to select the correct input shape to be able to load the models. The important variables for loading are `input_tdim`, `fstride` and `tstride` to specify the spectrograms time frames, the patches stride over frequency, and patches stride over time, respectively.
```python
import torch
from hear21passt.base import get_basic_model, get_model_passt
model = get_basic_model(mode="logits")
logits = model(some_wave_signal)
# Examples of other pre-trained models using the same spectrograms
# pre-traind on openMIC-18
model.net = get_model_passt(arch="openmic", n_classes=20)
# pre-traind on FSD-50k
model.net = get_model_passt(arch="fsd50k", n_classes=200)
# pre-traind on FSD-50k without patch-overlap (faster)
model.net = get_model_passt(arch="fsd50k-n", n_classes=200, fstride=16, tstride=16)
# models are trained on 10 seconds audios from Audioset, but accept longer audios (20s, or 30s)
# These models are trained by sampling a 10-second time-pos-encodings sequence
model.net = get_model_passt("passt_20sec", input_tdim=2000)
model.net = get_model_passt("passt_30sec", input_tdim=3000)
```
If you provide the wrong spectrograms, the model may fail silently, by generating low-quality embeddings and logits. Make sure you have the correct spectrograms' config for the selected pre-trained models.
Models with higher spectrogram resolutions, need to specify the correct spectrogram config:
```python
from hear21passt.models.preprocess import AugmentMelSTFT
# high-res pre-trained on Audioset
model.net = get_model_passt("stfthop160", input_tdim=2000)
# hopsize=160 for this pretrained model
model.mel = AugmentMelSTFT(n_mels=128, sr=32000, win_length=800, hopsize=160, n_fft=1024, freqm=48,
timem=192,
htk=False, fmin=0.0, fmax=None, norm=1, fmin_aug_range=10,
fmax_aug_range=2000)
# higher-res pre-trained on Audioset
model.net = get_model_passt("stfthop100", input_tdim=3200)
# hopsize=100 for this pretrained model
model.mel = AugmentMelSTFT(n_mels=128, sr=32000, win_length=800, hopsize=100, n_fft=1024, freqm=48,
timem=192,
htk=False, fmin=0.0, fmax=None, norm=1, fmin_aug_range=10,
fmax_aug_range=2000)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/kkoutini/passt_hear21",
"name": "hear21passt",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Khaled Koutini",
"author_email": "first.last@jku.at",
"download_url": "https://files.pythonhosted.org/packages/9e/4d/53efcc38f947303fc52547c0bbcbfe8a0d0dec5abb8217bd377dfb65430d/hear21passt-0.0.26.tar.gz",
"platform": null,
"description": "# PaSST package for HEAR 2021 NeurIPS Challenge Holistic Evaluation of Audio Representations\n\nThis is an implementation for [Efficient Training of Audio Transformers with Patchout](https://arxiv.org/abs/2110.05069) for HEAR 2021 NeurIPS Challenge\nHolistic Evaluation of Audio Representations\n\n# CUDA version\n\nThis is an implementation is tested with CUDA version 11.1, and torch installed:\n\n```shell\npip3 install torch==1.8.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html\n```\n\nbut should work on newer versions of CUDA and torch.\n\n# Installation\n\nInstall the latest version of this repo:\n\n```shell\npip install hear21passt\n```\n\nThe models follow the [common API](https://neuralaudio.ai/hear2021-holistic-evaluation-of-audio-representations.html#common-api) of HEAR 21\n:\n\n```shell\nhear-validator --model hear21passt.base.pt hear21passt.base\nhear-validator --model noweights.txt hear21passt.base2levelF\nhear-validator --model noweights.txt hear21passt.base2levelmel\n ```\n\nThere are three modules available `hear21passt.base`,`hear21passt.base2level`, `hear21passt.base2levelmel` :\n\n```python\nimport torch\n\nfrom hear21passt.base import load_model, get_scene_embeddings, get_timestamp_embeddings\n\nmodel = load_model().cuda()\nseconds = 15\naudio = torch.ones((3, 32000 * seconds))*0.5\nembed, time_stamps = get_timestamp_embeddings(audio, model)\nprint(embed.shape)\nembed = get_scene_embeddings(audio, model)\nprint(embed.shape)\n```\n\n# Getting the Logits/Class Labels\n\nYou can get the logits (before the sigmoid activation) for the 527 classes of audioset:\n\n```python\nfrom hear21passt.base import load_model\n\nmodel = load_model(mode=\"logits\").cuda()\nlogits = model(wave_signal)\n```\n\nThe class labels indices can be found [here](https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/metadata/class_labels_indices.csv)\n\nYou can also use different pre-trained models, for example, the model trained with KD `passt_s_kd_p16_128_ap486`:\n\n```python\nfrom hear21passt.base import get_basic_model\n\nmodel = get_basic_model(mode=\"logits\", arch=\"passt_s_kd_p16_128_ap486\")\nlogits = model(wave_signal)\n\n```\n\n# Supporting longer clips\n\nIn case of an input longer than 10 seconds, the `get_scene_embeddings` method compute the average of the embedding of a 10-second overlapping windows.\nDepending on the application, it may be useful to use a pre-trained that can extract embeddings from 20 or 30 seconds without averaging. These variant has pre-trained time positional encoding or 20/30 seconds:\n\n```python\n# from version 0.0.18, it's possible to use:\nfrom hear21passt.base20sec import load_model # up to 20 seconds of audio.\n# or \nfrom hear21passt.base30sec import load_model # up to 30 seconds of audio.\n\nmodel = load_model(mode=\"logits\").cuda()\nlogits = model(wave_signal)\n```\n\n# Loading other pre-trained models for logits or fine-tuning\n\nEach pre-trained model has a specific frequency/time positional encoding, it's necessary to select the correct input shape to be able to load the models. The important variables for loading are `input_tdim`, `fstride` and `tstride` to specify the spectrograms time frames, the patches stride over frequency, and patches stride over time, respectively.\n\n```python\nimport torch\n\nfrom hear21passt.base import get_basic_model, get_model_passt\n\nmodel = get_basic_model(mode=\"logits\")\n\nlogits = model(some_wave_signal)\n\n# Examples of other pre-trained models using the same spectrograms\n\n# pre-traind on openMIC-18\nmodel.net = get_model_passt(arch=\"openmic\", n_classes=20)\n# pre-traind on FSD-50k\nmodel.net = get_model_passt(arch=\"fsd50k\", n_classes=200)\n# pre-traind on FSD-50k without patch-overlap (faster)\nmodel.net = get_model_passt(arch=\"fsd50k-n\", n_classes=200, fstride=16, tstride=16)\n\n# models are trained on 10 seconds audios from Audioset, but accept longer audios (20s, or 30s)\n# These models are trained by sampling a 10-second time-pos-encodings sequence \nmodel.net = get_model_passt(\"passt_20sec\", input_tdim=2000)\nmodel.net = get_model_passt(\"passt_30sec\", input_tdim=3000)\n```\n\nIf you provide the wrong spectrograms, the model may fail silently, by generating low-quality embeddings and logits. Make sure you have the correct spectrograms' config for the selected pre-trained models.\nModels with higher spectrogram resolutions, need to specify the correct spectrogram config:\n\n```python\nfrom hear21passt.models.preprocess import AugmentMelSTFT\n\n# high-res pre-trained on Audioset\nmodel.net = get_model_passt(\"stfthop160\", input_tdim=2000)\n\n# hopsize=160 for this pretrained model\nmodel.mel = AugmentMelSTFT(n_mels=128, sr=32000, win_length=800, hopsize=160, n_fft=1024, freqm=48,\n timem=192,\n htk=False, fmin=0.0, fmax=None, norm=1, fmin_aug_range=10,\n fmax_aug_range=2000)\n\n\n\n# higher-res pre-trained on Audioset\nmodel.net = get_model_passt(\"stfthop100\", input_tdim=3200)\n\n# hopsize=100 for this pretrained model\nmodel.mel = AugmentMelSTFT(n_mels=128, sr=32000, win_length=800, hopsize=100, n_fft=1024, freqm=48,\n timem=192,\n htk=False, fmin=0.0, fmax=None, norm=1, fmin_aug_range=10,\n fmax_aug_range=2000)\n\n\n\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Passt pretrained model for HEAR 2021 NeurIPS Competition",
"version": "0.0.26",
"project_urls": {
"Bug Tracker": "https://github.com/kkoutini/passt_hear21/issues",
"Homepage": "https://github.com/kkoutini/passt_hear21",
"Source Code": "https://github.com/kkoutini/passt_hear21"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e89dd14d332d831125ff9aab91df8dec4f2707219e9136b6d1dc77cf8a7dae89",
"md5": "d62ddd04d1ecce5327464f52cd4a2f83",
"sha256": "a3a7377604c6d829369111ab26a86fc5dd40154ec611b8fa5819ecaa6b252550"
},
"downloads": -1,
"filename": "hear21passt-0.0.26-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d62ddd04d1ecce5327464f52cd4a2f83",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 33541,
"upload_time": "2024-01-02T11:29:18",
"upload_time_iso_8601": "2024-01-02T11:29:18.004777Z",
"url": "https://files.pythonhosted.org/packages/e8/9d/d14d332d831125ff9aab91df8dec4f2707219e9136b6d1dc77cf8a7dae89/hear21passt-0.0.26-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9e4d53efcc38f947303fc52547c0bbcbfe8a0d0dec5abb8217bd377dfb65430d",
"md5": "3801c1b7a591df3ec834645ea4b4502b",
"sha256": "9aa91cef4ca6468d9075092a9c377b09270b5a59dfa83ecdd4ce9ea92c8c8431"
},
"downloads": -1,
"filename": "hear21passt-0.0.26.tar.gz",
"has_sig": false,
"md5_digest": "3801c1b7a591df3ec834645ea4b4502b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 23681,
"upload_time": "2024-01-02T11:29:19",
"upload_time_iso_8601": "2024-01-02T11:29:19.529993Z",
"url": "https://files.pythonhosted.org/packages/9e/4d/53efcc38f947303fc52547c0bbcbfe8a0d0dec5abb8217bd377dfb65430d/hear21passt-0.0.26.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-02 11:29:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kkoutini",
"github_project": "passt_hear21",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pre-commit",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "pytest-cov",
"specs": []
},
{
"name": "pytest-env",
"specs": []
},
{
"name": "torch",
"specs": [
[
">=",
"1.8.1"
]
]
},
{
"name": "torchaudio",
"specs": [
[
">=",
"0.8.1"
]
]
},
{
"name": "timm",
"specs": [
[
">=",
"0.4.12"
]
]
}
],
"lcname": "hear21passt"
}