conette


Nameconette JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryCoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file.
upload_time2024-04-19 07:33:30
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords audio deep-learning pytorch captioning audio-captioning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# CoNeTTE model for Audio Captioning

[![](<https://img.shields.io/badge/-Python 3.10+-blue?style=for-the-badge&logo=python&logoColor=white>)](https://www.python.org/)
[![](<https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white>)](https://pytorch.org/get-started/locally/)
[![](https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray)](https://black.readthedocs.io/en/stable/)
[![](https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/inference.yaml?branch=main&style=for-the-badge&logo=github)](https://github.com/Labbeti/conette-audio-captioning/actions)

</div>

CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the [corresponding paper](https://arxiv.org/pdf/2309.00454.pdf). The model has been developped by me ([Étienne Labbé](https://labbeti.github.io/)) during my PhD. A simple interface to test CoNeTTE is available on [HuggingFace website](https://huggingface.co/spaces/Labbeti/conette).

## Training
### Requirements
- Intended for Ubuntu 20.04 only. Requires **java** < 1.13, **ffmpeg**, **yt-dlp**, and **zip** commands.
- Recommanded GPU: NVIDIA V100 with 32GB VRAM.
- WavCaps dataset might requires more than 2 TB of disk storage. Other datasets requires less than 50 GB.

### Installation
By default, **only the pip inference requirements are installed for conette**. To install training requirements you need to use the following command:
```bash
python -m pip install conette[train]
```
If you already installed conette for inference, it is **highly recommanded to create another environment** before installing conette for training.

### Download external models and data
These steps might take a while (few hours to download and prepare everything depending on your CPU, GPU and SSD/HDD).

First, download the ConvNeXt, NLTK and spacy models :
```bash
conette-prepare data=none default=true pack_to_hdf=false csum_in_hdf_name=false pann=false
```

Then download the 4 datasets used to train CoNeTTE :
```bash
common_args="data.download=true pack_to_hdf=true audio_t=resample_mean_convnext audio_t.pretrain_path=cnext_bl_75 post_hdf_name=bl pretag=cnext_bl_75"

conette-prepare data=audiocaps audio_t.src_sr=32000 ${common_args}
conette-prepare data=clotho audio_t.src_sr=44100 ${common_args}
conette-prepare data=macs audio_t.src_sr=48000 ${common_args}
conette-prepare data=wavcaps audio_t.src_sr=32000 ${common_args} datafilter.min_audio_size=0.1 datafilter.max_audio_size=30.0 datafilter.sr=32000
```

### Train a model
CNext-trans (baseline) on CL only (~3 hours on 1 GPU V100-32G)
```bash
conette-train expt=[clotho_cnext_bl] pl=baseline
```

CoNeTTE on AC+CL+MA+WC, specialized for CL (~4 hours on 1 GPU V100-32G)
```bash
conette-train expt=[camw_cnext_bl_for_c,task_ds_src_camw] pl=conette
```

CoNeTTE on AC+CL+MA+WC, specialized for AC (~3 hours on 1 GPU V100-32G)
```bash
conette-train expt=[camw_cnext_bl_for_a,task_ds_src_camw] pl=conette
```

Note 1: any training using AC data cannot be exactly reproduced because a part of this data is deleted from the YouTube source, and I cannot share my own audio files.
Note 2: paper results are averaged scores over 5 seeds (1234-1238). The default training only uses seed 1234.

## Inference only (without training)

### Installation
```bash
python -m pip install conette[test]
```

### Usage with python
```py
from conette import CoNeTTEConfig, CoNeTTEModel

config = CoNeTTEConfig.from_pretrained("Labbeti/conette")
model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)

path = "/your/path/to/audio.wav"
outputs = model(path)
candidate = outputs["cands"][0]
print(candidate)
```

The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:

```py
import torchaudio

path_1 = "/your/path/to/audio_1.wav"
path_2 = "/your/path/to/audio_2.wav"

audio_1, sr_1 = torchaudio.load(path_1)
audio_2, sr_2 = torchaudio.load(path_2)

outputs = model([audio_1, audio_2], sr=[sr_1, sr_2])
candidates = outputs["cands"]
print(candidates)
```

The model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is "clotho".

```py
outputs = model(path, task="clotho")
candidate = outputs["cands"][0]
print(candidate)

outputs = model(path, task="audiocaps")
candidate = outputs["cands"][0]
print(candidate)
```

### Usage with command line
Simply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.

```bash
conette-predict --audio "/your/path/to/audio.wav"
```

### Performance
The model has been trained on AudioCaps (AC), Clotho (CL), MACS (MA) and WavCaps (WC). The performance on the test subsets are :

| Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |
| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| AC-test | 44.14 | 43.98 | 60.81 | 309 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_audiocaps_test.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_audiocaps_test.yaml) |
| CL-eval | 30.97 | 30.87 | 51.72 | 636 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_clotho_eval.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_clotho_eval.yaml) |

This model checkpoint has been trained with focus on the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.

### Limitations
- The model expected audio sampled at **32 kHz**. The model automatically resample up or down the input audio files. However, it might give worse results, especially when using audio with lower sampling rates.
- The model has been trained on audio lasting from **1 to 30 seconds**. It can handle longer audio files, but it might require more memory and give worse results.

## Citation
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf

```bibtex
@misc{labbe2023conette,
	title        = {CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding},
	author       = {Étienne Labbé and Thomas Pellegrini and Julien Pinquier},
	year         = 2023,
	journal      = {arXiv preprint arXiv:2309.00454},
	url          = {https://arxiv.org/pdf/2309.00454.pdf},
	eprint       = {2309.00454},
	archiveprefix = {arXiv},
	primaryclass = {cs.SD}
}
```

## Additional information
- CoNeTTE stands for **Co**nv**Ne**Xt-**T**ransformer with **T**ask **E**mbedding.
- Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette
- The weights of the encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/records/10987498 under the filename "convnext_tiny_465mAP_BL_AC_75kit.pth".

## Contact
Maintainer:
- [Étienne Labbé](https://labbeti.github.io/) "Labbeti": labbeti.pub@gmail.com

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "conette",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "\"Etienne Labb\u00e9 (Labbeti)\" <labbeti.pub@gmail.com>",
    "keywords": "audio, deep-learning, pytorch, captioning, audio-captioning",
    "author": null,
    "author_email": "\"Etienne Labb\u00e9 (Labbeti)\" <labbeti.pub@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a4/f2/9e52fc00906fed3482119c8406ea01260e17361ea9b5563645406c8e8b7d/conette-0.3.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# CoNeTTE model for Audio Captioning\n\n[![](<https://img.shields.io/badge/-Python 3.10+-blue?style=for-the-badge&logo=python&logoColor=white>)](https://www.python.org/)\n[![](<https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white>)](https://pytorch.org/get-started/locally/)\n[![](https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray)](https://black.readthedocs.io/en/stable/)\n[![](https://img.shields.io/github/actions/workflow/status/Labbeti/conette-audio-captioning/inference.yaml?branch=main&style=for-the-badge&logo=github)](https://github.com/Labbeti/conette-audio-captioning/actions)\n\n</div>\n\nCoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file. The architecture and training are explained in the [corresponding paper](https://arxiv.org/pdf/2309.00454.pdf). The model has been developped by me ([\u00c9tienne Labb\u00e9](https://labbeti.github.io/)) during my PhD. A simple interface to test CoNeTTE is available on [HuggingFace website](https://huggingface.co/spaces/Labbeti/conette).\n\n## Training\n### Requirements\n- Intended for Ubuntu 20.04 only. Requires **java** < 1.13, **ffmpeg**, **yt-dlp**, and **zip** commands.\n- Recommanded GPU: NVIDIA V100 with 32GB VRAM.\n- WavCaps dataset might requires more than 2 TB of disk storage. Other datasets requires less than 50 GB.\n\n### Installation\nBy default, **only the pip inference requirements are installed for conette**. To install training requirements you need to use the following command:\n```bash\npython -m pip install conette[train]\n```\nIf you already installed conette for inference, it is **highly recommanded to create another environment** before installing conette for training.\n\n### Download external models and data\nThese steps might take a while (few hours to download and prepare everything depending on your CPU, GPU and SSD/HDD).\n\nFirst, download the ConvNeXt, NLTK and spacy models :\n```bash\nconette-prepare data=none default=true pack_to_hdf=false csum_in_hdf_name=false pann=false\n```\n\nThen download the 4 datasets used to train CoNeTTE :\n```bash\ncommon_args=\"data.download=true pack_to_hdf=true audio_t=resample_mean_convnext audio_t.pretrain_path=cnext_bl_75 post_hdf_name=bl pretag=cnext_bl_75\"\n\nconette-prepare data=audiocaps audio_t.src_sr=32000 ${common_args}\nconette-prepare data=clotho audio_t.src_sr=44100 ${common_args}\nconette-prepare data=macs audio_t.src_sr=48000 ${common_args}\nconette-prepare data=wavcaps audio_t.src_sr=32000 ${common_args} datafilter.min_audio_size=0.1 datafilter.max_audio_size=30.0 datafilter.sr=32000\n```\n\n### Train a model\nCNext-trans (baseline) on CL only (~3 hours on 1 GPU V100-32G)\n```bash\nconette-train expt=[clotho_cnext_bl] pl=baseline\n```\n\nCoNeTTE on AC+CL+MA+WC, specialized for CL (~4 hours on 1 GPU V100-32G)\n```bash\nconette-train expt=[camw_cnext_bl_for_c,task_ds_src_camw] pl=conette\n```\n\nCoNeTTE on AC+CL+MA+WC, specialized for AC (~3 hours on 1 GPU V100-32G)\n```bash\nconette-train expt=[camw_cnext_bl_for_a,task_ds_src_camw] pl=conette\n```\n\nNote 1: any training using AC data cannot be exactly reproduced because a part of this data is deleted from the YouTube source, and I cannot share my own audio files.\nNote 2: paper results are averaged scores over 5 seeds (1234-1238). The default training only uses seed 1234.\n\n## Inference only (without training)\n\n### Installation\n```bash\npython -m pip install conette[test]\n```\n\n### Usage with python\n```py\nfrom conette import CoNeTTEConfig, CoNeTTEModel\n\nconfig = CoNeTTEConfig.from_pretrained(\"Labbeti/conette\")\nmodel = CoNeTTEModel.from_pretrained(\"Labbeti/conette\", config=config)\n\npath = \"/your/path/to/audio.wav\"\noutputs = model(path)\ncandidate = outputs[\"cands\"][0]\nprint(candidate)\n```\n\nThe model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). In this second case you also need to provide the sampling rate of this files:\n\n```py\nimport torchaudio\n\npath_1 = \"/your/path/to/audio_1.wav\"\npath_2 = \"/your/path/to/audio_2.wav\"\n\naudio_1, sr_1 = torchaudio.load(path_1)\naudio_2, sr_2 = torchaudio.load(path_2)\n\noutputs = model([audio_1, audio_2], sr=[sr_1, sr_2])\ncandidates = outputs[\"cands\"]\nprint(candidates)\n```\n\nThe model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is \"clotho\".\n\n```py\noutputs = model(path, task=\"clotho\")\ncandidate = outputs[\"cands\"][0]\nprint(candidate)\n\noutputs = model(path, task=\"audiocaps\")\ncandidate = outputs[\"cands\"][0]\nprint(candidate)\n```\n\n### Usage with command line\nSimply use the command `conette-predict` with `--audio PATH1 PATH2 ...` option. You can also export results to a CSV file using `--csv_export PATH`.\n\n```bash\nconette-predict --audio \"/your/path/to/audio.wav\"\n```\n\n### Performance\nThe model has been trained on AudioCaps (AC), Clotho (CL), MACS (MA) and WavCaps (WC). The performance on the test subsets are :\n\n| Test data | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) | Vocab | Outputs | Scores |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| AC-test | 44.14 | 43.98 | 60.81 | 309 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_audiocaps_test.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_audiocaps_test.yaml) |\n| CL-eval | 30.97 | 30.87 | 51.72 | 636 | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/outputs_clotho_eval.csv) | [Link](https://github.com/Labbeti/conette-audio-captioning/blob/main/results/detailed_outputs/scores_clotho_eval.yaml) |\n\nThis model checkpoint has been trained with focus on the Clotho dataset, but it can also reach a good performance on AudioCaps with the \"audiocaps\" task.\n\n### Limitations\n- The model expected audio sampled at **32 kHz**. The model automatically resample up or down the input audio files. However, it might give worse results, especially when using audio with lower sampling rates.\n- The model has been trained on audio lasting from **1 to 30 seconds**. It can handle longer audio files, but it might require more memory and give worse results.\n\n## Citation\nThe preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf\n\n```bibtex\n@misc{labbe2023conette,\n\ttitle        = {CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding},\n\tauthor       = {\u00c9tienne Labb\u00e9 and Thomas Pellegrini and Julien Pinquier},\n\tyear         = 2023,\n\tjournal      = {arXiv preprint arXiv:2309.00454},\n\turl          = {https://arxiv.org/pdf/2309.00454.pdf},\n\teprint       = {2309.00454},\n\tarchiveprefix = {arXiv},\n\tprimaryclass = {cs.SD}\n}\n```\n\n## Additional information\n- CoNeTTE stands for **Co**nv**Ne**Xt-**T**ransformer with **T**ask **E**mbedding.\n- Model weights are available on HuggingFace: https://huggingface.co/Labbeti/conette\n- The weights of the encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://zenodo.org/records/10987498 under the filename \"convnext_tiny_465mAP_BL_AC_75kit.pth\".\n\n## Contact\nMaintainer:\n- [\u00c9tienne Labb\u00e9](https://labbeti.github.io/) \"Labbeti\": labbeti.pub@gmail.com\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "CoNeTTE is an audio captioning system, which generate a short textual description of the sound events in any audio file.",
    "version": "0.3.0",
    "project_urls": {
        "Changelog": "https://github.com/Labbeti/conette-audio-captioning/blob/main/CHANGELOG.md",
        "Repository": "https://github.com/Labbeti/conette-audio-captioning.git"
    },
    "split_keywords": [
        "audio",
        " deep-learning",
        " pytorch",
        " captioning",
        " audio-captioning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a4f29e52fc00906fed3482119c8406ea01260e17361ea9b5563645406c8e8b7d",
                "md5": "3ee5e507d6871f067ce2137ed0a7bedb",
                "sha256": "2e9ad022641be336063af8f5f06695af2fe8cf74296212660b49a3518d70d7a4"
            },
            "downloads": -1,
            "filename": "conette-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3ee5e507d6871f067ce2137ed0a7bedb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 143584,
            "upload_time": "2024-04-19T07:33:30",
            "upload_time_iso_8601": "2024-04-19T07:33:30.048421Z",
            "url": "https://files.pythonhosted.org/packages/a4/f2/9e52fc00906fed3482119c8406ea01260e17361ea9b5563645406c8e8b7d/conette-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-19 07:33:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Labbeti",
    "github_project": "conette-audio-captioning",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "conette"
}
        
Elapsed time: 0.24360s