Using `pyannote.audio` open-source toolkit in production?
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
# `pyannote.audio` speaker diarization toolkit
`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance.
<p align="center">
<a href="https://www.youtube.com/watch?v=37R_R82lfwA"><img src="https://img.youtube.com/vi/37R_R82lfwA/0.jpg"></a>
</p>
## TL;DR
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) with `pip install pyannote.audio`
2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3. Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions
4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
```python
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
# send pipeline to GPU (when available)
import torch
pipeline.to(torch.device("cuda"))
# apply pretrained pipeline
diarization = pipeline("audio.wav")
# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
# start=0.2s stop=1.5s speaker_0
# start=1.8s stop=3.9s speaker_1
# start=4.2s stop=5.7s speaker_0
# ...
```
## Highlights
- :hugs: pretrained [pipelines](https://hf.co/models?other=pyannote-audio-pipeline) (and [models](https://hf.co/models?other=pyannote-audio-model)) on [:hugs: model hub](https://huggingface.co/pyannote)
- :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark))
- :snake: Python-first API
- :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/)
## Documentation
- [Changelog](CHANGELOG.md)
- [Frequently asked questions](FAQ.md)
- Models
- Available tasks explained
- [Applying a pretrained model](tutorials/applying_a_model.ipynb)
- [Training, fine-tuning, and transfer learning](tutorials/training_a_model.ipynb)
- Pipelines
- Available pipelines explained
- [Applying a pretrained pipeline](tutorials/applying_a_pipeline.ipynb)
- [Adapting a pretrained pipeline to your own data](tutorials/adapting_pretrained_pipeline.ipynb)
- [Training a pipeline](tutorials/voice_activity_detection.ipynb)
- Contributing
- [Adding a new model](tutorials/add_your_own_model.ipynb)
- [Adding a new task](tutorials/add_your_own_task.ipynb)
- Adding a new pipeline
- Sharing pretrained models and pipelines
- Blog
- 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb)
- 2022-10-23 > ["One speaker segmentation model to rule them all"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)
- 2021-08-05 > ["Streaming voice activity detection with pyannote.audio"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html)
- Videos
- [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min
- [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min
- [First release of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min
- Community contributions (not maintained by the core team)
- 2024-04-05 > [Offline speaker diarization (speaker-diarization-3.1)](tutorials/community/offline_usage_speaker_diarization.ipynb) by [Simon Ottenhaus](https://github.com/simonottenhauskenbun)
## Benchmark
Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.
Those numbers are diarization error rates (in %):
| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.office.com/e/GdqwVgkZ5C) |
| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------ |
| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 14.1 | 12.2 | 11.9 |
| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 27.4 | 24.4 | 22.5 |
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.9 | 18.8 | 16.6 |
| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 27.1 | 22.4 | 20.9 |
| [AVA-AVD](https://arxiv.org/abs/2111.14448) | 66.3 | 50.0 | 39.8 |
| [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 31.6 | 28.4 | 22.2 |
| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 26.9 | 21.7 | 17.2 |
| [Earnings21](https://github.com/revdotcom/speech-datasets) | 17.0 | 9.4 | 9.0 |
| [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 61.5 | 51.2 | 43.8 |
| [MSDWild](https://github.com/X-LANCE/MSDWILD) | 32.8 | 25.3 | 19.8 |
| [RAMC](https://www.openslr.org/123/) | 22.5 | 22.2 | 18.4 |
| [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 8.2 | 7.8 | 7.6 |
| [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.3 | 9.4 |
[Diarization error rate](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %)
## Citations
If you use `pyannote.audio` please use the following citations:
```bibtex
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
```
```bibtex
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
```
## Development
The commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library.
```bash
pip install -e .[dev,testing]
pre-commit install
```
## Test
```bash
pytest
```
Raw data
{
"_id": null,
"home_page": "https://github.com/pyannote/pyannote-audio",
"name": "pyannote.audio",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Herv\u00e9 Bredin",
"author_email": "herve.bredin@irit.fr",
"download_url": "https://files.pythonhosted.org/packages/20/50/7d9c9bd4bc960c6e84b0dee8b186e70813c57a432f6057625107e4e87e8f/pyannote_audio-3.2.0.tar.gz",
"platform": "Linux",
"description": "Using `pyannote.audio` open-source toolkit in production?\nMake the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).\n\n# `pyannote.audio` speaker diarization toolkit\n\n`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance.\n\n<p align=\"center\">\n <a href=\"https://www.youtube.com/watch?v=37R_R82lfwA\"><img src=\"https://img.youtube.com/vi/37R_R82lfwA/0.jpg\"></a>\n</p>\n\n## TL;DR\n\n1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) with `pip install pyannote.audio`\n2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions\n3. Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions\n4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).\n\n```python\nfrom pyannote.audio import Pipeline\npipeline = Pipeline.from_pretrained(\n \"pyannote/speaker-diarization-3.1\",\n use_auth_token=\"HUGGINGFACE_ACCESS_TOKEN_GOES_HERE\")\n\n# send pipeline to GPU (when available)\nimport torch\npipeline.to(torch.device(\"cuda\"))\n\n# apply pretrained pipeline\ndiarization = pipeline(\"audio.wav\")\n\n# print the result\nfor turn, _, speaker in diarization.itertracks(yield_label=True):\n print(f\"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}\")\n# start=0.2s stop=1.5s speaker_0\n# start=1.8s stop=3.9s speaker_1\n# start=4.2s stop=5.7s speaker_0\n# ...\n```\n\n## Highlights\n\n- :hugs: pretrained [pipelines](https://hf.co/models?other=pyannote-audio-pipeline) (and [models](https://hf.co/models?other=pyannote-audio-model)) on [:hugs: model hub](https://huggingface.co/pyannote)\n- :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark))\n- :snake: Python-first API\n- :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/)\n\n## Documentation\n\n- [Changelog](CHANGELOG.md)\n- [Frequently asked questions](FAQ.md)\n- Models\n - Available tasks explained\n - [Applying a pretrained model](tutorials/applying_a_model.ipynb)\n - [Training, fine-tuning, and transfer learning](tutorials/training_a_model.ipynb)\n- Pipelines\n - Available pipelines explained\n - [Applying a pretrained pipeline](tutorials/applying_a_pipeline.ipynb)\n - [Adapting a pretrained pipeline to your own data](tutorials/adapting_pretrained_pipeline.ipynb)\n - [Training a pipeline](tutorials/voice_activity_detection.ipynb)\n- Contributing\n - [Adding a new model](tutorials/add_your_own_model.ipynb)\n - [Adding a new task](tutorials/add_your_own_task.ipynb)\n - Adding a new pipeline\n - Sharing pretrained models and pipelines\n- Blog\n - 2022-12-02 > [\"How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges\"](tutorials/adapting_pretrained_pipeline.ipynb)\n - 2022-10-23 > [\"One speaker segmentation model to rule them all\"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)\n - 2021-08-05 > [\"Streaming voice activity detection with pyannote.audio\"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html)\n- Videos\n - [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min\n - [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min\n - [First release of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min\n- Community contributions (not maintained by the core team)\n - 2024-04-05 > [Offline speaker diarization (speaker-diarization-3.1)](tutorials/community/offline_usage_speaker_diarization.ipynb) by [Simon Ottenhaus](https://github.com/simonottenhauskenbun)\n\n## Benchmark\n\nOut of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.\nThose numbers are diarization error rates (in %):\n\n| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.office.com/e/GdqwVgkZ5C) |\n| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------ |\n| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 14.1 | 12.2 | 11.9 |\n| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 27.4 | 24.4 | 22.5 |\n| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (IHM) | 18.9 | 18.8 | 16.6 |\n| [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) (SDM) | 27.1 | 22.4 | 20.9 |\n| [AVA-AVD](https://arxiv.org/abs/2111.14448) | 66.3 | 50.0 | 39.8 |\n| [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) ([part 2](https://github.com/BUTSpeechFIT/CALLHOME_sublists/issues/1)) | 31.6 | 28.4 | 22.2 |\n| [DIHARD 3](https://catalog.ldc.upenn.edu/LDC2022S14) ([full](https://arxiv.org/abs/2012.01477)) | 26.9 | 21.7 | 17.2 |\n| [Earnings21](https://github.com/revdotcom/speech-datasets) | 17.0 | 9.4 | 9.0 |\n| [Ego4D](https://arxiv.org/abs/2110.07058) (dev.) | 61.5 | 51.2 | 43.8 |\n| [MSDWild](https://github.com/X-LANCE/MSDWILD) | 32.8 | 25.3 | 19.8 |\n| [RAMC](https://www.openslr.org/123/) | 22.5 | 22.2 | 18.4 |\n| [REPERE](https://www.islrn.org/resources/360-758-359-485-0/) (phase2) | 8.2 | 7.8 | 7.6 |\n| [VoxConverse](https://github.com/joonson/voxconverse) (v0.3) | 11.2 | 11.3 | 9.4 |\n\n[Diarization error rate](http://pyannote.github.io/pyannote-metrics/reference.html#diarization) (in %)\n\n## Citations\n\nIf you use `pyannote.audio` please use the following citations:\n\n```bibtex\n@inproceedings{Plaquet23,\n author={Alexis Plaquet and Herv\u00e9 Bredin},\n title={{Powerset multi-class cross entropy loss for neural speaker diarization}},\n year=2023,\n booktitle={Proc. INTERSPEECH 2023},\n}\n```\n\n```bibtex\n@inproceedings{Bredin23,\n author={Herv\u00e9 Bredin},\n title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},\n year=2023,\n booktitle={Proc. INTERSPEECH 2023},\n}\n```\n\n## Development\n\nThe commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library.\n\n```bash\npip install -e .[dev,testing]\npre-commit install\n```\n\n## Test\n\n```bash\npytest\n```\n",
"bugtrack_url": null,
"license": "mit",
"summary": "Neural building blocks for speaker diarization",
"version": "3.2.0",
"project_urls": {
"Homepage": "https://github.com/pyannote/pyannote-audio"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0384b77cec4a703938d8988d97147e463df3b88000bff6696b3bc4fd9b0a1255",
"md5": "4c24eb5daa0e69438491e2b36b4874c1",
"sha256": "9a9e3d187040d06ffde3630ec07880c2312e0dd48effb815e62f003492685770"
},
"downloads": -1,
"filename": "pyannote.audio-3.2.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c24eb5daa0e69438491e2b36b4874c1",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.7",
"size": 873488,
"upload_time": "2024-05-08T09:45:49",
"upload_time_iso_8601": "2024-05-08T09:45:49.282027Z",
"url": "https://files.pythonhosted.org/packages/03/84/b77cec4a703938d8988d97147e463df3b88000bff6696b3bc4fd9b0a1255/pyannote.audio-3.2.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "20507d9c9bd4bc960c6e84b0dee8b186e70813c57a432f6057625107e4e87e8f",
"md5": "a861917ec758fd7cedc68362055540e8",
"sha256": "5f23450179747c83ffe342bf4212161615d1639b88edb96ec008069003b566f5"
},
"downloads": -1,
"filename": "pyannote_audio-3.2.0.tar.gz",
"has_sig": false,
"md5_digest": "a861917ec758fd7cedc68362055540e8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13642279,
"upload_time": "2024-05-08T09:45:52",
"upload_time_iso_8601": "2024-05-08T09:45:52.265644Z",
"url": "https://files.pythonhosted.org/packages/20/50/7d9c9bd4bc960c6e84b0dee8b186e70813c57a432f6057625107e4e87e8f/pyannote_audio-3.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-08 09:45:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pyannote",
"github_project": "pyannote-audio",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "asteroid-filterbanks",
"specs": [
[
">=",
"0.4"
]
]
},
{
"name": "einops",
"specs": [
[
">=",
"0.6.0"
]
]
},
{
"name": "huggingface_hub",
"specs": [
[
">=",
"0.13.0"
]
]
},
{
"name": "lightning",
"specs": [
[
">=",
"2.0.1"
]
]
},
{
"name": "omegaconf",
"specs": [
[
"<",
"3.0"
],
[
">=",
"2.1"
]
]
},
{
"name": "pyannote.core",
"specs": [
[
">=",
"5.0.0"
]
]
},
{
"name": "pyannote.database",
"specs": [
[
">=",
"5.0.1"
]
]
},
{
"name": "pyannote.metrics",
"specs": [
[
">=",
"3.2"
]
]
},
{
"name": "pyannote.pipeline",
"specs": [
[
">=",
"3.0.1"
]
]
},
{
"name": "pytorch_metric_learning",
"specs": [
[
">=",
"2.1.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"12.0.0"
]
]
},
{
"name": "semver",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "soundfile",
"specs": [
[
">=",
"0.12.1"
]
]
},
{
"name": "speechbrain",
"specs": [
[
">=",
"0.5.14"
]
]
},
{
"name": "tensorboardX",
"specs": [
[
">=",
"2.6"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "torch_audiomentations",
"specs": [
[
">=",
"0.11.0"
]
]
},
{
"name": "torchaudio",
"specs": [
[
">=",
"2.2.0"
]
]
},
{
"name": "torchmetrics",
"specs": [
[
">=",
"0.11.0"
]
]
}
],
"lcname": "pyannote.audio"
}