[](https://pypi.org/project/whisper-ctranslate2/)
[](https://pypistats.org/packages/whisper-ctranslate2)
# Introduction
Whisper command line client compatible with original [OpenAI client](https://github.com/openai/whisper) based on CTranslate2.
It uses [CTranslate2](https://github.com/OpenNMT/CTranslate2/) and [Faster-whisper](https://github.com/SYSTRAN/faster-whisper) Whisper implementation that is up to 4 times faster than openai/whisper for the same accuracy while using less memory.
Goals of the project:
* Provide an easy way to use the CTranslate2 Whisper implementation
* Ease the migration for people using OpenAI Whisper CLI
# 🚀 **NEW PROJECT LAUNCHED!** 🚀
**Open dubbing** is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages ! 🎉
### **🔥 Check it out now: [*open-dubbing*](https://github.com/jordimas/open-dubbing) 🔥**
# Installation
To install the latest stable version, just type:
pip install -U whisper-ctranslate2
Alternatively, if you are interested in the latest development (non-stable) version from this repository, just type:
pip install git+https://github.com/Softcatala/whisper-ctranslate2
# CPU and GPU support
GPU and CPU support are provided by [CTranslate2](https://github.com/OpenNMT/CTranslate2/).
It has compatibility with x86-64 and AArch64/ARM64 CPU and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.
GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the [CTranslate2 documentation](https://opennmt.net/CTranslate2/installation.html)
By default the best hardware available is selected for inference. You can use the options `--device` and `--device_index` to control manually the selection.
# Usage
Same command line as OpenAI Whisper.
To transcribe:
whisper-ctranslate2 inaguracio2011.mp3 --model medium
<img alt="image" src="https://user-images.githubusercontent.com/309265/226923541-8326c575-7f43-4bba-8235-2a4a8bdfb161.png">
To translate:
whisper-ctranslate2 inaguracio2011.mp3 --model medium --task translate
<img alt="image" src="https://user-images.githubusercontent.com/309265/226923535-b6583536-2486-4127-b17b-c58d85cdb90f.png">
Whisper translate task translates the transcription from the source language to English (the only target language supported).
Additionally using:
whisper-ctranslate2 --help
All the supported options with their help are shown.
# CTranslate2 specific options
On top of the OpenAI Whisper command line options, there are some specific options provided by CTranslate2 or whiper-ctranslate2.
## Batched inference
Batched inference transcribes each segment in-dependently which can provide an additional 2x-4x speed increase:
whisper-ctranslate2 inaguracio2011.mp3 --batched True
You can additionally use the --batch_size to specify the maximum number of parallel requests to model for decoding.
Batched inference uses Voice Activity Detection (VAD) filter and ignores the following paramters: compression_ratio_threshold, logprob_threshold,
no_speech_threshold, condition_on_previous_text, prompt_reset_on_temperature, prefix, hallucination_silence_threshold.
## Quantization
`--compute_type` option which accepts _default,auto,int8,int8_float16,int16,float16,float32_ values indicates the type of [quantization](https://opennmt.net/CTranslate2/quantization.html) to use. On CPU _int8_ will give the best performance:
whisper-ctranslate2 myfile.mp3 --compute_type int8
## Loading the model from a directory
`--model_directory` option allows to specify the directory from which you want to load a CTranslate2 Whisper model. For example, if you want to load your own quantified [Whisper model](https://opennmt.net/CTranslate2/conversion.html) version or using your own [Whisper fine-tunned](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event) version. The model must be in CTranslate2 format.
## Using Voice Activity Detection (VAD) filter
`--vad_filter` option enables the voice activity detection (VAD) to filter out parts of the audio without speech. This step uses the [Silero VAD model](https://github.com/snakers4/silero-vad):
whisper-ctranslate2 myfile.mp3 --vad_filter True
The VAD filter accepts multiple additional options to determine the filter behavior:
--vad_onset VALUE (float)
Probabilities above this value are considered as speech.
--vad_min_speech_duration_ms (int)
Final speech chunks shorter min_speech_duration_ms are thrown out.
--vad_max_speech_duration_s VALUE (int)
Maximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence.
## Print colors
`--print_colors True` options prints the transcribed text using an experimental color coding strategy based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to highlight words with high or low confidence:
whisper-ctranslate2 myfile.mp3 --print_colors True
<img alt="image" src="https://user-images.githubusercontent.com/309265/228054378-48ac6af4-ce4b-44da-b4ec-70ce9f2f2a6c.png">
## Live transcribe from your microphone
`--live_transcribe True` option activates the live transcription mode from your microphone:
whisper-ctranslate2 --live_transcribe True --language en
https://user-images.githubusercontent.com/309265/231533784-e58c4b92-e9fb-4256-b4cd-12f1864131d9.mov
## Diarization (speaker identification)
There is experimental diarization support using [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) to identify speakers. At the moment, the support is at segment level.
To enable diarization you need to follow these steps:
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) with `pip install pyannote.audio`
2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3. Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions
4. Create an access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
And then execute passing the HuggingFace API token as parameter to enable diarization:
whisper-ctranslate2 --hf_token YOUR_HF_TOKEN
and then the name of the speaker is added in the output files (e.g. JSON, VTT and STR files):
_[SPEAKER_00]: There is a lot of people in this room_
The option `--speaker_name SPEAKER_NAME` allows to use your own string to identify the speaker.
# Need help?
Check our [frequently asked questions](FAQ.md) for common questions.
# Contact
Jordi Mas <jmas@softcatala.org>
Raw data
{
"_id": null,
"home_page": "https://github.com/Softcatala/whisper-ctranslate2",
"name": "whisper-ctranslate2",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Jordi Mas",
"author_email": "jmas@softcatala.org",
"download_url": "https://files.pythonhosted.org/packages/f2/a7/118ce708f9f6dde7b50e4dd5cb2d0032cb016ecde059fc65ce64b93aaf8d/whisper_ctranslate2-0.5.2.tar.gz",
"platform": null,
"description": "[](https://pypi.org/project/whisper-ctranslate2/)\n[](https://pypistats.org/packages/whisper-ctranslate2)\n\n# Introduction\n\nWhisper command line client compatible with original [OpenAI client](https://github.com/openai/whisper) based on CTranslate2.\n\nIt uses [CTranslate2](https://github.com/OpenNMT/CTranslate2/) and [Faster-whisper](https://github.com/SYSTRAN/faster-whisper) Whisper implementation that is up to 4 times faster than openai/whisper for the same accuracy while using less memory.\n\nGoals of the project:\n* Provide an easy way to use the CTranslate2 Whisper implementation\n* Ease the migration for people using OpenAI Whisper CLI\n\n# \ud83d\ude80 **NEW PROJECT LAUNCHED!** \ud83d\ude80\n\n**Open dubbing** is an AI dubbing system which uses machine learning models to automatically translate and synchronize audio dialogue into different languages ! \ud83c\udf89\n\n### **\ud83d\udd25 Check it out now: [*open-dubbing*](https://github.com/jordimas/open-dubbing) \ud83d\udd25**\n\n\n# Installation\n\nTo install the latest stable version, just type:\n\n pip install -U whisper-ctranslate2\n\nAlternatively, if you are interested in the latest development (non-stable) version from this repository, just type:\n\n pip install git+https://github.com/Softcatala/whisper-ctranslate2\n\n# CPU and GPU support\n\nGPU and CPU support are provided by [CTranslate2](https://github.com/OpenNMT/CTranslate2/).\n\nIt has compatibility with x86-64 and AArch64/ARM64 CPU and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.\n\nGPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the [CTranslate2 documentation](https://opennmt.net/CTranslate2/installation.html)\n\nBy default the best hardware available is selected for inference. You can use the options `--device` and `--device_index` to control manually the selection.\n \n# Usage\n\nSame command line as OpenAI Whisper.\n\nTo transcribe:\n\n whisper-ctranslate2 inaguracio2011.mp3 --model medium\n \n<img alt=\"image\" src=\"https://user-images.githubusercontent.com/309265/226923541-8326c575-7f43-4bba-8235-2a4a8bdfb161.png\">\n\nTo translate:\n\n whisper-ctranslate2 inaguracio2011.mp3 --model medium --task translate\n\n<img alt=\"image\" src=\"https://user-images.githubusercontent.com/309265/226923535-b6583536-2486-4127-b17b-c58d85cdb90f.png\">\n\nWhisper translate task translates the transcription from the source language to English (the only target language supported).\n\nAdditionally using:\n\n whisper-ctranslate2 --help\n\nAll the supported options with their help are shown.\n\n# CTranslate2 specific options\n\nOn top of the OpenAI Whisper command line options, there are some specific options provided by CTranslate2 or whiper-ctranslate2.\n\n## Batched inference\n\nBatched inference transcribes each segment in-dependently which can provide an additional 2x-4x speed increase:\n\n whisper-ctranslate2 inaguracio2011.mp3 --batched True\n \nYou can additionally use the --batch_size to specify the maximum number of parallel requests to model for decoding.\n\nBatched inference uses Voice Activity Detection (VAD) filter and ignores the following paramters: compression_ratio_threshold, logprob_threshold,\nno_speech_threshold, condition_on_previous_text, prompt_reset_on_temperature, prefix, hallucination_silence_threshold.\n\n## Quantization\n\n`--compute_type` option which accepts _default,auto,int8,int8_float16,int16,float16,float32_ values indicates the type of [quantization](https://opennmt.net/CTranslate2/quantization.html) to use. On CPU _int8_ will give the best performance:\n\n whisper-ctranslate2 myfile.mp3 --compute_type int8\n\n## Loading the model from a directory\n\n`--model_directory` option allows to specify the directory from which you want to load a CTranslate2 Whisper model. For example, if you want to load your own quantified [Whisper model](https://opennmt.net/CTranslate2/conversion.html) version or using your own [Whisper fine-tunned](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event) version. The model must be in CTranslate2 format.\n\n## Using Voice Activity Detection (VAD) filter\n\n`--vad_filter` option enables the voice activity detection (VAD) to filter out parts of the audio without speech. This step uses the [Silero VAD model](https://github.com/snakers4/silero-vad):\n\n whisper-ctranslate2 myfile.mp3 --vad_filter True\n\nThe VAD filter accepts multiple additional options to determine the filter behavior:\n\n --vad_onset VALUE (float)\n\nProbabilities above this value are considered as speech.\n\n --vad_min_speech_duration_ms (int)\n\nFinal speech chunks shorter min_speech_duration_ms are thrown out.\n\n --vad_max_speech_duration_s VALUE (int)\n\nMaximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence.\n\n\n## Print colors\n\n`--print_colors True` options prints the transcribed text using an experimental color coding strategy based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp) to highlight words with high or low confidence:\n\n whisper-ctranslate2 myfile.mp3 --print_colors True\n\n<img alt=\"image\" src=\"https://user-images.githubusercontent.com/309265/228054378-48ac6af4-ce4b-44da-b4ec-70ce9f2f2a6c.png\">\n\n## Live transcribe from your microphone\n\n`--live_transcribe True` option activates the live transcription mode from your microphone:\n\n whisper-ctranslate2 --live_transcribe True --language en\n\nhttps://user-images.githubusercontent.com/309265/231533784-e58c4b92-e9fb-4256-b4cd-12f1864131d9.mov\n\n## Diarization (speaker identification)\n\nThere is experimental diarization support using [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) to identify speakers. At the moment, the support is at segment level.\n\nTo enable diarization you need to follow these steps:\n\n1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) with `pip install pyannote.audio`\n2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions\n3. Accept [`pyannote/speaker-diarization-3.1`](https://hf.co/pyannote/speaker-diarization-3.1) user conditions\n4. Create an access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).\n\nAnd then execute passing the HuggingFace API token as parameter to enable diarization:\n\n whisper-ctranslate2 --hf_token YOUR_HF_TOKEN\n\nand then the name of the speaker is added in the output files (e.g. JSON, VTT and STR files):\n\n_[SPEAKER_00]: There is a lot of people in this room_\n\nThe option `--speaker_name SPEAKER_NAME` allows to use your own string to identify the speaker.\n\n\n# Need help?\n\nCheck our [frequently asked questions](FAQ.md) for common questions.\n\n# Contact\n\nJordi Mas <jmas@softcatala.org>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Whisper command line client that uses CTranslate2 and faster-whisper",
"version": "0.5.2",
"project_urls": {
"Homepage": "https://github.com/Softcatala/whisper-ctranslate2"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4fe632417ed80ecc431398e9d1507e058d2b9b0f2e48c4669a27b65b7b365aef",
"md5": "e8ddfb414c60734bc6876b9724eab4c0",
"sha256": "09a7ccc3292302d42b77b916a5c2720696bacc00a5662c5893c20e7dfbba6e2e"
},
"downloads": -1,
"filename": "whisper_ctranslate2-0.5.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e8ddfb414c60734bc6876b9724eab4c0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 28268,
"upload_time": "2025-01-01T16:34:07",
"upload_time_iso_8601": "2025-01-01T16:34:07.693960Z",
"url": "https://files.pythonhosted.org/packages/4f/e6/32417ed80ecc431398e9d1507e058d2b9b0f2e48c4669a27b65b7b365aef/whisper_ctranslate2-0.5.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f2a7118ce708f9f6dde7b50e4dd5cb2d0032cb016ecde059fc65ce64b93aaf8d",
"md5": "b09114bb5c3e75a6898618ded782f69d",
"sha256": "c20dd903c6bc133f0c34f715824bee6d4160d8fc78371af2e2eb99906611d27d"
},
"downloads": -1,
"filename": "whisper_ctranslate2-0.5.2.tar.gz",
"has_sig": false,
"md5_digest": "b09114bb5c3e75a6898618ded782f69d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 24870,
"upload_time": "2025-01-01T16:34:11",
"upload_time_iso_8601": "2025-01-01T16:34:11.793233Z",
"url": "https://files.pythonhosted.org/packages/f2/a7/118ce708f9f6dde7b50e4dd5cb2d0032cb016ecde059fc65ce64b93aaf8d/whisper_ctranslate2-0.5.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-01 16:34:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Softcatala",
"github_project": "whisper-ctranslate2",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": []
},
{
"name": "faster-whisper",
"specs": [
[
">=",
"1.1.1"
]
]
},
{
"name": "ctranslate2",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "sounddevice",
"specs": []
},
{
"name": "numpy",
"specs": []
}
],
"lcname": "whisper-ctranslate2"
}