# BanglaSpeech2Text (Bangla Speech to Text)
BanglaSpeech2Text: An open-source offline speech-to-text package for Bangla language. Fine-tuned on the latest whisper speech to text model for optimal performance. Transcribe speech to text, convert voice to text and perform speech recognition in python with ease, even without internet connection.
## [Models](https://github.com/shhossain/BanglaSpeech2Text/blob/main/banglaspeech2text/utils/listed_models.json)
| Model | Size | Best(WER) |
| ------- | ---------- | --------- |
| `tiny` |100-200 MB | 74 |
| `base` |200-300 MB | 46 |
| `small` |1 GB | 18 |
| `large` |3-4 GB | 11 |
**NOTE**: Bigger model have better accuracy but slower inference speed. More models [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&language=bn&sort=likes)
## Pre-requisites
- Python 3.7 or higher
## Test it in Google Colab
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shhossain/BanglaSpeech2Text/blob/main/banglaspeech2text_in_colab.ipynb)
## Installation
You can install the library using pip:
```bash
pip install banglaspeech2text
```
## Usage
### Model Initialization
To use the library, you need to initialize the Speech2Text class with the desired model. By default, it uses the "base" model, but you can choose from different pre-trained models: "tiny", "small", "medium", "base", or "large". Here's an example:
```python
from banglaspeech2text import Speech2Text
stt = Speech2Text(model="base")
# You can use it wihout specifying model name (default model is "base")
stt = Speech2Text()
```
### Transcribing Audio Files
You can transcribe an audio file by calling the `recognize` method and passing the path to the audio file. It will return the transcribed text as a string. Here's an example:
```python
transcription = stt.recognize("audio.wav")
print(transcription)
```
### For longer audio files
(As different models have different max audio length, so you can use the following methods to transcribe longer audio files)
For longer audio files, you can use the `generate` or `recognize` method. Here's an example:
```python
for text in stt.generate("audio.wav"): # it will generate text as the chunks are processed
print(text)
# or
texts = stt.recognize("audio.wav", split=True) # Probably faster than generate method
for text in texts:
print(text)
# or
# you can pass min_silence_length and silence_threshold to split_on_silence
text = stt.recognize("audio.wav", split=True, min_silence_length=1000, silence_threshold=-16)
print(text)
```
## Multiple Audio Formats
BanglaSpeech2Text supports the following audio formats for input:
- File Formats: WAV, MP3, FLAC, and all formats supported by FFmpeg.
- Bytes: Raw audio data in byte format.
- Numpy Array: Numpy array representing audio data, preferably obtained using librosa.load.
- AudioData: Audio data obtained from the speech_recognition library.
- AudioSegment: Audio segment objects from the pydub library.
- BytesIO: Audio data provided through BytesIO objects from the io module.
- Path: Pathlib Path object pointing to an audio file.
Here's an example:
```python
transcription = stt.recognize("audio.mp3")
print(transcription)
```
### Use with SpeechRecognition
You can use [SpeechRecognition](https://pypi.org/project/SpeechRecognition/) package to get audio from microphone and transcribe it. Here's an example:
```python
import speech_recognition as sr
from banglaspeech2text import Speech2Text
stt = Speech2Text(model="base")
r = sr.Recognizer()
with sr.Microphone() as source:
print("Say something!")
r.adjust_for_ambient_noise(source)
audio = r.listen(source)
output = stt.recognize(audio)
print(output)
```
### Use CPU
You can use GPU for faster inference. Here's an example:
```python
stt = Speech2Text(model="base",use_gpu=False)
```
### Advanced GPU Usage
For more advanced GPU usage you can use `device` or `device_map` parameter. Here's an example:
```python
stt = Speech2Text(model="base",device="cuda:0")
```
```python
stt = Speech2Text(model="base",device_map="auto")
```
**NOTE**: Read more about [Pytorch Device](https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device)
### Instantly Check with gradio
You can instantly check the model with gradio. Here's an example:
```python
from banglaspeech2text import Speech2Text, available_models
import gradio as gr
stt = Speech2Text(model="base",use_gpu=True)
# You can also open the url and check it in mobile
gr.Interface(
fn=stt.recognize,
inputs=gr.Audio(source="microphone", type="filepath"),
outputs="text").launch(share=True)
```
## Some more usage examples
### Use huggingface model
```python
stt = Speech2Text(model="openai/whisper-tiny")
```
### Change Model Save location
```python
import os
os.environ["BANGLASPEECH2TEXT_CACHE"] = "/path/to/cache"
stt = Speech2Text(model="base")
```
### See current model info
```python
stt = Speech2Text(model="base")
print(stt.model_name) # the name of the model
print(stt.model_size) # the size of the model
print(stt.model_license) # the license of the model
print(stt.model_description) # the description of the model(in .md format)
print(stt.model_url) # the url of the model
print(stt.model_wer) # word error rate of the model
```
### CLI
You can use the library from the command line. Here's an example:
```bash
bnstt 'file.wav'
```
You can also use it with microphone:
```bash
bnstt --mic
```
Other options:
```bash
usage: bnstt
[-h]
[-gpu]
[-c CACHE]
[-o OUTPUT]
[-m MODEL]
[-s]
[-sm MIN_SILENCE_LENGTH]
[-st SILENCE_THRESH]
[-sp PADDING]
[--list]
[--info]
[INPUT ...]
Bangla Speech to Text
positional arguments:
INPUT
inputfile(s) or list of files
options:
-h, --help
show this help message and exit
-gpu
use gpu
-c CACHE, --cache CACHE
cache directory
-o OUTPUT, --output OUTPUT
output directory
-m MODEL, --model MODEL
model name
-s, --split
split audio file using pydub split_on_silence
-sm MIN_SILENCE_LENGTH, --min_silence_length MIN_SILENCE_LENGTH Minimum length of silence to split on (in ms)
-st SILENCE_THRESH, --silence_thresh SILENCE_THRESH dBFS below reference to be considered silence
-sp PADDING, --padding PADDING Padding to add to beginning and end of each split (in ms)
--list list of available models
--info show model info
```
## Custom Use Cases and Support
If your business or project has specific speech-to-text requirements that go beyond the capabilities of the provided open-source package, I'm here to help! I understand that each use case is unique, and I'm open to collaborating on custom solutions that meet your needs. Whether you have longer audio files that need accurate transcription, require model fine-tuning, or need assistance in implementing the package effectively, I'm available for support.
Raw data
{
"_id": null,
"home_page": "https://github.com/shhossain/BanglaSpeech2Text",
"name": "BanglaSpeech2Text",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "python, speech to text, voice to text, bangla speech to text, bangla speech recognation, whisper model, bangla asr model, offline speech to text, offline bangla speech to text, offline bangla voice recognation, voice recognation",
"author": "sifat (shhossain)",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/00/12/a9f0b76b60724d878a9ba385f44834162f4002fb1116f50af5eca6bab545/banglaspeech2text-1.0.9.tar.gz",
"platform": null,
"description": "# BanglaSpeech2Text (Bangla Speech to Text)\r\n\r\nBanglaSpeech2Text: An open-source offline speech-to-text package for Bangla language. Fine-tuned on the latest whisper speech to text model for optimal performance. Transcribe speech to text, convert voice to text and perform speech recognition in python with ease, even without internet connection.\r\n\r\n## [Models](https://github.com/shhossain/BanglaSpeech2Text/blob/main/banglaspeech2text/utils/listed_models.json)\r\n\r\n| Model | Size | Best(WER) |\r\n| ------- | ---------- | --------- |\r\n| `tiny` |100-200 MB | 74 |\r\n| `base` |200-300 MB | 46 |\r\n| `small` |1 GB | 18 |\r\n| `large` |3-4 GB | 11 |\r\n\r\n**NOTE**: Bigger model have better accuracy but slower inference speed. More models [HuggingFace Model Hub](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&language=bn&sort=likes)\r\n\r\n## Pre-requisites\r\n\r\n- Python 3.7 or higher\r\n\r\n## Test it in Google Colab\r\n\r\n- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/shhossain/BanglaSpeech2Text/blob/main/banglaspeech2text_in_colab.ipynb)\r\n\r\n## Installation\r\n\r\nYou can install the library using pip:\r\n\r\n```bash\r\npip install banglaspeech2text\r\n```\r\n\r\n## Usage\r\n\r\n### Model Initialization\r\n\r\nTo use the library, you need to initialize the Speech2Text class with the desired model. By default, it uses the \"base\" model, but you can choose from different pre-trained models: \"tiny\", \"small\", \"medium\", \"base\", or \"large\". Here's an example:\r\n\r\n```python\r\nfrom banglaspeech2text import Speech2Text\r\n\r\nstt = Speech2Text(model=\"base\")\r\n\r\n# You can use it wihout specifying model name (default model is \"base\")\r\nstt = Speech2Text()\r\n```\r\n\r\n### Transcribing Audio Files\r\n\r\nYou can transcribe an audio file by calling the `recognize` method and passing the path to the audio file. It will return the transcribed text as a string. Here's an example:\r\n\r\n```python\r\ntranscription = stt.recognize(\"audio.wav\")\r\nprint(transcription)\r\n```\r\n\r\n### For longer audio files\r\n(As different models have different max audio length, so you can use the following methods to transcribe longer audio files)\r\n\r\nFor longer audio files, you can use the `generate` or `recognize` method. Here's an example:\r\n\r\n```python\r\nfor text in stt.generate(\"audio.wav\"): # it will generate text as the chunks are processed\r\n print(text)\r\n\r\n# or\r\ntexts = stt.recognize(\"audio.wav\", split=True) # Probably faster than generate method\r\nfor text in texts:\r\n print(text)\r\n\r\n# or\r\n\r\n# you can pass min_silence_length and silence_threshold to split_on_silence\r\ntext = stt.recognize(\"audio.wav\", split=True, min_silence_length=1000, silence_threshold=-16)\r\nprint(text)\r\n```\r\n\r\n## Multiple Audio Formats\r\n\r\nBanglaSpeech2Text supports the following audio formats for input:\r\n\r\n- File Formats: WAV, MP3, FLAC, and all formats supported by FFmpeg.\r\n- Bytes: Raw audio data in byte format.\r\n- Numpy Array: Numpy array representing audio data, preferably obtained using librosa.load.\r\n- AudioData: Audio data obtained from the speech_recognition library.\r\n- AudioSegment: Audio segment objects from the pydub library.\r\n- BytesIO: Audio data provided through BytesIO objects from the io module.\r\n- Path: Pathlib Path object pointing to an audio file.\r\n\r\nHere's an example:\r\n\r\n```python\r\ntranscription = stt.recognize(\"audio.mp3\")\r\nprint(transcription)\r\n```\r\n\r\n### Use with SpeechRecognition\r\n\r\nYou can use [SpeechRecognition](https://pypi.org/project/SpeechRecognition/) package to get audio from microphone and transcribe it. Here's an example:\r\n\r\n```python\r\nimport speech_recognition as sr\r\nfrom banglaspeech2text import Speech2Text\r\n\r\nstt = Speech2Text(model=\"base\")\r\n\r\nr = sr.Recognizer()\r\nwith sr.Microphone() as source:\r\n print(\"Say something!\")\r\n r.adjust_for_ambient_noise(source)\r\n audio = r.listen(source)\r\n output = stt.recognize(audio)\r\n\r\nprint(output)\r\n```\r\n\r\n### Use CPU\r\n\r\nYou can use GPU for faster inference. Here's an example:\r\n\r\n```python\r\n\r\nstt = Speech2Text(model=\"base\",use_gpu=False)\r\n\r\n```\r\n\r\n### Advanced GPU Usage\r\n\r\nFor more advanced GPU usage you can use `device` or `device_map` parameter. Here's an example:\r\n\r\n```python\r\nstt = Speech2Text(model=\"base\",device=\"cuda:0\")\r\n```\r\n\r\n```python\r\nstt = Speech2Text(model=\"base\",device_map=\"auto\")\r\n```\r\n\r\n**NOTE**: Read more about [Pytorch Device](https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device)\r\n\r\n### Instantly Check with gradio\r\n\r\nYou can instantly check the model with gradio. Here's an example:\r\n\r\n```python\r\nfrom banglaspeech2text import Speech2Text, available_models\r\nimport gradio as gr\r\n\r\nstt = Speech2Text(model=\"base\",use_gpu=True)\r\n\r\n# You can also open the url and check it in mobile\r\ngr.Interface(\r\n fn=stt.recognize,\r\n inputs=gr.Audio(source=\"microphone\", type=\"filepath\"),\r\n outputs=\"text\").launch(share=True)\r\n```\r\n\r\n## Some more usage examples\r\n\r\n### Use huggingface model\r\n\r\n```python\r\nstt = Speech2Text(model=\"openai/whisper-tiny\")\r\n```\r\n\r\n### Change Model Save location\r\n\r\n```python\r\nimport os\r\nos.environ[\"BANGLASPEECH2TEXT_CACHE\"] = \"/path/to/cache\"\r\n\r\nstt = Speech2Text(model=\"base\")\r\n```\r\n\r\n### See current model info\r\n\r\n```python\r\nstt = Speech2Text(model=\"base\")\r\n\r\nprint(stt.model_name) # the name of the model\r\nprint(stt.model_size) # the size of the model\r\nprint(stt.model_license) # the license of the model\r\nprint(stt.model_description) # the description of the model(in .md format)\r\nprint(stt.model_url) # the url of the model\r\nprint(stt.model_wer) # word error rate of the model\r\n```\r\n\r\n### CLI\r\n\r\nYou can use the library from the command line. Here's an example:\r\n\r\n```bash\r\nbnstt 'file.wav'\r\n```\r\n\r\nYou can also use it with microphone:\r\n\r\n```bash\r\nbnstt --mic\r\n```\r\n\r\nOther options:\r\n\r\n```bash\r\nusage: bnstt\r\n [-h]\r\n [-gpu]\r\n [-c CACHE]\r\n [-o OUTPUT]\r\n [-m MODEL]\r\n [-s]\r\n [-sm MIN_SILENCE_LENGTH]\r\n [-st SILENCE_THRESH]\r\n [-sp PADDING]\r\n [--list]\r\n [--info]\r\n [INPUT ...]\r\n\r\nBangla Speech to Text\r\n\r\npositional arguments:\r\n INPUT\r\n inputfile(s) or list of files\r\n\r\noptions:\r\n -h, --help\r\n show this help message and exit\r\n -gpu\r\n use gpu\r\n -c CACHE, --cache CACHE\r\n cache directory\r\n -o OUTPUT, --output OUTPUT\r\n output directory\r\n -m MODEL, --model MODEL\r\n model name\r\n -s, --split\r\n split audio file using pydub split_on_silence\r\n -sm MIN_SILENCE_LENGTH, --min_silence_length MIN_SILENCE_LENGTH Minimum length of silence to split on (in ms)\r\n -st SILENCE_THRESH, --silence_thresh SILENCE_THRESH dBFS below reference to be considered silence\r\n -sp PADDING, --padding PADDING Padding to add to beginning and end of each split (in ms)\r\n --list list of available models\r\n --info show model info\r\n```\r\n\r\n## Custom Use Cases and Support\r\n\r\nIf your business or project has specific speech-to-text requirements that go beyond the capabilities of the provided open-source package, I'm here to help! I understand that each use case is unique, and I'm open to collaborating on custom solutions that meet your needs. Whether you have longer audio files that need accurate transcription, require model fine-tuning, or need assistance in implementing the package effectively, I'm available for support.\r\n",
"bugtrack_url": null,
"license": null,
"summary": "An open-source offline speech-to-text package for Bangla language.",
"version": "1.0.9",
"project_urls": {
"Bug Tracker": "https://github.com/shhossain/BanglaSpeech2Text/issues",
"Documentation": "https://github.com/shhossain/BanglaSpeech2Text",
"Homepage": "https://github.com/shhossain/BanglaSpeech2Text",
"Source": "https://github.com/shhossain/BanglaSpeech2Text"
},
"split_keywords": [
"python",
" speech to text",
" voice to text",
" bangla speech to text",
" bangla speech recognation",
" whisper model",
" bangla asr model",
" offline speech to text",
" offline bangla speech to text",
" offline bangla voice recognation",
" voice recognation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "06e445e340a8d594934bf6d1fcae58eb75d6fba4e923d3d8d57329bbbf2cac78",
"md5": "205f21457bb5a5b79cc794cad6e47c8d",
"sha256": "e8863000680688c3944e0a851fdbe70808bea0b304ce40fb74a51a4e4002ca8b"
},
"downloads": -1,
"filename": "BanglaSpeech2Text-1.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "205f21457bb5a5b79cc794cad6e47c8d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 19489,
"upload_time": "2024-11-08T14:12:27",
"upload_time_iso_8601": "2024-11-08T14:12:27.324465Z",
"url": "https://files.pythonhosted.org/packages/06/e4/45e340a8d594934bf6d1fcae58eb75d6fba4e923d3d8d57329bbbf2cac78/BanglaSpeech2Text-1.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0012a9f0b76b60724d878a9ba385f44834162f4002fb1116f50af5eca6bab545",
"md5": "d9b0f10e908eea1d524ad58f46a61b2a",
"sha256": "56aaa8dda57cb4cce676f81819c376155a937d25c4c51e47ab82b4d6af8ec956"
},
"downloads": -1,
"filename": "banglaspeech2text-1.0.9.tar.gz",
"has_sig": false,
"md5_digest": "d9b0f10e908eea1d524ad58f46a61b2a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 21528,
"upload_time": "2024-11-08T14:12:29",
"upload_time_iso_8601": "2024-11-08T14:12:29.056273Z",
"url": "https://files.pythonhosted.org/packages/00/12/a9f0b76b60724d878a9ba385f44834162f4002fb1116f50af5eca6bab545/banglaspeech2text-1.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-08 14:12:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "shhossain",
"github_project": "BanglaSpeech2Text",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "banglaspeech2text"
}