scraibe


Namescraibe JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/JSchmie/ScAIbe
SummaryTranscription tool for audio files based on Whisper and Pyannote
upload_time2023-09-22 18:38:58
maintainer
docs_urlNone
authorJacob Schmieder
requires_python>=3.8
licenseGPL-3
keywords transcription speech recognition whisper pyannote audio speech-to-text speech-to-text transcription speech-to-text recognition voice-to-speech
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# `ScrAIbe: Streamlined Conversation Recording with Automated Intelligence Based Environment`

`ScrAIbe` is a state-of-the-art,  [PyTorch](https://pytorch.org/) based multilingual speech-to-text framework to generate fully automated transcriptions. 

Beyond transcription, ScrAIbe supports advanced functions, such as speaker diarization and speaker recognition.

Designed as a comprehensive AI toolkit, it uses multiple AI models:

- [whisper](https://github.com/openai/whisper): A general-purpose speech recognition model.
- [payannote-audio](https://github.com/pyannote/pyannote-audio): An open-source toolkit for speaker diarization.

The framework utilizes a PyanNet-inspired pipeline, with the `Pyannote` library for speaker diarization and `VoxCeleb` for speaker embedding.

During post-diarization, each audio segment is processed by the OpenAI `Whisper` model, in a transformer encoder-decoder structure. Initially, a CNN mitigates noise and enhances speech. Before transcription, `VoxLingua` identifies the language segment, facilitating Whisper's role in both transcription and text translation.

The following graphic illustrates the whole pipeline:

![Pipeline](Pictures/pipeline.png#gh-dark-mode-only) 
![Pipeline](Pictures/pipeline_light.png#gh-light-mode-only) 

## Install `ScrAIbe` : 

The following command will pull and install the latest commit from this repository, along with its Python dependencies.

    pip install scraibe

- **Python version**: Python 3.8
- **PyTorch version**: Python 1.11.0
- **CUDA version**: Cuda-toolkit 11.3.1


Important: For the `Pyannote` model, you need to be granted access to Hugging Face.
Check the [Pyannote model page](https://huggingface.co/pyannote/speaker-diarization) to get access to the model.

Additionally, you need to generate a [Hugging Face token](https://huggingface.co/docs/hub/security-tokens). 

## Usage 

We've developed ScrAIbe with several access points to cater to diverse user needs.

### Python usage

It enables full control over the functionalities as well as process customization. 

```python
from scraibe import Scraibe

model = Scraibe(use_auth_token = "hf_yourhftoken")

text = model.autotranscribe("audio.wav")

print(f"Transcription: \n{text}")
```
The `Scraibe` Class is taking care of the models being properly loaded. Therefore, you can choose the other [whisper](https://github.com/openai/whisper/blob/main/model-card.md) models using the `whisper_model` keyword. 
You can also change the `pyannote` diarization model using the `dia_model` keyword.


As input, `autoranscribe` accepts every format which is compatible with [ffmgeg](https://ffmpeg.org/ffmpeg-formats.html). Examples therefore are `.mp4 .mp3 .wav .ogg .flac` and many more.

To further control the pipeline of `ScrAIbe` you can parse almost any keyword you also cloud parsed towards `whisper` or `pyannote` if you need more option, try to check out the documentations  tows two Frameworks, you might have a good chance that these keywords will work here as well. 
Here's are some examples regarding the `diarization` (which relies on the `pyannote` pipeline):

- `num_speakers` Number of speakers in the audio file
- `min_speakers` Minimal Number of speakers in the audio file 
- `max_speakers` maximal Number of speakers in the audio file

Then there are arguments about the transcription process, which uses the "whisper" model.

- `language` Specify the language ([list to supported languages](https://github.com/openai/whisper/blob/main/language-breakdown.svg)) 
- `task` can be just `transcribe` or `translate`. If `translate` is selected, the transcribed audio will be translated to English.

For example:

```
text = model.autotranscribe("audio.wav", language="german", num_speakers = 2)
```

`Scraibe` also contains the option to just do a transcription
```python
transcription = model.transcribe("audio.wav")
``` 
or just do a diarization: 

```python
diarization = model.diarize("audio.wav")
```

### Command-line usage

Next to the Pyhton interface, you can also run ScrAIbe using the command-line interface:

    scraibe -f "audio.wav" --hf-token "hf_yourhftoken" --language "german" --num_speakers 2

For the full list of options, run:

    scraibe -h

### Gradio App

The Gradio App is a user-friendly interface for ScrAIbe. It enables you to run the model without any coding knowledge. Therefore, you can run the app in your browser and upload your audio file, or you can make the Framework avail on your network and run it on your local machine.

#### Running the Gradio App on your local machine

To run the Gradio App on your local machine, just use the following command:

```
scraibe --start_server --port 7860 --hf_token hf_yourhftoken
```

- `--start_server`: Command to start the Gradio App.
- `--port`: Flag for connecting the container internal port to the port on your local machine.
- `--hf_token`: Flag for entering your personal HuggingFace token in the container.

When the app is running, it will show you at which address you can access it.
The default address is: http://127.0.0.1:7860 or http://0.0.0.0:7860

After the app is running, you can upload your audio file and select the desired options.
An example is shown below:

![Gradio App](Pictures/gradio_app.png)


### Running a Docker container

Another option to run ScrAIbe is to use a Docker container. This option is especially useful if you want to run the model on a server or if you would like to use the GPU without dealing with CUDA.
After you have installed Docker, you can execute the following commands in the terminal.

First, you need to build the Docker image. Therefore, you need to enter your HuggingFace token and the image name.

```
docker build . --build-arg="hf_token=[enter your HuggingFace token] " -t scraibe
```

After the image is built, you can run the container with the following command:

```
sudo docker run -it  -p 7860:7860  --name [container name][image name]  --hf_token [enter your HuggingFace token] --start_server

```
-  `-p`: Flag for connecting the container internal port to the port on your local machine.
-  `--hf_token`: Flag for entering your personal HuggingFace token in the container.
- `--start_server`: Command to start the Gradio App.

Inside the container, the `cli` is used. Therefore, you can use the same commands as in the command-line interface.

#### Enabling GPU usage

To use the GPU, ensure your Docker installation supports GPU usage.
For further information, check: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
To enable GPU usage, you need to add the following flag to the `docker run` command:

```
docker run -it  -p 7860:7860 --gpus 'all,capabilities=utility'  --name [container name][image name]  --hf_token [enter your HuggingFace token] --start_server
```

For further guidance, check: https://blog.roboflow.com/use-the-gpu-in-docker/ 

## Documentation 

For further insights, check the [documentation page]().

## Contributions

We are happy to have any interest in contributing and about feedback: In order to do that, create an issue with your feedback or feel free to contact us.

## Roadmap

The following milestones are planned for further releases of ScrAIbe:

- Model quantization   
Quantization to empower memory and computational efficiency.

- Model fine-tuning  
In order to be able to cover a variety of linguistic phenomena.

For example, currently ScrAIbe is able to transcribe word by word, but ignores filler words or speech pauses. 
These phenomena can be addressed by fine-tuning with the corresponding data.

- Implementation of LLMs   
One example is the implementation of a summarization or extraction model, which enables ScrAIbe to automatically summarize or retrieve the key information out of a generated transcription, which could be the minutes of a meeting.

- Executable for Windows

## Contact

For queries contact [Jacob Schmieder](Jacob.Schmieder@dbfz.de)

## License 

ScrAIbe is licensed under GNU General Public License.

## Acknowledgments

Special thanks go to the KIDA project and the BMEL (Bundesministerium für Ernährung und Landwirtschaft), especially to the AI Consultancy Team.

![KIDA](Pictures/kida_dark.png#gh-dark-mode-only)         ![BMEL](Pictures/BMEL_dark.png#gh-dark-mode-only)      ![DBFZ](Pictures/DBFZ_dark.png#gh-dark-mode-only)             ![MRI](Pictures/MRI.png#gh-dark-mode-only)   

![KIDA](Pictures/kida.png#gh-light-mode-only)         ![BMEL](Pictures/BMEL.jpg#gh-light-mode-only)      ![DBFZ](Pictures/DBFZ.png#gh-light-mode-only)             ![MRI](Pictures/MRI.png#gh-light-mode-only)  

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JSchmie/ScAIbe",
    "name": "scraibe",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "transcription,speech recognition,whisper,pyannote,audio,speech-to-text,speech-to-text transcription,speech-to-text recognition,voice-to-speech",
    "author": "Jacob Schmieder",
    "author_email": "Jacob.Schmieder@dbfz.de",
    "download_url": "https://files.pythonhosted.org/packages/9e/54/81145ff106120df5e6591df859153e66e8e8e4c85a04998c6438a441f091/scraibe-0.1.1.tar.gz",
    "platform": "Linux",
    "description": "\n# `ScrAIbe: Streamlined Conversation Recording with Automated Intelligence Based Environment`\n\n`ScrAIbe` is a state-of-the-art,  [PyTorch](https://pytorch.org/) based multilingual speech-to-text framework to generate fully automated transcriptions. \n\nBeyond transcription, ScrAIbe supports advanced functions, such as speaker diarization and speaker recognition.\n\nDesigned as a comprehensive AI toolkit, it uses multiple AI models:\n\n- [whisper](https://github.com/openai/whisper): A general-purpose speech recognition model.\n- [payannote-audio](https://github.com/pyannote/pyannote-audio): An open-source toolkit for speaker diarization.\n\nThe framework utilizes a PyanNet-inspired pipeline, with the `Pyannote` library for speaker diarization and `VoxCeleb` for speaker embedding.\n\nDuring post-diarization, each audio segment is processed by the OpenAI `Whisper` model, in a transformer encoder-decoder structure. Initially, a CNN mitigates noise and enhances speech. Before transcription, `VoxLingua` identifies the language segment, facilitating Whisper's role in both transcription and text translation.\n\nThe following graphic illustrates the whole pipeline:\n\n![Pipeline](Pictures/pipeline.png#gh-dark-mode-only) \n![Pipeline](Pictures/pipeline_light.png#gh-light-mode-only) \n\n## Install `ScrAIbe` : \n\nThe following command will pull and install the latest commit from this repository, along with its Python dependencies.\n\n    pip install scraibe\n\n- **Python version**: Python 3.8\n- **PyTorch version**: Python 1.11.0\n- **CUDA version**: Cuda-toolkit 11.3.1\n\n\nImportant: For the `Pyannote` model, you need to be granted access to Hugging Face.\nCheck the [Pyannote model page](https://huggingface.co/pyannote/speaker-diarization) to get access to the model.\n\nAdditionally, you need to generate a [Hugging Face token](https://huggingface.co/docs/hub/security-tokens). \n\n## Usage \n\nWe've developed ScrAIbe with several access points to cater to diverse user needs.\n\n### Python usage\n\nIt enables full control over the functionalities as well as process customization. \n\n```python\nfrom scraibe import Scraibe\n\nmodel = Scraibe(use_auth_token = \"hf_yourhftoken\")\n\ntext = model.autotranscribe(\"audio.wav\")\n\nprint(f\"Transcription: \\n{text}\")\n```\nThe `Scraibe` Class is taking care of the models being properly loaded. Therefore, you can choose the other [whisper](https://github.com/openai/whisper/blob/main/model-card.md) models using the `whisper_model` keyword. \nYou can also change the `pyannote` diarization model using the `dia_model` keyword.\n\n\nAs input, `autoranscribe` accepts every format which is compatible with [ffmgeg](https://ffmpeg.org/ffmpeg-formats.html). Examples therefore are `.mp4 .mp3 .wav .ogg .flac` and many more.\n\nTo further control the pipeline of `ScrAIbe` you can parse almost any keyword you also cloud parsed towards `whisper` or `pyannote` if you need more option, try to check out the documentations  tows two Frameworks, you might have a good chance that these keywords will work here as well. \nHere's are some examples regarding the `diarization` (which relies on the `pyannote` pipeline):\n\n- `num_speakers` Number of speakers in the audio file\n- `min_speakers` Minimal Number of speakers in the audio file \n- `max_speakers` maximal Number of speakers in the audio file\n\nThen there are arguments about the transcription process, which uses the \"whisper\" model.\n\n- `language` Specify the language ([list to supported languages](https://github.com/openai/whisper/blob/main/language-breakdown.svg)) \n- `task` can be just `transcribe` or `translate`. If `translate` is selected, the transcribed audio will be translated to English.\n\nFor example:\n\n```\ntext = model.autotranscribe(\"audio.wav\", language=\"german\", num_speakers = 2)\n```\n\n`Scraibe` also contains the option to just do a transcription\n```python\ntranscription = model.transcribe(\"audio.wav\")\n``` \nor just do a diarization: \n\n```python\ndiarization = model.diarize(\"audio.wav\")\n```\n\n### Command-line usage\n\nNext to the Pyhton interface, you can also run ScrAIbe using the command-line interface:\n\n    scraibe -f \"audio.wav\" --hf-token \"hf_yourhftoken\" --language \"german\" --num_speakers 2\n\nFor the full list of options, run:\n\n    scraibe -h\n\n### Gradio App\n\nThe Gradio App is a user-friendly interface for ScrAIbe. It enables you to run the model without any coding knowledge. Therefore, you can run the app in your browser and upload your audio file, or you can make the Framework avail on your network and run it on your local machine.\n\n#### Running the Gradio App on your local machine\n\nTo run the Gradio App on your local machine, just use the following command:\n\n```\nscraibe --start_server --port 7860 --hf_token hf_yourhftoken\n```\n\n- `--start_server`: Command to start the Gradio App.\n- `--port`: Flag for connecting the container internal port to the port on your local machine.\n- `--hf_token`: Flag for entering your personal HuggingFace token in the container.\n\nWhen the app is running, it will show you at which address you can access it.\nThe default address is: http://127.0.0.1:7860 or http://0.0.0.0:7860\n\nAfter the app is running, you can upload your audio file and select the desired options.\nAn example is shown below:\n\n![Gradio App](Pictures/gradio_app.png)\n\n\n### Running a Docker container\n\nAnother option to run ScrAIbe is to use a Docker container. This option is especially useful if you want to run the model on a server or if you would like to use the GPU without dealing with CUDA.\nAfter you have installed Docker, you can execute the following commands in the terminal.\n\nFirst, you need to build the Docker image. Therefore, you need to enter your HuggingFace token and the image name.\n\n```\ndocker build . --build-arg=\"hf_token=[enter your HuggingFace token] \" -t scraibe\n```\n\nAfter the image is built, you can run the container with the following command:\n\n```\nsudo docker run -it  -p 7860:7860  --name [container name][image name]  --hf_token [enter your HuggingFace token] --start_server\n\n```\n-  `-p`: Flag for connecting the container internal port to the port on your local machine.\n-  `--hf_token`: Flag for entering your personal HuggingFace token in the container.\n- `--start_server`: Command to start the Gradio App.\n\nInside the container, the `cli` is used. Therefore, you can use the same commands as in the command-line interface.\n\n#### Enabling GPU usage\n\nTo use the GPU, ensure your Docker installation supports GPU usage.\nFor further information, check: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker\nTo enable GPU usage, you need to add the following flag to the `docker run` command:\n\n```\ndocker run -it  -p 7860:7860 --gpus 'all,capabilities=utility'  --name [container name][image name]  --hf_token [enter your HuggingFace token] --start_server\n```\n\nFor further guidance, check: https://blog.roboflow.com/use-the-gpu-in-docker/ \n\n## Documentation \n\nFor further insights, check the [documentation page]().\n\n## Contributions\n\nWe are happy to have any interest in contributing and about feedback: In order to do that, create an issue with your feedback or feel free to contact us.\n\n## Roadmap\n\nThe following milestones are planned for further releases of ScrAIbe:\n\n- Model quantization   \nQuantization to empower memory and computational efficiency.\n\n- Model fine-tuning  \nIn order to be able to cover a variety of linguistic phenomena.\n\nFor example, currently ScrAIbe is able to transcribe word by word, but ignores filler words or speech pauses. \nThese phenomena can be addressed by fine-tuning with the corresponding data.\n\n- Implementation of LLMs   \nOne example is the implementation of a summarization or extraction model, which enables ScrAIbe to automatically summarize or retrieve the key information out of a generated transcription, which could be the minutes of a meeting.\n\n- Executable for Windows\n\n## Contact\n\nFor queries contact [Jacob Schmieder](Jacob.Schmieder@dbfz.de)\n\n## License \n\nScrAIbe is licensed under GNU General Public License.\n\n## Acknowledgments\n\nSpecial thanks go to the KIDA project and the BMEL (Bundesministerium f\u00fcr Ern\u00e4hrung und Landwirtschaft), especially to the AI Consultancy Team.\n\n![KIDA](Pictures/kida_dark.png#gh-dark-mode-only)         ![BMEL](Pictures/BMEL_dark.png#gh-dark-mode-only)      ![DBFZ](Pictures/DBFZ_dark.png#gh-dark-mode-only)             ![MRI](Pictures/MRI.png#gh-dark-mode-only)   \n\n![KIDA](Pictures/kida.png#gh-light-mode-only)         ![BMEL](Pictures/BMEL.jpg#gh-light-mode-only)      ![DBFZ](Pictures/DBFZ.png#gh-light-mode-only)             ![MRI](Pictures/MRI.png#gh-light-mode-only)  \n",
    "bugtrack_url": null,
    "license": "GPL-3",
    "summary": "Transcription tool for audio files based on Whisper and Pyannote",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/JSchmie/ScAIbe"
    },
    "split_keywords": [
        "transcription",
        "speech recognition",
        "whisper",
        "pyannote",
        "audio",
        "speech-to-text",
        "speech-to-text transcription",
        "speech-to-text recognition",
        "voice-to-speech"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9e5481145ff106120df5e6591df859153e66e8e8e4c85a04998c6438a441f091",
                "md5": "695a6c1ac214351d405edc27f8b3dffc",
                "sha256": "5539377aa7c37d13260d9eee90fe025dc24f4fdfd83ba1f95a2d657dde6a0214"
            },
            "downloads": -1,
            "filename": "scraibe-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "695a6c1ac214351d405edc27f8b3dffc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 58865,
            "upload_time": "2023-09-22T18:38:58",
            "upload_time_iso_8601": "2023-09-22T18:38:58.522800Z",
            "url": "https://files.pythonhosted.org/packages/9e/54/81145ff106120df5e6591df859153e66e8e8e4c85a04998c6438a441f091/scraibe-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-22 18:38:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JSchmie",
    "github_project": "ScAIbe",
    "github_not_found": true,
    "lcname": "scraibe"
}
        
Elapsed time: 0.11967s