# Shunya Labs
[](https://pypi.org/project/shunyalabs/)
[](https://www.python.org/downloads/)
[](LICENSE)
[](https://shunyalabs.ai)
[](https://huggingface.co/shunyalabs)
A professional speech transcription package by **Shunya Labs AI** supporting **ct2 (CTranslate2)** and **transformers** backends. Get superior transcription quality with unified API and advanced features.
## Overview
Shunya Labs provides a unified interface for transcribing audio files using state-of-the-art backends optimized by Shunya Labs. Whether you want the high-performance CTranslate2 optimization or the flexibility of Hugging Face transformers, Shunya Labs delivers exceptional results with the `shunyalabs/pingala-v1-en-verbatim` model.
## Features
- **Shunya Labs Optimized**: Built by Shunya Labs for superior performance
- **CT2 Backend**: High-performance CTranslate2 optimization (default)
- **Transformers Backend**: Hugging Face models and latest research
- **Auto-Detection**: Automatically selects the best backend for your model
- **Unified API**: Same interface across all backends
- **Word-Level Timestamps**: Precise timing for individual words
- **Confidence Scores**: Quality metrics for transcription segments and words
- **Voice Activity Detection (VAD)**: Filter out silence and background noise
- **Language Detection**: Automatic language identification
- **Multiple Output Formats**: Text, SRT subtitles, and WebVTT
- **Streaming Support**: Process segments as they are generated
- **Advanced Parameters**: Full control over all backend features
- **Rich CLI**: Command-line tool with comprehensive options
- **Error Handling**: Comprehensive error handling and validation
## Installation
### Standard Installation (All Backends Included)
```bash
pip install shunyalabs
```
This installs all dependencies including:
- **faster-whisper** ≥ 0.10.0 (CT2 backend)
- **transformers** == 4.52.4 (Transformers backend)
- **ctranslate2** == 4.4.0 (GPU acceleration)
- **librosa** ≥ 0.10.0 (Audio processing)
- **torch** ≥ 1.9.0 & **torchaudio** ≥ 0.9.0 (PyTorch)
- **datasets** ≥ 2.0.0 & **numpy** ≥ 1.21.0
### Development Installation
```bash
# Complete installation with development tools
pip install "shunyalabs[complete]"
```
This adds development tools: `pytest`, `black`, `flake8`, `mypy`
### Requirements
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for optimal performance)
- PyTorch and torchaudio
Unlike other transcription tools, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.
### GPU Support and CUDA Installation
#### GPU Requirements
GPU execution requires the following NVIDIA libraries to be installed:
* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
* [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)
**Important**: The latest versions of `ctranslate2` only support CUDA 12 and cuDNN 9. For CUDA 11 and cuDNN 8, downgrade to the `3.24.0` version of `ctranslate2`. For CUDA 12 and cuDNN 8, use `ctranslate2==4.4.0` (already included in shunyalabs):
```bash
pip install --force-reinstall ctranslate2==4.4.0
```
#### CUDA Installation Methods
<details>
<summary>Method 1: Docker (Recommended)</summary>
The easiest way is to use the official NVIDIA CUDA Docker image:
```bash
docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
pip install shunyalabs
```
</details>
<details>
<summary>Method 2: pip Installation (Linux only)</summary>
Install CUDA libraries via pip:
```bash
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
```
</details>
<details>
<summary>Method 3: Manual Download (Windows & Linux)</summary>
Download pre-built CUDA libraries from [Purfview's repository](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Extract and add to your system PATH.
</details>
### Performance Benchmarks
Based on faster-whisper benchmarks transcribing 13 minutes of audio:
#### GPU Performance (NVIDIA RTX 3070 Ti 8GB)
| Backend | Precision | Beam size | Time | VRAM Usage |
| --- | --- | --- | --- | --- |
| transformers (SDPA) | fp16 | 5 | 1m52s | 4960MB |
| **faster-whisper (CT2)** | fp16 | 5 | **1m03s** | 4525MB |
| **faster-whisper (CT2)** (`batch_size=8`) | fp16 | 5 | **17s** | 6090MB |
| **faster-whisper (CT2)** | int8 | 5 | 59s | 2926MB |
#### CPU Performance (Intel Core i7-12700K, 8 threads)
| Backend | Precision | Beam size | Time | RAM Usage |
| --- | --- | --- | --- | --- |
| **faster-whisper (CT2)** | fp32 | 5 | 2m37s | 2257MB |
| **faster-whisper (CT2)** (`batch_size=8`) | fp32 | 5 | **1m06s** | 4230MB |
| **faster-whisper (CT2)** | int8 | 5 | 1m42s | 1477MB |
*Shunya Labs delivers superior performance with optimized CTranslate2 backend and efficient memory usage.*
## Supported Backends
### ct2 (CTranslate2) - Default
- **Performance**: Fastest inference with CTranslate2 optimization
- **Features**: Full parameter control, VAD, streaming, GPU acceleration
- **Models**: All compatible models, optimized for Shunya Labs models
- **Best for**: Production use, real-time applications
### transformers
- **Performance**: Good performance with Hugging Face ecosystem
- **Features**: Access to latest models, easy fine-tuning integration
- **Models**: Any Seq2Seq model on Hugging Face Hub
- **Best for**: Research, latest models, custom transformer models
## Supported Models
### Default Model
- `shunyalabs/pingala-v1-en-verbatim` - High-quality English transcription model by Shunya Labs
### Shunya Labs Models
- `shunyalabs/pingala-v1-en-verbatim` - Optimized for English verbatim transcription
- More Shunya Labs models coming soon!
### Custom Models (Advanced Users)
- Any Hugging Face Seq2Seq model compatible with automatic-speech-recognition pipeline
- Local model paths supported
### Local Models
- `/path/to/local/model` - Local model directory or file
## Quick Start
### Basic Usage with Auto-Detection
```python
from shunyalabs import ShunyaTranscriber
# Initialize with default Shunya Labs model and auto-detected backend
transcriber = ShunyaTranscriber()
# Simple transcription
segments = transcriber.transcribe_file_simple("audio.wav")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```
### Backend Selection
```python
from shunyalabs import ShunyaTranscriber
# Explicitly choose backends with Shunya Labs model
transcriber_ct2 = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
transcriber_tf = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")
# Auto-detection (recommended)
transcriber_auto = ShunyaTranscriber() # Uses default Shunya Labs model with ct2
```
### Advanced Usage with All Features
```python
from shunyalabs import ShunyaTranscriber
# Initialize with specific backend and settings
transcriber = ShunyaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
backend="ct2",
device="cuda",
compute_type="float16"
)
# Advanced transcription with full metadata
segments, info = transcriber.transcribe_file(
"audio.wav",
beam_size=10, # Higher beam size for better accuracy
word_timestamps=True, # Enable word-level timestamps
temperature=0.0, # Deterministic output
compression_ratio_threshold=2.4, # Filter out low-quality segments
log_prob_threshold=-1.0, # Filter by probability
no_speech_threshold=0.6, # Silence detection threshold
initial_prompt="High quality audio recording", # Guide the model
hotwords="Python, machine learning, AI", # Boost specific words
vad_filter=True, # Enable voice activity detection
task="transcribe" # or "translate" for translation
)
# Print transcription info
model_info = transcriber.get_model_info()
print(f"Backend: {model_info['backend']}")
print(f"Model: {model_info['model_name']}")
print(f"Language: {info.language} (confidence: {info.language_probability:.3f})")
print(f"Duration: {info.duration:.2f} seconds")
# Process segments with all metadata
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
if segment.confidence:
print(f"Confidence: {segment.confidence:.3f}")
# Word-level details
for word in segment.words:
print(f" '{word.word}' [{word.start:.2f}-{word.end:.2f}s] (conf: {word.probability:.3f})")
```
### Using Transformers Backend
```python
# Use Shunya Labs model with transformers backend
transcriber = ShunyaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
backend="transformers"
)
segments = transcriber.transcribe_file_simple("audio.wav")
# Auto-detection will use ct2 by default for Shunya Labs models
transcriber = ShunyaTranscriber() # Uses ct2 backend (recommended)
```
## Command-Line Interface
The package includes a comprehensive CLI supporting both backends:
### Basic CLI Usage
```bash
# Basic transcription with auto-detected backend
shunyalabs audio.wav
# Specify backend explicitly
shunyalabs audio.wav --backend ct2
shunyalabs audio.wav --backend transformers
# Use Shunya Labs model with different backends
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend ct2
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers
# Save to file
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim -o transcript.txt
# Use CPU for processing
shunyalabs audio.wav --device cpu
```
### Advanced CLI Features
```bash
# Word-level timestamps with confidence scores (ct2)
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --word-timestamps --show-confidence --show-words
# Voice Activity Detection (ct2 only)
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --vad --verbose
# Language detection with different backends
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --detect-language --backend ct2
# SRT subtitles with word-level timing
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --format srt --word-timestamps -o subtitles.srt
# Transformers backend with Shunya Labs model
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers --verbose
# Advanced parameters (ct2)
shunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim \
--beam-size 10 \
--temperature 0.2 \
--compression-ratio-threshold 2.4 \
--log-prob-threshold -1.0 \
--initial-prompt "This is a technical presentation" \
--hotwords "Python,AI,machine learning"
```
### CLI Options Reference
| Option | Description | Backends | Default |
|--------|-------------|----------|---------|
| `--model` | Model name or path | All | shunyalabs/pingala-v1-en-verbatim |
| `--backend` | Backend selection | All | auto-detect |
| `--device` | Device: cuda, cpu, auto | All | cuda |
| `--compute-type` | Precision: float16, float32, int8 | All | float16 |
| `--beam-size` | Beam size for decoding | All | 5 |
| `--language` | Language code (e.g., 'en') | All | auto-detect |
| `--word-timestamps` | Enable word-level timestamps | ct2 | False |
| `--show-confidence` | Show confidence scores | All | False |
| `--show-words` | Show word-level details | All | False |
| `--vad` | Enable VAD filtering | ct2 | False |
| `--detect-language` | Language detection only | All | False |
| `--format` | Output format: text, srt, vtt | All | text |
| `--temperature` | Sampling temperature | All | 0.0 |
| `--compression-ratio-threshold` | Compression ratio filter | ct2 | 2.4 |
| `--log-prob-threshold` | Log probability filter | ct2 | -1.0 |
| `--no-speech-threshold` | No speech threshold | All | 0.6 |
| `--initial-prompt` | Initial prompt text | All | None |
| `--hotwords` | Hotwords to boost | ct2 | None |
| `--task` | Task: transcribe, translate | All | transcribe |
## Backend Comparison
| Feature | ct2 | transformers |
|---------|-----|--------------|
| **Performance** | Fastest | Good |
| **GPU Acceleration** | Optimized | Standard |
| **Memory Usage** | Lowest | Moderate |
| **Model Support** | Any model | Any HF model |
| **Word Timestamps** | Full support | Limited |
| **VAD Filtering** | Built-in | No |
| **Streaming** | True streaming | Batch only |
| **Advanced Params** | All features | Basic |
| **Latest Models** | Updated | Latest |
| **Custom Models** | CTranslate2 | Any format |
### Recommendations
- **Production/Performance**: Use `ct2` with Shunya Labs models
- **Latest Research Models**: Use `transformers`
- **Real-time Applications**: Use `ct2` with VAD
- **Custom Transformer Models**: Use `transformers`
## Performance Optimization
### Backend Selection Tips
```python
# Real-time/Production: Use ct2 with Shunya Labs model
transcriber = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
# Maximum accuracy: Use Shunya Labs model with ct2
transcriber = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="ct2")
# Alternative backend: Use transformers with Shunya Labs model
transcriber = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")
# Research/Latest models: Use transformers backend
transcriber = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim", backend="transformers")
```
### Hardware Recommendations
| Use Case | Model | Backend | Hardware |
|----------|-------|---------|----------|
| Real-time | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 4GB+ |
| Production | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 6GB+ |
| Maximum Quality | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 8GB+ |
| Alternative | shunyalabs/pingala-v1-en-verbatim | transformers | GPU 4GB+ |
| CPU-only | shunyalabs/pingala-v1-en-verbatim | any | 8GB+ RAM |
### GPU Optimization
```python
# Maximum performance on GPU - FP16 precision
transcriber = ShunyaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
device="cuda",
compute_type="float16" # Fastest GPU performance
)
# Memory constrained GPU - INT8 quantization
transcriber = ShunyaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
device="cuda",
compute_type="int8_float16" # Lower memory usage
)
# Batched processing for multiple files
segments, info = transcriber.transcribe_file(
"audio.wav",
batch_size=8, # Process multiple segments in parallel
beam_size=5
)
```
### CPU Optimization
```python
# Optimized CPU settings
transcriber = ShunyaTranscriber(
model_name="shunyalabs/pingala-v1-en-verbatim",
device="cpu",
compute_type="int8" # Lower memory, faster on CPU
)
# Control CPU threads for best performance
import os
os.environ["OMP_NUM_THREADS"] = "4" # Adjust based on your CPU
```
### Memory Optimization Tips
- **GPU VRAM**: Use `int8_float16` compute type to reduce memory usage by ~40%
- **System RAM**: Use `int8` compute type on CPU to reduce memory usage
- **Batch Size**: Increase batch size if you have sufficient memory for faster processing
- **Model Size**: Consider smaller models for memory-constrained environments
### Performance Comparison Tips
When comparing against other implementations:
- Use same beam size (default is 5 in shunyalabs)
- Compare with similar Word Error Rate (WER)
- Set consistent thread count: `OMP_NUM_THREADS=4 python script.py`
- Ensure similar transcription quality metrics
## Troubleshooting
### Common CUDA Issues
**Issue**: `RuntimeError: No CUDA capable device found`
```bash
# Check CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# If False, install CUDA toolkit or use CPU
transcriber = ShunyaTranscriber(device="cpu")
```
**Issue**: `CUDA out of memory`
```python
# Solution 1: Use INT8 quantization
transcriber = ShunyaTranscriber(compute_type="int8_float16")
# Solution 2: Reduce batch size
segments, info = transcriber.transcribe_file("audio.wav", batch_size=1)
# Solution 3: Use CPU
transcriber = ShunyaTranscriber(device="cpu")
```
**Issue**: `cuDNN/cuBLAS library not found`
```bash
# Install CUDA libraries via pip (Linux)
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
# Or use Docker
docker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04
```
**Issue**: `ctranslate2` version compatibility
```bash
# For CUDA 12 + cuDNN 8 (default in shunyalabs)
pip install ctranslate2==4.4.0
# For CUDA 11 + cuDNN 8
pip install ctranslate2==3.24.0
```
### Model Loading Issues
**Issue**: Model download fails
```python
# Use local model path
transcriber = ShunyaTranscriber(model_name="/path/to/local/model")
# Or specify different Shunya Labs model
transcriber = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim")
```
**Issue**: Alignment heads error (word timestamps)
```python
# This is handled automatically with fallback to no word timestamps
# Word timestamps are supported with Shunya Labs models
transcriber = ShunyaTranscriber(model_name="shunyalabs/pingala-v1-en-verbatim")
segments, info = transcriber.transcribe_file("audio.wav", word_timestamps=True)
```
## Examples
See `example.py` for comprehensive examples:
```bash
# Run with default backend (auto-detected)
python example.py audio.wav
# Test specific backends with Shunya Labs model
python example.py audio.wav --backend ct2
python example.py audio.wav --backend transformers
# Test Shunya Labs model with different backends
python example.py audio.wav shunyalabs/pingala-v1-en-verbatim
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- Built by [Shunya Labs](https://shunyalabs.ai) for superior transcription quality
- Powered by CTranslate2 for optimized inference
- Supports [Hugging Face transformers](https://github.com/huggingface/transformers)
- Uses the Pingala model from [Shunya Labs](https://shunyalabs.ai)
## About Shunya Labs
Visit [Shunya Labs](https://shunyalabs.ai) to learn more about our AI research and products.
Contact us at [0@shunyalabs.ai](mailto:0@shunyalabs.ai) for questions or collaboration opportunities.
Raw data
{
"_id": null,
"home_page": null,
"name": "shunyalabs",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "speech-to-text, transcription, audio, ai, shunyalabs, ct2, transformers, enterprise",
"author": null,
"author_email": "Shunya Labs <0@shunyalabs.ai>",
"download_url": "https://files.pythonhosted.org/packages/ce/c4/f7fb75ce33f5319f42fba9e130abe02883563e0e3aa53f6371041183e208/shunyalabs-1.0.2.tar.gz",
"platform": null,
"description": "# Shunya Labs\n\n[](https://pypi.org/project/shunyalabs/)\n[](https://www.python.org/downloads/)\n[](LICENSE)\n[](https://shunyalabs.ai)\n[](https://huggingface.co/shunyalabs)\n\nA professional speech transcription package by **Shunya Labs AI** supporting **ct2 (CTranslate2)** and **transformers** backends. Get superior transcription quality with unified API and advanced features.\n\n## Overview\n\nShunya Labs provides a unified interface for transcribing audio files using state-of-the-art backends optimized by Shunya Labs. Whether you want the high-performance CTranslate2 optimization or the flexibility of Hugging Face transformers, Shunya Labs delivers exceptional results with the `shunyalabs/pingala-v1-en-verbatim` model.\n\n## Features\n\n- **Shunya Labs Optimized**: Built by Shunya Labs for superior performance\n- **CT2 Backend**: High-performance CTranslate2 optimization (default)\n- **Transformers Backend**: Hugging Face models and latest research\n- **Auto-Detection**: Automatically selects the best backend for your model\n- **Unified API**: Same interface across all backends\n- **Word-Level Timestamps**: Precise timing for individual words\n- **Confidence Scores**: Quality metrics for transcription segments and words\n- **Voice Activity Detection (VAD)**: Filter out silence and background noise\n- **Language Detection**: Automatic language identification\n- **Multiple Output Formats**: Text, SRT subtitles, and WebVTT\n- **Streaming Support**: Process segments as they are generated\n- **Advanced Parameters**: Full control over all backend features\n- **Rich CLI**: Command-line tool with comprehensive options\n- **Error Handling**: Comprehensive error handling and validation\n\n## Installation\n\n### Standard Installation (All Backends Included)\n```bash\npip install shunyalabs\n```\n\nThis installs all dependencies including:\n- **faster-whisper** \u2265 0.10.0 (CT2 backend)\n- **transformers** == 4.52.4 (Transformers backend)\n- **ctranslate2** == 4.4.0 (GPU acceleration)\n- **librosa** \u2265 0.10.0 (Audio processing)\n- **torch** \u2265 1.9.0 & **torchaudio** \u2265 0.9.0 (PyTorch)\n- **datasets** \u2265 2.0.0 & **numpy** \u2265 1.21.0\n\n### Development Installation\n\n```bash\n# Complete installation with development tools\npip install \"shunyalabs[complete]\"\n```\n\nThis adds development tools: `pytest`, `black`, `flake8`, `mypy`\n\n### Requirements\n\n- Python 3.8 or higher\n- CUDA-compatible GPU (recommended for optimal performance)\n- PyTorch and torchaudio\n\nUnlike other transcription tools, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.\n\n### GPU Support and CUDA Installation\n\n#### GPU Requirements\n\nGPU execution requires the following NVIDIA libraries to be installed:\n\n* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)\n* [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)\n\n**Important**: The latest versions of `ctranslate2` only support CUDA 12 and cuDNN 9. For CUDA 11 and cuDNN 8, downgrade to the `3.24.0` version of `ctranslate2`. For CUDA 12 and cuDNN 8, use `ctranslate2==4.4.0` (already included in shunyalabs):\n\n```bash\npip install --force-reinstall ctranslate2==4.4.0\n```\n\n#### CUDA Installation Methods\n\n<details>\n<summary>Method 1: Docker (Recommended)</summary>\n\nThe easiest way is to use the official NVIDIA CUDA Docker image:\n```bash\ndocker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04\npip install shunyalabs\n```\n</details>\n\n<details>\n<summary>Method 2: pip Installation (Linux only)</summary>\n\nInstall CUDA libraries via pip:\n```bash\npip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*\n\nexport LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + \":\" + os.path.dirname(nvidia.cudnn.lib.__file__))'`\n```\n</details>\n\n<details>\n<summary>Method 3: Manual Download (Windows & Linux)</summary>\n\nDownload pre-built CUDA libraries from [Purfview's repository](https://github.com/Purfview/whisper-standalone-win/releases/tag/libs). Extract and add to your system PATH.\n</details>\n\n### Performance Benchmarks\n\nBased on faster-whisper benchmarks transcribing 13 minutes of audio:\n\n#### GPU Performance (NVIDIA RTX 3070 Ti 8GB)\n\n| Backend | Precision | Beam size | Time | VRAM Usage |\n| --- | --- | --- | --- | --- |\n| transformers (SDPA) | fp16 | 5 | 1m52s | 4960MB |\n| **faster-whisper (CT2)** | fp16 | 5 | **1m03s** | 4525MB |\n| **faster-whisper (CT2)** (`batch_size=8`) | fp16 | 5 | **17s** | 6090MB |\n| **faster-whisper (CT2)** | int8 | 5 | 59s | 2926MB |\n\n#### CPU Performance (Intel Core i7-12700K, 8 threads)\n\n| Backend | Precision | Beam size | Time | RAM Usage |\n| --- | --- | --- | --- | --- |\n| **faster-whisper (CT2)** | fp32 | 5 | 2m37s | 2257MB |\n| **faster-whisper (CT2)** (`batch_size=8`) | fp32 | 5 | **1m06s** | 4230MB |\n| **faster-whisper (CT2)** | int8 | 5 | 1m42s | 1477MB |\n\n*Shunya Labs delivers superior performance with optimized CTranslate2 backend and efficient memory usage.*\n\n## Supported Backends\n\n### ct2 (CTranslate2) - Default\n- **Performance**: Fastest inference with CTranslate2 optimization\n- **Features**: Full parameter control, VAD, streaming, GPU acceleration\n- **Models**: All compatible models, optimized for Shunya Labs models\n- **Best for**: Production use, real-time applications\n\n### transformers \n- **Performance**: Good performance with Hugging Face ecosystem\n- **Features**: Access to latest models, easy fine-tuning integration\n- **Models**: Any Seq2Seq model on Hugging Face Hub\n- **Best for**: Research, latest models, custom transformer models\n\n## Supported Models\n\n### Default Model\n- `shunyalabs/pingala-v1-en-verbatim` - High-quality English transcription model by Shunya Labs\n\n### Shunya Labs Models\n- `shunyalabs/pingala-v1-en-verbatim` - Optimized for English verbatim transcription\n- More Shunya Labs models coming soon!\n\n### Custom Models (Advanced Users)\n- Any Hugging Face Seq2Seq model compatible with automatic-speech-recognition pipeline\n- Local model paths supported\n\n### Local Models\n- `/path/to/local/model` - Local model directory or file\n\n## Quick Start\n\n### Basic Usage with Auto-Detection\n\n```python\nfrom shunyalabs import ShunyaTranscriber\n\n# Initialize with default Shunya Labs model and auto-detected backend\ntranscriber = ShunyaTranscriber()\n\n# Simple transcription\nsegments = transcriber.transcribe_file_simple(\"audio.wav\")\n\nfor segment in segments:\n print(f\"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}\")\n```\n\n### Backend Selection\n\n```python\nfrom shunyalabs import ShunyaTranscriber\n\n# Explicitly choose backends with Shunya Labs model\ntranscriber_ct2 = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\", backend=\"ct2\")\ntranscriber_tf = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\", backend=\"transformers\") \n\n# Auto-detection (recommended)\ntranscriber_auto = ShunyaTranscriber() # Uses default Shunya Labs model with ct2\n```\n\n### Advanced Usage with All Features\n\n```python\nfrom shunyalabs import ShunyaTranscriber\n\n# Initialize with specific backend and settings\ntranscriber = ShunyaTranscriber(\n model_name=\"shunyalabs/pingala-v1-en-verbatim\",\n backend=\"ct2\",\n device=\"cuda\", \n compute_type=\"float16\"\n)\n\n# Advanced transcription with full metadata\nsegments, info = transcriber.transcribe_file(\n \"audio.wav\",\n beam_size=10, # Higher beam size for better accuracy\n word_timestamps=True, # Enable word-level timestamps\n temperature=0.0, # Deterministic output\n compression_ratio_threshold=2.4, # Filter out low-quality segments\n log_prob_threshold=-1.0, # Filter by probability\n no_speech_threshold=0.6, # Silence detection threshold\n initial_prompt=\"High quality audio recording\", # Guide the model\n hotwords=\"Python, machine learning, AI\", # Boost specific words\n vad_filter=True, # Enable voice activity detection\n task=\"transcribe\" # or \"translate\" for translation\n)\n\n# Print transcription info\nmodel_info = transcriber.get_model_info()\nprint(f\"Backend: {model_info['backend']}\")\nprint(f\"Model: {model_info['model_name']}\")\nprint(f\"Language: {info.language} (confidence: {info.language_probability:.3f})\")\nprint(f\"Duration: {info.duration:.2f} seconds\")\n\n# Process segments with all metadata\nfor segment in segments:\n print(f\"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}\")\n if segment.confidence:\n print(f\"Confidence: {segment.confidence:.3f}\")\n \n # Word-level details\n for word in segment.words:\n print(f\" '{word.word}' [{word.start:.2f}-{word.end:.2f}s] (conf: {word.probability:.3f})\")\n```\n\n### Using Transformers Backend\n\n```python\n# Use Shunya Labs model with transformers backend\ntranscriber = ShunyaTranscriber(\n model_name=\"shunyalabs/pingala-v1-en-verbatim\",\n backend=\"transformers\"\n)\n\nsegments = transcriber.transcribe_file_simple(\"audio.wav\")\n\n# Auto-detection will use ct2 by default for Shunya Labs models\ntranscriber = ShunyaTranscriber() # Uses ct2 backend (recommended)\n```\n\n## Command-Line Interface\n\nThe package includes a comprehensive CLI supporting both backends:\n\n### Basic CLI Usage\n\n```bash\n# Basic transcription with auto-detected backend\nshunyalabs audio.wav\n\n# Specify backend explicitly \nshunyalabs audio.wav --backend ct2\nshunyalabs audio.wav --backend transformers\n\n# Use Shunya Labs model with different backends\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend ct2\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers\n\n# Save to file\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim -o transcript.txt\n\n# Use CPU for processing\nshunyalabs audio.wav --device cpu\n```\n\n### Advanced CLI Features\n\n```bash\n# Word-level timestamps with confidence scores (ct2)\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --word-timestamps --show-confidence --show-words\n\n# Voice Activity Detection (ct2 only)\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --vad --verbose\n\n# Language detection with different backends\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --detect-language --backend ct2\n\n# SRT subtitles with word-level timing\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --format srt --word-timestamps -o subtitles.srt\n\n# Transformers backend with Shunya Labs model\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim --backend transformers --verbose\n\n# Advanced parameters (ct2)\nshunyalabs audio.wav --model shunyalabs/pingala-v1-en-verbatim \\\n --beam-size 10 \\\n --temperature 0.2 \\\n --compression-ratio-threshold 2.4 \\\n --log-prob-threshold -1.0 \\\n --initial-prompt \"This is a technical presentation\" \\\n --hotwords \"Python,AI,machine learning\"\n```\n\n### CLI Options Reference\n\n| Option | Description | Backends | Default |\n|--------|-------------|----------|---------|\n| `--model` | Model name or path | All | shunyalabs/pingala-v1-en-verbatim |\n| `--backend` | Backend selection | All | auto-detect |\n| `--device` | Device: cuda, cpu, auto | All | cuda |\n| `--compute-type` | Precision: float16, float32, int8 | All | float16 |\n| `--beam-size` | Beam size for decoding | All | 5 |\n| `--language` | Language code (e.g., 'en') | All | auto-detect |\n| `--word-timestamps` | Enable word-level timestamps | ct2 | False |\n| `--show-confidence` | Show confidence scores | All | False |\n| `--show-words` | Show word-level details | All | False |\n| `--vad` | Enable VAD filtering | ct2 | False |\n| `--detect-language` | Language detection only | All | False |\n| `--format` | Output format: text, srt, vtt | All | text |\n| `--temperature` | Sampling temperature | All | 0.0 |\n| `--compression-ratio-threshold` | Compression ratio filter | ct2 | 2.4 |\n| `--log-prob-threshold` | Log probability filter | ct2 | -1.0 |\n| `--no-speech-threshold` | No speech threshold | All | 0.6 |\n| `--initial-prompt` | Initial prompt text | All | None |\n| `--hotwords` | Hotwords to boost | ct2 | None |\n| `--task` | Task: transcribe, translate | All | transcribe |\n\n## Backend Comparison\n\n| Feature | ct2 | transformers |\n|---------|-----|--------------|\n| **Performance** | Fastest | Good |\n| **GPU Acceleration** | Optimized | Standard |\n| **Memory Usage** | Lowest | Moderate |\n| **Model Support** | Any model | Any HF model |\n| **Word Timestamps** | Full support | Limited |\n| **VAD Filtering** | Built-in | No |\n| **Streaming** | True streaming | Batch only |\n| **Advanced Params** | All features | Basic |\n| **Latest Models** | Updated | Latest |\n| **Custom Models** | CTranslate2 | Any format |\n\n### Recommendations\n\n- **Production/Performance**: Use `ct2` with Shunya Labs models\n- **Latest Research Models**: Use `transformers`\n- **Real-time Applications**: Use `ct2` with VAD\n- **Custom Transformer Models**: Use `transformers`\n\n## Performance Optimization\n\n### Backend Selection Tips\n\n```python\n# Real-time/Production: Use ct2 with Shunya Labs model\ntranscriber = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\", backend=\"ct2\")\n\n# Maximum accuracy: Use Shunya Labs model with ct2 \ntranscriber = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\", backend=\"ct2\")\n\n# Alternative backend: Use transformers with Shunya Labs model\ntranscriber = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\", backend=\"transformers\")\n\n# Research/Latest models: Use transformers backend\ntranscriber = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\", backend=\"transformers\")\n```\n\n### Hardware Recommendations\n\n| Use Case | Model | Backend | Hardware |\n|----------|-------|---------|----------|\n| Real-time | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 4GB+ |\n| Production | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 6GB+ |\n| Maximum Quality | shunyalabs/pingala-v1-en-verbatim | ct2 | GPU 8GB+ |\n| Alternative | shunyalabs/pingala-v1-en-verbatim | transformers | GPU 4GB+ |\n| CPU-only | shunyalabs/pingala-v1-en-verbatim | any | 8GB+ RAM |\n\n### GPU Optimization\n\n```python\n# Maximum performance on GPU - FP16 precision\ntranscriber = ShunyaTranscriber(\n model_name=\"shunyalabs/pingala-v1-en-verbatim\",\n device=\"cuda\",\n compute_type=\"float16\" # Fastest GPU performance\n)\n\n# Memory constrained GPU - INT8 quantization\ntranscriber = ShunyaTranscriber(\n model_name=\"shunyalabs/pingala-v1-en-verbatim\", \n device=\"cuda\",\n compute_type=\"int8_float16\" # Lower memory usage\n)\n\n# Batched processing for multiple files\nsegments, info = transcriber.transcribe_file(\n \"audio.wav\",\n batch_size=8, # Process multiple segments in parallel\n beam_size=5\n)\n```\n\n### CPU Optimization\n\n```python\n# Optimized CPU settings\ntranscriber = ShunyaTranscriber(\n model_name=\"shunyalabs/pingala-v1-en-verbatim\",\n device=\"cpu\",\n compute_type=\"int8\" # Lower memory, faster on CPU\n)\n\n# Control CPU threads for best performance\nimport os\nos.environ[\"OMP_NUM_THREADS\"] = \"4\" # Adjust based on your CPU\n```\n\n### Memory Optimization Tips\n\n- **GPU VRAM**: Use `int8_float16` compute type to reduce memory usage by ~40%\n- **System RAM**: Use `int8` compute type on CPU to reduce memory usage\n- **Batch Size**: Increase batch size if you have sufficient memory for faster processing\n- **Model Size**: Consider smaller models for memory-constrained environments\n\n### Performance Comparison Tips\n\nWhen comparing against other implementations:\n- Use same beam size (default is 5 in shunyalabs)\n- Compare with similar Word Error Rate (WER)\n- Set consistent thread count: `OMP_NUM_THREADS=4 python script.py`\n- Ensure similar transcription quality metrics\n\n## Troubleshooting\n\n### Common CUDA Issues\n\n**Issue**: `RuntimeError: No CUDA capable device found`\n```bash\n# Check CUDA availability\npython -c \"import torch; print(f'CUDA available: {torch.cuda.is_available()}')\"\n\n# If False, install CUDA toolkit or use CPU\ntranscriber = ShunyaTranscriber(device=\"cpu\")\n```\n\n**Issue**: `CUDA out of memory`\n```python\n# Solution 1: Use INT8 quantization\ntranscriber = ShunyaTranscriber(compute_type=\"int8_float16\")\n\n# Solution 2: Reduce batch size\nsegments, info = transcriber.transcribe_file(\"audio.wav\", batch_size=1)\n\n# Solution 3: Use CPU\ntranscriber = ShunyaTranscriber(device=\"cpu\")\n```\n\n**Issue**: `cuDNN/cuBLAS library not found`\n```bash\n# Install CUDA libraries via pip (Linux)\npip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*\n\n# Or use Docker\ndocker run --gpus all -it nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04\n```\n\n**Issue**: `ctranslate2` version compatibility\n```bash\n# For CUDA 12 + cuDNN 8 (default in shunyalabs)\npip install ctranslate2==4.4.0\n\n# For CUDA 11 + cuDNN 8\npip install ctranslate2==3.24.0\n```\n\n### Model Loading Issues\n\n**Issue**: Model download fails\n```python\n# Use local model path\ntranscriber = ShunyaTranscriber(model_name=\"/path/to/local/model\")\n\n# Or specify different Shunya Labs model\ntranscriber = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\")\n```\n\n**Issue**: Alignment heads error (word timestamps)\n```python\n# This is handled automatically with fallback to no word timestamps\n# Word timestamps are supported with Shunya Labs models\ntranscriber = ShunyaTranscriber(model_name=\"shunyalabs/pingala-v1-en-verbatim\")\nsegments, info = transcriber.transcribe_file(\"audio.wav\", word_timestamps=True)\n```\n\n## Examples\n\nSee `example.py` for comprehensive examples:\n\n```bash\n# Run with default backend (auto-detected)\npython example.py audio.wav\n\n# Test specific backends with Shunya Labs model\npython example.py audio.wav --backend ct2\npython example.py audio.wav --backend transformers \n\n# Test Shunya Labs model with different backends\npython example.py audio.wav shunyalabs/pingala-v1-en-verbatim\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Acknowledgments\n\n- Built by [Shunya Labs](https://shunyalabs.ai) for superior transcription quality\n- Powered by CTranslate2 for optimized inference\n- Supports [Hugging Face transformers](https://github.com/huggingface/transformers) \n- Uses the Pingala model from [Shunya Labs](https://shunyalabs.ai)\n\n## About Shunya Labs\n\nVisit [Shunya Labs](https://shunyalabs.ai) to learn more about our AI research and products. \nContact us at [0@shunyalabs.ai](mailto:0@shunyalabs.ai) for questions or collaboration opportunities. \n",
"bugtrack_url": null,
"license": null,
"summary": "Professional speech transcription package by Shunya Labs with ct2 and transformers backends",
"version": "1.0.2",
"project_urls": {
"Bug Reports": "https://github.com/Shunyalabsai/shunyalabs/issues",
"Documentation": "https://shunyalabs.ai/shunyalabs",
"Homepage": "https://shunyalabs.ai",
"Source": "https://github.com/Shunyalabsai/shunyalabs"
},
"split_keywords": [
"speech-to-text",
" transcription",
" audio",
" ai",
" shunyalabs",
" ct2",
" transformers",
" enterprise"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ad5b958b158d332d65dacfff8e2a42ff695c4507163ddf0f4b16847e72a3f3d3",
"md5": "1cdfc459b67bf46fda620010e9d743ba",
"sha256": "a2444f88e3ae24d43e4c7a19020bef6142f071d39b28fedb476553f68f871193"
},
"downloads": -1,
"filename": "shunyalabs-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1cdfc459b67bf46fda620010e9d743ba",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 19197,
"upload_time": "2025-11-07T11:45:36",
"upload_time_iso_8601": "2025-11-07T11:45:36.192123Z",
"url": "https://files.pythonhosted.org/packages/ad/5b/958b158d332d65dacfff8e2a42ff695c4507163ddf0f4b16847e72a3f3d3/shunyalabs-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "cec4f7fb75ce33f5319f42fba9e130abe02883563e0e3aa53f6371041183e208",
"md5": "36cf2dd1b02dbc3d529618f9fadf125b",
"sha256": "2b9959ce9164124db44e221b144796c397c20be4c6c5010900dc78e63ca5c69a"
},
"downloads": -1,
"filename": "shunyalabs-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "36cf2dd1b02dbc3d529618f9fadf125b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 24765,
"upload_time": "2025-11-07T11:45:37",
"upload_time_iso_8601": "2025-11-07T11:45:37.582487Z",
"url": "https://files.pythonhosted.org/packages/ce/c4/f7fb75ce33f5319f42fba9e130abe02883563e0e3aa53f6371041183e208/shunyalabs-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-07 11:45:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Shunyalabsai",
"github_project": "shunyalabs",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "faster-whisper",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.9.0"
]
]
},
{
"name": "torchaudio",
"specs": [
[
">=",
"0.9.0"
]
]
},
{
"name": "transformers",
"specs": [
[
"==",
"4.52.4"
]
]
},
{
"name": "ctranslate2",
"specs": [
[
"==",
"4.4.0"
]
]
},
{
"name": "librosa",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
}
],
"lcname": "shunyalabs"
}