| Name | salute-speech JSON |
| Version |
2.0.0
JSON |
| download |
| home_page | None |
| Summary | Sber Salute Speech API |
| upload_time | 2025-08-30 17:11:01 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | MIT License
Copyright (c) 2024 Maxim Moroz <maxim.moroz@gmail.com>
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
| keywords |
speech
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
Click
python-dotenv
requests
pydub
setuptools
numpy
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# Sber Salute Speech Python API
A Python client for Sber's Salute Speech Recognition service with a simple, async-first API.
## Features
- OpenAI Whisper-like API for ease of use
- Asynchronous API for compatibility and better performance
- Comprehensive error handling
- Support for multiple audio formats
- Command-line interface for quick transcription
> Quick CLI
```bash
export SBER_SPEECH_API_KEY=your_key_here
ffmpeg -i input.mp3 -ac 1 -ar 16000 audio.wav
salute_speech transcribe-audio audio.wav -o transcript.txt
```
## Installation
```bash
pip install salute_speech
```
Prerequisites (recommended):
- ffmpeg (for best results and simple conversion to mono 16 kHz)
macOS (Homebrew): `brew install ffmpeg`
Ubuntu/Debian: `sudo apt-get install ffmpeg`
## Quick Start
```python
from salute_speech.speech_recognition import SaluteSpeechClient
import asyncio
import os
async def main():
# Initialize the client (from environment variable)
client = SaluteSpeechClient(client_credentials=os.getenv("SBER_SPEECH_API_KEY"))
# Open and transcribe an audio file
with open("audio.mp3", "rb") as audio_file:
result = await client.audio.transcriptions.create(
file=audio_file,
language="ru-RU"
)
print(result.text)
# Run the async function
asyncio.run(main())
```
## API Reference
### SaluteSpeechClient
The main client class that provides access to the Sber Speech API.
```python
client = SaluteSpeechClient(client_credentials="your_credentials_here")
```
#### `client.audio.transcriptions.create()`
Creates a transcription for the given audio file.
**Parameters:**
- `file` (BinaryIO): An audio file opened in binary mode
- `language` (str, optional): Language code for transcription. Supported: `ru-RU`, `en-US`, `kk-KZ`. Defaults to "ru-RU"
- `poll_interval` (float, optional): Interval between status checks in seconds. Defaults to 1.0
- `config` (SpeechRecognitionConfig, optional): Advanced recognition tuning passed to the SberSpeech async API
- `prompt` (str, optional): Optional prompt to guide transcription (not yet supported)
- `response_format` (str, optional): Format of the response (not yet supported)
**Returns:**
- `TranscriptionResponse` object
### TranscriptionResponse
Shape of the result object returned by `client.audio.transcriptions.create()` (aligned with OpenAI's TranscriptionVerbose):
- `duration` (float): The duration of the input audio in seconds
- `language` (str): The language of the input audio (e.g., `ru`, `en`)
- `text` (str): The full transcribed text (concatenation of segment texts)
- `segments` (List[TranscriptionSegment] | None): Segments of the transcribed text with timestamps
- `status` (str): Sber-specific job status (e.g., `DONE`)
- `task_id` (str): Sber-specific task identifier
OpenAI alignment: This structure matches OpenAI Whisper's `TranscriptionVerbose` response format, with additional Sber-specific fields for internal tracking.
Programmatic usage (iterate segments):
```python
for seg in result.segments or []:
print(f"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}")
```
TranscriptionSegment:
- `id` (int): Segment index
- `start` (float): Start time in seconds
- `end` (float): End time in seconds
- `text` (str): Segment text
**Example:**
```python
async with open("meeting.mp3", "rb") as audio_file:
result = await client.audio.transcriptions.create(
file=audio_file,
language="ru-RU"
)
print(result.text)
```
### Supported Audio Formats
The service supports the following audio formats:
| Format | Max Channels | Sample Rate Range |
|--------|-------------|-------------------|
| PCM_S16LE (WAV) | 8 | 8,000 - 96,000 Hz |
| OPUS | 1 | Any |
| MP3 | 2 | Any |
| FLAC | 8 | Any |
| ALAW | 8 | 8,000 - 96,000 Hz |
| MULAW | 8 | 8,000 - 96,000 Hz |
Audio parameters are automatically detected and validated using the `AudioValidator` class.
### Error Handling
The client provides structured error handling with specific exception classes:
```python
try:
result = await client.audio.transcriptions.create(file=audio_file)
except TokenRequestError as e:
print(f"Authentication error: {e}")
except FileUploadError as e:
print(f"Upload failed: {e}")
except TaskStatusResponseError as e:
print(f"Transcription task failed: {e}")
except ValidationError as e:
print(f"Audio validation failed: {e}")
except InvalidResponseError as e:
print(f"Invalid API response: {e}")
except APIError as e:
print(f"API error: {e}")
except SberSpeechError as e:
print(f"General API error: {e}")
```
### Token Management
Authentication tokens are automatically managed by the `TokenManager` class, which:
- Caches tokens to minimize API requests
- Refreshes tokens when they expire
- Validates token format and expiration
## Command Line Interface
The package includes a command-line interface for quick transcription tasks:
```bash
# Set your API key as an environment variable
export SBER_SPEECH_API_KEY=your_key_here
```
**Basic Usage:**
```bash
salute_speech --help
```
**Transcribe to text:**
```bash
# Prepare audio (recommended: convert to mono)
ffmpeg -i video.mp4 -ac 1 -ar 16000 audio.wav
# Transcribe to text
salute_speech transcribe-audio audio.wav -o transcript.txt
```
**Transcribe to WebVTT:**
```bash
salute_speech transcribe-audio audio.wav -o transcript.vtt
```
Options:
- `--language` (`ru-RU` | `en-US` | `kk-KZ`) Default: `ru-RU`
- `--channels` Number of channels expected in the input (validated)
- `-f, --output_format` One of: `txt`, `vtt`, `srt`, `tsv`, `json` (usually inferred from `-o`)
- `-o, --output_file` Path to write the transcription
- `--debug_dump` Path or directory to dump raw JSON result (for debugging)
Examples:
```bash
# JSON (timed segments)
salute_speech transcribe-audio audio.wav -o transcript.json
# SRT subtitles
salute_speech transcribe-audio audio.wav -o subtitles.srt
# Dump raw Sber result for inspection
salute_speech transcribe-audio audio.wav -o transcript.txt --debug_dump res.json
```
**Supported output formats:**
- `txt` - Plain text
- `vtt` - WebVTT subtitles
- `srt` - SubRip subtitles
- `tsv` - Tab-separated values
- `json` - JSON format with detailed information:
- `text` (str)
- `duration` (float)
- `language` (str)
- `segments` (array of `{id, start, end, text}`)
**Note:** Each audio channel is transcribed separately, so converting to mono is recommended for most cases.
### About the API flow
This library uses the asynchronous REST API flow (upload → create task → poll status → download result).
See Sber docs for details:
- Async recognition flow: [developers.sber.ru/docs/ru/salutespeech/rest/async-general](https://developers.sber.ru/docs/ru/salutespeech/rest/async-general)
## Advanced Configuration
For advanced use cases, you can customize the speech recognition parameters:
```python
from salute_speech.speech_recognition import SpeechRecognitionConfig
config = SpeechRecognitionConfig(
hypotheses_count=3, # Number of transcription variants
enable_profanity_filter=True, # Filter out profanity
max_speech_timeout="30s", # Maximum timeout for speech segments
# speaker_separation_options={...} # Optional: see official docs for available options
)
result = await client.audio.transcriptions.create(
file=audio_file,
language="ru-RU",
config=config
)
```
### Tuning parameters
You can fine‑tune recognition behavior through `SpeechRecognitionConfig` passed to the async API. The most relevant fields are:
- `hypotheses_count` (int): Number of transcription variants to generate (1–10)
- `enable_profanity_filter` (bool): Replace offensive words
- `max_speech_timeout` (str): Max duration for a single speech segment, e.g. `"20s"`
- `no_speech_timeout` (str): Timeout for no-speech detection, e.g. `"7s"`
- `hints` (dict): Domain terms/pronunciations to bias decoding
- `insight_models` (list): Enable extra models (if available for your account)
- `speaker_separation_options` (dict): Speaker separation-related options (service-dependent)
See the official SaluteSpeech async recognition documentation for the complete, up-to-date list of supported options and expected formats:
- Async recognition flow: [developers.sber.ru/docs/ru/salutespeech/rest/async-general](https://developers.sber.ru/docs/ru/salutespeech/rest/async-general)
Note: `language`, `audio_encoding`, `sample_rate`, and `channels_count` are set automatically from your input or via CLI and are not part of the `SpeechRecognitionConfig` object.
Raw data
{
"_id": null,
"home_page": null,
"name": "salute-speech",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "speech",
"author": null,
"author_email": "Maxim Moroz <maxim.moroz@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/99/b0/e587d4e85cef0613445b6e99a9a8185681be0ac4ba3a8fafde89f1570b56/salute_speech-2.0.0.tar.gz",
"platform": null,
"description": "# Sber Salute Speech Python API\n\nA Python client for Sber's Salute Speech Recognition service with a simple, async-first API.\n\n## Features\n\n- OpenAI Whisper-like API for ease of use\n- Asynchronous API for compatibility and better performance\n- Comprehensive error handling\n- Support for multiple audio formats\n- Command-line interface for quick transcription\n\n> Quick CLI\n\n```bash\nexport SBER_SPEECH_API_KEY=your_key_here\nffmpeg -i input.mp3 -ac 1 -ar 16000 audio.wav\nsalute_speech transcribe-audio audio.wav -o transcript.txt\n```\n\n## Installation\n\n```bash\npip install salute_speech\n```\n\nPrerequisites (recommended):\n\n- ffmpeg (for best results and simple conversion to mono 16 kHz)\n\nmacOS (Homebrew): `brew install ffmpeg`\nUbuntu/Debian: `sudo apt-get install ffmpeg`\n\n## Quick Start\n\n```python\nfrom salute_speech.speech_recognition import SaluteSpeechClient\nimport asyncio\nimport os\n\nasync def main():\n # Initialize the client (from environment variable)\n client = SaluteSpeechClient(client_credentials=os.getenv(\"SBER_SPEECH_API_KEY\"))\n \n # Open and transcribe an audio file\n with open(\"audio.mp3\", \"rb\") as audio_file:\n result = await client.audio.transcriptions.create(\n file=audio_file,\n language=\"ru-RU\"\n )\n print(result.text)\n\n# Run the async function\nasyncio.run(main())\n```\n\n## API Reference\n\n### SaluteSpeechClient\n\nThe main client class that provides access to the Sber Speech API.\n\n```python\nclient = SaluteSpeechClient(client_credentials=\"your_credentials_here\")\n```\n\n#### `client.audio.transcriptions.create()`\n\nCreates a transcription for the given audio file.\n\n**Parameters:**\n\n- `file` (BinaryIO): An audio file opened in binary mode\n- `language` (str, optional): Language code for transcription. Supported: `ru-RU`, `en-US`, `kk-KZ`. Defaults to \"ru-RU\"\n- `poll_interval` (float, optional): Interval between status checks in seconds. Defaults to 1.0\n- `config` (SpeechRecognitionConfig, optional): Advanced recognition tuning passed to the SberSpeech async API\n- `prompt` (str, optional): Optional prompt to guide transcription (not yet supported)\n- `response_format` (str, optional): Format of the response (not yet supported)\n\n**Returns:**\n\n- `TranscriptionResponse` object\n\n### TranscriptionResponse\n\nShape of the result object returned by `client.audio.transcriptions.create()` (aligned with OpenAI's TranscriptionVerbose):\n\n- `duration` (float): The duration of the input audio in seconds\n- `language` (str): The language of the input audio (e.g., `ru`, `en`)\n- `text` (str): The full transcribed text (concatenation of segment texts)\n- `segments` (List[TranscriptionSegment] | None): Segments of the transcribed text with timestamps\n- `status` (str): Sber-specific job status (e.g., `DONE`)\n- `task_id` (str): Sber-specific task identifier\n\nOpenAI alignment: This structure matches OpenAI Whisper's `TranscriptionVerbose` response format, with additional Sber-specific fields for internal tracking.\n\nProgrammatic usage (iterate segments):\n\n```python\nfor seg in result.segments or []:\n print(f\"[{seg.start:.2f} - {seg.end:.2f}] {seg.text}\")\n```\n\nTranscriptionSegment:\n\n- `id` (int): Segment index\n- `start` (float): Start time in seconds\n- `end` (float): End time in seconds\n- `text` (str): Segment text\n\n**Example:**\n\n```python\nasync with open(\"meeting.mp3\", \"rb\") as audio_file:\n result = await client.audio.transcriptions.create(\n file=audio_file,\n language=\"ru-RU\"\n )\n print(result.text)\n```\n\n### Supported Audio Formats\n\nThe service supports the following audio formats:\n\n| Format | Max Channels | Sample Rate Range |\n|--------|-------------|-------------------|\n| PCM_S16LE (WAV) | 8 | 8,000 - 96,000 Hz |\n| OPUS | 1 | Any |\n| MP3 | 2 | Any |\n| FLAC | 8 | Any |\n| ALAW | 8 | 8,000 - 96,000 Hz |\n| MULAW | 8 | 8,000 - 96,000 Hz |\n\nAudio parameters are automatically detected and validated using the `AudioValidator` class.\n\n### Error Handling\n\nThe client provides structured error handling with specific exception classes:\n\n```python\ntry:\n result = await client.audio.transcriptions.create(file=audio_file)\nexcept TokenRequestError as e:\n print(f\"Authentication error: {e}\")\nexcept FileUploadError as e:\n print(f\"Upload failed: {e}\")\nexcept TaskStatusResponseError as e:\n print(f\"Transcription task failed: {e}\")\nexcept ValidationError as e:\n print(f\"Audio validation failed: {e}\")\nexcept InvalidResponseError as e:\n print(f\"Invalid API response: {e}\")\nexcept APIError as e:\n print(f\"API error: {e}\")\nexcept SberSpeechError as e:\n print(f\"General API error: {e}\")\n```\n\n### Token Management\n\nAuthentication tokens are automatically managed by the `TokenManager` class, which:\n\n- Caches tokens to minimize API requests\n- Refreshes tokens when they expire\n- Validates token format and expiration\n\n## Command Line Interface\n\nThe package includes a command-line interface for quick transcription tasks:\n\n```bash\n# Set your API key as an environment variable\nexport SBER_SPEECH_API_KEY=your_key_here\n```\n\n**Basic Usage:**\n\n```bash\nsalute_speech --help\n```\n\n**Transcribe to text:**\n\n```bash\n# Prepare audio (recommended: convert to mono)\nffmpeg -i video.mp4 -ac 1 -ar 16000 audio.wav\n\n# Transcribe to text\nsalute_speech transcribe-audio audio.wav -o transcript.txt\n```\n\n**Transcribe to WebVTT:**\n\n```bash\nsalute_speech transcribe-audio audio.wav -o transcript.vtt\n```\n\nOptions:\n\n- `--language` (`ru-RU` | `en-US` | `kk-KZ`) Default: `ru-RU`\n- `--channels` Number of channels expected in the input (validated)\n- `-f, --output_format` One of: `txt`, `vtt`, `srt`, `tsv`, `json` (usually inferred from `-o`)\n- `-o, --output_file` Path to write the transcription\n- `--debug_dump` Path or directory to dump raw JSON result (for debugging)\n\nExamples:\n\n```bash\n# JSON (timed segments)\nsalute_speech transcribe-audio audio.wav -o transcript.json\n\n# SRT subtitles\nsalute_speech transcribe-audio audio.wav -o subtitles.srt\n\n# Dump raw Sber result for inspection\nsalute_speech transcribe-audio audio.wav -o transcript.txt --debug_dump res.json\n```\n\n**Supported output formats:**\n\n- `txt` - Plain text\n- `vtt` - WebVTT subtitles\n- `srt` - SubRip subtitles\n- `tsv` - Tab-separated values\n- `json` - JSON format with detailed information:\n - `text` (str)\n - `duration` (float)\n - `language` (str)\n - `segments` (array of `{id, start, end, text}`)\n\n**Note:** Each audio channel is transcribed separately, so converting to mono is recommended for most cases.\n\n### About the API flow\n\nThis library uses the asynchronous REST API flow (upload \u2192 create task \u2192 poll status \u2192 download result).\nSee Sber docs for details:\n\n- Async recognition flow: [developers.sber.ru/docs/ru/salutespeech/rest/async-general](https://developers.sber.ru/docs/ru/salutespeech/rest/async-general)\n\n## Advanced Configuration\n\nFor advanced use cases, you can customize the speech recognition parameters:\n\n```python\nfrom salute_speech.speech_recognition import SpeechRecognitionConfig\n\nconfig = SpeechRecognitionConfig(\n hypotheses_count=3, # Number of transcription variants\n enable_profanity_filter=True, # Filter out profanity\n max_speech_timeout=\"30s\", # Maximum timeout for speech segments\n # speaker_separation_options={...} # Optional: see official docs for available options\n)\n\nresult = await client.audio.transcriptions.create(\n file=audio_file,\n language=\"ru-RU\",\n config=config\n)\n```\n\n### Tuning parameters\n\nYou can fine\u2011tune recognition behavior through `SpeechRecognitionConfig` passed to the async API. The most relevant fields are:\n\n- `hypotheses_count` (int): Number of transcription variants to generate (1\u201310)\n- `enable_profanity_filter` (bool): Replace offensive words\n- `max_speech_timeout` (str): Max duration for a single speech segment, e.g. `\"20s\"`\n- `no_speech_timeout` (str): Timeout for no-speech detection, e.g. `\"7s\"`\n- `hints` (dict): Domain terms/pronunciations to bias decoding\n- `insight_models` (list): Enable extra models (if available for your account)\n- `speaker_separation_options` (dict): Speaker separation-related options (service-dependent)\n\nSee the official SaluteSpeech async recognition documentation for the complete, up-to-date list of supported options and expected formats:\n\n- Async recognition flow: [developers.sber.ru/docs/ru/salutespeech/rest/async-general](https://developers.sber.ru/docs/ru/salutespeech/rest/async-general)\n\nNote: `language`, `audio_encoding`, `sample_rate`, and `channels_count` are set automatically from your input or via CLI and are not part of the `SpeechRecognitionConfig` object.\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2024 Maxim Moroz <maxim.moroz@gmail.com>\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "Sber Salute Speech API",
"version": "2.0.0",
"project_urls": {
"Homepage": "https://github.com/mmua/salute_speech",
"Repository": "https://github.com/mmua/salute_speech.git"
},
"split_keywords": [
"speech"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "39e78de708f9ee1c9b7e5a0950966685e4485f9adcf12fd0c150c4c31987e04c",
"md5": "2ab9b0ce7358ee9679ff22efaaae1dc9",
"sha256": "bf6b18caf9c7b58f13c57fd7f158fa8cebb994bffdf93930890ad29b2d66b744"
},
"downloads": -1,
"filename": "salute_speech-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2ab9b0ce7358ee9679ff22efaaae1dc9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 28271,
"upload_time": "2025-08-30T17:11:00",
"upload_time_iso_8601": "2025-08-30T17:11:00.491689Z",
"url": "https://files.pythonhosted.org/packages/39/e7/8de708f9ee1c9b7e5a0950966685e4485f9adcf12fd0c150c4c31987e04c/salute_speech-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "99b0e587d4e85cef0613445b6e99a9a8185681be0ac4ba3a8fafde89f1570b56",
"md5": "b2fc075b98e8d44beda0950ec3eb886a",
"sha256": "5e8c84a4cb92feeb7a3bdd6097ef5ff2d92c6d1b6b65f030516de4a807d23e84"
},
"downloads": -1,
"filename": "salute_speech-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "b2fc075b98e8d44beda0950ec3eb886a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 35625,
"upload_time": "2025-08-30T17:11:01",
"upload_time_iso_8601": "2025-08-30T17:11:01.657460Z",
"url": "https://files.pythonhosted.org/packages/99/b0/e587d4e85cef0613445b6e99a9a8185681be0ac4ba3a8fafde89f1570b56/salute_speech-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-30 17:11:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mmua",
"github_project": "salute_speech",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "Click",
"specs": [
[
">=",
"8.1.7"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
">=",
"1.0.1"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.32.3"
]
]
},
{
"name": "pydub",
"specs": [
[
">=",
"0.25.1"
]
]
},
{
"name": "setuptools",
"specs": [
[
">=",
"69.2.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
}
],
"lcname": "salute-speech"
}