speechmatics-voice

Name	speechmatics-voice JSON
Version	0.1.15 JSON
	download
home_page	None
Summary	Speechmatics Voice Agent Python client for Real-Time API
upload_time	2025-10-15 15:39:01
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	speechmatics conversational-ai voice agents websocket real-time pipecat livekit
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Voice Agent Python client for Speechmatics Real-Time API

[![License](https://img.shields.io/badge/license-MIT-yellow.svg)](https://github.com/speechmatics/speechmatics-python-voice/blob/master/LICENSE)

An SDK for working with the Speechmatics Real-Time API optimised for use in voice agents or transcription services.

## Overview

The Voice Agent SDK is designed to be a set of helper classes that use the Real-Time AsyncClient to connect to the Speechmatics Real-Time API and process the transcription results from the STT engine and combine them into manageable segments of text. Taking advantage of speaker diarization, the transcription is grouped into individual speakers, with advanced options to focus on and/or ignore specific speakers.

```mermaid
graph TD
    A[Client connects] --> B[Recognition started]
    B --> C[Audio chunks sent]
    C --> D[STT processing]

    D --> E[Partial transcripts rx]
    D --> F[Final transcripts rx]

    E --> G[Process transcripts]
    F --> G

    G --> H[Accumulate speech fragments]
    H --> I{Changes detected?}

    I -->|No| J[Skip]
    I -->|Yes| K[Create speaker segments]

    K --> L{Segment type?}
    L -->|Final| M[Emit final segments]
    L -->|Partial| N[Emit partial segments]

    M --> O[Trim fragments]
    N --> P{End of utterance?}
    O --> P

    P -->|FIXED mode| Q[Wait for STT signal]
    P -->|ADAPTIVE mode| R[Calculate timing]

    Q --> S[Emit END_OF_TURN]
    R --> S

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style M fill:#fff3e0
    style N fill:#fff3e0
    style S fill:#fce4ec
```

## Installation

```bash
pip install speechmatics-voice
```

## Requirements

You must have a valid Speechmatics API key to use this SDK. You can get one from the [Speechmatics Portal](https://portal.speechmatics.com/).

Store this as `SPEECHMATICS_API_KEY` environment variable in your `.env` file or use `export SPEECHMATICS_API_KEY="your_api_key_here"` in your terminal.

## Quick Start

Below is a basic example of how to use the SDK to transcribe audio from a microphone.

```python
import asyncio
from speechmatics.rt import Microphone
from speechmatics.voice import (
    VoiceAgentClient,
    VoiceAgentConfig,
    EndOfUtteranceMode,
    AgentServerMessageType
)

async def main():
    # Configure the voice agent
    config = VoiceAgentConfig(
        enable_diarization=True,
        end_of_utterance_mode=EndOfUtteranceMode.FIXED,
    )

    # Initialize microphone
    mic = Microphone(
        sample_rate=16000,
        chunk_size=160,
    )

    if not mic.start():
        print("Microphone not available")
        return

    # Create client and register event handlers
    async with VoiceAgentClient(config=config) as client:

        # Handle interim transcription segments
        @client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)
        def handle_interim_segments(message):
            segments = message["segments"]
            for segment in segments:
                print(f"[PARTIAL] Speaker {segment['speaker_id']}: {segment['text']}")

        # Handle finalized transcription segments
        @client.on(AgentServerMessageType.ADD_SEGMENT)
        def handle_final_segments(message):
            segments = message["segments"]
            for segment in segments:
                print(f"[FINAL] Speaker {segment['speaker_id']}: {segment['text']}")

        # Handle user started speaking event
        @client.on(AgentServerMessageType.SPEAKER_STARTED)
        def handle_speech_started(message):
            status = message["status"]
            print(f"User started speaking: {status}")

        # Handle user stopped speaking event
        @client.on(AgentServerMessageType.SPEAKER_ENDED)
        def handle_speech_ended(message):
            status = message["status"]
            print(f"User stopped speaking: {status}")

        # End of turn / utterance(s)
        @client.on(AgentServerMessageType.END_OF_TURN)
        def handle_end_of_turn(message):
            print("End of turn")

        # Connect and start processing audio
        await client.connect()

        while True:
            frame = await mic.read(160)
            await client.send_audio(frame)

if __name__ == "__main__":
    asyncio.run(main())
```

## Examples

The `examples/` directory contains practical demonstrations of the Voice Agent SDK. See the README in the `examples/voice` directory for more information.

## Client Configuration

The `VoiceAgentClient` can be configured with a number of options to control the behaviour of the client using the `VoiceAgentConfig` class.

### VoiceAgentConfig Parameters

#### Service Configuration

- **`operating_point`** (`OperatingPoint`): Operating point for transcription accuracy vs. latency tradeoff. It is recommended to use `OperatingPoint.ENHANCED` for most use cases. Defaults to `OperatingPoint.ENHANCED`.
- **`domain`** (`Optional[str]`): Domain for Speechmatics API. Defaults to `None`.
- **`language`** (`str`): Language code for transcription. Defaults to `"en"`.
- **`output_locale`** (`Optional[str]`): Output locale for transcription, e.g. `"en-GB"`. Defaults to `None`.

#### Timing and Latency Features

- **`max_delay`** (`float`): Maximum delay in seconds for transcription. This forces the STT engine to speed up the processing of transcribed words and reduces the interval between partial and final results. Lower values can have an impact on accuracy. Defaults to `0.7`.
- **`end_of_utterance_silence_trigger`** (`float`): Maximum delay in seconds for end of utterance trigger. The delay is used to wait for any further transcribed words before emitting the final word frames. The value must be lower than `max_delay`. Defaults to `0.2`.
- **`end_of_utterance_max_delay`** (`float`): Maximum delay in seconds for end of utterance delay. The delay is used to wait for any further transcribed words before emitting the final word frames. The value must be greater than `end_of_utterance_silence_trigger`. Defaults to `10.0`.
- **`end_of_utterance_mode`** (`EndOfUtteranceMode`): End of utterance delay mode. When `ADAPTIVE` is used, the delay can be adjusted on the content of what the most recent speaker has said, such as rate of speech and whether they have any pauses or disfluencies. When `FIXED` is used, the delay is fixed to the value of `end_of_utterance_silence_trigger`. Use of `EXTERNAL` disables end of utterance detection and uses a fallback timer. Defaults to `EndOfUtteranceMode.FIXED`.

#### Language and Vocabulary Features

- **`additional_vocab`** (`list[AdditionalVocabEntry]`): List of additional vocabulary entries. If you supply a list of additional vocabulary entries, this will increase the weight of the words in the vocabulary and help the STT engine to better transcribe the words. Defaults to `[]`.
- **`punctuation_overrides`** (`Optional[dict]`): Punctuation overrides. This allows you to override the punctuation in the STT engine. This is useful for languages that use different punctuation than English. See documentation for more information. Defaults to `None`.

#### Speaker Diarization

- **`enable_diarization`** (`bool`): Enable speaker diarization. When enabled, the STT engine will determine and attribute words to unique speakers. The `speaker_sensitivity` parameter can be used to adjust the sensitivity of diarization. Defaults to `False`.
- **`speaker_sensitivity`** (`float`): Diarization sensitivity. A higher value increases the sensitivity of diarization and helps when two or more speakers have similar voices. Defaults to `0.5`.
- **`max_speakers`** (`Optional[int]`): Maximum number of speakers to detect. This forces the STT engine to cluster words into a fixed number of speakers. It should not be used to limit the number of speakers, unless it is clear that there will only be a known number of speakers. Defaults to `None`.
- **`prefer_current_speaker`** (`bool`): Prefer current speaker ID. When set to true, groups of words close together are given extra weight to be identified as the same speaker. Defaults to `False`.
- **`speaker_config`** (`SpeakerFocusConfig`): Configuration to specify speakers to focus on, ignore and how to deal with speakers that are not in focus. Defaults to `SpeakerFocusConfig()`.
- **`known_speakers`** (`list[SpeakerIdentifier]`): List of known speaker labels and identifiers. If you supply a list of labels and identifiers for speakers, then the STT engine will use them to attribute any spoken words to that speaker. This is useful when you want to attribute words to a specific speaker, such as the assistant or a specific user. Labels and identifiers can be obtained from a running STT session and then used in subsequent sessions. Identifiers are unique to each Speechmatics account and cannot be used across accounts. Defaults to `[]`.

#### Advanced Features

- **`include_results`** (`bool`): Include word data in the response. This is useful for debugging and understanding the STT engine's behaviour. Defaults to `False`.
- **`enable_preview_features`** (`bool`): Enable preview features. Defaults to `False`.

#### Audio Configuration

- **`sample_rate`** (`int`): Audio sample rate for streaming. Defaults to `16000`.
- **`audio_encoding`** (`AudioEncoding`): Audio encoding format. Defaults to `AudioEncoding.PCM_S16LE`.

### Example Configuration

```python
from speechmatics.voice import (
    AdditionalVocabEntry,
    AudioEncoding,
    SpeakerFocusConfig,
    EndOfUtteranceMode,
    OperatingPoint,
    VoiceAgentConfig
)

# Basic configuration
config = VoiceAgentConfig(
    language="en",
    enable_diarization=True,
    end_of_utterance_mode=EndOfUtteranceMode.FIXED,
)

# Advanced configuration with custom settings
advanced_config = VoiceAgentConfig(
    operating_point=OperatingPoint.ENHANCED,
    language="en",
    output_locale="en-GB",
    max_delay=0.7,
    end_of_utterance_silence_trigger=0.25,
    end_of_utterance_max_delay=8.0,
    end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
    enable_diarization=True,
    speaker_sensitivity=0.5,
    max_speakers=6,
    prefer_current_speaker=True,
    speaker_config=SpeakerFocusConfig(
        focus_speakers=["S1", "S2"],
        ignore_speakers=["S3"],
        focus_mode=SpeakerFocusMode.RETAIN,
    ),
    additional_vocab=[
        AdditionalVocabEntry(
            content="Speechmatics",
            sounds_like=["speech matters", "speech magic"]
        )
    ],
    include_results=True,
    sample_rate=16000,
    audio_encoding=AudioEncoding.PCM_S16LE
)
```

## Messages

The async client will return messages of type `AgentServerMessageType`. To register for a message, use the `on` method. Optionally, you can use the `once` method to register a callback that will be called once and then removed. Use the `off` method to remove a callback.

### `RECOGNITION_STARTED`

Emitted when the recognition has started and contains the session ID and base time for when transcription started. It also contains the language pack information for the model being used.

```json
{
  "message": "RecognitionStarted",
  "orchestrator_version": "2025.08.29127+289170c022.HEAD",
  "id": "a8779b0b-a238-43de-8211-c70f5fcbe191",
  "language_pack_info": {
    "adapted": false,
    "itn": true,
    "language_description": "English",
    "word_delimiter": " ",
    "writing_direction": "left-to-right"
  }
}
```

### `SPEAKER_STARTED`

Emitted when a speaker starts speaking. Contains the speaker ID and the VAD status. If there are multiple speakers, this will be emitted each time a speaker starts speaking. There will only be one active speaker at any given time. As speakers switch during a turn, separate `SPEAKER_ENDED` and `SPEAKER_STARTED` events will be emitted.

```json
{
  "message": "SpeakerStarted",
  "status": {
    "is_active": true,
    "speaker_id": "S1"
  }
}
```

### `SPEAKER_ENDED`

Emitted when a speaker stops speaking. Contains the speaker ID (if diarization is enabled) and the VAD status. If there are multiple speakers, this will be emitted each time a speaker stops speaking.

```json
{
  "message": "SpeakerEnded",
  "status": {
    "is_active": false,
    "speaker_id": "S1"
  }
}
```

### `ADD_PARTIAL_SEGMENT`

Emitted when a partial segment has been detected. Contains the speaker ID, if diarization is enabled. If there are multiple speakers, this will be emitted each time a speaker starts speaking. Words from different speakers will be grouped into segments.

If diarization is enabled and the client has been configured to focus on a specific speaker, the `is_active` will indicate whether the contents are from focused speakers. Ignored speakers will not have their words emitted.

The `metadata` contains the start and end time for the segment. Each time the segment is updated as new partials are received, whole segment will be emitted again with updated `metadata`. The `annotation` field contains additional information about the contents of the segment.

```json
{
  "message": "AddPartialSegment",
  "segments": [
    {
      "speaker_id": "S1",
      "is_active": true,
      "timestamp": "2025-09-15T19:47:29.096+00:00",
      "language": "en",
      "text": "Welcome",
      "annotation": ["has_partial"]
    }
  ],
  "metadata": { "start_time": 0.36, "end_time": 0.92 }
}
```

### `ADD_SEGMENT`

Emitted when a final segment has been detected. Contains the speaker ID, if diarization is enabled. If there are multiple speakers, this will be emitted each time a speaker stops speaking.

The `metadata` contains the start and end time for the segment. The `annotation` field contains additional information about the contents of the segment.

```json
{
  "message": "AddSegment",
  "segments": [
    {
      "speaker_id": "S1",
      "is_active": true,
      "timestamp": "2025-09-15T19:47:29.096+00:00",
      "language": "en",
      "text": "Welcome to Speechmatics.",
      "annotation": [
        "has_final",
        "starts_with_final",
        "ends_with_final",
        "ends_with_eos",
        "ends_with_punctuation"
      ]
    }
  ],
  "metadata": { "start_time": 0.36, "end_time": 1.32 }
}
```

### `END_OF_TURN`

Emitted when a turn has ended. This is a signal that the user has finished speaking and the system has finished processing the turn.

The message is emitted differently for the different `EndOfUtterance` modes:

- `FIXED` -> emitted when the fixed delay has elapsed with the STT engine
- `ADAPTIVE` -> emitted when the adaptive delay has elapsed with the client
- `EXTERNAL` -> emitted when an external trigger forces the end of turn

The `ADAPTIVE` mode takes into consideration a number of factors to determine whether the most recent speaker as completed their turn. These include:

- The speed of speech
- Whether they have been using disfluencies (e.g. "um", "er", "ah")
- If the words spoken are considered to be a complete sentence

The `end_of_utterance_silence_trigger` is used to calculate the baseline `FIXED` and `ADAPTIVE` delays. As a fallback, the `end_of_utterance_max_delay` is used to trigger the end of turn after a fixed amount of time, regardless of the content of what the most recent speaker has said.

When using `EXTERNAL` mode, call `client.finalize()` to force the end of turn.

```json
{
  "message": "EndOfTurn",
  "metadata": { "end_time": 9.16, "start_time": 11.08 }
}
```

## Environment Variables

- `SPEECHMATICS_API_KEY` - Your Speechmatics API key
- `SPEECHMATICS_RT_URL` - Custom WebSocket endpoint (optional)
- `SPEECHMATICS_DEBUG_MORE` - Enable verbose debugging

## Documentation

- **Speechmatics API**: https://docs.speechmatics.com

## License

[MIT](LICENSE)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "speechmatics-voice",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "speechmatics, conversational-ai, voice, agents, websocket, real-time, pipecat, livekit",
    "author": null,
    "author_email": "Speechmatics <support@speechmatics.com>",
    "download_url": "https://files.pythonhosted.org/packages/a4/7c/9e6c7d2b84b263555a89d6b70fa95d806f7be7b6fe35c3c4f1e6341bc475/speechmatics_voice-0.1.15.tar.gz",
    "platform": null,
    "description": "# Voice Agent Python client for Speechmatics Real-Time API\n\n[![License](https://img.shields.io/badge/license-MIT-yellow.svg)](https://github.com/speechmatics/speechmatics-python-voice/blob/master/LICENSE)\n\nAn SDK for working with the Speechmatics Real-Time API optimised for use in voice agents or transcription services.\n\n## Overview\n\nThe Voice Agent SDK is designed to be a set of helper classes that use the Real-Time AsyncClient to connect to the Speechmatics Real-Time API and process the transcription results from the STT engine and combine them into manageable segments of text. Taking advantage of speaker diarization, the transcription is grouped into individual speakers, with advanced options to focus on and/or ignore specific speakers.\n\n```mermaid\ngraph TD\n    A[Client connects] --> B[Recognition started]\n    B --> C[Audio chunks sent]\n    C --> D[STT processing]\n\n    D --> E[Partial transcripts rx]\n    D --> F[Final transcripts rx]\n\n    E --> G[Process transcripts]\n    F --> G\n\n    G --> H[Accumulate speech fragments]\n    H --> I{Changes detected?}\n\n    I -->|No| J[Skip]\n    I -->|Yes| K[Create speaker segments]\n\n    K --> L{Segment type?}\n    L -->|Final| M[Emit final segments]\n    L -->|Partial| N[Emit partial segments]\n\n    M --> O[Trim fragments]\n    N --> P{End of utterance?}\n    O --> P\n\n    P -->|FIXED mode| Q[Wait for STT signal]\n    P -->|ADAPTIVE mode| R[Calculate timing]\n\n    Q --> S[Emit END_OF_TURN]\n    R --> S\n\n    style A fill:#e1f5fe\n    style B fill:#e8f5e8\n    style M fill:#fff3e0\n    style N fill:#fff3e0\n    style S fill:#fce4ec\n```\n\n## Installation\n\n```bash\npip install speechmatics-voice\n```\n\n## Requirements\n\nYou must have a valid Speechmatics API key to use this SDK. You can get one from the [Speechmatics Portal](https://portal.speechmatics.com/).\n\nStore this as `SPEECHMATICS_API_KEY` environment variable in your `.env` file or use `export SPEECHMATICS_API_KEY=\"your_api_key_here\"` in your terminal.\n\n## Quick Start\n\nBelow is a basic example of how to use the SDK to transcribe audio from a microphone.\n\n```python\nimport asyncio\nfrom speechmatics.rt import Microphone\nfrom speechmatics.voice import (\n    VoiceAgentClient,\n    VoiceAgentConfig,\n    EndOfUtteranceMode,\n    AgentServerMessageType\n)\n\nasync def main():\n    # Configure the voice agent\n    config = VoiceAgentConfig(\n        enable_diarization=True,\n        end_of_utterance_mode=EndOfUtteranceMode.FIXED,\n    )\n\n    # Initialize microphone\n    mic = Microphone(\n        sample_rate=16000,\n        chunk_size=160,\n    )\n\n    if not mic.start():\n        print(\"Microphone not available\")\n        return\n\n    # Create client and register event handlers\n    async with VoiceAgentClient(config=config) as client:\n\n        # Handle interim transcription segments\n        @client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)\n        def handle_interim_segments(message):\n            segments = message[\"segments\"]\n            for segment in segments:\n                print(f\"[PARTIAL] Speaker {segment['speaker_id']}: {segment['text']}\")\n\n        # Handle finalized transcription segments\n        @client.on(AgentServerMessageType.ADD_SEGMENT)\n        def handle_final_segments(message):\n            segments = message[\"segments\"]\n            for segment in segments:\n                print(f\"[FINAL] Speaker {segment['speaker_id']}: {segment['text']}\")\n\n        # Handle user started speaking event\n        @client.on(AgentServerMessageType.SPEAKER_STARTED)\n        def handle_speech_started(message):\n            status = message[\"status\"]\n            print(f\"User started speaking: {status}\")\n\n        # Handle user stopped speaking event\n        @client.on(AgentServerMessageType.SPEAKER_ENDED)\n        def handle_speech_ended(message):\n            status = message[\"status\"]\n            print(f\"User stopped speaking: {status}\")\n\n        # End of turn / utterance(s)\n        @client.on(AgentServerMessageType.END_OF_TURN)\n        def handle_end_of_turn(message):\n            print(\"End of turn\")\n\n        # Connect and start processing audio\n        await client.connect()\n\n        while True:\n            frame = await mic.read(160)\n            await client.send_audio(frame)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n## Examples\n\nThe `examples/` directory contains practical demonstrations of the Voice Agent SDK. See the README in the `examples/voice` directory for more information.\n\n## Client Configuration\n\nThe `VoiceAgentClient` can be configured with a number of options to control the behaviour of the client using the `VoiceAgentConfig` class.\n\n### VoiceAgentConfig Parameters\n\n#### Service Configuration\n\n- **`operating_point`** (`OperatingPoint`): Operating point for transcription accuracy vs. latency tradeoff. It is recommended to use `OperatingPoint.ENHANCED` for most use cases. Defaults to `OperatingPoint.ENHANCED`.\n- **`domain`** (`Optional[str]`): Domain for Speechmatics API. Defaults to `None`.\n- **`language`** (`str`): Language code for transcription. Defaults to `\"en\"`.\n- **`output_locale`** (`Optional[str]`): Output locale for transcription, e.g. `\"en-GB\"`. Defaults to `None`.\n\n#### Timing and Latency Features\n\n- **`max_delay`** (`float`): Maximum delay in seconds for transcription. This forces the STT engine to speed up the processing of transcribed words and reduces the interval between partial and final results. Lower values can have an impact on accuracy. Defaults to `0.7`.\n- **`end_of_utterance_silence_trigger`** (`float`): Maximum delay in seconds for end of utterance trigger. The delay is used to wait for any further transcribed words before emitting the final word frames. The value must be lower than `max_delay`. Defaults to `0.2`.\n- **`end_of_utterance_max_delay`** (`float`): Maximum delay in seconds for end of utterance delay. The delay is used to wait for any further transcribed words before emitting the final word frames. The value must be greater than `end_of_utterance_silence_trigger`. Defaults to `10.0`.\n- **`end_of_utterance_mode`** (`EndOfUtteranceMode`): End of utterance delay mode. When `ADAPTIVE` is used, the delay can be adjusted on the content of what the most recent speaker has said, such as rate of speech and whether they have any pauses or disfluencies. When `FIXED` is used, the delay is fixed to the value of `end_of_utterance_silence_trigger`. Use of `EXTERNAL` disables end of utterance detection and uses a fallback timer. Defaults to `EndOfUtteranceMode.FIXED`.\n\n#### Language and Vocabulary Features\n\n- **`additional_vocab`** (`list[AdditionalVocabEntry]`): List of additional vocabulary entries. If you supply a list of additional vocabulary entries, this will increase the weight of the words in the vocabulary and help the STT engine to better transcribe the words. Defaults to `[]`.\n- **`punctuation_overrides`** (`Optional[dict]`): Punctuation overrides. This allows you to override the punctuation in the STT engine. This is useful for languages that use different punctuation than English. See documentation for more information. Defaults to `None`.\n\n#### Speaker Diarization\n\n- **`enable_diarization`** (`bool`): Enable speaker diarization. When enabled, the STT engine will determine and attribute words to unique speakers. The `speaker_sensitivity` parameter can be used to adjust the sensitivity of diarization. Defaults to `False`.\n- **`speaker_sensitivity`** (`float`): Diarization sensitivity. A higher value increases the sensitivity of diarization and helps when two or more speakers have similar voices. Defaults to `0.5`.\n- **`max_speakers`** (`Optional[int]`): Maximum number of speakers to detect. This forces the STT engine to cluster words into a fixed number of speakers. It should not be used to limit the number of speakers, unless it is clear that there will only be a known number of speakers. Defaults to `None`.\n- **`prefer_current_speaker`** (`bool`): Prefer current speaker ID. When set to true, groups of words close together are given extra weight to be identified as the same speaker. Defaults to `False`.\n- **`speaker_config`** (`SpeakerFocusConfig`): Configuration to specify speakers to focus on, ignore and how to deal with speakers that are not in focus. Defaults to `SpeakerFocusConfig()`.\n- **`known_speakers`** (`list[SpeakerIdentifier]`): List of known speaker labels and identifiers. If you supply a list of labels and identifiers for speakers, then the STT engine will use them to attribute any spoken words to that speaker. This is useful when you want to attribute words to a specific speaker, such as the assistant or a specific user. Labels and identifiers can be obtained from a running STT session and then used in subsequent sessions. Identifiers are unique to each Speechmatics account and cannot be used across accounts. Defaults to `[]`.\n\n#### Advanced Features\n\n- **`include_results`** (`bool`): Include word data in the response. This is useful for debugging and understanding the STT engine's behaviour. Defaults to `False`.\n- **`enable_preview_features`** (`bool`): Enable preview features. Defaults to `False`.\n\n#### Audio Configuration\n\n- **`sample_rate`** (`int`): Audio sample rate for streaming. Defaults to `16000`.\n- **`audio_encoding`** (`AudioEncoding`): Audio encoding format. Defaults to `AudioEncoding.PCM_S16LE`.\n\n### Example Configuration\n\n```python\nfrom speechmatics.voice import (\n    AdditionalVocabEntry,\n    AudioEncoding,\n    SpeakerFocusConfig,\n    EndOfUtteranceMode,\n    OperatingPoint,\n    VoiceAgentConfig\n)\n\n# Basic configuration\nconfig = VoiceAgentConfig(\n    language=\"en\",\n    enable_diarization=True,\n    end_of_utterance_mode=EndOfUtteranceMode.FIXED,\n)\n\n# Advanced configuration with custom settings\nadvanced_config = VoiceAgentConfig(\n    operating_point=OperatingPoint.ENHANCED,\n    language=\"en\",\n    output_locale=\"en-GB\",\n    max_delay=0.7,\n    end_of_utterance_silence_trigger=0.25,\n    end_of_utterance_max_delay=8.0,\n    end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,\n    enable_diarization=True,\n    speaker_sensitivity=0.5,\n    max_speakers=6,\n    prefer_current_speaker=True,\n    speaker_config=SpeakerFocusConfig(\n        focus_speakers=[\"S1\", \"S2\"],\n        ignore_speakers=[\"S3\"],\n        focus_mode=SpeakerFocusMode.RETAIN,\n    ),\n    additional_vocab=[\n        AdditionalVocabEntry(\n            content=\"Speechmatics\",\n            sounds_like=[\"speech matters\", \"speech magic\"]\n        )\n    ],\n    include_results=True,\n    sample_rate=16000,\n    audio_encoding=AudioEncoding.PCM_S16LE\n)\n```\n\n## Messages\n\nThe async client will return messages of type `AgentServerMessageType`. To register for a message, use the `on` method. Optionally, you can use the `once` method to register a callback that will be called once and then removed. Use the `off` method to remove a callback.\n\n### `RECOGNITION_STARTED`\n\nEmitted when the recognition has started and contains the session ID and base time for when transcription started. It also contains the language pack information for the model being used.\n\n```json\n{\n  \"message\": \"RecognitionStarted\",\n  \"orchestrator_version\": \"2025.08.29127+289170c022.HEAD\",\n  \"id\": \"a8779b0b-a238-43de-8211-c70f5fcbe191\",\n  \"language_pack_info\": {\n    \"adapted\": false,\n    \"itn\": true,\n    \"language_description\": \"English\",\n    \"word_delimiter\": \" \",\n    \"writing_direction\": \"left-to-right\"\n  }\n}\n```\n\n### `SPEAKER_STARTED`\n\nEmitted when a speaker starts speaking. Contains the speaker ID and the VAD status. If there are multiple speakers, this will be emitted each time a speaker starts speaking. There will only be one active speaker at any given time. As speakers switch during a turn, separate `SPEAKER_ENDED` and `SPEAKER_STARTED` events will be emitted.\n\n```json\n{\n  \"message\": \"SpeakerStarted\",\n  \"status\": {\n    \"is_active\": true,\n    \"speaker_id\": \"S1\"\n  }\n}\n```\n\n### `SPEAKER_ENDED`\n\nEmitted when a speaker stops speaking. Contains the speaker ID (if diarization is enabled) and the VAD status. If there are multiple speakers, this will be emitted each time a speaker stops speaking.\n\n```json\n{\n  \"message\": \"SpeakerEnded\",\n  \"status\": {\n    \"is_active\": false,\n    \"speaker_id\": \"S1\"\n  }\n}\n```\n\n### `ADD_PARTIAL_SEGMENT`\n\nEmitted when a partial segment has been detected. Contains the speaker ID, if diarization is enabled. If there are multiple speakers, this will be emitted each time a speaker starts speaking. Words from different speakers will be grouped into segments.\n\nIf diarization is enabled and the client has been configured to focus on a specific speaker, the `is_active` will indicate whether the contents are from focused speakers. Ignored speakers will not have their words emitted.\n\nThe `metadata` contains the start and end time for the segment. Each time the segment is updated as new partials are received, whole segment will be emitted again with updated `metadata`. The `annotation` field contains additional information about the contents of the segment.\n\n```json\n{\n  \"message\": \"AddPartialSegment\",\n  \"segments\": [\n    {\n      \"speaker_id\": \"S1\",\n      \"is_active\": true,\n      \"timestamp\": \"2025-09-15T19:47:29.096+00:00\",\n      \"language\": \"en\",\n      \"text\": \"Welcome\",\n      \"annotation\": [\"has_partial\"]\n    }\n  ],\n  \"metadata\": { \"start_time\": 0.36, \"end_time\": 0.92 }\n}\n```\n\n### `ADD_SEGMENT`\n\nEmitted when a final segment has been detected. Contains the speaker ID, if diarization is enabled. If there are multiple speakers, this will be emitted each time a speaker stops speaking.\n\nThe `metadata` contains the start and end time for the segment. The `annotation` field contains additional information about the contents of the segment.\n\n```json\n{\n  \"message\": \"AddSegment\",\n  \"segments\": [\n    {\n      \"speaker_id\": \"S1\",\n      \"is_active\": true,\n      \"timestamp\": \"2025-09-15T19:47:29.096+00:00\",\n      \"language\": \"en\",\n      \"text\": \"Welcome to Speechmatics.\",\n      \"annotation\": [\n        \"has_final\",\n        \"starts_with_final\",\n        \"ends_with_final\",\n        \"ends_with_eos\",\n        \"ends_with_punctuation\"\n      ]\n    }\n  ],\n  \"metadata\": { \"start_time\": 0.36, \"end_time\": 1.32 }\n}\n```\n\n### `END_OF_TURN`\n\nEmitted when a turn has ended. This is a signal that the user has finished speaking and the system has finished processing the turn.\n\nThe message is emitted differently for the different `EndOfUtterance` modes:\n\n- `FIXED` -> emitted when the fixed delay has elapsed with the STT engine\n- `ADAPTIVE` -> emitted when the adaptive delay has elapsed with the client\n- `EXTERNAL` -> emitted when an external trigger forces the end of turn\n\nThe `ADAPTIVE` mode takes into consideration a number of factors to determine whether the most recent speaker as completed their turn. These include:\n\n- The speed of speech\n- Whether they have been using disfluencies (e.g. \"um\", \"er\", \"ah\")\n- If the words spoken are considered to be a complete sentence\n\nThe `end_of_utterance_silence_trigger` is used to calculate the baseline `FIXED` and `ADAPTIVE` delays. As a fallback, the `end_of_utterance_max_delay` is used to trigger the end of turn after a fixed amount of time, regardless of the content of what the most recent speaker has said.\n\nWhen using `EXTERNAL` mode, call `client.finalize()` to force the end of turn.\n\n```json\n{\n  \"message\": \"EndOfTurn\",\n  \"metadata\": { \"end_time\": 9.16, \"start_time\": 11.08 }\n}\n```\n\n## Environment Variables\n\n- `SPEECHMATICS_API_KEY` - Your Speechmatics API key\n- `SPEECHMATICS_RT_URL` - Custom WebSocket endpoint (optional)\n- `SPEECHMATICS_DEBUG_MORE` - Enable verbose debugging\n\n## Documentation\n\n- **Speechmatics API**: https://docs.speechmatics.com\n\n## License\n\n[MIT](LICENSE)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Speechmatics Voice Agent Python client for Real-Time API",
    "version": "0.1.15",
    "project_urls": {
        "documentation": "https://docs.speechmatics.com/",
        "homepage": "https://github.com/speechmatics/speechmatics-python-sdk",
        "issues": "https://github.com/speechmatics/speechmatics-python-sdk/issues",
        "repository": "https://github.com/speechmatics/speechmatics-python-sdk"
    },
    "split_keywords": [
        "speechmatics",
        " conversational-ai",
        " voice",
        " agents",
        " websocket",
        " real-time",
        " pipecat",
        " livekit"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e244975015558b9d29524c57ec517ade8e824f91286ccef71ed00903f0d7615e",
                "md5": "f96df2715eab519892333050ce3bdf13",
                "sha256": "3354d1855728e7fde40db0becae432088c99ce01915f6345dd45a39081e10728"
            },
            "downloads": -1,
            "filename": "speechmatics_voice-0.1.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f96df2715eab519892333050ce3bdf13",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 39703,
            "upload_time": "2025-10-15T15:38:59",
            "upload_time_iso_8601": "2025-10-15T15:38:59.690307Z",
            "url": "https://files.pythonhosted.org/packages/e2/44/975015558b9d29524c57ec517ade8e824f91286ccef71ed00903f0d7615e/speechmatics_voice-0.1.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a47c9e6c7d2b84b263555a89d6b70fa95d806f7be7b6fe35c3c4f1e6341bc475",
                "md5": "42ceaea3c0fc3563117717ec0ac2e76e",
                "sha256": "bb63301552ca3c4a8239407ac809fdaf2bebc00ed0079b8ba045e3e6345194e1"
            },
            "downloads": -1,
            "filename": "speechmatics_voice-0.1.15.tar.gz",
            "has_sig": false,
            "md5_digest": "42ceaea3c0fc3563117717ec0ac2e76e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 41929,
            "upload_time": "2025-10-15T15:39:01",
            "upload_time_iso_8601": "2025-10-15T15:39:01.107745Z",
            "url": "https://files.pythonhosted.org/packages/a4/7c/9e6c7d2b84b263555a89d6b70fa95d806f7be7b6fe35c3c4f1e6341bc475/speechmatics_voice-0.1.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-15 15:39:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "speechmatics",
    "github_project": "speechmatics-python-sdk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "speechmatics-voice"
}

None