litests

Name	litests JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/uezo/litests
Summary	A super lightweight Speech-to-Speech framework with modular VAD, STT, LLM and TTS components. 🧩
upload_time	2025-02-15 12:03:13
maintainer	uezo
docs_url	None
author	uezo
requires_python	None
license	Apache v2
keywords
VCS
bugtrack_url
requirements	httpx openai
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LiteSTS

A super lightweight Speech-to-Speech framework with modular VAD, STT, LLM and TTS components. 🧩

## ✨ Features

- **🧩 Modular architecture**: VAD, STT, LLM, and TTS are like building blocks—just snap them together! Each one is super simple to integrate with a lightweight interface. Here's what we support out of the box (but feel free to add your own flair!):
    - VAD: Built-in Standard VAD (turn-end detection based on silence length)
    - STT: Google, Azure and OpenAI
    - ChatGPT, Gemini, Claude. Plus, with support for LiteLLM and Dify, you can use any LLMs they support!
    - TTS: VOICEVOX / AivisSpeech, OpenAI, SpeechGateway (Yep, all the TTS supported by SpeechGateway, including Style-Bert-VITS2 and NijiVoice!)
- **🥰 Rich expression**: Supports text-based information exchange, enabling rich expressions like facial expressions and motions on the front end. Voice styles seamlessly align with facial expressions to enhance overall expressiveness. It also supports methods like Chain-of-Thought, allowing for maximum utilization of capabilities.
- **🏎️ Super speed**: Speech synthesis and playback are performed in parallel with streaming responses from the LLM, enabling dramatically faster voice responses compared to simply connecting the components sequentially.


## 🎁 Installation

You can install it with a single `pip` command:

```sh
pip install git+https://github.com/uezo/litests
```

If you plan to use LiteSTS to handle microphone input or play audio on a local computer, make sure to install `PortAudio` and its Python binding, `PyAudio`, beforehand:

```sh
# Mac
brew install portaudio
pip install PyAudio
```


## 🚀 Quick start

It's super easy to create the Speech-to-Speech AI chatbot locally:

```python
import asyncio
from litests import LiteSTS
from litests.adapter.audiodevice import AudioDeviceAdapter

OPENAI_API_KEY = "YOUR_API_KEY"
GOOGLE_API_KEY = "YOUR_API_KEY"

async def quick_start_main():
    sts = LiteSTS(
        vad_volume_db_threshold=-30,    # Adjust microphone sensitivity (Gate)
        stt_google_api_key=GOOGLE_API_KEY,
        llm_openai_api_key=OPENAI_API_KEY,
        # Azure OpenAI
        # llm_model="azure",
        # llm_base_url="https://{your_resource_name}.openai.azure.com/openai/deployments/{your_deployment_name}/chat/completions?api-version={api_version}",
        debug=True
    )

    adapter = AudioDeviceAdapter(sts)
    await adapter.start_listening("session_id")

asyncio.run(quick_start_main())
```

Make sure the VOICEVOX server is running at http://127.0.0.1:50021 and run the script above:

```sh
python run.py
```

Enjoy👍


## 🛠️ Customize pipeline

By instantiating modules for VAD, STT, LLM, TTS, and the Response Handler, and assigning them in the LiteSTS pipeline constructor, you can fully customize the features of the pipeline.


```python
"""
Step 1. Create modular components
"""
# (1) VAD
from litests.vad import StandardSpeechDetector
vad = StandardSpeechDetector(...)

# (2) STT
from litests.stt.google import GoogleSpeechRecognizer
stt = GoogleSpeechRecognizer(...)

# (3) LLM
from litests.llm.chatgpt import ChatGPTService
llm = ChatGPTService(
    ...
    context_manager=PostgreSQLContextManager(...)   # <- Set if you use PostgreSQL
)

# (4) TTS
from litests.tts.voicevox import VoicevoxSpeechSynthesizer
tts = VoicevoxSpeechSynthesizer(...)

# (5) Performance Recorder
from litests.performance_recorder.postgres import PostgreSQLPerformanceRecorder
performance_recorder = PostgreSQLPerformanceRecorder(...)


"""
Step 2. Assign them to Speech-to-Speech pipeline
"""
sts = litests.LiteSTS(
    vad=vad,
    stt=stt,
    llm=llm,
    tts=tts,
    performance_recorder=performance_recorder,
    debug=True
)


"""
Step 3. Start Speech-to-Speech with adapter
"""
# Case 1: Microphone (PyAudio)
adapter = AudioDeviceAdapter(sts)
await adapter.start_listening("session_id")

# Case 2: WebSocket (Twilio)
class TwilioAdapter(WebSocketAdapter):
    # Implement adapter for twilio
    ...

adapter = TwilioAdapter(sts)
router = adapter.get_websocket_router(wss_base_url="wss://your_domain")
app = FastAPI()
app.include_router(router)
tts.audio_format = "mulaw"  # <- TTS service should support mulaw
```

See also `examples/local/llms.py`. For example, you can use Gemini by the following code:

```python
gemini = GeminiService(
    gemini_api_key=GEMINI_API_KEY
)

sts = litests.LiteSTS(
    vad=vad,
    stt=stt,
    llm=gemini,     # <- Set gemini here
    tts=tts,
    debug=True
)
```


## ⚡️ Function Calling

You can use Function Calling (Tool Call) by registering function specifications and their handlers through `tool` decorator, as shown below. Functions will be automatically invoked as needed.

**NOTE**: Currently, only ChatGPT is supported.

```python
# Create LLM service
llm = ChatGPTService(
    openai_api_key=OPENAI_API_KEY,
    system_prompt=SYSTEM_PROMPT
)

# Register tool
weather_tool_spec = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            },
        },
    }
}
@llm.tool(weather_tool_spec)    # NOTE: Gemini doesn't take spec as argument
async def get_weather(location: str = None):
    weather = await weather_api(location=location)
    return weather  # {"weather": "clear", "temperature": 23.4}
```


## ⛓️ Chain of Thought Prompting

Chain of Thought Prompting (CoT) is one of the popular techniques to improve the quality of AI responses. LiteSTS, by default, directly synthesize AI output, but it can also be configured to synthesize only the text inside specific XML tags.

For example, if you want the AI to output its thought process inside `<thinking>~</thinking>` and the final speech content inside `<answer>~</answer>`, you can use the following sample code:

```python
SYSTEM_PROMPT = """
Carefully consider the response first.
Output your thought process inside <thinking>~</thinking>.
Then, output the content to be spoken inside <answer>~</answer>.
"""

service = ChatGPTService(
    openai_api_key=OPENAI_API_KEY,
    system_prompt=SYSTEM_PROMPT,
    model="gpt-4o",
    temperature=0.5,
    voice_text_tag="answer" # <- Synthesize inner text of <answer> tag
)
```


## 🪄 Request Filter

You can validate and preprocess requests (recognized text from voice) before they are sent to LLM.

```python
# Create LLM service
chatgpt = ChatGPTService(
    openai_api_key=OPENAI_API_KEY,
    system_prompt=SYSTEM_PROMPT,
    debug = True
)

# Set filter
@chatgpt.request_filter
def request_filter(text: str):
    return f"Here is the user's spoken input. Respond as if you're a cat, adding 'meow' to the end of each sentence.\n\nUser: {text}"
```

**System Prompt vs Request Filter:** While the system prompt is generally static and used to define overall behavior, the request filter can dynamically insert instructions based on the specific context. It can emphasize key points to prioritize in generating responses, helping stabilize the conversation and adapt to changing scenarios.


## 🌏 Multi-language Support

You can dynamically switch the spoken language during a conversation.  
To enable this, configure the system prompt, set up `SpeechSynthesizer`, and add custom logic to the Speech-to-Speech pipeline as shown below:

```python
# System prompt
SYSTEM_PROMPT = """
You can speak following languages:

- English (en-US)
- Chinese (zh-CN)

When responding in a language other than Japanese, always prepend `[lang:en-US]` or `[lang:zh-CN]` at the beginning of the response.  
Additionally, when switching back to Japanese from another language, always prepend `[lang:ja-JP]` at the beginning of the response.
"""

# Setup TTS and configure language-speaker map
tts = GoogleSpeechSynthesizer(
    google_api_key=GOOGLE_API_KEY,
    speaker="ja-JP-Standard-B",
)
tts.voice_map["en-US"] = "en-US-Standard-H"     # English
tts.voice_map["cmn-CN"] = "cmn-CN-Standard-D"   # Chinese

# Add parsing logic for language code
import re
@sts.process_llm_chunk
async def process_llm_chunk(chunk: STSResponse):
    match = re.search(r"\[lang:([a-zA-Z-]+)\]", chunk.text)
    if match:
        return {"language": match.group(1)}
    else:
        return {}
```


## 🥰 Voice Style

You can apply a specific voice style to synthesized speech when certain keywords are included in the response.

To use this feature, register the keywords and their corresponding speaker names or styles for each TTS component in a `style_mapper`.

```python
# VOICEVOX / AivisSpeech
from litests.tts.voicevox import VoicevoxSpeechSynthesizer
voicevox_tts = VoicevoxSpeechSynthesizer(
    # AivisSpeech
    base_url="http://127.0.0.1:10101",
    # Base speaker name for the neutral style (Anneli / Neutral)
    speaker=888753761,
    # Define style mapper (Keyword in response : styled speaker)
    style_mapper={
        "[face:Joy]": "888753764",
        "[face:Angry]": "888753765",
        "[face:Sorrow]": "888753765",
        "[face:Fun]": "888753762",
        "[face:Surprised]": "888753762"
    },
    debug=True
)

# SpeechGateway
from litests.tts.speech_gateway import SpeechGatewaySpeechSynthesizer
tts = SpeechGatewaySpeechSynthesizer(
    tts_url="http://127.0.0.1:8000/tts",
    service_name="sbv2",
    speaker="0-0",
    # Define style mapper (Keyword in response : voice style)
    style_mapper = {
        "[face:Joy]": "joy",
        "[face:Angry]": "angry",
        "[face:Sorrow]": "sorrow",
        "[face:Fun]": "fun",
        "[face:Surprised]": "surprised",
    },
    debug=True
)
```


## 🧩 Make custom modules

By creating modules that inherit the interfaces for VAD, STT, LLM, TTS, and the Response Handler, you can integrate them into the pipeline. Below, only the interfaces are introduced; for implementation details, please refer to the existing modules included in the repository.


### VAD

Make the class that implements `process_samples` and `process_stream` methods.

```python
class SpeechDetector(ABC):
    @abstractmethod
    async def process_samples(self, samples: bytes, session_id: str = None):
        pass

    @abstractmethod
    async def process_stream(self, input_stream: AsyncGenerator[bytes, None], session_id: str = None):
        pass
```

### STT

Make the class that implements just `transcribe` method.

```python
class SpeechRecognizer(ABC):
    @abstractmethod
    async def transcribe(self, data: bytes) -> str:
        pass
```

### LLM

Make the class that implements `compose_messages`, `update_context` and `get_llm_stream_response` methods.

```python
class LLMService(ABC):
    @abstractmethod
    async def compose_messages(self, context_id: str, text: str) -> List[Dict]:
        pass

    @abstractmethod
    async def update_context(self, context_id: str, messages: List[Dict], response_text: str):
        pass

    @abstractmethod
    async def get_llm_stream_response(self, context_id: str, messages: List[dict]) -> AsyncGenerator[str, None]:
        pass
```

### TTS

Make the class that implements just `synthesize` method.

```python
class SpeechSynthesizer(ABC):
    @abstractmethod
    async def synthesize(self, text: str) -> bytes:
        pass
```

### Adapter

Make the class that implements `handle_response` and `stop_response` methods.

```python
class Adapter(ABC):
    @abstractmethod
    async def handle_response(self, response: STSResponse):
        pass

    @abstractmethod
    async def stop_response(self, context_id: str):
        pass
```

### Context Manager

Make the class that implements `get_histories` and `add_histories` methods.

```python
class ContextManager(ABC):
    @abstractmethod
    async def get_histories(self, context_id: str, limit: int = 100) -> List[Dict]:
        pass

    @abstractmethod
    async def add_histories(self, context_id: str, data_list: List[Dict], context_schema: str = None):
        pass
```


## 🔌 WebSocket

Refer to `examples/websocket`.

- server.py: A WebSocket server program. Set your API keys and start this first.

    ```sh
    uvicorn server:app
    ```

- client.py: A WebSocket client program. Run this after starting the server, and start a conversation by saying something.

    ```sh
    python client.py
    ```

**NOTE**: To make the core mechanism easier to understand, exception handling and resource cleanup have been omitted. If you plan to use this in a production service, be sure to implement these as well.


## 📈 Performance Recorder

Records the time taken for each component in the Speech-to-Speech pipeline, from invocation to completion.

The recorded metrics include:

- `stt_time`: Time taken for transcription by the Speech-to-Text service.
- `stop_response_time`: Time taken to stop the response of a previous request, if any.
- `llm_first_chunk_time`: Time taken to receive the first sentence from the LLM.
- `llm_first_voice_chunk_time`: Time taken to receive the first sentence from the LLM that is used for speech synthesis.
- `llm_time`: Time taken to receive the full response from the LLM.
- `tts_first_chunk_time`: Time taken to synthesize the first sentence for speech synthesis.
- `tts_time`: Time taken to complete the entire speech synthesis process.
- `total_time`: Total time taken for the entire pipeline to complete.

The key metric is `tts_first_chunk_time`, which measures the time between when the user finishes speaking and when the system begins its response.

By default, SQLite is used for storing data, but you can implement a custom recorder by using the `PerformanceRecorder` interface.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/uezo/litests",
    "name": "litests",
    "maintainer": "uezo",
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": "uezo@uezo.net",
    "keywords": null,
    "author": "uezo",
    "author_email": "uezo@uezo.net",
    "download_url": null,
    "platform": null,
    "description": "# LiteSTS\n\nA super lightweight Speech-to-Speech framework with modular VAD, STT, LLM and TTS components. \ud83e\udde9\n\n## \u2728 Features\n\n- **\ud83e\udde9 Modular architecture**: VAD, STT, LLM, and TTS are like building blocks\u2014just snap them together! Each one is super simple to integrate with a lightweight interface. Here's what we support out of the box (but feel free to add your own flair!):\n    - VAD: Built-in Standard VAD (turn-end detection based on silence length)\n    - STT: Google, Azure and OpenAI\n    - ChatGPT, Gemini, Claude. Plus, with support for LiteLLM and Dify, you can use any LLMs they support!\n    - TTS: VOICEVOX / AivisSpeech, OpenAI, SpeechGateway (Yep, all the TTS supported by SpeechGateway, including Style-Bert-VITS2 and NijiVoice!)\n- **\ud83e\udd70 Rich expression**: Supports text-based information exchange, enabling rich expressions like facial expressions and motions on the front end. Voice styles seamlessly align with facial expressions to enhance overall expressiveness. It also supports methods like Chain-of-Thought, allowing for maximum utilization of capabilities.\n- **\ud83c\udfce\ufe0f Super speed**: Speech synthesis and playback are performed in parallel with streaming responses from the LLM, enabling dramatically faster voice responses compared to simply connecting the components sequentially.\n\n\n## \ud83c\udf81 Installation\n\nYou can install it with a single `pip` command:\n\n```sh\npip install git+https://github.com/uezo/litests\n```\n\nIf you plan to use LiteSTS to handle microphone input or play audio on a local computer, make sure to install `PortAudio` and its Python binding, `PyAudio`, beforehand:\n\n```sh\n# Mac\nbrew install portaudio\npip install PyAudio\n```\n\n\n## \ud83d\ude80 Quick start\n\nIt's super easy to create the Speech-to-Speech AI chatbot locally:\n\n```python\nimport asyncio\nfrom litests import LiteSTS\nfrom litests.adapter.audiodevice import AudioDeviceAdapter\n\nOPENAI_API_KEY = \"YOUR_API_KEY\"\nGOOGLE_API_KEY = \"YOUR_API_KEY\"\n\nasync def quick_start_main():\n    sts = LiteSTS(\n        vad_volume_db_threshold=-30,    # Adjust microphone sensitivity (Gate)\n        stt_google_api_key=GOOGLE_API_KEY,\n        llm_openai_api_key=OPENAI_API_KEY,\n        # Azure OpenAI\n        # llm_model=\"azure\",\n        # llm_base_url=\"https://{your_resource_name}.openai.azure.com/openai/deployments/{your_deployment_name}/chat/completions?api-version={api_version}\",\n        debug=True\n    )\n\n    adapter = AudioDeviceAdapter(sts)\n    await adapter.start_listening(\"session_id\")\n\nasyncio.run(quick_start_main())\n```\n\nMake sure the VOICEVOX server is running at http://127.0.0.1:50021 and run the script above:\n\n```sh\npython run.py\n```\n\nEnjoy\ud83d\udc4d\n\n\n## \ud83d\udee0\ufe0f Customize pipeline\n\nBy instantiating modules for VAD, STT, LLM, TTS, and the Response Handler, and assigning them in the LiteSTS pipeline constructor, you can fully customize the features of the pipeline.\n\n\n```python\n\"\"\"\nStep 1. Create modular components\n\"\"\"\n# (1) VAD\nfrom litests.vad import StandardSpeechDetector\nvad = StandardSpeechDetector(...)\n\n# (2) STT\nfrom litests.stt.google import GoogleSpeechRecognizer\nstt = GoogleSpeechRecognizer(...)\n\n# (3) LLM\nfrom litests.llm.chatgpt import ChatGPTService\nllm = ChatGPTService(\n    ...\n    context_manager=PostgreSQLContextManager(...)   # <- Set if you use PostgreSQL\n)\n\n# (4) TTS\nfrom litests.tts.voicevox import VoicevoxSpeechSynthesizer\ntts = VoicevoxSpeechSynthesizer(...)\n\n# (5) Performance Recorder\nfrom litests.performance_recorder.postgres import PostgreSQLPerformanceRecorder\nperformance_recorder = PostgreSQLPerformanceRecorder(...)\n\n\n\"\"\"\nStep 2. Assign them to Speech-to-Speech pipeline\n\"\"\"\nsts = litests.LiteSTS(\n    vad=vad,\n    stt=stt,\n    llm=llm,\n    tts=tts,\n    performance_recorder=performance_recorder,\n    debug=True\n)\n\n\n\"\"\"\nStep 3. Start Speech-to-Speech with adapter\n\"\"\"\n# Case 1: Microphone (PyAudio)\nadapter = AudioDeviceAdapter(sts)\nawait adapter.start_listening(\"session_id\")\n\n# Case 2: WebSocket (Twilio)\nclass TwilioAdapter(WebSocketAdapter):\n    # Implement adapter for twilio\n    ...\n\nadapter = TwilioAdapter(sts)\nrouter = adapter.get_websocket_router(wss_base_url=\"wss://your_domain\")\napp = FastAPI()\napp.include_router(router)\ntts.audio_format = \"mulaw\"  # <- TTS service should support mulaw\n```\n\nSee also `examples/local/llms.py`. For example, you can use Gemini by the following code:\n\n```python\ngemini = GeminiService(\n    gemini_api_key=GEMINI_API_KEY\n)\n\nsts = litests.LiteSTS(\n    vad=vad,\n    stt=stt,\n    llm=gemini,     # <- Set gemini here\n    tts=tts,\n    debug=True\n)\n```\n\n\n## \u26a1\ufe0f Function Calling\n\nYou can use Function Calling (Tool Call) by registering function specifications and their handlers through `tool` decorator, as shown below. Functions will be automatically invoked as needed.\n\n**NOTE**: Currently, only ChatGPT is supported.\n\n```python\n# Create LLM service\nllm = ChatGPTService(\n    openai_api_key=OPENAI_API_KEY,\n    system_prompt=SYSTEM_PROMPT\n)\n\n# Register tool\nweather_tool_spec = {\n    \"type\": \"function\",\n    \"function\": {\n        \"name\": \"get_weather\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\n                \"location\": {\"type\": \"string\"}\n            },\n        },\n    }\n}\n@llm.tool(weather_tool_spec)    # NOTE: Gemini doesn't take spec as argument\nasync def get_weather(location: str = None):\n    weather = await weather_api(location=location)\n    return weather  # {\"weather\": \"clear\", \"temperature\": 23.4}\n```\n\n\n## \u26d3\ufe0f Chain of Thought Prompting\n\nChain of Thought Prompting (CoT) is one of the popular techniques to improve the quality of AI responses. LiteSTS, by default, directly synthesize AI output, but it can also be configured to synthesize only the text inside specific XML tags.\n\nFor example, if you want the AI to output its thought process inside `<thinking>~</thinking>` and the final speech content inside `<answer>~</answer>`, you can use the following sample code:\n\n```python\nSYSTEM_PROMPT = \"\"\"\nCarefully consider the response first.\nOutput your thought process inside <thinking>~</thinking>.\nThen, output the content to be spoken inside <answer>~</answer>.\n\"\"\"\n\nservice = ChatGPTService(\n    openai_api_key=OPENAI_API_KEY,\n    system_prompt=SYSTEM_PROMPT,\n    model=\"gpt-4o\",\n    temperature=0.5,\n    voice_text_tag=\"answer\" # <- Synthesize inner text of <answer> tag\n)\n```\n\n\n## \ud83e\ude84 Request Filter\n\nYou can validate and preprocess requests (recognized text from voice) before they are sent to LLM.\n\n```python\n# Create LLM service\nchatgpt = ChatGPTService(\n    openai_api_key=OPENAI_API_KEY,\n    system_prompt=SYSTEM_PROMPT,\n    debug = True\n)\n\n# Set filter\n@chatgpt.request_filter\ndef request_filter(text: str):\n    return f\"Here is the user's spoken input. Respond as if you're a cat, adding 'meow' to the end of each sentence.\\n\\nUser: {text}\"\n```\n\n**System Prompt vs Request Filter:** While the system prompt is generally static and used to define overall behavior, the request filter can dynamically insert instructions based on the specific context. It can emphasize key points to prioritize in generating responses, helping stabilize the conversation and adapt to changing scenarios.\n\n\n## \ud83c\udf0f Multi-language Support\n\nYou can dynamically switch the spoken language during a conversation.  \nTo enable this, configure the system prompt, set up `SpeechSynthesizer`, and add custom logic to the Speech-to-Speech pipeline as shown below:\n\n```python\n# System prompt\nSYSTEM_PROMPT = \"\"\"\nYou can speak following languages:\n\n- English (en-US)\n- Chinese (zh-CN)\n\nWhen responding in a language other than Japanese, always prepend `[lang:en-US]` or `[lang:zh-CN]` at the beginning of the response.  \nAdditionally, when switching back to Japanese from another language, always prepend `[lang:ja-JP]` at the beginning of the response.\n\"\"\"\n\n# Setup TTS and configure language-speaker map\ntts = GoogleSpeechSynthesizer(\n    google_api_key=GOOGLE_API_KEY,\n    speaker=\"ja-JP-Standard-B\",\n)\ntts.voice_map[\"en-US\"] = \"en-US-Standard-H\"     # English\ntts.voice_map[\"cmn-CN\"] = \"cmn-CN-Standard-D\"   # Chinese\n\n# Add parsing logic for language code\nimport re\n@sts.process_llm_chunk\nasync def process_llm_chunk(chunk: STSResponse):\n    match = re.search(r\"\\[lang:([a-zA-Z-]+)\\]\", chunk.text)\n    if match:\n        return {\"language\": match.group(1)}\n    else:\n        return {}\n```\n\n\n## \ud83e\udd70 Voice Style\n\nYou can apply a specific voice style to synthesized speech when certain keywords are included in the response.\n\nTo use this feature, register the keywords and their corresponding speaker names or styles for each TTS component in a `style_mapper`.\n\n```python\n# VOICEVOX / AivisSpeech\nfrom litests.tts.voicevox import VoicevoxSpeechSynthesizer\nvoicevox_tts = VoicevoxSpeechSynthesizer(\n    # AivisSpeech\n    base_url=\"http://127.0.0.1:10101\",\n    # Base speaker name for the neutral style (Anneli / Neutral)\n    speaker=888753761,\n    # Define style mapper (Keyword in response : styled speaker)\n    style_mapper={\n        \"[face:Joy]\": \"888753764\",\n        \"[face:Angry]\": \"888753765\",\n        \"[face:Sorrow]\": \"888753765\",\n        \"[face:Fun]\": \"888753762\",\n        \"[face:Surprised]\": \"888753762\"\n    },\n    debug=True\n)\n\n# SpeechGateway\nfrom litests.tts.speech_gateway import SpeechGatewaySpeechSynthesizer\ntts = SpeechGatewaySpeechSynthesizer(\n    tts_url=\"http://127.0.0.1:8000/tts\",\n    service_name=\"sbv2\",\n    speaker=\"0-0\",\n    # Define style mapper (Keyword in response : voice style)\n    style_mapper = {\n        \"[face:Joy]\": \"joy\",\n        \"[face:Angry]\": \"angry\",\n        \"[face:Sorrow]\": \"sorrow\",\n        \"[face:Fun]\": \"fun\",\n        \"[face:Surprised]\": \"surprised\",\n    },\n    debug=True\n)\n```\n\n\n## \ud83e\udde9 Make custom modules\n\nBy creating modules that inherit the interfaces for VAD, STT, LLM, TTS, and the Response Handler, you can integrate them into the pipeline. Below, only the interfaces are introduced; for implementation details, please refer to the existing modules included in the repository.\n\n\n### VAD\n\nMake the class that implements `process_samples` and `process_stream` methods.\n\n```python\nclass SpeechDetector(ABC):\n    @abstractmethod\n    async def process_samples(self, samples: bytes, session_id: str = None):\n        pass\n\n    @abstractmethod\n    async def process_stream(self, input_stream: AsyncGenerator[bytes, None], session_id: str = None):\n        pass\n```\n\n### STT\n\nMake the class that implements just `transcribe` method.\n\n```python\nclass SpeechRecognizer(ABC):\n    @abstractmethod\n    async def transcribe(self, data: bytes) -> str:\n        pass\n```\n\n### LLM\n\nMake the class that implements `compose_messages`, `update_context` and `get_llm_stream_response` methods.\n\n```python\nclass LLMService(ABC):\n    @abstractmethod\n    async def compose_messages(self, context_id: str, text: str) -> List[Dict]:\n        pass\n\n    @abstractmethod\n    async def update_context(self, context_id: str, messages: List[Dict], response_text: str):\n        pass\n\n    @abstractmethod\n    async def get_llm_stream_response(self, context_id: str, messages: List[dict]) -> AsyncGenerator[str, None]:\n        pass\n```\n\n### TTS\n\nMake the class that implements just `synthesize` method.\n\n```python\nclass SpeechSynthesizer(ABC):\n    @abstractmethod\n    async def synthesize(self, text: str) -> bytes:\n        pass\n```\n\n### Adapter\n\nMake the class that implements `handle_response` and `stop_response` methods.\n\n```python\nclass Adapter(ABC):\n    @abstractmethod\n    async def handle_response(self, response: STSResponse):\n        pass\n\n    @abstractmethod\n    async def stop_response(self, context_id: str):\n        pass\n```\n\n### Context Manager\n\nMake the class that implements `get_histories` and `add_histories` methods.\n\n```python\nclass ContextManager(ABC):\n    @abstractmethod\n    async def get_histories(self, context_id: str, limit: int = 100) -> List[Dict]:\n        pass\n\n    @abstractmethod\n    async def add_histories(self, context_id: str, data_list: List[Dict], context_schema: str = None):\n        pass\n```\n\n\n## \ud83d\udd0c WebSocket\n\nRefer to `examples/websocket`.\n\n- server.py: A WebSocket server program. Set your API keys and start this first.\n\n    ```sh\n    uvicorn server:app\n    ```\n\n- client.py: A WebSocket client program. Run this after starting the server, and start a conversation by saying something.\n\n    ```sh\n    python client.py\n    ```\n\n**NOTE**: To make the core mechanism easier to understand, exception handling and resource cleanup have been omitted. If you plan to use this in a production service, be sure to implement these as well.\n\n\n## \ud83d\udcc8 Performance Recorder\n\nRecords the time taken for each component in the Speech-to-Speech pipeline, from invocation to completion.\n\nThe recorded metrics include:\n\n- `stt_time`: Time taken for transcription by the Speech-to-Text service.\n- `stop_response_time`: Time taken to stop the response of a previous request, if any.\n- `llm_first_chunk_time`: Time taken to receive the first sentence from the LLM.\n- `llm_first_voice_chunk_time`: Time taken to receive the first sentence from the LLM that is used for speech synthesis.\n- `llm_time`: Time taken to receive the full response from the LLM.\n- `tts_first_chunk_time`: Time taken to synthesize the first sentence for speech synthesis.\n- `tts_time`: Time taken to complete the entire speech synthesis process.\n- `total_time`: Total time taken for the entire pipeline to complete.\n\nThe key metric is `tts_first_chunk_time`, which measures the time between when the user finishes speaking and when the system begins its response.\n\nBy default, SQLite is used for storing data, but you can implement a custom recorder by using the `PerformanceRecorder` interface.\n",
    "bugtrack_url": null,
    "license": "Apache v2",
    "summary": "A super lightweight Speech-to-Speech framework with modular VAD, STT, LLM and TTS components. \ud83e\udde9",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/uezo/litests"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dff6c98b48b7c9a8e0f57e52ad27e93f043be316a53353202a2dcebf11dfb7d0",
                "md5": "c0e6920c91e3151e019a7e805c1be72e",
                "sha256": "9aec6ee289df5f79ffe03895358dba0b5d077b5a4cdeb2a37698fd0a29c0dafc"
            },
            "downloads": -1,
            "filename": "litests-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c0e6920c91e3151e019a7e805c1be72e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 42997,
            "upload_time": "2025-02-15T12:03:13",
            "upload_time_iso_8601": "2025-02-15T12:03:13.022556Z",
            "url": "https://files.pythonhosted.org/packages/df/f6/c98b48b7c9a8e0f57e52ad27e93f043be316a53353202a2dcebf11dfb7d0/litests-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-15 12:03:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "uezo",
    "github_project": "litests",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "httpx",
            "specs": [
                [
                    "==",
                    "0.27.0"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.55.3"
                ]
            ]
        }
    ],
    "lcname": "litests"
}

uezo