# SpeechFlow
A unified Python TTS (Text-to-Speech) library that provides a simple interface for multiple TTS engines.
## Features
- **Multiple TTS Engine Support**:
- OpenAI TTS
- Google Gemini TTS
- FishAudio TTS (Cloud-based, multi-voice)
- Kokoro TTS (Multi-language, lightweight, local)
- Style-Bert-VITS2 (Local, high-quality Japanese TTS)
- **Unified Interface**: Switch between different TTS engines without changing your code
- **Streaming Support**: Real-time audio streaming for supported engines
- **Decoupled Architecture**: Use TTS engines, audio players, and file writers independently
- **Audio Playback**: Synchronous audio player with streaming support
- **File Export**: Save synthesized speech to various audio formats
## Installation
```bash
pip install speechflow
# or
uv add speechflow
```
### GPU Support for PyTorch
SpeechFlow includes PyTorch as a dependency for some TTS engines (Kokoro, Style-Bert-VITS2). By default, pip/uv will install CPU-only PyTorch.
**For GPU acceleration, install PyTorch BEFORE installing speechflow:**
**Option 1: Using pip**
```bash
# First install PyTorch with CUDA (example for CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Then install speechflow
pip install speechflow
```
**Option 2: Using uv**
```bash
# First add PyTorch with CUDA support
uv add torch torchvision torchaudio --index https://download.pytorch.org/whl/cu121
# Then add speechflow
uv add speechflow
```
**Note:**
- Replace `cu121` with your CUDA version (e.g., `cu118` for CUDA 11.8, `cu124` for CUDA 12.4)
- If you've already installed speechflow with CPU PyTorch, you'll need to reinstall PyTorch:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --upgrade --force-reinstall
```
## Quick Start
### Basic Usage (Decoupled Components)
```python
from speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter
# Initialize components
engine = OpenAITTSEngine(api_key="your-api-key")
player = AudioPlayer()
writer = AudioWriter()
# Generate audio
audio = engine.get("Hello, world!")
# Play audio
player.play(audio)
# Save to file
writer.save(audio, "output.wav")
```
### Streaming Audio
**Important Notes on Streaming Behavior:**
- **OpenAI**: True streaming with multiple chunks. First call may have 10-20s cold start delay. Uses PCM format for simplicity.
- **Gemini**: Returns complete audio in a single chunk (as of January 2025). This is a known limitation, not true streaming.
```python
from speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter
# Initialize components
engine = OpenAITTSEngine(api_key="your-api-key")
player = AudioPlayer()
writer = AudioWriter()
# Warmup for OpenAI (recommended for production)
_ = list(engine.stream("Warmup"))
# Stream and play audio (returns combined AudioData)
combined_audio = player.play_stream(engine.stream("This is a long text that will be streamed..."))
# Save the combined audio to file
writer.save(combined_audio, "output.wav")
```
## Engine-Specific Features
### OpenAI TTS
```python
from speechflow import OpenAITTSEngine
engine = OpenAITTSEngine(api_key="your-api-key")
audio = engine.get(
"Hello",
voice="alloy", # or: ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer
model="gpt-4o-mini-tts", # or: tts-1, tts-1-hd
speed=1.0
)
# Streaming
for chunk in engine.stream("Long text..."):
# Process audio chunks in real-time
pass
```
### Google Gemini TTS
```python
from speechflow import GeminiTTSEngine
engine = GeminiTTSEngine(api_key="your-api-key")
audio = engine.get(
"Hello",
model="gemini-2.5-flash-preview-tts", # or: gemini-2.5-pro-preview-tts
voice="Leda", # or: Puck, Charon, Kore, Fenrir, Aoede, and many more
speed=1.0
)
```
### FishAudio TTS
```python
from speechflow import FishAudioTTSEngine
engine = FishAudioTTSEngine(api_key="your-api-key")
audio = engine.get(
"Hello world",
model="s1", # or: s1-mini, speech-1.6, speech-1.5, agent-x0
voice="your-voice-id" # Use your FishAudio voice ID
)
# Streaming
for chunk in engine.stream("Streaming text..."):
# Process audio chunks
pass
```
### Kokoro TTS
```python
from speechflow import KokoroTTSEngine
# Default: American English
engine = KokoroTTSEngine()
audio = engine.get(
"Hello world",
voice="af_heart" # Multiple voices available
)
# Japanese (requires additional setup)
engine = KokoroTTSEngine(lang_code="j")
audio = engine.get(
"こんにちは、世界",
voice="af_heart"
)
```
**Note for Japanese support:**
The Japanese dictionary will be automatically downloaded on first use.
If you encounter errors, you can manually download it:
```bash
python -m unidic download
```
### Style-Bert-VITS2
```python
from speechflow import StyleBertTTSEngine
# Use pre-trained model (automatically downloads on first use)
engine = StyleBertTTSEngine(model_name="jvnv-F1-jp") # Female Japanese voice
audio = engine.get(
"こんにちは、世界",
style="Happy", # Emotion: Neutral, Happy, Sad, Angry, Fear, Surprise, Disgust
style_weight=5.0, # Emotion strength (0.0-10.0)
speed=1.0, # Speech speed
pitch=0.0 # Pitch shift in semitones
)
# Available pre-trained models:
# - jvnv-F1-jp, jvnv-F2-jp: Female voices (JP-Extra version)
# - jvnv-M1-jp, jvnv-M2-jp: Male voices (JP-Extra version)
# - jvnv-F1, jvnv-F2, jvnv-M1, jvnv-M2: Legacy versions
# Use custom model
engine = StyleBertTTSEngine(model_path="/path/to/your/model")
# Sentence-by-sentence streaming (not true streaming)
for audio_chunk in engine.stream("長い文章を文ごとに生成します。"):
# Process each sentence's audio
pass
```
**Note:** Style-Bert-VITS2 is optimized for Japanese text and requires GPU for best performance.
## Language Support
### Kokoro Languages
- 🇺🇸 American English (`a`)
- 🇬🇧 British English (`b`)
- 🇪🇸 Spanish (`e`)
- 🇫🇷 French (`f`)
- 🇮🇳 Hindi (`h`)
- 🇮🇹 Italian (`i`)
- 🇯🇵 Japanese (`j`) - requires unidic
- 🇧🇷 Brazilian Portuguese (`p`)
- 🇨🇳 Mandarin Chinese (`z`)
## License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "speechflow",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "fishaudio, gemini, kokoro, openai, speech-synthesis, style-bert-vits2, text-to-speech, tts",
"author": null,
"author_email": "minamik <mia@sync.dev>",
"download_url": "https://files.pythonhosted.org/packages/11/43/6c436baf5c995973d4a2168424925fcd2a65a9cd3cdc6b851c2c39ee08ee/speechflow-0.1.6.tar.gz",
"platform": null,
"description": "# SpeechFlow\n\nA unified Python TTS (Text-to-Speech) library that provides a simple interface for multiple TTS engines.\n\n## Features\n\n- **Multiple TTS Engine Support**:\n - OpenAI TTS\n - Google Gemini TTS\n - FishAudio TTS (Cloud-based, multi-voice)\n - Kokoro TTS (Multi-language, lightweight, local)\n - Style-Bert-VITS2 (Local, high-quality Japanese TTS)\n\n- **Unified Interface**: Switch between different TTS engines without changing your code\n- **Streaming Support**: Real-time audio streaming for supported engines\n- **Decoupled Architecture**: Use TTS engines, audio players, and file writers independently\n- **Audio Playback**: Synchronous audio player with streaming support\n- **File Export**: Save synthesized speech to various audio formats\n\n## Installation\n\n```bash\npip install speechflow\n# or\nuv add speechflow\n```\n\n### GPU Support for PyTorch\n\nSpeechFlow includes PyTorch as a dependency for some TTS engines (Kokoro, Style-Bert-VITS2). By default, pip/uv will install CPU-only PyTorch. \n\n**For GPU acceleration, install PyTorch BEFORE installing speechflow:**\n\n**Option 1: Using pip**\n```bash\n# First install PyTorch with CUDA (example for CUDA 12.1)\npip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121\n\n# Then install speechflow\npip install speechflow\n```\n\n**Option 2: Using uv**\n```bash\n# First add PyTorch with CUDA support\nuv add torch torchvision torchaudio --index https://download.pytorch.org/whl/cu121\n\n# Then add speechflow\nuv add speechflow\n```\n\n**Note:** \n- Replace `cu121` with your CUDA version (e.g., `cu118` for CUDA 11.8, `cu124` for CUDA 12.4)\n- If you've already installed speechflow with CPU PyTorch, you'll need to reinstall PyTorch:\n ```bash\n pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 --upgrade --force-reinstall\n ```\n\n## Quick Start\n\n### Basic Usage (Decoupled Components)\n```python\nfrom speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter\n\n# Initialize components\nengine = OpenAITTSEngine(api_key=\"your-api-key\")\nplayer = AudioPlayer()\nwriter = AudioWriter()\n\n# Generate audio\naudio = engine.get(\"Hello, world!\")\n\n# Play audio\nplayer.play(audio)\n\n# Save to file\nwriter.save(audio, \"output.wav\")\n```\n\n### Streaming Audio\n\n**Important Notes on Streaming Behavior:**\n- **OpenAI**: True streaming with multiple chunks. First call may have 10-20s cold start delay. Uses PCM format for simplicity.\n- **Gemini**: Returns complete audio in a single chunk (as of January 2025). This is a known limitation, not true streaming.\n\n```python\nfrom speechflow import OpenAITTSEngine, AudioPlayer, AudioWriter\n\n# Initialize components\nengine = OpenAITTSEngine(api_key=\"your-api-key\")\nplayer = AudioPlayer()\nwriter = AudioWriter()\n\n# Warmup for OpenAI (recommended for production)\n_ = list(engine.stream(\"Warmup\"))\n\n# Stream and play audio (returns combined AudioData)\ncombined_audio = player.play_stream(engine.stream(\"This is a long text that will be streamed...\"))\n\n# Save the combined audio to file\nwriter.save(combined_audio, \"output.wav\")\n```\n\n## Engine-Specific Features\n\n### OpenAI TTS\n```python\nfrom speechflow import OpenAITTSEngine\n\nengine = OpenAITTSEngine(api_key=\"your-api-key\")\naudio = engine.get(\n \"Hello\",\n voice=\"alloy\", # or: ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer\n model=\"gpt-4o-mini-tts\", # or: tts-1, tts-1-hd\n speed=1.0\n)\n\n# Streaming\nfor chunk in engine.stream(\"Long text...\"):\n # Process audio chunks in real-time\n pass\n```\n\n### Google Gemini TTS\n```python\nfrom speechflow import GeminiTTSEngine\n\nengine = GeminiTTSEngine(api_key=\"your-api-key\")\naudio = engine.get(\n \"Hello\",\n model=\"gemini-2.5-flash-preview-tts\", # or: gemini-2.5-pro-preview-tts\n voice=\"Leda\", # or: Puck, Charon, Kore, Fenrir, Aoede, and many more\n speed=1.0\n)\n```\n\n### FishAudio TTS\n```python\nfrom speechflow import FishAudioTTSEngine\n\nengine = FishAudioTTSEngine(api_key=\"your-api-key\")\naudio = engine.get(\n \"Hello world\",\n model=\"s1\", # or: s1-mini, speech-1.6, speech-1.5, agent-x0\n voice=\"your-voice-id\" # Use your FishAudio voice ID\n)\n\n# Streaming\nfor chunk in engine.stream(\"Streaming text...\"):\n # Process audio chunks\n pass\n```\n\n### Kokoro TTS\n```python\nfrom speechflow import KokoroTTSEngine\n\n# Default: American English\nengine = KokoroTTSEngine()\naudio = engine.get(\n \"Hello world\",\n voice=\"af_heart\" # Multiple voices available\n)\n\n# Japanese (requires additional setup)\nengine = KokoroTTSEngine(lang_code=\"j\")\naudio = engine.get(\n \"\u3053\u3093\u306b\u3061\u306f\u3001\u4e16\u754c\",\n voice=\"af_heart\"\n)\n```\n\n**Note for Japanese support:**\nThe Japanese dictionary will be automatically downloaded on first use.\nIf you encounter errors, you can manually download it:\n```bash\npython -m unidic download\n```\n\n### Style-Bert-VITS2\n```python\nfrom speechflow import StyleBertTTSEngine\n\n# Use pre-trained model (automatically downloads on first use)\nengine = StyleBertTTSEngine(model_name=\"jvnv-F1-jp\") # Female Japanese voice\naudio = engine.get(\n \"\u3053\u3093\u306b\u3061\u306f\u3001\u4e16\u754c\",\n style=\"Happy\", # Emotion: Neutral, Happy, Sad, Angry, Fear, Surprise, Disgust\n style_weight=5.0, # Emotion strength (0.0-10.0)\n speed=1.0, # Speech speed\n pitch=0.0 # Pitch shift in semitones\n)\n\n# Available pre-trained models:\n# - jvnv-F1-jp, jvnv-F2-jp: Female voices (JP-Extra version)\n# - jvnv-M1-jp, jvnv-M2-jp: Male voices (JP-Extra version) \n# - jvnv-F1, jvnv-F2, jvnv-M1, jvnv-M2: Legacy versions\n\n# Use custom model\nengine = StyleBertTTSEngine(model_path=\"/path/to/your/model\")\n\n# Sentence-by-sentence streaming (not true streaming)\nfor audio_chunk in engine.stream(\"\u9577\u3044\u6587\u7ae0\u3092\u6587\u3054\u3068\u306b\u751f\u6210\u3057\u307e\u3059\u3002\"):\n # Process each sentence's audio\n pass\n```\n\n**Note:** Style-Bert-VITS2 is optimized for Japanese text and requires GPU for best performance.\n\n## Language Support\n\n### Kokoro Languages\n- \ud83c\uddfa\ud83c\uddf8 American English (`a`)\n- \ud83c\uddec\ud83c\udde7 British English (`b`)\n- \ud83c\uddea\ud83c\uddf8 Spanish (`e`)\n- \ud83c\uddeb\ud83c\uddf7 French (`f`)\n- \ud83c\uddee\ud83c\uddf3 Hindi (`h`)\n- \ud83c\uddee\ud83c\uddf9 Italian (`i`)\n- \ud83c\uddef\ud83c\uddf5 Japanese (`j`) - requires unidic\n- \ud83c\udde7\ud83c\uddf7 Brazilian Portuguese (`p`)\n- \ud83c\udde8\ud83c\uddf3 Mandarin Chinese (`z`)\n\n## License\n\nMIT",
"bugtrack_url": null,
"license": null,
"summary": "TTS (Text-to-Speech) wrapper library for Python",
"version": "0.1.6",
"project_urls": null,
"split_keywords": [
"fishaudio",
" gemini",
" kokoro",
" openai",
" speech-synthesis",
" style-bert-vits2",
" text-to-speech",
" tts"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e2b8812b8e8f1fa81a490fb3ed9c224b129b5a0e8937a109359928605e5c41a6",
"md5": "e5b2d51ea1da85c7b2a3ff47bd5543e6",
"sha256": "946f3b2eab16d3bebc95b5929f4b2fa5804b861ac71cc5c73fd139678b2efa8e"
},
"downloads": -1,
"filename": "speechflow-0.1.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e5b2d51ea1da85c7b2a3ff47bd5543e6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 23930,
"upload_time": "2025-08-09T12:30:53",
"upload_time_iso_8601": "2025-08-09T12:30:53.023307Z",
"url": "https://files.pythonhosted.org/packages/e2/b8/812b8e8f1fa81a490fb3ed9c224b129b5a0e8937a109359928605e5c41a6/speechflow-0.1.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "11436c436baf5c995973d4a2168424925fcd2a65a9cd3cdc6b851c2c39ee08ee",
"md5": "a323fda5ffb95fa3ae8efd6a914b88c3",
"sha256": "1c0c76035d38c73f29761a31ada16d6d097b7f3cf5588b76072edb2b6677a0e5"
},
"downloads": -1,
"filename": "speechflow-0.1.6.tar.gz",
"has_sig": false,
"md5_digest": "a323fda5ffb95fa3ae8efd6a914b88c3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 22627,
"upload_time": "2025-08-09T12:30:54",
"upload_time_iso_8601": "2025-08-09T12:30:54.386232Z",
"url": "https://files.pythonhosted.org/packages/11/43/6c436baf5c995973d4a2168424925fcd2a65a9cd3cdc6b851c2c39ee08ee/speechflow-0.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-09 12:30:54",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "speechflow"
}