# Vogent Turn
**Fast and accurate turn detection for voice AI**
Multimodal turn detection that combines audio intonation and text context to accurately determine when a speaker has finished their turn in a conversation.
[Technical Report](https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents)
[Model Weights](http://huggingface.co/vogent/Vogent-Turn-80M)
[HF Space](https://huggingface.co/spaces/vogent/vogent-turn-demo)
## Key Features
- **Multimodal**: Uses both audio (Whisper encoder) and text (SmolLM) for context-aware predictions
- **Fast**: Optimized with `torch.compile` for low-latency inference
- **Easy to Use**: Simple Python API with just a few lines of code
- **Production-Ready**: Batched inference, model caching, and comprehensive error handling
## Architecture
- **Audio Encoder**: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
- **Text Model**: SmolLM-135M (12 layers, ~80M parameters)
- **Classifier**: Binary classification (turn complete / turn incomplete)
The model projects audio embeddings into the LLM's input space and processes them together with conversation context for turn detection.
---
## Installation
### Using `pip`
```bash
pip install vogent-turn
```
### Using `uv`
```bash
uv init
uv add vogent-turn
```
### From Source (Traditional)
```bash
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn
pip install -e .
```
### From Source (with UV - Recommended for Development)
[UV](https://github.com/astral-sh/uv) is a fast Python package manager. If you have UV installed:
```bash
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn
# Create virtual environment and install
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
```
### Requirements
- See `pyproject.toml` for full list
---
## Quick Start
### Python Library
```python
from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request
# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)
# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")
# Run turn detection with conversational context
result = detector.predict(
audio,
prev_line="What is your phone number",
curr_line="My number is 804",
sample_rate=sr,
return_probs=True,
)
print(f"Turn complete: {result['is_endpoint']}")
print(f"Confidence: {result['prob_endpoint']:.1%}")
```
### CLI Tool
```bash
# Basic usage (sample rate automatically detected from file)
vogent-turn-predict speech.wav \
--prev "What is your phone number" \
--curr "My number is 804"
```
**Note:** Sample rate is automatically detected from the audio file. Audio will be resampled to 16kHz internally if needed.
---
## API Reference
### `TurnDetector`
Main class for turn detection inference.
#### Constructor
```python
detector = TurnDetector(
model_name="vogent/Vogent-Turn-80M", # HuggingFace model ID
revision="main", # Model revision
device=None, # "cuda", "cpu", or None (auto)
compile_model=True # Use torch.compile for speed
)
```
#### `predict()`
Detect if the current speaker has finished their turn.
```python
result = detector.predict(
audio, # np.ndarray: (n_samples,) mono float32
prev_line="", # str: Previous speaker's text (optional)
curr_line="", # str: Current speaker's text (optional)
sample_rate=None, # int: Sample rate in Hz (recommended to specify, otherwise 16kHz is assumed)
return_probs=False # bool: Return probabilities
)
```
**Note:** The model operates at 16kHz internally. If you provide audio at a different sample rate, it will be automatically resampled (requires `librosa`). If no sample rate is specified, 16kHz is assumed with a warning.
**Returns:**
- If `return_probs=False`: `bool` (True = turn complete, False = continue)
- If `return_probs=True`: `dict` with keys:
- `is_endpoint`: bool
- `prob_endpoint`: float (0-1)
- `prob_continue`: float (0-1)
#### `predict_batch()`
Process multiple audio samples efficiently in a single batch.
```python
results = detector.predict_batch(
audio_batch, # list[np.ndarray]: List of audio arrays
context_batch=None, # list[dict]: List of context dicts with 'prev_line' and 'curr_line'
sample_rate=None, # int: Sample rate in Hz (applies to all audio)
return_probs=False # bool: Return probabilities
)
```
**Note:** All audio samples in the batch must have the same sample rate. Audio will be automatically resampled to 16kHz if a different rate is specified.
**Returns:**
- List of predictions (same format as `predict()` depending on `return_probs`)
#### Audio Requirements
- **Sample rate**: 16kHz
- **Channels**: Mono
- **Format**: float32 numpy array
- **Range**: [-1.0, 1.0]
- **Duration**: Up to 8 seconds (longer audio will be truncated)
### Text Context Format
The model uses conversation context to improve predictions:
- **`prev_line`**: What the previous speaker said (e.g., a question)
- **`curr_line`**: What the current speaker is saying (e.g., their response)
For best performance, do not include terminal punctuation (periods, etc.).
**Example:**
```python
result = detector.predict(
audio,
prev_line="How are you doing today",
curr_line="I'm doing great thanks"
)
```
---
## Model Details
### Multimodal Architecture
```
Audio (16kHz) ─────> Whisper Encoder ─> Audio Embeddings (1500D)
|
v
Audio Projector
|
v
Text Context ─────> SmolLM Tokenizer ─> Text Embeddings (variable length)
|
v
[Audio Embeds + Text Embeds] ─> SmolLM
|
v
Classification Head
|
v
[Endpoint / Continue]
```
### Training Data
The model is trained on conversational audio with labeled turn boundaries. It learns to detect:
- **Prosodic cues**: Pitch, intonation, pauses
- **Semantic cues**: Completeness of thought, question-answer patterns
- **Contextual cues**: Conversation flow and expectations
---
## Examples
Sample scripts can be found in the `examples/` directory.
`python3 examples/basic_usage.py` downloads an audio file and runs the turn detector.
`python3 examples/batch_processing.py` downloads two audio files and runs the turn detector with a batched input.
`request_batcher.py` is a sample implementation of a thread for continuous receiving and batching of requests (e.g. in a production setting).
---
## Development
### Project Structure
```
vogent-turn/ # Project root
├── pyproject.toml # Package configuration and dependencies
├── vogent_turn/ # Python package
│ ├── __init__.py # Package exports
│ ├── inference.py # Main TurnDetector class
│ ├── predict.py # CLI tool
│ ├── smollm_whisper.py # Model architecture
│ └── whisper.py # Whisper components
└── examples/ # Usage examples
├── basic_usage.py
└── batch_processing.py
```
### Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
---
## Citation
If you use this library in your research, please cite:
```bibtex
@software{vogent_turn,
title = {Vogent Turn: Multimodal Turn Detection for Conversational AI},
author = {Vogent},
year = {2024},
url = {https://github.com/vogent/vogent-turn}
}
```
---
## License
Inference code is open-source under Apache 2.0. Model weights are under a modified Apache 2.0 license with stricter attribution requirements for certain types of usage.
---
## Support
- **Issues**: [GitHub Issues](https://github.com/vogent/vogent-turn/issues)
---
## Changelog
### v0.1.0 (2025-10-19)
- Initial release
- Multimodal turn detection with Whisper + SmolLM
- Python library and CLI tool
- Torch.compile optimization for fast inference
Raw data
{
"_id": null,
"home_page": null,
"name": "vogent-turn",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "conversation-ai, multimodal, speech, turn-detection, voice, whisper",
"author": "Vogent",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/f3/8a/69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9/vogent_turn-0.1.1.tar.gz",
"platform": null,
"description": "# Vogent Turn\n\n**Fast and accurate turn detection for voice AI**\n\nMultimodal turn detection that combines audio intonation and text context to accurately determine when a speaker has finished their turn in a conversation.\n\n[Technical Report](https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents)\n\n[Model Weights](http://huggingface.co/vogent/Vogent-Turn-80M)\n\n[HF Space](https://huggingface.co/spaces/vogent/vogent-turn-demo)\n\n\n## Key Features\n\n- **Multimodal**: Uses both audio (Whisper encoder) and text (SmolLM) for context-aware predictions\n- **Fast**: Optimized with `torch.compile` for low-latency inference\n- **Easy to Use**: Simple Python API with just a few lines of code\n- **Production-Ready**: Batched inference, model caching, and comprehensive error handling\n\n## Architecture\n\n- **Audio Encoder**: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)\n- **Text Model**: SmolLM-135M (12 layers, ~80M parameters) \n- **Classifier**: Binary classification (turn complete / turn incomplete)\n\nThe model projects audio embeddings into the LLM's input space and processes them together with conversation context for turn detection.\n\n---\n\n## Installation\n\n### Using `pip`\n\n```bash\npip install vogent-turn\n```\n\n### Using `uv`\n\n```bash\nuv init\nuv add vogent-turn\n```\n\n### From Source (Traditional)\n\n```bash\ngit clone https://github.com/vogent/vogent-turn.git\ncd vogent-turn\npip install -e .\n```\n\n### From Source (with UV - Recommended for Development)\n\n[UV](https://github.com/astral-sh/uv) is a fast Python package manager. If you have UV installed:\n\n```bash\ngit clone https://github.com/vogent/vogent-turn.git\ncd vogent-turn\n\n# Create virtual environment and install\nuv venv\nsource .venv/bin/activate # On Windows: .venv\\Scripts\\activate\nuv pip install -e .\n```\n\n### Requirements\n\n- See `pyproject.toml` for full list\n\n---\n\n## Quick Start\n\n### Python Library\n\n```python\nfrom vogent_turn import TurnDetector\nimport soundfile as sf\nimport urllib.request\n\n# Initialize detector\ndetector = TurnDetector(compile_model=True, warmup=True)\n\n# Download and load audio\naudio_url = \"https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav\"\nurllib.request.urlretrieve(audio_url, \"sample.wav\")\naudio, sr = sf.read(\"sample.wav\")\n\n# Run turn detection with conversational context\nresult = detector.predict(\n audio,\n prev_line=\"What is your phone number\",\n curr_line=\"My number is 804\",\n sample_rate=sr,\n return_probs=True,\n)\n\nprint(f\"Turn complete: {result['is_endpoint']}\")\nprint(f\"Confidence: {result['prob_endpoint']:.1%}\")\n```\n\n### CLI Tool\n\n```bash\n# Basic usage (sample rate automatically detected from file)\nvogent-turn-predict speech.wav \\\n --prev \"What is your phone number\" \\\n --curr \"My number is 804\"\n```\n\n**Note:** Sample rate is automatically detected from the audio file. Audio will be resampled to 16kHz internally if needed.\n\n---\n\n## API Reference\n\n### `TurnDetector`\n\nMain class for turn detection inference.\n\n#### Constructor\n\n```python\ndetector = TurnDetector(\n model_name=\"vogent/Vogent-Turn-80M\", # HuggingFace model ID\n revision=\"main\", # Model revision\n device=None, # \"cuda\", \"cpu\", or None (auto)\n compile_model=True # Use torch.compile for speed\n)\n```\n\n#### `predict()`\n\nDetect if the current speaker has finished their turn.\n\n```python\nresult = detector.predict(\n audio, # np.ndarray: (n_samples,) mono float32\n prev_line=\"\", # str: Previous speaker's text (optional)\n curr_line=\"\", # str: Current speaker's text (optional)\n sample_rate=None, # int: Sample rate in Hz (recommended to specify, otherwise 16kHz is assumed)\n return_probs=False # bool: Return probabilities\n)\n```\n\n**Note:** The model operates at 16kHz internally. If you provide audio at a different sample rate, it will be automatically resampled (requires `librosa`). If no sample rate is specified, 16kHz is assumed with a warning.\n\n**Returns:**\n- If `return_probs=False`: `bool` (True = turn complete, False = continue)\n- If `return_probs=True`: `dict` with keys:\n - `is_endpoint`: bool\n - `prob_endpoint`: float (0-1)\n - `prob_continue`: float (0-1)\n\n#### `predict_batch()`\n\nProcess multiple audio samples efficiently in a single batch.\n\n```python\nresults = detector.predict_batch(\n audio_batch, # list[np.ndarray]: List of audio arrays\n context_batch=None, # list[dict]: List of context dicts with 'prev_line' and 'curr_line'\n sample_rate=None, # int: Sample rate in Hz (applies to all audio)\n return_probs=False # bool: Return probabilities\n)\n```\n\n**Note:** All audio samples in the batch must have the same sample rate. Audio will be automatically resampled to 16kHz if a different rate is specified.\n\n**Returns:**\n- List of predictions (same format as `predict()` depending on `return_probs`)\n\n#### Audio Requirements\n\n- **Sample rate**: 16kHz\n- **Channels**: Mono\n- **Format**: float32 numpy array\n- **Range**: [-1.0, 1.0]\n- **Duration**: Up to 8 seconds (longer audio will be truncated)\n\n### Text Context Format\n\nThe model uses conversation context to improve predictions:\n\n- **`prev_line`**: What the previous speaker said (e.g., a question)\n- **`curr_line`**: What the current speaker is saying (e.g., their response)\n\nFor best performance, do not include terminal punctuation (periods, etc.).\n\n**Example:**\n```python\nresult = detector.predict(\n audio,\n prev_line=\"How are you doing today\",\n curr_line=\"I'm doing great thanks\"\n)\n```\n\n---\n\n## Model Details\n\n### Multimodal Architecture\n\n```\nAudio (16kHz) \u2500\u2500\u2500\u2500\u2500> Whisper Encoder \u2500> Audio Embeddings (1500D)\n |\n v\n Audio Projector\n |\n v\nText Context \u2500\u2500\u2500\u2500\u2500> SmolLM Tokenizer \u2500> Text Embeddings (variable length)\n |\n v\n [Audio Embeds + Text Embeds] \u2500> SmolLM\n |\n v\n Classification Head\n |\n v\n [Endpoint / Continue]\n```\n\n### Training Data\n\nThe model is trained on conversational audio with labeled turn boundaries. It learns to detect:\n- **Prosodic cues**: Pitch, intonation, pauses\n- **Semantic cues**: Completeness of thought, question-answer patterns\n- **Contextual cues**: Conversation flow and expectations\n\n---\n\n## Examples\n\nSample scripts can be found in the `examples/` directory. \n`python3 examples/basic_usage.py` downloads an audio file and runs the turn detector. \n`python3 examples/batch_processing.py` downloads two audio files and runs the turn detector with a batched input.\n`request_batcher.py` is a sample implementation of a thread for continuous receiving and batching of requests (e.g. in a production setting).\n\n---\n## Development\n\n### Project Structure\n\n```\nvogent-turn/ # Project root\n\u251c\u2500\u2500 pyproject.toml # Package configuration and dependencies\n\u251c\u2500\u2500 vogent_turn/ # Python package\n\u2502 \u251c\u2500\u2500 __init__.py # Package exports\n\u2502 \u251c\u2500\u2500 inference.py # Main TurnDetector class\n\u2502 \u251c\u2500\u2500 predict.py # CLI tool\n\u2502 \u251c\u2500\u2500 smollm_whisper.py # Model architecture\n\u2502 \u2514\u2500\u2500 whisper.py # Whisper components\n\u2514\u2500\u2500 examples/ # Usage examples\n \u251c\u2500\u2500 basic_usage.py\n \u2514\u2500\u2500 batch_processing.py\n```\n\n### Contributing\n\nContributions are welcome! Please:\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n---\n\n## Citation\n\nIf you use this library in your research, please cite:\n\n```bibtex\n@software{vogent_turn,\n title = {Vogent Turn: Multimodal Turn Detection for Conversational AI},\n author = {Vogent},\n year = {2024},\n url = {https://github.com/vogent/vogent-turn}\n}\n```\n\n---\n\n## License\n\nInference code is open-source under Apache 2.0. Model weights are under a modified Apache 2.0 license with stricter attribution requirements for certain types of usage.\n\n---\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/vogent/vogent-turn/issues)\n\n---\n\n## Changelog\n\n### v0.1.0 (2025-10-19)\n- Initial release\n- Multimodal turn detection with Whisper + SmolLM\n- Python library and CLI tool\n- Torch.compile optimization for fast inference\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Lightweight turn detection library for conversational AI",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/vogent/vogent-turn",
"Issues": "https://github.com/vogent/vogent-turn/issues",
"Repository": "https://github.com/vogent/vogent-turn"
},
"split_keywords": [
"conversation-ai",
" multimodal",
" speech",
" turn-detection",
" voice",
" whisper"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "87851a980238e7960c2f87537879a452c3b7419f8bb3ed39bc81686a9688ab99",
"md5": "927c1fc764d94bf393235f3d043fe200",
"sha256": "90363f921035edf92eb43e6012ba53093783da358a1742226b0d04a504462ef7"
},
"downloads": -1,
"filename": "vogent_turn-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "927c1fc764d94bf393235f3d043fe200",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 19644,
"upload_time": "2025-10-28T00:28:19",
"upload_time_iso_8601": "2025-10-28T00:28:19.444801Z",
"url": "https://files.pythonhosted.org/packages/87/85/1a980238e7960c2f87537879a452c3b7419f8bb3ed39bc81686a9688ab99/vogent_turn-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f38a69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9",
"md5": "72c812454f30975885a65221388a9e19",
"sha256": "30662242d4bd1e3d484fc9a4d9171c4466ada01adb95cd8e2f0f3818a18326de"
},
"downloads": -1,
"filename": "vogent_turn-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "72c812454f30975885a65221388a9e19",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 237633,
"upload_time": "2025-10-28T00:28:20",
"upload_time_iso_8601": "2025-10-28T00:28:20.779941Z",
"url": "https://files.pythonhosted.org/packages/f3/8a/69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9/vogent_turn-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-28 00:28:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vogent",
"github_project": "vogent-turn",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "vogent-turn"
}