vogent-turn


Namevogent-turn JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryLightweight turn detection library for conversational AI
upload_time2025-10-28 00:28:20
maintainerNone
docs_urlNone
authorVogent
requires_python>=3.10
licenseApache-2.0
keywords conversation-ai multimodal speech turn-detection voice whisper
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Vogent Turn

**Fast and accurate turn detection for voice AI**

Multimodal turn detection that combines audio intonation and text context to accurately determine when a speaker has finished their turn in a conversation.

[Technical Report](https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents)

[Model Weights](http://huggingface.co/vogent/Vogent-Turn-80M)

[HF Space](https://huggingface.co/spaces/vogent/vogent-turn-demo)


## Key Features

- **Multimodal**: Uses both audio (Whisper encoder) and text (SmolLM) for context-aware predictions
- **Fast**: Optimized with `torch.compile` for low-latency inference
- **Easy to Use**: Simple Python API with just a few lines of code
- **Production-Ready**: Batched inference, model caching, and comprehensive error handling

## Architecture

- **Audio Encoder**: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
- **Text Model**: SmolLM-135M (12 layers, ~80M parameters)  
- **Classifier**: Binary classification (turn complete / turn incomplete)

The model projects audio embeddings into the LLM's input space and processes them together with conversation context for turn detection.

---

## Installation

### Using `pip`

```bash
pip install vogent-turn
```

### Using `uv`

```bash
uv init
uv add vogent-turn
```

### From Source (Traditional)

```bash
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn
pip install -e .
```

### From Source (with UV - Recommended for Development)

[UV](https://github.com/astral-sh/uv) is a fast Python package manager. If you have UV installed:

```bash
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn

# Create virtual environment and install
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .
```

### Requirements

- See `pyproject.toml` for full list

---

## Quick Start

### Python Library

```python
from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request

# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)

# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")

# Run turn detection with conversational context
result = detector.predict(
    audio,
    prev_line="What is your phone number",
    curr_line="My number is 804",
    sample_rate=sr,
    return_probs=True,
)

print(f"Turn complete: {result['is_endpoint']}")
print(f"Confidence: {result['prob_endpoint']:.1%}")
```

### CLI Tool

```bash
# Basic usage (sample rate automatically detected from file)
vogent-turn-predict speech.wav \
  --prev "What is your phone number" \
  --curr "My number is 804"
```

**Note:** Sample rate is automatically detected from the audio file. Audio will be resampled to 16kHz internally if needed.

---

## API Reference

### `TurnDetector`

Main class for turn detection inference.

#### Constructor

```python
detector = TurnDetector(
    model_name="vogent/Vogent-Turn-80M",  # HuggingFace model ID
    revision="main",                     # Model revision
    device=None,                         # "cuda", "cpu", or None (auto)
    compile_model=True                   # Use torch.compile for speed
)
```

#### `predict()`

Detect if the current speaker has finished their turn.

```python
result = detector.predict(
    audio,                    # np.ndarray: (n_samples,) mono float32
    prev_line="",             # str: Previous speaker's text (optional)
    curr_line="",             # str: Current speaker's text (optional)
    sample_rate=None,         # int: Sample rate in Hz (recommended to specify, otherwise 16kHz is assumed)
    return_probs=False        # bool: Return probabilities
)
```

**Note:** The model operates at 16kHz internally. If you provide audio at a different sample rate, it will be automatically resampled (requires `librosa`). If no sample rate is specified, 16kHz is assumed with a warning.

**Returns:**
- If `return_probs=False`: `bool` (True = turn complete, False = continue)
- If `return_probs=True`: `dict` with keys:
  - `is_endpoint`: bool
  - `prob_endpoint`: float (0-1)
  - `prob_continue`: float (0-1)

#### `predict_batch()`

Process multiple audio samples efficiently in a single batch.

```python
results = detector.predict_batch(
    audio_batch,              # list[np.ndarray]: List of audio arrays
    context_batch=None,       # list[dict]: List of context dicts with 'prev_line' and 'curr_line'
    sample_rate=None,         # int: Sample rate in Hz (applies to all audio)
    return_probs=False        # bool: Return probabilities
)
```

**Note:** All audio samples in the batch must have the same sample rate. Audio will be automatically resampled to 16kHz if a different rate is specified.

**Returns:**
- List of predictions (same format as `predict()` depending on `return_probs`)

#### Audio Requirements

- **Sample rate**: 16kHz
- **Channels**: Mono
- **Format**: float32 numpy array
- **Range**: [-1.0, 1.0]
- **Duration**: Up to 8 seconds (longer audio will be truncated)

### Text Context Format

The model uses conversation context to improve predictions:

- **`prev_line`**: What the previous speaker said (e.g., a question)
- **`curr_line`**: What the current speaker is saying (e.g., their response)

For best performance, do not include terminal punctuation (periods, etc.).

**Example:**
```python
result = detector.predict(
    audio,
    prev_line="How are you doing today",
    curr_line="I'm doing great thanks"
)
```

---

## Model Details

### Multimodal Architecture

```
Audio (16kHz) ─────> Whisper Encoder ─> Audio Embeddings (1500D)
                                              |
                                              v
                                        Audio Projector
                                              |
                                              v
Text Context ─────> SmolLM Tokenizer ─> Text Embeddings (variable length)
                                              |
                                              v
                              [Audio Embeds + Text Embeds] ─> SmolLM
                                              |
                                              v
                                      Classification Head
                                              |
                                              v
                                    [Endpoint / Continue]
```

### Training Data

The model is trained on conversational audio with labeled turn boundaries. It learns to detect:
- **Prosodic cues**: Pitch, intonation, pauses
- **Semantic cues**: Completeness of thought, question-answer patterns
- **Contextual cues**: Conversation flow and expectations

---

## Examples

Sample scripts can be found in the `examples/` directory. 
`python3 examples/basic_usage.py` downloads an audio file and runs the turn detector. 
`python3 examples/batch_processing.py` downloads two audio files and runs the turn detector with a batched input.
`request_batcher.py` is a sample implementation of a thread for continuous receiving and batching of requests (e.g. in a production setting).

---
## Development

### Project Structure

```
vogent-turn/                    # Project root
├── pyproject.toml              # Package configuration and dependencies
├── vogent_turn/                # Python package
│   ├── __init__.py             # Package exports
│   ├── inference.py            # Main TurnDetector class
│   ├── predict.py              # CLI tool
│   ├── smollm_whisper.py       # Model architecture
│   └── whisper.py              # Whisper components
└── examples/                   # Usage examples
    ├── basic_usage.py
    └── batch_processing.py
```

### Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

---

## Citation

If you use this library in your research, please cite:

```bibtex
@software{vogent_turn,
  title = {Vogent Turn: Multimodal Turn Detection for Conversational AI},
  author = {Vogent},
  year = {2024},
  url = {https://github.com/vogent/vogent-turn}
}
```

---

## License

Inference code is open-source under Apache 2.0. Model weights are under a modified Apache 2.0 license with stricter attribution requirements for certain types of usage.

---

## Support

- **Issues**: [GitHub Issues](https://github.com/vogent/vogent-turn/issues)

---

## Changelog

### v0.1.0 (2025-10-19)
- Initial release
- Multimodal turn detection with Whisper + SmolLM
- Python library and CLI tool
- Torch.compile optimization for fast inference

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vogent-turn",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "conversation-ai, multimodal, speech, turn-detection, voice, whisper",
    "author": "Vogent",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/f3/8a/69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9/vogent_turn-0.1.1.tar.gz",
    "platform": null,
    "description": "# Vogent Turn\n\n**Fast and accurate turn detection for voice AI**\n\nMultimodal turn detection that combines audio intonation and text context to accurately determine when a speaker has finished their turn in a conversation.\n\n[Technical Report](https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents)\n\n[Model Weights](http://huggingface.co/vogent/Vogent-Turn-80M)\n\n[HF Space](https://huggingface.co/spaces/vogent/vogent-turn-demo)\n\n\n## Key Features\n\n- **Multimodal**: Uses both audio (Whisper encoder) and text (SmolLM) for context-aware predictions\n- **Fast**: Optimized with `torch.compile` for low-latency inference\n- **Easy to Use**: Simple Python API with just a few lines of code\n- **Production-Ready**: Batched inference, model caching, and comprehensive error handling\n\n## Architecture\n\n- **Audio Encoder**: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)\n- **Text Model**: SmolLM-135M (12 layers, ~80M parameters)  \n- **Classifier**: Binary classification (turn complete / turn incomplete)\n\nThe model projects audio embeddings into the LLM's input space and processes them together with conversation context for turn detection.\n\n---\n\n## Installation\n\n### Using `pip`\n\n```bash\npip install vogent-turn\n```\n\n### Using `uv`\n\n```bash\nuv init\nuv add vogent-turn\n```\n\n### From Source (Traditional)\n\n```bash\ngit clone https://github.com/vogent/vogent-turn.git\ncd vogent-turn\npip install -e .\n```\n\n### From Source (with UV - Recommended for Development)\n\n[UV](https://github.com/astral-sh/uv) is a fast Python package manager. If you have UV installed:\n\n```bash\ngit clone https://github.com/vogent/vogent-turn.git\ncd vogent-turn\n\n# Create virtual environment and install\nuv venv\nsource .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\nuv pip install -e .\n```\n\n### Requirements\n\n- See `pyproject.toml` for full list\n\n---\n\n## Quick Start\n\n### Python Library\n\n```python\nfrom vogent_turn import TurnDetector\nimport soundfile as sf\nimport urllib.request\n\n# Initialize detector\ndetector = TurnDetector(compile_model=True, warmup=True)\n\n# Download and load audio\naudio_url = \"https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav\"\nurllib.request.urlretrieve(audio_url, \"sample.wav\")\naudio, sr = sf.read(\"sample.wav\")\n\n# Run turn detection with conversational context\nresult = detector.predict(\n    audio,\n    prev_line=\"What is your phone number\",\n    curr_line=\"My number is 804\",\n    sample_rate=sr,\n    return_probs=True,\n)\n\nprint(f\"Turn complete: {result['is_endpoint']}\")\nprint(f\"Confidence: {result['prob_endpoint']:.1%}\")\n```\n\n### CLI Tool\n\n```bash\n# Basic usage (sample rate automatically detected from file)\nvogent-turn-predict speech.wav \\\n  --prev \"What is your phone number\" \\\n  --curr \"My number is 804\"\n```\n\n**Note:** Sample rate is automatically detected from the audio file. Audio will be resampled to 16kHz internally if needed.\n\n---\n\n## API Reference\n\n### `TurnDetector`\n\nMain class for turn detection inference.\n\n#### Constructor\n\n```python\ndetector = TurnDetector(\n    model_name=\"vogent/Vogent-Turn-80M\",  # HuggingFace model ID\n    revision=\"main\",                     # Model revision\n    device=None,                         # \"cuda\", \"cpu\", or None (auto)\n    compile_model=True                   # Use torch.compile for speed\n)\n```\n\n#### `predict()`\n\nDetect if the current speaker has finished their turn.\n\n```python\nresult = detector.predict(\n    audio,                    # np.ndarray: (n_samples,) mono float32\n    prev_line=\"\",             # str: Previous speaker's text (optional)\n    curr_line=\"\",             # str: Current speaker's text (optional)\n    sample_rate=None,         # int: Sample rate in Hz (recommended to specify, otherwise 16kHz is assumed)\n    return_probs=False        # bool: Return probabilities\n)\n```\n\n**Note:** The model operates at 16kHz internally. If you provide audio at a different sample rate, it will be automatically resampled (requires `librosa`). If no sample rate is specified, 16kHz is assumed with a warning.\n\n**Returns:**\n- If `return_probs=False`: `bool` (True = turn complete, False = continue)\n- If `return_probs=True`: `dict` with keys:\n  - `is_endpoint`: bool\n  - `prob_endpoint`: float (0-1)\n  - `prob_continue`: float (0-1)\n\n#### `predict_batch()`\n\nProcess multiple audio samples efficiently in a single batch.\n\n```python\nresults = detector.predict_batch(\n    audio_batch,              # list[np.ndarray]: List of audio arrays\n    context_batch=None,       # list[dict]: List of context dicts with 'prev_line' and 'curr_line'\n    sample_rate=None,         # int: Sample rate in Hz (applies to all audio)\n    return_probs=False        # bool: Return probabilities\n)\n```\n\n**Note:** All audio samples in the batch must have the same sample rate. Audio will be automatically resampled to 16kHz if a different rate is specified.\n\n**Returns:**\n- List of predictions (same format as `predict()` depending on `return_probs`)\n\n#### Audio Requirements\n\n- **Sample rate**: 16kHz\n- **Channels**: Mono\n- **Format**: float32 numpy array\n- **Range**: [-1.0, 1.0]\n- **Duration**: Up to 8 seconds (longer audio will be truncated)\n\n### Text Context Format\n\nThe model uses conversation context to improve predictions:\n\n- **`prev_line`**: What the previous speaker said (e.g., a question)\n- **`curr_line`**: What the current speaker is saying (e.g., their response)\n\nFor best performance, do not include terminal punctuation (periods, etc.).\n\n**Example:**\n```python\nresult = detector.predict(\n    audio,\n    prev_line=\"How are you doing today\",\n    curr_line=\"I'm doing great thanks\"\n)\n```\n\n---\n\n## Model Details\n\n### Multimodal Architecture\n\n```\nAudio (16kHz) \u2500\u2500\u2500\u2500\u2500> Whisper Encoder \u2500> Audio Embeddings (1500D)\n                                              |\n                                              v\n                                        Audio Projector\n                                              |\n                                              v\nText Context \u2500\u2500\u2500\u2500\u2500> SmolLM Tokenizer \u2500> Text Embeddings (variable length)\n                                              |\n                                              v\n                              [Audio Embeds + Text Embeds] \u2500> SmolLM\n                                              |\n                                              v\n                                      Classification Head\n                                              |\n                                              v\n                                    [Endpoint / Continue]\n```\n\n### Training Data\n\nThe model is trained on conversational audio with labeled turn boundaries. It learns to detect:\n- **Prosodic cues**: Pitch, intonation, pauses\n- **Semantic cues**: Completeness of thought, question-answer patterns\n- **Contextual cues**: Conversation flow and expectations\n\n---\n\n## Examples\n\nSample scripts can be found in the `examples/` directory. \n`python3 examples/basic_usage.py` downloads an audio file and runs the turn detector. \n`python3 examples/batch_processing.py` downloads two audio files and runs the turn detector with a batched input.\n`request_batcher.py` is a sample implementation of a thread for continuous receiving and batching of requests (e.g. in a production setting).\n\n---\n## Development\n\n### Project Structure\n\n```\nvogent-turn/                    # Project root\n\u251c\u2500\u2500 pyproject.toml              # Package configuration and dependencies\n\u251c\u2500\u2500 vogent_turn/                # Python package\n\u2502   \u251c\u2500\u2500 __init__.py             # Package exports\n\u2502   \u251c\u2500\u2500 inference.py            # Main TurnDetector class\n\u2502   \u251c\u2500\u2500 predict.py              # CLI tool\n\u2502   \u251c\u2500\u2500 smollm_whisper.py       # Model architecture\n\u2502   \u2514\u2500\u2500 whisper.py              # Whisper components\n\u2514\u2500\u2500 examples/                   # Usage examples\n    \u251c\u2500\u2500 basic_usage.py\n    \u2514\u2500\u2500 batch_processing.py\n```\n\n### Contributing\n\nContributions are welcome! Please:\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n---\n\n## Citation\n\nIf you use this library in your research, please cite:\n\n```bibtex\n@software{vogent_turn,\n  title = {Vogent Turn: Multimodal Turn Detection for Conversational AI},\n  author = {Vogent},\n  year = {2024},\n  url = {https://github.com/vogent/vogent-turn}\n}\n```\n\n---\n\n## License\n\nInference code is open-source under Apache 2.0. Model weights are under a modified Apache 2.0 license with stricter attribution requirements for certain types of usage.\n\n---\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/vogent/vogent-turn/issues)\n\n---\n\n## Changelog\n\n### v0.1.0 (2025-10-19)\n- Initial release\n- Multimodal turn detection with Whisper + SmolLM\n- Python library and CLI tool\n- Torch.compile optimization for fast inference\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Lightweight turn detection library for conversational AI",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/vogent/vogent-turn",
        "Issues": "https://github.com/vogent/vogent-turn/issues",
        "Repository": "https://github.com/vogent/vogent-turn"
    },
    "split_keywords": [
        "conversation-ai",
        " multimodal",
        " speech",
        " turn-detection",
        " voice",
        " whisper"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "87851a980238e7960c2f87537879a452c3b7419f8bb3ed39bc81686a9688ab99",
                "md5": "927c1fc764d94bf393235f3d043fe200",
                "sha256": "90363f921035edf92eb43e6012ba53093783da358a1742226b0d04a504462ef7"
            },
            "downloads": -1,
            "filename": "vogent_turn-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "927c1fc764d94bf393235f3d043fe200",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 19644,
            "upload_time": "2025-10-28T00:28:19",
            "upload_time_iso_8601": "2025-10-28T00:28:19.444801Z",
            "url": "https://files.pythonhosted.org/packages/87/85/1a980238e7960c2f87537879a452c3b7419f8bb3ed39bc81686a9688ab99/vogent_turn-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f38a69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9",
                "md5": "72c812454f30975885a65221388a9e19",
                "sha256": "30662242d4bd1e3d484fc9a4d9171c4466ada01adb95cd8e2f0f3818a18326de"
            },
            "downloads": -1,
            "filename": "vogent_turn-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "72c812454f30975885a65221388a9e19",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 237633,
            "upload_time": "2025-10-28T00:28:20",
            "upload_time_iso_8601": "2025-10-28T00:28:20.779941Z",
            "url": "https://files.pythonhosted.org/packages/f3/8a/69df53fe8f0e6748dbd075a8f30aa1c91aa04dc4aac531da380fc7b6ada9/vogent_turn-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-28 00:28:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vogent",
    "github_project": "vogent-turn",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "vogent-turn"
}
        
Elapsed time: 0.54849s