# CaptionAlchemy
A Python package for creating intelligent closed captions with face detection and speaker recognition.
## Features
- **Audio Transcription**: Powered by OpenAI Whisper for high-quality speech-to-text
- **Speaker Diarization**: Identifies different speakers in audio
- **Face Recognition**: Links speakers to known faces for character identification
- **Multiple Output Formats**: Supports SRT, VTT, and SAMI caption formats
- **Voice Activity Detection**: Intelligently detects speech vs non-speech segments
- **GPU Acceleration**: Automatic CUDA support when available
## Installation
```bash
pip install captionalchemy
```
If you have a GPU and want to use hardware acceleration:
```bash
pip install captionalchemy[cuda]
```
### Prerequisites
- Python 3.10+
- FFmpeg (for video/audio processing)
- CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
- Whisper.cpp capable (optional on MacOS)
If using Whisper.cpp on MacOS, follow installation instructions [[here](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#core-ml-support)]
Clone the whisper repo into your working directory.
## Quick Start
1. **Set up environment variables** (create `.env` file):
```
HF_AUTH_TOKEN=your_huggingface_token_here
```
2. **Prepare known faces** (optional, for speaker identification):
Create `known_faces.json`:
```json
[
{
"name": "Speaker Name",
"image_path": "path/to/speaker/photo.jpg"
}
]
```
3. **Generate captions**:
```bash
captionalchemy video.mp4 -f srt -o my_captions
```
or in a python script
```python
from dotenv import load_dotenv
from captionalchemy import caption
load_dotenv()
caption.run_pipeline(
video_url_or_path="path/to/your/video.mp4", # this can be a video URL or local file
character_identification=False, # True by default
known_faces_json="path/to/known_faces.json",
embed_faces_json="path/to/embed_faces.json", # name of the output file
caption_output_path="my_captions/output", # will write output to output.srt (or .vtt/.smi)
caption_format="srt"
)
```
## Usage
### Basic Usage
```bash
# Generate SRT captions from video file
captionalchemy video.mp4
# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output
# Disable face recognition
captionalchemy video.mp4 --no-face-id
```
### Command Line Options
```
captionalchemy VIDEO [OPTIONS]
Arguments:
VIDEO Video file path or URL
Options:
-f, --format Caption format: srt, vtt, smi (default: srt)
-o, --output Output file base name (default: output_captions)
--no-face-id Disable face recognition
--known-faces-json Path to known faces JSON (default: example/known_faces.json)
--embed-faces-json Path to face embeddings JSON (default: example/embed_faces.json)
-v, --verbose Enable debug logging
```
## How It Works
1. **Face Embedding**: Pre-processes known faces into embeddings
2. **Audio Extraction**: Extracts audio from video files
3. **Voice Activity Detection**: Identifies speech segments
4. **Speaker Diarization**: Separates different speakers
5. **Transcription**: Converts speech to text using Whisper
6. **Face Recognition**: Matches speakers to known faces (if enabled)
7. **Caption Generation**: Creates timestamped captions with speaker names
## Configuration
### Known Faces Setup
Create a `known_faces.json` file with speaker information:
```json
[
{
"name": "John Doe",
"image_path": "photos/john_doe.jpg"
},
{
"name": "Jane Smith",
"image_path": "photos/jane_smith.png"
}
]
```
### Environment Variables
- `HF_AUTH_TOKEN`: Hugging Face token for accessing pyannote models
## Output Examples
### SRT Format
```
1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.
2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.
```
### VTT Format
```
WEBVTT
00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.
00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.
```
## Development and Contributing
### Setup Development Environment
```bash
# Install in development mode
pip install -e ".[dev]"
```
### Running Tests
```bash
pytest
```
### Code Quality
```bash
# Linting
flake8
# Code formatting
black src/ tests/
```
## Requirements
See `requirements.txt` for the complete list of dependencies. Key packages include:
- `openai-whisper`: Speech transcription
- `pyannote.audio`: Speaker diarization
- `opencv-python`: Computer vision
- `insightface`: Face recognition
- `torch`: Deep learning framework
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request
## Troubleshooting
### Common Issues
- **CUDA out of memory**: Use CPU-only mode or reduce batch sizes
- **Missing models**: Ensure whisper.cpp models are downloaded
- **Face recognition errors**: Verify image paths in known_faces.json
- **Audio extraction fails**: Check that FFmpeg is installed
### Getting Help
- Check the logs with `-v` flag for detailed error information
- Ensure all dependencies are properly installed
- Verify video file format compatibility
```
```
Raw data
{
"_id": null,
"home_page": null,
"name": "captionalchemy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "caption, closed captions, face detection, face recognition, video processing",
"author": null,
"author_email": "Ben Batman <benbatman2@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/57/ff/72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da/captionalchemy-1.1.1.tar.gz",
"platform": null,
"description": "# CaptionAlchemy\n\nA Python package for creating intelligent closed captions with face detection and speaker recognition.\n\n## Features\n\n- **Audio Transcription**: Powered by OpenAI Whisper for high-quality speech-to-text\n- **Speaker Diarization**: Identifies different speakers in audio\n- **Face Recognition**: Links speakers to known faces for character identification\n- **Multiple Output Formats**: Supports SRT, VTT, and SAMI caption formats\n- **Voice Activity Detection**: Intelligently detects speech vs non-speech segments\n- **GPU Acceleration**: Automatic CUDA support when available\n\n## Installation\n\n```bash\npip install captionalchemy\n```\n\nIf you have a GPU and want to use hardware acceleration:\n\n```bash\npip install captionalchemy[cuda]\n```\n\n### Prerequisites\n\n- Python 3.10+\n- FFmpeg (for video/audio processing)\n- CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)\n- Whisper.cpp capable (optional on MacOS)\n\nIf using Whisper.cpp on MacOS, follow installation instructions [[here](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#core-ml-support)]\nClone the whisper repo into your working directory.\n\n## Quick Start\n\n1. **Set up environment variables** (create `.env` file):\n\n ```\n HF_AUTH_TOKEN=your_huggingface_token_here\n ```\n\n2. **Prepare known faces** (optional, for speaker identification):\n Create `known_faces.json`:\n\n ```json\n [\n {\n \"name\": \"Speaker Name\",\n \"image_path\": \"path/to/speaker/photo.jpg\"\n }\n ]\n ```\n\n3. **Generate captions**:\n\n```bash\ncaptionalchemy video.mp4 -f srt -o my_captions\n```\n\nor in a python script\n\n```python\nfrom dotenv import load_dotenv\nfrom captionalchemy import caption\n\nload_dotenv()\n\ncaption.run_pipeline(\n video_url_or_path=\"path/to/your/video.mp4\", # this can be a video URL or local file\n character_identification=False, # True by default\n known_faces_json=\"path/to/known_faces.json\",\n embed_faces_json=\"path/to/embed_faces.json\", # name of the output file\n caption_output_path=\"my_captions/output\", # will write output to output.srt (or .vtt/.smi)\n caption_format=\"srt\"\n)\n```\n\n## Usage\n\n### Basic Usage\n\n```bash\n# Generate SRT captions from video file\ncaptionalchemy video.mp4\n\n# Generate VTT captions from YouTube URL\ncaptionalchemy \"https://youtube.com/watch?v=VIDEO_ID\" -f vtt -o output\n\n# Disable face recognition\ncaptionalchemy video.mp4 --no-face-id\n```\n\n### Command Line Options\n\n```\ncaptionalchemy VIDEO [OPTIONS]\n\nArguments:\n VIDEO Video file path or URL\n\nOptions:\n -f, --format Caption format: srt, vtt, smi (default: srt)\n -o, --output Output file base name (default: output_captions)\n --no-face-id Disable face recognition\n --known-faces-json Path to known faces JSON (default: example/known_faces.json)\n --embed-faces-json Path to face embeddings JSON (default: example/embed_faces.json)\n -v, --verbose Enable debug logging\n```\n\n## How It Works\n\n1. **Face Embedding**: Pre-processes known faces into embeddings\n2. **Audio Extraction**: Extracts audio from video files\n3. **Voice Activity Detection**: Identifies speech segments\n4. **Speaker Diarization**: Separates different speakers\n5. **Transcription**: Converts speech to text using Whisper\n6. **Face Recognition**: Matches speakers to known faces (if enabled)\n7. **Caption Generation**: Creates timestamped captions with speaker names\n\n## Configuration\n\n### Known Faces Setup\n\nCreate a `known_faces.json` file with speaker information:\n\n```json\n[\n {\n \"name\": \"John Doe\",\n \"image_path\": \"photos/john_doe.jpg\"\n },\n {\n \"name\": \"Jane Smith\",\n \"image_path\": \"photos/jane_smith.png\"\n }\n]\n```\n\n### Environment Variables\n\n- `HF_AUTH_TOKEN`: Hugging Face token for accessing pyannote models\n\n## Output Examples\n\n### SRT Format\n\n```\n1\n00:00:03,254 --> 00:00:06,890\nJohn Doe: Welcome to our presentation on quantum computing.\n\n2\n00:00:07,120 --> 00:00:10,456\nJane Smith: Thanks John. Let's start with the basics.\n```\n\n### VTT Format\n\n```\nWEBVTT\n\n00:03.254 --> 00:06.890\nJohn Doe: Welcome to our presentation on quantum computing.\n\n00:07.120 --> 00:10.456\nJane Smith: Thanks John. Let's start with the basics.\n```\n\n## Development and Contributing\n\n### Setup Development Environment\n\n```bash\n# Install in development mode\npip install -e \".[dev]\"\n```\n\n### Running Tests\n\n```bash\npytest\n```\n\n### Code Quality\n\n```bash\n# Linting\nflake8\n\n# Code formatting\nblack src/ tests/\n```\n\n## Requirements\n\nSee `requirements.txt` for the complete list of dependencies. Key packages include:\n\n- `openai-whisper`: Speech transcription\n- `pyannote.audio`: Speaker diarization\n- `opencv-python`: Computer vision\n- `insightface`: Face recognition\n- `torch`: Deep learning framework\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite\n6. Submit a pull request\n\n## Troubleshooting\n\n### Common Issues\n\n- **CUDA out of memory**: Use CPU-only mode or reduce batch sizes\n- **Missing models**: Ensure whisper.cpp models are downloaded\n- **Face recognition errors**: Verify image paths in known_faces.json\n- **Audio extraction fails**: Check that FFmpeg is installed\n\n### Getting Help\n\n- Check the logs with `-v` flag for detailed error information\n- Ensure all dependencies are properly installed\n- Verify video file format compatibility\n\n```\n\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package to create closed captions with face detection and recognition.",
"version": "1.1.1",
"project_urls": null,
"split_keywords": [
"caption",
" closed captions",
" face detection",
" face recognition",
" video processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e5251dd0a9723617d8078a66ecfdb001e476895d95a6fff2b0b2abda07215c20",
"md5": "cb4f62577c44746e9132b4ff3c7a3a1b",
"sha256": "40e62dac63870800170e1e16f8cd71a6508362f207fd11c95922fef5a5970612"
},
"downloads": -1,
"filename": "captionalchemy-1.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cb4f62577c44746e9132b4ff3c7a3a1b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 1201575,
"upload_time": "2025-07-09T15:55:00",
"upload_time_iso_8601": "2025-07-09T15:55:00.712688Z",
"url": "https://files.pythonhosted.org/packages/e5/25/1dd0a9723617d8078a66ecfdb001e476895d95a6fff2b0b2abda07215c20/captionalchemy-1.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "57ff72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da",
"md5": "07729ade7b8b487e73823ff3c9b599af",
"sha256": "703cd10f9c0080ae2831190163ff1058ddacc545af6a199c97012efdc80ccf79"
},
"downloads": -1,
"filename": "captionalchemy-1.1.1.tar.gz",
"has_sig": false,
"md5_digest": "07729ade7b8b487e73823ff3c9b599af",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 36207596,
"upload_time": "2025-07-09T15:55:02",
"upload_time_iso_8601": "2025-07-09T15:55:02.555199Z",
"url": "https://files.pythonhosted.org/packages/57/ff/72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da/captionalchemy-1.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-09 15:55:02",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "captionalchemy"
}