captionalchemy


Namecaptionalchemy JSON
Version 1.1.1 PyPI version JSON
download
home_pageNone
SummaryA Python package to create closed captions with face detection and recognition.
upload_time2025-07-09 15:55:02
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords caption closed captions face detection face recognition video processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CaptionAlchemy

A Python package for creating intelligent closed captions with face detection and speaker recognition.

## Features

- **Audio Transcription**: Powered by OpenAI Whisper for high-quality speech-to-text
- **Speaker Diarization**: Identifies different speakers in audio
- **Face Recognition**: Links speakers to known faces for character identification
- **Multiple Output Formats**: Supports SRT, VTT, and SAMI caption formats
- **Voice Activity Detection**: Intelligently detects speech vs non-speech segments
- **GPU Acceleration**: Automatic CUDA support when available

## Installation

```bash
pip install captionalchemy
```

If you have a GPU and want to use hardware acceleration:

```bash
pip install captionalchemy[cuda]
```

### Prerequisites

- Python 3.10+
- FFmpeg (for video/audio processing)
- CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)
- Whisper.cpp capable (optional on MacOS)

If using Whisper.cpp on MacOS, follow installation instructions [[here](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#core-ml-support)]
Clone the whisper repo into your working directory.

## Quick Start

1. **Set up environment variables** (create `.env` file):

   ```
   HF_AUTH_TOKEN=your_huggingface_token_here
   ```

2. **Prepare known faces** (optional, for speaker identification):
   Create `known_faces.json`:

   ```json
   [
     {
       "name": "Speaker Name",
       "image_path": "path/to/speaker/photo.jpg"
     }
   ]
   ```

3. **Generate captions**:

```bash
captionalchemy video.mp4 -f srt -o my_captions
```

or in a python script

```python
from dotenv import load_dotenv
from captionalchemy import caption

load_dotenv()

caption.run_pipeline(
    video_url_or_path="path/to/your/video.mp4",         # this can be a video URL or local file
    character_identification=False,                      # True by default
    known_faces_json="path/to/known_faces.json",
    embed_faces_json="path/to/embed_faces.json",        # name of the output file
    caption_output_path="my_captions/output",           # will write output to output.srt (or .vtt/.smi)
    caption_format="srt"
)
```

## Usage

### Basic Usage

```bash
# Generate SRT captions from video file
captionalchemy video.mp4

# Generate VTT captions from YouTube URL
captionalchemy "https://youtube.com/watch?v=VIDEO_ID" -f vtt -o output

# Disable face recognition
captionalchemy video.mp4 --no-face-id
```

### Command Line Options

```
captionalchemy VIDEO [OPTIONS]

Arguments:
  VIDEO                Video file path or URL

Options:
  -f, --format         Caption format: srt, vtt, smi (default: srt)
  -o, --output         Output file base name (default: output_captions)
  --no-face-id         Disable face recognition
  --known-faces-json   Path to known faces JSON (default: example/known_faces.json)
  --embed-faces-json   Path to face embeddings JSON (default: example/embed_faces.json)
  -v, --verbose        Enable debug logging
```

## How It Works

1. **Face Embedding**: Pre-processes known faces into embeddings
2. **Audio Extraction**: Extracts audio from video files
3. **Voice Activity Detection**: Identifies speech segments
4. **Speaker Diarization**: Separates different speakers
5. **Transcription**: Converts speech to text using Whisper
6. **Face Recognition**: Matches speakers to known faces (if enabled)
7. **Caption Generation**: Creates timestamped captions with speaker names

## Configuration

### Known Faces Setup

Create a `known_faces.json` file with speaker information:

```json
[
  {
    "name": "John Doe",
    "image_path": "photos/john_doe.jpg"
  },
  {
    "name": "Jane Smith",
    "image_path": "photos/jane_smith.png"
  }
]
```

### Environment Variables

- `HF_AUTH_TOKEN`: Hugging Face token for accessing pyannote models

## Output Examples

### SRT Format

```
1
00:00:03,254 --> 00:00:06,890
John Doe: Welcome to our presentation on quantum computing.

2
00:00:07,120 --> 00:00:10,456
Jane Smith: Thanks John. Let's start with the basics.
```

### VTT Format

```
WEBVTT

00:03.254 --> 00:06.890
John Doe: Welcome to our presentation on quantum computing.

00:07.120 --> 00:10.456
Jane Smith: Thanks John. Let's start with the basics.
```

## Development and Contributing

### Setup Development Environment

```bash
# Install in development mode
pip install -e ".[dev]"
```

### Running Tests

```bash
pytest
```

### Code Quality

```bash
# Linting
flake8

# Code formatting
black src/ tests/
```

## Requirements

See `requirements.txt` for the complete list of dependencies. Key packages include:

- `openai-whisper`: Speech transcription
- `pyannote.audio`: Speaker diarization
- `opencv-python`: Computer vision
- `insightface`: Face recognition
- `torch`: Deep learning framework

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request

## Troubleshooting

### Common Issues

- **CUDA out of memory**: Use CPU-only mode or reduce batch sizes
- **Missing models**: Ensure whisper.cpp models are downloaded
- **Face recognition errors**: Verify image paths in known_faces.json
- **Audio extraction fails**: Check that FFmpeg is installed

### Getting Help

- Check the logs with `-v` flag for detailed error information
- Ensure all dependencies are properly installed
- Verify video file format compatibility

```

```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "captionalchemy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "caption, closed captions, face detection, face recognition, video processing",
    "author": null,
    "author_email": "Ben Batman <benbatman2@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/57/ff/72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da/captionalchemy-1.1.1.tar.gz",
    "platform": null,
    "description": "# CaptionAlchemy\n\nA Python package for creating intelligent closed captions with face detection and speaker recognition.\n\n## Features\n\n- **Audio Transcription**: Powered by OpenAI Whisper for high-quality speech-to-text\n- **Speaker Diarization**: Identifies different speakers in audio\n- **Face Recognition**: Links speakers to known faces for character identification\n- **Multiple Output Formats**: Supports SRT, VTT, and SAMI caption formats\n- **Voice Activity Detection**: Intelligently detects speech vs non-speech segments\n- **GPU Acceleration**: Automatic CUDA support when available\n\n## Installation\n\n```bash\npip install captionalchemy\n```\n\nIf you have a GPU and want to use hardware acceleration:\n\n```bash\npip install captionalchemy[cuda]\n```\n\n### Prerequisites\n\n- Python 3.10+\n- FFmpeg (for video/audio processing)\n- CUDA-capable GPU (optional, for acceleration but is highly recommended for the diarization)\n- Whisper.cpp capable (optional on MacOS)\n\nIf using Whisper.cpp on MacOS, follow installation instructions [[here](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#core-ml-support)]\nClone the whisper repo into your working directory.\n\n## Quick Start\n\n1. **Set up environment variables** (create `.env` file):\n\n   ```\n   HF_AUTH_TOKEN=your_huggingface_token_here\n   ```\n\n2. **Prepare known faces** (optional, for speaker identification):\n   Create `known_faces.json`:\n\n   ```json\n   [\n     {\n       \"name\": \"Speaker Name\",\n       \"image_path\": \"path/to/speaker/photo.jpg\"\n     }\n   ]\n   ```\n\n3. **Generate captions**:\n\n```bash\ncaptionalchemy video.mp4 -f srt -o my_captions\n```\n\nor in a python script\n\n```python\nfrom dotenv import load_dotenv\nfrom captionalchemy import caption\n\nload_dotenv()\n\ncaption.run_pipeline(\n    video_url_or_path=\"path/to/your/video.mp4\",         # this can be a video URL or local file\n    character_identification=False,                      # True by default\n    known_faces_json=\"path/to/known_faces.json\",\n    embed_faces_json=\"path/to/embed_faces.json\",        # name of the output file\n    caption_output_path=\"my_captions/output\",           # will write output to output.srt (or .vtt/.smi)\n    caption_format=\"srt\"\n)\n```\n\n## Usage\n\n### Basic Usage\n\n```bash\n# Generate SRT captions from video file\ncaptionalchemy video.mp4\n\n# Generate VTT captions from YouTube URL\ncaptionalchemy \"https://youtube.com/watch?v=VIDEO_ID\" -f vtt -o output\n\n# Disable face recognition\ncaptionalchemy video.mp4 --no-face-id\n```\n\n### Command Line Options\n\n```\ncaptionalchemy VIDEO [OPTIONS]\n\nArguments:\n  VIDEO                Video file path or URL\n\nOptions:\n  -f, --format         Caption format: srt, vtt, smi (default: srt)\n  -o, --output         Output file base name (default: output_captions)\n  --no-face-id         Disable face recognition\n  --known-faces-json   Path to known faces JSON (default: example/known_faces.json)\n  --embed-faces-json   Path to face embeddings JSON (default: example/embed_faces.json)\n  -v, --verbose        Enable debug logging\n```\n\n## How It Works\n\n1. **Face Embedding**: Pre-processes known faces into embeddings\n2. **Audio Extraction**: Extracts audio from video files\n3. **Voice Activity Detection**: Identifies speech segments\n4. **Speaker Diarization**: Separates different speakers\n5. **Transcription**: Converts speech to text using Whisper\n6. **Face Recognition**: Matches speakers to known faces (if enabled)\n7. **Caption Generation**: Creates timestamped captions with speaker names\n\n## Configuration\n\n### Known Faces Setup\n\nCreate a `known_faces.json` file with speaker information:\n\n```json\n[\n  {\n    \"name\": \"John Doe\",\n    \"image_path\": \"photos/john_doe.jpg\"\n  },\n  {\n    \"name\": \"Jane Smith\",\n    \"image_path\": \"photos/jane_smith.png\"\n  }\n]\n```\n\n### Environment Variables\n\n- `HF_AUTH_TOKEN`: Hugging Face token for accessing pyannote models\n\n## Output Examples\n\n### SRT Format\n\n```\n1\n00:00:03,254 --> 00:00:06,890\nJohn Doe: Welcome to our presentation on quantum computing.\n\n2\n00:00:07,120 --> 00:00:10,456\nJane Smith: Thanks John. Let's start with the basics.\n```\n\n### VTT Format\n\n```\nWEBVTT\n\n00:03.254 --> 00:06.890\nJohn Doe: Welcome to our presentation on quantum computing.\n\n00:07.120 --> 00:10.456\nJane Smith: Thanks John. Let's start with the basics.\n```\n\n## Development and Contributing\n\n### Setup Development Environment\n\n```bash\n# Install in development mode\npip install -e \".[dev]\"\n```\n\n### Running Tests\n\n```bash\npytest\n```\n\n### Code Quality\n\n```bash\n# Linting\nflake8\n\n# Code formatting\nblack src/ tests/\n```\n\n## Requirements\n\nSee `requirements.txt` for the complete list of dependencies. Key packages include:\n\n- `openai-whisper`: Speech transcription\n- `pyannote.audio`: Speaker diarization\n- `opencv-python`: Computer vision\n- `insightface`: Face recognition\n- `torch`: Deep learning framework\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite\n6. Submit a pull request\n\n## Troubleshooting\n\n### Common Issues\n\n- **CUDA out of memory**: Use CPU-only mode or reduce batch sizes\n- **Missing models**: Ensure whisper.cpp models are downloaded\n- **Face recognition errors**: Verify image paths in known_faces.json\n- **Audio extraction fails**: Check that FFmpeg is installed\n\n### Getting Help\n\n- Check the logs with `-v` flag for detailed error information\n- Ensure all dependencies are properly installed\n- Verify video file format compatibility\n\n```\n\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python package to create closed captions with face detection and recognition.",
    "version": "1.1.1",
    "project_urls": null,
    "split_keywords": [
        "caption",
        " closed captions",
        " face detection",
        " face recognition",
        " video processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e5251dd0a9723617d8078a66ecfdb001e476895d95a6fff2b0b2abda07215c20",
                "md5": "cb4f62577c44746e9132b4ff3c7a3a1b",
                "sha256": "40e62dac63870800170e1e16f8cd71a6508362f207fd11c95922fef5a5970612"
            },
            "downloads": -1,
            "filename": "captionalchemy-1.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cb4f62577c44746e9132b4ff3c7a3a1b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 1201575,
            "upload_time": "2025-07-09T15:55:00",
            "upload_time_iso_8601": "2025-07-09T15:55:00.712688Z",
            "url": "https://files.pythonhosted.org/packages/e5/25/1dd0a9723617d8078a66ecfdb001e476895d95a6fff2b0b2abda07215c20/captionalchemy-1.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "57ff72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da",
                "md5": "07729ade7b8b487e73823ff3c9b599af",
                "sha256": "703cd10f9c0080ae2831190163ff1058ddacc545af6a199c97012efdc80ccf79"
            },
            "downloads": -1,
            "filename": "captionalchemy-1.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "07729ade7b8b487e73823ff3c9b599af",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 36207596,
            "upload_time": "2025-07-09T15:55:02",
            "upload_time_iso_8601": "2025-07-09T15:55:02.555199Z",
            "url": "https://files.pythonhosted.org/packages/57/ff/72a0cf2944c144569f1e2fba89d31e690561991abaa7667fe3d3634117da/captionalchemy-1.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-09 15:55:02",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "captionalchemy"
}
        
Elapsed time: 1.08036s