transcribe-anything


Nametranscribe-anything JSON
Version 3.2.3 PyPI version JSON
download
home_pageNone
SummaryUses Whisper AI to transcribe speech from video and audio files. Also accepts URLs for YouTube, Rumble, BitChute, clear file links, etc.
upload_time2025-07-27 00:22:49
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseBSD-3-Clause
keywords transcribe openai whisper
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # transcribe-anything

[![MacOS_Tests](https://github.com/zackees/transcribe-anything/actions/workflows/test_macos.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/test_macos.yml)
[![Win_Tests](https://github.com/zackees/transcribe-anything/actions/workflows/test_win.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/test_win.yml)
[![Ubuntu_Tests](https://github.com/zackees/transcribe-anything/actions/workflows/test_ubuntu.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/test_ubuntu.yml)
[![Lint](https://github.com/zackees/transcribe-anything/actions/workflows/lint.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/lint.yml)

![image](https://github.com/zackees/transcribe-anything/assets/6856673/94bdd1fe-3225-438a-ac1b-09c81f1d4108)

### USES WHISPER AI

Over 800+⭐'s because this program this app just works! Works great for windows and mac. This whisper front-end app is the only one to generate a `speaker.json` file which partitions the conversation by who doing the speaking.

[![Star History Chart](https://api.star-history.com/svg?repos=zackees/transcribe-anything&type=Date)](https://star-history.com/#zackees/transcribe-anything&Date)

### New in 3.2!

**Turbo Mac acceleration using the new [lightning-whisper-mlx](https://github.com/mustafaaljadery/lightning-whisper-mlx) backend.**

This is a communinity contribution by https://github.com/aj47. On behalf of all the mac users, thank you!

#### MLX Backend details

  * 4x faster than the `mps` whisper backend.
  * Supports multiple languages (`mps` only supports english).
  * Supports custom vocabulary via `--initial_prompt`.
  
#### Usage


```bash
# Mac accelerated back-end
transcribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ --device mlx
```

Special thank

### New in 3.1!

Mac acceleration option using the new [lightning-whisper-mlx](https://github.com/mustafaaljadery/lightning-whisper-mlx) backend. Enable with `--device mlx`. Now supports multiple languages, custom vocabulary via `--initial_prompt`, and both transcribe/translate tasks. 10x faster than Whisper CPP, 4x faster than previous MLX implementations!

**Model Storage:** MLX models are now stored in `~/.cache/whisper/mlx_models/` for consistency with other backends, instead of cluttering your current working directory.

**GPU Accelerated Dockerfile**

Recently added in 3.0.10 is a GPU accelerated [Dockerfile](Dockerfile).

If you are are doing translations at scale, check out the sister project: [https://github.com/zackees/transcribe-everything](https://github.com/zackees/transcribe-everything).

You can pull the docker image like so:

`docker pull niteris/transcribe-anything`

## About

Easiest whisper implementation to install and use. Just install with `pip install transcribe-anything`. All whisper backends are executed in an isolated environment. GPU acceleration is _automatic_, using the _blazingly_ fast [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) as the backend for `--device insane`. This is the only tool to optionally produces a `speaker.json` file, representing speaker-assigned text that has been de-chunkified.

Hardware acceleration on Windows/Linux `--device insane`

MacArm acceleration when using `--device mlx` (now with multi-language support and custom vocabulary)

Input a local file or youtube/rumble url and this tool will transcribe it using Whisper AI into subtitle files and raw text.

Uses whisper AI so this is state of the art translation service - completely free. 🤯🤯🤯

Your data stays private and is not uploaded to any service.

The new version now has state of the art speed in transcriptions, thanks to the new backend `--device insane`, as well as producing a `speaker.json` file.

```bash
pip install transcribe-anything

# Basic usage - CPU mode (works everywhere, slower)
transcribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ

# GPU accelerated (Windows/Linux)
transcribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ --device insane

# Mac Apple Silicon accelerated
transcribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ --device mlx

# Advanced options (see Advanced Options section below for full details)
transcribe-anything video.mp4 --device mlx --batch_size 16 --verbose
transcribe-anything video.mp4 --device insane --batch-size 8 --flash True
```

_python api_

```python
from transcribe_anything import transcribe_anything

transcribe_anything(
    url_or_file="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    output_dir="output_dir",
    task="transcribe",
    model="large",
    device="cuda"
)

# Full function signiture:
def transcribe(
    url_or_file: str,
    output_dir: Optional[str] = None,
    model: Optional[str] = None,              # tiny,small,medium,large
    task: Optional[str] = None,               # transcribe or translate
    language: Optional[str] = None,           # auto detected if none, "en" for english...
    device: Optional[str] = None,             # cuda,cpu,insane,mlx
    embed: bool = False,                      # Produces a video.mp4 with the subtitles burned in.
    hugging_face_token: Optional[str] = None, # If you want a speaker.json
    other_args: Optional[list[str]] = None,   # Other args to be passed to to the whisper backend
    initial_prompt: Optional[str] = None,     # Custom prompt for better recognition of specific terms
) -> str:

```

#### Fastest Transcription - Use `insane` mode with model `large-v3` + `batching`

This is by far the fastest combination. Experimental, it produces text that tends to be lower quality:

- Higher chance for repeated text patterns.
- Timestamps in the vtt/srt files become unaligned.

It's unclear if this is due to batching or `large-v3` itself. More testing is needed. If you do this then please let us know the results by filing a bug in the issues page.

Large batch sizes require more significant amounts of Nvidia GPU Ram. For a 12 GB card, it's been experimentally shown that batch-size=8 will work on all videos from an internally tested data lake.

#### Insanely fast on `cuda` platforms

If you pass in `--device insane` on a cuda platform then this tool will use this state of the art version of whisper: https://github.com/Vaibhavs10/insanely-fast-whisper, which is MUCH faster and has a pipeline for speaker identification (diarization) using the `--hf_token` option.

Compatible with Python 3.10 and above. Backends use an isolated environment with pinned requirements and python version.

#### Speaker.json

When diarization is enabled via `--hf_token` (hugging face token) then the output json will contain speaker info labeled as `SPEAKER_00`, `SPEAKER_01` etc. For licensing agreement reasons, you must get your own hugging face token if you want to enable this feature. Also there is an additional step to agree to the user policies for the `pyannote.audio` located here: https://huggingface.co/pyannote/segmentation-3.0. If you don't do this then you'll see runtime exceptions from `pyannote` when the `--hf_token` is used.

What's special to this app is that we also generate a `speaker.json` which is a de-chunkified version of the output json speaker section.

### speaker.json example:

```json
[
  {
    "speaker": "SPEAKER_00",
    "timestamp": [0.0, 7.44],
    "text": "for that. But welcome, Zach Vorhees. Great to have you back on. Thank you, Matt. Craving me back onto your show. Man, we got a lot to talk about.",
    "reason": "beginning"
  },
  {
    "speaker": "SPEAKER_01",
    "timestamp": [7.44, 33.52],
    "text": "Oh, we do. 2023 was the year that OpenAI released, you know, chat GPT-4, which I think most people would say has surpassed average human intelligence, at least in test taking, perhaps not in, you know, reasoning and things like that. But it was a major year for AI. I think that most people are behind the curve on this. What's your take of what just happened in the last 12 months and what it means for the future of human cognition versus machine cognition?",
    "reason": "speaker-switch"
  },
  {
    "speaker": "SPEAKER_00",
    "timestamp": [33.52, 44.08],
    "text": "Yeah. Well, you know, at the beginning of 2023, we had a pretty weak AI system, which was a chat GPT 3.5 turbo was the best that we had. And then between the beginning of last",
    "reason": "speaker-switch"
  }
]
```

Note that `speaker.json `is only generated when using `--device insane` and not for `--device cuda` nor `--device cpu`.

#### `cuda` vs `insane`

Insane mode eats up a lot of memory and it's common to get out of memory errors while transcribing. For example a 3060 12GB nividia card produced out of memory errors are common for big content. If you experience this then pass in `--batch-size 8` or smaller. Note that any arguments not recognized by `transcribe-anything` are passed onto the backend transcriber.

Also, please don't use `distil-whisper/distil-large-v2`, it produces extremely bad stuttering and it's not entirely clear why this is. I've had to switch it out of production environments because it's so bad. It's also non-deterministic so I think that somehow a fallback non-zero temperature is being used, which produces these stutterings.

`cuda` is the original AI model supplied by openai. It's more stable but MUCH slower. It also won't produce a `speaker.json` file which looks like this:

`--embed`. This app will optionally embed subtitles directly "burned" into an output video.

# Install

This front end app for whisper boasts the easiest install in the whisper ecosystem thanks to [isolated-environment](https://pypi.org/project/isolated-environment/). You can simply install it with pip, like this:

```bash
pip install transcribe-anything
```

# Docker

We have a [Dockerfile](Dockerfile) that will be descently fast for startup. It is tuned specifically for `device=insane`. If you have extremely large batches of data you'd like to convert all at once then consider using the sister project [transcribe-everything](https://github.com/zackees/transcribe-everything) which operates on entire remote paths hierarchies.

# GPU Acceleration

GPU acceleration will be automatically enabled for windows and linux. Mac users can use `--device mlx` for hardware acceleration on Apple Silicon. `--device insane` may also work on Mac M1+ but has been less tested.

Windows/Linux:

- Use `--device insane`

Mac:

- Use `--device mlx`

# Advanced Options and Backend-Specific Arguments

## Quick Reference

| Backend | Device Flag | Key Arguments | Best For |
|---------|-------------|---------------|----------|
| **MLX** | `--device mlx` | `--batch_size`, `--verbose`, `--initial_prompt` | Mac Apple Silicon |
| **Insanely Fast** | `--device insane` | `--batch-size`, `--hf_token`, `--flash`, `--timestamp` | Windows/Linux GPU |
| **CPU** | `--device cpu` | Standard whisper args | Universal compatibility |

> **Note:** Each backend has different capabilities. MLX is optimized for Apple Silicon with a focused feature set. Insanely Fast uses a transformer-based architecture with specific options. CPU backend supports the full range of standard OpenAI Whisper arguments.

## Custom Prompts and Vocabulary

Whisper supports custom prompts to improve transcription accuracy for domain-specific vocabulary, names, or technical terms. This is especially useful when transcribing content with:

- Technical terminology (AI, machine learning, programming terms)
- Proper names (people, companies, products)
- Medical or scientific terms
- Industry-specific jargon

### Using Custom Prompts

#### Command Line

```bash
# Direct prompt
transcribe-anything video.mp4 --initial_prompt "The speaker discusses artificial intelligence, machine learning, and neural networks."

# Load prompt from file
transcribe-anything video.mp4 --prompt_file my_custom_prompt.txt
```

#### Python API

```python
from transcribe_anything import transcribe

# Direct prompt
transcribe(
    url_or_file="video.mp4",
    initial_prompt="The speaker discusses AI, PyTorch, TensorFlow, and deep learning algorithms."
)

# Load prompt from file
with open("my_prompt.txt", "r") as f:
    prompt = f.read()

transcribe(
    url_or_file="video.mp4",
    initial_prompt=prompt
)
```

#### Best Practices

- Keep prompts concise but comprehensive for your domain
- Include variations of terms (e.g., "AI", "artificial intelligence")
- Focus on terms that Whisper commonly misrecognizes
- Test with and without prompts to measure improvement

## MLX Backend Arguments (--device mlx)

The MLX backend supports additional arguments for fine-tuning performance:

### Available Options

```bash
# Adjust batch size for better performance/memory trade-off
transcribe-anything video.mp4 --device mlx --batch_size 24

# Enable verbose output for debugging
transcribe-anything video.mp4 --device mlx --verbose

# Use custom prompt for better recognition of specific terms
transcribe-anything video.mp4 --device mlx --initial_prompt "The speaker discusses AI, machine learning, and neural networks."
```

### MLX-Specific Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--batch_size` | int | 12 | Batch size for processing. Higher values use more memory but may be faster |
| `--verbose` | flag | false | Enable verbose output for debugging |
| `--initial_prompt` | string | None | Custom vocabulary/context prompt for better recognition |

### Supported Models

The MLX backend supports these whisper models optimized for Apple Silicon:
- `tiny`, `small`, `base`, `medium`, `large`, `large-v2`, `large-v3`
- Distilled models: `distil-small.en`, `distil-medium.en`, `distil-large-v2`, `distil-large-v3`

> **Note:** The MLX backend uses the lightning-whisper-mlx library which has a focused feature set optimized for Apple Silicon. Advanced whisper options like `--temperature` and `--word_timestamps` are not currently supported by this backend.

## Insanely Fast Whisper Arguments (--device insane)

The insanely-fast-whisper backend supports these specific options:

### Performance Options

```bash
# Adjust batch size (critical for GPU memory management)
transcribe-anything video.mp4 --device insane --batch-size 8

# Use different model variants
transcribe-anything video.mp4 --device insane --model large-v3

# Enable Flash Attention 2 for faster processing
transcribe-anything video.mp4 --device insane --flash True
```

### Speaker Diarization Options

```bash
# Enable speaker diarization with HuggingFace token
transcribe-anything video.mp4 --device insane --hf_token your_token_here

# Specify exact number of speakers
transcribe-anything video.mp4 --device insane --hf_token your_token --num-speakers 3

# Set speaker range
transcribe-anything video.mp4 --device insane --hf_token your_token --min-speakers 2 --max-speakers 5
```

### Timestamp Options

```bash
# Choose timestamp granularity
transcribe-anything video.mp4 --device insane --timestamp chunk  # default
transcribe-anything video.mp4 --device insane --timestamp word   # word-level
```

### Insanely Fast Whisper Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--batch-size` | int | 24 | Batch size for processing. Critical for GPU memory management |
| `--flash` | bool | false | Use Flash Attention 2 for faster processing |
| `--timestamp` | choice | chunk | Timestamp granularity: "chunk" or "word" |
| `--hf_token` | string | None | HuggingFace token for speaker diarization |
| `--num-speakers` | int | None | Exact number of speakers (cannot use with min/max) |
| `--min-speakers` | int | None | Minimum number of speakers |
| `--max-speakers` | int | None | Maximum number of speakers |
| `--diarization_model` | string | pyannote/speaker-diarization | Diarization model to use |

> **Note:** The insanely-fast-whisper backend uses a different architecture than standard OpenAI Whisper. It does NOT support standard whisper arguments like `--temperature`, `--beam_size`, `--best_of`, etc. These are specific to the OpenAI implementation.

## CPU Backend Arguments (--device cpu)

The CPU backend uses the standard OpenAI Whisper implementation and supports many additional arguments:

### Standard Whisper Options

```bash
# Language and task options (also available as main arguments)
transcribe-anything video.mp4 --device cpu --language es --task translate

# Generation parameters
transcribe-anything video.mp4 --device cpu --temperature 0.1 --best_of 5 --beam_size 5

# Quality thresholds
transcribe-anything video.mp4 --device cpu --compression_ratio_threshold 2.4 --logprob_threshold -1.0

# Output formatting
transcribe-anything video.mp4 --device cpu --word_timestamps --highlight_words True

# Audio processing
transcribe-anything video.mp4 --device cpu --threads 4 --clip_timestamps "0,30"
```

> **Note:** The CPU backend supports most standard OpenAI Whisper arguments. These are passed through automatically and documented in the [OpenAI Whisper repository](https://github.com/openai/whisper).

### Batch Size Recommendations

**MLX Backend (`--device mlx`):**
- Default: 12
- Recommended range: 8-24
- Higher values for more VRAM, lower for less

**Insanely Fast Whisper (`--device insane`):**
- Default: 24
- Recommended for 8GB GPU: 4-8
- Recommended for 12GB GPU: 8-12
- Recommended for 24GB GPU: 16-24
- Use `--flash True` for better memory efficiency
- Start low and increase if no OOM errors

# Usage Examples

## Basic Usage

```bash
# Basic transcription
transcribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ

# Local file
transcribe-anything video.mp4
```

## Backend-Specific Examples

### MLX Backend (Mac Apple Silicon)

```bash
# Basic MLX usage
transcribe-anything video.mp4 --device mlx

# MLX with custom batch size and verbose output
transcribe-anything video.mp4 --device mlx --batch_size 16 --verbose

# MLX with custom prompt for technical content
transcribe-anything lecture.mp4 --device mlx --initial_prompt "The speaker discusses machine learning, neural networks, PyTorch, and TensorFlow."

# MLX with multiple options (using main arguments for language/task)
transcribe-anything video.mp4 --device mlx --batch_size 20 --verbose --task translate --language es
```

### Insanely Fast Whisper (GPU)

```bash
# Basic insane mode
transcribe-anything video.mp4 --device insane

# Insane mode with custom batch size (important for GPU memory)
transcribe-anything video.mp4 --device insane --batch-size 8

# Insane mode with Flash Attention 2 for speed
transcribe-anything video.mp4 --device insane --batch-size 12 --flash True

# Insane mode with speaker diarization
transcribe-anything video.mp4 --device insane --hf_token your_huggingface_token

# Insane mode with word-level timestamps and speaker diarization
transcribe-anything video.mp4 --device insane --timestamp word --hf_token your_token --num-speakers 3

# High-performance setup with all optimizations
transcribe-anything video.mp4 --device insane --batch-size 16 --flash True --timestamp word
```

### CPU Backend (Universal)

```bash
# CPU mode (works everywhere, slower)
transcribe-anything video.mp4 --device cpu

# CPU with custom model and language
transcribe-anything video.mp4 --device cpu --model medium --language fr --task transcribe
```

## Troubleshooting Common Issues

### Out of Memory Errors

If you encounter GPU out-of-memory errors:

```bash
# Reduce batch size for MLX
transcribe-anything video.mp4 --device mlx --batch_size 8

# Reduce batch size for insane mode
transcribe-anything video.mp4 --device insane --batch-size 4

# Use smaller model
transcribe-anything video.mp4 --device insane --model small --batch-size 8
```

### Poor Quality Transcriptions

For better quality:

```bash
# Use larger model
transcribe-anything video.mp4 --device insane --model large-v3

# Enable Flash Attention 2 for better performance
transcribe-anything video.mp4 --device insane --flash True

# Use custom prompt for domain-specific content (works with all backends)
transcribe-anything video.mp4 --initial_prompt "Medical terminology: diagnosis, treatment, symptoms, patient care"

# For CPU backend, you can use standard whisper quality options
transcribe-anything video.mp4 --device cpu --compression_ratio_threshold 2.0 --logprob_threshold -0.5
```

### Performance Optimization

For faster processing:

```bash
# Increase batch size (if you have enough GPU memory)
transcribe-anything video.mp4 --device mlx --batch_size 24
transcribe-anything video.mp4 --device insane --batch-size 16

# Enable Flash Attention 2 for insane mode (significant speedup)
transcribe-anything video.mp4 --device insane --flash True --batch-size 16

# Use smaller model for speed
transcribe-anything video.mp4 --device insane --model small

# Use distilled models for even faster processing
transcribe-anything video.mp4 --device insane --model distil-whisper/large-v2 --flash True
```

Will output:

```
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:27.000]  We're no strangers to love, you know the rules, and so do I
[00:27.000 --> 00:31.000]  I've built commitments while I'm thinking of
[00:31.000 --> 00:35.000]  You wouldn't get this from any other guy
[00:35.000 --> 00:40.000]  I just wanna tell you how I'm feeling
[00:40.000 --> 00:43.000]  Gotta make you understand
[00:43.000 --> 00:45.000]  Never gonna give you up
[00:45.000 --> 00:47.000]  Never gonna let you down
[00:47.000 --> 00:51.000]  Never gonna run around and desert you
[00:51.000 --> 00:53.000]  Never gonna make you cry
[00:53.000 --> 00:55.000]  Never gonna say goodbye
[00:55.000 --> 00:58.000]  Never gonna tell a lie
[00:58.000 --> 01:00.000]  And hurt you
[01:00.000 --> 01:04.000]  We've known each other for so long
[01:04.000 --> 01:09.000]  Your heart's been aching but you're too shy to say it
[01:09.000 --> 01:13.000]  Inside we both know what's been going on
[01:13.000 --> 01:17.000]  We know the game and we're gonna play it
[01:17.000 --> 01:22.000]  And if you ask me how I'm feeling
[01:22.000 --> 01:25.000]  Don't tell me you're too much to see
[01:25.000 --> 01:27.000]  Never gonna give you up
[01:27.000 --> 01:29.000]  Never gonna let you down
[01:29.000 --> 01:33.000]  Never gonna run around and desert you
[01:33.000 --> 01:35.000]  Never gonna make you cry
[01:35.000 --> 01:38.000]  Never gonna say goodbye
[01:38.000 --> 01:40.000]  Never gonna tell a lie
[01:40.000 --> 01:42.000]  And hurt you
[01:42.000 --> 01:44.000]  Never gonna give you up
[01:44.000 --> 01:46.000]  Never gonna let you down
[01:46.000 --> 01:50.000]  Never gonna run around and desert you
[01:50.000 --> 01:52.000]  Never gonna make you cry
[01:52.000 --> 01:54.000]  Never gonna say goodbye
[01:54.000 --> 01:57.000]  Never gonna tell a lie
[01:57.000 --> 01:59.000]  And hurt you
[02:08.000 --> 02:10.000]  Never gonna give
[02:12.000 --> 02:14.000]  Never gonna give
[02:16.000 --> 02:19.000]  We've known each other for so long
[02:19.000 --> 02:24.000]  Your heart's been aching but you're too shy to say it
[02:24.000 --> 02:28.000]  Inside we both know what's been going on
[02:28.000 --> 02:32.000]  We know the game and we're gonna play it
[02:32.000 --> 02:37.000]  I just wanna tell you how I'm feeling
[02:37.000 --> 02:40.000]  Gotta make you understand
[02:40.000 --> 02:42.000]  Never gonna give you up
[02:42.000 --> 02:44.000]  Never gonna let you down
[02:44.000 --> 02:48.000]  Never gonna run around and desert you
[02:48.000 --> 02:50.000]  Never gonna make you cry
[02:50.000 --> 02:53.000]  Never gonna say goodbye
[02:53.000 --> 02:55.000]  Never gonna tell a lie
[02:55.000 --> 02:57.000]  And hurt you
[02:57.000 --> 02:59.000]  Never gonna give you up
[02:59.000 --> 03:01.000]  Never gonna let you down
[03:01.000 --> 03:05.000]  Never gonna run around and desert you
[03:05.000 --> 03:08.000]  Never gonna make you cry
[03:08.000 --> 03:10.000]  Never gonna say goodbye
[03:10.000 --> 03:12.000]  Never gonna tell a lie
[03:12.000 --> 03:14.000]  And hurt you
[03:14.000 --> 03:16.000]  Never gonna give you up
[03:16.000 --> 03:23.000]  If you want, never gonna let you down Never gonna run around and desert you
[03:23.000 --> 03:28.000]  Never gonna make you hide Never gonna say goodbye
[03:28.000 --> 03:42.000]  Never gonna tell you I ain't ready
```

## Api

```python
from transcribe_anything.api import transcribe

transcribe(
    url_or_file="https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    output_dir="output_dir",
)
```

## Develop

Works for Ubuntu/MacOS/Win32(in git-bash)
This will create a virtual environment

```bash
> cd transcribe_anything
> ./install.sh
# Enter the environment:
> source activate.sh
```

The environment is now active and the next step will only install to the local python. If the terminal
is closed then to get back into the environment `cd transcribe_anything` and execute `source activate.sh`

## Required: Install to current python environment

- `pip install transcribe-anything`
  - The command `transcribe_anything` will magically become available.
- `transcribe_anything <YOUTUBE_URL>`

# Tech Stack

- OpenAI whisper
- insanely-fast-whisper
- yt-dlp: https://github.com/yt-dlp/yt-dlp
- static-ffmpeg
  - github: https://github.com/zackees/static_ffmpeg
  - pypi: https://pypi.org/project/static-ffmpeg/

# Testing

- Every commit is tested for standard linters and a batch of unit tests.

## Updated version 2.3.0

`transcribe-anything` now works much better across different configurations and is now much faster. Why? I switched the environment isolation that I was using from my own homespun version built on top of `venv` to the AMAZING `uv` system. The biggest improvement is the runtime speed and re-installs. UV is just insane at how fast it is for checking the environment. Also it turns out that `uv` has strict package dependency checking which found a minor bug where a certain version of one of the `pytorch` dependencies was being constantly re-installed because of a dependency conflict that pip was apparently perfectly happy to never warn about. This manifested as certain packages being constantly re-installed with the previous version. `uv` identified this as an error immediately and was fixed.

The real reason behind `transcribe-anything`'s surprising popularity comes from the fact that it just works. And the reason for this is that I can isolate environments for different configurations and install them lazily. If you have the same problem then consider my other tool: https://github.com/zackees/iso-env

# Versions

- 3.0.7: Insane whisperer mode no longer prints out the srt file during transcription completion.
- 3.0.6: MacOS MLX mode fixed/improved
  - PR: https://github.com/zackees/transcribe-anything/pull/39
  - Thank you https://github.com/aj47!
- 3.0.5: A temp wav file was not being cleaned up, now it is.
- 3.1.0: Upgraded Mac-arm backend to [lightning-whisper-mlx](https://github.com/mustafaaljadery/lightning-whisper-mlx), enable with `--device mlx`. Now supports multiple languages, custom vocabulary via `--initial_prompt`, and both transcribe/translate tasks. 10x faster than Whisper CPP!
- 3.0.0: Implemented new Mac-arm accelerated [whisper-mps](https://github.com/AtomGradient/whisper-mps) backend, enable with `--device mps` (now `--device mlx`). Only does english, but is quite fast.
- 2.3.0: Swapped out the environment isolator. Now based on `uv`, should fix the missing dll's on some windows systems.
- 2.7.39: Fix `--hf-token` usage for insanely fast whisper backend.
- 2.7.37: Fixed breakage due to numpy 2.0 being released.
- 2.7.36: Fixed some ffmpeg dependencies.
- 2.7.35: All `ffmpeg` commands are now `static_ffmpeg` commands. Fixes issue.
- 2.7.34: Various fixes.
- 2.7.33: Fixes linux
- 2.7.32: Fixes mac m1 and m2.
- 2.7.31: Adds a warning if using python 3.12, which isn't supported yet in the backend.
- 2.7.30: adds --query-gpu-json-path
- 2.7.29: Made to json -> srt more robust for `--device insane`, bad entries will be skipped but warn.
- 2.7.28: Fixes bad title fetching with weird characters.
- 2.7.27: `pytorch-audio` upgrades broke this package. Upgrade to latest version to resolve.
- 2.7.26: Add model option `distil-whisper/distil-large-v2`
- 2.7.25: Windows (Linux/MacOS) bug with `--device insane` and python 3.11 installing wrong `insanely-fast-whisper` version.
- 2.7.22: Fixes `transcribe-anything` on Linux.
- 2.7.21: Tested that Mac Arm can run `--device insane`. Added tests to ensure this.
- 2.7.20: Fixes wrong type being returned when speaker.json happens to be empty.
- 2.7.19: speaker.json is now in plain json format instead of json5 format
- 2.7.18: Fixes tests
- 2.7.17: Fixes speaker.json nesting.
- 2.7.16: Adds `--save_hf_token`
- 2.7.15: Fixes 2.7.14 breakage.
- 2.7.14: (Broken) Now generates `speaker.json` when diarization is enabled.
- 2.7.13: Default diarization model is now pyannote/speaker-diarization-3.1
- 2.7.12: Adds srt_swap for line breaks and improved isolated_environment usage.
- 2.7.11: `--device insane` now generates a \*.vtt translation file
- 2.7.10: Better support for namespaced models. Trims text output in output json. Output json is now formatted with indents. SRT file is now printed out for `--device insane`
- 2.7.9: All SRT translation errors fixed for `--device insane`. All tests pass.
- 2.7.8: During error of `--device insane`, write out the error.json file into the destination.
- 2.7.7: Better error messages during failure.
- 2.7.6: Improved generation of out.txt, removes linebreaks.
- 2.7.5: `--device insane` now generates better conforming srt files.
- 2.7.3: Various fixes for the `insane` mode backend.
- 2.7.0: Introduces an `insanely-fast-whisper`, enable by using `--device insane`
- 2.6.0: GPU acceleration now happens automatically on Windows thanks to `isolated-environment`. This will also prevent
  interference with different versions of torch for other AI tools.
- 2.5.0: `--model large` now aliases to `--model large-v3`. Use `--model large-legacy` to use original large model.
- 2.4.0: pytorch updated to 2.1.2, gpu install script updated to same + cuda version is now 121.
- 2.3.9: Fallback to `cpu` device if `gpu` device is not compatible.
- 2.3.8: Fix --models arg which
- 2.3.7: Critical fix: fixes dependency breakage with open-ai. Fixes windows use of embedded tool.
- 2.3.6: Fixes typo in readme for installation instructions.
- 2.3.5: Now has `--embed` to burn the subtitles into the video itself. Only works on local mp4 files at the moment.
- 2.3.4: Removed `out.mp3` and instead use a temporary wav file, as that is faster to process. --no-keep-audio has now been removed.
- 2.3.3: Fix case where there spaces in name (happens on windows)
- 2.3.2: Fix windows transcoding error
- 2.3.1: static-ffmpeg >= 2.5 now specified
- 2.3.0: Now uses the official version of whisper ai
- 2.2.1: "test\_" is now prepended to all the different output folder names.
- 2.2.0: Now explictly setting a language will put the file in a folder with that language name, allowing multi language passes without overwriting.
- 2.1.2: yt-dlp pinned to new minimum version. Fixes downloading issues from old lib. Adds audio normalization by default.
- 2.1.1: Updates keywords for easier pypi finding.
- 2.1.0: Unknown args are now assumed to be for whisper and passed to it as-is. Fixes https://github.com/zackees/transcribe-anything/issues/3
- 2.0.13: Now works with python 3.9
- 2.0.12: Adds --device to argument parameters. This will default to CUDA if available, else CPU.
- 2.0.11: Automatically deletes files in the out directory if they already exist.
- 2.0.10: fixes local file issue https://github.com/zackees/transcribe-anything/issues/2
- 2.0.9: fixes sanitization of path names for some youtube videos
- 2.0.8: fix `--output_dir` not being respected.
- 2.0.7: `install_cuda.sh` -> `install_cuda.py`
- 2.0.6: Fixes twitter video fetching. --keep-audio -> --no-keep-audio
- 2.0.5: Fix bad filename on trailing urls ending with /, adds --keep-audio
- 2.0.3: GPU support is now added. Run the `install_cuda.sh` script to enable.
- 2.0.2: Minor cleanup of file names (no more out.mp3.txt, it's now out.txt)
- 2.0.1: Fixes missing dependencies and adds whisper option.
- 2.0.0: New! Now a front end for Whisper ai!

## Notes:

- Insanely Fast whisper for GPU
  - https://github.com/Vaibhavs10/insanely-fast-whisper
- Fast Whisper for CPU
  - https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file
- A better whisper CLI that supports more options but has a manual install.
  - https://github.com/ochen1/insanely-fast-whisper-cli/blob/main/requirements.txt
- Subtitles translator:
  - https://github.com/TDHM/Subtitles-Translator
- Forum post on how to avoid stuttering
  - https://community.openai.com/t/how-to-avoid-hallucinations-in-whisper-transcriptions/125300/23
- More stable transcriptions:
  - https://github.com/jianfch/stable-ts?tab=readme-ov-file

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "transcribe-anything",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Zachary Vorhies <dont@email.me>",
    "keywords": "transcribe, openai, whisper",
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# transcribe-anything\r\n\r\n[![MacOS_Tests](https://github.com/zackees/transcribe-anything/actions/workflows/test_macos.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/test_macos.yml)\r\n[![Win_Tests](https://github.com/zackees/transcribe-anything/actions/workflows/test_win.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/test_win.yml)\r\n[![Ubuntu_Tests](https://github.com/zackees/transcribe-anything/actions/workflows/test_ubuntu.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/test_ubuntu.yml)\r\n[![Lint](https://github.com/zackees/transcribe-anything/actions/workflows/lint.yml/badge.svg)](https://github.com/zackees/transcribe-anything/actions/workflows/lint.yml)\r\n\r\n![image](https://github.com/zackees/transcribe-anything/assets/6856673/94bdd1fe-3225-438a-ac1b-09c81f1d4108)\r\n\r\n### USES WHISPER AI\r\n\r\nOver 800+\u2b50's because this program this app just works! Works great for windows and mac. This whisper front-end app is the only one to generate a `speaker.json` file which partitions the conversation by who doing the speaking.\r\n\r\n[![Star History Chart](https://api.star-history.com/svg?repos=zackees/transcribe-anything&type=Date)](https://star-history.com/#zackees/transcribe-anything&Date)\r\n\r\n### New in 3.2!\r\n\r\n**Turbo Mac acceleration using the new [lightning-whisper-mlx](https://github.com/mustafaaljadery/lightning-whisper-mlx) backend.**\r\n\r\nThis is a communinity contribution by https://github.com/aj47. On behalf of all the mac users, thank you!\r\n\r\n#### MLX Backend details\r\n\r\n  * 4x faster than the `mps` whisper backend.\r\n  * Supports multiple languages (`mps` only supports english).\r\n  * Supports custom vocabulary via `--initial_prompt`.\r\n  \r\n#### Usage\r\n\r\n\r\n```bash\r\n# Mac accelerated back-end\r\ntranscribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ --device mlx\r\n```\r\n\r\nSpecial thank\r\n\r\n### New in 3.1!\r\n\r\nMac acceleration option using the new [lightning-whisper-mlx](https://github.com/mustafaaljadery/lightning-whisper-mlx) backend. Enable with `--device mlx`. Now supports multiple languages, custom vocabulary via `--initial_prompt`, and both transcribe/translate tasks. 10x faster than Whisper CPP, 4x faster than previous MLX implementations!\r\n\r\n**Model Storage:** MLX models are now stored in `~/.cache/whisper/mlx_models/` for consistency with other backends, instead of cluttering your current working directory.\r\n\r\n**GPU Accelerated Dockerfile**\r\n\r\nRecently added in 3.0.10 is a GPU accelerated [Dockerfile](Dockerfile).\r\n\r\nIf you are are doing translations at scale, check out the sister project: [https://github.com/zackees/transcribe-everything](https://github.com/zackees/transcribe-everything).\r\n\r\nYou can pull the docker image like so:\r\n\r\n`docker pull niteris/transcribe-anything`\r\n\r\n## About\r\n\r\nEasiest whisper implementation to install and use. Just install with `pip install transcribe-anything`. All whisper backends are executed in an isolated environment. GPU acceleration is _automatic_, using the _blazingly_ fast [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper) as the backend for `--device insane`. This is the only tool to optionally produces a `speaker.json` file, representing speaker-assigned text that has been de-chunkified.\r\n\r\nHardware acceleration on Windows/Linux `--device insane`\r\n\r\nMacArm acceleration when using `--device mlx` (now with multi-language support and custom vocabulary)\r\n\r\nInput a local file or youtube/rumble url and this tool will transcribe it using Whisper AI into subtitle files and raw text.\r\n\r\nUses whisper AI so this is state of the art translation service - completely free. \ud83e\udd2f\ud83e\udd2f\ud83e\udd2f\r\n\r\nYour data stays private and is not uploaded to any service.\r\n\r\nThe new version now has state of the art speed in transcriptions, thanks to the new backend `--device insane`, as well as producing a `speaker.json` file.\r\n\r\n```bash\r\npip install transcribe-anything\r\n\r\n# Basic usage - CPU mode (works everywhere, slower)\r\ntranscribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ\r\n\r\n# GPU accelerated (Windows/Linux)\r\ntranscribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ --device insane\r\n\r\n# Mac Apple Silicon accelerated\r\ntranscribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ --device mlx\r\n\r\n# Advanced options (see Advanced Options section below for full details)\r\ntranscribe-anything video.mp4 --device mlx --batch_size 16 --verbose\r\ntranscribe-anything video.mp4 --device insane --batch-size 8 --flash True\r\n```\r\n\r\n_python api_\r\n\r\n```python\r\nfrom transcribe_anything import transcribe_anything\r\n\r\ntranscribe_anything(\r\n    url_or_file=\"https://www.youtube.com/watch?v=dQw4w9WgXcQ\",\r\n    output_dir=\"output_dir\",\r\n    task=\"transcribe\",\r\n    model=\"large\",\r\n    device=\"cuda\"\r\n)\r\n\r\n# Full function signiture:\r\ndef transcribe(\r\n    url_or_file: str,\r\n    output_dir: Optional[str] = None,\r\n    model: Optional[str] = None,              # tiny,small,medium,large\r\n    task: Optional[str] = None,               # transcribe or translate\r\n    language: Optional[str] = None,           # auto detected if none, \"en\" for english...\r\n    device: Optional[str] = None,             # cuda,cpu,insane,mlx\r\n    embed: bool = False,                      # Produces a video.mp4 with the subtitles burned in.\r\n    hugging_face_token: Optional[str] = None, # If you want a speaker.json\r\n    other_args: Optional[list[str]] = None,   # Other args to be passed to to the whisper backend\r\n    initial_prompt: Optional[str] = None,     # Custom prompt for better recognition of specific terms\r\n) -> str:\r\n\r\n```\r\n\r\n#### Fastest Transcription - Use `insane` mode with model `large-v3` + `batching`\r\n\r\nThis is by far the fastest combination. Experimental, it produces text that tends to be lower quality:\r\n\r\n- Higher chance for repeated text patterns.\r\n- Timestamps in the vtt/srt files become unaligned.\r\n\r\nIt's unclear if this is due to batching or `large-v3` itself. More testing is needed. If you do this then please let us know the results by filing a bug in the issues page.\r\n\r\nLarge batch sizes require more significant amounts of Nvidia GPU Ram. For a 12 GB card, it's been experimentally shown that batch-size=8 will work on all videos from an internally tested data lake.\r\n\r\n#### Insanely fast on `cuda` platforms\r\n\r\nIf you pass in `--device insane` on a cuda platform then this tool will use this state of the art version of whisper: https://github.com/Vaibhavs10/insanely-fast-whisper, which is MUCH faster and has a pipeline for speaker identification (diarization) using the `--hf_token` option.\r\n\r\nCompatible with Python 3.10 and above. Backends use an isolated environment with pinned requirements and python version.\r\n\r\n#### Speaker.json\r\n\r\nWhen diarization is enabled via `--hf_token` (hugging face token) then the output json will contain speaker info labeled as `SPEAKER_00`, `SPEAKER_01` etc. For licensing agreement reasons, you must get your own hugging face token if you want to enable this feature. Also there is an additional step to agree to the user policies for the `pyannote.audio` located here: https://huggingface.co/pyannote/segmentation-3.0. If you don't do this then you'll see runtime exceptions from `pyannote` when the `--hf_token` is used.\r\n\r\nWhat's special to this app is that we also generate a `speaker.json` which is a de-chunkified version of the output json speaker section.\r\n\r\n### speaker.json example:\r\n\r\n```json\r\n[\r\n  {\r\n    \"speaker\": \"SPEAKER_00\",\r\n    \"timestamp\": [0.0, 7.44],\r\n    \"text\": \"for that. But welcome, Zach Vorhees. Great to have you back on. Thank you, Matt. Craving me back onto your show. Man, we got a lot to talk about.\",\r\n    \"reason\": \"beginning\"\r\n  },\r\n  {\r\n    \"speaker\": \"SPEAKER_01\",\r\n    \"timestamp\": [7.44, 33.52],\r\n    \"text\": \"Oh, we do. 2023 was the year that OpenAI released, you know, chat GPT-4, which I think most people would say has surpassed average human intelligence, at least in test taking, perhaps not in, you know, reasoning and things like that. But it was a major year for AI. I think that most people are behind the curve on this. What's your take of what just happened in the last 12 months and what it means for the future of human cognition versus machine cognition?\",\r\n    \"reason\": \"speaker-switch\"\r\n  },\r\n  {\r\n    \"speaker\": \"SPEAKER_00\",\r\n    \"timestamp\": [33.52, 44.08],\r\n    \"text\": \"Yeah. Well, you know, at the beginning of 2023, we had a pretty weak AI system, which was a chat GPT 3.5 turbo was the best that we had. And then between the beginning of last\",\r\n    \"reason\": \"speaker-switch\"\r\n  }\r\n]\r\n```\r\n\r\nNote that `speaker.json `is only generated when using `--device insane` and not for `--device cuda` nor `--device cpu`.\r\n\r\n#### `cuda` vs `insane`\r\n\r\nInsane mode eats up a lot of memory and it's common to get out of memory errors while transcribing. For example a 3060 12GB nividia card produced out of memory errors are common for big content. If you experience this then pass in `--batch-size 8` or smaller. Note that any arguments not recognized by `transcribe-anything` are passed onto the backend transcriber.\r\n\r\nAlso, please don't use `distil-whisper/distil-large-v2`, it produces extremely bad stuttering and it's not entirely clear why this is. I've had to switch it out of production environments because it's so bad. It's also non-deterministic so I think that somehow a fallback non-zero temperature is being used, which produces these stutterings.\r\n\r\n`cuda` is the original AI model supplied by openai. It's more stable but MUCH slower. It also won't produce a `speaker.json` file which looks like this:\r\n\r\n`--embed`. This app will optionally embed subtitles directly \"burned\" into an output video.\r\n\r\n# Install\r\n\r\nThis front end app for whisper boasts the easiest install in the whisper ecosystem thanks to [isolated-environment](https://pypi.org/project/isolated-environment/). You can simply install it with pip, like this:\r\n\r\n```bash\r\npip install transcribe-anything\r\n```\r\n\r\n# Docker\r\n\r\nWe have a [Dockerfile](Dockerfile) that will be descently fast for startup. It is tuned specifically for `device=insane`. If you have extremely large batches of data you'd like to convert all at once then consider using the sister project [transcribe-everything](https://github.com/zackees/transcribe-everything) which operates on entire remote paths hierarchies.\r\n\r\n# GPU Acceleration\r\n\r\nGPU acceleration will be automatically enabled for windows and linux. Mac users can use `--device mlx` for hardware acceleration on Apple Silicon. `--device insane` may also work on Mac M1+ but has been less tested.\r\n\r\nWindows/Linux:\r\n\r\n- Use `--device insane`\r\n\r\nMac:\r\n\r\n- Use `--device mlx`\r\n\r\n# Advanced Options and Backend-Specific Arguments\r\n\r\n## Quick Reference\r\n\r\n| Backend | Device Flag | Key Arguments | Best For |\r\n|---------|-------------|---------------|----------|\r\n| **MLX** | `--device mlx` | `--batch_size`, `--verbose`, `--initial_prompt` | Mac Apple Silicon |\r\n| **Insanely Fast** | `--device insane` | `--batch-size`, `--hf_token`, `--flash`, `--timestamp` | Windows/Linux GPU |\r\n| **CPU** | `--device cpu` | Standard whisper args | Universal compatibility |\r\n\r\n> **Note:** Each backend has different capabilities. MLX is optimized for Apple Silicon with a focused feature set. Insanely Fast uses a transformer-based architecture with specific options. CPU backend supports the full range of standard OpenAI Whisper arguments.\r\n\r\n## Custom Prompts and Vocabulary\r\n\r\nWhisper supports custom prompts to improve transcription accuracy for domain-specific vocabulary, names, or technical terms. This is especially useful when transcribing content with:\r\n\r\n- Technical terminology (AI, machine learning, programming terms)\r\n- Proper names (people, companies, products)\r\n- Medical or scientific terms\r\n- Industry-specific jargon\r\n\r\n### Using Custom Prompts\r\n\r\n#### Command Line\r\n\r\n```bash\r\n# Direct prompt\r\ntranscribe-anything video.mp4 --initial_prompt \"The speaker discusses artificial intelligence, machine learning, and neural networks.\"\r\n\r\n# Load prompt from file\r\ntranscribe-anything video.mp4 --prompt_file my_custom_prompt.txt\r\n```\r\n\r\n#### Python API\r\n\r\n```python\r\nfrom transcribe_anything import transcribe\r\n\r\n# Direct prompt\r\ntranscribe(\r\n    url_or_file=\"video.mp4\",\r\n    initial_prompt=\"The speaker discusses AI, PyTorch, TensorFlow, and deep learning algorithms.\"\r\n)\r\n\r\n# Load prompt from file\r\nwith open(\"my_prompt.txt\", \"r\") as f:\r\n    prompt = f.read()\r\n\r\ntranscribe(\r\n    url_or_file=\"video.mp4\",\r\n    initial_prompt=prompt\r\n)\r\n```\r\n\r\n#### Best Practices\r\n\r\n- Keep prompts concise but comprehensive for your domain\r\n- Include variations of terms (e.g., \"AI\", \"artificial intelligence\")\r\n- Focus on terms that Whisper commonly misrecognizes\r\n- Test with and without prompts to measure improvement\r\n\r\n## MLX Backend Arguments (--device mlx)\r\n\r\nThe MLX backend supports additional arguments for fine-tuning performance:\r\n\r\n### Available Options\r\n\r\n```bash\r\n# Adjust batch size for better performance/memory trade-off\r\ntranscribe-anything video.mp4 --device mlx --batch_size 24\r\n\r\n# Enable verbose output for debugging\r\ntranscribe-anything video.mp4 --device mlx --verbose\r\n\r\n# Use custom prompt for better recognition of specific terms\r\ntranscribe-anything video.mp4 --device mlx --initial_prompt \"The speaker discusses AI, machine learning, and neural networks.\"\r\n```\r\n\r\n### MLX-Specific Arguments\r\n\r\n| Argument | Type | Default | Description |\r\n|----------|------|---------|-------------|\r\n| `--batch_size` | int | 12 | Batch size for processing. Higher values use more memory but may be faster |\r\n| `--verbose` | flag | false | Enable verbose output for debugging |\r\n| `--initial_prompt` | string | None | Custom vocabulary/context prompt for better recognition |\r\n\r\n### Supported Models\r\n\r\nThe MLX backend supports these whisper models optimized for Apple Silicon:\r\n- `tiny`, `small`, `base`, `medium`, `large`, `large-v2`, `large-v3`\r\n- Distilled models: `distil-small.en`, `distil-medium.en`, `distil-large-v2`, `distil-large-v3`\r\n\r\n> **Note:** The MLX backend uses the lightning-whisper-mlx library which has a focused feature set optimized for Apple Silicon. Advanced whisper options like `--temperature` and `--word_timestamps` are not currently supported by this backend.\r\n\r\n## Insanely Fast Whisper Arguments (--device insane)\r\n\r\nThe insanely-fast-whisper backend supports these specific options:\r\n\r\n### Performance Options\r\n\r\n```bash\r\n# Adjust batch size (critical for GPU memory management)\r\ntranscribe-anything video.mp4 --device insane --batch-size 8\r\n\r\n# Use different model variants\r\ntranscribe-anything video.mp4 --device insane --model large-v3\r\n\r\n# Enable Flash Attention 2 for faster processing\r\ntranscribe-anything video.mp4 --device insane --flash True\r\n```\r\n\r\n### Speaker Diarization Options\r\n\r\n```bash\r\n# Enable speaker diarization with HuggingFace token\r\ntranscribe-anything video.mp4 --device insane --hf_token your_token_here\r\n\r\n# Specify exact number of speakers\r\ntranscribe-anything video.mp4 --device insane --hf_token your_token --num-speakers 3\r\n\r\n# Set speaker range\r\ntranscribe-anything video.mp4 --device insane --hf_token your_token --min-speakers 2 --max-speakers 5\r\n```\r\n\r\n### Timestamp Options\r\n\r\n```bash\r\n# Choose timestamp granularity\r\ntranscribe-anything video.mp4 --device insane --timestamp chunk  # default\r\ntranscribe-anything video.mp4 --device insane --timestamp word   # word-level\r\n```\r\n\r\n### Insanely Fast Whisper Arguments\r\n\r\n| Argument | Type | Default | Description |\r\n|----------|------|---------|-------------|\r\n| `--batch-size` | int | 24 | Batch size for processing. Critical for GPU memory management |\r\n| `--flash` | bool | false | Use Flash Attention 2 for faster processing |\r\n| `--timestamp` | choice | chunk | Timestamp granularity: \"chunk\" or \"word\" |\r\n| `--hf_token` | string | None | HuggingFace token for speaker diarization |\r\n| `--num-speakers` | int | None | Exact number of speakers (cannot use with min/max) |\r\n| `--min-speakers` | int | None | Minimum number of speakers |\r\n| `--max-speakers` | int | None | Maximum number of speakers |\r\n| `--diarization_model` | string | pyannote/speaker-diarization | Diarization model to use |\r\n\r\n> **Note:** The insanely-fast-whisper backend uses a different architecture than standard OpenAI Whisper. It does NOT support standard whisper arguments like `--temperature`, `--beam_size`, `--best_of`, etc. These are specific to the OpenAI implementation.\r\n\r\n## CPU Backend Arguments (--device cpu)\r\n\r\nThe CPU backend uses the standard OpenAI Whisper implementation and supports many additional arguments:\r\n\r\n### Standard Whisper Options\r\n\r\n```bash\r\n# Language and task options (also available as main arguments)\r\ntranscribe-anything video.mp4 --device cpu --language es --task translate\r\n\r\n# Generation parameters\r\ntranscribe-anything video.mp4 --device cpu --temperature 0.1 --best_of 5 --beam_size 5\r\n\r\n# Quality thresholds\r\ntranscribe-anything video.mp4 --device cpu --compression_ratio_threshold 2.4 --logprob_threshold -1.0\r\n\r\n# Output formatting\r\ntranscribe-anything video.mp4 --device cpu --word_timestamps --highlight_words True\r\n\r\n# Audio processing\r\ntranscribe-anything video.mp4 --device cpu --threads 4 --clip_timestamps \"0,30\"\r\n```\r\n\r\n> **Note:** The CPU backend supports most standard OpenAI Whisper arguments. These are passed through automatically and documented in the [OpenAI Whisper repository](https://github.com/openai/whisper).\r\n\r\n### Batch Size Recommendations\r\n\r\n**MLX Backend (`--device mlx`):**\r\n- Default: 12\r\n- Recommended range: 8-24\r\n- Higher values for more VRAM, lower for less\r\n\r\n**Insanely Fast Whisper (`--device insane`):**\r\n- Default: 24\r\n- Recommended for 8GB GPU: 4-8\r\n- Recommended for 12GB GPU: 8-12\r\n- Recommended for 24GB GPU: 16-24\r\n- Use `--flash True` for better memory efficiency\r\n- Start low and increase if no OOM errors\r\n\r\n# Usage Examples\r\n\r\n## Basic Usage\r\n\r\n```bash\r\n# Basic transcription\r\ntranscribe-anything https://www.youtube.com/watch?v=dQw4w9WgXcQ\r\n\r\n# Local file\r\ntranscribe-anything video.mp4\r\n```\r\n\r\n## Backend-Specific Examples\r\n\r\n### MLX Backend (Mac Apple Silicon)\r\n\r\n```bash\r\n# Basic MLX usage\r\ntranscribe-anything video.mp4 --device mlx\r\n\r\n# MLX with custom batch size and verbose output\r\ntranscribe-anything video.mp4 --device mlx --batch_size 16 --verbose\r\n\r\n# MLX with custom prompt for technical content\r\ntranscribe-anything lecture.mp4 --device mlx --initial_prompt \"The speaker discusses machine learning, neural networks, PyTorch, and TensorFlow.\"\r\n\r\n# MLX with multiple options (using main arguments for language/task)\r\ntranscribe-anything video.mp4 --device mlx --batch_size 20 --verbose --task translate --language es\r\n```\r\n\r\n### Insanely Fast Whisper (GPU)\r\n\r\n```bash\r\n# Basic insane mode\r\ntranscribe-anything video.mp4 --device insane\r\n\r\n# Insane mode with custom batch size (important for GPU memory)\r\ntranscribe-anything video.mp4 --device insane --batch-size 8\r\n\r\n# Insane mode with Flash Attention 2 for speed\r\ntranscribe-anything video.mp4 --device insane --batch-size 12 --flash True\r\n\r\n# Insane mode with speaker diarization\r\ntranscribe-anything video.mp4 --device insane --hf_token your_huggingface_token\r\n\r\n# Insane mode with word-level timestamps and speaker diarization\r\ntranscribe-anything video.mp4 --device insane --timestamp word --hf_token your_token --num-speakers 3\r\n\r\n# High-performance setup with all optimizations\r\ntranscribe-anything video.mp4 --device insane --batch-size 16 --flash True --timestamp word\r\n```\r\n\r\n### CPU Backend (Universal)\r\n\r\n```bash\r\n# CPU mode (works everywhere, slower)\r\ntranscribe-anything video.mp4 --device cpu\r\n\r\n# CPU with custom model and language\r\ntranscribe-anything video.mp4 --device cpu --model medium --language fr --task transcribe\r\n```\r\n\r\n## Troubleshooting Common Issues\r\n\r\n### Out of Memory Errors\r\n\r\nIf you encounter GPU out-of-memory errors:\r\n\r\n```bash\r\n# Reduce batch size for MLX\r\ntranscribe-anything video.mp4 --device mlx --batch_size 8\r\n\r\n# Reduce batch size for insane mode\r\ntranscribe-anything video.mp4 --device insane --batch-size 4\r\n\r\n# Use smaller model\r\ntranscribe-anything video.mp4 --device insane --model small --batch-size 8\r\n```\r\n\r\n### Poor Quality Transcriptions\r\n\r\nFor better quality:\r\n\r\n```bash\r\n# Use larger model\r\ntranscribe-anything video.mp4 --device insane --model large-v3\r\n\r\n# Enable Flash Attention 2 for better performance\r\ntranscribe-anything video.mp4 --device insane --flash True\r\n\r\n# Use custom prompt for domain-specific content (works with all backends)\r\ntranscribe-anything video.mp4 --initial_prompt \"Medical terminology: diagnosis, treatment, symptoms, patient care\"\r\n\r\n# For CPU backend, you can use standard whisper quality options\r\ntranscribe-anything video.mp4 --device cpu --compression_ratio_threshold 2.0 --logprob_threshold -0.5\r\n```\r\n\r\n### Performance Optimization\r\n\r\nFor faster processing:\r\n\r\n```bash\r\n# Increase batch size (if you have enough GPU memory)\r\ntranscribe-anything video.mp4 --device mlx --batch_size 24\r\ntranscribe-anything video.mp4 --device insane --batch-size 16\r\n\r\n# Enable Flash Attention 2 for insane mode (significant speedup)\r\ntranscribe-anything video.mp4 --device insane --flash True --batch-size 16\r\n\r\n# Use smaller model for speed\r\ntranscribe-anything video.mp4 --device insane --model small\r\n\r\n# Use distilled models for even faster processing\r\ntranscribe-anything video.mp4 --device insane --model distil-whisper/large-v2 --flash True\r\n```\r\n\r\nWill output:\r\n\r\n```\r\nDetecting language using up to the first 30 seconds. Use `--language` to specify the language\r\nDetected language: English\r\n[00:00.000 --> 00:27.000]  We're no strangers to love, you know the rules, and so do I\r\n[00:27.000 --> 00:31.000]  I've built commitments while I'm thinking of\r\n[00:31.000 --> 00:35.000]  You wouldn't get this from any other guy\r\n[00:35.000 --> 00:40.000]  I just wanna tell you how I'm feeling\r\n[00:40.000 --> 00:43.000]  Gotta make you understand\r\n[00:43.000 --> 00:45.000]  Never gonna give you up\r\n[00:45.000 --> 00:47.000]  Never gonna let you down\r\n[00:47.000 --> 00:51.000]  Never gonna run around and desert you\r\n[00:51.000 --> 00:53.000]  Never gonna make you cry\r\n[00:53.000 --> 00:55.000]  Never gonna say goodbye\r\n[00:55.000 --> 00:58.000]  Never gonna tell a lie\r\n[00:58.000 --> 01:00.000]  And hurt you\r\n[01:00.000 --> 01:04.000]  We've known each other for so long\r\n[01:04.000 --> 01:09.000]  Your heart's been aching but you're too shy to say it\r\n[01:09.000 --> 01:13.000]  Inside we both know what's been going on\r\n[01:13.000 --> 01:17.000]  We know the game and we're gonna play it\r\n[01:17.000 --> 01:22.000]  And if you ask me how I'm feeling\r\n[01:22.000 --> 01:25.000]  Don't tell me you're too much to see\r\n[01:25.000 --> 01:27.000]  Never gonna give you up\r\n[01:27.000 --> 01:29.000]  Never gonna let you down\r\n[01:29.000 --> 01:33.000]  Never gonna run around and desert you\r\n[01:33.000 --> 01:35.000]  Never gonna make you cry\r\n[01:35.000 --> 01:38.000]  Never gonna say goodbye\r\n[01:38.000 --> 01:40.000]  Never gonna tell a lie\r\n[01:40.000 --> 01:42.000]  And hurt you\r\n[01:42.000 --> 01:44.000]  Never gonna give you up\r\n[01:44.000 --> 01:46.000]  Never gonna let you down\r\n[01:46.000 --> 01:50.000]  Never gonna run around and desert you\r\n[01:50.000 --> 01:52.000]  Never gonna make you cry\r\n[01:52.000 --> 01:54.000]  Never gonna say goodbye\r\n[01:54.000 --> 01:57.000]  Never gonna tell a lie\r\n[01:57.000 --> 01:59.000]  And hurt you\r\n[02:08.000 --> 02:10.000]  Never gonna give\r\n[02:12.000 --> 02:14.000]  Never gonna give\r\n[02:16.000 --> 02:19.000]  We've known each other for so long\r\n[02:19.000 --> 02:24.000]  Your heart's been aching but you're too shy to say it\r\n[02:24.000 --> 02:28.000]  Inside we both know what's been going on\r\n[02:28.000 --> 02:32.000]  We know the game and we're gonna play it\r\n[02:32.000 --> 02:37.000]  I just wanna tell you how I'm feeling\r\n[02:37.000 --> 02:40.000]  Gotta make you understand\r\n[02:40.000 --> 02:42.000]  Never gonna give you up\r\n[02:42.000 --> 02:44.000]  Never gonna let you down\r\n[02:44.000 --> 02:48.000]  Never gonna run around and desert you\r\n[02:48.000 --> 02:50.000]  Never gonna make you cry\r\n[02:50.000 --> 02:53.000]  Never gonna say goodbye\r\n[02:53.000 --> 02:55.000]  Never gonna tell a lie\r\n[02:55.000 --> 02:57.000]  And hurt you\r\n[02:57.000 --> 02:59.000]  Never gonna give you up\r\n[02:59.000 --> 03:01.000]  Never gonna let you down\r\n[03:01.000 --> 03:05.000]  Never gonna run around and desert you\r\n[03:05.000 --> 03:08.000]  Never gonna make you cry\r\n[03:08.000 --> 03:10.000]  Never gonna say goodbye\r\n[03:10.000 --> 03:12.000]  Never gonna tell a lie\r\n[03:12.000 --> 03:14.000]  And hurt you\r\n[03:14.000 --> 03:16.000]  Never gonna give you up\r\n[03:16.000 --> 03:23.000]  If you want, never gonna let you down Never gonna run around and desert you\r\n[03:23.000 --> 03:28.000]  Never gonna make you hide Never gonna say goodbye\r\n[03:28.000 --> 03:42.000]  Never gonna tell you I ain't ready\r\n```\r\n\r\n## Api\r\n\r\n```python\r\nfrom transcribe_anything.api import transcribe\r\n\r\ntranscribe(\r\n    url_or_file=\"https://www.youtube.com/watch?v=dQw4w9WgXcQ\",\r\n    output_dir=\"output_dir\",\r\n)\r\n```\r\n\r\n## Develop\r\n\r\nWorks for Ubuntu/MacOS/Win32(in git-bash)\r\nThis will create a virtual environment\r\n\r\n```bash\r\n> cd transcribe_anything\r\n> ./install.sh\r\n# Enter the environment:\r\n> source activate.sh\r\n```\r\n\r\nThe environment is now active and the next step will only install to the local python. If the terminal\r\nis closed then to get back into the environment `cd transcribe_anything` and execute `source activate.sh`\r\n\r\n## Required: Install to current python environment\r\n\r\n- `pip install transcribe-anything`\r\n  - The command `transcribe_anything` will magically become available.\r\n- `transcribe_anything <YOUTUBE_URL>`\r\n\r\n# Tech Stack\r\n\r\n- OpenAI whisper\r\n- insanely-fast-whisper\r\n- yt-dlp: https://github.com/yt-dlp/yt-dlp\r\n- static-ffmpeg\r\n  - github: https://github.com/zackees/static_ffmpeg\r\n  - pypi: https://pypi.org/project/static-ffmpeg/\r\n\r\n# Testing\r\n\r\n- Every commit is tested for standard linters and a batch of unit tests.\r\n\r\n## Updated version 2.3.0\r\n\r\n`transcribe-anything` now works much better across different configurations and is now much faster. Why? I switched the environment isolation that I was using from my own homespun version built on top of `venv` to the AMAZING `uv` system. The biggest improvement is the runtime speed and re-installs. UV is just insane at how fast it is for checking the environment. Also it turns out that `uv` has strict package dependency checking which found a minor bug where a certain version of one of the `pytorch` dependencies was being constantly re-installed because of a dependency conflict that pip was apparently perfectly happy to never warn about. This manifested as certain packages being constantly re-installed with the previous version. `uv` identified this as an error immediately and was fixed.\r\n\r\nThe real reason behind `transcribe-anything`'s surprising popularity comes from the fact that it just works. And the reason for this is that I can isolate environments for different configurations and install them lazily. If you have the same problem then consider my other tool: https://github.com/zackees/iso-env\r\n\r\n# Versions\r\n\r\n- 3.0.7: Insane whisperer mode no longer prints out the srt file during transcription completion.\r\n- 3.0.6: MacOS MLX mode fixed/improved\r\n  - PR: https://github.com/zackees/transcribe-anything/pull/39\r\n  - Thank you https://github.com/aj47!\r\n- 3.0.5: A temp wav file was not being cleaned up, now it is.\r\n- 3.1.0: Upgraded Mac-arm backend to [lightning-whisper-mlx](https://github.com/mustafaaljadery/lightning-whisper-mlx), enable with `--device mlx`. Now supports multiple languages, custom vocabulary via `--initial_prompt`, and both transcribe/translate tasks. 10x faster than Whisper CPP!\r\n- 3.0.0: Implemented new Mac-arm accelerated [whisper-mps](https://github.com/AtomGradient/whisper-mps) backend, enable with `--device mps` (now `--device mlx`). Only does english, but is quite fast.\r\n- 2.3.0: Swapped out the environment isolator. Now based on `uv`, should fix the missing dll's on some windows systems.\r\n- 2.7.39: Fix `--hf-token` usage for insanely fast whisper backend.\r\n- 2.7.37: Fixed breakage due to numpy 2.0 being released.\r\n- 2.7.36: Fixed some ffmpeg dependencies.\r\n- 2.7.35: All `ffmpeg` commands are now `static_ffmpeg` commands. Fixes issue.\r\n- 2.7.34: Various fixes.\r\n- 2.7.33: Fixes linux\r\n- 2.7.32: Fixes mac m1 and m2.\r\n- 2.7.31: Adds a warning if using python 3.12, which isn't supported yet in the backend.\r\n- 2.7.30: adds --query-gpu-json-path\r\n- 2.7.29: Made to json -> srt more robust for `--device insane`, bad entries will be skipped but warn.\r\n- 2.7.28: Fixes bad title fetching with weird characters.\r\n- 2.7.27: `pytorch-audio` upgrades broke this package. Upgrade to latest version to resolve.\r\n- 2.7.26: Add model option `distil-whisper/distil-large-v2`\r\n- 2.7.25: Windows (Linux/MacOS) bug with `--device insane` and python 3.11 installing wrong `insanely-fast-whisper` version.\r\n- 2.7.22: Fixes `transcribe-anything` on Linux.\r\n- 2.7.21: Tested that Mac Arm can run `--device insane`. Added tests to ensure this.\r\n- 2.7.20: Fixes wrong type being returned when speaker.json happens to be empty.\r\n- 2.7.19: speaker.json is now in plain json format instead of json5 format\r\n- 2.7.18: Fixes tests\r\n- 2.7.17: Fixes speaker.json nesting.\r\n- 2.7.16: Adds `--save_hf_token`\r\n- 2.7.15: Fixes 2.7.14 breakage.\r\n- 2.7.14: (Broken) Now generates `speaker.json` when diarization is enabled.\r\n- 2.7.13: Default diarization model is now pyannote/speaker-diarization-3.1\r\n- 2.7.12: Adds srt_swap for line breaks and improved isolated_environment usage.\r\n- 2.7.11: `--device insane` now generates a \\*.vtt translation file\r\n- 2.7.10: Better support for namespaced models. Trims text output in output json. Output json is now formatted with indents. SRT file is now printed out for `--device insane`\r\n- 2.7.9: All SRT translation errors fixed for `--device insane`. All tests pass.\r\n- 2.7.8: During error of `--device insane`, write out the error.json file into the destination.\r\n- 2.7.7: Better error messages during failure.\r\n- 2.7.6: Improved generation of out.txt, removes linebreaks.\r\n- 2.7.5: `--device insane` now generates better conforming srt files.\r\n- 2.7.3: Various fixes for the `insane` mode backend.\r\n- 2.7.0: Introduces an `insanely-fast-whisper`, enable by using `--device insane`\r\n- 2.6.0: GPU acceleration now happens automatically on Windows thanks to `isolated-environment`. This will also prevent\r\n  interference with different versions of torch for other AI tools.\r\n- 2.5.0: `--model large` now aliases to `--model large-v3`. Use `--model large-legacy` to use original large model.\r\n- 2.4.0: pytorch updated to 2.1.2, gpu install script updated to same + cuda version is now 121.\r\n- 2.3.9: Fallback to `cpu` device if `gpu` device is not compatible.\r\n- 2.3.8: Fix --models arg which\r\n- 2.3.7: Critical fix: fixes dependency breakage with open-ai. Fixes windows use of embedded tool.\r\n- 2.3.6: Fixes typo in readme for installation instructions.\r\n- 2.3.5: Now has `--embed` to burn the subtitles into the video itself. Only works on local mp4 files at the moment.\r\n- 2.3.4: Removed `out.mp3` and instead use a temporary wav file, as that is faster to process. --no-keep-audio has now been removed.\r\n- 2.3.3: Fix case where there spaces in name (happens on windows)\r\n- 2.3.2: Fix windows transcoding error\r\n- 2.3.1: static-ffmpeg >= 2.5 now specified\r\n- 2.3.0: Now uses the official version of whisper ai\r\n- 2.2.1: \"test\\_\" is now prepended to all the different output folder names.\r\n- 2.2.0: Now explictly setting a language will put the file in a folder with that language name, allowing multi language passes without overwriting.\r\n- 2.1.2: yt-dlp pinned to new minimum version. Fixes downloading issues from old lib. Adds audio normalization by default.\r\n- 2.1.1: Updates keywords for easier pypi finding.\r\n- 2.1.0: Unknown args are now assumed to be for whisper and passed to it as-is. Fixes https://github.com/zackees/transcribe-anything/issues/3\r\n- 2.0.13: Now works with python 3.9\r\n- 2.0.12: Adds --device to argument parameters. This will default to CUDA if available, else CPU.\r\n- 2.0.11: Automatically deletes files in the out directory if they already exist.\r\n- 2.0.10: fixes local file issue https://github.com/zackees/transcribe-anything/issues/2\r\n- 2.0.9: fixes sanitization of path names for some youtube videos\r\n- 2.0.8: fix `--output_dir` not being respected.\r\n- 2.0.7: `install_cuda.sh` -> `install_cuda.py`\r\n- 2.0.6: Fixes twitter video fetching. --keep-audio -> --no-keep-audio\r\n- 2.0.5: Fix bad filename on trailing urls ending with /, adds --keep-audio\r\n- 2.0.3: GPU support is now added. Run the `install_cuda.sh` script to enable.\r\n- 2.0.2: Minor cleanup of file names (no more out.mp3.txt, it's now out.txt)\r\n- 2.0.1: Fixes missing dependencies and adds whisper option.\r\n- 2.0.0: New! Now a front end for Whisper ai!\r\n\r\n## Notes:\r\n\r\n- Insanely Fast whisper for GPU\r\n  - https://github.com/Vaibhavs10/insanely-fast-whisper\r\n- Fast Whisper for CPU\r\n  - https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file\r\n- A better whisper CLI that supports more options but has a manual install.\r\n  - https://github.com/ochen1/insanely-fast-whisper-cli/blob/main/requirements.txt\r\n- Subtitles translator:\r\n  - https://github.com/TDHM/Subtitles-Translator\r\n- Forum post on how to avoid stuttering\r\n  - https://community.openai.com/t/how-to-avoid-hallucinations-in-whisper-transcriptions/125300/23\r\n- More stable transcriptions:\r\n  - https://github.com/jianfch/stable-ts?tab=readme-ov-file\r\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "Uses Whisper AI to transcribe speech from video and audio files. Also accepts URLs for YouTube, Rumble, BitChute, clear file links, etc.",
    "version": "3.2.3",
    "project_urls": {
        "homepage": "https://github.com/zackees/transcribe-anything"
    },
    "split_keywords": [
        "transcribe",
        " openai",
        " whisper"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d0346e886fb5682da2e9796b5e56d9d8ecd1b77846890de112dfe6c519c36c6c",
                "md5": "a9ebec84915ab2db3513ea1d9d936e7a",
                "sha256": "5574d20ed5bf505601978b255c5fe21260a8db459eef3d3e82fc42c0d2606cbf"
            },
            "downloads": -1,
            "filename": "transcribe_anything-3.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a9ebec84915ab2db3513ea1d9d936e7a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 201005,
            "upload_time": "2025-07-27T00:22:49",
            "upload_time_iso_8601": "2025-07-27T00:22:49.422251Z",
            "url": "https://files.pythonhosted.org/packages/d0/34/6e886fb5682da2e9796b5e56d9d8ecd1b77846890de112dfe6c519c36c6c/transcribe_anything-3.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 00:22:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zackees",
    "github_project": "transcribe-anything",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "transcribe-anything"
}
        
Elapsed time: 1.12616s