TurnVoice


NameTurnVoice JSON
Version 0.0.60 PyPI version JSON
download
home_pagehttps://github.com/KoljaB/TurnVoice
SummaryReplaces and translates voices in youtube videos
upload_time2023-12-18 15:54:34
maintainer
docs_urlNone
authorKolja Beigel
requires_python>=3.6
license
keywords replace voice youtube video audio voice synthesis sentence-segmentation tts-engine audio-playback stream-player sentence-fragment audio-feedback interactive python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TurnVoice

A command-line tool to **transform voices** in (YouTube) videos with additional **translation** capabilities. [^1] 

https://github.com/KoljaB/TurnVoice/assets/7604638/f87759cc-0b3f-4d8f-864f-af99202d7312

<sup>(sorry for the bad video quality, it had to fit under 10MB file size because Github 🀷)</sup> [🎞️ HD version 🎞️](https://www.youtube.com/watch?v=Rl0WhIax2lM) 

## Features

- **Voice Transformation**  
  Turn voices with the free [Coqui TTS](#coqui-engine) at no operating costs  <sup>*(supports voice cloning)*</sup>

- **Voice Variety**  
  Support for popular TTS engines like [Elevenlabs](#elevenlabs-engine), [OpenAI TTS](#openai-engine), or [Azure](#azure-engine) for more voices. [^7]

- **Translation**  
  Translates videos at zero costs powered by free deep-translator.

- **Change Speaking Styles** <sup>*(AI powered)*</sup>  
  Make every spoken sentence delivered in a custom speaking style for a unique flair using prompting. [^6]

- **Full Rendering Control**  
  Precise [rendering control](#workflow) by customizing the sentence text, timings, and voice selection.  
    
  <sup>*πŸ’‘ Tip: the [Renderscript Editor](#renderscript-editor) makes this step easy*</sup>

- **Local Video Processing**  
  Process any local video files.

- **Background Audio Preservation**  
  Keeps the original background audio intact.

> *Discover more in the [release notes](https://github.com/KoljaB/TurnVoice/releases).*

## Prerequisites

- [Rubberband](https://breakfastquay.com/rubberband/) command-line utility installed [^2] 
- [ffmpeg](https://ffmpeg.org/download.html) command-line utility installed [^3]
  <details>
  <summary>To install ffmpeg with a package manager:</summary>

    - **On Ubuntu or Debian**:
        ```bash
        sudo apt update && sudo apt install ffmpeg
        ```

    - **On Arch Linux**:
        ```bash
        sudo pacman -S ffmpeg
        ```

    - **On MacOS using Homebrew** ([https://brew.sh/](https://brew.sh/)):
        ```bash
        brew install ffmpeg
        ```

    - **On Windows using Chocolatey** ([https://chocolatey.org/](https://chocolatey.org/)):
        ```bash
        choco install ffmpeg
        ```

    - **On Windows using Scoop** ([https://scoop.sh/](https://scoop.sh/)):
        ```bash
        scoop install ffmpeg
        ```    
  </details>
- Huggingface conditions accepted for [Speaker Diarization](https://huggingface.co/pyannote/speaker-diarization-3.1) and [Segmentation](https://huggingface.co/pyannote/segmentation-3.0)
- Huggingface access token in env variable HF_ACCESS_TOKEN [^4]
> [!TIP]
> *Set your [HF token](https://huggingface.co/settings/tokens) with `setx HF_ACCESS_TOKEN "your_token_here"*

## Installation 

```
pip install turnvoice
```

> [!TIP]
> For faster rendering with GPU prepare your [CUDA](https://pytorch.org/get-started/locally/) environment after installation:
> 
> ***For CUDA 11.8***  
> `pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118`  
>   
> ***For CUDA 12.1***  
> `pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu211 --index-url https://download.pytorch.org/whl/cu211`  

## Usage

```bash
turnvoice [-i] <YouTube URL|ID|Local File> [-l] <Translation Language> -e <Engine(s)> -v <Voice(s)> -o <Output File>
```

Submit a string to the 'voice' parameter for each speaker voice you wish to use. If you specify engines, the voices will be assigned to these engines in the order they are listed. Should there be more voices than engines, the first engine will be used for the excess voices. In the absence of a specified engine, the Coqui engine will be used as the default. If no voices are defined, a default voice will be selected for each engine.

### Example Command:

Arthur Morgan narrating a cooking tutorial:

```bash
turnvoice -i AmC9SmCBUj4 -v arthur.wav -o cooking_with_arthur.mp4
```

> [!NOTE]
> *Requires the cloning voice file (e.g., arthur.wav or .json) in the same directory (you find one in the tests directory).*

## Workflow

### Preparation

Prepare a script with transcription, speaker diarization (and optionally translation or prompting) using:

```bash
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --prepare
```

Translation and prompts should be applied in this preparation step. Engines or voices come later in the render step.

### Renderscript Editor

1. **Open script**  
  Open the [editor.html](https://github.com/KoljaB/TurnVoice/blob/main/turnvoice/editor/editor.html) file. Click on the file open button and navigate to the folder you started turnvoice from. Open download folder. Open the folder with the name of the video. Open the file full_script.txt.
2. **Edit**  
  The Editor will visualize the transcript and speaker diarization results and start playing the original video now. While playing verify texts, starting times and speaker assignments and adjust them if the detection went wrong.
3. **Save**  
  Save the script. Remember the path to the file.

### Rendering

Render the refined script to generate the final video using:

```bash
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --render <path_to_script>
```

Adjust the path in the displayed CLI command (the editor can't read that information out from the browser).

Assign engines and voices to each speaker track with the -e and -v commands.

## Parameters Explained:

- `-i`, `--in`: Input video. Accepts a YouTube video URL or ID, or a path to a local video file.
- `-l`, `--language`: Language for translation. Coqui synthesis supports: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko. Omit to retain the original video language.
- `-il`, `--input_language`: Language code for transcription, set if automatic detection fails.
- `-v`, `--voice`: Voices for synthesis. Accepts multiple values to replace more than one speaker.
- `-o`, `--output_video`: Filename for the final output video (default: 'final_cut.mp4').
- `-a`, `--analysis`: Print transcription and speaker analysis without synthesizing or rendering the video.
- `-from`: Time to start processing the video from.
- `-to`: Time to stop processing the video at.
- `-e`, `--engine`: Engine(s) to synthesize with. Can be coqui, elevenlabs, azure, openai or system. Accepts multiple values, linked to the the submitted voices. 
- `-s`, `--speaker`: Speaker number to be transformed.
- `-snum`, `--num_speakers`: Helps diarization. Specify the exact number of speakers in the video if you know it in advance. 
- `-smin`, `--min_speakers`: Helps diarization. Specify the minimum number of speakers in the video if you know it in advance. 
- `-smax`, `--max_speakers`: Helps diarization. Specify the maximum number of speakers in the video if you know it in advance. 
- `-dd`, `--download_directory`: Directory for saving downloaded files (default: 'downloads').
- `-sd`, `--synthesis_directory`: Directory for saving synthesized audio files (default: 'synthesis').
- `-exoff`, `--extractoff`: Disables extraction of audio from the video file. Downloads audio and video from the internet.
- `-c`, `--clean_audio`: Removes original audio from the final video, resulting in clean synthesis.
- `-tf`, `--timefile`: Define timestamp file(s) for processing (functions like multiple --from/--to commands).
- `-p`, `--prompt`: Define a prompt to apply a style change to sentences like "speaking style of captain jack sparrow" [^6]
- `-prep`, `--prepare`: Write full script with speaker analysis, sentence transformation and translation but doesn't perform synthesis or rendering. Can be continued.
- `-r`, `--render`: Takes a full script and only perform synthesis and rendering on it, but no speaker analysis, sentence transformation or translation. 

> `-i` and `-l` can be used as both positional and optional arguments.

## Coqui Engine

Coqui engine is the default engine if no other engine is specified with the -e parameter.

<details>
<summary>To use voices from Coqui:</summary>

#### Voices (-v parameter)

Submit path to one or more audiofiles containing 16 bit 24kHz mono source material as reference wavs.

Example:
```
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e coqui -v female.wav
```

#### The Art of Choosing a Reference Wav
- A 24000, 44100 or 22050 Hz 16-bit mono wav file of 10-30 seconds is your golden ticket. 
- 24k mono 16 is my default, but I also had voices where I found 44100 32-bit to yield best results
- I test voices [with this tool](https://github.com/KoljaB/RealtimeTTS/blob/master/tests/coqui_test.py) before rendering
- Audacity is your friend for adjusting sample rates. Experiment with frame rates for best results!

#### Fixed TTS Model Download Folder
Keep your models organized! Set `COQUI_MODEL_PATH` to your preferred folder.

Windows example:
```bash
setx COQUI_MODEL_PATH "C:\Downloads\CoquiModels"
```
</details>

## Elevenlabs Engine

> [!NOTE]
> To use Elevenlabs voices you need the [API Key](https://elevenlabs.io/docs/api-reference/text-to-speech#authentication) stored in env variable **ELEVENLABS_API_KEY**

All voices are synthesized with the multilingual-v1 model.

> [!CAUTION]
> Elevenlabs is a pricy API. Focus on short videos. Don't let a work-in-progress script like this run unattended on a pay-per-use API. Bugs could be very annoying when occurring at the end of a pricy long rendering process. 

<details>
<summary>To use voices from Elevenlabs:</summary>

#### Voices (-v parameter)

Submit name(s) of either a generated or predefined voice.

Example:
```
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e elevenlabs -v Giovanni
```

</details>  

> [!TIP]
> Test rendering with a free engine like coqui first before using pricy ones.

## OpenAI Engine

> [!NOTE]
> To use OpenAI TTS voices you need the [API Key](https://platform.openai.com/api-keys) stored in env variable **OPENAI_API_KEY**

<details>
<summary>To use voices from OpenAI:</summary>

#### Voice (-v parameter)

Submit name of voice. Currently only one voice for OpenAI supported. Alloy, echo, fable, onyx, nova or shimmer.

Example:
```
turnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e openai -v shimmer
```
</details>

## Azure Engine

> [!NOTE]
> To use Azure voices you need the [API Key](https://www.youtube.com/watch?v=HgYE2nJPaHA&t=57s) for SpeechService resource in **AZURE_SPEECH_KEY** and the [region identifier](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/regions) in **AZURE_SPEECH_REGION**

<details>
<summary>To use voices from Azure:</summary>

#### Voices (-v parameter)

Submit name(s) of either a generated or predefined voice.

Example:
```
turnvoice https://www.youtube.com/watch?v=BqnAeUoqFAM -e azure -v ChristopherNeural
```
</details>

## System Engine

<details>
<summary>To use system voices:</summary>

#### Voices (-v parameter)

Submit name(s) of voices as string.

Example:
```
turnvoice https://www.youtube.com/watch?v=BqnAeUoqFAM -e system -v David
```
</details>

## What to expect

- early alpha / work-in-progress, so bugs might occur (please report, need to be aware to fix)
- might not always achieve perfect lip synchronization, especially when translating to a different language
- speaker detection does not work that well, probably doing something wrong or or perhaps the tech[^5] is not yet ready to be reliable
- translation feature is currently in experimental prototype state (powered by deep-translate) and still produces very imperfect results
- occasionally, the synthesis might introduce unexpected noises or distortions in the audio (we got **way** better reducing artifacts with the new v0.0.30 algo)
- spleeter might get confused when a spoken voice and backmusic with singing are present together in the source audio

## Source Quality

- delivers best results with YouTube videos featuring **clear spoken** content (podcasts, educational videos)
- requires a high-quality, **clean** source WAV file for effective voice cloning 

## Pro Tips

### How to exchange a single speaker

First perform a speaker analysis with -a parameter:

```bash
turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -a
```

Then select a speaker from the list with -s parameter

```bash
turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -s 2
```

## License

TurnVoice is proudly under the [Coqui Public Model License 1.0.0](https://coqui.ai/cpml). 

# Contact 🀝

[Share](https://github.com/KoljaB/TurnVoice/discussions) your funniest or most creative TurnVoice creations with me! 

And if you've got a cool feature idea or just want to say hi, drop me a line on

- [Twitter](https://twitter.com/LonLigrin)  
- [Reddit](https://www.reddit.com/user/Lonligrin)  
- [EMail](mailto:kolja.beigel@web.de)  

If you like the repo please leave a star  
✨ 🌟 ✨

[^1]: State is work-in-progress (early pre-alpha). Ülease expect CLI API changes to come and sorry in advance if anything does not work as expected.  
  Developed on Python 3.11.4 under Win 10. 
[^2]: Rubberband is needed to pitchpreserve timestretch audios for fitting synthesis into timewindow.
[^3]: ffmpeg is needed to convert mp3 files into wav
[^4]: Huggingface access token is needed to download the speaker diarization model for identifying speakers with pyannote.audio.
[^5]: Speaker diarization is performed with the pyannote.audio default HF implementation on the vocals track splitted from the original audio.
[^6]: Generates costs. Uses gpt-4-1106-preview model and needs [OpenAI API Key](https://platform.openai.com/api-keys) stored in env variable **OPENAI_API_KEY**.
[^7]: Generates costs. [Elevenlabs](#elevenlabs-engine) is pricy, [OpenAI TTS](#openai-engine), [Azure](#azure-engine) are affordable. Needs API Keys stored in env variables, see engine information for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoljaB/TurnVoice",
    "name": "TurnVoice",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "replace,voice,youtube,video,audio,voice,synthesis,sentence-segmentation,TTS-engine,audio-playback,stream-player,sentence-fragment,audio-feedback,interactive,python",
    "author": "Kolja Beigel",
    "author_email": "kolja.beigel@web.de",
    "download_url": "https://files.pythonhosted.org/packages/00/9b/7675230cc630908ae97bf70cb710c2c5c59a74b75ab1e0107d39ffe31b0f/TurnVoice-0.0.60.tar.gz",
    "platform": null,
    "description": "# TurnVoice\r\n\r\nA command-line tool to **transform voices** in (YouTube) videos with additional **translation** capabilities. [^1] \r\n\r\nhttps://github.com/KoljaB/TurnVoice/assets/7604638/f87759cc-0b3f-4d8f-864f-af99202d7312\r\n\r\n<sup>(sorry for the bad video quality, it had to fit under 10MB file size because Github \ud83e\udd37)</sup> [\ud83c\udf9e\ufe0f HD version \ud83c\udf9e\ufe0f](https://www.youtube.com/watch?v=Rl0WhIax2lM) \r\n\r\n## Features\r\n\r\n- **Voice Transformation**  \r\n  Turn voices with the free [Coqui TTS](#coqui-engine) at no operating costs  <sup>*(supports voice cloning)*</sup>\r\n\r\n- **Voice Variety**  \r\n  Support for popular TTS engines like [Elevenlabs](#elevenlabs-engine), [OpenAI TTS](#openai-engine), or [Azure](#azure-engine) for more voices. [^7]\r\n\r\n- **Translation**  \r\n  Translates videos at zero costs powered by free deep-translator.\r\n\r\n- **Change Speaking Styles** <sup>*(AI powered)*</sup>  \r\n  Make every spoken sentence delivered in a custom speaking style for a unique flair using prompting. [^6]\r\n\r\n- **Full Rendering Control**  \r\n  Precise [rendering control](#workflow) by customizing the sentence text, timings, and voice selection.  \r\n    \r\n  <sup>*\ud83d\udca1 Tip: the [Renderscript Editor](#renderscript-editor) makes this step easy*</sup>\r\n\r\n- **Local Video Processing**  \r\n  Process any local video files.\r\n\r\n- **Background Audio Preservation**  \r\n  Keeps the original background audio intact.\r\n\r\n> *Discover more in the [release notes](https://github.com/KoljaB/TurnVoice/releases).*\r\n\r\n## Prerequisites\r\n\r\n- [Rubberband](https://breakfastquay.com/rubberband/) command-line utility installed [^2] \r\n- [ffmpeg](https://ffmpeg.org/download.html) command-line utility installed [^3]\r\n  <details>\r\n  <summary>To install ffmpeg with a package manager:</summary>\r\n\r\n    - **On Ubuntu or Debian**:\r\n        ```bash\r\n        sudo apt update && sudo apt install ffmpeg\r\n        ```\r\n\r\n    - **On Arch Linux**:\r\n        ```bash\r\n        sudo pacman -S ffmpeg\r\n        ```\r\n\r\n    - **On MacOS using Homebrew** ([https://brew.sh/](https://brew.sh/)):\r\n        ```bash\r\n        brew install ffmpeg\r\n        ```\r\n\r\n    - **On Windows using Chocolatey** ([https://chocolatey.org/](https://chocolatey.org/)):\r\n        ```bash\r\n        choco install ffmpeg\r\n        ```\r\n\r\n    - **On Windows using Scoop** ([https://scoop.sh/](https://scoop.sh/)):\r\n        ```bash\r\n        scoop install ffmpeg\r\n        ```    \r\n  </details>\r\n- Huggingface conditions accepted for [Speaker Diarization](https://huggingface.co/pyannote/speaker-diarization-3.1) and [Segmentation](https://huggingface.co/pyannote/segmentation-3.0)\r\n- Huggingface access token in env variable HF_ACCESS_TOKEN [^4]\r\n> [!TIP]\r\n> *Set your [HF token](https://huggingface.co/settings/tokens) with `setx HF_ACCESS_TOKEN \"your_token_here\"*\r\n\r\n## Installation \r\n\r\n```\r\npip install turnvoice\r\n```\r\n\r\n> [!TIP]\r\n> For faster rendering with GPU prepare your [CUDA](https://pytorch.org/get-started/locally/) environment after installation:\r\n> \r\n> ***For CUDA 11.8***  \r\n> `pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118`  \r\n>   \r\n> ***For CUDA 12.1***  \r\n> `pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu211 --index-url https://download.pytorch.org/whl/cu211`  \r\n\r\n## Usage\r\n\r\n```bash\r\nturnvoice [-i] <YouTube URL|ID|Local File> [-l] <Translation Language> -e <Engine(s)> -v <Voice(s)> -o <Output File>\r\n```\r\n\r\nSubmit a string to the 'voice' parameter for each speaker voice you wish to use. If you specify engines, the voices will be assigned to these engines in the order they are listed. Should there be more voices than engines, the first engine will be used for the excess voices. In the absence of a specified engine, the Coqui engine will be used as the default. If no voices are defined, a default voice will be selected for each engine.\r\n\r\n### Example Command:\r\n\r\nArthur Morgan narrating a cooking tutorial:\r\n\r\n```bash\r\nturnvoice -i AmC9SmCBUj4 -v arthur.wav -o cooking_with_arthur.mp4\r\n```\r\n\r\n> [!NOTE]\r\n> *Requires the cloning voice file (e.g., arthur.wav or .json) in the same directory (you find one in the tests directory).*\r\n\r\n## Workflow\r\n\r\n### Preparation\r\n\r\nPrepare a script with transcription, speaker diarization (and optionally translation or prompting) using:\r\n\r\n```bash\r\nturnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --prepare\r\n```\r\n\r\nTranslation and prompts should be applied in this preparation step. Engines or voices come later in the render step.\r\n\r\n### Renderscript Editor\r\n\r\n1. **Open script**  \r\n  Open the [editor.html](https://github.com/KoljaB/TurnVoice/blob/main/turnvoice/editor/editor.html) file. Click on the file open button and navigate to the folder you started turnvoice from. Open download folder. Open the folder with the name of the video. Open the file full_script.txt.\r\n2. **Edit**  \r\n  The Editor will visualize the transcript and speaker diarization results and start playing the original video now. While playing verify texts, starting times and speaker assignments and adjust them if the detection went wrong.\r\n3. **Save**  \r\n  Save the script. Remember the path to the file.\r\n\r\n### Rendering\r\n\r\nRender the refined script to generate the final video using:\r\n\r\n```bash\r\nturnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c --render <path_to_script>\r\n```\r\n\r\nAdjust the path in the displayed CLI command (the editor can't read that information out from the browser).\r\n\r\nAssign engines and voices to each speaker track with the -e and -v commands.\r\n\r\n## Parameters Explained:\r\n\r\n- `-i`, `--in`: Input video. Accepts a YouTube video URL or ID, or a path to a local video file.\r\n- `-l`, `--language`: Language for translation. Coqui synthesis supports: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko. Omit to retain the original video language.\r\n- `-il`, `--input_language`: Language code for transcription, set if automatic detection fails.\r\n- `-v`, `--voice`: Voices for synthesis. Accepts multiple values to replace more than one speaker.\r\n- `-o`, `--output_video`: Filename for the final output video (default: 'final_cut.mp4').\r\n- `-a`, `--analysis`: Print transcription and speaker analysis without synthesizing or rendering the video.\r\n- `-from`: Time to start processing the video from.\r\n- `-to`: Time to stop processing the video at.\r\n- `-e`, `--engine`: Engine(s) to synthesize with. Can be coqui, elevenlabs, azure, openai or system. Accepts multiple values, linked to the the submitted voices. \r\n- `-s`, `--speaker`: Speaker number to be transformed.\r\n- `-snum`, `--num_speakers`: Helps diarization. Specify the exact number of speakers in the video if you know it in advance. \r\n- `-smin`, `--min_speakers`: Helps diarization. Specify the minimum number of speakers in the video if you know it in advance. \r\n- `-smax`, `--max_speakers`: Helps diarization. Specify the maximum number of speakers in the video if you know it in advance. \r\n- `-dd`, `--download_directory`: Directory for saving downloaded files (default: 'downloads').\r\n- `-sd`, `--synthesis_directory`: Directory for saving synthesized audio files (default: 'synthesis').\r\n- `-exoff`, `--extractoff`: Disables extraction of audio from the video file. Downloads audio and video from the internet.\r\n- `-c`, `--clean_audio`: Removes original audio from the final video, resulting in clean synthesis.\r\n- `-tf`, `--timefile`: Define timestamp file(s) for processing (functions like multiple --from/--to commands).\r\n- `-p`, `--prompt`: Define a prompt to apply a style change to sentences like \"speaking style of captain jack sparrow\" [^6]\r\n- `-prep`, `--prepare`: Write full script with speaker analysis, sentence transformation and translation but doesn't perform synthesis or rendering. Can be continued.\r\n- `-r`, `--render`: Takes a full script and only perform synthesis and rendering on it, but no speaker analysis, sentence transformation or translation. \r\n\r\n> `-i` and `-l` can be used as both positional and optional arguments.\r\n\r\n## Coqui Engine\r\n\r\nCoqui engine is the default engine if no other engine is specified with the -e parameter.\r\n\r\n<details>\r\n<summary>To use voices from Coqui:</summary>\r\n\r\n#### Voices (-v parameter)\r\n\r\nSubmit path to one or more audiofiles containing 16 bit 24kHz mono source material as reference wavs.\r\n\r\nExample:\r\n```\r\nturnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e coqui -v female.wav\r\n```\r\n\r\n#### The Art of Choosing a Reference Wav\r\n- A 24000, 44100 or 22050 Hz 16-bit mono wav file of 10-30 seconds is your golden ticket. \r\n- 24k mono 16 is my default, but I also had voices where I found 44100 32-bit to yield best results\r\n- I test voices [with this tool](https://github.com/KoljaB/RealtimeTTS/blob/master/tests/coqui_test.py) before rendering\r\n- Audacity is your friend for adjusting sample rates. Experiment with frame rates for best results!\r\n\r\n#### Fixed TTS Model Download Folder\r\nKeep your models organized! Set `COQUI_MODEL_PATH` to your preferred folder.\r\n\r\nWindows example:\r\n```bash\r\nsetx COQUI_MODEL_PATH \"C:\\Downloads\\CoquiModels\"\r\n```\r\n</details>\r\n\r\n## Elevenlabs Engine\r\n\r\n> [!NOTE]\r\n> To use Elevenlabs voices you need the [API Key](https://elevenlabs.io/docs/api-reference/text-to-speech#authentication) stored in env variable **ELEVENLABS_API_KEY**\r\n\r\nAll voices are synthesized with the multilingual-v1 model.\r\n\r\n> [!CAUTION]\r\n> Elevenlabs is a pricy API. Focus on short videos. Don't let a work-in-progress script like this run unattended on a pay-per-use API. Bugs could be very annoying when occurring at the end of a pricy long rendering process. \r\n\r\n<details>\r\n<summary>To use voices from Elevenlabs:</summary>\r\n\r\n#### Voices (-v parameter)\r\n\r\nSubmit name(s) of either a generated or predefined voice.\r\n\r\nExample:\r\n```\r\nturnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e elevenlabs -v Giovanni\r\n```\r\n\r\n</details>  \r\n\r\n> [!TIP]\r\n> Test rendering with a free engine like coqui first before using pricy ones.\r\n\r\n## OpenAI Engine\r\n\r\n> [!NOTE]\r\n> To use OpenAI TTS voices you need the [API Key](https://platform.openai.com/api-keys) stored in env variable **OPENAI_API_KEY**\r\n\r\n<details>\r\n<summary>To use voices from OpenAI:</summary>\r\n\r\n#### Voice (-v parameter)\r\n\r\nSubmit name of voice. Currently only one voice for OpenAI supported. Alloy, echo, fable, onyx, nova or shimmer.\r\n\r\nExample:\r\n```\r\nturnvoice https://www.youtube.com/watch?v=cOg4J1PxU0c -e openai -v shimmer\r\n```\r\n</details>\r\n\r\n## Azure Engine\r\n\r\n> [!NOTE]\r\n> To use Azure voices you need the [API Key](https://www.youtube.com/watch?v=HgYE2nJPaHA&t=57s) for SpeechService resource in **AZURE_SPEECH_KEY** and the [region identifier](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/regions) in **AZURE_SPEECH_REGION**\r\n\r\n<details>\r\n<summary>To use voices from Azure:</summary>\r\n\r\n#### Voices (-v parameter)\r\n\r\nSubmit name(s) of either a generated or predefined voice.\r\n\r\nExample:\r\n```\r\nturnvoice https://www.youtube.com/watch?v=BqnAeUoqFAM -e azure -v ChristopherNeural\r\n```\r\n</details>\r\n\r\n## System Engine\r\n\r\n<details>\r\n<summary>To use system voices:</summary>\r\n\r\n#### Voices (-v parameter)\r\n\r\nSubmit name(s) of voices as string.\r\n\r\nExample:\r\n```\r\nturnvoice https://www.youtube.com/watch?v=BqnAeUoqFAM -e system -v David\r\n```\r\n</details>\r\n\r\n## What to expect\r\n\r\n- early alpha / work-in-progress, so bugs might occur (please report, need to be aware to fix)\r\n- might not always achieve perfect lip synchronization, especially when translating to a different language\r\n- speaker detection does not work that well, probably doing something wrong or or perhaps the tech[^5] is not yet ready to be reliable\r\n- translation feature is currently in experimental prototype state (powered by deep-translate) and still produces very imperfect results\r\n- occasionally, the synthesis might introduce unexpected noises or distortions in the audio (we got **way** better reducing artifacts with the new v0.0.30 algo)\r\n- spleeter might get confused when a spoken voice and backmusic with singing are present together in the source audio\r\n\r\n## Source Quality\r\n\r\n- delivers best results with YouTube videos featuring **clear spoken** content (podcasts, educational videos)\r\n- requires a high-quality, **clean** source WAV file for effective voice cloning \r\n\r\n## Pro Tips\r\n\r\n### How to exchange a single speaker\r\n\r\nFirst perform a speaker analysis with -a parameter:\r\n\r\n```bash\r\nturnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -a\r\n```\r\n\r\nThen select a speaker from the list with -s parameter\r\n\r\n```bash\r\nturnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -s 2\r\n```\r\n\r\n## License\r\n\r\nTurnVoice is proudly under the [Coqui Public Model License 1.0.0](https://coqui.ai/cpml). \r\n\r\n# Contact \ud83e\udd1d\r\n\r\n[Share](https://github.com/KoljaB/TurnVoice/discussions) your funniest or most creative TurnVoice creations with me! \r\n\r\nAnd if you've got a cool feature idea or just want to say hi, drop me a line on\r\n\r\n- [Twitter](https://twitter.com/LonLigrin)  \r\n- [Reddit](https://www.reddit.com/user/Lonligrin)  \r\n- [EMail](mailto:kolja.beigel@web.de)  \r\n\r\nIf you like the repo please leave a star  \r\n\u2728 \ud83c\udf1f \u2728\r\n\r\n[^1]: State is work-in-progress (early pre-alpha). \u00dclease expect CLI API changes to come and sorry in advance if anything does not work as expected.  \r\n  Developed on Python 3.11.4 under Win 10. \r\n[^2]: Rubberband is needed to pitchpreserve timestretch audios for fitting synthesis into timewindow.\r\n[^3]: ffmpeg is needed to convert mp3 files into wav\r\n[^4]: Huggingface access token is needed to download the speaker diarization model for identifying speakers with pyannote.audio.\r\n[^5]: Speaker diarization is performed with the pyannote.audio default HF implementation on the vocals track splitted from the original audio.\r\n[^6]: Generates costs. Uses gpt-4-1106-preview model and needs [OpenAI API Key](https://platform.openai.com/api-keys) stored in env variable **OPENAI_API_KEY**.\r\n[^7]: Generates costs. [Elevenlabs](#elevenlabs-engine) is pricy, [OpenAI TTS](#openai-engine), [Azure](#azure-engine) are affordable. Needs API Keys stored in env variables, see engine information for details.\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Replaces and translates voices in youtube videos",
    "version": "0.0.60",
    "project_urls": {
        "Homepage": "https://github.com/KoljaB/TurnVoice"
    },
    "split_keywords": [
        "replace",
        "voice",
        "youtube",
        "video",
        "audio",
        "voice",
        "synthesis",
        "sentence-segmentation",
        "tts-engine",
        "audio-playback",
        "stream-player",
        "sentence-fragment",
        "audio-feedback",
        "interactive",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4bfa083eb0d4e7dc21feb78b930a10d906e2f4922ee305aec8ef32e674344d01",
                "md5": "addf26d83184df36c3db0e6d99488236",
                "sha256": "59c23523e7430adefdd7b06db1e7b5e4126516298d3b27f36bd81820987de475"
            },
            "downloads": -1,
            "filename": "TurnVoice-0.0.60-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "addf26d83184df36c3db0e6d99488236",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 3327933,
            "upload_time": "2023-12-18T15:54:11",
            "upload_time_iso_8601": "2023-12-18T15:54:11.415119Z",
            "url": "https://files.pythonhosted.org/packages/4b/fa/083eb0d4e7dc21feb78b930a10d906e2f4922ee305aec8ef32e674344d01/TurnVoice-0.0.60-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "009b7675230cc630908ae97bf70cb710c2c5c59a74b75ab1e0107d39ffe31b0f",
                "md5": "67030e5be55182957e138a4014a4eaa2",
                "sha256": "9fee682273e2e3d70b5ef2ca7f7bc0292c250908704a169a050860327195289c"
            },
            "downloads": -1,
            "filename": "TurnVoice-0.0.60.tar.gz",
            "has_sig": false,
            "md5_digest": "67030e5be55182957e138a4014a4eaa2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 3171409,
            "upload_time": "2023-12-18T15:54:34",
            "upload_time_iso_8601": "2023-12-18T15:54:34.560329Z",
            "url": "https://files.pythonhosted.org/packages/00/9b/7675230cc630908ae97bf70cb710c2c5c59a74b75ab1e0107d39ffe31b0f/TurnVoice-0.0.60.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-18 15:54:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoljaB",
    "github_project": "TurnVoice",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "turnvoice"
}
        
Elapsed time: 0.15209s