duosubs

Name	duosubs JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	Semantic subtitle aligner and merger for bilingual subtitle syncing.
upload_time	2025-07-23 07:48:09
maintainer	None
docs_url	None
author	CK-Explorer
requires_python	>=3.10
license	None
keywords	subtitles alignment merging sentence-transformers sentence-similarity bilingual nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 🎬 DuoSubs

[![CI](https://github.com/CK-Explorer/DuoSubs/actions/workflows/ci.yml/badge.svg)](https://github.com/CK-Explorer/DuoSubs/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/duosubs.svg)](https://pypi.org/project/duosubs/)
[![Python Versions](https://img.shields.io/pypi/pyversions/duosubs.svg)](https://pypi.org/project/duosubs/)
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blueviolet.svg)](LICENSE)
[![Type Checked: Mypy](https://img.shields.io/badge/type%20checked-mypy-blue)](http://mypy-lang.org/)
[![Code Style: Ruff](https://img.shields.io/badge/code%20style-ruff-blue?logo=python&labelColor=gray)](https://github.com/astral-sh/ruff)
[![codecov](https://codecov.io/gh/CK-Explorer/DuoSubs/branch/main/graph/badge.svg)](https://codecov.io/gh/CK-Explorer/DuoSubs)
[![Documentation Status](https://readthedocs.org/projects/duosubs/badge/?version=latest)](https://duosubs.readthedocs.io/en/latest/?badge=latest)

Merging subtitles using only the nearest timestamp often leads to incorrect pairings
— lines may end up out of sync, duplicated, or mismatched.

This Python tool uses **semantic similarity** 
(via [Sentence Transformers](https://www.sbert.net/)) to align subtitle lines based on 
**meaning** instead of timestamps — making it possible to pair subtitles across 
**different languages**.

---

## ✨ Features

- 📌 Aligns subtitle lines based on **meaning**, not timing
- 🌍 **Multilingual** support based on the **user** selected 
[Sentence Transformer model](https://huggingface.co/models?library=sentence-transformers)
- 🧩 Easy-to-use **API** for integration
- 💻 **Command-line interface** with customizable options
- 📄 Flexible format support — works with **SRT**, **VTT**, **MPL2**, **TTML**, **ASS**, 
**SSA** files

---

## 🛠️ Installation

1. Install the correct version of PyTorch for your system by following the official 
instructions: https://pytorch.org/get-started/locally
2. Install this repo via pip:
    ```bash
    pip install duosubs
    ```

---

## 🚀 Usage

With the [demo files](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/) provided, here are the simplest way to get started:

- via command line

    ```bash
    duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt
    ```

- via Python API

    ```python
    from duosubs import MergeArgs, run_merge_pipeline

    # Store all arguments
    args = MergeArgs(
        primary="demo/primary_sub.srt",
        secondary="demo/secondary_sub.srt"
    )

    # Load, merge, and save subtitles.
    run_merge_pipeline(args, print)
    ```

These codes will produce [primary_sub.zip](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/primary_sub.zip), with the following structure:

```text
primary_sub.zip
├── primary_sub_combined.ass   # Merged subtitles
├── primary_sub_primary.ass    # Original primary subtitles
└── primary_sub_secondary.ass  # Time-shifted secondary subtitles
```

By default, the Sentence Transformer model used is 
[LaBSE](https://huggingface.co/sentence-transformers/LaBSE).

If you want to experiment with different models, then pick one from
[🤗 Hugging Face](https://huggingface.co/models?library=sentence-transformers) 
or check out from the
[leaderboard](https://huggingface.co/spaces/mteb/leaderboard)
for top performing model.

For example, if the model chosen is 
[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), 
you can run:

- via command line

    ```bash
    duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt --model Qwen/Qwen3-Embedding-0.6B
    ```

- via Python API

    ```python
    from duosubs import MergeArgs, run_merge_pipeline

    # Store all arguments
    args = MergeArgs(
        primary="demo/primary_sub.srt",
        secondary="demo/secondary_sub.srt",
        model="Qwen/Qwen3-Embedding-0.6B"
    )

    # Load, merge, and save subtitles.
    run_merge_pipeline(args, print)
    ```

> ⚠️ **Warning**  
> - Some models may require significant RAM or GPU (VRAM) to run, and might not be compatible with all devices — especially larger models. 
> - Also, please ensure the selected model supports your desired language for reliable results.

To learn more about this tool, please see the 
[documentation](https://duosubs.readthedocs.io/en/latest/).

---

## 📚 Behind the Scenes

1. Parse subtitles and detect language.
2. Tokenize subtitle lines.
3. Extract and filter non-overlapping subtitles. *(Optional)*
4. Estimate tokenized subtitle pairings using DTW.
5. Refine alignment using a sliding window approach.
6. Combine aligned and non-overlapping subtitles.
7. Eliminate unnecessary newline within subtitle lines.

---

## 🚫 Known Limitations

- The **accuracy** of the merging process **varies** on the 
[model](https://huggingface.co/models?library=sentence-transformers) selected.
- Some models may produce **unreliable results** for **unsupported** or low-resource **languages**.
- Some sentence **fragments** from secondary subtitles may be **misaligned** to the 
primary subtitles line due to the tokenization algorithm used.
- **Secondary** subtitles might **contain extra whitespace** as a result of token-level merging.
- The algorithm may **not** work reliably if the **timestamps** of some matching lines
**don’t overlap** at all. See [special case](#-special-case).

---

## 🧩 Special Case

For the last known limitation, if both subtitle files are **known** to be 
**perfectly semantically aligned**, meaning:

* **matching dialogue contents**
* **no extra lines** like scene annotations or bonus Director’s Cut stuff.

Then, just **enable** the `ignore-non-overlap-filter` option in either: 

* CLI (`--ignore-non-overlap-filter`)
* Python API (see [documentation](https://duosubs.readthedocs.io/en/latest/))

to skip the overlap check — the merge should go smoothly from there.

⚠️ If the subtitle **timings** are **off** and the two subtitle files 
**don’t fully match in content**, the algorithm likely **won’t** produce great results. Still, 
you can try running it with `ignore-non-overlap-filter` **enabled**.

---

## 🙏 Acknowledgements

This project wouldn't be possible without the incredible work of the open-source community. 
Special thanks to:

- [sentence-transformers](https://github.com/UKPLab/sentence-transformers) — for the semantic 
embedding backbone
- [Hugging Face](https://huggingface.co/) — for hosting models and making them easy to use
- [PyTorch](https://pytorch.org/) — for providing the deep learning framework
- [fastdtw](https://github.com/slaypni/fastdtw) — for aligning the subtitles
- [lingua-py](https://github.com/pemistahl/lingua-py) — for detecting the subtitles' language codes
- [pysubs2](https://github.com/tkarabela/pysubs2) — for subtitle file I/O utilities
- [charset_normalizer](https://github.com/jawah/charset_normalizer) — for identifying the file 
encoding
- [typer](https://github.com/fastapi/typer) — for CLI application
- [tqdm](https://github.com/tqdm/tqdm) — for displaying progress bar
- [Tears of Steel](https://mango.blender.org/) — subtitles used for demo, testing and development 
purposes. Created by the 
[Blender Foundation](https://mango.blender.org/), licensed under 
[CC BY 3.0](http://creativecommons.org/licenses/by/3.0/).

---

## 🤝 Contributing

Contributions are welcome! If you'd like to submit a pull request, please check out the
 [contributing guidelines](https://github.com/CK-Explorer/DuoSubs/blob/main/CONTRIBUTING.md).

---

## 🔑 License

Apache-2.0 license - see the [LICENSE](https://github.com/CK-Explorer/DuoSubs/blob/main/LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "duosubs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "subtitles, alignment, merging, sentence-transformers, sentence-similarity, bilingual, nlp",
    "author": "CK-Explorer ",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/5c/28/da5ee530b2ef709b511dda0fd70dade75ea4f8bac572cb6b02725aff01c2/duosubs-0.2.0.tar.gz",
    "platform": null,
    "description": "# \ud83c\udfac DuoSubs\n\n[![CI](https://github.com/CK-Explorer/DuoSubs/actions/workflows/ci.yml/badge.svg)](https://github.com/CK-Explorer/DuoSubs/actions/workflows/ci.yml)\n[![PyPI version](https://img.shields.io/pypi/v/duosubs.svg)](https://pypi.org/project/duosubs/)\n[![Python Versions](https://img.shields.io/pypi/pyversions/duosubs.svg)](https://pypi.org/project/duosubs/)\n[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blueviolet.svg)](LICENSE)\n[![Type Checked: Mypy](https://img.shields.io/badge/type%20checked-mypy-blue)](http://mypy-lang.org/)\n[![Code Style: Ruff](https://img.shields.io/badge/code%20style-ruff-blue?logo=python&labelColor=gray)](https://github.com/astral-sh/ruff)\n[![codecov](https://codecov.io/gh/CK-Explorer/DuoSubs/branch/main/graph/badge.svg)](https://codecov.io/gh/CK-Explorer/DuoSubs)\n[![Documentation Status](https://readthedocs.org/projects/duosubs/badge/?version=latest)](https://duosubs.readthedocs.io/en/latest/?badge=latest)\n\nMerging subtitles using only the nearest timestamp often leads to incorrect pairings\n\u2014 lines may end up out of sync, duplicated, or mismatched.\n\nThis Python tool uses **semantic similarity** \n(via [Sentence Transformers](https://www.sbert.net/)) to align subtitle lines based on \n**meaning** instead of timestamps \u2014 making it possible to pair subtitles across \n**different languages**.\n\n---\n\n## \u2728 Features\n\n- \ud83d\udccc Aligns subtitle lines based on **meaning**, not timing\n- \ud83c\udf0d **Multilingual** support based on the **user** selected \n[Sentence Transformer model](https://huggingface.co/models?library=sentence-transformers)\n- \ud83e\udde9 Easy-to-use **API** for integration\n- \ud83d\udcbb **Command-line interface** with customizable options\n- \ud83d\udcc4 Flexible format support \u2014 works with **SRT**, **VTT**, **MPL2**, **TTML**, **ASS**, \n**SSA** files\n\n---\n\n## \ud83d\udee0\ufe0f Installation\n\n1. Install the correct version of PyTorch for your system by following the official \ninstructions: https://pytorch.org/get-started/locally\n2. Install this repo via pip:\n    ```bash\n    pip install duosubs\n    ```\n\n---\n\n## \ud83d\ude80 Usage\n\nWith the [demo files](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/) provided, here are the simplest way to get started:\n\n- via command line\n\n    ```bash\n    duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt\n    ```\n\n- via Python API\n\n    ```python\n    from duosubs import MergeArgs, run_merge_pipeline\n\n    # Store all arguments\n    args = MergeArgs(\n        primary=\"demo/primary_sub.srt\",\n        secondary=\"demo/secondary_sub.srt\"\n    )\n\n    # Load, merge, and save subtitles.\n    run_merge_pipeline(args, print)\n    ```\n\nThese codes will produce [primary_sub.zip](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/primary_sub.zip), with the following structure:\n\n```text\nprimary_sub.zip\n\u251c\u2500\u2500 primary_sub_combined.ass   # Merged subtitles\n\u251c\u2500\u2500 primary_sub_primary.ass    # Original primary subtitles\n\u2514\u2500\u2500 primary_sub_secondary.ass  # Time-shifted secondary subtitles\n```\n\nBy default, the Sentence Transformer model used is \n[LaBSE](https://huggingface.co/sentence-transformers/LaBSE).\n\nIf you want to experiment with different models, then pick one from\n[\ud83e\udd17 Hugging Face](https://huggingface.co/models?library=sentence-transformers) \nor check out from the\n[leaderboard](https://huggingface.co/spaces/mteb/leaderboard)\nfor top performing model.\n\nFor example, if the model chosen is \n[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), \nyou can run:\n\n- via command line\n\n    ```bash\n    duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt --model Qwen/Qwen3-Embedding-0.6B\n    ```\n\n- via Python API\n\n    ```python\n    from duosubs import MergeArgs, run_merge_pipeline\n\n    # Store all arguments\n    args = MergeArgs(\n        primary=\"demo/primary_sub.srt\",\n        secondary=\"demo/secondary_sub.srt\",\n        model=\"Qwen/Qwen3-Embedding-0.6B\"\n    )\n\n    # Load, merge, and save subtitles.\n    run_merge_pipeline(args, print)\n    ```\n\n> \u26a0\ufe0f **Warning**  \n> - Some models may require significant RAM or GPU (VRAM) to run, and might not be compatible with all devices \u2014 especially larger models. \n> - Also, please ensure the selected model supports your desired language for reliable results.\n\nTo learn more about this tool, please see the \n[documentation](https://duosubs.readthedocs.io/en/latest/).\n\n---\n\n## \ud83d\udcda Behind the Scenes\n\n1. Parse subtitles and detect language.\n2. Tokenize subtitle lines.\n3. Extract and filter non-overlapping subtitles. *(Optional)*\n4. Estimate tokenized subtitle pairings using DTW.\n5. Refine alignment using a sliding window approach.\n6. Combine aligned and non-overlapping subtitles.\n7. Eliminate unnecessary newline within subtitle lines.\n\n---\n\n## \ud83d\udeab Known Limitations\n\n- The **accuracy** of the merging process **varies** on the \n[model](https://huggingface.co/models?library=sentence-transformers) selected.\n- Some models may produce **unreliable results** for **unsupported** or low-resource **languages**.\n- Some sentence **fragments** from secondary subtitles may be **misaligned** to the \nprimary subtitles line due to the tokenization algorithm used.\n- **Secondary** subtitles might **contain extra whitespace** as a result of token-level merging.\n- The algorithm may **not** work reliably if the **timestamps** of some matching lines\n**don\u2019t overlap** at all. See [special case](#-special-case).\n\n---\n\n## \ud83e\udde9 Special Case\n\nFor the last known limitation, if both subtitle files are **known** to be \n**perfectly semantically aligned**, meaning:\n\n* **matching dialogue contents**\n* **no extra lines** like scene annotations or bonus Director\u2019s Cut stuff.\n\nThen, just **enable** the `ignore-non-overlap-filter` option in either: \n\n* CLI (`--ignore-non-overlap-filter`)\n* Python API (see [documentation](https://duosubs.readthedocs.io/en/latest/))\n\nto skip the overlap check \u2014 the merge should go smoothly from there.\n\n\u26a0\ufe0f If the subtitle **timings** are **off** and the two subtitle files \n**don\u2019t fully match in content**, the algorithm likely **won\u2019t** produce great results. Still, \nyou can try running it with `ignore-non-overlap-filter` **enabled**.\n\n---\n\n## \ud83d\ude4f Acknowledgements\n\nThis project wouldn't be possible without the incredible work of the open-source community. \nSpecial thanks to:\n\n- [sentence-transformers](https://github.com/UKPLab/sentence-transformers) \u2014 for the semantic \nembedding backbone\n- [Hugging Face](https://huggingface.co/) \u2014 for hosting models and making them easy to use\n- [PyTorch](https://pytorch.org/) \u2014 for providing the deep learning framework\n- [fastdtw](https://github.com/slaypni/fastdtw) \u2014 for aligning the subtitles\n- [lingua-py](https://github.com/pemistahl/lingua-py) \u2014 for detecting the subtitles' language codes\n- [pysubs2](https://github.com/tkarabela/pysubs2) \u2014 for subtitle file I/O utilities\n- [charset_normalizer](https://github.com/jawah/charset_normalizer) \u2014 for identifying the file \nencoding\n- [typer](https://github.com/fastapi/typer) \u2014 for CLI application\n- [tqdm](https://github.com/tqdm/tqdm) \u2014 for displaying progress bar\n- [Tears of Steel](https://mango.blender.org/) \u2014 subtitles used for demo, testing and development \npurposes. Created by the \n[Blender Foundation](https://mango.blender.org/), licensed under \n[CC BY 3.0](http://creativecommons.org/licenses/by/3.0/).\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! If you'd like to submit a pull request, please check out the\n [contributing guidelines](https://github.com/CK-Explorer/DuoSubs/blob/main/CONTRIBUTING.md).\n\n---\n\n## \ud83d\udd11 License\n\nApache-2.0 license - see the [LICENSE](https://github.com/CK-Explorer/DuoSubs/blob/main/LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Semantic subtitle aligner and merger for bilingual subtitle syncing.",
    "version": "0.2.0",
    "project_urls": {
        "Documentation": "https://duosubs.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/CK-Explorer/DuoSubs",
        "Repository": "https://github.com/CK-Explorer/DuoSubs"
    },
    "split_keywords": [
        "subtitles",
        " alignment",
        " merging",
        " sentence-transformers",
        " sentence-similarity",
        " bilingual",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1ad76783a74dac0d95d3febd8a7f976e2b47918ec95718a0464391e5690d0156",
                "md5": "e42215fa7adb54f65a1ba24f732eb490",
                "sha256": "ca0aecf0530aa7636a39aaa01d935d2c44c7e1047bd090810e5116a63cc34a66"
            },
            "downloads": -1,
            "filename": "duosubs-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e42215fa7adb54f65a1ba24f732eb490",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 34023,
            "upload_time": "2025-07-23T07:48:08",
            "upload_time_iso_8601": "2025-07-23T07:48:08.195042Z",
            "url": "https://files.pythonhosted.org/packages/1a/d7/6783a74dac0d95d3febd8a7f976e2b47918ec95718a0464391e5690d0156/duosubs-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5c28da5ee530b2ef709b511dda0fd70dade75ea4f8bac572cb6b02725aff01c2",
                "md5": "1ea1465caf757ec438842cf84e970a0c",
                "sha256": "b43e77eb9cd897163a7327dff7dfc6ab3252cb2b7d2bf33a56f594e78131472b"
            },
            "downloads": -1,
            "filename": "duosubs-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1ea1465caf757ec438842cf84e970a0c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 32107,
            "upload_time": "2025-07-23T07:48:09",
            "upload_time_iso_8601": "2025-07-23T07:48:09.302935Z",
            "url": "https://files.pythonhosted.org/packages/5c/28/da5ee530b2ef709b511dda0fd70dade75ea4f8bac572cb6b02725aff01c2/duosubs-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-23 07:48:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CK-Explorer",
    "github_project": "DuoSubs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "duosubs"
}

CK-Explorer