# π¬ DuoSubs
[](https://github.com/CK-Explorer/DuoSubs/actions/workflows/ci.yml)
[](https://pypi.org/project/duosubs/)
[](https://pypi.org/project/duosubs/)
[](LICENSE)
[](http://mypy-lang.org/)
[](https://github.com/astral-sh/ruff)
[](https://codecov.io/gh/CK-Explorer/DuoSubs)
[](https://duosubs.readthedocs.io/en/latest/?badge=latest)
Merging subtitles using only the nearest timestamp often leads to incorrect pairings
β lines may end up out of sync, duplicated, or mismatched.
This Python tool uses **semantic similarity**
(via [Sentence Transformers](https://www.sbert.net/)) to align subtitle lines based on
**meaning** instead of timestamps β making it possible to pair subtitles across
**different languages**.
---
## β¨ Features
- π Aligns subtitle lines based on **meaning**, not timing
- π **Multilingual** support based on the **user** selected
[Sentence Transformer model](https://huggingface.co/models?library=sentence-transformers)
- π§© Easy-to-use **API** for integration
- π» **Command-line interface** with customizable options
- π Flexible format support β works with **SRT**, **VTT**, **MPL2**, **TTML**, **ASS**,
**SSA** files
---
## π οΈ Installation
1. Install the correct version of PyTorch for your system by following the official
instructions: https://pytorch.org/get-started/locally
2. Install this repo via pip:
```bash
pip install duosubs
```
---
## π Usage
With the [demo files](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/) provided, here are the simplest way to get started:
- via command line
```bash
duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt
```
- via Python API
```python
from duosubs import MergeArgs, run_merge_pipeline
# Store all arguments
args = MergeArgs(
primary="demo/primary_sub.srt",
secondary="demo/secondary_sub.srt"
)
# Load, merge, and save subtitles.
run_merge_pipeline(args, print)
```
These codes will produce [primary_sub.zip](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/primary_sub.zip), with the following structure:
```text
primary_sub.zip
βββ primary_sub_combined.ass # Merged subtitles
βββ primary_sub_primary.ass # Original primary subtitles
βββ primary_sub_secondary.ass # Time-shifted secondary subtitles
```
By default, the Sentence Transformer model used is
[LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
If you want to experiment with different models, then pick one from
[π€ Hugging Face](https://huggingface.co/models?library=sentence-transformers)
or check out from the
[leaderboard](https://huggingface.co/spaces/mteb/leaderboard)
for top performing model.
For example, if the model chosen is
[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B),
you can run:
- via command line
```bash
duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt --model Qwen/Qwen3-Embedding-0.6B
```
- via Python API
```python
from duosubs import MergeArgs, run_merge_pipeline
# Store all arguments
args = MergeArgs(
primary="demo/primary_sub.srt",
secondary="demo/secondary_sub.srt",
model="Qwen/Qwen3-Embedding-0.6B"
)
# Load, merge, and save subtitles.
run_merge_pipeline(args, print)
```
> β οΈ **Warning**
> - Some models may require significant RAM or GPU (VRAM) to run, and might not be compatible with all devices β especially larger models.
> - Also, please ensure the selected model supports your desired language for reliable results.
To learn more about this tool, please see the
[documentation](https://duosubs.readthedocs.io/en/latest/).
---
## π Behind the Scenes
1. Parse subtitles and detect language.
2. Tokenize subtitle lines.
3. Extract and filter non-overlapping subtitles. *(Optional)*
4. Estimate tokenized subtitle pairings using DTW.
5. Refine alignment using a sliding window approach.
6. Combine aligned and non-overlapping subtitles.
7. Eliminate unnecessary newline within subtitle lines.
---
## π« Known Limitations
- The **accuracy** of the merging process **varies** on the
[model](https://huggingface.co/models?library=sentence-transformers) selected.
- Some models may produce **unreliable results** for **unsupported** or low-resource **languages**.
- Some sentence **fragments** from secondary subtitles may be **misaligned** to the
primary subtitles line due to the tokenization algorithm used.
- **Secondary** subtitles might **contain extra whitespace** as a result of token-level merging.
- The algorithm may **not** work reliably if the **timestamps** of some matching lines
**donβt overlap** at all. See [special case](#-special-case).
---
## π§© Special Case
For the last known limitation, if both subtitle files are **known** to be
**perfectly semantically aligned**, meaning:
* **matching dialogue contents**
* **no extra lines** like scene annotations or bonus Directorβs Cut stuff.
Then, just **enable** the `ignore-non-overlap-filter` option in either:
* CLI (`--ignore-non-overlap-filter`)
* Python API (see [documentation](https://duosubs.readthedocs.io/en/latest/))
to skip the overlap check β the merge should go smoothly from there.
β οΈ If the subtitle **timings** are **off** and the two subtitle files
**donβt fully match in content**, the algorithm likely **wonβt** produce great results. Still,
you can try running it with `ignore-non-overlap-filter` **enabled**.
---
## π Acknowledgements
This project wouldn't be possible without the incredible work of the open-source community.
Special thanks to:
- [sentence-transformers](https://github.com/UKPLab/sentence-transformers) β for the semantic
embedding backbone
- [Hugging Face](https://huggingface.co/) β for hosting models and making them easy to use
- [PyTorch](https://pytorch.org/) β for providing the deep learning framework
- [fastdtw](https://github.com/slaypni/fastdtw) β for aligning the subtitles
- [lingua-py](https://github.com/pemistahl/lingua-py) β for detecting the subtitles' language codes
- [pysubs2](https://github.com/tkarabela/pysubs2) β for subtitle file I/O utilities
- [charset_normalizer](https://github.com/jawah/charset_normalizer) β for identifying the file
encoding
- [typer](https://github.com/fastapi/typer) β for CLI application
- [tqdm](https://github.com/tqdm/tqdm) β for displaying progress bar
- [Tears of Steel](https://mango.blender.org/) β subtitles used for demo, testing and development
purposes. Created by the
[Blender Foundation](https://mango.blender.org/), licensed under
[CC BY 3.0](http://creativecommons.org/licenses/by/3.0/).
---
## π€ Contributing
Contributions are welcome! If you'd like to submit a pull request, please check out the
[contributing guidelines](https://github.com/CK-Explorer/DuoSubs/blob/main/CONTRIBUTING.md).
---
## π License
Apache-2.0 license - see the [LICENSE](https://github.com/CK-Explorer/DuoSubs/blob/main/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "duosubs",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "subtitles, alignment, merging, sentence-transformers, sentence-similarity, bilingual, nlp",
"author": "CK-Explorer ",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/5c/28/da5ee530b2ef709b511dda0fd70dade75ea4f8bac572cb6b02725aff01c2/duosubs-0.2.0.tar.gz",
"platform": null,
"description": "# \ud83c\udfac DuoSubs\n\n[](https://github.com/CK-Explorer/DuoSubs/actions/workflows/ci.yml)\n[](https://pypi.org/project/duosubs/)\n[](https://pypi.org/project/duosubs/)\n[](LICENSE)\n[](http://mypy-lang.org/)\n[](https://github.com/astral-sh/ruff)\n[](https://codecov.io/gh/CK-Explorer/DuoSubs)\n[](https://duosubs.readthedocs.io/en/latest/?badge=latest)\n\nMerging subtitles using only the nearest timestamp often leads to incorrect pairings\n\u2014 lines may end up out of sync, duplicated, or mismatched.\n\nThis Python tool uses **semantic similarity** \n(via [Sentence Transformers](https://www.sbert.net/)) to align subtitle lines based on \n**meaning** instead of timestamps \u2014 making it possible to pair subtitles across \n**different languages**.\n\n---\n\n## \u2728 Features\n\n- \ud83d\udccc Aligns subtitle lines based on **meaning**, not timing\n- \ud83c\udf0d **Multilingual** support based on the **user** selected \n[Sentence Transformer model](https://huggingface.co/models?library=sentence-transformers)\n- \ud83e\udde9 Easy-to-use **API** for integration\n- \ud83d\udcbb **Command-line interface** with customizable options\n- \ud83d\udcc4 Flexible format support \u2014 works with **SRT**, **VTT**, **MPL2**, **TTML**, **ASS**, \n**SSA** files\n\n---\n\n## \ud83d\udee0\ufe0f Installation\n\n1. Install the correct version of PyTorch for your system by following the official \ninstructions: https://pytorch.org/get-started/locally\n2. Install this repo via pip:\n ```bash\n pip install duosubs\n ```\n\n---\n\n## \ud83d\ude80 Usage\n\nWith the [demo files](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/) provided, here are the simplest way to get started:\n\n- via command line\n\n ```bash\n duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt\n ```\n\n- via Python API\n\n ```python\n from duosubs import MergeArgs, run_merge_pipeline\n\n # Store all arguments\n args = MergeArgs(\n primary=\"demo/primary_sub.srt\",\n secondary=\"demo/secondary_sub.srt\"\n )\n\n # Load, merge, and save subtitles.\n run_merge_pipeline(args, print)\n ```\n\nThese codes will produce [primary_sub.zip](https://github.com/CK-Explorer/DuoSubs/blob/main/demo/primary_sub.zip), with the following structure:\n\n```text\nprimary_sub.zip\n\u251c\u2500\u2500 primary_sub_combined.ass # Merged subtitles\n\u251c\u2500\u2500 primary_sub_primary.ass # Original primary subtitles\n\u2514\u2500\u2500 primary_sub_secondary.ass # Time-shifted secondary subtitles\n```\n\nBy default, the Sentence Transformer model used is \n[LaBSE](https://huggingface.co/sentence-transformers/LaBSE).\n\nIf you want to experiment with different models, then pick one from\n[\ud83e\udd17 Hugging Face](https://huggingface.co/models?library=sentence-transformers) \nor check out from the\n[leaderboard](https://huggingface.co/spaces/mteb/leaderboard)\nfor top performing model.\n\nFor example, if the model chosen is \n[Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B), \nyou can run:\n\n- via command line\n\n ```bash\n duosubs -p demo/primary_sub.srt -s demo/secondary_sub.srt --model Qwen/Qwen3-Embedding-0.6B\n ```\n\n- via Python API\n\n ```python\n from duosubs import MergeArgs, run_merge_pipeline\n\n # Store all arguments\n args = MergeArgs(\n primary=\"demo/primary_sub.srt\",\n secondary=\"demo/secondary_sub.srt\",\n model=\"Qwen/Qwen3-Embedding-0.6B\"\n )\n\n # Load, merge, and save subtitles.\n run_merge_pipeline(args, print)\n ```\n\n> \u26a0\ufe0f **Warning** \n> - Some models may require significant RAM or GPU (VRAM) to run, and might not be compatible with all devices \u2014 especially larger models. \n> - Also, please ensure the selected model supports your desired language for reliable results.\n\nTo learn more about this tool, please see the \n[documentation](https://duosubs.readthedocs.io/en/latest/).\n\n---\n\n## \ud83d\udcda Behind the Scenes\n\n1. Parse subtitles and detect language.\n2. Tokenize subtitle lines.\n3. Extract and filter non-overlapping subtitles. *(Optional)*\n4. Estimate tokenized subtitle pairings using DTW.\n5. Refine alignment using a sliding window approach.\n6. Combine aligned and non-overlapping subtitles.\n7. Eliminate unnecessary newline within subtitle lines.\n\n---\n\n## \ud83d\udeab Known Limitations\n\n- The **accuracy** of the merging process **varies** on the \n[model](https://huggingface.co/models?library=sentence-transformers) selected.\n- Some models may produce **unreliable results** for **unsupported** or low-resource **languages**.\n- Some sentence **fragments** from secondary subtitles may be **misaligned** to the \nprimary subtitles line due to the tokenization algorithm used.\n- **Secondary** subtitles might **contain extra whitespace** as a result of token-level merging.\n- The algorithm may **not** work reliably if the **timestamps** of some matching lines\n**don\u2019t overlap** at all. See [special case](#-special-case).\n\n---\n\n## \ud83e\udde9 Special Case\n\nFor the last known limitation, if both subtitle files are **known** to be \n**perfectly semantically aligned**, meaning:\n\n* **matching dialogue contents**\n* **no extra lines** like scene annotations or bonus Director\u2019s Cut stuff.\n\nThen, just **enable** the `ignore-non-overlap-filter` option in either: \n\n* CLI (`--ignore-non-overlap-filter`)\n* Python API (see [documentation](https://duosubs.readthedocs.io/en/latest/))\n\nto skip the overlap check \u2014 the merge should go smoothly from there.\n\n\u26a0\ufe0f If the subtitle **timings** are **off** and the two subtitle files \n**don\u2019t fully match in content**, the algorithm likely **won\u2019t** produce great results. Still, \nyou can try running it with `ignore-non-overlap-filter` **enabled**.\n\n---\n\n## \ud83d\ude4f Acknowledgements\n\nThis project wouldn't be possible without the incredible work of the open-source community. \nSpecial thanks to:\n\n- [sentence-transformers](https://github.com/UKPLab/sentence-transformers) \u2014 for the semantic \nembedding backbone\n- [Hugging Face](https://huggingface.co/) \u2014 for hosting models and making them easy to use\n- [PyTorch](https://pytorch.org/) \u2014 for providing the deep learning framework\n- [fastdtw](https://github.com/slaypni/fastdtw) \u2014 for aligning the subtitles\n- [lingua-py](https://github.com/pemistahl/lingua-py) \u2014 for detecting the subtitles' language codes\n- [pysubs2](https://github.com/tkarabela/pysubs2) \u2014 for subtitle file I/O utilities\n- [charset_normalizer](https://github.com/jawah/charset_normalizer) \u2014 for identifying the file \nencoding\n- [typer](https://github.com/fastapi/typer) \u2014 for CLI application\n- [tqdm](https://github.com/tqdm/tqdm) \u2014 for displaying progress bar\n- [Tears of Steel](https://mango.blender.org/) \u2014 subtitles used for demo, testing and development \npurposes. Created by the \n[Blender Foundation](https://mango.blender.org/), licensed under \n[CC BY 3.0](http://creativecommons.org/licenses/by/3.0/).\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! If you'd like to submit a pull request, please check out the\n [contributing guidelines](https://github.com/CK-Explorer/DuoSubs/blob/main/CONTRIBUTING.md).\n\n---\n\n## \ud83d\udd11 License\n\nApache-2.0 license - see the [LICENSE](https://github.com/CK-Explorer/DuoSubs/blob/main/LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Semantic subtitle aligner and merger for bilingual subtitle syncing.",
"version": "0.2.0",
"project_urls": {
"Documentation": "https://duosubs.readthedocs.io/en/latest/",
"Homepage": "https://github.com/CK-Explorer/DuoSubs",
"Repository": "https://github.com/CK-Explorer/DuoSubs"
},
"split_keywords": [
"subtitles",
" alignment",
" merging",
" sentence-transformers",
" sentence-similarity",
" bilingual",
" nlp"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1ad76783a74dac0d95d3febd8a7f976e2b47918ec95718a0464391e5690d0156",
"md5": "e42215fa7adb54f65a1ba24f732eb490",
"sha256": "ca0aecf0530aa7636a39aaa01d935d2c44c7e1047bd090810e5116a63cc34a66"
},
"downloads": -1,
"filename": "duosubs-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e42215fa7adb54f65a1ba24f732eb490",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 34023,
"upload_time": "2025-07-23T07:48:08",
"upload_time_iso_8601": "2025-07-23T07:48:08.195042Z",
"url": "https://files.pythonhosted.org/packages/1a/d7/6783a74dac0d95d3febd8a7f976e2b47918ec95718a0464391e5690d0156/duosubs-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5c28da5ee530b2ef709b511dda0fd70dade75ea4f8bac572cb6b02725aff01c2",
"md5": "1ea1465caf757ec438842cf84e970a0c",
"sha256": "b43e77eb9cd897163a7327dff7dfc6ab3252cb2b7d2bf33a56f594e78131472b"
},
"downloads": -1,
"filename": "duosubs-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "1ea1465caf757ec438842cf84e970a0c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 32107,
"upload_time": "2025-07-23T07:48:09",
"upload_time_iso_8601": "2025-07-23T07:48:09.302935Z",
"url": "https://files.pythonhosted.org/packages/5c/28/da5ee530b2ef709b511dda0fd70dade75ea4f8bac572cb6b02725aff01c2/duosubs-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-23 07:48:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CK-Explorer",
"github_project": "DuoSubs",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "duosubs"
}