cleanit


Namecleanit JSON
Version 0.4.8 PyPI version JSON
download
home_pagehttps://github.com/ratoaq2/cleanit
SummarySubtitles extremely clean
upload_time2024-06-23 06:19:14
maintainerNone
docs_urlNone
authorRato
requires_python<4.0.0,>=3.9.0
licenseApache-2.0
keywords subtitle subtitles clean denylist replace ocr fix tidy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CleanIt

Subtitles extremely clean.

[![Latest
Version](https://img.shields.io/pypi/v/cleanit.svg)](https://pypi.python.org/pypi/cleanit)

[![tests](https://github.com/ratoaq2/cleanit/actions/workflows/test.yml/badge.svg)](https://github.com/ratoaq2/cleanit/actions/workflows/test.yml)

[![License](https://img.shields.io/github/license/ratoaq2/cleanit.svg)](https://github.com/ratoaq2/cleanit/blob/master/LICENSE)

  - Project page  
    <https://github.com/ratoaq2/cleanit>

**CleanIt** is a command line tool that helps you to keep your subtitles
clean. You can specify your own rules to detect entries to be removed or
patterns to be replaced. Simple text matching or complex regex can be
used. It comes with standard rules out of the box:

  - ocr: Fix common OCR errors
  - tidy: Fix common formatting issues (e.g.: extra/missing spaces after
    punctuation)
  - no-sdh: Remove SDH descriptions
  - no-lyrics: Remove lyrics
  - no-spam: Remove ads and spams
  - no-style: Remove font style tags like \<i\> and \<b\>
  - minimal: includes only ocr and tidy rules
  - default: includes all rules except no-style

## Usage

### CLI

Clean subtitles:

    $ cat mysubtitle.srt
    1
    00:00:46,464 --> 00:00:48,549
    -And then what?
    -| don't know.
    
    2
    00:49:07,278 --> 00:49:09,363
    - If you cross the sea
    with an army you bought ...
    
    
    $ cleanit -t default mysubtitle.en.srt
    1 subtitle collected / 0 subtitle filtered out / 0 path ignored
    1 subtitle saved / 0 subtitle unchanged
    
    $ cat mysubtitle.srt
    1
    00:00:46,464 --> 00:00:48,549
    - And then what?
    - I don't know.
    
    2
    00:49:07,278 --> 00:49:09,363
    If you cross the sea
    with an army you bought...
    
    
    $ cleanit -t ocr -t no-sdh -t tidy -l en -l pt-BR ~/subtitles/
    423 subtitles collected / 107 subtitles filtered out / 0 path ignored
    Cleaning subtitles  [####################################]  100%
    268 subtitles saved / 155 subtitles unchanged

Using docker:

    $ docker run -it --rm -v /medias:/medias -u $(id -u username):$(id -g username) ratoaq2/cleanit -t default /medias
    1072 subtitles collected / 0 subtitle filtered out / 0 path ignored
    Cleaning subtitles  [####################################]  100%
    980 subtitle saved / 92 subtitles unchanged

### API

``` python
from cleanit import Config, Subtitle

sub = Subtitle('/subtitle/path/subtitle.en.srt')
cfg = Config.from_path('/config/path')
rules = cfg.select_rules(tags={'ocr'})
if sub.clean(rules):
    sub.save()
```

### YAML Configuration file

``` yaml
templates:
  - &ocr
    tags:
      - ocr
      - minimal
      - default
    priority: 10000
    languages: en

rules:
  replace-l-to-I-character[ocr:en]:
    <<: *ocr
    patterns: '\bl\b'
    replacement: 'I'
    examples:
      ? |
        And if l refuse?
      : |
        And if I refuse?
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ratoaq2/cleanit",
    "name": "cleanit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0.0,>=3.9.0",
    "maintainer_email": null,
    "keywords": "subtitle, subtitles, clean, denylist, replace, ocr, fix, tidy",
    "author": "Rato",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/57/e3/d08d7980c4a04f3e23c8adf33717cb92b0e009ac96f6c05e5867bca0edf1/cleanit-0.4.8.tar.gz",
    "platform": null,
    "description": "# CleanIt\n\nSubtitles extremely clean.\n\n[![Latest\nVersion](https://img.shields.io/pypi/v/cleanit.svg)](https://pypi.python.org/pypi/cleanit)\n\n[![tests](https://github.com/ratoaq2/cleanit/actions/workflows/test.yml/badge.svg)](https://github.com/ratoaq2/cleanit/actions/workflows/test.yml)\n\n[![License](https://img.shields.io/github/license/ratoaq2/cleanit.svg)](https://github.com/ratoaq2/cleanit/blob/master/LICENSE)\n\n  - Project page  \n    <https://github.com/ratoaq2/cleanit>\n\n**CleanIt** is a command line tool that helps you to keep your subtitles\nclean. You can specify your own rules to detect entries to be removed or\npatterns to be replaced. Simple text matching or complex regex can be\nused. It comes with standard rules out of the box:\n\n  - ocr: Fix common OCR errors\n  - tidy: Fix common formatting issues (e.g.: extra/missing spaces after\n    punctuation)\n  - no-sdh: Remove SDH descriptions\n  - no-lyrics: Remove lyrics\n  - no-spam: Remove ads and spams\n  - no-style: Remove font style tags like \\<i\\> and \\<b\\>\n  - minimal: includes only ocr and tidy rules\n  - default: includes all rules except no-style\n\n## Usage\n\n### CLI\n\nClean subtitles:\n\n    $ cat mysubtitle.srt\n    1\n    00:00:46,464 --> 00:00:48,549\n    -And then what?\n    -| don't know.\n    \n    2\n    00:49:07,278 --> 00:49:09,363\n    - If you cross the sea\n    with an army you bought ...\n    \n    \n    $ cleanit -t default mysubtitle.en.srt\n    1 subtitle collected / 0 subtitle filtered out / 0 path ignored\n    1 subtitle saved / 0 subtitle unchanged\n    \n    $ cat mysubtitle.srt\n    1\n    00:00:46,464 --> 00:00:48,549\n    - And then what?\n    - I don't know.\n    \n    2\n    00:49:07,278 --> 00:49:09,363\n    If you cross the sea\n    with an army you bought...\n    \n    \n    $ cleanit -t ocr -t no-sdh -t tidy -l en -l pt-BR ~/subtitles/\n    423 subtitles collected / 107 subtitles filtered out / 0 path ignored\n    Cleaning subtitles  [####################################]  100%\n    268 subtitles saved / 155 subtitles unchanged\n\nUsing docker:\n\n    $ docker run -it --rm -v /medias:/medias -u $(id -u username):$(id -g username) ratoaq2/cleanit -t default /medias\n    1072 subtitles collected / 0 subtitle filtered out / 0 path ignored\n    Cleaning subtitles  [####################################]  100%\n    980 subtitle saved / 92 subtitles unchanged\n\n### API\n\n``` python\nfrom cleanit import Config, Subtitle\n\nsub = Subtitle('/subtitle/path/subtitle.en.srt')\ncfg = Config.from_path('/config/path')\nrules = cfg.select_rules(tags={'ocr'})\nif sub.clean(rules):\n    sub.save()\n```\n\n### YAML Configuration file\n\n``` yaml\ntemplates:\n  - &ocr\n    tags:\n      - ocr\n      - minimal\n      - default\n    priority: 10000\n    languages: en\n\nrules:\n  replace-l-to-I-character[ocr:en]:\n    <<: *ocr\n    patterns: '\\bl\\b'\n    replacement: 'I'\n    examples:\n      ? |\n        And if l refuse?\n      : |\n        And if I refuse?\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Subtitles extremely clean",
    "version": "0.4.8",
    "project_urls": {
        "Homepage": "https://github.com/ratoaq2/cleanit",
        "Repository": "https://github.com/ratoaq2/cleanit"
    },
    "split_keywords": [
        "subtitle",
        " subtitles",
        " clean",
        " denylist",
        " replace",
        " ocr",
        " fix",
        " tidy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "79b9fcf9e3b833bff99e1d2d63c31dad1d10c1d650f29971b541846295d96513",
                "md5": "dcdf9c28fd79b49e63c9f061b55644be",
                "sha256": "8ae8853871a8664a8781f8f82940ac559322263058f9d94b245780c1750681f2"
            },
            "downloads": -1,
            "filename": "cleanit-0.4.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dcdf9c28fd79b49e63c9f061b55644be",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0.0,>=3.9.0",
            "size": 26630,
            "upload_time": "2024-06-23T06:19:12",
            "upload_time_iso_8601": "2024-06-23T06:19:12.426240Z",
            "url": "https://files.pythonhosted.org/packages/79/b9/fcf9e3b833bff99e1d2d63c31dad1d10c1d650f29971b541846295d96513/cleanit-0.4.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57e3d08d7980c4a04f3e23c8adf33717cb92b0e009ac96f6c05e5867bca0edf1",
                "md5": "0a0b5adf9cc322e6683f457fc51e5d41",
                "sha256": "1b19fe2dd2712695ebbf9d429c4d3366a1b51300738bb034c13ea221c84a6ae9"
            },
            "downloads": -1,
            "filename": "cleanit-0.4.8.tar.gz",
            "has_sig": false,
            "md5_digest": "0a0b5adf9cc322e6683f457fc51e5d41",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0.0,>=3.9.0",
            "size": 21625,
            "upload_time": "2024-06-23T06:19:14",
            "upload_time_iso_8601": "2024-06-23T06:19:14.038247Z",
            "url": "https://files.pythonhosted.org/packages/57/e3/d08d7980c4a04f3e23c8adf33717cb92b0e009ac96f6c05e5867bca0edf1/cleanit-0.4.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-23 06:19:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ratoaq2",
    "github_project": "cleanit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cleanit"
}
        
Elapsed time: 0.28081s