demeuk


Namedemeuk JSON
Version 4.2.0 PyPI version JSON
download
home_pagehttps://github.com/NetherlandsForensicInstitute/demeuk
SummaryCLI tool to remove invalid chars from a corpus.
upload_time2023-12-21 07:19:39
maintainer
docs_urlNone
authorNetherlands Forensic Institute
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Demeuk
[![Documentation Status](https://readthedocs.org/projects/demeuk/badge/?version=latest)](https://demeuk.readthedocs.io/en/latest/?badge=latest) [![Tests](https://github.com/NetherlandsForensicInstitute/demeuk/actions/workflows/test.yml/badge.svg)](https://github.com/NetherlandsForensicInstitute/demeuk/actions/workflows/test.yml)

Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset
containing plain text strings. Example use cases are: cleaning up language dictionaries,
password sets (like for example RockYou) or any file / stdin containing plain text strings.

In those corpora you'll find encoding mistakes that have been made, or you want to remove some parts
of a line. Instead of creating a huge bash oneliner you can use demeuk to do all your cleaning.

Example usages:
 - Cutting
 - Length checking
 - Encoding fixing

Demeuk is written in Python3, this means of course that it is slower than for example cut.
However, Demeuk is written multithreaded and thus can use all your cores. Besides this Demeuk
can easily be extended to match your needs.

This application is part of the CERBERUS project that has received
funding from the European Union's Internal Security Fund - Police under
grant agreement No. 82201

Please read the docs for more information.

## Quick start
The recommended way to install demeuk is to install it in a virtual
environment.

```
# Create virtual environment
virtualenv <virtual environment name>
# Activate the virtual environment
source <virtual environment name>/bin/activate
pip3 install -r requirements.txt
```

Now you can run bin/demeuk.py:

Examples:
```
    demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt
    demeuk -i inputfile -o outputfile -j 24 -l logfile.log
    demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt --leak
    demeuk -i inputfile -o outputfile -j 24 -l logfile.log --leak-full
    demeuk -i inputdir/*.txt -o outputfile.dict -l logfile.log
    demeuk -o outputfile.dict -l logfile.log
```

## Docs
The docs are available at: <http://demeuk.rtfd.io/>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/NetherlandsForensicInstitute/demeuk",
    "name": "demeuk",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Netherlands Forensic Institute",
    "author_email": "holmesnl@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/26/e7/7fd14c15c76852209a2e539d4cf904ce3952d36823cb1babd1d9122caf94/demeuk-4.2.0.tar.gz",
    "platform": null,
    "description": "# Demeuk\n[![Documentation Status](https://readthedocs.org/projects/demeuk/badge/?version=latest)](https://demeuk.readthedocs.io/en/latest/?badge=latest) [![Tests](https://github.com/NetherlandsForensicInstitute/demeuk/actions/workflows/test.yml/badge.svg)](https://github.com/NetherlandsForensicInstitute/demeuk/actions/workflows/test.yml)\n\nDemeuk is a simple tool to clean up corpora (like dictionaries) or any dataset\ncontaining plain text strings. Example use cases are: cleaning up language dictionaries,\npassword sets (like for example RockYou) or any file / stdin containing plain text strings.\n\nIn those corpora you'll find encoding mistakes that have been made, or you want to remove some parts\nof a line. Instead of creating a huge bash oneliner you can use demeuk to do all your cleaning.\n\nExample usages:\n - Cutting\n - Length checking\n - Encoding fixing\n\nDemeuk is written in Python3, this means of course that it is slower than for example cut.\nHowever, Demeuk is written multithreaded and thus can use all your cores. Besides this Demeuk\ncan easily be extended to match your needs.\n\nThis application is part of the CERBERUS project that has received\nfunding from the European Union's Internal Security Fund - Police under\ngrant agreement No. 82201\n\nPlease read the docs for more information.\n\n## Quick start\nThe recommended way to install demeuk is to install it in a virtual\nenvironment.\n\n```\n# Create virtual environment\nvirtualenv <virtual environment name>\n# Activate the virtual environment\nsource <virtual environment name>/bin/activate\npip3 install -r requirements.txt\n```\n\nNow you can run bin/demeuk.py:\n\nExamples:\n```\n    demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt\n    demeuk -i inputfile -o outputfile -j 24 -l logfile.log\n    demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt --leak\n    demeuk -i inputfile -o outputfile -j 24 -l logfile.log --leak-full\n    demeuk -i inputdir/*.txt -o outputfile.dict -l logfile.log\n    demeuk -o outputfile.dict -l logfile.log\n```\n\n## Docs\nThe docs are available at: <http://demeuk.rtfd.io/>\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "CLI tool to remove invalid chars from a corpus.",
    "version": "4.2.0",
    "project_urls": {
        "Homepage": "https://github.com/NetherlandsForensicInstitute/demeuk"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "12c2d3bf72efca0abc3a9fb99baa9418e45174db81f8c2e76d9cc9a45a0d9697",
                "md5": "ecd6b73d093e67000c88f610038290a0",
                "sha256": "1d86441f98387fd2f9095e91fcfb28e54c3f2525db2e4f486689291957215f2a"
            },
            "downloads": -1,
            "filename": "demeuk-4.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ecd6b73d093e67000c88f610038290a0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 18428,
            "upload_time": "2023-12-21T07:19:38",
            "upload_time_iso_8601": "2023-12-21T07:19:38.242363Z",
            "url": "https://files.pythonhosted.org/packages/12/c2/d3bf72efca0abc3a9fb99baa9418e45174db81f8c2e76d9cc9a45a0d9697/demeuk-4.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "26e77fd14c15c76852209a2e539d4cf904ce3952d36823cb1babd1d9122caf94",
                "md5": "a14fc20fd094b74e7427648ded4bc51a",
                "sha256": "e2a87249842417d7b766e4cbe7ef57fd6fcbba66cb8944179ca74c506908aec8"
            },
            "downloads": -1,
            "filename": "demeuk-4.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a14fc20fd094b74e7427648ded4bc51a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19186,
            "upload_time": "2023-12-21T07:19:39",
            "upload_time_iso_8601": "2023-12-21T07:19:39.825095Z",
            "url": "https://files.pythonhosted.org/packages/26/e7/7fd14c15c76852209a2e539d4cf904ce3952d36823cb1babd1d9122caf94/demeuk-4.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-21 07:19:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NetherlandsForensicInstitute",
    "github_project": "demeuk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "demeuk"
}
        
Elapsed time: 0.16634s