text-quality

Name	text-quality JSON
Version	0.3.1 JSON
	download
home_page	https://github.com/laHTeR/htr-quality-classifier
Summary	A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).
upload_time	2023-11-16 14:23:17
maintainer
docs_url	None
author	Carsten Schnober
requires_python	>=3.9
license
keywords	htr ocr
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Text Quality

A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).

The current pipeline is tuned on (historic) Dutch language, and will not perform well on other languages.
However, the [underlying model](https://jdmdh.episciences.org/10239) has been used for other (Germanic) languages, and can be adapted and applied to texts of other languages and time periods.

<img src="./qrcode.svg" width=100 height=100>

## Examples

Good quality (not necessarily perfect):

```
Van
Malacca den 29 maart 1.
door zoo veel ruijmer handen te hebben,
[…]
Siac van waar op den 5=e deeser,
na onse verschijde adhortaties, is over
eeen gekomen
zoo meede van Siac
```

Bad quality:

```
uijtkoops --
winst suijverevense versis
e ee
,, 19
1 oe
na aftrek van
5 p:s C: Commiss:s
t 1a per 't geheel t p=s lb. off @'t geheeke
[…]
```

## What's Missing

- Pipelines for languages other than historic Dutch
- Automatic training procedure for creating and update pipelines
- Additional features such as publication year.

See [this notebook](notebooks/quality.ipynb) for a semi-automated pipeline creation process.

## How to use text_quality

After [installation](#installation), use the [classify_text_quality.py](scripts/classify_text_quality.py) script to classify PageXML or plain text files.
For instance, if you want to classify all `*.xml` files in the `pages/` directory, use the `--glob` argument:

```shell
classify_text_quality.py --glob "page/*.xml" --output classifications.csv --output-scores
```

Per input file, one output line is returned in CSV table format, along with the classification result:

1. Good quality
2. Medium quality
3. Bad quality

All supported parameters:

```console
$ classify_text_quality.py --help
usage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]

options:
  -h, --help            show this help message and exit
  --output FILE, -o FILE
                        Output file; defaults to stdout.
  --output-scores       Output scores and text statistics.

Input:
  --input [FILE ...], -i [FILE ...]
                        Plain text file(s) to classify. Use '-' for stdin.
  --pagexml [FILE ...]  Input file(s) in PageXML format.
  --pagexml-glob PATTERN, --glob PATTERN
                        A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.
```

### Notes

The pipeline might emit warnings like this:

```console
UserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names
```

This is due to the internals of the [Scikit-Learn Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), and can safely be ignored.

The dependencies are pinned to specific versions.
While this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions.
Hence, requirement dependecies can be changed manually, if you are aware of these issues.

The project setup is documented in [project_setup.md](project_setup.md). Feel free to remove this document (and/or the link to this document) if you don't need it.

## Installation

To install the `text_quality` package:

```shell
pip install -U text-quality
```

Alternatively, install the package from GitHub repository:

```shell
git clone https://github.com/LAHTeR/htr-quality-classifier.git
cd htr-quality-classifier
python3 -m pip install -U .
```

## Documentation

[Readthedocs](https://htr-quality-classifier.readthedocs.io/en/latest/)

## Software Architecture

This diagram shows the class design of the `text_quality` package.

![Software architecture](classes_text_quality.svg)

## Contributing

If you want to contribute to the development of text_quality,
have a look at the [contribution guidelines](CONTRIBUTING.md).

## Credits

Logic and implementation are based on [Nautilus-OCR](https://github.com/natliblux/nautilusocr).

This package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [NLeSC/python-template](https://github.com/NLeSC/python-template).

## Badges

(Customize these badges with your own links, and check <https://shields.io/> or <https://badgen.net/> to see which other badges are available.)

| fair-software.eu recommendations | |
| :-- | :--  |
| (1/5) code repository              | [![github repo badge](https://img.shields.io/badge/github-repo-000.svg?logo=github&labelColor=gray&color=blue)](https://github.com/laHTeR/htr-quality-classifier) |
| (2/5) license                      | [![github license badge](https://img.shields.io/github/license/laHTeR/htr-quality-classifier)](https://github.com/laHTeR/htr-quality-classifier) |
| (3/5) community registry           | [![RSD](https://img.shields.io/badge/rsd-text_quality-00a3e3.svg)](https://research-software-directory.org/projects/lahter) [![workflow pypi badge](https://img.shields.io/pypi/v/text_quality.svg?colorB=blue)](https://pypi.python.org/project/text_quality/) |
| (4/5) citation                     | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8190017.svg)](https://doi.org/10.5281/zenodo.8190017) |
| (5/5) checklist                    | [![OpenSSF Best Practices](https://bestpractices.coreinfrastructure.org/projects/7672/badge)](https://bestpractices.coreinfrastructure.org/projects/7672) |
| howfairis                          | [![fair-software badge](https://img.shields.io/badge/fair--software.eu-%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8B-yellow)](https://fair-software.eu) |
| **Other best practices**           | &nbsp; |
| Static analysis                    | [![workflow scq badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=alert_status)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |
| Coverage                           | [![workflow scc badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=coverage)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |
| Documentation                      | [![Documentation Status](https://readthedocs.org/projects/htr-quality-classifier/badge/?version=latest)](https://htr-quality-classifier.readthedocs.io/en/latest/?badge=latest) |
| **GitHub Actions**                 | &nbsp; |
| Build                              | [![build](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml) |
| Citation data consistency               | [![cffconvert](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml) |
| SonarCloud                         | [![sonarcloud](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml) |
| MarkDown link checker              | [![markdown-link-check](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml) |

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/laHTeR/htr-quality-classifier",
    "name": "text-quality",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "htr,ocr",
    "author": "Carsten Schnober",
    "author_email": "c.schnober@esciencecenter.nl",
    "download_url": "https://files.pythonhosted.org/packages/82/10/d93c0eb8930dcff73efe374c96d96060bdb04042728665b3fd5fb2ae478c/text_quality-0.3.1.tar.gz",
    "platform": null,
    "description": "# Text Quality\n\nA package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).\n\nThe current pipeline is tuned on (historic) Dutch language, and will not perform well on other languages.\nHowever, the [underlying model](https://jdmdh.episciences.org/10239) has been used for other (Germanic) languages, and can be adapted and applied to texts of other languages and time periods.\n\n<img src=\"./qrcode.svg\" width=100 height=100>\n\n## Examples\n\nGood quality (not necessarily perfect):\n\n```\nVan\nMalacca den 29 maart 1.\ndoor zoo veel ruijmer handen te hebben,\n[\u2026]\nSiac van waar op den 5=e deeser,\nna onse verschijde adhortaties, is over\neeen gekomen\nzoo meede van Siac\n```\n\nBad quality:\n\n```\nuijtkoops --\nwinst suijverevense versis\ne ee\n,, 19\n1 oe\nna aftrek van\n5 p:s C: Commiss:s\nt 1a per 't geheel t p=s lb. off @'t geheeke\n[\u2026]\n```\n\n## What's Missing\n\n- Pipelines for languages other than historic Dutch\n- Automatic training procedure for creating and update pipelines\n- Additional features such as publication year.\n\nSee [this notebook](notebooks/quality.ipynb) for a semi-automated pipeline creation process.\n\n## How to use text_quality\n\nAfter [installation](#installation), use the [classify_text_quality.py](scripts/classify_text_quality.py) script to classify PageXML or plain text files.\nFor instance, if you want to classify all `*.xml` files in the `pages/` directory, use the `--glob` argument:\n\n```shell\nclassify_text_quality.py --glob \"page/*.xml\" --output classifications.csv --output-scores\n```\n\nPer input file, one output line is returned in CSV table format, along with the classification result:\n\n1. Good quality\n2. Medium quality\n3. Bad quality\n\nAll supported parameters:\n\n```console\n$ classify_text_quality.py --help\nusage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]\n\noptions:\n  -h, --help            show this help message and exit\n  --output FILE, -o FILE\n                        Output file; defaults to stdout.\n  --output-scores       Output scores and text statistics.\n\nInput:\n  --input [FILE ...], -i [FILE ...]\n                        Plain text file(s) to classify. Use '-' for stdin.\n  --pagexml [FILE ...]  Input file(s) in PageXML format.\n  --pagexml-glob PATTERN, --glob PATTERN\n                        A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.\n```\n\n### Notes\n\nThe pipeline might emit warnings like this:\n\n```console\nUserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names\n```\n\nThis is due to the internals of the [Scikit-Learn Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), and can safely be ignored.\n\nThe dependencies are pinned to specific versions.\nWhile this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions.\nHence, requirement dependecies can be changed manually, if you are aware of these issues.\n\nThe project setup is documented in [project_setup.md](project_setup.md). Feel free to remove this document (and/or the link to this document) if you don't need it.\n\n## Installation\n\nTo install the `text_quality` package:\n\n```shell\npip install -U text-quality\n```\n\nAlternatively, install the package from GitHub repository:\n\n```shell\ngit clone https://github.com/LAHTeR/htr-quality-classifier.git\ncd htr-quality-classifier\npython3 -m pip install -U .\n```\n\n## Documentation\n\n[Readthedocs](https://htr-quality-classifier.readthedocs.io/en/latest/)\n\n## Software Architecture\n\nThis diagram shows the class design of the `text_quality` package.\n\n![Software architecture](classes_text_quality.svg)\n\n## Contributing\n\nIf you want to contribute to the development of text_quality,\nhave a look at the [contribution guidelines](CONTRIBUTING.md).\n\n## Credits\n\nLogic and implementation are based on [Nautilus-OCR](https://github.com/natliblux/nautilusocr).\n\nThis package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [NLeSC/python-template](https://github.com/NLeSC/python-template).\n\n## Badges\n\n(Customize these badges with your own links, and check <https://shields.io/> or <https://badgen.net/> to see which other badges are available.)\n\n| fair-software.eu recommendations | |\n| :-- | :--  |\n| (1/5) code repository              | [![github repo badge](https://img.shields.io/badge/github-repo-000.svg?logo=github&labelColor=gray&color=blue)](https://github.com/laHTeR/htr-quality-classifier) |\n| (2/5) license                      | [![github license badge](https://img.shields.io/github/license/laHTeR/htr-quality-classifier)](https://github.com/laHTeR/htr-quality-classifier) |\n| (3/5) community registry           | [![RSD](https://img.shields.io/badge/rsd-text_quality-00a3e3.svg)](https://research-software-directory.org/projects/lahter) [![workflow pypi badge](https://img.shields.io/pypi/v/text_quality.svg?colorB=blue)](https://pypi.python.org/project/text_quality/) |\n| (4/5) citation                     | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8190017.svg)](https://doi.org/10.5281/zenodo.8190017) |\n| (5/5) checklist                    | [![OpenSSF Best Practices](https://bestpractices.coreinfrastructure.org/projects/7672/badge)](https://bestpractices.coreinfrastructure.org/projects/7672) |\n| howfairis                          | [![fair-software badge](https://img.shields.io/badge/fair--software.eu-%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8B-yellow)](https://fair-software.eu) |\n| **Other best practices**           | &nbsp; |\n| Static analysis                    | [![workflow scq badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=alert_status)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |\n| Coverage                           | [![workflow scc badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=coverage)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |\n| Documentation                      | [![Documentation Status](https://readthedocs.org/projects/htr-quality-classifier/badge/?version=latest)](https://htr-quality-classifier.readthedocs.io/en/latest/?badge=latest) |\n| **GitHub Actions**                 | &nbsp; |\n| Build                              | [![build](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml) |\n| Citation data consistency               | [![cffconvert](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml) |\n| SonarCloud                         | [![sonarcloud](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml) |\n| MarkDown link checker              | [![markdown-link-check](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml) |\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).",
    "version": "0.3.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/laHTeR/htr-quality-classifier/issues",
        "Homepage": "https://github.com/laHTeR/htr-quality-classifier"
    },
    "split_keywords": [
        "htr",
        "ocr"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "425a07c6cceadbbaed168b9fcd70912af1b81d5675ed5814aaff8e96a3612b8e",
                "md5": "453ff5702ec8cf8d6207ca7b6ad068b0",
                "sha256": "758f215ad5ab5922576ea368935326f16b13cbeee1448c5af08263798c6484eb"
            },
            "downloads": -1,
            "filename": "text_quality-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "453ff5702ec8cf8d6207ca7b6ad068b0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 2472363,
            "upload_time": "2023-11-16T14:23:15",
            "upload_time_iso_8601": "2023-11-16T14:23:15.141913Z",
            "url": "https://files.pythonhosted.org/packages/42/5a/07c6cceadbbaed168b9fcd70912af1b81d5675ed5814aaff8e96a3612b8e/text_quality-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8210d93c0eb8930dcff73efe374c96d96060bdb04042728665b3fd5fb2ae478c",
                "md5": "faafda022968a20477f611e214c2f944",
                "sha256": "264a4a024ddabb59762bef3d0055012f9db487c75cf7df404ffa882cb72232a5"
            },
            "downloads": -1,
            "filename": "text_quality-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "faafda022968a20477f611e214c2f944",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 2474282,
            "upload_time": "2023-11-16T14:23:17",
            "upload_time_iso_8601": "2023-11-16T14:23:17.280565Z",
            "url": "https://files.pythonhosted.org/packages/82/10/d93c0eb8930dcff73efe374c96d96060bdb04042728665b3fd5fb2ae478c/text_quality-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-16 14:23:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "laHTeR",
    "github_project": "htr-quality-classifier",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "text-quality"
}

Carsten Schnober