# Text Quality
A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).
The current pipeline is tuned on (historic) Dutch language, and will not perform well on other languages.
However, the [underlying model](https://jdmdh.episciences.org/10239) has been used for other (Germanic) languages, and can be adapted and applied to texts of other languages and time periods.
<img src="./qrcode.svg" width=100 height=100>
## Examples
Good quality (not necessarily perfect):
```
Van
Malacca den 29 maart 1.
door zoo veel ruijmer handen te hebben,
[…]
Siac van waar op den 5=e deeser,
na onse verschijde adhortaties, is over
eeen gekomen
zoo meede van Siac
```
Bad quality:
```
uijtkoops --
winst suijverevense versis
e ee
,, 19
1 oe
na aftrek van
5 p:s C: Commiss:s
t 1a per 't geheel t p=s lb. off @'t geheeke
[…]
```
## What's Missing
- Pipelines for languages other than historic Dutch
- Automatic training procedure for creating and update pipelines
- Additional features such as publication year.
See [this notebook](notebooks/quality.ipynb) for a semi-automated pipeline creation process.
## How to use text_quality
After [installation](#installation), use the [classify_text_quality.py](scripts/classify_text_quality.py) script to classify PageXML or plain text files.
For instance, if you want to classify all `*.xml` files in the `pages/` directory, use the `--glob` argument:
```shell
classify_text_quality.py --glob "page/*.xml" --output classifications.csv --output-scores
```
Per input file, one output line is returned in CSV table format, along with the classification result:
1. Good quality
2. Medium quality
3. Bad quality
All supported parameters:
```console
$ classify_text_quality.py --help
usage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]
options:
-h, --help show this help message and exit
--output FILE, -o FILE
Output file; defaults to stdout.
--output-scores Output scores and text statistics.
Input:
--input [FILE ...], -i [FILE ...]
Plain text file(s) to classify. Use '-' for stdin.
--pagexml [FILE ...] Input file(s) in PageXML format.
--pagexml-glob PATTERN, --glob PATTERN
A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.
```
### Notes
The pipeline might emit warnings like this:
```console
UserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names
```
This is due to the internals of the [Scikit-Learn Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), and can safely be ignored.
The dependencies are pinned to specific versions.
While this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions.
Hence, requirement dependecies can be changed manually, if you are aware of these issues.
The project setup is documented in [project_setup.md](project_setup.md). Feel free to remove this document (and/or the link to this document) if you don't need it.
## Installation
To install the `text_quality` package:
```shell
pip install -U text-quality
```
Alternatively, install the package from GitHub repository:
```shell
git clone https://github.com/LAHTeR/htr-quality-classifier.git
cd htr-quality-classifier
python3 -m pip install -U .
```
## Documentation
[Readthedocs](https://htr-quality-classifier.readthedocs.io/en/latest/)
## Software Architecture
This diagram shows the class design of the `text_quality` package.
![Software architecture](classes_text_quality.svg)
## Contributing
If you want to contribute to the development of text_quality,
have a look at the [contribution guidelines](CONTRIBUTING.md).
## Credits
Logic and implementation are based on [Nautilus-OCR](https://github.com/natliblux/nautilusocr).
This package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [NLeSC/python-template](https://github.com/NLeSC/python-template).
## Badges
(Customize these badges with your own links, and check <https://shields.io/> or <https://badgen.net/> to see which other badges are available.)
| fair-software.eu recommendations | |
| :-- | :-- |
| (1/5) code repository | [![github repo badge](https://img.shields.io/badge/github-repo-000.svg?logo=github&labelColor=gray&color=blue)](https://github.com/laHTeR/htr-quality-classifier) |
| (2/5) license | [![github license badge](https://img.shields.io/github/license/laHTeR/htr-quality-classifier)](https://github.com/laHTeR/htr-quality-classifier) |
| (3/5) community registry | [![RSD](https://img.shields.io/badge/rsd-text_quality-00a3e3.svg)](https://research-software-directory.org/projects/lahter) [![workflow pypi badge](https://img.shields.io/pypi/v/text_quality.svg?colorB=blue)](https://pypi.python.org/project/text_quality/) |
| (4/5) citation | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8190017.svg)](https://doi.org/10.5281/zenodo.8190017) |
| (5/5) checklist | [![OpenSSF Best Practices](https://bestpractices.coreinfrastructure.org/projects/7672/badge)](https://bestpractices.coreinfrastructure.org/projects/7672) |
| howfairis | [![fair-software badge](https://img.shields.io/badge/fair--software.eu-%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8B-yellow)](https://fair-software.eu) |
| **Other best practices** | |
| Static analysis | [![workflow scq badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=alert_status)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |
| Coverage | [![workflow scc badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=coverage)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |
| Documentation | [![Documentation Status](https://readthedocs.org/projects/htr-quality-classifier/badge/?version=latest)](https://htr-quality-classifier.readthedocs.io/en/latest/?badge=latest) |
| **GitHub Actions** | |
| Build | [![build](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml) |
| Citation data consistency | [![cffconvert](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml) |
| SonarCloud | [![sonarcloud](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml) |
| MarkDown link checker | [![markdown-link-check](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml) |
Raw data
{
"_id": null,
"home_page": "https://github.com/laHTeR/htr-quality-classifier",
"name": "text-quality",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "htr,ocr",
"author": "Carsten Schnober",
"author_email": "c.schnober@esciencecenter.nl",
"download_url": "https://files.pythonhosted.org/packages/82/10/d93c0eb8930dcff73efe374c96d96060bdb04042728665b3fd5fb2ae478c/text_quality-0.3.1.tar.gz",
"platform": null,
"description": "# Text Quality\n\nA package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).\n\nThe current pipeline is tuned on (historic) Dutch language, and will not perform well on other languages.\nHowever, the [underlying model](https://jdmdh.episciences.org/10239) has been used for other (Germanic) languages, and can be adapted and applied to texts of other languages and time periods.\n\n<img src=\"./qrcode.svg\" width=100 height=100>\n\n## Examples\n\nGood quality (not necessarily perfect):\n\n```\nVan\nMalacca den 29 maart 1.\ndoor zoo veel ruijmer handen te hebben,\n[\u2026]\nSiac van waar op den 5=e deeser,\nna onse verschijde adhortaties, is over\neeen gekomen\nzoo meede van Siac\n```\n\nBad quality:\n\n```\nuijtkoops --\nwinst suijverevense versis\ne ee\n,, 19\n1 oe\nna aftrek van\n5 p:s C: Commiss:s\nt 1a per 't geheel t p=s lb. off @'t geheeke\n[\u2026]\n```\n\n## What's Missing\n\n- Pipelines for languages other than historic Dutch\n- Automatic training procedure for creating and update pipelines\n- Additional features such as publication year.\n\nSee [this notebook](notebooks/quality.ipynb) for a semi-automated pipeline creation process.\n\n## How to use text_quality\n\nAfter [installation](#installation), use the [classify_text_quality.py](scripts/classify_text_quality.py) script to classify PageXML or plain text files.\nFor instance, if you want to classify all `*.xml` files in the `pages/` directory, use the `--glob` argument:\n\n```shell\nclassify_text_quality.py --glob \"page/*.xml\" --output classifications.csv --output-scores\n```\n\nPer input file, one output line is returned in CSV table format, along with the classification result:\n\n1. Good quality\n2. Medium quality\n3. Bad quality\n\nAll supported parameters:\n\n```console\n$ classify_text_quality.py --help\nusage: Classify the quality of a (digitized) text. [-h] [--input [FILE ...]] [--pagexml [FILE ...]] [--pagexml-glob PATTERN] [--output FILE] [--output-scores]\n\noptions:\n -h, --help show this help message and exit\n --output FILE, -o FILE\n Output file; defaults to stdout.\n --output-scores Output scores and text statistics.\n\nInput:\n --input [FILE ...], -i [FILE ...]\n Plain text file(s) to classify. Use '-' for stdin.\n --pagexml [FILE ...] Input file(s) in PageXML format.\n --pagexml-glob PATTERN, --glob PATTERN\n A pattern to find a set of PageXML files, e.g. 'pagexml/*.xml'.\n```\n\n### Notes\n\nThe pipeline might emit warnings like this:\n\n```console\nUserWarning: X does not have valid feature names, but MLPClassifier was fitted with feature names\n```\n\nThis is due to the internals of the [Scikit-Learn Pipeline object](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), and can safely be ignored.\n\nThe dependencies are pinned to specific versions.\nWhile this prevents implicit updated even for patch-level updated of required libraries, it prevents misleading warnings emitted by varying Scikit-Learn versions.\nHence, requirement dependecies can be changed manually, if you are aware of these issues.\n\nThe project setup is documented in [project_setup.md](project_setup.md). Feel free to remove this document (and/or the link to this document) if you don't need it.\n\n## Installation\n\nTo install the `text_quality` package:\n\n```shell\npip install -U text-quality\n```\n\nAlternatively, install the package from GitHub repository:\n\n```shell\ngit clone https://github.com/LAHTeR/htr-quality-classifier.git\ncd htr-quality-classifier\npython3 -m pip install -U .\n```\n\n## Documentation\n\n[Readthedocs](https://htr-quality-classifier.readthedocs.io/en/latest/)\n\n## Software Architecture\n\nThis diagram shows the class design of the `text_quality` package.\n\n![Software architecture](classes_text_quality.svg)\n\n## Contributing\n\nIf you want to contribute to the development of text_quality,\nhave a look at the [contribution guidelines](CONTRIBUTING.md).\n\n## Credits\n\nLogic and implementation are based on [Nautilus-OCR](https://github.com/natliblux/nautilusocr).\n\nThis package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [NLeSC/python-template](https://github.com/NLeSC/python-template).\n\n## Badges\n\n(Customize these badges with your own links, and check <https://shields.io/> or <https://badgen.net/> to see which other badges are available.)\n\n| fair-software.eu recommendations | |\n| :-- | :-- |\n| (1/5) code repository | [![github repo badge](https://img.shields.io/badge/github-repo-000.svg?logo=github&labelColor=gray&color=blue)](https://github.com/laHTeR/htr-quality-classifier) |\n| (2/5) license | [![github license badge](https://img.shields.io/github/license/laHTeR/htr-quality-classifier)](https://github.com/laHTeR/htr-quality-classifier) |\n| (3/5) community registry | [![RSD](https://img.shields.io/badge/rsd-text_quality-00a3e3.svg)](https://research-software-directory.org/projects/lahter) [![workflow pypi badge](https://img.shields.io/pypi/v/text_quality.svg?colorB=blue)](https://pypi.python.org/project/text_quality/) |\n| (4/5) citation | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8190017.svg)](https://doi.org/10.5281/zenodo.8190017) |\n| (5/5) checklist | [![OpenSSF Best Practices](https://bestpractices.coreinfrastructure.org/projects/7672/badge)](https://bestpractices.coreinfrastructure.org/projects/7672) |\n| howfairis | [![fair-software badge](https://img.shields.io/badge/fair--software.eu-%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8F%20%20%E2%97%8B-yellow)](https://fair-software.eu) |\n| **Other best practices** | |\n| Static analysis | [![workflow scq badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=alert_status)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |\n| Coverage | [![workflow scc badge](https://sonarcloud.io/api/project_badges/measure?project=LAHTeR_htr-quality-classifier&metric=coverage)](https://sonarcloud.io/dashboard?id=LAHTeR_htr-quality-classifier) |\n| Documentation | [![Documentation Status](https://readthedocs.org/projects/htr-quality-classifier/badge/?version=latest)](https://htr-quality-classifier.readthedocs.io/en/latest/?badge=latest) |\n| **GitHub Actions** | |\n| Build | [![build](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/build.yml) |\n| Citation data consistency | [![cffconvert](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/cffconvert.yml) |\n| SonarCloud | [![sonarcloud](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/sonarcloud.yml) |\n| MarkDown link checker | [![markdown-link-check](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml/badge.svg)](https://github.com/laHTeR/htr-quality-classifier/actions/workflows/markdown-link-check.yml) |\n",
"bugtrack_url": null,
"license": "",
"summary": "A package to determine the quality of a a digitized text, from a handwritten script or scanned print (HTR/OCR output).",
"version": "0.3.1",
"project_urls": {
"Bug Tracker": "https://github.com/laHTeR/htr-quality-classifier/issues",
"Homepage": "https://github.com/laHTeR/htr-quality-classifier"
},
"split_keywords": [
"htr",
"ocr"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "425a07c6cceadbbaed168b9fcd70912af1b81d5675ed5814aaff8e96a3612b8e",
"md5": "453ff5702ec8cf8d6207ca7b6ad068b0",
"sha256": "758f215ad5ab5922576ea368935326f16b13cbeee1448c5af08263798c6484eb"
},
"downloads": -1,
"filename": "text_quality-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "453ff5702ec8cf8d6207ca7b6ad068b0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 2472363,
"upload_time": "2023-11-16T14:23:15",
"upload_time_iso_8601": "2023-11-16T14:23:15.141913Z",
"url": "https://files.pythonhosted.org/packages/42/5a/07c6cceadbbaed168b9fcd70912af1b81d5675ed5814aaff8e96a3612b8e/text_quality-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8210d93c0eb8930dcff73efe374c96d96060bdb04042728665b3fd5fb2ae478c",
"md5": "faafda022968a20477f611e214c2f944",
"sha256": "264a4a024ddabb59762bef3d0055012f9db487c75cf7df404ffa882cb72232a5"
},
"downloads": -1,
"filename": "text_quality-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "faafda022968a20477f611e214c2f944",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 2474282,
"upload_time": "2023-11-16T14:23:17",
"upload_time_iso_8601": "2023-11-16T14:23:17.280565Z",
"url": "https://files.pythonhosted.org/packages/82/10/d93c0eb8930dcff73efe374c96d96060bdb04042728665b3fd5fb2ae478c/text_quality-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-16 14:23:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "laHTeR",
"github_project": "htr-quality-classifier",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "text-quality"
}