dinglehopper
============
dinglehopper is an OCR evaluation tool and reads
[ALTO](https://github.com/altoxml),
[PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) and text files. It
compares a ground truth (GT) document page with a OCR result page to compute
metrics and a word/character differences report. It also supports batch processing by
generating, aggregating and summarizing multiple reports.
[![Tests](https://github.com/qurator-spk/dinglehopper/actions/workflows/test.yml/badge.svg)](https://github.com/qurator-spk/dinglehopper/actions?query=workflow:"test")
[![GitHub tag](https://img.shields.io/github/tag/qurator-spk/dinglehopper?include_prereleases=&sort=semver&color=blue)](https://github.com/qurator-spk/dinglehopper/releases/)
[![License](https://img.shields.io/badge/License-Apache-blue)](#license)
[![issues - dinglehopper](https://img.shields.io/github/issues/qurator-spk/dinglehopper)](https://github.com/qurator-spk/dinglehopper/issues)
Goals
-----
* Useful
* As a UI tool
* For an automated evaluation
* As a library
* Unicode support
Installation
------------
It's best to use pip to install the package from PyPI, e.g.:
```
pip install dinglehopper
```
Usage
-----
~~~
Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX] [REPORTS_FOLDER]
Compare the PAGE/ALTO/text document GT against the document OCR.
dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
their text and falls back to plain text if no ALTO or PAGE is detected.
The files GT and OCR are usually a ground truth document and the result of
an OCR software, but you may use dinglehopper to compare two OCR results. In
that case, use --no-metrics to disable the then meaningless metrics and also
change the color scheme from green/red to blue.
The comparison report will be written to
$REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults
to the current working directory and $REPORT_PREFIX defaults to "report".
The reports include the character error rate (CER) and the word error rate
(WER).
By default, the text of PAGE files is extracted on 'region' level. You may
use "--textequiv-level line" to extract from the level of TextLine tags.
Options:
--metrics / --no-metrics Enable/disable metrics and green/red
--differences BOOLEAN Enable reporting character and word level
differences
--textequiv-level LEVEL PAGE TextEquiv level to extract text from
--progress Show progress bar
--help Show this message and exit.
~~~
For example:
~~~
dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml
~~~
This generates `report.html` and `report.json`.
![dinglehopper displaying metrics and character differences](.screenshots/dinglehopper.png?raw=true)
Batch comparison between folders of GT and OCR files can be done by simply providing
folders:
~~~
dinglehopper gt/ ocr/ report output_folder/
~~~
This assumes that you have files with the same name in both folders, e.g.
`gt/00000001.page.xml` and `ocr/00000001.alto.xml`.
The example generates reports for each set of files, with the prefix `report`, in the
(automatically created) folder `output_folder/`.
By default, the JSON report does not contain the character and word differences, only
the calculated metrics. If you want to include the differences, use the
`--differences` flag:
~~~
dinglehopper gt/ ocr/ report output_folder/ --differences
~~~
### dinglehopper-summarize
A set of (JSON) reports can be summarized into a single set of
reports. This is useful after having generated reports in batch.
Example:
~~~
dinglehopper-summarize output_folder/
~~~
This generates `summary.html` and `summary.json` in the same `output_folder`.
If you are summarizing many reports and have used the `--differences` flag while
generating them, it may be useful to limit the number of differences reported by using
the `--occurrences-threshold` parameter. This will reduce the size of the generated HTML
report, making it easier to open and navigate. Note that the JSON report will still
contain all differences. Example:
~~~
dinglehopper-summarize output_folder/ --occurrences-threshold 10
~~~
### dinglehopper-line-dirs
You also may want to compare a directory of GT text files (i.e. `gt/line0001.gt.txt`)
with a directory of OCR text files (i.e. `ocr/line0001.some-ocr.txt`) with a separate
CLI interface:
~~~
dinglehopper-line-dirs gt/ ocr/
~~~
### dinglehopper-extract
The tool `dinglehopper-extract` extracts the text of the given input file on
stdout, for example:
~~~
dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml
~~~
### OCR-D
As a OCR-D processor:
~~~
ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL
~~~
This generates HTML and JSON reports in the `OCR-D-OCR-TESS-EVAL` filegroup.
The OCR-D processor has these parameters:
| Parameter | Meaning |
| ------------------------- | ------------------------------------------------------------------- |
| `-P metrics false` | Disable metrics and the green-red color scheme (default: enabled) |
| `-P textequiv_level line` | (PAGE) Extract text from TextLine level (default: TextRegion level) |
For example:
~~~
ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false
~~~
Developer information
---------------------
*Please refer to [README-DEV.md](README-DEV.md).*
Raw data
{
"_id": null,
"home_page": null,
"name": "dinglehopper",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "qurator, ocr, evaluation, ocr-d",
"author": null,
"author_email": "Mike Gerber <mike.gerber@sbb.spk-berlin.de>, The QURATOR SPK Team <qurator@sbb.spk-berlin.de>",
"download_url": "https://files.pythonhosted.org/packages/33/98/6ac9723f778f63dc5f6c56150964b2e203eec1c2d7213c2cd8aab4acffaa/dinglehopper-0.9.7.tar.gz",
"platform": null,
"description": "dinglehopper\n============\n\ndinglehopper is an OCR evaluation tool and reads\n[ALTO](https://github.com/altoxml),\n[PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) and text files. It\ncompares a ground truth (GT) document page with a OCR result page to compute\nmetrics and a word/character differences report. It also supports batch processing by\ngenerating, aggregating and summarizing multiple reports.\n\n[![Tests](https://github.com/qurator-spk/dinglehopper/actions/workflows/test.yml/badge.svg)](https://github.com/qurator-spk/dinglehopper/actions?query=workflow:\"test\")\n[![GitHub tag](https://img.shields.io/github/tag/qurator-spk/dinglehopper?include_prereleases=&sort=semver&color=blue)](https://github.com/qurator-spk/dinglehopper/releases/)\n[![License](https://img.shields.io/badge/License-Apache-blue)](#license)\n[![issues - dinglehopper](https://img.shields.io/github/issues/qurator-spk/dinglehopper)](https://github.com/qurator-spk/dinglehopper/issues)\n\nGoals\n-----\n* Useful\n * As a UI tool\n * For an automated evaluation\n * As a library\n* Unicode support\n\nInstallation\n------------\n\nIt's best to use pip to install the package from PyPI, e.g.:\n```\npip install dinglehopper\n```\n\nUsage\n-----\n~~~\nUsage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX] [REPORTS_FOLDER]\n\n Compare the PAGE/ALTO/text document GT against the document OCR.\n\n dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract\n their text and falls back to plain text if no ALTO or PAGE is detected.\n\n The files GT and OCR are usually a ground truth document and the result of\n an OCR software, but you may use dinglehopper to compare two OCR results. In\n that case, use --no-metrics to disable the then meaningless metrics and also\n change the color scheme from green/red to blue.\n\n The comparison report will be written to\n $REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults\n to the current working directory and $REPORT_PREFIX defaults to \"report\".\n The reports include the character error rate (CER) and the word error rate\n (WER).\n\n By default, the text of PAGE files is extracted on 'region' level. You may\n use \"--textequiv-level line\" to extract from the level of TextLine tags.\n\nOptions:\n --metrics / --no-metrics Enable/disable metrics and green/red\n --differences BOOLEAN Enable reporting character and word level\n differences\n --textequiv-level LEVEL PAGE TextEquiv level to extract text from\n --progress Show progress bar\n --help Show this message and exit.\n~~~\n\nFor example:\n~~~\ndinglehopper some-document.gt.page.xml some-document.ocr.alto.xml\n~~~\nThis generates `report.html` and `report.json`.\n\n![dinglehopper displaying metrics and character differences](.screenshots/dinglehopper.png?raw=true)\n\nBatch comparison between folders of GT and OCR files can be done by simply providing\nfolders:\n~~~\ndinglehopper gt/ ocr/ report output_folder/\n~~~\nThis assumes that you have files with the same name in both folders, e.g.\n`gt/00000001.page.xml` and `ocr/00000001.alto.xml`.\n\nThe example generates reports for each set of files, with the prefix `report`, in the\n(automatically created) folder `output_folder/`.\n\nBy default, the JSON report does not contain the character and word differences, only\nthe calculated metrics. If you want to include the differences, use the\n`--differences` flag:\n\n~~~\ndinglehopper gt/ ocr/ report output_folder/ --differences\n~~~\n\n### dinglehopper-summarize\nA set of (JSON) reports can be summarized into a single set of\nreports. This is useful after having generated reports in batch.\nExample:\n~~~\ndinglehopper-summarize output_folder/\n~~~\nThis generates `summary.html` and `summary.json` in the same `output_folder`.\n\nIf you are summarizing many reports and have used the `--differences` flag while\ngenerating them, it may be useful to limit the number of differences reported by using\nthe `--occurrences-threshold` parameter. This will reduce the size of the generated HTML\nreport, making it easier to open and navigate. Note that the JSON report will still\ncontain all differences. Example:\n~~~\ndinglehopper-summarize output_folder/ --occurrences-threshold 10\n~~~\n\n### dinglehopper-line-dirs\nYou also may want to compare a directory of GT text files (i.e. `gt/line0001.gt.txt`)\nwith a directory of OCR text files (i.e. `ocr/line0001.some-ocr.txt`) with a separate\nCLI interface:\n\n~~~\ndinglehopper-line-dirs gt/ ocr/\n~~~\n\n### dinglehopper-extract\nThe tool `dinglehopper-extract` extracts the text of the given input file on\nstdout, for example:\n\n~~~\ndinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml\n~~~\n\n### OCR-D\nAs a OCR-D processor:\n~~~\nocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL\n~~~\nThis generates HTML and JSON reports in the `OCR-D-OCR-TESS-EVAL` filegroup.\n\nThe OCR-D processor has these parameters:\n\n| Parameter | Meaning |\n| ------------------------- | ------------------------------------------------------------------- |\n| `-P metrics false` | Disable metrics and the green-red color scheme (default: enabled) |\n| `-P textequiv_level line` | (PAGE) Extract text from TextLine level (default: TextRegion level) |\n\nFor example:\n~~~\nocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false\n~~~\n\nDeveloper information\n---------------------\n*Please refer to [README-DEV.md](README-DEV.md).*\n",
"bugtrack_url": null,
"license": null,
"summary": "The OCR evaluation tool",
"version": "0.9.7",
"project_urls": {
"Homepage": "https://github.com/qurator-spk/dinglehopper",
"Repository": "https://github.com/qurator-spk/dinglehopper.git"
},
"split_keywords": [
"qurator",
" ocr",
" evaluation",
" ocr-d"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "89ff9960717f140f31399df923b1f58536d75e56e3a9e43ca006d40a66cde592",
"md5": "12e65fddb2791f8e8d39683a5b194843",
"sha256": "1072f8395d2decfb3ae347682676a5927071d48ee6fe7eb01e56e00bdf9889c5"
},
"downloads": -1,
"filename": "dinglehopper-0.9.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "12e65fddb2791f8e8d39683a5b194843",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 48296,
"upload_time": "2024-07-11T15:29:30",
"upload_time_iso_8601": "2024-07-11T15:29:30.326969Z",
"url": "https://files.pythonhosted.org/packages/89/ff/9960717f140f31399df923b1f58536d75e56e3a9e43ca006d40a66cde592/dinglehopper-0.9.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "33986ac9723f778f63dc5f6c56150964b2e203eec1c2d7213c2cd8aab4acffaa",
"md5": "2f84a3f1c7c710eee9b5af57e7c8c0b1",
"sha256": "052baffe1bc845cf3c89c85acb4ae02d06efa5c8a2544fca22d567c05fe0c816"
},
"downloads": -1,
"filename": "dinglehopper-0.9.7.tar.gz",
"has_sig": false,
"md5_digest": "2f84a3f1c7c710eee9b5af57e7c8c0b1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 37816,
"upload_time": "2024-07-11T15:29:32",
"upload_time_iso_8601": "2024-07-11T15:29:32.163596Z",
"url": "https://files.pythonhosted.org/packages/33/98/6ac9723f778f63dc5f6c56150964b2e203eec1c2d7213c2cd8aab4acffaa/dinglehopper-0.9.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-11 15:29:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "qurator-spk",
"github_project": "dinglehopper",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"requirements": [
{
"name": "click",
"specs": []
},
{
"name": "jinja2",
"specs": []
},
{
"name": "lxml",
"specs": []
},
{
"name": "uniseg",
"specs": [
[
">=",
"0.8.0"
]
]
},
{
"name": "numpy",
"specs": []
},
{
"name": "colorama",
"specs": []
},
{
"name": "MarkupSafe",
"specs": []
},
{
"name": "ocrd",
"specs": [
[
">=",
"2.65.0"
]
]
},
{
"name": "attrs",
"specs": []
},
{
"name": "multimethod",
"specs": [
[
">=",
"1.3"
]
]
},
{
"name": "tqdm",
"specs": []
},
{
"name": "rapidfuzz",
"specs": [
[
">=",
"2.7.0"
]
]
},
{
"name": "chardet",
"specs": []
},
{
"name": "importlib_resources",
"specs": []
}
],
"lcname": "dinglehopper"
}