digital-eval


Namedigital-eval JSON
Version 1.5.3 PyPI version JSON
download
home_page
SummaryEvaluate Mass Digitalization Data
upload_time2023-06-14 12:49:03
maintainerUwe Hartwig
docs_urlNone
authorUniversitäts- und Landesbibliothek Sachsen-Anhalt
requires_python>=3.8
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # digital eval

![example workflow](https://github.com/ulb-sachsen-anhalt/digital-eval/actions/workflows/python-app.yml/badge.svg)
[![PyPi version](https://badgen.net/pypi/v/digital-eval/)](https://pypi.org/project/digital-eval) ![PyPI - Downloads](https://img.shields.io/pypi/dm/digital-eval) ![PyPI - License](https://img.shields.io/pypi/l/digital-eval) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/digital-eval)

Python3 Tool to report evaluation outcomes from mass digitalization workflows.

## Features

* match automatically groundtruth (i.e. reference data) and candidates by filename
* use geometric information to evaluate only specific frame (i.e. specific column or region from large page) of
  candidates (requires ALTO or PAGE format)
* aggregate evaluation outcome on domain range (with multiple subdomains)
* choose from textual metrics based on characters or words plus common Information Retrieval
* choose between accuracy / error rate and different UTF-8 Python norms
* formats: ALTO, PAGE or plain text for both groundtruth and candidates
* speedup with parallel execution
* additional OCR util:
  * filter custom areas of single OCR files

## Installation

```bash
pip install digital-eval
```

## Usage

### Metrics

Calculate similarity (`acc`) or difference (`err`) ratios between single reference/groundtruth and test/candidate item.

#### Edit-Distance based

Character-based text string minus whitechars (`Cs`, `Characters`) or Letter-based (`Ls`, `Letters`) minus whites,
punctuation and digits.
Word/Token-based edit-distance of single tokens identified by whitespaces.

#### Set based

Calculate union of sets of tokens/words (`BoW`, `BagOfWords`).
Operate on sets of tokens/words with respect to language specific stopwords using [nltk](https://www.nltk.org/)
-framework for:

* Precision (`IRPre`, `Pre`, `Precision`): How many tokens from candidate are in groundtruth reference?
* Recall (`IRRec`, `Rec`, `Recall`): How many tokens from groundtruth reference should candidate include?
* F-Measure (`IRFMeasure`, `FM`): weighted ratio Precision / Recall

### UTF-8 Normalisations

Use standard Python Implementation of UTF-8 normalizations; default: `NFKD`.

### Statistics

Statistics calculated via [numpy](https://numpy.org/) include arithmetic mean, median and outlier detection with
interquartile range and are based on the specific groundtruth/reference (ref) for each metric, i.e. char, letters or
tokens.

### Evaluate treelike structures

To evaluate OCR-candidate-data batch-like versus existing Groundtruth, please make sure that your structures fit this
way:

```bash
groundtruth root/
├── <domain>/ 
│    └── <subdomain>/
│         └── <page-01>.gt.xml
candidate root/
├── <domain>/ 
│    └── <subdomain>/
│         └── <page-01>.xml
```

Now call via:

```bash
digital-eval <path-candidate-root>/domain/ -ref <path-groundtruth>/domain/
```

for an aggregated overview on stdout. Feel free to increase verbosity via `-v` (or even `-vv`) to get detailed
information about each single data set which was evaluated.

Structured OCR is considered to contain valid geometrical and textual data on word level, even though for recent PAGE
also line level is possible.

### Data problems

Inconsistent OCR Groundtruth with empty texts (ALTO String elements missing CONTENT or PAGE without TextEquiv) or
invalid geometrical coordinates (less than 3 points or even empty) will lead to evaluation errors if geometry must be
respected.

## Additional OCR Utils

### Filter Area

You can filter a custom area of a page of an OCR file by providing the points of an arbitrary shape.
The format of the `-p, --points` argument is `<pt_1_x>,<pt_1_y> <pt_2_x>,<pt_2_y> <pt_3_x>,<pt_3_y> ... <pt_n_x>,<pt_n_y>` . For simple rectangular areas this can be expressed also with two points, with first point as top left and second point as bottom right: `<pt_top_left_x>,<pt_top_left_y> <pt_bottom_right_x>,<pt_bottom_right_y>`.

The following example filters a rectangular area of 600x400 pixels of a page, which is described by an input ALTO file and saves the result to an output ALTO file

```bash
ocr-util frame -i page_1.alto.xml -p "0,0 600,0 600,400 0,400" -o page_1_area.alto.xml
```

Short version with top left and bottom right:

```bash
ocr-util frame -i page_1.alto.xml -p "0,0 600,400" -o page_1_area.alto.xml
```

## Development

Plattform: Intel(R) Core(TM) i5-6500 CPU@3.20GHz, 16GB RAM, Ubuntu 20.04 LTS, Python 3.8.

```bash
# clone local
git clone <repository-url> <local-dir>
cd <local-dir>

# enable virtual python environment (linux)
# and install libraries
python3 -m venv venv
. venv/bin/activate
python -m pip install -U pip
python -m pip install -r requirements.txt

# install
pip install .

# optional:
# install additional development dependencies
pip install -r tests/test_requirements.txt
pytest -v

# run
digital-eval --help
```

## Contribute

Contributions, suggestions and proposals welcome!

## Licence

Under terms of the [MIT license](https://opensource.org/licenses/MIT).

**NOTE**: This software depends on other packages that _may_ be licensed under different open source licenses.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "digital-eval",
    "maintainer": "Uwe Hartwig",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "uwe.hartwig@bibliothek.uni-halle.de",
    "keywords": "",
    "author": "Universit\u00e4ts- und Landesbibliothek Sachsen-Anhalt",
    "author_email": "development@bibliothek.uni-halle.de",
    "download_url": "https://files.pythonhosted.org/packages/e9/6e/cc67264ec23699fcf48e6554bab96fc6f257027b9898d13e433e8e635664/digital-eval-1.5.3.tar.gz",
    "platform": null,
    "description": "# digital eval\n\n![example workflow](https://github.com/ulb-sachsen-anhalt/digital-eval/actions/workflows/python-app.yml/badge.svg)\n[![PyPi version](https://badgen.net/pypi/v/digital-eval/)](https://pypi.org/project/digital-eval) ![PyPI - Downloads](https://img.shields.io/pypi/dm/digital-eval) ![PyPI - License](https://img.shields.io/pypi/l/digital-eval) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/digital-eval)\n\nPython3 Tool to report evaluation outcomes from mass digitalization workflows.\n\n## Features\n\n* match automatically groundtruth (i.e. reference data) and candidates by filename\n* use geometric information to evaluate only specific frame (i.e. specific column or region from large page) of\n  candidates (requires ALTO or PAGE format)\n* aggregate evaluation outcome on domain range (with multiple subdomains)\n* choose from textual metrics based on characters or words plus common Information Retrieval\n* choose between accuracy / error rate and different UTF-8 Python norms\n* formats: ALTO, PAGE or plain text for both groundtruth and candidates\n* speedup with parallel execution\n* additional OCR util:\n  * filter custom areas of single OCR files\n\n## Installation\n\n```bash\npip install digital-eval\n```\n\n## Usage\n\n### Metrics\n\nCalculate similarity (`acc`) or difference (`err`) ratios between single reference/groundtruth and test/candidate item.\n\n#### Edit-Distance based\n\nCharacter-based text string minus whitechars (`Cs`, `Characters`) or Letter-based (`Ls`, `Letters`) minus whites,\npunctuation and digits.\nWord/Token-based edit-distance of single tokens identified by whitespaces.\n\n#### Set based\n\nCalculate union of sets of tokens/words (`BoW`, `BagOfWords`).\nOperate on sets of tokens/words with respect to language specific stopwords using [nltk](https://www.nltk.org/)\n-framework for:\n\n* Precision (`IRPre`, `Pre`, `Precision`): How many tokens from candidate are in groundtruth reference?\n* Recall (`IRRec`, `Rec`, `Recall`): How many tokens from groundtruth reference should candidate include?\n* F-Measure (`IRFMeasure`, `FM`): weighted ratio Precision / Recall\n\n### UTF-8 Normalisations\n\nUse standard Python Implementation of UTF-8 normalizations; default: `NFKD`.\n\n### Statistics\n\nStatistics calculated via [numpy](https://numpy.org/) include arithmetic mean, median and outlier detection with\ninterquartile range and are based on the specific groundtruth/reference (ref) for each metric, i.e. char, letters or\ntokens.\n\n### Evaluate treelike structures\n\nTo evaluate OCR-candidate-data batch-like versus existing Groundtruth, please make sure that your structures fit this\nway:\n\n```bash\ngroundtruth root/\n\u251c\u2500\u2500 <domain>/ \n\u2502    \u2514\u2500\u2500 <subdomain>/\n\u2502         \u2514\u2500\u2500 <page-01>.gt.xml\ncandidate root/\n\u251c\u2500\u2500 <domain>/ \n\u2502    \u2514\u2500\u2500 <subdomain>/\n\u2502         \u2514\u2500\u2500 <page-01>.xml\n```\n\nNow call via:\n\n```bash\ndigital-eval <path-candidate-root>/domain/ -ref <path-groundtruth>/domain/\n```\n\nfor an aggregated overview on stdout. Feel free to increase verbosity via `-v` (or even `-vv`) to get detailed\ninformation about each single data set which was evaluated.\n\nStructured OCR is considered to contain valid geometrical and textual data on word level, even though for recent PAGE\nalso line level is possible.\n\n### Data problems\n\nInconsistent OCR Groundtruth with empty texts (ALTO String elements missing CONTENT or PAGE without TextEquiv) or\ninvalid geometrical coordinates (less than 3 points or even empty) will lead to evaluation errors if geometry must be\nrespected.\n\n## Additional OCR Utils\n\n### Filter Area\n\nYou can filter a custom area of a page of an OCR file by providing the points of an arbitrary shape.\nThe format of the `-p, --points` argument is `<pt_1_x>,<pt_1_y> <pt_2_x>,<pt_2_y> <pt_3_x>,<pt_3_y> ... <pt_n_x>,<pt_n_y>` . For simple rectangular areas this can be expressed also with two points, with first point as top left and second point as bottom right: `<pt_top_left_x>,<pt_top_left_y> <pt_bottom_right_x>,<pt_bottom_right_y>`.\n\nThe following example filters a rectangular area of 600x400 pixels of a page, which is described by an input ALTO file and saves the result to an output ALTO file\n\n```bash\nocr-util frame -i page_1.alto.xml -p \"0,0 600,0 600,400 0,400\" -o page_1_area.alto.xml\n```\n\nShort version with top left and bottom right:\n\n```bash\nocr-util frame -i page_1.alto.xml -p \"0,0 600,400\" -o page_1_area.alto.xml\n```\n\n## Development\n\nPlattform: Intel(R) Core(TM) i5-6500 CPU@3.20GHz, 16GB RAM, Ubuntu 20.04 LTS, Python 3.8.\n\n```bash\n# clone local\ngit clone <repository-url> <local-dir>\ncd <local-dir>\n\n# enable virtual python environment (linux)\n# and install libraries\npython3 -m venv venv\n. venv/bin/activate\npython -m pip install -U pip\npython -m pip install -r requirements.txt\n\n# install\npip install .\n\n# optional:\n# install additional development dependencies\npip install -r tests/test_requirements.txt\npytest -v\n\n# run\ndigital-eval --help\n```\n\n## Contribute\n\nContributions, suggestions and proposals welcome!\n\n## Licence\n\nUnder terms of the [MIT license](https://opensource.org/licenses/MIT).\n\n**NOTE**: This software depends on other packages that _may_ be licensed under different open source licenses.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Evaluate Mass Digitalization Data",
    "version": "1.5.3",
    "project_urls": {
        "Homepage": "https://github.com/ulb-sachsen-anhalt/digital-eval"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "39dfd2ab4a65e6c515edfe5a700b42ad1c6c524e07d42ae9c0e4b1e8b7b0aa38",
                "md5": "05f76104484b831545822a96556db78d",
                "sha256": "3d2b8ac02dd0033e826d367296ba307f4fa8f11b8d69a129c9da61e988673daf"
            },
            "downloads": -1,
            "filename": "digital_eval-1.5.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "05f76104484b831545822a96556db78d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 33967,
            "upload_time": "2023-06-14T12:49:01",
            "upload_time_iso_8601": "2023-06-14T12:49:01.489585Z",
            "url": "https://files.pythonhosted.org/packages/39/df/d2ab4a65e6c515edfe5a700b42ad1c6c524e07d42ae9c0e4b1e8b7b0aa38/digital_eval-1.5.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e96ecc67264ec23699fcf48e6554bab96fc6f257027b9898d13e433e8e635664",
                "md5": "b1cb7e10d4b68f926410ecb2bf3fc7df",
                "sha256": "50f3d0f761706ef146f126641f7325540e712f640ae82b725cc639f0a94ed44f"
            },
            "downloads": -1,
            "filename": "digital-eval-1.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b1cb7e10d4b68f926410ecb2bf3fc7df",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 32960,
            "upload_time": "2023-06-14T12:49:03",
            "upload_time_iso_8601": "2023-06-14T12:49:03.431747Z",
            "url": "https://files.pythonhosted.org/packages/e9/6e/cc67264ec23699fcf48e6554bab96fc6f257027b9898d13e433e8e635664/digital-eval-1.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-14 12:49:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ulb-sachsen-anhalt",
    "github_project": "digital-eval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "digital-eval"
}
        
Elapsed time: 0.07831s