ocrd-tesserocr

Name	ocrd-tesserocr JSON
Version	0.19.1 JSON
	download
home_page	https://github.com/OCR-D/ocrd_tesserocr
Summary	wrap Tesseract preprocessing, segmentation and recognition
upload_time	2024-07-01 15:05:13
maintainer	None
docs_url	None
author	Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky
requires_python	None
license	Apache License 2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            # ocrd_tesserocr

> Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr

[![image](https://circleci.com/gh/OCR-D/ocrd_tesserocr.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_tesserocr)
[![image](https://img.shields.io/pypi/v/ocrd_tesserocr.svg)](https://pypi.org/project/ocrd_tesserocr/)
[![image](https://codecov.io/gh/OCR-D/ocrd_tesserocr/branch/master/graph/badge.svg)](https://codecov.io/gh/OCR-D/ocrd_tesserocr)
[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/tesserocr.svg)](https://hub.docker.com/r/ocrd/tesserocr/tags/)

## Introduction

This package offers [OCR-D](https://ocr-d.de/en/spec) compliant [workspace processors](https://ocr-d.de/en/spec/cli) for (much of) the functionality of [Tesseract](https://github.com/tesseract-ocr) via its Python API wrapper [tesserocr](https://github.com/sirfz/tesserocr). (Each processor is a parameterizable step in a configurable [workflow](https://ocr-d.de/en/workflows) of the [OCR-D functional model](https://ocr-d.de/en/about). There are usually various alternative processor implementations for each step. Data is represented with [METS](https://ocr-d.de/en/spec/mets) and [PAGE](https://ocr-d.de/en/spec/page).)

It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition. 

Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via `AlternativeImage`, text results via `TextEquiv`, font attributes via `TextStyle`, script via `@primaryScript`, deskewing via `@orientation`, cropping via `Border` and segmentation via `Region` / `TextLine` / `Word` elements with `Coords/@points`.

## Installation

### With docker

This is the best option if you want to run the software in a container.

You need to have [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/)


    docker pull ocrd/tesserocr


To run with docker:


    docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...


### From PyPI and Tesseract provided by system

If your operating system / distribution already provides Tesseract 4.1
or newer, then just install its development package:

    # on Debian / Ubuntu:
    sudo apt install libtesseract-dev

Otherwise, recent Tesseract packages for Ubuntu are available via PPA
[alex-p](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel),
which has up-to-date builds of Tesseract and its dependencies:


    # on Debian / Ubuntu
    sudo add-apt-repository ppa:alex-p/tesseract-ocr
    sudo apt-get update
    sudo apt install libtesseract-dev

Once Tesseract is available, just install ocrd_tesserocr from PyPI server:


    pip install ocrd_tesserocr

We strongly recommend setting up a
[venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) first.


### From git

Use this option if there is no suitable prebuilt version of Tesseract available
on your system, or you want to change the source code or install the latest, unpublished changes.

    git clone https://github.com/OCR-D/ocrd_tesserocr
    cd ocrd_tesserocr
    # install Tesseract:
    sudo make deps-ubuntu # system dependencies just for the build
    make deps
    # install tesserocr and ocrd_tesserocr:
    make install

We strongly recommend setting up a
[venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) first.


## Models

Tesseract comes with synthetically trained models for languages (`tesseract-ocr-{eng,deu,deu_latf,...}`
or scripts (`tesseract-ocr-script-{latn,frak,...}`). In addition, various models
[trained](https://github.com/tesseract-ocr/tesstrain) on scan data are available from the community.

Since all OCR-D processors must resolve file/data resources
in a [standardized way](https://ocr-d.de/en/spec/cli#processor-resources),
and we want to stay interoperable with standalone Tesseract
(which uses a single compile-time `tessdata` directory),
`ocrd-tesserocr-recognize` expects the recognition models to be installed
in its [module](https://ocr-d.de/en/spec/ocrd_tool#file-parameters) **resource location** only.
The `module` location is determined by the underlying Tesseract installation
(compile-time `tessdata` directory, or run-time `$TESSDATA_PREFIX` environment variable).
Other resource locations (data/system/cwd) will be ignored, and should not be used
when installing models with the **Resource Manager** (`ocrd resmgr download`).

To see the `module` resource location of your installation:

    ocrd-tesserocr-recognize -D

For a full description of available commands for resource management, see:

    ocrd resmgr --help
    ocrd resmgr list-available --help
    ocrd resmgr download --help
    ocrd resmgr list-installed --help

> **Note**: 
> (In previous versions, the resource locations of standalone Tesseract and the OCR-D wrapper were different.
> If you already have models under `$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize`,
> usually `~/.local/share/ocrd-resources/ocrd-tesserocr-recognize`, then consider moving them
> to the new default under `ocrd-tesserocr-recognize -D`,
> usually `/usr/share/tesseract-ocr/4.00/tessdata`, _or_ alternatively overriding the module directory
> by setting `TESSDATA_PREFIX=$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize` in the environment.)

Cf. [OCR-D model guide](https://ocr-d.de/en/models).

Models always use the filename suffix `.traineddata`, but are just loaded by their basename.
You will need **at least** `eng` and `osd` installed (even for segmentation and deskewing),
probably also `Latin` and `Fraktur` etc. So to get minimal models, do:

	ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata
	ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata

(This will already be installed if using the Docker or git installation option.)

As of v0.13.1, you can configure `ocrd-tesserocr-recognize` to select models **dynamically** segment by segment,
either via custom conditions on the PAGE-XML annotation (presented as XPath rules),
or by automatically choosing the model with highest confidence.

## Usage

For details, see docstrings in the individual processors
and [ocrd-tool.json](ocrd_tesserocr/ocrd-tool.json) descriptions,
or simply `--help`.

Available [OCR-D processors](https://ocr-d.de/en/spec/cli) are:

- [ocrd-tesserocr-crop](ocrd_tesserocr/crop.py)
  (simplistic)
  - sets `Border` of pages and adds `AlternativeImage` files to the output fileGrp
- [ocrd-tesserocr-deskew](ocrd_tesserocr/deskew.py)
  (for skew and orientation; mind `operation_level`)
  - sets `@orientation` of regions or pages and adds `AlternativeImage` files to the output fileGrp
- [ocrd-tesserocr-binarize](ocrd_tesserocr/binarize.py)
  (Otsu – not recommended, unless already binarized and using `tiseg`)
  - adds `AlternativeImage` files to the output fileGrp
- [ocrd-tesserocr-recognize](ocrd_tesserocr/recognize.py)
  (optionally including segmentation; mind `segmentation_level` and `textequiv_level`)
  - adds `TextRegion`s, `TableRegion`s, `ImageRegion`s, `MathsRegion`s, `SeparatorRegion`s,
    `NoiseRegion`s, `ReadingOrder` and `AlternativeImage` to `Page` and sets their `@orientation` (optionally)
  - adds `TextRegion`s to `TableRegion`s and sets their `@orientation` (optionally)
  - adds `TextLine`s to `TextRegion`s (optionally)
  - adds `Word`s to `TextLine`s (optionally)
  - adds `Glyph`s to `Word`s (optionally)
  - adds `TextEquiv`
- [ocrd-tesserocr-segment](ocrd_tesserocr/segment.py)
  (all-in-one segmentation – recommended; delegates to `recognize`)
  - adds `TextRegion`s, `TableRegion`s, `ImageRegion`s, `MathsRegion`s, `SeparatorRegion`s,
    `NoiseRegion`s, `ReadingOrder` and `AlternativeImage` to `Page` and sets their `@orientation`
  - adds `TextRegion`s to `TableRegion`s and sets their `@orientation`
  - adds `TextLine`s to `TextRegion`s
  - adds `Word`s to `TextLine`s
  - adds `Glyph`s to `Word`s
- [ocrd-tesserocr-segment-region](ocrd_tesserocr/segment_region.py)
  (only regions – with overlapping bboxes; delegates to `recognize`)
  - adds `TextRegion`s, `TableRegion`s, `ImageRegion`s, `MathsRegion`s, `SeparatorRegion`s,
    `NoiseRegion`s and `ReadingOrder` to `Page` and sets their `@orientation`
- [ocrd-tesserocr-segment-table](ocrd_tesserocr/segment_table.py)
  (only table cells; delegates to `recognize`)
  - adds `TextRegion`s to `TableRegion`s
- [ocrd-tesserocr-segment-line](ocrd_tesserocr/segment_line.py)
  (only lines – from overlapping regions; delegates to `recognize`)
  - adds `TextLine`s to `TextRegion`s
- [ocrd-tesserocr-segment-word](ocrd_tesserocr/segment_word.py)
  (only words; delegates to `recognize`)
  - adds `Word`s to `TextLine`s
- [ocrd-tesserocr-fontshape](ocrd_tesserocr/fontshape.py)
  (only text style – via Tesseract 3 models)
  - adds `TextStyle` to `Word`s

The text region `@type`s detected are (from Tesseract's [PolyBlockType](https://github.com/tesseract-ocr/tesseract/blob/11297c983ec7f5c9765d7fa4faa48f5150cf2d38/include/tesseract/publictypes.h#L52-L69)):
- `paragraph`: normal block (aligned with others in the column)
- `floating`: unaligned block (`is in a cross-column pull-out region`)
- `heading`: block that `spans more than one column`
- `caption`: block for `text that belongs to an image`

If you are unhappy with these choices, then consider post-processing
with a dedicated custom processor in Python, or by modifying the PAGE files directly
(e.g. `xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml`).

All segmentation is currently done as **bounding boxes** only by default,
i.e. without precise polygonal outlines. For dense page layouts this means
that neighbouring regions and neighbouring text lines may overlap a lot.
If this is a problem for your workflow, try post-processing like so:
- after line segmentation: use `ocrd-cis-ocropy-resegment` for polygonalization,
  or `ocrd-cis-ocropy-clip` on the line level
- after region segmentation: use `ocrd-segment-repair` with `plausibilize`
  (and `sanitize` after line segmentation)

It also means that Tesseract should be allowed to segment across multiple hierarchy levels
at once, to avoid introducing inconsistent/duplicate text line assignments in text regions,
or word assignments in text lines. Hence,
- prefer `ocrd-tesserocr-recognize` with `segmentation_level=region`  
  over `ocrd-tesserocr-segment` followed by `ocrd-tesserocr-recognize`,  
  if you want to do all in one with Tesseract,
- prefer `ocrd-tesserocr-recognize` with `segmentation_level=line`  
  over `ocrd-tesserocr-segment-line` followed by `ocrd-tesserocr-recognize`,  
  if you want to do everything but region segmentation with Tesseract,
- prefer `ocrd-tesserocr-segment` over `ocrd-tesserocr-segment-region`  
  followed by (`ocrd-tesserocr-segment-table` and) `ocrd-tesserocr-segment-line`,  
  if you want to do everything but recognition with Tesseract.

However, you can also run `ocrd-tesserocr-segment*` and `ocrd-tesserocr-recognize`
with `shrink_polygons=True` to get **polygons** by post-processing each segment,
shrinking to the convex hull of all its symbol outlines.

## Testing


    make test


This downloads some test data from https://github.com/OCR-D/assets under `repo/assets`,
and runs some basic test of the Python API as well as the CLIs.

Set `PYTEST_ARGS="-s --verbose"` to see log output (`-s`) and individual test results (`--verbose`).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/OCR-D/ocrd_tesserocr",
    "name": "ocrd-tesserocr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Konstantin Baierer, Kay-Michael W\u00fcrzner, Robert Sachunsky",
    "author_email": "unixprog@gmail.com, wuerzner@gmail.com, sachunsky@informatik.uni-leipzig.de",
    "download_url": "https://files.pythonhosted.org/packages/bb/b1/a8bb00b936f363b5e1e26407d25d0cf2c22626650752c1d5f80e9517378b/ocrd_tesserocr-0.19.1.tar.gz",
    "platform": null,
    "description": "# ocrd_tesserocr\n\n> Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr\n\n[![image](https://circleci.com/gh/OCR-D/ocrd_tesserocr.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_tesserocr)\n[![image](https://img.shields.io/pypi/v/ocrd_tesserocr.svg)](https://pypi.org/project/ocrd_tesserocr/)\n[![image](https://codecov.io/gh/OCR-D/ocrd_tesserocr/branch/master/graph/badge.svg)](https://codecov.io/gh/OCR-D/ocrd_tesserocr)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/tesserocr.svg)](https://hub.docker.com/r/ocrd/tesserocr/tags/)\n\n## Introduction\n\nThis package offers [OCR-D](https://ocr-d.de/en/spec) compliant [workspace processors](https://ocr-d.de/en/spec/cli) for (much of) the functionality of [Tesseract](https://github.com/tesseract-ocr) via its Python API wrapper [tesserocr](https://github.com/sirfz/tesserocr). (Each processor is a parameterizable step in a configurable [workflow](https://ocr-d.de/en/workflows) of the [OCR-D functional model](https://ocr-d.de/en/about). There are usually various alternative processor implementations for each step. Data is represented with [METS](https://ocr-d.de/en/spec/mets) and [PAGE](https://ocr-d.de/en/spec/page).)\n\nIt includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition. \n\nMost processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via `AlternativeImage`, text results via `TextEquiv`, font attributes via `TextStyle`, script via `@primaryScript`, deskewing via `@orientation`, cropping via `Border` and segmentation via `Region` / `TextLine` / `Word` elements with `Coords/@points`.\n\n## Installation\n\n### With docker\n\nThis is the best option if you want to run the software in a container.\n\nYou need to have [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/)\n\n\n    docker pull ocrd/tesserocr\n\n\nTo run with docker:\n\n\n    docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...\n\n\n### From PyPI and Tesseract provided by system\n\nIf your operating system / distribution already provides Tesseract 4.1\nor newer, then just install its development package:\n\n    # on Debian / Ubuntu:\n    sudo apt install libtesseract-dev\n\nOtherwise, recent Tesseract packages for Ubuntu are available via PPA\n[alex-p](https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel),\nwhich has up-to-date builds of Tesseract and its dependencies:\n\n\n    # on Debian / Ubuntu\n    sudo add-apt-repository ppa:alex-p/tesseract-ocr\n    sudo apt-get update\n    sudo apt install libtesseract-dev\n\nOnce Tesseract is available, just install ocrd_tesserocr from PyPI server:\n\n\n    pip install ocrd_tesserocr\n\nWe strongly recommend setting up a\n[venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) first.\n\n\n### From git\n\nUse this option if there is no suitable prebuilt version of Tesseract available\non your system, or you want to change the source code or install the latest, unpublished changes.\n\n    git clone https://github.com/OCR-D/ocrd_tesserocr\n    cd ocrd_tesserocr\n    # install Tesseract:\n    sudo make deps-ubuntu # system dependencies just for the build\n    make deps\n    # install tesserocr and ocrd_tesserocr:\n    make install\n\nWe strongly recommend setting up a\n[venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) first.\n\n\n## Models\n\nTesseract comes with synthetically trained models for languages (`tesseract-ocr-{eng,deu,deu_latf,...}`\nor scripts (`tesseract-ocr-script-{latn,frak,...}`). In addition, various models\n[trained](https://github.com/tesseract-ocr/tesstrain) on scan data are available from the community.\n\nSince all OCR-D processors must resolve file/data resources\nin a [standardized way](https://ocr-d.de/en/spec/cli#processor-resources),\nand we want to stay interoperable with standalone Tesseract\n(which uses a single compile-time `tessdata` directory),\n`ocrd-tesserocr-recognize` expects the recognition models to be installed\nin its [module](https://ocr-d.de/en/spec/ocrd_tool#file-parameters) **resource location** only.\nThe `module` location is determined by the underlying Tesseract installation\n(compile-time `tessdata` directory, or run-time `$TESSDATA_PREFIX` environment variable).\nOther resource locations (data/system/cwd) will be ignored, and should not be used\nwhen installing models with the **Resource Manager** (`ocrd resmgr download`).\n\nTo see the `module` resource location of your installation:\n\n    ocrd-tesserocr-recognize -D\n\nFor a full description of available commands for resource management, see:\n\n    ocrd resmgr --help\n    ocrd resmgr list-available --help\n    ocrd resmgr download --help\n    ocrd resmgr list-installed --help\n\n> **Note**: \n> (In previous versions, the resource locations of standalone Tesseract and the OCR-D wrapper were different.\n> If you already have models under `$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize`,\n> usually `~/.local/share/ocrd-resources/ocrd-tesserocr-recognize`, then consider moving them\n> to the new default under `ocrd-tesserocr-recognize -D`,\n> usually `/usr/share/tesseract-ocr/4.00/tessdata`, _or_ alternatively overriding the module directory\n> by setting `TESSDATA_PREFIX=$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize` in the environment.)\n\nCf. [OCR-D model guide](https://ocr-d.de/en/models).\n\nModels always use the filename suffix `.traineddata`, but are just loaded by their basename.\nYou will need **at least** `eng` and `osd` installed (even for segmentation and deskewing),\nprobably also `Latin` and `Fraktur` etc. So to get minimal models, do:\n\n\tocrd resmgr download ocrd-tesserocr-recognize eng.traineddata\n\tocrd resmgr download ocrd-tesserocr-recognize osd.traineddata\n\n(This will already be installed if using the Docker or git installation option.)\n\nAs of v0.13.1, you can configure `ocrd-tesserocr-recognize` to select models **dynamically** segment by segment,\neither via custom conditions on the PAGE-XML annotation (presented as XPath rules),\nor by automatically choosing the model with highest confidence.\n\n## Usage\n\nFor details, see docstrings in the individual processors\nand [ocrd-tool.json](ocrd_tesserocr/ocrd-tool.json) descriptions,\nor simply `--help`.\n\nAvailable [OCR-D processors](https://ocr-d.de/en/spec/cli) are:\n\n- [ocrd-tesserocr-crop](ocrd_tesserocr/crop.py)\n  (simplistic)\n  - sets `Border` of pages and adds `AlternativeImage` files to the output fileGrp\n- [ocrd-tesserocr-deskew](ocrd_tesserocr/deskew.py)\n  (for skew and orientation; mind `operation_level`)\n  - sets `@orientation` of regions or pages and adds `AlternativeImage` files to the output fileGrp\n- [ocrd-tesserocr-binarize](ocrd_tesserocr/binarize.py)\n  (Otsu \u2013 not recommended, unless already binarized and using `tiseg`)\n  - adds `AlternativeImage` files to the output fileGrp\n- [ocrd-tesserocr-recognize](ocrd_tesserocr/recognize.py)\n  (optionally including segmentation; mind `segmentation_level` and `textequiv_level`)\n  - adds `TextRegion`s, `TableRegion`s, `ImageRegion`s, `MathsRegion`s, `SeparatorRegion`s,\n    `NoiseRegion`s, `ReadingOrder` and `AlternativeImage` to `Page` and sets their `@orientation` (optionally)\n  - adds `TextRegion`s to `TableRegion`s and sets their `@orientation` (optionally)\n  - adds `TextLine`s to `TextRegion`s (optionally)\n  - adds `Word`s to `TextLine`s (optionally)\n  - adds `Glyph`s to `Word`s (optionally)\n  - adds `TextEquiv`\n- [ocrd-tesserocr-segment](ocrd_tesserocr/segment.py)\n  (all-in-one segmentation \u2013 recommended; delegates to `recognize`)\n  - adds `TextRegion`s, `TableRegion`s, `ImageRegion`s, `MathsRegion`s, `SeparatorRegion`s,\n    `NoiseRegion`s, `ReadingOrder` and `AlternativeImage` to `Page` and sets their `@orientation`\n  - adds `TextRegion`s to `TableRegion`s and sets their `@orientation`\n  - adds `TextLine`s to `TextRegion`s\n  - adds `Word`s to `TextLine`s\n  - adds `Glyph`s to `Word`s\n- [ocrd-tesserocr-segment-region](ocrd_tesserocr/segment_region.py)\n  (only regions \u2013 with overlapping bboxes; delegates to `recognize`)\n  - adds `TextRegion`s, `TableRegion`s, `ImageRegion`s, `MathsRegion`s, `SeparatorRegion`s,\n    `NoiseRegion`s and `ReadingOrder` to `Page` and sets their `@orientation`\n- [ocrd-tesserocr-segment-table](ocrd_tesserocr/segment_table.py)\n  (only table cells; delegates to `recognize`)\n  - adds `TextRegion`s to `TableRegion`s\n- [ocrd-tesserocr-segment-line](ocrd_tesserocr/segment_line.py)\n  (only lines \u2013 from overlapping regions; delegates to `recognize`)\n  - adds `TextLine`s to `TextRegion`s\n- [ocrd-tesserocr-segment-word](ocrd_tesserocr/segment_word.py)\n  (only words; delegates to `recognize`)\n  - adds `Word`s to `TextLine`s\n- [ocrd-tesserocr-fontshape](ocrd_tesserocr/fontshape.py)\n  (only text style \u2013 via Tesseract 3 models)\n  - adds `TextStyle` to `Word`s\n\nThe text region `@type`s detected are (from Tesseract's [PolyBlockType](https://github.com/tesseract-ocr/tesseract/blob/11297c983ec7f5c9765d7fa4faa48f5150cf2d38/include/tesseract/publictypes.h#L52-L69)):\n- `paragraph`: normal block (aligned with others in the column)\n- `floating`: unaligned block (`is in a cross-column pull-out region`)\n- `heading`: block that `spans more than one column`\n- `caption`: block for `text that belongs to an image`\n\nIf you are unhappy with these choices, then consider post-processing\nwith a dedicated custom processor in Python, or by modifying the PAGE files directly\n(e.g. `xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.=\"floating\"]' -v paragraph filegrp/*.xml`).\n\nAll segmentation is currently done as **bounding boxes** only by default,\ni.e. without precise polygonal outlines. For dense page layouts this means\nthat neighbouring regions and neighbouring text lines may overlap a lot.\nIf this is a problem for your workflow, try post-processing like so:\n- after line segmentation: use `ocrd-cis-ocropy-resegment` for polygonalization,\n  or `ocrd-cis-ocropy-clip` on the line level\n- after region segmentation: use `ocrd-segment-repair` with `plausibilize`\n  (and `sanitize` after line segmentation)\n\nIt also means that Tesseract should be allowed to segment across multiple hierarchy levels\nat once, to avoid introducing inconsistent/duplicate text line assignments in text regions,\nor word assignments in text lines. Hence,\n- prefer `ocrd-tesserocr-recognize` with `segmentation_level=region`  \n  over `ocrd-tesserocr-segment` followed by `ocrd-tesserocr-recognize`,  \n  if you want to do all in one with Tesseract,\n- prefer `ocrd-tesserocr-recognize` with `segmentation_level=line`  \n  over `ocrd-tesserocr-segment-line` followed by `ocrd-tesserocr-recognize`,  \n  if you want to do everything but region segmentation with Tesseract,\n- prefer `ocrd-tesserocr-segment` over `ocrd-tesserocr-segment-region`  \n  followed by (`ocrd-tesserocr-segment-table` and) `ocrd-tesserocr-segment-line`,  \n  if you want to do everything but recognition with Tesseract.\n\nHowever, you can also run `ocrd-tesserocr-segment*` and `ocrd-tesserocr-recognize`\nwith `shrink_polygons=True` to get **polygons** by post-processing each segment,\nshrinking to the convex hull of all its symbol outlines.\n\n## Testing\n\n\n    make test\n\n\nThis downloads some test data from https://github.com/OCR-D/assets under `repo/assets`,\nand runs some basic test of the Python API as well as the CLIs.\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "wrap Tesseract preprocessing, segmentation and recognition",
    "version": "0.19.1",
    "project_urls": {
        "Homepage": "https://github.com/OCR-D/ocrd_tesserocr"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "28dd61ddc3b27c4e199764db053318deb9d8d71f37687840ac8e9072f63900bd",
                "md5": "2c47c9b8d1e4316cf87c7b42a44f6e83",
                "sha256": "b5060735b57a94726db4d5fc4de64305a6121b49df90ced03928fbba107794d2"
            },
            "downloads": -1,
            "filename": "ocrd_tesserocr-0.19.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2c47c9b8d1e4316cf87c7b42a44f6e83",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 53570,
            "upload_time": "2024-07-01T15:05:10",
            "upload_time_iso_8601": "2024-07-01T15:05:10.574894Z",
            "url": "https://files.pythonhosted.org/packages/28/dd/61ddc3b27c4e199764db053318deb9d8d71f37687840ac8e9072f63900bd/ocrd_tesserocr-0.19.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bbb1a8bb00b936f363b5e1e26407d25d0cf2c22626650752c1d5f80e9517378b",
                "md5": "4fa9b7f80216b58a13bcaa27efcfcc8b",
                "sha256": "c72c204bca20d92bc42a413295d0acbe404939891d310030d67b5ee52b8c3ebc"
            },
            "downloads": -1,
            "filename": "ocrd_tesserocr-0.19.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4fa9b7f80216b58a13bcaa27efcfcc8b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 45622,
            "upload_time": "2024-07-01T15:05:13",
            "upload_time_iso_8601": "2024-07-01T15:05:13.109083Z",
            "url": "https://files.pythonhosted.org/packages/bb/b1/a8bb00b936f363b5e1e26407d25d0cf2c22626650752c1d5f80e9517378b/ocrd_tesserocr-0.19.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-01 15:05:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "OCR-D",
    "github_project": "ocrd_tesserocr",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "circle": true,
    "requirements": [],
    "tox": true,
    "lcname": "ocrd-tesserocr"
}

Konstantin Baierer, Kay-Michael Würzner, Robert Sachunsky