hocr-tools-lib

Name	hocr-tools-lib JSON
Version	1.1.0 JSON
	download
home_page	None
Summary	Advanced tools for hOCR integration (library version)
upload_time	2024-07-23 03:47:51
maintainer	None
docs_url	None
author	Thomas Breuel, stefan6419846 (library version)
requires_python	<4,>=3.6
license	Apache-2.0
keywords	ocr hocr xhtml
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# hocr-tools

## About

hOCR is a format for representing OCR output, including layout information,
character confidences, bounding boxes, and style information.
It embeds this information invisibly in standard HTML.
By building on standard HTML, it automatically inherits well-defined support
for most scripts, languages, and common layout options.
Furthermore, unlike previous OCR formats, the recognized text and OCR-related
information co-exist in the same file and survives editing and manipulation.
hOCR markup is independent of the presentation.

There is a [Public Specification](http://kba.github.io/hocr-spec/1.2/) for the hOCR Format.

### About this fork

This repository contains my own fork of the package with quite some changes:

* Allow library usage and reduce code duplicates by this.
* Migrate tests to plain *unittest*-based ones instead of some external framework.
* Remove some deprecated code to make it compatible with latest Python 3 versions.
* Add type hints.

For now, I do not have any direct plans to send a corresponding PR. Unfortunately, as for
quite some similar OCR-related tools, development is rather inactive (at least inside the
official GitHub repositories). Some deprecations have been discussed for a long time, as
well as the library support (which I primarily need), with no real progress.

## Installation

You can install hocr-tools along with its dependencies from
[PyPI](https://pypi.python.org/pypi/hocr-tools-lib):

```sh
pip install hocr-tools-lib
```

Or from the Git checkout:

```
pip install .
```

## Available Programs

Included command line programs:

### hocr-check

```
hocr-check file.html
```

Perform consistency checks on the hOCR file.

### hocr-combine

```
hocr-combine file1.html [file2.html ...]
```

Combine the OCR pages contained in each HTML file into a single document.
The document metadata is taken from the first file.

### hocr-cut

```
hocr-cut [-h] [-d] [file.html]
```

Cut a page (horizontally) into two pages in the middle
such that the most of the bounding boxes are separated
nicely, e.g. cutting double pages or double columns

### hocr-eval-lines

```
hocr-eval-lines [-v] true-lines.txt hocr-actual.html
```

Evaluate hOCR output against ASCII ground truth. This evaluation method
requires that the line breaks in true-lines.txt and the ocr_line elements
in hocr-actual.html agree (most ASCII output from OCR systems satisfies this
requirement).

### hocr-eval-geom

```
hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual
```

Compare the segmentations at the level of the element name (default: ocr_line).
Computes undersegmentation, oversegmentation, and missegmentation.

### hocr-eval

```
hocr-eval hocr-true.html hocr-actual.html
```

Evaluate the actual OCR with respect to the ground truth. This outputs
the number of OCR errors due to incorrect segmentation and the number
of OCR errors due to character recognition errors.

It works by aligning segmentation components geometrically, and for each
segmentation component that can be aligned, computing the string edit distance
of the text the segmentation component contains.

### hocr-extract-g1000

Extract lines from [Google 1000 book sample](http://commondatastorage.googleapis.com/books/icdar2007/README.txt)

### hocr-extract-images

```
hocr-extract-images [-b BASENAME] [-p PATTERN] [-e ELEMENT] [-P PADDING] [file]
```

Extract the images and texts within all the ocr_line elements within the hOCR file.
The `BASENAME` is the image directory, the default pattern is `line-%03d.png`,
the default element is `ocr_line` and there is no extra padding by default.

### hocr-lines

```
hocr-lines [FILE]
```

Extract the text within all the ocr_line elements within the hOCR file
given by FILE. If called without any file, `hocr-lines` reads
hOCR data from stdin.

### hocr-merge-dc

```
hocr-merge-dc dc.xml hocr.html > hocr-new.html
```

Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.

### hocr-pdf

```
hocr-pdf <imgdir> > out.pdf
hocr-pdf --savefile out.pdf <imgdir>
```

Create a searchable PDF from a pile of hOCR and JPEG. It is important that the corresponding JPEG and hOCR files have the same name with their respective file ending. All of these files should lie in one directory, which one has to specify as an argument when calling the command, e.g. use `hocr-pdf . > out.pdf` to run the command in the current directory and save the output as `out.pdf` alternatively `hocr-pdf . --savefile out.pdf` which avoids routing the output through the terminal.

### hocr-split

```
hocr-split file.html pattern
```

Split a multipage hOCR file into hOCR files containing one page each.
The pattern should something like "base-%03d.html"

### hocr-wordfreq

```
hocr-wordfreq [-h] [-i] [-n MAX] [-s] [-y] [file.html]
```

Outputs a list of the most frequent words in an hOCR file with their number of occurrences.
If called without any file, `hocr-wordfreq` reads hOCR data (for example from `hocr-combine`) from stdin.

By default, the first 10 words are shown, but any number can be requested with `-n`.
Use `-i` to ignore upper and lower case, `-s` to split on spaces only which will then
lead to words also containing punctations, and `-y` tries to dehyphenate the text
(separation of words at line break with a hyphen) before analysis.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hocr-tools-lib",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4,>=3.6",
    "maintainer_email": null,
    "keywords": "ocr, hocr, xhtml",
    "author": "Thomas Breuel, stefan6419846 (library version)",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b5/28/5212b3228e854d2b9497e1e730af28b8087a85659c34c86b56f0b9b0855d/hocr_tools_lib-1.1.0.tar.gz",
    "platform": null,
    "description": "# hocr-tools\n\n## About\n\nhOCR is a format for representing OCR output, including layout information,\ncharacter confidences, bounding boxes, and style information.\nIt embeds this information invisibly in standard HTML.\nBy building on standard HTML, it automatically inherits well-defined support\nfor most scripts, languages, and common layout options.\nFurthermore, unlike previous OCR formats, the recognized text and OCR-related\ninformation co-exist in the same file and survives editing and manipulation.\nhOCR markup is independent of the presentation.\n\nThere is a [Public Specification](http://kba.github.io/hocr-spec/1.2/) for the hOCR Format.\n\n### About this fork\n\nThis repository contains my own fork of the package with quite some changes:\n\n* Allow library usage and reduce code duplicates by this.\n* Migrate tests to plain *unittest*-based ones instead of some external framework.\n* Remove some deprecated code to make it compatible with latest Python 3 versions.\n* Add type hints.\n\nFor now, I do not have any direct plans to send a corresponding PR. Unfortunately, as for\nquite some similar OCR-related tools, development is rather inactive (at least inside the\nofficial GitHub repositories). Some deprecations have been discussed for a long time, as\nwell as the library support (which I primarily need), with no real progress.\n\n## Installation\n\nYou can install hocr-tools along with its dependencies from\n[PyPI](https://pypi.python.org/pypi/hocr-tools-lib):\n\n```sh\npip install hocr-tools-lib\n```\n\nOr from the Git checkout:\n\n```\npip install .\n```\n\n## Available Programs\n\nIncluded command line programs:\n\n### hocr-check\n\n```\nhocr-check file.html\n```\n\nPerform consistency checks on the hOCR file.\n\n### hocr-combine\n\n```\nhocr-combine file1.html [file2.html ...]\n```\n\nCombine the OCR pages contained in each HTML file into a single document.\nThe document metadata is taken from the first file.\n\n### hocr-cut\n\n```\nhocr-cut [-h] [-d] [file.html]\n```\n\nCut a page (horizontally) into two pages in the middle\nsuch that the most of the bounding boxes are separated\nnicely, e.g. cutting double pages or double columns\n\n### hocr-eval-lines\n\n```\nhocr-eval-lines [-v] true-lines.txt hocr-actual.html\n```\n\nEvaluate hOCR output against ASCII ground truth.  This evaluation method\nrequires that the line breaks in true-lines.txt and the ocr_line elements\nin hocr-actual.html agree (most ASCII output from OCR systems satisfies this\nrequirement).\n\n### hocr-eval-geom\n\n```\nhocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual\n```\n\nCompare the segmentations at the level of the element name (default: ocr_line).\nComputes undersegmentation, oversegmentation, and missegmentation.\n\n### hocr-eval\n\n```\nhocr-eval hocr-true.html hocr-actual.html\n```\n\nEvaluate the actual OCR with respect to the ground truth.  This outputs\nthe number of OCR errors due to incorrect segmentation and the number\nof OCR errors due to character recognition errors.\n\nIt works by aligning segmentation components geometrically, and for each\nsegmentation component that can be aligned, computing the string edit distance\nof the text the segmentation component contains.\n\n### hocr-extract-g1000\n\nExtract lines from [Google 1000 book sample](http://commondatastorage.googleapis.com/books/icdar2007/README.txt)\n\n### hocr-extract-images\n\n```\nhocr-extract-images [-b BASENAME] [-p PATTERN] [-e ELEMENT] [-P PADDING] [file]\n```\n\nExtract the images and texts within all the ocr_line elements within the hOCR file.\nThe `BASENAME` is the image directory, the default pattern is `line-%03d.png`,\nthe default element is `ocr_line` and there is no extra padding by default.\n\n### hocr-lines\n\n```\nhocr-lines [FILE]\n```\n\nExtract the text within all the ocr_line elements within the hOCR file\ngiven by FILE. If called without any file, `hocr-lines` reads\nhOCR data from stdin.\n\n### hocr-merge-dc\n\n```\nhocr-merge-dc dc.xml hocr.html > hocr-new.html\n```\n\nMerges the Dublin Core metadata into the hOCR file by encoding the data in its header.\n\n### hocr-pdf\n\n```\nhocr-pdf <imgdir> > out.pdf\nhocr-pdf --savefile out.pdf <imgdir>\n```\n\nCreate a searchable PDF from a pile of hOCR and JPEG. It is important that the corresponding JPEG and hOCR files have the same name with their respective file ending. All of these files should lie in one directory, which one has to specify as an argument when calling the command, e.g. use `hocr-pdf . > out.pdf` to run the command in the current directory and save the output as `out.pdf` alternatively `hocr-pdf . --savefile out.pdf` which avoids routing the output through the terminal.\n\n### hocr-split\n\n```\nhocr-split file.html pattern\n```\n\nSplit a multipage hOCR file into hOCR files containing one page each.\nThe pattern should something like \"base-%03d.html\"\n\n### hocr-wordfreq\n\n```\nhocr-wordfreq [-h] [-i] [-n MAX] [-s] [-y] [file.html]\n```\n\nOutputs a list of the most frequent words in an hOCR file with their number of occurrences.\nIf called without any file, `hocr-wordfreq` reads hOCR data (for example from `hocr-combine`) from stdin.\n\nBy default, the first 10 words are shown, but any number can be requested with `-n`.\nUse `-i` to ignore upper and lower case, `-s` to split on spaces only which will then\nlead to words also containing punctations, and `-y` tries to dehyphenate the text\n(separation of words at line break with a hyphen) before analysis.\n\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Advanced tools for hOCR integration (library version)",
    "version": "1.1.0",
    "project_urls": {
        "Changelog": "https://github.com/stefan6419846/hocr-tools/blob/next/CHANGELOG.md",
        "Documentation": "https://hocr-tools-lib.readthedocs.io",
        "Homepage": "https://github.com/stefan6419846/hocr-tools",
        "Issues": "https://github.com/stefan6419846/hocr-tools/issues",
        "Repository": "https://github.com/stefan6419846/hocr-tools"
    },
    "split_keywords": [
        "ocr",
        " hocr",
        " xhtml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "edab92c6d91c0decb2d621fc2720893c154730eeb1e707f6d93543cfa515001e",
                "md5": "98c3bf450c7a5fe921eaf4fa67d80bf5",
                "sha256": "88a62521320e6a19dbcd0a2a6f314d312cf33c06a5b134c33a9f8fb450981094"
            },
            "downloads": -1,
            "filename": "hocr_tools_lib-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "98c3bf450c7a5fe921eaf4fa67d80bf5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.6",
            "size": 31246,
            "upload_time": "2024-07-23T03:47:49",
            "upload_time_iso_8601": "2024-07-23T03:47:49.948469Z",
            "url": "https://files.pythonhosted.org/packages/ed/ab/92c6d91c0decb2d621fc2720893c154730eeb1e707f6d93543cfa515001e/hocr_tools_lib-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b5285212b3228e854d2b9497e1e730af28b8087a85659c34c86b56f0b9b0855d",
                "md5": "a7e7eb84c91680207bc44768b928d4d6",
                "sha256": "c4165a00fce22ccc0f4902b6a866a30fa9bc7a31f48b9be102415e6fca3cfb53"
            },
            "downloads": -1,
            "filename": "hocr_tools_lib-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a7e7eb84c91680207bc44768b928d4d6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.6",
            "size": 26712,
            "upload_time": "2024-07-23T03:47:51",
            "upload_time_iso_8601": "2024-07-23T03:47:51.308604Z",
            "url": "https://files.pythonhosted.org/packages/b5/28/5212b3228e854d2b9497e1e730af28b8087a85659c34c86b56f0b9b0855d/hocr_tools_lib-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-23 03:47:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "stefan6419846",
    "github_project": "hocr-tools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "hocr-tools-lib"
}

Thomas Breuel, stefan6419846 (library version)