# OCR-D/core
> Python modules implementing [OCR-D specs](https://github.com/OCR-D/spec) and related tools
[](https://pypi.org/project/ocrd/)
[](https://github.com/OCR-D/core/actions/workflows/docker-image.yml)
[](https://github.com/OCR-D/core/actions/workflows/unit-test.yml)
[](https://codecov.io/gh/OCR-D/core)
[](https://scrutinizer-ci.com/g/OCR-D/core)
[](https://scrutinizer-ci.com/g/OCR-D/core)
[](https://gitter.im/OCR-D/Lobby)
<!-- BEGIN-MARKDOWN-TOC -->
* [Introduction](#introduction)
* [Installation](#installation)
* [Command line tools](#command-line-tools)
* [`ocrd` CLI](#ocrd-cli)
* [`ocrd-dummy` CLI](#ocrd-dummy-cli)
* [Configuration](#configuration)
* [Packages](#packages)
* [ocrd_utils](#ocrd_utils)
* [ocrd_models](#ocrd_models)
* [ocrd_modelfactory](#ocrd_modelfactory)
* [ocrd_validators](#ocrd_validators)
* [ocrd_network](#ocrd_network)
* [ocrd](#ocrd)
* [bash library](#bash-library)
* [Testing](#testing)
* [See Also](#see-also)
<!-- END-MARKDOWN-TOC -->
## Introduction
This repository contains the python packages that form the base for tools within the
[OCR-D ecosphere](https://github.com/topics/ocr-d).
All packages are also published to [PyPI](https://pypi.org/search/?q=ocrd).
## Installation
**NOTE** Unless you want to contribute to OCR-D/core, we recommend installation
as part of [ocrd_all](https://github.com/OCR-D/ocrd_all) which installs a
complete stack of OCR-D-related software.
The easiest way to install is via `pip`:
pip install ocrd
All Python software released by [OCR-D](https://github.com/OCR-D) requires Python 3.8 or higher.
> **NOTE** Some OCR-D tools (or even test cases) _might_ reveal an unintended behavior if you have specific environment modifications, like:
* using a custom build of [ImageMagick](https://github.com/ImageMagick/ImageMagick), whose format delegates are different from what OCR-D supposes
* custom Python logging configurations in your personal account
## Command line tools
**NOTE:** All OCR-D CLI tools support a `--help` flag which shows usage and
supported flags, options and arguments.
### `ocrd` CLI
* [CLI usage](https://ocr-d.de/core/api/ocrd/ocrd.cli.html)
* [Introduction to `ocrd workspace`](https://github.com/OCR-D/ocrd-website/wiki/Intro-ocrd-workspace-CLI)
* [OCR-D user guide](https://ocr-d.de/en/use)
### `ocrd-dummy` CLI
A minimal [OCR-D processor](https://ocr-d.de/en/user_guide#using-the-ocr-d-processors) that copies from `-I/-input-file-grp` to `-O/-output-file-grp`
## Configuration
Almost all behaviour of the OCR-D/core software is configured via CLI options and flags, which can be listed with the `--help` flag that all CLI support.
Some parts of the software are configured via environment variables:
* `OCRD_PROFILE`: This variable configures the built-in CPU and memory profiling. If empty, no profiling is done. Otherwise expected to contain any of the following tokens:
* `CPU`: Enable CPU profiling of processor runs
* `RSS`: Enable RSS memory profiling
* `PSS`: Enable proportionate memory profiling
* `OCRD_PROFILE_FILE`: If set, then the CPU profile is written to this file for later peruse with a analysis tools like [snakeviz](https://jiffyclub.github.io/snakeviz/)
* `PATH`: Search path for processor executables (affects `ocrd process` and `ocrd resmgr`).
* `HOME`: Directory to look for `ocrd_logging.conf`, fallback for unset XDG variables (see below).
* `XDG_CONFIG_HOME`: Directory to look for `./ocrd/resources.yml` (i.e. `ocrd resmgr` user database) – defaults to `$HOME/.config`.
* `XDG_DATA_HOME`: Directory to look for `./ocrd-resources/*` (i.e. `ocrd resmgr` data location) – defaults to `$HOME/.local/share`.
* `OCRD_DOWNLOAD_RETRIES`: Number of times to retry failed attempts for downloads of resources or workspace files.
* `OCRD_DOWNLOAD_TIMEOUT`: Timeout in seconds for connecting or reading (comma-separated) when downloading.
* `OCRD_MISSING_INPUT`: How to deal with missing input files (for some fileGrp/pageId) during processing:
* `SKIP`: ignore and proceed with next page's input
* `ABORT`: throw `MissingInputFile` exception
* `OCRD_MISSING_OUTPUT`: How to deal with missing output files (for some fileGrp/pageId) during processing:
* `SKIP`: ignore and proceed processing next page
* `COPY`: fall back to copying input PAGE to output fileGrp for page
* `ABORT`: re-throw whatever caused processing to fail
* `OCRD_MAX_MISSING_OUTPUTS`: Maximal rate of skipped/fallback pages among all processed pages before aborting (decimal fraction, ignored if negative).
* `OCRD_EXISTING_OUTPUT`: How to deal with already existing output files (for some fileGrp/pageId) during processing:
* `SKIP`: ignore and proceed processing next page
* `OVERWRITE`: force writing result to output fileGrp for page
* `ABORT`: re-throw `FileExistsError` exception
* `OCRD_METS_CACHING`: Whether to enable in-memory storage of OcrdMets data structures for speedup during processing or workspace operations.
* `OCRD_MAX_PROCESSOR_CACHE`: Maximum number of processor instances (for each set of parameters) to be kept in memory (including loaded models) for processing workers or processor servers.
* `OCRD_MAX_PARALLEL_PAGES`: Maximum number of processor threads for page-parallel processing (within each Processor's selected page range, independent of the number of Processing Workers or Processor Servers). If set `>1`, then a METS Server must be used for METS synchronisation.
* `OCRD_PROCESSING_PAGE_TIMEOUT`: Timeout in seconds for processing a single page. If set >0, when exceeded, the same as OCRD_MISSING_OUTPUT applies.
* `OCRD_NETWORK_SERVER_ADDR_PROCESSING`: Default address of Processing Server to connect to (for `ocrd network client processing`).
* `OCRD_NETWORK_SERVER_ADDR_WORKFLOW`: Default address of Workflow Server to connect to (for `ocrd network client workflow`).
* `OCRD_NETWORK_SERVER_ADDR_WORKSPACE`: Default address of Workspace Server to connect to (for `ocrd network client workspace`).
* `OCRD_NETWORK_RABBITMQ_CLIENT_CONNECT_ATTEMPTS`: Number of attempts for a worker to create its queue. Helpful if the rabbitmq-server needs time to be fully started.
* `OCRD_NETWORK_CLIENT_POLLING_SLEEP`: How many seconds to sleep before trying `ocrd network client` again.
* `OCRD_NETWORK_CLIENT_POLLING_TIMEOUT`: Timeout for a blocking `ocrd network client` (in seconds).
* `OCRD_NETWORK_SOCKETS_ROOT_DIR`: The root directory where all mets server related socket files are created.
* `OCRD_NETWORK_LOGS_ROOT_DIR`: The root directory where all ocrd_network related file logs are stored.
## Packages
### ocrd_utils
Contains utilities and constants, e.g. for logging, path normalization, coordinate calculation etc.
See [README for `ocrd_utils`](./README_ocrd_utils.md) for further information.
### ocrd_models
Contains file format wrappers for PAGE-XML, METS, EXIF metadata etc.
See [README for `ocrd_models`](./README_ocrd_models.md) for further information.
### ocrd_modelfactory
Code to instantiate [models](#ocrd-models) from existing data.
See [README for `ocrd_modelfactory`](./README_ocrd_modelfactory.md) for further information.
### ocrd_validators
Schemas and routines for validating BagIt, `ocrd-tool.json`, workspaces, METS, page, CLI parameters etc.
See [README for `ocrd_validators`](./README_ocrd_validators.md) for further information.
### ocrd_network
Components related to OCR-D Web API
See [README for `ocrd_network`](./README_ocrd_network.md) for further information.
### ocrd
Depends on all of the above, also contains decorators and classes for creating OCR-D processors and CLIs.
Also contains the command line tool `ocrd`.
See [README for `ocrd`](./README_ocrd.md) for further information.
## bash library
Builds a bash script that can be sourced by other bash scripts to create OCRD-compliant CLI.
See [README for `bashlib`](./README_bashlib.md) for further information.
## Testing
Download assets (`make assets`)
Test with local files: `make test`
- Test with remote assets:
- `make test OCRD_BASEURL='https://github.com/OCR-D/assets/raw/master/data/'`
## See Also
- [OCR-D Specifications](https://https://ocr-d.de/en/spec/) ([Repo](https://github.com/ocr-d/spec))
- [OCR-D core API documentation](https://ocr-d.de/core) (built here via `make docs`)
- [OCR-D Website](https://ocr-d.de) ([Repo](https://github.com/ocr-d/ocrd-website))
Raw data
{
"_id": null,
"home_page": null,
"name": "ocrd",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "Konstantin Baierer <unixprog@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/cc/73/34cece39f7fc9b887b8ed6aa3cf2d3fad599004793746cf820e55832a577/ocrd-3.0.2.tar.gz",
"platform": null,
"description": "# OCR-D/core\n\n> Python modules implementing [OCR-D specs](https://github.com/OCR-D/spec) and related tools\n\n[](https://pypi.org/project/ocrd/)\n[](https://github.com/OCR-D/core/actions/workflows/docker-image.yml)\n[](https://github.com/OCR-D/core/actions/workflows/unit-test.yml)\n[](https://codecov.io/gh/OCR-D/core)\n[](https://scrutinizer-ci.com/g/OCR-D/core)\n[](https://scrutinizer-ci.com/g/OCR-D/core)\n\n[](https://gitter.im/OCR-D/Lobby)\n\n\n<!-- BEGIN-MARKDOWN-TOC -->\n* [Introduction](#introduction)\n* [Installation](#installation)\n* [Command line tools](#command-line-tools)\n\t* [`ocrd` CLI](#ocrd-cli)\n\t* [`ocrd-dummy` CLI](#ocrd-dummy-cli)\n* [Configuration](#configuration)\n* [Packages](#packages)\n\t* [ocrd_utils](#ocrd_utils)\n\t* [ocrd_models](#ocrd_models)\n\t* [ocrd_modelfactory](#ocrd_modelfactory)\n\t* [ocrd_validators](#ocrd_validators)\n\t* [ocrd_network](#ocrd_network)\n\t* [ocrd](#ocrd)\n* [bash library](#bash-library)\n* [Testing](#testing)\n* [See Also](#see-also)\n\n<!-- END-MARKDOWN-TOC -->\n\n## Introduction\n\nThis repository contains the python packages that form the base for tools within the\n[OCR-D ecosphere](https://github.com/topics/ocr-d).\n\nAll packages are also published to [PyPI](https://pypi.org/search/?q=ocrd).\n\n## Installation\n\n**NOTE** Unless you want to contribute to OCR-D/core, we recommend installation\nas part of [ocrd_all](https://github.com/OCR-D/ocrd_all) which installs a\ncomplete stack of OCR-D-related software.\n\nThe easiest way to install is via `pip`:\n\n pip install ocrd\n\n\nAll Python software released by [OCR-D](https://github.com/OCR-D) requires Python 3.8 or higher.\n\n> **NOTE** Some OCR-D tools (or even test cases) _might_ reveal an unintended behavior if you have specific environment modifications, like:\n* using a custom build of [ImageMagick](https://github.com/ImageMagick/ImageMagick), whose format delegates are different from what OCR-D supposes\n* custom Python logging configurations in your personal account\n\n## Command line tools\n\n**NOTE:** All OCR-D CLI tools support a `--help` flag which shows usage and\nsupported flags, options and arguments.\n\n### `ocrd` CLI\n\n* [CLI usage](https://ocr-d.de/core/api/ocrd/ocrd.cli.html)\n* [Introduction to `ocrd workspace`](https://github.com/OCR-D/ocrd-website/wiki/Intro-ocrd-workspace-CLI)\n* [OCR-D user guide](https://ocr-d.de/en/use)\n\n### `ocrd-dummy` CLI\n\nA minimal [OCR-D processor](https://ocr-d.de/en/user_guide#using-the-ocr-d-processors) that copies from `-I/-input-file-grp` to `-O/-output-file-grp`\n\n## Configuration\n\nAlmost all behaviour of the OCR-D/core software is configured via CLI options and flags, which can be listed with the `--help` flag that all CLI support.\n\nSome parts of the software are configured via environment variables:\n\n* `OCRD_PROFILE`: This variable configures the built-in CPU and memory profiling. If empty, no profiling is done. Otherwise expected to contain any of the following tokens:\n * `CPU`: Enable CPU profiling of processor runs\n * `RSS`: Enable RSS memory profiling\n * `PSS`: Enable proportionate memory profiling\n* `OCRD_PROFILE_FILE`: If set, then the CPU profile is written to this file for later peruse with a analysis tools like [snakeviz](https://jiffyclub.github.io/snakeviz/)\n\n* `PATH`: Search path for processor executables (affects `ocrd process` and `ocrd resmgr`).\n* `HOME`: Directory to look for `ocrd_logging.conf`, fallback for unset XDG variables (see below).\n\n* `XDG_CONFIG_HOME`: Directory to look for `./ocrd/resources.yml` (i.e. `ocrd resmgr` user database) \u2013 defaults to `$HOME/.config`.\n* `XDG_DATA_HOME`: Directory to look for `./ocrd-resources/*` (i.e. `ocrd resmgr` data location) \u2013 defaults to `$HOME/.local/share`.\n\n* `OCRD_DOWNLOAD_RETRIES`: Number of times to retry failed attempts for downloads of resources or workspace files.\n* `OCRD_DOWNLOAD_TIMEOUT`: Timeout in seconds for connecting or reading (comma-separated) when downloading.\n\n* `OCRD_MISSING_INPUT`: How to deal with missing input files (for some fileGrp/pageId) during processing:\n * `SKIP`: ignore and proceed with next page's input\n * `ABORT`: throw `MissingInputFile` exception\n\n* `OCRD_MISSING_OUTPUT`: How to deal with missing output files (for some fileGrp/pageId) during processing:\n * `SKIP`: ignore and proceed processing next page\n * `COPY`: fall back to copying input PAGE to output fileGrp for page\n * `ABORT`: re-throw whatever caused processing to fail\n\n* `OCRD_MAX_MISSING_OUTPUTS`: Maximal rate of skipped/fallback pages among all processed pages before aborting (decimal fraction, ignored if negative).\n\n* `OCRD_EXISTING_OUTPUT`: How to deal with already existing output files (for some fileGrp/pageId) during processing:\n * `SKIP`: ignore and proceed processing next page\n * `OVERWRITE`: force writing result to output fileGrp for page\n * `ABORT`: re-throw `FileExistsError` exception\n\n\n* `OCRD_METS_CACHING`: Whether to enable in-memory storage of OcrdMets data structures for speedup during processing or workspace operations.\n\n* `OCRD_MAX_PROCESSOR_CACHE`: Maximum number of processor instances (for each set of parameters) to be kept in memory (including loaded models) for processing workers or processor servers.\n\n* `OCRD_MAX_PARALLEL_PAGES`: Maximum number of processor threads for page-parallel processing (within each Processor's selected page range, independent of the number of Processing Workers or Processor Servers). If set `>1`, then a METS Server must be used for METS synchronisation.\n\n* `OCRD_PROCESSING_PAGE_TIMEOUT`: Timeout in seconds for processing a single page. If set >0, when exceeded, the same as OCRD_MISSING_OUTPUT applies.\n\n* `OCRD_NETWORK_SERVER_ADDR_PROCESSING`: Default address of Processing Server to connect to (for `ocrd network client processing`).\n* `OCRD_NETWORK_SERVER_ADDR_WORKFLOW`: Default address of Workflow Server to connect to (for `ocrd network client workflow`).\n* `OCRD_NETWORK_SERVER_ADDR_WORKSPACE`: Default address of Workspace Server to connect to (for `ocrd network client workspace`).\n* `OCRD_NETWORK_RABBITMQ_CLIENT_CONNECT_ATTEMPTS`: Number of attempts for a worker to create its queue. Helpful if the rabbitmq-server needs time to be fully started.\n\n* `OCRD_NETWORK_CLIENT_POLLING_SLEEP`: How many seconds to sleep before trying `ocrd network client` again.\n* `OCRD_NETWORK_CLIENT_POLLING_TIMEOUT`: Timeout for a blocking `ocrd network client` (in seconds).\n\n* `OCRD_NETWORK_SOCKETS_ROOT_DIR`: The root directory where all mets server related socket files are created.\n* `OCRD_NETWORK_LOGS_ROOT_DIR`: The root directory where all ocrd_network related file logs are stored.\n\n\n\n## Packages\n\n### ocrd_utils\n\nContains utilities and constants, e.g. for logging, path normalization, coordinate calculation etc.\n\nSee [README for `ocrd_utils`](./README_ocrd_utils.md) for further information.\n\n### ocrd_models\n\nContains file format wrappers for PAGE-XML, METS, EXIF metadata etc.\n\nSee [README for `ocrd_models`](./README_ocrd_models.md) for further information.\n\n### ocrd_modelfactory\n\nCode to instantiate [models](#ocrd-models) from existing data.\n\nSee [README for `ocrd_modelfactory`](./README_ocrd_modelfactory.md) for further information.\n\n### ocrd_validators\n\nSchemas and routines for validating BagIt, `ocrd-tool.json`, workspaces, METS, page, CLI parameters etc.\n\nSee [README for `ocrd_validators`](./README_ocrd_validators.md) for further information.\n\n### ocrd_network\n\nComponents related to OCR-D Web API\n\nSee [README for `ocrd_network`](./README_ocrd_network.md) for further information.\n\n### ocrd\n\nDepends on all of the above, also contains decorators and classes for creating OCR-D processors and CLIs.\n\nAlso contains the command line tool `ocrd`.\n\nSee [README for `ocrd`](./README_ocrd.md) for further information.\n\n## bash library\n\nBuilds a bash script that can be sourced by other bash scripts to create OCRD-compliant CLI.\n\nSee [README for `bashlib`](./README_bashlib.md) for further information.\n\n## Testing\n\nDownload assets (`make assets`)\n\nTest with local files: `make test`\n\n- Test with remote assets:\n - `make test OCRD_BASEURL='https://github.com/OCR-D/assets/raw/master/data/'`\n\n## See Also\n\n - [OCR-D Specifications](https://https://ocr-d.de/en/spec/) ([Repo](https://github.com/ocr-d/spec))\n - [OCR-D core API documentation](https://ocr-d.de/core) (built here via `make docs`)\n - [OCR-D Website](https://ocr-d.de) ([Repo](https://github.com/ocr-d/ocrd-website))\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "OCR-D framework",
"version": "3.0.2",
"project_urls": {
"Documentation": "https://ocr-d.de/core",
"Homepage": "https://ocr-d.de",
"Issues": "https://github.com/OCR-D/core/issues",
"Repository": "https://github.com/OCR-D/core"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3b2332d560c90222af90d653b5f6a140dcc0205004b625fbbb5b943b293b33ed",
"md5": "f516793d4f9d5af4a0f4758d316476f0",
"sha256": "3261e8fbc62a7b0051814d83fdbd0d4be13ed90e91f956243e9f27895deddff0"
},
"downloads": -1,
"filename": "ocrd-3.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f516793d4f9d5af4a0f4758d316476f0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 374545,
"upload_time": "2025-02-12T11:09:33",
"upload_time_iso_8601": "2025-02-12T11:09:33.388698Z",
"url": "https://files.pythonhosted.org/packages/3b/23/32d560c90222af90d653b5f6a140dcc0205004b625fbbb5b943b293b33ed/ocrd-3.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "cc7334cece39f7fc9b887b8ed6aa3cf2d3fad599004793746cf820e55832a577",
"md5": "e7d20e768ee4ea8e186c01224009d992",
"sha256": "a787b5f32bbb51655e0ea9de285b7a2adca982382670ad88231ea8e6fe73319c"
},
"downloads": -1,
"filename": "ocrd-3.0.2.tar.gz",
"has_sig": false,
"md5_digest": "e7d20e768ee4ea8e186c01224009d992",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 363789,
"upload_time": "2025-02-12T11:09:39",
"upload_time_iso_8601": "2025-02-12T11:09:39.140619Z",
"url": "https://files.pythonhosted.org/packages/cc/73/34cece39f7fc9b887b8ed6aa3cf2d3fad599004793746cf820e55832a577/ocrd-3.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-12 11:09:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "OCR-D",
"github_project": "core",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"circle": true,
"requirements": [
{
"name": "atomicwrites",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "beanie",
"specs": [
[
"~=",
"1.7"
]
]
},
{
"name": "click",
"specs": [
[
">=",
"7"
]
]
},
{
"name": "cryptography",
"specs": [
[
"<",
"43.0.0"
]
]
},
{
"name": "Deprecated",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "docker",
"specs": []
},
{
"name": "elementpath",
"specs": []
},
{
"name": "fastapi",
"specs": [
[
">=",
"0.78.0"
]
]
},
{
"name": "filetype",
"specs": []
},
{
"name": "Flask",
"specs": []
},
{
"name": "frozendict",
"specs": [
[
">=",
"2.3.4"
]
]
},
{
"name": "gdown",
"specs": []
},
{
"name": "httpx",
"specs": [
[
">=",
"0.22.0"
]
]
},
{
"name": "importlib_metadata",
"specs": []
},
{
"name": "importlib_resources",
"specs": []
},
{
"name": "jsonschema",
"specs": [
[
">=",
"4"
]
]
},
{
"name": "loky",
"specs": []
},
{
"name": "lxml",
"specs": []
},
{
"name": "memory-profiler",
"specs": [
[
">=",
"0.58.0"
]
]
},
{
"name": "numpy",
"specs": []
},
{
"name": "ocrd-fork-bagit",
"specs": [
[
">=",
"1.8.1.post2"
]
]
},
{
"name": "ocrd-fork-bagit_profile",
"specs": [
[
">=",
"1.3.0.post1"
]
]
},
{
"name": "opencv-python-headless",
"specs": []
},
{
"name": "paramiko",
"specs": []
},
{
"name": "pika",
"specs": [
[
">=",
"1.2.0"
]
]
},
{
"name": "Pillow",
"specs": [
[
">=",
"7.2.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"1.*"
]
]
},
{
"name": "python-magic",
"specs": []
},
{
"name": "python-multipart",
"specs": []
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "requests_unixsocket2",
"specs": []
},
{
"name": "shapely",
"specs": []
},
{
"name": "uvicorn",
"specs": []
},
{
"name": "uvicorn",
"specs": [
[
">=",
"0.17.6"
]
]
}
],
"tox": true,
"lcname": "ocrd"
}