browse-ocrd


Namebrowse-ocrd JSON
Version 0.5.5 PyPI version JSON
download
home_pagehttps://github.com/hnesk/browse-ocrd
SummaryAn extensible viewer for OCR-D workspaces
upload_time2023-04-27 21:03:14
maintainer
docs_urlNone
authorJohannes Künsebeck
requires_python>=3.7
licenseMIT License
keywords ocr ocr-d mets page xml
VCS
bugtrack_url
requirements ocrd Pillow numpy opencv-python-headless PyGObject python-magic wheel setuptools lxml Shapely Deprecated importlib_metadata importlib_resources pydantic
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Unit tests](https://github.com/hnesk/browse-ocrd/workflows/Unit%20tests/badge.svg?branch=master)](https://github.com/hnesk/browse-ocrd/actions/workflows/unittest.yml)
[![Docker build](https://github.com/hnesk/browse-ocrd/actions/workflows/dockerhub.yml/badge.svg)](https://github.com/hnesk/browse-ocrd/actions/workflows/dockerhub.yml)
[![PyPI version](https://badge.fury.io/py/browse-ocrd.svg)](https://badge.fury.io/py/browse-ocrd)
# OCR-D Browser

An extensible viewer for [OCR-D](https://ocr-d.de/) [mets.xml](https://ocr-d.de/en/spec/mets) files

 * [Screenshot](#screenshot)
 * [Features](#features)
 * [Installation](#installation)
    * [Native](#native-tested-on-ubuntu-18042004)
       * [From source](#from-source)
       * [Via pip](#via-pip)
    * [Docker](#docker)
 * [Usage](#usage)
    * [Native GUI](#native-gui)
    * [Docker service](#docker-service)
 * [Configuration](#configuration)
    * [Configuration file locations](#configuration-file-locations)
    * [Configuration file syntax](#configuration-file-syntax)
 
## Screenshot

![OCRD Browser with Page and Xml view](docs/screenshot.png)


## Features

- Browse fileGrps and pages, arranging views next to each other for comparison
- PageView: Show original or derived page images with [PAGE-XML](https://ocr-d.de/en/spec/page) annotations overlay, similar to [PageViewer](https://github.com/PRImA-Research-Lab/prima-page-viewer)
- ImageView: Show original or derived images (`AlternativeImage` on any level of the structural hierarchy)
- ImageView: Show multiple images at once for different pages (horizontally) or different segments (vertically), zooming freely
- XmlView: Show raw [PAGE-XML](https://ocr-d.de/en/spec/page) with syntax highlighting, open with [PageViewer](https://github.com/PRImA-Research-Lab/prima-page-viewer)
- TextView: Show concatenated [PAGE-XML](https://ocr-d.de/en/spec/page) text annotation
- DiffView: Show a simple diff comparison between text annotations from different fileGrps  
- HtmlView: Show rendered HTML comparison from [dinglehopper](https://github.com/qurator-spk/dinglehopper) evaluations

## Installation

OCR-D Browser requires Python 3.7 or higher.

### Native (tested on Ubuntu 18.04/20.04) 

The native installation requires [GTK 3](https://www.gtk.org/).

In any case you need a [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) with a current `pip` version (>=20), preferably your existing OCR-D venv:

<details>
  <summary>Create a current pip venv:</summary>

```bash
sudo apt install python3-pip python3-venv 
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools wheel
```
</details>


#### From source
```bash
git clone https://github.com/hnesk/browse-ocrd.git 
cd browse-ocrd
sudo make deps-ubuntu
make install
```

#### Via pip

```bash
sudo apt install libcairo2-dev libgirepository1.0-dev
pip install browse-ocrd
```

### Docker

If you have installed [Docker](https://docs.docker.com/get-docker/), you can build OCR-D Browser as a **web service**:

    docker build -t ocrd_browser .

Or use a prebuilt image from Dockerhub:

    docker pull hnesk/ocrd_browser


## Usage

### Native GUI
Start the app with the filesystem path to the METS file of your [OCR-D workspace](https://ocr-d.de/en/spec/glossary#workspace):
```
browse-ocrd ./path/to/mets.xml
```

You can still open another METS file from the UI though.

### Docker service

When running the webservice, you need to pass a directory `DATADIR` which (recursively) contains all the workspaces you want to serve.
The top entrypoint `http://localhost/` will show an index page with a link `http://localhost/browse/...` for each workspace path.
Each link will run `browse-ocrd` at that workspace in the background, and then redirect your browser to the internal [Broadway server](https://docs.gtk.org/gtk3/broadway.html), which renders the app in the web browser.

To start up, just do:

    docker run -it --rm -v DATADIR:/data -p 8085:8085 -p 8080:8080 ocrd_browser


## Configuration

### Configuration file locations

At startup the following directories a searched for a config file named `ocrd-browser.conf` 

```python
# directories and their default values under Ubuntu 20.04
GLib.get_system_config_dirs()  # '/etc/xdg/xdg-ubuntu/ocrd-browser.conf', '/etc/xdg/ocrd-browser.conf'
GLib.get_user_config_dir()     # '/home/jk/.config/ocrd-browser.conf'  
os.getcwd()                    # './ocrd-browser.conf'
```

### Configuration file syntax

The `ocrd-browser.conf` file is an ini-file with the following sections and keys:
```ini
[FileGroups]
# Preferred fileGrp names for thumbnail display in the Page Browser 
# Comma separated list of regular expressions
preferredImages = OCR-D-IMG, OCR-D-IMG.*, ORIGINAL

# Each Tool has a section header [Tool XYZ]
# At the moment the only defined tool is "PageViewer"  
[Tool PageViewer]
# shell commandline to execute with placeholders  
commandline = /usr/bin/java -jar /home/jk/bin/JPageViewer/JPageViewer.jar --resolve-dir {workspace.directory} {file.path.absolute}
```

> Note: You can get PRImA's PageViewer at [Github](https://github.com/PRImA-Research-Lab/prima-page-viewer/releases).


The `commandline` string will be used as a python format string with the keyword arguments:

* `workspace` : The current `ocrd.Workspace`, all properties get shell escaped (by `shlex.quote`) automatically.
* `file` : The current `ocrd_models.OcrdFile`, all properties get shell escaped (by `shlex.quote`) automatically, also there is an additional property `path` with the properties `absolute` and `relative`, so `{file.path.absolute}` will be replaced by the shell quoted absolute path of the file. 

### Configuration by environment variables

It is possible to set or override values of the configuration through environment variables. The environment variables follow this structure :  `BROCRD__{SECTION}__{KEY}`, where `SECTION` and `KEY` are in upper snake case and divided by a double underscore (`__`). If the section title contains spaces, the single words are also divided by `__`.  

Some examples:
```shell
BROCRD__FILE_GROUPS__PREFERRED_IMAGES='THUMB'  
BROCRD__TOOL__PAGEVIEWER__COMMANDLINE='ls {file.path.absolute}'  

```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hnesk/browse-ocrd",
    "name": "browse-ocrd",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "OCR,OCR-D,mets,PAGE Xml",
    "author": "Johannes K\u00fcnsebeck",
    "author_email": "kuensebeck@googlemail.com",
    "download_url": "https://files.pythonhosted.org/packages/c8/a8/75f071c13eea8056883322c3b0f114347acc0c2f51e16de370ceba80978b/browse-ocrd-0.5.5.tar.gz",
    "platform": null,
    "description": "[![Unit tests](https://github.com/hnesk/browse-ocrd/workflows/Unit%20tests/badge.svg?branch=master)](https://github.com/hnesk/browse-ocrd/actions/workflows/unittest.yml)\n[![Docker build](https://github.com/hnesk/browse-ocrd/actions/workflows/dockerhub.yml/badge.svg)](https://github.com/hnesk/browse-ocrd/actions/workflows/dockerhub.yml)\n[![PyPI version](https://badge.fury.io/py/browse-ocrd.svg)](https://badge.fury.io/py/browse-ocrd)\n# OCR-D Browser\n\nAn extensible viewer for [OCR-D](https://ocr-d.de/) [mets.xml](https://ocr-d.de/en/spec/mets) files\n\n * [Screenshot](#screenshot)\n * [Features](#features)\n * [Installation](#installation)\n    * [Native](#native-tested-on-ubuntu-18042004)\n       * [From source](#from-source)\n       * [Via pip](#via-pip)\n    * [Docker](#docker)\n * [Usage](#usage)\n    * [Native GUI](#native-gui)\n    * [Docker service](#docker-service)\n * [Configuration](#configuration)\n    * [Configuration file locations](#configuration-file-locations)\n    * [Configuration file syntax](#configuration-file-syntax)\n \n## Screenshot\n\n![OCRD Browser with Page and Xml view](docs/screenshot.png)\n\n\n## Features\n\n- Browse fileGrps and pages, arranging views next to each other for comparison\n- PageView: Show original or derived page images with [PAGE-XML](https://ocr-d.de/en/spec/page) annotations overlay, similar to [PageViewer](https://github.com/PRImA-Research-Lab/prima-page-viewer)\n- ImageView: Show original or derived images (`AlternativeImage` on any level of the structural hierarchy)\n- ImageView: Show multiple images at once for different pages (horizontally) or different segments (vertically), zooming freely\n- XmlView: Show raw [PAGE-XML](https://ocr-d.de/en/spec/page) with syntax highlighting, open with [PageViewer](https://github.com/PRImA-Research-Lab/prima-page-viewer)\n- TextView: Show concatenated [PAGE-XML](https://ocr-d.de/en/spec/page) text annotation\n- DiffView: Show a simple diff comparison between text annotations from different fileGrps  \n- HtmlView: Show rendered HTML comparison from [dinglehopper](https://github.com/qurator-spk/dinglehopper) evaluations\n\n## Installation\n\nOCR-D Browser requires Python 3.7 or higher.\n\n### Native (tested on Ubuntu 18.04/20.04) \n\nThe native installation requires [GTK 3](https://www.gtk.org/).\n\nIn any case you need a [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) with a current `pip` version (>=20), preferably your existing OCR-D venv:\n\n<details>\n  <summary>Create a current pip venv:</summary>\n\n```bash\nsudo apt install python3-pip python3-venv \npython3 -m venv venv\nsource venv/bin/activate\npip install --upgrade pip setuptools wheel\n```\n</details>\n\n\n#### From source\n```bash\ngit clone https://github.com/hnesk/browse-ocrd.git \ncd browse-ocrd\nsudo make deps-ubuntu\nmake install\n```\n\n#### Via pip\n\n```bash\nsudo apt install libcairo2-dev libgirepository1.0-dev\npip install browse-ocrd\n```\n\n### Docker\n\nIf you have installed [Docker](https://docs.docker.com/get-docker/), you can build OCR-D Browser as a **web service**:\n\n    docker build -t ocrd_browser .\n\nOr use a prebuilt image from Dockerhub:\n\n    docker pull hnesk/ocrd_browser\n\n\n## Usage\n\n### Native GUI\nStart the app with the filesystem path to the METS file of your [OCR-D workspace](https://ocr-d.de/en/spec/glossary#workspace):\n```\nbrowse-ocrd ./path/to/mets.xml\n```\n\nYou can still open another METS file from the UI though.\n\n### Docker service\n\nWhen running the webservice, you need to pass a directory `DATADIR` which (recursively) contains all the workspaces you want to serve.\nThe top entrypoint `http://localhost/` will show an index page with a link `http://localhost/browse/...` for each workspace path.\nEach link will run `browse-ocrd` at that workspace in the background, and then redirect your browser to the internal [Broadway server](https://docs.gtk.org/gtk3/broadway.html), which renders the app in the web browser.\n\nTo start up, just do:\n\n    docker run -it --rm -v DATADIR:/data -p 8085:8085 -p 8080:8080 ocrd_browser\n\n\n## Configuration\n\n### Configuration file locations\n\nAt startup the following directories a searched for a config file named `ocrd-browser.conf` \n\n```python\n# directories and their default values under Ubuntu 20.04\nGLib.get_system_config_dirs()  # '/etc/xdg/xdg-ubuntu/ocrd-browser.conf', '/etc/xdg/ocrd-browser.conf'\nGLib.get_user_config_dir()     # '/home/jk/.config/ocrd-browser.conf'  \nos.getcwd()                    # './ocrd-browser.conf'\n```\n\n### Configuration file syntax\n\nThe `ocrd-browser.conf` file is an ini-file with the following sections and keys:\n```ini\n[FileGroups]\n# Preferred fileGrp names for thumbnail display in the Page Browser \n# Comma separated list of regular expressions\npreferredImages = OCR-D-IMG, OCR-D-IMG.*, ORIGINAL\n\n# Each Tool has a section header [Tool XYZ]\n# At the moment the only defined tool is \"PageViewer\"  \n[Tool PageViewer]\n# shell commandline to execute with placeholders  \ncommandline = /usr/bin/java -jar /home/jk/bin/JPageViewer/JPageViewer.jar --resolve-dir {workspace.directory} {file.path.absolute}\n```\n\n> Note: You can get PRImA's PageViewer at [Github](https://github.com/PRImA-Research-Lab/prima-page-viewer/releases).\n\n\nThe `commandline` string will be used as a python format string with the keyword arguments:\n\n* `workspace` : The current `ocrd.Workspace`, all properties get shell escaped (by `shlex.quote`) automatically.\n* `file` : The current `ocrd_models.OcrdFile`, all properties get shell escaped (by `shlex.quote`) automatically, also there is an additional property `path` with the properties `absolute` and `relative`, so `{file.path.absolute}` will be replaced by the shell quoted absolute path of the file. \n\n### Configuration by environment variables\n\nIt is possible to set or override values of the configuration through environment variables. The environment variables follow this structure :  `BROCRD__{SECTION}__{KEY}`, where `SECTION` and `KEY` are in upper snake case and divided by a double underscore (`__`). If the section title contains spaces, the single words are also divided by `__`.  \n\nSome examples:\n```shell\nBROCRD__FILE_GROUPS__PREFERRED_IMAGES='THUMB'  \nBROCRD__TOOL__PAGEVIEWER__COMMANDLINE='ls {file.path.absolute}'  \n\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "An extensible viewer for OCR-D workspaces",
    "version": "0.5.5",
    "split_keywords": [
        "ocr",
        "ocr-d",
        "mets",
        "page xml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "45dca7621e6f46837fc7749d6d731e59d59edd3a91de02a29da883da19a39075",
                "md5": "353f108c01b213c9d6b8ea87dce2ab01",
                "sha256": "ec1853070bea6b4abbae17ad2322c788eaac9d7e5233ee9b2eb206bfb0be6133"
            },
            "downloads": -1,
            "filename": "browse_ocrd-0.5.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "353f108c01b213c9d6b8ea87dce2ab01",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 98377,
            "upload_time": "2023-04-27T21:03:11",
            "upload_time_iso_8601": "2023-04-27T21:03:11.640035Z",
            "url": "https://files.pythonhosted.org/packages/45/dc/a7621e6f46837fc7749d6d731e59d59edd3a91de02a29da883da19a39075/browse_ocrd-0.5.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c8a875f071c13eea8056883322c3b0f114347acc0c2f51e16de370ceba80978b",
                "md5": "ec4f86220f669c157900a20454a0f54d",
                "sha256": "20ed6de3e330a1d3e9e2b66e86bb64159f877f0bed0e9c017263f0963e966949"
            },
            "downloads": -1,
            "filename": "browse-ocrd-0.5.5.tar.gz",
            "has_sig": false,
            "md5_digest": "ec4f86220f669c157900a20454a0f54d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 88133,
            "upload_time": "2023-04-27T21:03:14",
            "upload_time_iso_8601": "2023-04-27T21:03:14.078821Z",
            "url": "https://files.pythonhosted.org/packages/c8/a8/75f071c13eea8056883322c3b0f114347acc0c2f51e16de370ceba80978b/browse-ocrd-0.5.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-27 21:03:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "hnesk",
    "github_project": "browse-ocrd",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "ocrd",
            "specs": [
                [
                    ">=",
                    "2.43.0"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20"
                ]
            ]
        },
        {
            "name": "opencv-python-headless",
            "specs": []
        },
        {
            "name": "PyGObject",
            "specs": [
                [
                    ">=",
                    "3.28"
                ]
            ]
        },
        {
            "name": "python-magic",
            "specs": []
        },
        {
            "name": "wheel",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "Shapely",
            "specs": []
        },
        {
            "name": "Deprecated",
            "specs": []
        },
        {
            "name": "importlib_metadata",
            "specs": [
                [
                    ">=",
                    "3.6"
                ]
            ]
        },
        {
            "name": "importlib_resources",
            "specs": []
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "~=",
                    "1.10"
                ]
            ]
        }
    ],
    "lcname": "browse-ocrd"
}
        
Elapsed time: 0.39420s