pagexml-tools

Name	pagexml-tools JSON
Version	0.6.0 JSON
	download
home_page	https://github.com/knaw-huc/pagexml
Summary	Utility functions for reading PageXML files
upload_time	2025-01-07 12:04:11
maintainer	None
docs_url	None
author	Marijn Koolen
requires_python	<4.0,>=3.9
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # pagexml-tools

[![GitHub Actions](https://github.com/knaw-huc/pagexml/workflows/tests/badge.svg)](https://github.com/knaw-huc/pagexml/actions)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Documentation Status](https://readthedocs.org/projects/pagexml/badge/?version=latest)](https://pagexml.readthedocs.io/en/latest/?badge=latest)
[![PyPI](https://img.shields.io/pypi/v/pagexml-tools)](https://pypi.org/project/pagexml-tools/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pagexml-tools)](https://pypi.org/project/pagexml-tools/)

Utility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files

## installing

### using poetry

```commandline
poetry add pagexml-tools
```

### using pip

```commandline
pip install pagexml-tools
```

## Using

PageXML-tools contains functions for parsing and for a range of analysis tasks.

### Parsing PageXML files and the Physical Document model

There is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)

PageXML-tools contains basic functionality for parsing a PageXML file that returns
a document model representing the content of the file. The HTR/OCR process that generates
PageXML, recognises text in an image of a physical document.

```python
from pagexml.parser import parse_pagexml_file

pagexml_file = "path/to/pagexml_file.xml"

page_doc = parse_pagexml_file(pagexml_file)

# a page document has an ID
print(page_doc.id)

# print descriptive statistics
print(page_doc.stats)

# iterative over text regions and lines
for tr in page_doc.text_regions:
    # a text_region has an ID and a bounding box derived from its coordinates
    print(tr.id, tr.coords.box)
    # a text_region can have sub-text_regions and lines
    for line in tr.lines:
        # a line has an ID, coordinates and text
        print(line.id, line.coords.box, line.text)
```

###

In addition to the basic parsing and handling of PageXML output, there is
functionality to support a range of tasks:

- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),
- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))
- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),
- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),
- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),
- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),
- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),
- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),
- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),
- turning physical structure into logical structure,

----

[USAGE](https://pagexml.readthedocs.io/en/latest/) |
[CONTRIBUTING](CONTRIBUTING.md) |
[LICENSE](LICENSE)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/knaw-huc/pagexml",
    "name": "pagexml-tools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Marijn Koolen",
    "author_email": "marijn.koolen@huygens.knaw.nl",
    "download_url": "https://files.pythonhosted.org/packages/32/d2/c5c42adeb06ca3162cbfe54f005c2944c3a9538f40ec74938fdf63d3a889/pagexml_tools-0.6.0.tar.gz",
    "platform": null,
    "description": "# pagexml-tools\n\n[![GitHub Actions](https://github.com/knaw-huc/pagexml/workflows/tests/badge.svg)](https://github.com/knaw-huc/pagexml/actions)\n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Documentation Status](https://readthedocs.org/projects/pagexml/badge/?version=latest)](https://pagexml.readthedocs.io/en/latest/?badge=latest)\n[![PyPI](https://img.shields.io/pypi/v/pagexml-tools)](https://pypi.org/project/pagexml-tools/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pagexml-tools)](https://pypi.org/project/pagexml-tools/)\n\nUtility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files\n\n## installing\n\n### using poetry\n\n```commandline\npoetry add pagexml-tools\n```\n\n### using pip\n\n```commandline\npip install pagexml-tools\n```\n\n## Using\n\nPageXML-tools contains functions for parsing and for a range of analysis tasks.\n\n### Parsing PageXML files and the Physical Document model\n\nThere is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)\n\nPageXML-tools contains basic functionality for parsing a PageXML file that returns\na document model representing the content of the file. The HTR/OCR process that generates\nPageXML, recognises text in an image of a physical document.\n\n```python\nfrom pagexml.parser import parse_pagexml_file\n\npagexml_file = \"path/to/pagexml_file.xml\"\n\npage_doc = parse_pagexml_file(pagexml_file)\n\n# a page document has an ID\nprint(page_doc.id)\n\n# print descriptive statistics\nprint(page_doc.stats)\n\n# iterative over text regions and lines\nfor tr in page_doc.text_regions:\n    # a text_region has an ID and a bounding box derived from its coordinates\n    print(tr.id, tr.coords.box)\n    # a text_region can have sub-text_regions and lines\n    for line in tr.lines:\n        # a line has an ID, coordinates and text\n        print(line.id, line.coords.box, line.text)\n```\n\n###\n\nIn addition to the basic parsing and handling of PageXML output, there is\nfunctionality to support a range of tasks:\n\n- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),\n- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))\n- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),\n- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),\n- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),\n- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),\n- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),\n- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),\n- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),\n- turning physical structure into logical structure,\n\n----\n\n[USAGE](https://pagexml.readthedocs.io/en/latest/) |\n[CONTRIBUTING](CONTRIBUTING.md) |\n[LICENSE](LICENSE)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Utility functions for reading PageXML files",
    "version": "0.6.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/knaw-huc/pagexml/issues",
        "Homepage": "https://github.com/knaw-huc/pagexml",
        "Repository": "https://github.com/knaw-huc/pagexml"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3eb712aa702b24afe5dc814a13c87e8db0a1a6a9127ee38db8aaf230d8d8bebe",
                "md5": "8621c5620151543debab4b8e98a80306",
                "sha256": "2a028c7ad00553a6b78687fca4d5e948788870d5de7529f86a7a916ca5e7ccbc"
            },
            "downloads": -1,
            "filename": "pagexml_tools-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8621c5620151543debab4b8e98a80306",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 54289,
            "upload_time": "2025-01-07T12:04:08",
            "upload_time_iso_8601": "2025-01-07T12:04:08.878792Z",
            "url": "https://files.pythonhosted.org/packages/3e/b7/12aa702b24afe5dc814a13c87e8db0a1a6a9127ee38db8aaf230d8d8bebe/pagexml_tools-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "32d2c5c42adeb06ca3162cbfe54f005c2944c3a9538f40ec74938fdf63d3a889",
                "md5": "5e785ddae096291c79c2c57a65485758",
                "sha256": "3985ccffde573153b708e3e48b8d5d0be0412ff43adf4e3d436ece4902dba3dc"
            },
            "downloads": -1,
            "filename": "pagexml_tools-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5e785ddae096291c79c2c57a65485758",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 50405,
            "upload_time": "2025-01-07T12:04:11",
            "upload_time_iso_8601": "2025-01-07T12:04:11.530978Z",
            "url": "https://files.pythonhosted.org/packages/32/d2/c5c42adeb06ca3162cbfe54f005c2944c3a9538f40ec74938fdf63d3a889/pagexml_tools-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-07 12:04:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "knaw-huc",
    "github_project": "pagexml",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pagexml-tools"
}

Marijn Koolen