pagexml-tools


Namepagexml-tools JSON
Version 0.5.0 PyPI version JSON
download
home_pagehttps://github.com/knaw-huc/pagexml
SummaryUtility functions for reading PageXML files
upload_time2024-03-18 13:50:31
maintainer
docs_urlNone
authorMarijn Koolen
requires_python>=3.8,<4.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pagexml-tools

[![GitHub Actions](https://github.com/knaw-huc/pagexml/workflows/tests/badge.svg)](https://github.com/knaw-huc/pagexml/actions)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Documentation Status](https://readthedocs.org/projects/pagexml/badge/?version=latest)](https://pagexml.readthedocs.io/en/latest/?badge=latest)
[![PyPI](https://img.shields.io/pypi/v/pagexml-tools)](https://pypi.org/project/pagexml-tools/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pagexml-tools)](https://pypi.org/project/pagexml-tools/)

Utility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files

## installing

### using poetry

```commandline
poetry add pagexml-tools
```

### using pip

```commandline
pip install pagexml-tools
```

## Using

PageXML-tools contains functions for parsing and for a range of analysis tasks.

### Parsing PageXML files and the Physical Document model

There is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)

PageXML-tools contains basic functionality for parsing a PageXML file that returns
a document model representing the content of the file. The HTR/OCR process that generates
PageXML, recognises text in an image of a physical document.

```python
from pagexml.parser import parse_pagexml_file

pagexml_file = "path/to/pagexml_file.xml"

page_doc = parse_pagexml_file(pagexml_file)

# a page document has an ID
print(page_doc.id)

# print descriptive statistics
print(page_doc.stats)

# iterative over text regions and lines
for tr in page_doc.text_regions:
    # a text_region has an ID and a bounding box derived from its coordinates
    print(tr.id, tr.coords.box)
    # a text_region can have sub-text_regions and lines
    for line in tr.lines:
        # a line has an ID, coordinates and text
        print(line.id, line.coords.box, line.text)
```

###

In addition to the basic parsing and handling of PageXML output, there is
functionality to support a range of tasks:

- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),
- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))
- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),
- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),
- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),
- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),
- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),
- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),
- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),
- turning physical structure into logical structure,

----

[USAGE](https://pagexml.readthedocs.io/en/latest/) |
[CONTRIBUTING](CONTRIBUTING.md) |
[LICENSE](LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/knaw-huc/pagexml",
    "name": "pagexml-tools",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Marijn Koolen",
    "author_email": "marijn.koolen@huygens.knaw.nl",
    "download_url": "https://files.pythonhosted.org/packages/b9/d8/2692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a/pagexml_tools-0.5.0.tar.gz",
    "platform": null,
    "description": "# pagexml-tools\n\n[![GitHub Actions](https://github.com/knaw-huc/pagexml/workflows/tests/badge.svg)](https://github.com/knaw-huc/pagexml/actions)\n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Documentation Status](https://readthedocs.org/projects/pagexml/badge/?version=latest)](https://pagexml.readthedocs.io/en/latest/?badge=latest)\n[![PyPI](https://img.shields.io/pypi/v/pagexml-tools)](https://pypi.org/project/pagexml-tools/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pagexml-tools)](https://pypi.org/project/pagexml-tools/)\n\nUtility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files\n\n## installing\n\n### using poetry\n\n```commandline\npoetry add pagexml-tools\n```\n\n### using pip\n\n```commandline\npip install pagexml-tools\n```\n\n## Using\n\nPageXML-tools contains functions for parsing and for a range of analysis tasks.\n\n### Parsing PageXML files and the Physical Document model\n\nThere is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)\n\nPageXML-tools contains basic functionality for parsing a PageXML file that returns\na document model representing the content of the file. The HTR/OCR process that generates\nPageXML, recognises text in an image of a physical document.\n\n```python\nfrom pagexml.parser import parse_pagexml_file\n\npagexml_file = \"path/to/pagexml_file.xml\"\n\npage_doc = parse_pagexml_file(pagexml_file)\n\n# a page document has an ID\nprint(page_doc.id)\n\n# print descriptive statistics\nprint(page_doc.stats)\n\n# iterative over text regions and lines\nfor tr in page_doc.text_regions:\n    # a text_region has an ID and a bounding box derived from its coordinates\n    print(tr.id, tr.coords.box)\n    # a text_region can have sub-text_regions and lines\n    for line in tr.lines:\n        # a line has an ID, coordinates and text\n        print(line.id, line.coords.box, line.text)\n```\n\n###\n\nIn addition to the basic parsing and handling of PageXML output, there is\nfunctionality to support a range of tasks:\n\n- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),\n- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))\n- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),\n- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),\n- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),\n- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),\n- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),\n- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),\n- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),\n- turning physical structure into logical structure,\n\n----\n\n[USAGE](https://pagexml.readthedocs.io/en/latest/) |\n[CONTRIBUTING](CONTRIBUTING.md) |\n[LICENSE](LICENSE)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Utility functions for reading PageXML files",
    "version": "0.5.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/knaw-huc/pagexml/issues",
        "Homepage": "https://github.com/knaw-huc/pagexml",
        "Repository": "https://github.com/knaw-huc/pagexml"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "93adca93ed859aef4f3f2e8962bb8be049fb719968e7f61f02f16a7faeda6aa3",
                "md5": "8ceac659da37a26c9ec75c9573e14a95",
                "sha256": "1f04160d0ec197618db98776cd0ced37dc805da8d1d978907e8b8412cb4f5551"
            },
            "downloads": -1,
            "filename": "pagexml_tools-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8ceac659da37a26c9ec75c9573e14a95",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 54236,
            "upload_time": "2024-03-18T13:50:29",
            "upload_time_iso_8601": "2024-03-18T13:50:29.623937Z",
            "url": "https://files.pythonhosted.org/packages/93/ad/ca93ed859aef4f3f2e8962bb8be049fb719968e7f61f02f16a7faeda6aa3/pagexml_tools-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b9d82692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a",
                "md5": "d211734cdd8e2ff1b5682e4f990ed9c7",
                "sha256": "8fbda5e390a5c0199bb968217f77ddb7edc01cbdbc72c6beceac496128f413aa"
            },
            "downloads": -1,
            "filename": "pagexml_tools-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d211734cdd8e2ff1b5682e4f990ed9c7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 50183,
            "upload_time": "2024-03-18T13:50:31",
            "upload_time_iso_8601": "2024-03-18T13:50:31.702888Z",
            "url": "https://files.pythonhosted.org/packages/b9/d8/2692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a/pagexml_tools-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-18 13:50:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "knaw-huc",
    "github_project": "pagexml",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pagexml-tools"
}
        
Elapsed time: 0.20174s