# pagexml-tools
[![GitHub Actions](https://github.com/knaw-huc/pagexml/workflows/tests/badge.svg)](https://github.com/knaw-huc/pagexml/actions)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Documentation Status](https://readthedocs.org/projects/pagexml/badge/?version=latest)](https://pagexml.readthedocs.io/en/latest/?badge=latest)
[![PyPI](https://img.shields.io/pypi/v/pagexml-tools)](https://pypi.org/project/pagexml-tools/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pagexml-tools)](https://pypi.org/project/pagexml-tools/)
Utility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files
## installing
### using poetry
```commandline
poetry add pagexml-tools
```
### using pip
```commandline
pip install pagexml-tools
```
## Using
PageXML-tools contains functions for parsing and for a range of analysis tasks.
### Parsing PageXML files and the Physical Document model
There is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)
PageXML-tools contains basic functionality for parsing a PageXML file that returns
a document model representing the content of the file. The HTR/OCR process that generates
PageXML, recognises text in an image of a physical document.
```python
from pagexml.parser import parse_pagexml_file
pagexml_file = "path/to/pagexml_file.xml"
page_doc = parse_pagexml_file(pagexml_file)
# a page document has an ID
print(page_doc.id)
# print descriptive statistics
print(page_doc.stats)
# iterative over text regions and lines
for tr in page_doc.text_regions:
# a text_region has an ID and a bounding box derived from its coordinates
print(tr.id, tr.coords.box)
# a text_region can have sub-text_regions and lines
for line in tr.lines:
# a line has an ID, coordinates and text
print(line.id, line.coords.box, line.text)
```
###
In addition to the basic parsing and handling of PageXML output, there is
functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),
- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))
- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),
- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),
- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),
- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),
- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),
- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),
- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),
- turning physical structure into logical structure,
----
[USAGE](https://pagexml.readthedocs.io/en/latest/) |
[CONTRIBUTING](CONTRIBUTING.md) |
[LICENSE](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/knaw-huc/pagexml",
"name": "pagexml-tools",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Marijn Koolen",
"author_email": "marijn.koolen@huygens.knaw.nl",
"download_url": "https://files.pythonhosted.org/packages/b9/d8/2692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a/pagexml_tools-0.5.0.tar.gz",
"platform": null,
"description": "# pagexml-tools\n\n[![GitHub Actions](https://github.com/knaw-huc/pagexml/workflows/tests/badge.svg)](https://github.com/knaw-huc/pagexml/actions)\n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Documentation Status](https://readthedocs.org/projects/pagexml/badge/?version=latest)](https://pagexml.readthedocs.io/en/latest/?badge=latest)\n[![PyPI](https://img.shields.io/pypi/v/pagexml-tools)](https://pypi.org/project/pagexml-tools/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pagexml-tools)](https://pypi.org/project/pagexml-tools/)\n\nUtility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files\n\n## installing\n\n### using poetry\n\n```commandline\npoetry add pagexml-tools\n```\n\n### using pip\n\n```commandline\npip install pagexml-tools\n```\n\n## Using\n\nPageXML-tools contains functions for parsing and for a range of analysis tasks.\n\n### Parsing PageXML files and the Physical Document model\n\nThere is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)\n\nPageXML-tools contains basic functionality for parsing a PageXML file that returns\na document model representing the content of the file. The HTR/OCR process that generates\nPageXML, recognises text in an image of a physical document.\n\n```python\nfrom pagexml.parser import parse_pagexml_file\n\npagexml_file = \"path/to/pagexml_file.xml\"\n\npage_doc = parse_pagexml_file(pagexml_file)\n\n# a page document has an ID\nprint(page_doc.id)\n\n# print descriptive statistics\nprint(page_doc.stats)\n\n# iterative over text regions and lines\nfor tr in page_doc.text_regions:\n # a text_region has an ID and a bounding box derived from its coordinates\n print(tr.id, tr.coords.box)\n # a text_region can have sub-text_regions and lines\n for line in tr.lines:\n # a line has an ID, coordinates and text\n print(line.id, line.coords.box, line.text)\n```\n\n###\n\nIn addition to the basic parsing and handling of PageXML output, there is\nfunctionality to support a range of tasks:\n\n- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),\n- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))\n- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),\n- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),\n- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),\n- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),\n- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),\n- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),\n- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),\n- turning physical structure into logical structure,\n\n----\n\n[USAGE](https://pagexml.readthedocs.io/en/latest/) |\n[CONTRIBUTING](CONTRIBUTING.md) |\n[LICENSE](LICENSE)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Utility functions for reading PageXML files",
"version": "0.5.0",
"project_urls": {
"Bug Tracker": "https://github.com/knaw-huc/pagexml/issues",
"Homepage": "https://github.com/knaw-huc/pagexml",
"Repository": "https://github.com/knaw-huc/pagexml"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "93adca93ed859aef4f3f2e8962bb8be049fb719968e7f61f02f16a7faeda6aa3",
"md5": "8ceac659da37a26c9ec75c9573e14a95",
"sha256": "1f04160d0ec197618db98776cd0ced37dc805da8d1d978907e8b8412cb4f5551"
},
"downloads": -1,
"filename": "pagexml_tools-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8ceac659da37a26c9ec75c9573e14a95",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 54236,
"upload_time": "2024-03-18T13:50:29",
"upload_time_iso_8601": "2024-03-18T13:50:29.623937Z",
"url": "https://files.pythonhosted.org/packages/93/ad/ca93ed859aef4f3f2e8962bb8be049fb719968e7f61f02f16a7faeda6aa3/pagexml_tools-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b9d82692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a",
"md5": "d211734cdd8e2ff1b5682e4f990ed9c7",
"sha256": "8fbda5e390a5c0199bb968217f77ddb7edc01cbdbc72c6beceac496128f413aa"
},
"downloads": -1,
"filename": "pagexml_tools-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "d211734cdd8e2ff1b5682e4f990ed9c7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 50183,
"upload_time": "2024-03-18T13:50:31",
"upload_time_iso_8601": "2024-03-18T13:50:31.702888Z",
"url": "https://files.pythonhosted.org/packages/b9/d8/2692da5cf57504e97b9dda07cd52efb02230f2fd54d38d5d9cbb6a20033a/pagexml_tools-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-18 13:50:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "knaw-huc",
"github_project": "pagexml",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pagexml-tools"
}