# pagexml-tools
[](https://github.com/knaw-huc/pagexml/actions)
[](https://www.repostatus.org/#active)
[](https://pagexml.readthedocs.io/en/latest/?badge=latest)
[](https://pypi.org/project/pagexml-tools/)
[](https://pypi.org/project/pagexml-tools/)
Utility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files
## installing
### using poetry
```commandline
poetry add pagexml-tools
```
### using pip
```commandline
pip install pagexml-tools
```
## Using
PageXML-tools contains functions for parsing and for a range of analysis tasks.
### Parsing PageXML files and the Physical Document model
There is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)
PageXML-tools contains basic functionality for parsing a PageXML file that returns
a document model representing the content of the file. The HTR/OCR process that generates
PageXML, recognises text in an image of a physical document.
```python
from pagexml.parser import parse_pagexml_file
pagexml_file = "path/to/pagexml_file.xml"
page_doc = parse_pagexml_file(pagexml_file)
# a page document has an ID
print(page_doc.id)
# print descriptive statistics
print(page_doc.stats)
# iterative over text regions and lines
for tr in page_doc.text_regions:
# a text_region has an ID and a bounding box derived from its coordinates
print(tr.id, tr.coords.box)
# a text_region can have sub-text_regions and lines
for line in tr.lines:
# a line has an ID, coordinates and text
print(line.id, line.coords.box, line.text)
```
###
In addition to the basic parsing and handling of PageXML output, there is
functionality to support a range of tasks:
- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),
- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))
- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),
- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),
- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),
- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),
- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),
- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),
- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),
- turning physical structure into logical structure,
----
[USAGE](https://pagexml.readthedocs.io/en/latest/) |
[CONTRIBUTING](CONTRIBUTING.md) |
[LICENSE](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://github.com/knaw-huc/pagexml",
"name": "pagexml-tools",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Marijn Koolen",
"author_email": "marijn.koolen@huygens.knaw.nl",
"download_url": "https://files.pythonhosted.org/packages/32/d2/c5c42adeb06ca3162cbfe54f005c2944c3a9538f40ec74938fdf63d3a889/pagexml_tools-0.6.0.tar.gz",
"platform": null,
"description": "# pagexml-tools\n\n[](https://github.com/knaw-huc/pagexml/actions)\n[](https://www.repostatus.org/#active)\n[](https://pagexml.readthedocs.io/en/latest/?badge=latest)\n[](https://pypi.org/project/pagexml-tools/)\n[](https://pypi.org/project/pagexml-tools/)\n\nUtility functions for reading [PageXML](https://www.primaresearch.org/tools/PAGELibraries) files\n\n## installing\n\n### using poetry\n\n```commandline\npoetry add pagexml-tools\n```\n\n### using pip\n\n```commandline\npip install pagexml-tools\n```\n\n## Using\n\nPageXML-tools contains functions for parsing and for a range of analysis tasks.\n\n### Parsing PageXML files and the Physical Document model\n\nThere is a tutorial that demonstrates the [physical document model API](./notebooks/Demo-understanding-the-document-model.ipynb)\n\nPageXML-tools contains basic functionality for parsing a PageXML file that returns\na document model representing the content of the file. The HTR/OCR process that generates\nPageXML, recognises text in an image of a physical document.\n\n```python\nfrom pagexml.parser import parse_pagexml_file\n\npagexml_file = \"path/to/pagexml_file.xml\"\n\npage_doc = parse_pagexml_file(pagexml_file)\n\n# a page document has an ID\nprint(page_doc.id)\n\n# print descriptive statistics\nprint(page_doc.stats)\n\n# iterative over text regions and lines\nfor tr in page_doc.text_regions:\n # a text_region has an ID and a bounding box derived from its coordinates\n print(tr.id, tr.coords.box)\n # a text_region can have sub-text_regions and lines\n for line in tr.lines:\n # a line has an ID, coordinates and text\n print(line.id, line.coords.box, line.text)\n```\n\n###\n\nIn addition to the basic parsing and handling of PageXML output, there is\nfunctionality to support a range of tasks:\n\n- reading sets of PageXML files from a archive (tar, zip) file ([tutorial](./notebooks/Demo-reading-pagexml-files-from-archive.ipynb)),\n- searching in text ([keyword in context](./notebooks/Demo-text-search-simple.ipynb), [keywords or fuzzy search](./notebooks/Demo-text-search-in-pagexml-archive.ipynb))\n- classifying physical document types in a large set of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics.ipynb)),\n- checking the quality of the HTR/OCR process ([tutorial](./notebooks/Demo-analysing-scan-characteristics-checking-quality.ipynb)),\n- comparing subsets ([tutorial](./notebooks/Demo-analysing-scan-characteristics-comparing-subsets.ipynb)),\n- identifying document sections in sequences of PageXML documents ([tutorial](./notebooks/Demo-analysing-scan-characteristics-book-sections.ipynb)),\n- turning text lines into running text ([tutorial](./notebooks/Demo-from-lines-to-running-text.ipynb)),\n- supporting different reading orders ([tutorial](./notebooks/Demo-sorting.ipynb)),\n- reinterpreting and restructuring text regions and lines ([tutorial](./notebooks/Demo-restructuring-documents.ipynb)),\n- turning physical structure into logical structure,\n\n----\n\n[USAGE](https://pagexml.readthedocs.io/en/latest/) |\n[CONTRIBUTING](CONTRIBUTING.md) |\n[LICENSE](LICENSE)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Utility functions for reading PageXML files",
"version": "0.6.0",
"project_urls": {
"Bug Tracker": "https://github.com/knaw-huc/pagexml/issues",
"Homepage": "https://github.com/knaw-huc/pagexml",
"Repository": "https://github.com/knaw-huc/pagexml"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3eb712aa702b24afe5dc814a13c87e8db0a1a6a9127ee38db8aaf230d8d8bebe",
"md5": "8621c5620151543debab4b8e98a80306",
"sha256": "2a028c7ad00553a6b78687fca4d5e948788870d5de7529f86a7a916ca5e7ccbc"
},
"downloads": -1,
"filename": "pagexml_tools-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8621c5620151543debab4b8e98a80306",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 54289,
"upload_time": "2025-01-07T12:04:08",
"upload_time_iso_8601": "2025-01-07T12:04:08.878792Z",
"url": "https://files.pythonhosted.org/packages/3e/b7/12aa702b24afe5dc814a13c87e8db0a1a6a9127ee38db8aaf230d8d8bebe/pagexml_tools-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "32d2c5c42adeb06ca3162cbfe54f005c2944c3a9538f40ec74938fdf63d3a889",
"md5": "5e785ddae096291c79c2c57a65485758",
"sha256": "3985ccffde573153b708e3e48b8d5d0be0412ff43adf4e3d436ece4902dba3dc"
},
"downloads": -1,
"filename": "pagexml_tools-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "5e785ddae096291c79c2c57a65485758",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 50405,
"upload_time": "2025-01-07T12:04:11",
"upload_time_iso_8601": "2025-01-07T12:04:11.530978Z",
"url": "https://files.pythonhosted.org/packages/32/d2/c5c42adeb06ca3162cbfe54f005c2944c3a9538f40ec74938fdf63d3a889/pagexml_tools-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-07 12:04:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "knaw-huc",
"github_project": "pagexml",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pagexml-tools"
}