# PangoLine
PangoLine is a basic tool to render raw (horizontal) text into PDF documents
and create parallel ALTO files for each page containing baseline and bounding
box information.
It is intended to support the rendering of most of the world's writing systems
in order to create synthetic page-level training data for automatic text
recognition systems. Functionality is fairly basic for now. PDF output is
single column, justified text without word breaking. Paragraphs are split
automatically once a page is full.
## Installation
You'll need PyGObject and the Pango/Cairo libraries on your system. As
PyGObject is only shipped in source form this also requires a C compiler and
the usual build environment dependencies installed. An easier way is to use conda:
~> conda create --name pangoline-py3.11 -c conda-forge python=3.11
~> conda activate pangoline-py3.11
~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow
Afterwards either install from pypi:
~> pip install pangoline-tool
or directly from the checked out git repository:
~> pip install --no-deps .
## Usage
### Rendering
PangoLine renders text first into vector PDFs and ALTO facsimiles using some
configurable "physical" dimensions.
~> pangoline render doc.txt
Rendering ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Various options to direct rendering such as page size, margins, language, and
base direction can be manually set, for example:
~> pangoline render -p 216 279 -l en-us -f "Noto Sans 24" doc.txt
Text can also be styled with [Pango
Markup](https://docs.gtk.org/Pango/pango_markup.html). Parsing is disabled per
default but can be enabled with a switch. You'll need to escape any characters
that are part of XML such as &, <, >, quotes, and various control characters
using [HTML
entities](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references).
~> pangoline render --markup doc.txt
It is possible to randomly insert stylization of Unicode [word
segments](https://unicode.org/reports/tr29/#Word_Boundaries) in the text. One
or more styles will be randomly selected from a configurable list of styles:
~> pangoline render --random-markup-probability 0.01 doc.txt
The probability is the probability of at least one style being applied to any
particular segment. A subset of the total available number of styles is enabled
by default when a probability greater than 0 is given. To change the list of
possible styles:
~> pangoline render --random-markup-probability 0.01 --random-markup style_italic --random-markup variant_smallcaps doc.txt
The semantics of each value can be found in the [pango documentation](https://docs.gtk.org/Pango/pango_markup.html).
Styling with color is treated slightly differently than other styles. In
general, colors are selected with the `foreground_*` style. As a large number
of colors are known to Pango, the `foreground_random` alias exists that enables
all possible colors:
~> pangoline render --random-markup-probability 0.01 --random-markup foreground_random doc.txt
When applying random styles to words, control characters in the source text
should *not* be escaped as pangoline internally escapes any characters that
require it.
### Rasterization
In a second step those vector files can be rasterized into PNGs and the
coordinates in the ALTO files scaled to the selected resolution (per default
300dpi):
~> pangoline rasterize doc.0.xml doc.1.xml ...
Rasterizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Rasterized files and their ALTOs can be used as is as ATR training data.
To obtain slightly more realistic input images it is possible to overlay the
rasterized text into images of writing surfaces.
~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...
Rasterization can be invoked with multiple background images in which case they
will be sampled randomly for each output page. A tarball with 70 empty paper
backgrounds of different origins, digitization qualities, and states of
preservation can be found [here](http://l.unchti.me/paper.tar).
For larger collections of texts it is advisable to parallelize processing,
especially for rasterization with overlays:
~> pangoline --workers 8 render *.txt
~> pangoline --workers 8 rasterize *.xml
## Limitations
In order to achieve proper typesetting quality, Pango requires placing the
whole text into a single layout before splitting it into individual pages by
translating each line of the layout onto a page surface. This approach limits to
maximum print space of a single text to 739.8 meters, roughly 3000 pages
depending on paper size and margins, before an overflow of the 32 bit integer
baseline position y-offset will occur.
## Funding
<table border="0">
<tr>
<td> <img src="https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg" alt="Co-financed by the European Union" width="100"/></td>
<td>This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829).</td>
</tr>
</table>
Raw data
{
"_id": null,
"home_page": "http://pangoline.github.io",
"name": "pangoline-tool",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "atr, document rendering, alto",
"author": "Benjamin Kiessling",
"author_email": "mittagessen@l.unchti.me",
"download_url": "https://files.pythonhosted.org/packages/84/d8/3e1b270df529b91f98cef4cf24056ccaa249e9961f56460609056bb15ff1/pangoline_tool-0.3.0.tar.gz",
"platform": null,
"description": "# PangoLine\n\nPangoLine is a basic tool to render raw (horizontal) text into PDF documents\nand create parallel ALTO files for each page containing baseline and bounding\nbox information. \n\nIt is intended to support the rendering of most of the world's writing systems\nin order to create synthetic page-level training data for automatic text\nrecognition systems. Functionality is fairly basic for now. PDF output is\nsingle column, justified text without word breaking. Paragraphs are split\nautomatically once a page is full.\n\n## Installation\n\nYou'll need PyGObject and the Pango/Cairo libraries on your system. As\nPyGObject is only shipped in source form this also requires a C compiler and\nthe usual build environment dependencies installed. An easier way is to use conda:\n\n ~> conda create --name pangoline-py3.11 -c conda-forge python=3.11\n ~> conda activate pangoline-py3.11\n ~> conda install -c conda-forge pygobject pango Cairo click jinja2 rich pypdfium2 lxml pillow\n\nAfterwards either install from pypi:\n\n ~> pip install pangoline-tool\n\nor directly from the checked out git repository:\n\n ~> pip install --no-deps .\n\n## Usage\n\n### Rendering\n\nPangoLine renders text first into vector PDFs and ALTO facsimiles using some\nconfigurable \"physical\" dimensions.\n\n ~> pangoline render doc.txt\n Rendering \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 100% 0:00:00\n\nVarious options to direct rendering such as page size, margins, language, and\nbase direction can be manually set, for example:\n\n ~> pangoline render -p 216 279 -l en-us -f \"Noto Sans 24\" doc.txt\n\nText can also be styled with [Pango\nMarkup](https://docs.gtk.org/Pango/pango_markup.html). Parsing is disabled per\ndefault but can be enabled with a switch. You'll need to escape any characters\nthat are part of XML such as &, <, >, quotes, and various control characters\nusing [HTML\nentities](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references).\n\n ~> pangoline render --markup doc.txt\n\nIt is possible to randomly insert stylization of Unicode [word\nsegments](https://unicode.org/reports/tr29/#Word_Boundaries) in the text. One\nor more styles will be randomly selected from a configurable list of styles:\n\n ~> pangoline render --random-markup-probability 0.01 doc.txt\n\nThe probability is the probability of at least one style being applied to any\nparticular segment. A subset of the total available number of styles is enabled\nby default when a probability greater than 0 is given. To change the list of\npossible styles:\n\n ~> pangoline render --random-markup-probability 0.01 --random-markup style_italic --random-markup variant_smallcaps doc.txt\n\nThe semantics of each value can be found in the [pango documentation](https://docs.gtk.org/Pango/pango_markup.html).\n\nStyling with color is treated slightly differently than other styles. In\ngeneral, colors are selected with the `foreground_*` style. As a large number\nof colors are known to Pango, the `foreground_random` alias exists that enables\nall possible colors:\n\n ~> pangoline render --random-markup-probability 0.01 --random-markup foreground_random doc.txt\n\nWhen applying random styles to words, control characters in the source text\nshould *not* be escaped as pangoline internally escapes any characters that\nrequire it.\n\n### Rasterization\n\nIn a second step those vector files can be rasterized into PNGs and the\ncoordinates in the ALTO files scaled to the selected resolution (per default\n300dpi):\n\n ~> pangoline rasterize doc.0.xml doc.1.xml ...\n Rasterizing \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 100% 0:00:00\n\nRasterized files and their ALTOs can be used as is as ATR training data.\n\nTo obtain slightly more realistic input images it is possible to overlay the\nrasterized text into images of writing surfaces.\n\n ~> pangoline rasterize -w ~/background_1.jpg doc.0.xml doc.1.xml ...\n\nRasterization can be invoked with multiple background images in which case they\nwill be sampled randomly for each output page. A tarball with 70 empty paper\nbackgrounds of different origins, digitization qualities, and states of\npreservation can be found [here](http://l.unchti.me/paper.tar).\n\nFor larger collections of texts it is advisable to parallelize processing,\nespecially for rasterization with overlays:\n\n ~> pangoline --workers 8 render *.txt\n ~> pangoline --workers 8 rasterize *.xml\n\n## Limitations\n\nIn order to achieve proper typesetting quality, Pango requires placing the\nwhole text into a single layout before splitting it into individual pages by\ntranslating each line of the layout onto a page surface. This approach limits to\nmaximum print space of a single text to 739.8 meters, roughly 3000 pages\ndepending on paper size and margins, before an overflow of the 32 bit integer\nbaseline position y-offset will occur.\n\n## Funding\n\n<table border=\"0\">\n <tr>\n <td> <img src=\"https://raw.githubusercontent.com/mittagessen/kraken/main/docs/_static/normal-reproduction-low-resolution.jpg\" alt=\"Co-financed by the European Union\" width=\"100\"/></td>\n <td>This project was funded in part by the European Union. (ERC, MiDRASH,project number 101071829).</td>\n </tr>\n</table>\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Synthetic document rendering with parallel ALTO output",
"version": "0.3.0",
"project_urls": {
"Homepage": "http://pangoline.github.io"
},
"split_keywords": [
"atr",
" document rendering",
" alto"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "52d93d528f99e533f873bd8a69ae21b29fa546e27ea73457b3da180ec9c9f332",
"md5": "2fc9b259cbcfb1f8a904fc6f25affcf1",
"sha256": "d81ef7523359cc434a433e72a5dbfeb3eafbe006cac4b1d28439e1274f6a95eb"
},
"downloads": -1,
"filename": "pangoline_tool-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2fc9b259cbcfb1f8a904fc6f25affcf1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 19296,
"upload_time": "2025-08-28T13:28:07",
"upload_time_iso_8601": "2025-08-28T13:28:07.919065Z",
"url": "https://files.pythonhosted.org/packages/52/d9/3d528f99e533f873bd8a69ae21b29fa546e27ea73457b3da180ec9c9f332/pangoline_tool-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "84d83e1b270df529b91f98cef4cf24056ccaa249e9961f56460609056bb15ff1",
"md5": "bf7718936be9ea19a34c240119d02cc4",
"sha256": "cb8d490c978c08d3cd278bb52fe504c64d3e8a82d1ec477e02d15e27a9fa49c1"
},
"downloads": -1,
"filename": "pangoline_tool-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "bf7718936be9ea19a34c240119d02cc4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 19557,
"upload_time": "2025-08-28T13:28:08",
"upload_time_iso_8601": "2025-08-28T13:28:08.998036Z",
"url": "https://files.pythonhosted.org/packages/84/d8/3e1b270df529b91f98cef4cf24056ccaa249e9961f56460609056bb15ff1/pangoline_tool-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-28 13:28:08",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pangoline-tool"
}