# old-doc
Easily create synthetic data for HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition).
## Description
old-doc is a Python package designed to generate synthetic data for training and testing HTR and OCR models. This tool streamlines the process of creating diverse datasets for improving text recognition systems, allowing users to generate custom manuscript-like pages with various text styles, layouts, and effects.
## Installation
You can install old-doc using pip:
```bash
pip install old-doc
```
Note: old-doc requires Python 3.8 or later.
## Features
- Generate synthetic handwritten text images
- Create synthetic printed document images
- Customize text content, fonts, layouts, and degradation effects
- Support for curved text, drop caps, and marginalia
- Export data in image format and ALTO XML for HTR and OCR tasks
## Usage
Here's an example of how to use old-doc to create a sample manuscript page:
```python
from old_doc import TextBlock, Column, Row, Page
title = TextBlock("Simple Document", block_type="heading", font_size=40, font_color=(100, 0, 0))
content = TextBlock("This is a sample text for our document. " * 5,
font_size=16, font_color=(0, 0, 0),
curve_amount=0.1, # Slight curve to the text
word_spacing=10
)
# Create layout
header_row = Row([Column([title], width=800)], height=60)
content_row = Row([Column([content], width=800)], height=400)
# Create page
page = Page([header_row, content_row],
cell_padding=20,
background_color=(250, 240, 230)) # Light parchment color
# Generate the page
image, alto = page.generate()
# Save the results
image.save("example.png")
page.save_alto_xml("example.alto.xml")
# Display the image (optional, requires matplotlib)
page.visualize_results()
```
This example creates a manuscript page with a header, date, main content with curved text and potential drop caps, and marginalia. It then generates the page, visualizes it, and saves both the image and ALTO XML output.
Raw data
{
"_id": null,
"home_page": "https://github.com/wjbmattingly/old-doc",
"name": "old-doc",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "WJB Mattingly",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/4c/37/1c93ab377ab4a61a0508fca9537699e1faffba75206a080b7cc11bdc58fa/old_doc-0.0.3.tar.gz",
"platform": null,
"description": "# old-doc\n\nEasily create synthetic data for HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition).\n\n## Description\n\nold-doc is a Python package designed to generate synthetic data for training and testing HTR and OCR models. This tool streamlines the process of creating diverse datasets for improving text recognition systems, allowing users to generate custom manuscript-like pages with various text styles, layouts, and effects.\n\n## Installation\n\nYou can install old-doc using pip:\n\n```bash\npip install old-doc\n```\n\nNote: old-doc requires Python 3.8 or later.\n\n## Features\n\n- Generate synthetic handwritten text images\n- Create synthetic printed document images\n- Customize text content, fonts, layouts, and degradation effects\n- Support for curved text, drop caps, and marginalia\n- Export data in image format and ALTO XML for HTR and OCR tasks\n\n## Usage\n\nHere's an example of how to use old-doc to create a sample manuscript page:\n\n```python\nfrom old_doc import TextBlock, Column, Row, Page\n\ntitle = TextBlock(\"Simple Document\", block_type=\"heading\", font_size=40, font_color=(100, 0, 0))\ncontent = TextBlock(\"This is a sample text for our document. \" * 5, \n font_size=16, font_color=(0, 0, 0), \n curve_amount=0.1, # Slight curve to the text\n word_spacing=10\n )\n\n# Create layout\nheader_row = Row([Column([title], width=800)], height=60)\ncontent_row = Row([Column([content], width=800)], height=400)\n\n# Create page\npage = Page([header_row, content_row], \n cell_padding=20, \n background_color=(250, 240, 230)) # Light parchment color\n\n# Generate the page\nimage, alto = page.generate()\n\n# Save the results\nimage.save(\"example.png\")\npage.save_alto_xml(\"example.alto.xml\")\n\n# Display the image (optional, requires matplotlib)\npage.visualize_results()\n```\n\nThis example creates a manuscript page with a header, date, main content with curved text and potential drop caps, and marginalia. It then generates the page, visualizes it, and saves both the image and ALTO XML output.\n",
"bugtrack_url": null,
"license": null,
"summary": "Easily create synthetic data for HTR and OCR",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/wjbmattingly/old-doc"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "33984eb7716f96c70903901d172bd749a602d8fa894d3203fd1b9b5c96327d06",
"md5": "3316191a37801b071ba4e913df474581",
"sha256": "8a76dd90a137f3e1b7aea2a668e533b236e4b8c82621ed4ce9eb35bdbb5b1c09"
},
"downloads": -1,
"filename": "old_doc-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3316191a37801b071ba4e913df474581",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 13481,
"upload_time": "2024-08-26T22:45:35",
"upload_time_iso_8601": "2024-08-26T22:45:35.820044Z",
"url": "https://files.pythonhosted.org/packages/33/98/4eb7716f96c70903901d172bd749a602d8fa894d3203fd1b9b5c96327d06/old_doc-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4c371c93ab377ab4a61a0508fca9537699e1faffba75206a080b7cc11bdc58fa",
"md5": "eb3fc411719ca33aa0b78c2b31cca04e",
"sha256": "2df8f6cd2a252b66ec4ae4efd27d3572de3541d1110ba811c7757e361b04820d"
},
"downloads": -1,
"filename": "old_doc-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "eb3fc411719ca33aa0b78c2b31cca04e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6381,
"upload_time": "2024-08-26T22:45:37",
"upload_time_iso_8601": "2024-08-26T22:45:37.421462Z",
"url": "https://files.pythonhosted.org/packages/4c/37/1c93ab377ab4a61a0508fca9537699e1faffba75206a080b7cc11bdc58fa/old_doc-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-26 22:45:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wjbmattingly",
"github_project": "old-doc",
"github_not_found": true,
"lcname": "old-doc"
}