# ushmm: A Python Library for Oral Testimonies at the USHMM
This README provides an overview of the `ushmm` Python library, developed for parsing and processing oral testimonies from the [United States Holocaust Memorial Museum](https://www.ushmm.org/). The `ushmm` library is designed to facilitate the conversion of PDFs into structured data, which can then be used for various research and educational purposes.
## Introduction
The `ushmm` library streamlines the process of handling the collection of oral testimonies available at the USHMM. These testimonies, which come in PDF format, are processed into raw text and subsequently structured data. The library wraps around Tesseract (for OCR) and Poppler (for parsing PDFs). It also converts the testimonies into structured HTML.
![Original Testimony Image](https://raw.githubusercontent.com/wjbmattingly/ushmm/main/images/original.png)
## Installation
You can install the `ushmm` library directly using pip:
```shell
pip install ushmm
```
### Additional Dependencies
For macOS users:
1. Create a new Conda environment.
2. Install Tesseract and Poppler using Homebrew or Conda-Forge:
```shell
conda install -c conda-forge tesseract poppler
```
3. Ensure you uninstall and then reinstall `pdf2image` via conda-forge if necessary:
```shell
pip uninstall pdf2image
conda install -c conda-forge pdf2image
```
## Usage
The `ushmm` library includes functions that facilitate the conversion of PDF testimonies into images, and then to text, while cleaning and removing unwanted elements such as footers:
```python
from ushmm import pdf_to_images, images_to_text, clean_texts, remove_footers, process_testimony_texts
# Convert PDF to images
images = pdf_to_images("path/to/pdf", "path/to/images", save=True)
# Remove footers using Open-CV
cropped_images = remove_footers("path/to/images", "path/to/cropped_images", save=True)
# Perform OCR on the images
texts = images_to_text("path/to/cropped_images", "path/to/text", save=True)
# Clean the OCR output
cleaned_texts = clean_texts("path/to/text", "path/to/cleaned_text", save=True)
# Process the cleaned text into structured data
html_result = process_testimony_texts("path/to/cleaned_text", "output_file.html", save=True)
```
## Features
- **PDF Conversion**: Converts PDF documents into a sequence of images.
- **Image Cropping**: Identifies and removes footers from images using Open-CV.
- **OCR Processing**: Applies Tesseract OCR to convert images into text.
- **Data Cleaning**: Cleans the OCR output to prepare it for structured data conversion.
- **Structured Data**: Parses raw text files and converts them into structured HTML documents.
## Data Accessibility
Making the data accessible is a crucial aspect of the `ushmm` library. With the provided functions, users can not only process the testimonies but also make them available for public access and research.
## Contributing
Contributions to the `ushmm` library are welcome. Please refer to the [contribution guidelines](https://github.com/wjbmattingly/ushmm/blob/main/CONTRIBUTING.md) for more information.
## License
The `ushmm` library is provided under the MIT License. See the [LICENSE](https://github.com/wjbmattingly/ushmm/blob/main/LICENSE) file for more details.
## Acknowledgments
This library was made possible by the collaborative efforts at the United States Holocaust Memorial Museum and contributions from the open-source community.
Raw data
{
"_id": null,
"home_page": "https://github.com/wjbmattingly/ushmm",
"name": "ushmm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "pdf image testimonies",
"author": "W.J.B. Mattingly",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/bc/76/6e60880a2c2f78cfd1ad8b338d6e4df240eb6b917a7024d345109f9e975c/ushmm-0.0.6.tar.gz",
"platform": null,
"description": "# ushmm: A Python Library for Oral Testimonies at the USHMM\n\nThis README provides an overview of the `ushmm` Python library, developed for parsing and processing oral testimonies from the [United States Holocaust Memorial Museum](https://www.ushmm.org/). The `ushmm` library is designed to facilitate the conversion of PDFs into structured data, which can then be used for various research and educational purposes.\n\n## Introduction\n\nThe `ushmm` library streamlines the process of handling the collection of oral testimonies available at the USHMM. These testimonies, which come in PDF format, are processed into raw text and subsequently structured data. The library wraps around Tesseract (for OCR) and Poppler (for parsing PDFs). It also converts the testimonies into structured HTML.\n\n![Original Testimony Image](https://raw.githubusercontent.com/wjbmattingly/ushmm/main/images/original.png)\n\n## Installation\n\nYou can install the `ushmm` library directly using pip:\n\n```shell\npip install ushmm\n```\n\n### Additional Dependencies\n\nFor macOS users:\n\n1. Create a new Conda environment.\n2. Install Tesseract and Poppler using Homebrew or Conda-Forge:\n\n```shell\nconda install -c conda-forge tesseract poppler\n```\n\n3. Ensure you uninstall and then reinstall `pdf2image` via conda-forge if necessary:\n\n```shell\npip uninstall pdf2image\nconda install -c conda-forge pdf2image\n```\n\n## Usage\n\nThe `ushmm` library includes functions that facilitate the conversion of PDF testimonies into images, and then to text, while cleaning and removing unwanted elements such as footers:\n\n```python\nfrom ushmm import pdf_to_images, images_to_text, clean_texts, remove_footers, process_testimony_texts\n\n# Convert PDF to images\nimages = pdf_to_images(\"path/to/pdf\", \"path/to/images\", save=True)\n\n# Remove footers using Open-CV\ncropped_images = remove_footers(\"path/to/images\", \"path/to/cropped_images\", save=True)\n\n# Perform OCR on the images\ntexts = images_to_text(\"path/to/cropped_images\", \"path/to/text\", save=True)\n\n# Clean the OCR output\ncleaned_texts = clean_texts(\"path/to/text\", \"path/to/cleaned_text\", save=True)\n\n# Process the cleaned text into structured data\nhtml_result = process_testimony_texts(\"path/to/cleaned_text\", \"output_file.html\", save=True)\n```\n\n## Features\n\n- **PDF Conversion**: Converts PDF documents into a sequence of images.\n- **Image Cropping**: Identifies and removes footers from images using Open-CV.\n- **OCR Processing**: Applies Tesseract OCR to convert images into text.\n- **Data Cleaning**: Cleans the OCR output to prepare it for structured data conversion.\n- **Structured Data**: Parses raw text files and converts them into structured HTML documents.\n\n## Data Accessibility\n\nMaking the data accessible is a crucial aspect of the `ushmm` library. With the provided functions, users can not only process the testimonies but also make them available for public access and research.\n\n## Contributing\n\nContributions to the `ushmm` library are welcome. Please refer to the [contribution guidelines](https://github.com/wjbmattingly/ushmm/blob/main/CONTRIBUTING.md) for more information.\n\n## License\n\nThe `ushmm` library is provided under the MIT License. See the [LICENSE](https://github.com/wjbmattingly/ushmm/blob/main/LICENSE) file for more details.\n\n## Acknowledgments\n\nThis library was made possible by the collaborative efforts at the United States Holocaust Memorial Museum and contributions from the open-source community.\n",
"bugtrack_url": null,
"license": null,
"summary": "A suite of tools for working with data at the United States Holocaust Memorial Museum",
"version": "0.0.6",
"project_urls": {
"Homepage": "https://github.com/wjbmattingly/ushmm"
},
"split_keywords": [
"pdf",
"image",
"testimonies"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "be6509eddde5b7494f9d7ec0fe43c426af0e1c7ab131e175887f596abf44725e",
"md5": "b12175149a0c0079656e7331ac42273c",
"sha256": "728304c29a5aec669858cb8026361f4cdb21835b55ed45f1683f95e7adf43b08"
},
"downloads": -1,
"filename": "ushmm-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b12175149a0c0079656e7331ac42273c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 7861,
"upload_time": "2024-09-11T22:09:46",
"upload_time_iso_8601": "2024-09-11T22:09:46.645159Z",
"url": "https://files.pythonhosted.org/packages/be/65/09eddde5b7494f9d7ec0fe43c426af0e1c7ab131e175887f596abf44725e/ushmm-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bc766e60880a2c2f78cfd1ad8b338d6e4df240eb6b917a7024d345109f9e975c",
"md5": "3c45c289391937968933299253bdab03",
"sha256": "41b69370a8bc4c5da70c0b9fcaa20c5cb31fab9bcc88a8e3600890d64b99017a"
},
"downloads": -1,
"filename": "ushmm-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "3c45c289391937968933299253bdab03",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7830,
"upload_time": "2024-09-11T22:09:47",
"upload_time_iso_8601": "2024-09-11T22:09:47.570362Z",
"url": "https://files.pythonhosted.org/packages/bc/76/6e60880a2c2f78cfd1ad8b338d6e4df240eb6b917a7024d345109f9e975c/ushmm-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-11 22:09:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wjbmattingly",
"github_project": "ushmm",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pdf2image",
"specs": []
},
{
"name": "Pillow",
"specs": []
},
{
"name": "pytesseract",
"specs": []
},
{
"name": "unidecode",
"specs": []
},
{
"name": "opencv-python",
"specs": []
}
],
"lcname": "ushmm"
}