ushmm


Nameushmm JSON
Version 0.0.6 PyPI version JSON
download
home_pagehttps://github.com/wjbmattingly/ushmm
SummaryA suite of tools for working with data at the United States Holocaust Memorial Museum
upload_time2024-09-11 22:09:47
maintainerNone
docs_urlNone
authorW.J.B. Mattingly
requires_python>=3.7
licenseNone
keywords pdf image testimonies
VCS
bugtrack_url
requirements pdf2image Pillow pytesseract unidecode opencv-python
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ushmm: A Python Library for Oral Testimonies at the USHMM

This README provides an overview of the `ushmm` Python library, developed for parsing and processing oral testimonies from the [United States Holocaust Memorial Museum](https://www.ushmm.org/). The `ushmm` library is designed to facilitate the conversion of PDFs into structured data, which can then be used for various research and educational purposes.

## Introduction

The `ushmm` library streamlines the process of handling the collection of oral testimonies available at the USHMM. These testimonies, which come in PDF format, are processed into raw text and subsequently structured data. The library wraps around Tesseract (for OCR) and Poppler (for parsing PDFs). It also converts the testimonies into structured HTML.

![Original Testimony Image](https://raw.githubusercontent.com/wjbmattingly/ushmm/main/images/original.png)

## Installation

You can install the `ushmm` library directly using pip:

```shell
pip install ushmm
```

### Additional Dependencies

For macOS users:

1. Create a new Conda environment.
2. Install Tesseract and Poppler using Homebrew or Conda-Forge:

```shell
conda install -c conda-forge tesseract poppler
```

3. Ensure you uninstall and then reinstall `pdf2image` via conda-forge if necessary:

```shell
pip uninstall pdf2image
conda install -c conda-forge pdf2image
```

## Usage

The `ushmm` library includes functions that facilitate the conversion of PDF testimonies into images, and then to text, while cleaning and removing unwanted elements such as footers:

```python
from ushmm import pdf_to_images, images_to_text, clean_texts, remove_footers, process_testimony_texts

# Convert PDF to images
images = pdf_to_images("path/to/pdf", "path/to/images", save=True)

# Remove footers using Open-CV
cropped_images = remove_footers("path/to/images", "path/to/cropped_images", save=True)

# Perform OCR on the images
texts = images_to_text("path/to/cropped_images", "path/to/text", save=True)

# Clean the OCR output
cleaned_texts = clean_texts("path/to/text", "path/to/cleaned_text", save=True)

# Process the cleaned text into structured data
html_result = process_testimony_texts("path/to/cleaned_text", "output_file.html", save=True)
```

## Features

- **PDF Conversion**: Converts PDF documents into a sequence of images.
- **Image Cropping**: Identifies and removes footers from images using Open-CV.
- **OCR Processing**: Applies Tesseract OCR to convert images into text.
- **Data Cleaning**: Cleans the OCR output to prepare it for structured data conversion.
- **Structured Data**: Parses raw text files and converts them into structured HTML documents.

## Data Accessibility

Making the data accessible is a crucial aspect of the `ushmm` library. With the provided functions, users can not only process the testimonies but also make them available for public access and research.

## Contributing

Contributions to the `ushmm` library are welcome. Please refer to the [contribution guidelines](https://github.com/wjbmattingly/ushmm/blob/main/CONTRIBUTING.md) for more information.

## License

The `ushmm` library is provided under the MIT License. See the [LICENSE](https://github.com/wjbmattingly/ushmm/blob/main/LICENSE) file for more details.

## Acknowledgments

This library was made possible by the collaborative efforts at the United States Holocaust Memorial Museum and contributions from the open-source community.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wjbmattingly/ushmm",
    "name": "ushmm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pdf image testimonies",
    "author": "W.J.B. Mattingly",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/bc/76/6e60880a2c2f78cfd1ad8b338d6e4df240eb6b917a7024d345109f9e975c/ushmm-0.0.6.tar.gz",
    "platform": null,
    "description": "# ushmm: A Python Library for Oral Testimonies at the USHMM\n\nThis README provides an overview of the `ushmm` Python library, developed for parsing and processing oral testimonies from the [United States Holocaust Memorial Museum](https://www.ushmm.org/). The `ushmm` library is designed to facilitate the conversion of PDFs into structured data, which can then be used for various research and educational purposes.\n\n## Introduction\n\nThe `ushmm` library streamlines the process of handling the collection of oral testimonies available at the USHMM. These testimonies, which come in PDF format, are processed into raw text and subsequently structured data. The library wraps around Tesseract (for OCR) and Poppler (for parsing PDFs). It also converts the testimonies into structured HTML.\n\n![Original Testimony Image](https://raw.githubusercontent.com/wjbmattingly/ushmm/main/images/original.png)\n\n## Installation\n\nYou can install the `ushmm` library directly using pip:\n\n```shell\npip install ushmm\n```\n\n### Additional Dependencies\n\nFor macOS users:\n\n1. Create a new Conda environment.\n2. Install Tesseract and Poppler using Homebrew or Conda-Forge:\n\n```shell\nconda install -c conda-forge tesseract poppler\n```\n\n3. Ensure you uninstall and then reinstall `pdf2image` via conda-forge if necessary:\n\n```shell\npip uninstall pdf2image\nconda install -c conda-forge pdf2image\n```\n\n## Usage\n\nThe `ushmm` library includes functions that facilitate the conversion of PDF testimonies into images, and then to text, while cleaning and removing unwanted elements such as footers:\n\n```python\nfrom ushmm import pdf_to_images, images_to_text, clean_texts, remove_footers, process_testimony_texts\n\n# Convert PDF to images\nimages = pdf_to_images(\"path/to/pdf\", \"path/to/images\", save=True)\n\n# Remove footers using Open-CV\ncropped_images = remove_footers(\"path/to/images\", \"path/to/cropped_images\", save=True)\n\n# Perform OCR on the images\ntexts = images_to_text(\"path/to/cropped_images\", \"path/to/text\", save=True)\n\n# Clean the OCR output\ncleaned_texts = clean_texts(\"path/to/text\", \"path/to/cleaned_text\", save=True)\n\n# Process the cleaned text into structured data\nhtml_result = process_testimony_texts(\"path/to/cleaned_text\", \"output_file.html\", save=True)\n```\n\n## Features\n\n- **PDF Conversion**: Converts PDF documents into a sequence of images.\n- **Image Cropping**: Identifies and removes footers from images using Open-CV.\n- **OCR Processing**: Applies Tesseract OCR to convert images into text.\n- **Data Cleaning**: Cleans the OCR output to prepare it for structured data conversion.\n- **Structured Data**: Parses raw text files and converts them into structured HTML documents.\n\n## Data Accessibility\n\nMaking the data accessible is a crucial aspect of the `ushmm` library. With the provided functions, users can not only process the testimonies but also make them available for public access and research.\n\n## Contributing\n\nContributions to the `ushmm` library are welcome. Please refer to the [contribution guidelines](https://github.com/wjbmattingly/ushmm/blob/main/CONTRIBUTING.md) for more information.\n\n## License\n\nThe `ushmm` library is provided under the MIT License. See the [LICENSE](https://github.com/wjbmattingly/ushmm/blob/main/LICENSE) file for more details.\n\n## Acknowledgments\n\nThis library was made possible by the collaborative efforts at the United States Holocaust Memorial Museum and contributions from the open-source community.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A suite of tools for working with data at the United States Holocaust Memorial Museum",
    "version": "0.0.6",
    "project_urls": {
        "Homepage": "https://github.com/wjbmattingly/ushmm"
    },
    "split_keywords": [
        "pdf",
        "image",
        "testimonies"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "be6509eddde5b7494f9d7ec0fe43c426af0e1c7ab131e175887f596abf44725e",
                "md5": "b12175149a0c0079656e7331ac42273c",
                "sha256": "728304c29a5aec669858cb8026361f4cdb21835b55ed45f1683f95e7adf43b08"
            },
            "downloads": -1,
            "filename": "ushmm-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b12175149a0c0079656e7331ac42273c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7861,
            "upload_time": "2024-09-11T22:09:46",
            "upload_time_iso_8601": "2024-09-11T22:09:46.645159Z",
            "url": "https://files.pythonhosted.org/packages/be/65/09eddde5b7494f9d7ec0fe43c426af0e1c7ab131e175887f596abf44725e/ushmm-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc766e60880a2c2f78cfd1ad8b338d6e4df240eb6b917a7024d345109f9e975c",
                "md5": "3c45c289391937968933299253bdab03",
                "sha256": "41b69370a8bc4c5da70c0b9fcaa20c5cb31fab9bcc88a8e3600890d64b99017a"
            },
            "downloads": -1,
            "filename": "ushmm-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "3c45c289391937968933299253bdab03",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 7830,
            "upload_time": "2024-09-11T22:09:47",
            "upload_time_iso_8601": "2024-09-11T22:09:47.570362Z",
            "url": "https://files.pythonhosted.org/packages/bc/76/6e60880a2c2f78cfd1ad8b338d6e4df240eb6b917a7024d345109f9e975c/ushmm-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-11 22:09:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wjbmattingly",
    "github_project": "ushmm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pdf2image",
            "specs": []
        },
        {
            "name": "Pillow",
            "specs": []
        },
        {
            "name": "pytesseract",
            "specs": []
        },
        {
            "name": "unidecode",
            "specs": []
        },
        {
            "name": "opencv-python",
            "specs": []
        }
    ],
    "lcname": "ushmm"
}
        
Elapsed time: 0.29596s