visarchpy

Name	visarchpy JSON
Version	1.0.4 JSON
	download
home_page
Summary	Data pipelines for extraction, transformation and visualization of architectural visuals in Python.
upload_time	2024-02-02 00:44:01
maintainer
docs_url	None
author
requires_python	>=3.10
license	MIT
keywords	data pipelines visuals architecture pdf etl computer vision dino tu delft
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI](https://img.shields.io/pypi/v/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)
[![PyPI_versions](https://img.shields.io/pypi/pyversions/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)
[![PyPI_status](https://img.shields.io/pypi/status/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)
[![PyPI_format](https://img.shields.io/pypi/format/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)
![Unit Tests](https://github.com/AiDAPT-A/VisArchPy/actions/workflows/unit-tests.yml/badge.svg)
[![Docs](https://readthedocs.org/projects/visarchpy/badge/?version=latest)](https://visarchpy.readthedocs.io)

# VisArchPy

Data pipelines for extraction, transformation and visualization of architectural visuals in Python. It extracts images embedded in PDF files, collects relevant metadata, and extracts visual features using the DinoV2 model.
We ambition to make of this package Ai-powered tool with features for recorgnizing different types architectural visuals (types of buildings, structures, etc.). The package is still in development and we are working on adding more features and improving the existing ones. If you have any suggestions or questions, please open an issue in our [GitHub repository](https://github.com/AiDAPT-A/VisArchPy/issues). 

## Main Features

#### Extraction pipelines

- **Layout:** pipeline for extracting metadata and visuals (images) from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.
- **OCR:** pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.
- **LayoutOCR:** pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis.

#### Metadata Extraction
- Extraction of medatdata of extracted images (document page, image size)
- Extraction of captions of images based on proximity to images and  *text-analysis* using keywords.

#### Transformation utilities
- **Dino:** pipeline for transforming images into visual features using the self-supervised  learning in [DinoV2.](https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/)

#### Visualization utilities
- **Viz:** an utility to create a *bounding box plot*. This plot provides an overview of the shapes and sizes of images in a data set. 

    ![Example Bbox plot](docs/img/all-plot-heat.png)

## Dependencies

- Python 3.10 or 3.11
- [Tesseract v4.0 or recent](https://tesseract-ocr.github.io/)
- [PyTorch v2.1 or recent](https://pytorch.org/get-started/locally/)

## Installion

After installing the dependencies, install VisArchPy using `pip`.

```shell
pip install visarchpy
```

### Installing from source

1. Clone the repository.
    ```shell
    git clone https://github.com/AiDAPT-A/VisArchPy.git
    ```
2. Go to the root of the repository.
   ```shell
   cd VisArchPy/
   ```
3. Install the package using `pip`.

    ```shell
    pip install .
    ```

Developers who intend to modify the sourcecode can install additional dependencies for test and documentation as follows. 

1. Go to the root directory `visarchpy/`

2.  Run:

   ```shell
   pip install -e .[dev]
   ```

## Usage

VisArchPy provides a command line interface to access its functionality. If you want to VisArchPy as a Python package consult the [documentation](https://visarchpy.readthedocs.io).

1. To access the CLI:

```shell
visarch -h
```

2. To access a particular pipeline:

```shell
visarch [PIPELINE] [SUBCOMMAND]
```

For example, to run the `layout` pipeline using a single PDF file, do the following:

```shell
visarch layout from-file <path-to-pdf-file> <path-output-directory>
```

Use `visarch [PIPELINE] [SUBCOMMAND] -h` for help.

### Results

Results from the data extraction pipelines (Layout, OCR, LayoutOCR) are save to the output directory. Results are organized as following:

```shell
00000/  # results directory
├── pdf-001  # directory where images are saved to. One per PDF file
├── 00000-metadata.csv  # extracted metadata as CSV
├── 00000-metadata.json  # extracted metadata as JSON
├── 00000-settings.json  # settings used by pipeline
└── 00000.log  # log file
```

## Settings

The pipeline's settings determine how visual extraction from PDF files is performed. Settings must be passed as a JSON file on the CLI. Settings may must include all items listed below. The values showed belowed are the defaults.

<details>
  <summary>Available settings</summary>
  
```python
{
    "layout": { # setting for layout analysis
        "caption": { 
            "offset": [ # distance used to locate captions
                4,
                "mm"
            ],
            "direction": "down", # direction used to locate captions
            "keywords": [  # keywords used to find captions based on text analysis
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": { # images smaller than these dimensions will be ignored
            "width": 120,
            "height": 120
        }
    },
    "ocr": {  # settings for OCR analysis
        "caption": {
            "offset": [
                50,
                "px"
            ],
            "direction": "down",
            "keywords": [
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": {
            "width": 120,
            "height": 120
        },
        "resolution": 250, # dpi to convert PDF pages to images before OCR
        "resize": 30000  # total pixels. Larger OCR inputs are downsize to this before OCR
        "tesseract" : "--psm 1 --oem 3"  # tesseract options
    }
}
```
</details>

\
When no seetings are passed to a pipeline, the defaults are used. To print the default seetting to the terminal use:

```shell
visarch [PIPELINE] settings
```

## Citation
Please cite this software using as follows:

*Garcia Alvarez, M. G., Khademi, S., & Pohl, D. (2023). VisArchPy [Computer software]. https://github.com/AiDAPT-A/VisArchPy*

## Acknowlegdements

- VisArchPy was develped thanks to the support provided by the [Digital Competence Centre](https://dcc.tudelft.nl), Delft University of Technology.
- Reseach Data Services, Delft University of Technology, The Netherlands.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "visarchpy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "data pipelines,visuals,architecture,pdf,ETL,computer vision,dino,TU Delft",
    "author": "",
    "author_email": "Manuel Garcia <m.g.garciaalvarez@tudelft.nl>",
    "download_url": "https://files.pythonhosted.org/packages/f6/d3/59d0edff4ac90338e90b0966462e6ea1639ba44cf9d8276d58ea20f1cd93/visarchpy-1.0.4.tar.gz",
    "platform": null,
    "description": "[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PyPI](https://img.shields.io/pypi/v/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)\n[![PyPI_versions](https://img.shields.io/pypi/pyversions/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)\n[![PyPI_status](https://img.shields.io/pypi/status/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)\n[![PyPI_format](https://img.shields.io/pypi/format/visarchpy.svg)](https://pypi.python.org/pypi/visarchpy)\n![Unit Tests](https://github.com/AiDAPT-A/VisArchPy/actions/workflows/unit-tests.yml/badge.svg)\n[![Docs](https://readthedocs.org/projects/visarchpy/badge/?version=latest)](https://visarchpy.readthedocs.io)\n\n# VisArchPy\n\nData pipelines for extraction, transformation and visualization of architectural visuals in Python. It extracts images embedded in PDF files, collects relevant metadata, and extracts visual features using the DinoV2 model.\nWe ambition to make of this package Ai-powered tool with features for recorgnizing different types architectural visuals (types of buildings, structures, etc.). The package is still in development and we are working on adding more features and improving the existing ones. If you have any suggestions or questions, please open an issue in our [GitHub repository](https://github.com/AiDAPT-A/VisArchPy/issues). \n\n## Main Features\n\n#### Extraction pipelines\n\n- **Layout:** pipeline for extracting metadata and visuals (images) from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.\n- **OCR:** pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.\n- **LayoutOCR:** pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis.\n\n#### Metadata Extraction\n- Extraction of medatdata of extracted images (document page, image size)\n- Extraction of captions of images based on proximity to images and  *text-analysis* using keywords.\n\n#### Transformation utilities\n- **Dino:** pipeline for transforming images into visual features using the self-supervised  learning in [DinoV2.](https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/)\n\n#### Visualization utilities\n- **Viz:** an utility to create a *bounding box plot*. This plot provides an overview of the shapes and sizes of images in a data set. \n\n    ![Example Bbox plot](docs/img/all-plot-heat.png)\n\n## Dependencies\n\n- Python 3.10 or 3.11\n- [Tesseract v4.0 or recent](https://tesseract-ocr.github.io/)\n- [PyTorch v2.1 or recent](https://pytorch.org/get-started/locally/)\n\n## Installion\n\nAfter installing the dependencies, install VisArchPy using `pip`.\n\n```shell\npip install visarchpy\n```\n\n### Installing from source\n\n1. Clone the repository.\n    ```shell\n    git clone https://github.com/AiDAPT-A/VisArchPy.git\n    ```\n2. Go to the root of the repository.\n   ```shell\n   cd VisArchPy/\n   ```\n3. Install the package using `pip`.\n\n    ```shell\n    pip install .\n    ```\n\nDevelopers who intend to modify the sourcecode can install additional dependencies for test and documentation as follows. \n\n1. Go to the root directory `visarchpy/`\n\n2.  Run:\n\n   ```shell\n   pip install -e .[dev]\n   ```\n\n## Usage\n\nVisArchPy provides a command line interface to access its functionality. If you want to VisArchPy as a Python package consult the [documentation](https://visarchpy.readthedocs.io).\n\n1. To access the CLI:\n\n```shell\nvisarch -h\n```\n\n2. To access a particular pipeline:\n\n```shell\nvisarch [PIPELINE] [SUBCOMMAND]\n```\n\nFor example, to run the `layout` pipeline using a single PDF file, do the following:\n\n```shell\nvisarch layout from-file <path-to-pdf-file> <path-output-directory>\n```\n\nUse `visarch [PIPELINE] [SUBCOMMAND] -h` for help.\n\n### Results\n\nResults from the data extraction pipelines (Layout, OCR, LayoutOCR) are save to the output directory. Results are organized as following:\n\n```shell\n00000/  # results directory\n\u251c\u2500\u2500 pdf-001  # directory where images are saved to. One per PDF file\n\u251c\u2500\u2500 00000-metadata.csv  # extracted metadata as CSV\n\u251c\u2500\u2500 00000-metadata.json  # extracted metadata as JSON\n\u251c\u2500\u2500 00000-settings.json  # settings used by pipeline\n\u2514\u2500\u2500 00000.log  # log file\n```\n\n## Settings\n\nThe pipeline's settings determine how visual extraction from PDF files is performed. Settings must be passed as a JSON file on the CLI. Settings may must include all items listed below. The values showed belowed are the defaults.\n\n<details>\n  <summary>Available settings</summary>\n  \n```python\n{\n    \"layout\": { # setting for layout analysis\n        \"caption\": { \n            \"offset\": [ # distance used to locate captions\n                4,\n                \"mm\"\n            ],\n            \"direction\": \"down\", # direction used to locate captions\n            \"keywords\": [  # keywords used to find captions based on text analysis\n                \"figure\",\n                \"caption\",\n                \"figuur\"\n            ]\n        },\n        \"image\": { # images smaller than these dimensions will be ignored\n            \"width\": 120,\n            \"height\": 120\n        }\n    },\n    \"ocr\": {  # settings for OCR analysis\n        \"caption\": {\n            \"offset\": [\n                50,\n                \"px\"\n            ],\n            \"direction\": \"down\",\n            \"keywords\": [\n                \"figure\",\n                \"caption\",\n                \"figuur\"\n            ]\n        },\n        \"image\": {\n            \"width\": 120,\n            \"height\": 120\n        },\n        \"resolution\": 250, # dpi to convert PDF pages to images before OCR\n        \"resize\": 30000  # total pixels. Larger OCR inputs are downsize to this before OCR\n        \"tesseract\" : \"--psm 1 --oem 3\"  # tesseract options\n    }\n}\n```\n</details>\n\n\\\nWhen no seetings are passed to a pipeline, the defaults are used. To print the default seetting to the terminal use:\n\n```shell\nvisarch [PIPELINE] settings\n```\n\n## Citation\nPlease cite this software using as follows:\n\n*Garcia Alvarez, M. G., Khademi, S., & Pohl, D. (2023). VisArchPy [Computer software]. https://github.com/AiDAPT-A/VisArchPy*\n\n## Acknowlegdements\n\n- VisArchPy was develped thanks to the support provided by the [Digital Competence Centre](https://dcc.tudelft.nl), Delft University of Technology.\n- Reseach Data Services, Delft University of Technology, The Netherlands.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Data pipelines for extraction, transformation and visualization of architectural visuals in Python.",
    "version": "1.0.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/AiDAPT-A/VisArchPy/issues",
        "Documentation": "https://visarchpy.readthedocs.io",
        "Repository": "https://github.com/AiDAPT-A/VisArchPy.git"
    },
    "split_keywords": [
        "data pipelines",
        "visuals",
        "architecture",
        "pdf",
        "etl",
        "computer vision",
        "dino",
        "tu delft"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a93a2ba8695d78580de421ba65a209cc56464006b4dfadc903b72cc221da0bb3",
                "md5": "1069e62ae3759abefa382ff0b23ac71b",
                "sha256": "31f6da75819e67f2c7da9dc43532884d569debfed28309734916f38ebb624056"
            },
            "downloads": -1,
            "filename": "visarchpy-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1069e62ae3759abefa382ff0b23ac71b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 41421,
            "upload_time": "2024-02-02T00:43:59",
            "upload_time_iso_8601": "2024-02-02T00:43:59.578790Z",
            "url": "https://files.pythonhosted.org/packages/a9/3a/2ba8695d78580de421ba65a209cc56464006b4dfadc903b72cc221da0bb3/visarchpy-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f6d359d0edff4ac90338e90b0966462e6ea1639ba44cf9d8276d58ea20f1cd93",
                "md5": "c8970eb26bac86f9b6e4321b577b72d7",
                "sha256": "5ddb95e7d862d65659c22af4879b2eebbe32485c37d9c7ee6de580a98d16cbff"
            },
            "downloads": -1,
            "filename": "visarchpy-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "c8970eb26bac86f9b6e4321b577b72d7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 39801,
            "upload_time": "2024-02-02T00:44:01",
            "upload_time_iso_8601": "2024-02-02T00:44:01.409683Z",
            "url": "https://files.pythonhosted.org/packages/f6/d3/59d0edff4ac90338e90b0966462e6ea1639ba44cf9d8276d58ea20f1cd93/visarchpy-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-02 00:44:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AiDAPT-A",
    "github_project": "VisArchPy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "visarchpy"
}