opendataloader-pdf

Name	opendataloader-pdf JSON
Version	0.0.16 JSON
	download
home_page	https://github.com/opendataloader-project/opendataloader-pdf
Summary	A Python wrapper for the opendataloader-pdf Java CLI.
upload_time	2025-09-15 08:28:21
maintainer	None
docs_url	None
author	opendataloader-project
requires_python	>=3.7
license	MPL-2.0
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OpenDataLoader PDF

![Pre-release](https://img.shields.io/badge/Pre--release-FFA500&logo=github)
[![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
![Java](https://img.shields.io/badge/Java-11+-blue.svg)
![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)
[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
[![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)
[![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)
[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
[![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)
[![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)

<br/>

**Safe, Open, High-Performance — PDF for AI**

OpenDataLoader-PDF converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).

It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.

<br/>

## 🌟 Key Features

- 🧾 **Rich, Structured Output** — JSON, Markdown or Html
- 🧩 **Layout Reconstruction** — Headings, Lists, Tables, Images, Reading Order
- ⚡ **Fast & Lightweight** — Rule-Based Heuristic, High-Throughput, No GPU
- 🔒 **Local-First Privacy** — Runs fully on your machine
- 🛡️ **AI-Safety** — Auto-Filters likely prompt-injection content - [Learn more about AI-Safety](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/docs/AI_SAFETY.md)
- 🖍️ **Annotated PDF Visualization** — See detected structures overlaid on the original

[Download Annotated PDF Sample](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/1901.03003_annotated.pdf)

![Annotated PDF Preview](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/example_annotated_pdf.png)

<br/>

## 🚀 Upcoming Features

- 🖨️ **OCR for scanned PDFs** — Extract data from image-only pages
- 🧠 **Table AI option** — Higher accuracy for tables with borderless or merged cells
- ⚡ **Performance Benchmarks** — Transparent evaluations with open datasets and metrics, reported regularly
- 🛡️ **AI Red Teaming** — Transparent adversarial benchmarks with datasets and metrics, reported regularly

<br/>

## Prerequisites

- Java 11 or higher must be installed and available in your system's PATH.
- Python 3.8+

<br/>

## Python

### Installation

```sh
pip install -U opendataloader-pdf
```

### Usage

- input_path can be either the path to a single document or the path to a folder.
- If you don’t specify an output_folder, the output data will be saved in the same directory as the input document.

```python
import opendataloader_pdf

opendataloader_pdf.run(
    input_path="path-to-document.pdf",
    output_folder="path-to-output",
    generate_markdown=True,
    generate_html=True,
    generate_annotated_pdf=True,
)
```

### Function: run()

The main function to process PDFs.

| Parameter                | Type   | Required | Default      | Description                                                                                                                         |
|--------------------------| ------ | -------- |--------------|-------------------------------------------------------------------------------------------------------------------------------------|
| `input_path`             | `str`  | ✅ Yes    | —            | Path to the input PDF file or folder.                                                                                               |
| `output_folder`          | `str`  | No       | input folder | Path to the output folder.                                                                                                          |
| `password`               | `str`  | No       | `None`       | Password for the PDF file.                                                                                                          |
| `replace_invalid_chars`  | `str`  | No       | `" "`       | Character to replace invalid or unrecognized characters (e.g., �, \u0000)                                                           |
| `content_safety_off`     | `str`  | No       | `None`       | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |
| `generate_markdown`      | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                                                                                        |
| `generate_html`          | `bool` | No       | `False`      | If `True`, generates an HTML output file.                                                                                           |
| `generate_annotated_pdf` | `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.                                                                                  |
| `keep_line_breaks`       | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                                                                                         |
| `html_in_markdown`       | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                                                                                        |
| `add_image_to_markdown`  | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                                                                                      |
| `debug`                  | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution.                                                                     |

<br/>

## Node.js / NPM

**Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.

### Prerequisites

- Java 11 or higher must be installed and available in your system's PATH.

### Installation

```sh
npm install @opendataloader/pdf
```

### Usage

- `inputPath` can be either the path to a single document or the path to a folder.
- If you don’t specify an `outputFolder`, the output data will be saved in the same directory as the input document.

```typescript
import { run } from '@opendataloader/pdf';

async function main() {
  try {
    const output = await run('path-to-document.pdf', {
      outputFolder: 'path-to-output',
      generateMarkdown: true,
      generateHtml: true,
      generateAnnotatedPdf: true,
      debug: true,
    });
    console.log('PDF processing complete.', output);
  } catch (error) {
    console.error('Error processing PDF:', error);
  }
}

main();
```

### Function: run()

`run(inputPath: string, options?: RunOptions): Promise<string>`

The main function to process PDFs.

**Parameters**

| Parameter   | Type     | Required | Description                           |
| ----------- | -------- | -------- | ------------------------------------- |
| `inputPath` | `string` | ✅ Yes    | Path to the input PDF file or folder. |
| `options`   | `RunOptions` | No       | Configuration options for the run.    |

**RunOptions**

| Property                | Type      | Default       | Description                                                                 |
| ----------------------- | --------- | ------------- | --------------------------------------------------------------------------- |
| `outputFolder`          | `string`  | `undefined`   | Path to the output folder. If not set, output is saved next to the input.   |
| `password`              | `string`  | `undefined`   | Password for the PDF file.                                                  |
| `replaceInvalidChars`   | `string`  | `" "`         | Character to replace invalid or unrecognized characters (e.g., , \u0000).  |
| `content_safety_off`     | `string`  | `undefined`   | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |
| `generateMarkdown`      | `boolean` | `false`       | If `true`, generates a Markdown output file.                                |
| `generateHtml`          | `boolean` | `false`       | If `true`, generates an HTML output file.                                   |
| `generateAnnotatedPdf`  | `boolean` | `false`       | If `true`, generates an annotated PDF output file.                          |
| `keepLineBreaks`        | `boolean` | `false`       | If `true`, keeps line breaks in the output.                                 |
| `htmlInMarkdown`        | `boolean` | `false`       | If `true`, uses HTML in the Markdown output.                                |
| `addImageToMarkdown`    | `boolean` | `false`       | If `true`, adds images to the Markdown output.                              |
| `debug`                 | `boolean` | `false`       | If `true`, prints CLI messages to the console during execution.             |

<br/>

## Java

For various example templates, including Gradle and Maven, please refer to https://github.com/opendataloader-project/opendataloader-pdf/tree/main/examples/java.

### Dependency

To include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.

Check for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).

```xml
<project>
    <!-- other configurations... -->

    <dependencies>
        <dependency>
            <groupId>org.opendataloader</groupId>
            <artifactId>opendataloader-pdf-core</artifactId>
            <version>0.0.15</version>
        </dependency>
    </dependencies>

    <repositories>
        <repository>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
            <id>vera-dev</id>
            <name>Vera development</name>
            <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
            <id>vera-dev</id>
            <name>Vera development</name>
            <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>
        </pluginRepository>
    </pluginRepositories>

    <!-- other configurations... -->
</project>
```


### Java code integration

To integrate Layout recognition API into Java code, one can follow the sample code below.

```java
import org.opendataloader.pdf.api.Config;
import org.opendataloader.pdf.api.OpenDataLoaderPDF;

import java.io.IOException;

public class Sample {

    public static void main(String[] args) {
        Config config = new Config();
        config.setOutputFolder("path/to/output");
        config.setGeneratePDF(true);
        config.setGenerateMarkdown(true);
        config.setGenerateHtml(true);

        try {
            OpenDataLoaderPDF.processFile("path/to/document.pdf", config);
        } catch (Exception exception) {
            //exception during processing
        }
    }
}
```

### API Documentation

The full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)

<br/>

## Docker

Download sample PDF

```sh
curl -L -o 1901.03003.pdf https://arxiv.org/pdf/1901.03003
```

Run opendataloader-pdf in Docker container

```
docker run --rm -v "$PWD":/work ghcr.io/opendataloader-project/opendataloader-pdf-cli:latest /work/1901.03003.pdf --markdown --html --pdf
```

<br/>

## Developing with OpenDataLoader PDF

### Build

Build and package using Maven command:

```sh
mvn clean package -f java/pom.xml
```

If the build is successful, the resulting `jar` file will be created in the path below.

```sh
java/opendataloader-pdf-cli/target
```

### CLI usage

```sh
java -jar opendataloader-pdf-cli-<VERSION>.jar [options] <INPUT FILE OR FOLDER>
```

This generates a JSON file with layout recognition results in the specified output folder. 
Additionally, annotated PDF with recognized structures, Markdown and Html are generated if options `--pdf`, `--markdown` and `--html` are specified.

By default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.

The option `--keep-line-breaks` to preserve the original line breaks text content in JSON and Markdown output.
The option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.
The option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags. 
The option `--markdown-with-images` enables inclusion of image references into the output Markdown. 
The option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character.
The images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.

#### Available options:

```
Options:
-o,--output-dir <arg>           Specifies the output directory for generated files
--keep-line-breaks              Preserves original line breaks in the extracted text
--content-safety-off <arg>      Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page
--markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure
--markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links
--markdown                      Sets the data extraction output format to Markdown
--html                          Sets the data extraction output format to HTML
-p,--password <arg>             Specifies the password for an encrypted PDF
--pdf                           Generates a new PDF file where the extracted layout data is visualized as annotations
--replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., �, \u0000) with the specified character
```

### Schema of the JSON output

Root json node

| Field             | Type    | Optional | Description                        |
|-------------------|---------|----------|------------------------------------|
| file name         | string  | no       | Name of processed pdf file         |
| number of pages   | integer | no       | Number of pages in pdf file        |
| author            | string  | no       | Author of pdf file                 |
| title             | string  | no       | Title of pdf file                  |
| creation date     | string  | no       | Creation date of pdf file          |
| modification date | string  | no       | Modification date of pdf file      |
| kids              | array   | no       | Array of detected content elements |

Common fields of content json nodes

| Field        | Type    | Optional | Description                                                                                                                                                                                           |
|--------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| id           | integer | yes      | Unique id of content element                                                                                                                                                                          |
| level        | string  | yes      | Level of content element                                                                                                                                                                              |
| type         | string  | no       | Type of content element<br/>Possible types: `footer`, `header`, `heading`, `line`, `table`, `table row`, `table cell`, `paragraph`, `list`, `list item`, `image`, `line art`, `caption`, `text block` |
| page number  | integer | no       | Page number of content element                                                                                                                                                                        |
| bounding box | array   | no       | Bounding box of content element                                                                                                                                                                       |

Specific fields of text content json nodes (`caption`, `heading`, `paragraph`)

| Field      | Type   | Optional | Description       |
|------------|--------|----------|-------------------|
| font       | string | no       | Font name of text |
| font size  | double | no       | Font size of text |
| text color | array  | no       | Color of text     |
| content    | string | no       | Text value        |

Specific fields of `table` json nodes

| Field             | Type    | Optional | Description                    |
|-------------------|---------|----------|--------------------------------|
| number of rows    | integer | no       | Number of table rows           |
| number of columns | integer | no       | Number of table columns        |
| rows              | array   | no       | Array of table rows            |
| previous table id | integer | yes      | Id of previous connected table |
| next table id     | integer | yes      | Id of next connected table     |

Specific fields of `table row` json nodes

| Field      | Type    | Optional | Description          |
|------------|---------|----------|----------------------|
| row number | integer | no       | Number of table row  |
| cells      | array   | no       | Array of table cells |

Specific fields of `table cell` json nodes

| Field         | Type    | Optional | Description                          |
|---------------|---------|----------|--------------------------------------|
| row number    | integer | no       | Row number of table cell             |
| column number | integer | no       | Column number of table cell          |
| row span      | integer | no       | Row span of table cell               |
| column span   | integer | no       | Column span of table cell            |
| kids          | array   | no       | Array of table cell content elements |

Specific fields of `heading` json nodes

| Field         | Type    | Optional | Description              |
|---------------|---------|----------|--------------------------|
| heading level | integer | no       | Heading level of heading |

Specific fields of `list` json nodes

| Field                | Type    | Optional | Description                         |
|----------------------|---------|----------|-------------------------------------|
| number of list items | integer | no       | Number of list items                |
| numbering style      | string  | no       | Numbering style of this list        |
| previous list id     | integer | yes      | Id of previous connected list       |
| next list id         | integer | yes      | Id of next connected list           |
| list items           | array   | no       | Array of list item content elements |

Specific fields of `list item` json nodes

| Field | Type  | Optional | Description                         |
|-------|-------|----------|-------------------------------------|
| kids  | array | no       | Array of list item content elements |

Specific fields of `header` and `footer` json nodes

| Field | Type  | Optional | Description                             |
|-------|-------|----------|-----------------------------------------|
| kids  | array | no       | Array of header/footer content elements |

Specific fields of `text block` json nodes

| Field | Type  | Optional | Description                          |
|-------|-------|----------|--------------------------------------|
| kids  | array | no       | Array of text block content elements |


## 🤝 Contributing

We believe that great software is built together.

Your contributions are vital to the success of this project.

Please read [CONTRIBUTING.md](https://github.com/hancom-inc/opendataloader-pdf/blob/main/CONTRIBUTING.md) for details on how to contribute.

## 💖 Community & Support
Have questions or need a little help? We're here for you!🤗

- [GitHub Discussions](https://github.com/hancom-inc/opendataloader-pdf/discussions): For Q&A and general chats. Let's talk! 🗣️
- [GitHub Issues](https://github.com/hancom-inc/opendataloader-pdf/issues): Found a bug? 🐛 Please report it here so we can fix it.

## ✨ Our Branding and Trademarks 

We love our brand and want to protect it!

This project may contain trademarks, logos, or brand names for our products and services.

To ensure everyone is on the same page, please remember these simple rules:

- **Authorized Use**: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.
- **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.
- **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company’s specific policies.

## ⚖️ License

This project is licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).

For the full license text, see [LICENSE](LICENSE).

For information on third-party libraries and components, see:
- [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)
- [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)
- [licenses/](./THIRD_PARTY/licenses/)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/opendataloader-project/opendataloader-pdf",
    "name": "opendataloader-pdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "opendataloader-project",
    "author_email": "open.dataloader@hancom.com",
    "download_url": null,
    "platform": null,
    "description": "# OpenDataLoader PDF\n\n![Pre-release](https://img.shields.io/badge/Pre--release-FFA500&logo=github)\n[![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)\n![Java](https://img.shields.io/badge/Java-11+-blue.svg)\n![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)\n[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)\n[![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)\n[![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)\n[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker-image)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)\n[![Coverage](https://codecov.io/gh/opendataloader-project/opendataloader-pdf/branch/main/graph/badge.svg)](https://app.codecov.io/gh/opendataloader-project/opendataloader-pdf)\n[![CLA assistant](https://cla-assistant.io/readme/badge/opendataloader-project/opendataloader-pdf)](https://cla-assistant.io/opendataloader-project/opendataloader-pdf)\n\n<br/>\n\n**Safe, Open, High-Performance \u2014 PDF for AI**\n\nOpenDataLoader-PDF converts PDFs into JSON, Markdown or Html \u2014 ready to feed into modern AI stacks (LLMs, vector search, and RAG).\n\nIt reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.\nPowered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.\nAI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.\n\n<br/>\n\n## \ud83c\udf1f Key Features\n\n- \ud83e\uddfe **Rich, Structured Output** \u2014 JSON, Markdown or Html\n- \ud83e\udde9 **Layout Reconstruction** \u2014 Headings, Lists, Tables, Images, Reading Order\n- \u26a1 **Fast & Lightweight** \u2014 Rule-Based Heuristic, High-Throughput, No GPU\n- \ud83d\udd12 **Local-First Privacy** \u2014 Runs fully on your machine\n- \ud83d\udee1\ufe0f **AI-Safety** \u2014 Auto-Filters likely prompt-injection content - [Learn more about AI-Safety](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/docs/AI_SAFETY.md)\n- \ud83d\udd8d\ufe0f **Annotated PDF Visualization** \u2014 See detected structures overlaid on the original\n\n[Download Annotated PDF Sample](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/1901.03003_annotated.pdf)\n\n![Annotated PDF Preview](https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/resources/example_annotated_pdf.png)\n\n<br/>\n\n## \ud83d\ude80 Upcoming Features\n\n- \ud83d\udda8\ufe0f **OCR for scanned PDFs** \u2014 Extract data from image-only pages\n- \ud83e\udde0 **Table AI option** \u2014 Higher accuracy for tables with borderless or merged cells\n- \u26a1 **Performance Benchmarks** \u2014 Transparent evaluations with open datasets and metrics, reported regularly\n- \ud83d\udee1\ufe0f **AI Red Teaming** \u2014 Transparent adversarial benchmarks with datasets and metrics, reported regularly\n\n<br/>\n\n## Prerequisites\n\n- Java 11 or higher must be installed and available in your system's PATH.\n- Python 3.8+\n\n<br/>\n\n## Python\n\n### Installation\n\n```sh\npip install -U opendataloader-pdf\n```\n\n### Usage\n\n- input_path can be either the path to a single document or the path to a folder.\n- If you don\u2019t specify an output_folder, the output data will be saved in the same directory as the input document.\n\n```python\nimport opendataloader_pdf\n\nopendataloader_pdf.run(\n    input_path=\"path-to-document.pdf\",\n    output_folder=\"path-to-output\",\n    generate_markdown=True,\n    generate_html=True,\n    generate_annotated_pdf=True,\n)\n```\n\n### Function: run()\n\nThe main function to process PDFs.\n\n| Parameter                | Type   | Required | Default      | Description                                                                                                                         |\n|--------------------------| ------ | -------- |--------------|-------------------------------------------------------------------------------------------------------------------------------------|\n| `input_path`             | `str`  | \u2705 Yes    | \u2014            | Path to the input PDF file or folder.                                                                                               |\n| `output_folder`          | `str`  | No       | input folder | Path to the output folder.                                                                                                          |\n| `password`               | `str`  | No       | `None`       | Password for the PDF file.                                                                                                          |\n| `replace_invalid_chars`  | `str`  | No       | `\" \"`       | Character to replace invalid or unrecognized characters (e.g., \ufffd, \\u0000)                                                           |\n| `content_safety_off`     | `str`  | No       | `None`       | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |\n| `generate_markdown`      | `bool` | No       | `False`      | If `True`, generates a Markdown output file.                                                                                        |\n| `generate_html`          | `bool` | No       | `False`      | If `True`, generates an HTML output file.                                                                                           |\n| `generate_annotated_pdf` | `bool` | No       | `False`      | If `True`, generates an annotated PDF output file.                                                                                  |\n| `keep_line_breaks`       | `bool` | No       | `False`      | If `True`, keeps line breaks in the output.                                                                                         |\n| `html_in_markdown`       | `bool` | No       | `False`      | If `True`, uses HTML in the Markdown output.                                                                                        |\n| `add_image_to_markdown`  | `bool` | No       | `False`      | If `True`, adds images to the Markdown output.                                                                                      |\n| `debug`                  | `bool` | No       | `False`      | If `True`, prints CLI messages to the console during execution.                                                                     |\n\n<br/>\n\n## Node.js / NPM\n\n**Note:** This package is a wrapper around a Java CLI and is intended for use in a Node.js backend environment. It cannot be used in a browser-based frontend.\n\n### Prerequisites\n\n- Java 11 or higher must be installed and available in your system's PATH.\n\n### Installation\n\n```sh\nnpm install @opendataloader/pdf\n```\n\n### Usage\n\n- `inputPath` can be either the path to a single document or the path to a folder.\n- If you don\u2019t specify an `outputFolder`, the output data will be saved in the same directory as the input document.\n\n```typescript\nimport { run } from '@opendataloader/pdf';\n\nasync function main() {\n  try {\n    const output = await run('path-to-document.pdf', {\n      outputFolder: 'path-to-output',\n      generateMarkdown: true,\n      generateHtml: true,\n      generateAnnotatedPdf: true,\n      debug: true,\n    });\n    console.log('PDF processing complete.', output);\n  } catch (error) {\n    console.error('Error processing PDF:', error);\n  }\n}\n\nmain();\n```\n\n### Function: run()\n\n`run(inputPath: string, options?: RunOptions): Promise<string>`\n\nThe main function to process PDFs.\n\n**Parameters**\n\n| Parameter   | Type     | Required | Description                           |\n| ----------- | -------- | -------- | ------------------------------------- |\n| `inputPath` | `string` | \u2705 Yes    | Path to the input PDF file or folder. |\n| `options`   | `RunOptions` | No       | Configuration options for the run.    |\n\n**RunOptions**\n\n| Property                | Type      | Default       | Description                                                                 |\n| ----------------------- | --------- | ------------- | --------------------------------------------------------------------------- |\n| `outputFolder`          | `string`  | `undefined`   | Path to the output folder. If not set, output is saved next to the input.   |\n| `password`              | `string`  | `undefined`   | Password for the PDF file.                                                  |\n| `replaceInvalidChars`   | `string`  | `\" \"`         | Character to replace invalid or unrecognized characters (e.g., , \\u0000).  |\n| `content_safety_off`     | `string`  | `undefined`   | Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page. |\n| `generateMarkdown`      | `boolean` | `false`       | If `true`, generates a Markdown output file.                                |\n| `generateHtml`          | `boolean` | `false`       | If `true`, generates an HTML output file.                                   |\n| `generateAnnotatedPdf`  | `boolean` | `false`       | If `true`, generates an annotated PDF output file.                          |\n| `keepLineBreaks`        | `boolean` | `false`       | If `true`, keeps line breaks in the output.                                 |\n| `htmlInMarkdown`        | `boolean` | `false`       | If `true`, uses HTML in the Markdown output.                                |\n| `addImageToMarkdown`    | `boolean` | `false`       | If `true`, adds images to the Markdown output.                              |\n| `debug`                 | `boolean` | `false`       | If `true`, prints CLI messages to the console during execution.             |\n\n<br/>\n\n## Java\n\nFor various example templates, including Gradle and Maven, please refer to https://github.com/opendataloader-project/opendataloader-pdf/tree/main/examples/java.\n\n### Dependency\n\nTo include OpenDataLoader PDF in your Maven project, add the dependency below to your `pom.xml` file.\n\nCheck for the latest version on [Maven Central](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core).\n\n```xml\n<project>\n    <!-- other configurations... -->\n\n    <dependencies>\n        <dependency>\n            <groupId>org.opendataloader</groupId>\n            <artifactId>opendataloader-pdf-core</artifactId>\n            <version>0.0.15</version>\n        </dependency>\n    </dependencies>\n\n    <repositories>\n        <repository>\n            <snapshots>\n                <enabled>true</enabled>\n            </snapshots>\n            <id>vera-dev</id>\n            <name>Vera development</name>\n            <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>\n        </repository>\n    </repositories>\n    <pluginRepositories>\n        <pluginRepository>\n            <snapshots>\n                <enabled>false</enabled>\n            </snapshots>\n            <id>vera-dev</id>\n            <name>Vera development</name>\n            <url>https://artifactory.openpreservation.org/artifactory/vera-dev</url>\n        </pluginRepository>\n    </pluginRepositories>\n\n    <!-- other configurations... -->\n</project>\n```\n\n\n### Java code integration\n\nTo integrate Layout recognition API into Java code, one can follow the sample code below.\n\n```java\nimport org.opendataloader.pdf.api.Config;\nimport org.opendataloader.pdf.api.OpenDataLoaderPDF;\n\nimport java.io.IOException;\n\npublic class Sample {\n\n    public static void main(String[] args) {\n        Config config = new Config();\n        config.setOutputFolder(\"path/to/output\");\n        config.setGeneratePDF(true);\n        config.setGenerateMarkdown(true);\n        config.setGenerateHtml(true);\n\n        try {\n            OpenDataLoaderPDF.processFile(\"path/to/document.pdf\", config);\n        } catch (Exception exception) {\n            //exception during processing\n        }\n    }\n}\n```\n\n### API Documentation\n\nThe full API documentation is available at [javadoc](https://javadoc.io/doc/org.opendataloader/opendataloader-pdf-core/latest/)\n\n<br/>\n\n## Docker\n\nDownload sample PDF\n\n```sh\ncurl -L -o 1901.03003.pdf https://arxiv.org/pdf/1901.03003\n```\n\nRun opendataloader-pdf in Docker container\n\n```\ndocker run --rm -v \"$PWD\":/work ghcr.io/opendataloader-project/opendataloader-pdf-cli:latest /work/1901.03003.pdf --markdown --html --pdf\n```\n\n<br/>\n\n## Developing with OpenDataLoader PDF\n\n### Build\n\nBuild and package using Maven command:\n\n```sh\nmvn clean package -f java/pom.xml\n```\n\nIf the build is successful, the resulting `jar` file will be created in the path below.\n\n```sh\njava/opendataloader-pdf-cli/target\n```\n\n### CLI usage\n\n```sh\njava -jar opendataloader-pdf-cli-<VERSION>.jar [options] <INPUT FILE OR FOLDER>\n```\n\nThis generates a JSON file with layout recognition results in the specified output folder. \nAdditionally, annotated PDF with recognized structures, Markdown and Html are generated if options `--pdf`, `--markdown` and `--html` are specified.\n\nBy default all line breaks and hyphenation characters are removed, the Markdown does not include any images and does not use any HTML.\n\nThe option `--keep-line-breaks` to preserve the original line breaks text content in JSON and Markdown output.\nThe option `--content-safety-off` disables one or more content safety filters. Accepts a comma-separated list of filter names.\nThe option `--markdown-with-html` enables use of HTML in Markdown, which may improve Markdown preview in processors that support HTML tags. \nThe option `--markdown-with-images` enables inclusion of image references into the output Markdown. \nThe option `--replace-invalid-chars` replaces invalid or unrecognized characters (e.g., \ufffd, \\u0000) with the specified character.\nThe images are extracted from PDF as individual files and stored in a subfolder next to the Markdown output.\n\n#### Available options:\n\n```\nOptions:\n-o,--output-dir <arg>           Specifies the output directory for generated files\n--keep-line-breaks              Preserves original line breaks in the extracted text\n--content-safety-off <arg>      Disables one or more content safety filters. Accepts a comma-separated list of filter names. Arguments: all, hidden-text, off-page\n--markdown-with-html            Sets the data extraction output format to Markdown with rendering complex elements like tables as HTML for better structure\n--markdown-with-images          Sets the data extraction output format to Markdown with extracting images from the PDF and includes them as links\n--markdown                      Sets the data extraction output format to Markdown\n--html                          Sets the data extraction output format to HTML\n-p,--password <arg>             Specifies the password for an encrypted PDF\n--pdf                           Generates a new PDF file where the extracted layout data is visualized as annotations\n--replace-invalid-chars <arg>   Replaces invalid or unrecognized characters (e.g., \ufffd, \\u0000) with the specified character\n```\n\n### Schema of the JSON output\n\nRoot json node\n\n| Field             | Type    | Optional | Description                        |\n|-------------------|---------|----------|------------------------------------|\n| file name         | string  | no       | Name of processed pdf file         |\n| number of pages   | integer | no       | Number of pages in pdf file        |\n| author            | string  | no       | Author of pdf file                 |\n| title             | string  | no       | Title of pdf file                  |\n| creation date     | string  | no       | Creation date of pdf file          |\n| modification date | string  | no       | Modification date of pdf file      |\n| kids              | array   | no       | Array of detected content elements |\n\nCommon fields of content json nodes\n\n| Field        | Type    | Optional | Description                                                                                                                                                                                           |\n|--------------|---------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| id           | integer | yes      | Unique id of content element                                                                                                                                                                          |\n| level        | string  | yes      | Level of content element                                                                                                                                                                              |\n| type         | string  | no       | Type of content element<br/>Possible types: `footer`, `header`, `heading`, `line`, `table`, `table row`, `table cell`, `paragraph`, `list`, `list item`, `image`, `line art`, `caption`, `text block` |\n| page number  | integer | no       | Page number of content element                                                                                                                                                                        |\n| bounding box | array   | no       | Bounding box of content element                                                                                                                                                                       |\n\nSpecific fields of text content json nodes (`caption`, `heading`, `paragraph`)\n\n| Field      | Type   | Optional | Description       |\n|------------|--------|----------|-------------------|\n| font       | string | no       | Font name of text |\n| font size  | double | no       | Font size of text |\n| text color | array  | no       | Color of text     |\n| content    | string | no       | Text value        |\n\nSpecific fields of `table` json nodes\n\n| Field             | Type    | Optional | Description                    |\n|-------------------|---------|----------|--------------------------------|\n| number of rows    | integer | no       | Number of table rows           |\n| number of columns | integer | no       | Number of table columns        |\n| rows              | array   | no       | Array of table rows            |\n| previous table id | integer | yes      | Id of previous connected table |\n| next table id     | integer | yes      | Id of next connected table     |\n\nSpecific fields of `table row` json nodes\n\n| Field      | Type    | Optional | Description          |\n|------------|---------|----------|----------------------|\n| row number | integer | no       | Number of table row  |\n| cells      | array   | no       | Array of table cells |\n\nSpecific fields of `table cell` json nodes\n\n| Field         | Type    | Optional | Description                          |\n|---------------|---------|----------|--------------------------------------|\n| row number    | integer | no       | Row number of table cell             |\n| column number | integer | no       | Column number of table cell          |\n| row span      | integer | no       | Row span of table cell               |\n| column span   | integer | no       | Column span of table cell            |\n| kids          | array   | no       | Array of table cell content elements |\n\nSpecific fields of `heading` json nodes\n\n| Field         | Type    | Optional | Description              |\n|---------------|---------|----------|--------------------------|\n| heading level | integer | no       | Heading level of heading |\n\nSpecific fields of `list` json nodes\n\n| Field                | Type    | Optional | Description                         |\n|----------------------|---------|----------|-------------------------------------|\n| number of list items | integer | no       | Number of list items                |\n| numbering style      | string  | no       | Numbering style of this list        |\n| previous list id     | integer | yes      | Id of previous connected list       |\n| next list id         | integer | yes      | Id of next connected list           |\n| list items           | array   | no       | Array of list item content elements |\n\nSpecific fields of `list item` json nodes\n\n| Field | Type  | Optional | Description                         |\n|-------|-------|----------|-------------------------------------|\n| kids  | array | no       | Array of list item content elements |\n\nSpecific fields of `header` and `footer` json nodes\n\n| Field | Type  | Optional | Description                             |\n|-------|-------|----------|-----------------------------------------|\n| kids  | array | no       | Array of header/footer content elements |\n\nSpecific fields of `text block` json nodes\n\n| Field | Type  | Optional | Description                          |\n|-------|-------|----------|--------------------------------------|\n| kids  | array | no       | Array of text block content elements |\n\n\n## \ud83e\udd1d Contributing\n\nWe believe that great software is built together.\n\nYour contributions are vital to the success of this project.\n\nPlease read [CONTRIBUTING.md](https://github.com/hancom-inc/opendataloader-pdf/blob/main/CONTRIBUTING.md) for details on how to contribute.\n\n## \ud83d\udc96 Community & Support\nHave questions or need a little help? We're here for you!\ud83e\udd17\n\n- [GitHub Discussions](https://github.com/hancom-inc/opendataloader-pdf/discussions): For Q&A and general chats. Let's talk! \ud83d\udde3\ufe0f\n- [GitHub Issues](https://github.com/hancom-inc/opendataloader-pdf/issues): Found a bug? \ud83d\udc1b Please report it here so we can fix it.\n\n## \u2728 Our Branding and Trademarks \n\nWe love our brand and want to protect it!\n\nThis project may contain trademarks, logos, or brand names for our products and services.\n\nTo ensure everyone is on the same page, please remember these simple rules:\n\n- **Authorized Use**: You're welcome to use our logos and trademarks, but you must follow our official brand guidelines.\n- **No Confusion**: When you use our trademarks in a modified version of this project, it should never cause confusion or imply that Hancom officially sponsors or endorses your version.\n- **Third-Party Brands**: Any use of trademarks or logos from other companies must follow that company\u2019s specific policies.\n\n## \u2696\ufe0f License\n\nThis project is licensed under the [Mozilla Public License 2.0](https://www.mozilla.org/MPL/2.0/).\n\nFor the full license text, see [LICENSE](LICENSE).\n\nFor information on third-party libraries and components, see:\n- [THIRD_PARTY_LICENSES](./THIRD_PARTY/THIRD_PARTY_LICENSES.md)\n- [THIRD_PARTY_NOTICES](./THIRD_PARTY/THIRD_PARTY_NOTICES.md)\n- [licenses/](./THIRD_PARTY/licenses/)\n",
    "bugtrack_url": null,
    "license": "MPL-2.0",
    "summary": "A Python wrapper for the opendataloader-pdf Java CLI.",
    "version": "0.0.16",
    "project_urls": {
        "Homepage": "https://github.com/opendataloader-project/opendataloader-pdf"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e5af122c062d33b08f5a912aec5e24b99e0526154fd253f0519d112d8e2b8992",
                "md5": "3efff58d64dad9b720ad259cbdfa7809",
                "sha256": "0092856af50034e57493845466d03777627b62f163f36dff0a9b552deb56babb"
            },
            "downloads": -1,
            "filename": "opendataloader_pdf-0.0.16-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3efff58d64dad9b720ad259cbdfa7809",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 20580602,
            "upload_time": "2025-09-15T08:28:21",
            "upload_time_iso_8601": "2025-09-15T08:28:21.978151Z",
            "url": "https://files.pythonhosted.org/packages/e5/af/122c062d33b08f5a912aec5e24b99e0526154fd253f0519d112d8e2b8992/opendataloader_pdf-0.0.16-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-15 08:28:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "opendataloader-project",
    "github_project": "opendataloader-pdf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "opendataloader-pdf"
}

opendataloader-project