rapidocr-pdf

Name	rapidocr-pdf JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/RapidAI/RapidOCRPDF
Summary	Tools of extracting PDF content based on RapidOCR
upload_time	2024-04-27 12:47:52
maintainer	None
docs_url	None
author	SWHL
requires_python	<3.12,>=3.6
license	Apache-2.0
keywords	rapidocr_pdf rapidocr_onnxruntime ocr onnxruntime openvino
VCS
bugtrack_url
requirements	filetype pymupdf rapidocr_onnxruntime
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ## RapidOCRPDF
<p>
    <a href=""><img src="https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg"></a>
    <a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a>
    <a href="https://pypi.org/project/rapidocr-pdf/"><img alt="PyPI" src="https://img.shields.io/pypi/v/rapidocr-pdf"></a>
    <a href="https://pepy.tech/project/rapidocr-pdf"><img src="https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a>
    <a href="https://semver.org/"><img alt="SemVer2.0" src="https://img.shields.io/badge/SemVer-2.0-brightgreen"></a>
    <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
    <a href="https://choosealicense.com/licenses/apache-2.0/"><img alt="GitHub" src="https://img.shields.io/github/license/RapidAI/RapidOCRPDF"></a>
</p>

- Relying on [RapidOCR](https://github.com/RapidAI/RapidOCR), quickly extract text from PDF, including scanned PDF and encrypted PDF.
- Layout restore is not included for now.


### 1. Install package by pypi.
   ```bash
   # base rapidocr_onnxruntime
   pip install rapidocr_pdf[onnxruntime]

   # base rapidocr_openvino
   pip install rapidocr_pdf[openvino]
   ```

### 2. Usage
- Run by script.
    ```python
    from rapidocr_pdf import PDFExtracter

    pdf_extracter = PDFExtracter()

    pdf_path = 'tests/test_files/direct_and_image.pdf'
    texts = pdf_extracter(pdf_path)
    print(texts)
    ```
- Run by command line.
    ```bash
    $ rapidocr_pdf -h
    usage: rapidocr_pdf [-h] [-path FILE_PATH]

    options:
    -h, --help            show this help message and exit
    -path FILE_PATH, --file_path FILE_PATH
                            File path, PDF or images

    $ rapidocr_pdf -path tests/test_files/direct_and_image.pdf
    ```
### 3. Ouput format.
   - **Input**：`Union[str, Path, bytes]`
   - **Output**：`List` \[**Page num**, **Page content** + **score**\], ：
       ```python
       [
           ['0', '达大学拉斯维加斯分校）的一次中文评测中获得最', '0.8969868'],
           ['1', 'ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network∗\nYuliang Liu‡†', '0.8969868'],
       ]
       ```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/RapidAI/RapidOCRPDF",
    "name": "rapidocr-pdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.6",
    "maintainer_email": null,
    "keywords": "rapidocr_pdf, rapidocr_onnxruntime, ocr, onnxruntime, openvino",
    "author": "SWHL",
    "author_email": "liekkaskono@163.com",
    "download_url": null,
    "platform": "Any",
    "description": "## RapidOCRPDF\n<p>\n    <a href=\"\"><img src=\"https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg\"></a>\n    <a href=\"\"><img src=\"https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg\"></a>\n    <a href=\"https://pypi.org/project/rapidocr-pdf/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/rapidocr-pdf\"></a>\n    <a href=\"https://pepy.tech/project/rapidocr-pdf\"><img src=\"https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads\"></a>\n    <a href=\"https://semver.org/\"><img alt=\"SemVer2.0\" src=\"https://img.shields.io/badge/SemVer-2.0-brightgreen\"></a>\n    <a href=\"https://github.com/psf/black\"><img src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a>\n    <a href=\"https://choosealicense.com/licenses/apache-2.0/\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/RapidAI/RapidOCRPDF\"></a>\n</p>\n\n- Relying on [RapidOCR](https://github.com/RapidAI/RapidOCR), quickly extract text from PDF, including scanned PDF and encrypted PDF.\n- Layout restore is not included for now.\n\n\n### 1. Install package by pypi.\n   ```bash\n   # base rapidocr_onnxruntime\n   pip install rapidocr_pdf[onnxruntime]\n\n   # base rapidocr_openvino\n   pip install rapidocr_pdf[openvino]\n   ```\n\n### 2. Usage\n- Run by script.\n    ```python\n    from rapidocr_pdf import PDFExtracter\n\n    pdf_extracter = PDFExtracter()\n\n    pdf_path = 'tests/test_files/direct_and_image.pdf'\n    texts = pdf_extracter(pdf_path)\n    print(texts)\n    ```\n- Run by command line.\n    ```bash\n    $ rapidocr_pdf -h\n    usage: rapidocr_pdf [-h] [-path FILE_PATH]\n\n    options:\n    -h, --help            show this help message and exit\n    -path FILE_PATH, --file_path FILE_PATH\n                            File path, PDF or images\n\n    $ rapidocr_pdf -path tests/test_files/direct_and_image.pdf\n    ```\n### 3. Ouput format.\n   - **Input**\uff1a`Union[str, Path, bytes]`\n   - **Output**\uff1a`List` \\[**Page num**, **Page content** + **score**\\], \uff1a\n       ```python\n       [\n           ['0', '\u8fbe\u5927\u5b66\u62c9\u65af\u7ef4\u52a0\u65af\u5206\u6821\uff09\u7684\u4e00\u6b21\u4e2d\u6587\u8bc4\u6d4b\u4e2d\u83b7\u5f97\u6700', '0.8969868'],\n           ['1', 'ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network\u2217\\nYuliang Liu\u2021\u2020', '0.8969868'],\n       ]\n       ```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Tools of extracting PDF content based on RapidOCR",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/RapidAI/RapidOCRPDF"
    },
    "split_keywords": [
        "rapidocr_pdf",
        " rapidocr_onnxruntime",
        " ocr",
        " onnxruntime",
        " openvino"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c72d7e916675a28dd37dbdd6027260990e65459652acd000810df7e7f7cc517c",
                "md5": "cee5048c1da8cab69493eb6adf91c2c6",
                "sha256": "3c64fd950033e7ceb65fadffdb8f8c50e60b5667322302f677f460b73f6e751a"
            },
            "downloads": -1,
            "filename": "rapidocr_pdf-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cee5048c1da8cab69493eb6adf91c2c6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.6",
            "size": 9218,
            "upload_time": "2024-04-27T12:47:52",
            "upload_time_iso_8601": "2024-04-27T12:47:52.527274Z",
            "url": "https://files.pythonhosted.org/packages/c7/2d/7e916675a28dd37dbdd6027260990e65459652acd000810df7e7f7cc517c/rapidocr_pdf-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-27 12:47:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "RapidAI",
    "github_project": "RapidOCRPDF",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "filetype",
            "specs": [
                [
                    ">=",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "pymupdf",
            "specs": []
        },
        {
            "name": "rapidocr_onnxruntime",
            "specs": []
        }
    ],
    "lcname": "rapidocr-pdf"
}

SWHL