## RapidOCRPDF
<p>
<a href=""><img src="https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a>
<a href="https://pypi.org/project/rapidocr-pdf/"><img alt="PyPI" src="https://img.shields.io/pypi/v/rapidocr-pdf"></a>
<a href="https://pepy.tech/project/rapidocr-pdf"><img src="https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a>
<a href="https://semver.org/"><img alt="SemVer2.0" src="https://img.shields.io/badge/SemVer-2.0-brightgreen"></a>
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
<a href="https://choosealicense.com/licenses/apache-2.0/"><img alt="GitHub" src="https://img.shields.io/github/license/RapidAI/RapidOCRPDF"></a>
</p>
- Relying on [RapidOCR](https://github.com/RapidAI/RapidOCR), quickly extract text from PDF, including scanned PDF and encrypted PDF.
- Layout restore is not included for now.
### 1. Install package by pypi.
```bash
# base rapidocr_onnxruntime
pip install rapidocr_pdf[onnxruntime]
# base rapidocr_openvino
pip install rapidocr_pdf[openvino]
```
### 2. Usage
- Run by script.
```python
from rapidocr_pdf import PDFExtracter
pdf_extracter = PDFExtracter()
pdf_path = 'tests/test_files/direct_and_image.pdf'
texts = pdf_extracter(pdf_path)
print(texts)
```
- Run by command line.
```bash
$ rapidocr_pdf -h
usage: rapidocr_pdf [-h] [-path FILE_PATH]
options:
-h, --help show this help message and exit
-path FILE_PATH, --file_path FILE_PATH
File path, PDF or images
$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf
```
### 3. Ouput format.
- **Input**:`Union[str, Path, bytes]`
- **Output**:`List` \[**Page num**, **Page content** + **score**\], :
```python
[
['0', '达大学拉斯维加斯分校)的一次中文评测中获得最', '0.8969868'],
['1', 'ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network∗\nYuliang Liu‡†', '0.8969868'],
]
```
Raw data
{
"_id": null,
"home_page": "https://github.com/RapidAI/RapidOCRPDF",
"name": "rapidocr-pdf",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.6",
"maintainer_email": null,
"keywords": "rapidocr_pdf, rapidocr_onnxruntime, ocr, onnxruntime, openvino",
"author": "SWHL",
"author_email": "liekkaskono@163.com",
"download_url": null,
"platform": "Any",
"description": "## RapidOCRPDF\n<p>\n <a href=\"\"><img src=\"https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg\"></a>\n <a href=\"\"><img src=\"https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg\"></a>\n <a href=\"https://pypi.org/project/rapidocr-pdf/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/rapidocr-pdf\"></a>\n <a href=\"https://pepy.tech/project/rapidocr-pdf\"><img src=\"https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads\"></a>\n <a href=\"https://semver.org/\"><img alt=\"SemVer2.0\" src=\"https://img.shields.io/badge/SemVer-2.0-brightgreen\"></a>\n <a href=\"https://github.com/psf/black\"><img src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"></a>\n <a href=\"https://choosealicense.com/licenses/apache-2.0/\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/RapidAI/RapidOCRPDF\"></a>\n</p>\n\n- Relying on [RapidOCR](https://github.com/RapidAI/RapidOCR), quickly extract text from PDF, including scanned PDF and encrypted PDF.\n- Layout restore is not included for now.\n\n\n### 1. Install package by pypi.\n ```bash\n # base rapidocr_onnxruntime\n pip install rapidocr_pdf[onnxruntime]\n\n # base rapidocr_openvino\n pip install rapidocr_pdf[openvino]\n ```\n\n### 2. Usage\n- Run by script.\n ```python\n from rapidocr_pdf import PDFExtracter\n\n pdf_extracter = PDFExtracter()\n\n pdf_path = 'tests/test_files/direct_and_image.pdf'\n texts = pdf_extracter(pdf_path)\n print(texts)\n ```\n- Run by command line.\n ```bash\n $ rapidocr_pdf -h\n usage: rapidocr_pdf [-h] [-path FILE_PATH]\n\n options:\n -h, --help show this help message and exit\n -path FILE_PATH, --file_path FILE_PATH\n File path, PDF or images\n\n $ rapidocr_pdf -path tests/test_files/direct_and_image.pdf\n ```\n### 3. Ouput format.\n - **Input**\uff1a`Union[str, Path, bytes]`\n - **Output**\uff1a`List` \\[**Page num**, **Page content** + **score**\\], \uff1a\n ```python\n [\n ['0', '\u8fbe\u5927\u5b66\u62c9\u65af\u7ef4\u52a0\u65af\u5206\u6821\uff09\u7684\u4e00\u6b21\u4e2d\u6587\u8bc4\u6d4b\u4e2d\u83b7\u5f97\u6700', '0.8969868'],\n ['1', 'ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network\u2217\\nYuliang Liu\u2021\u2020', '0.8969868'],\n ]\n ```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Tools of extracting PDF content based on RapidOCR",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/RapidAI/RapidOCRPDF"
},
"split_keywords": [
"rapidocr_pdf",
" rapidocr_onnxruntime",
" ocr",
" onnxruntime",
" openvino"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c72d7e916675a28dd37dbdd6027260990e65459652acd000810df7e7f7cc517c",
"md5": "cee5048c1da8cab69493eb6adf91c2c6",
"sha256": "3c64fd950033e7ceb65fadffdb8f8c50e60b5667322302f677f460b73f6e751a"
},
"downloads": -1,
"filename": "rapidocr_pdf-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cee5048c1da8cab69493eb6adf91c2c6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.6",
"size": 9218,
"upload_time": "2024-04-27T12:47:52",
"upload_time_iso_8601": "2024-04-27T12:47:52.527274Z",
"url": "https://files.pythonhosted.org/packages/c7/2d/7e916675a28dd37dbdd6027260990e65459652acd000810df7e7f7cc517c/rapidocr_pdf-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-27 12:47:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "RapidAI",
"github_project": "RapidOCRPDF",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "filetype",
"specs": [
[
">=",
"1.2.0"
]
]
},
{
"name": "pymupdf",
"specs": []
},
{
"name": "rapidocr_onnxruntime",
"specs": []
}
],
"lcname": "rapidocr-pdf"
}