English | [ä¸æ–‡](README_CN.md)
# pdf2docx
![python-version](https://img.shields.io/badge/python->=3.6-green.svg)
[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)
[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)
![license](https://img.shields.io/pypi/l/pdf2docx.svg)
![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)
- Extract data from PDF with `PyMuPDF`, e.g. text, images and drawings
- Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with `python-docx`
## Features
- Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
- Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
- Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
- Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
- Parsing pages with multi-processing
*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*
## Limitations
- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can't 100% convert the PDF layout
## Documentation
- [Installation](https://dothinking.github.io/pdf2docx/installation.html)
- [Quickstart](https://dothinking.github.io/pdf2docx/quickstart.html)
- [Convert PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)
- [Extract table](https://dothinking.github.io/pdf2docx/quickstart.table.html)
- [Command Line Interface](https://dothinking.github.io/pdf2docx/quickstart.cli.html)
- [Graphic User Interface](https://dothinking.github.io/pdf2docx/quickstart.gui.html)
- [Technical Documentation (In Chinese)](https://dothinking.github.io/pdf2docx/techdoc.html)
- [API Documentation](https://dothinking.github.io/pdf2docx/modules.html)
## Sample
![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)
Raw data
{
"_id": null,
"home_page": "https://github.com/dothinking/RGpdfconverter",
"name": "RGpdfconverter",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "pdf-to-word,pdf-to-docx",
"author": "dothinking",
"author_email": "train8808@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/3a/9b/9103bdc9025413fc09c7ce4f504c2b531b128e6c743fb051d7dc6f3b920e/RGpdfconverter-0.3.tar.gz",
"platform": null,
"description": "English | [\u00e4\u00b8\u00ad\u00e6\u2013\u2021](README_CN.md)\n\n# pdf2docx \n\n![python-version](https://img.shields.io/badge/python->=3.6-green.svg)\n[![codecov](https://codecov.io/gh/dothinking/pdf2docx/branch/master/graph/badge.svg)](https://codecov.io/gh/dothinking/pdf2docx)\n[![pypi-version](https://img.shields.io/pypi/v/pdf2docx.svg)](https://pypi.python.org/pypi/pdf2docx/)\n![license](https://img.shields.io/pypi/l/pdf2docx.svg)\n![pypi-downloads](https://img.shields.io/pypi/dm/pdf2docx)\n\n- Extract data from PDF with `PyMuPDF`, e.g. text, images and drawings \n- Parse layout with rule, e.g. sections, paragraphs, images and tables\n- Generate docx with `python-docx`\n\n## Features\n\n- Parse and re-create page layout\n - page margin\n - section and column (1 or 2 columns only)\n - page header and footer [TODO]\n\n- Parse and re-create paragraph\n - OCR text [TODO]\n - text in horizontal/vertical direction: from left to right, from bottom to top\n - font style, e.g. font name, size, weight, italic and color\n - text format, e.g. highlight, underline, strike-through\n - list style [TODO]\n - external hyper link\n - paragraph horizontal alignment (left/right/center/justify) and vertical spacing\n\n- Parse and re-create image\n\t- in-line image\n - image in Gray/RGB/CMYK mode\n - transparent image\n - floating image, i.e. picture behind text\n\n- Parse and re-create table\n - border style, e.g. width, color\n - shading style, i.e. background color\n - merged cells\n - vertical direction cell\n - table with partly hidden borders\n - nested tables\n\n- Parsing pages with multi-processing\n\n*It can also be used as a tool to extract table contents since both table content and format/style is parsed.*\n\n## Limitations\n\n- Text-based PDF file\n- Left to right language\n- Normal reading direction, no word transformation / rotation\n- Rule-based method can't 100% convert the PDF layout\n\n\n## Documentation\n\n- [Installation](https://dothinking.github.io/pdf2docx/installation.html)\n- [Quickstart](https://dothinking.github.io/pdf2docx/quickstart.html)\n - [Convert PDF](https://dothinking.github.io/pdf2docx/quickstart.convert.html)\n - [Extract table](https://dothinking.github.io/pdf2docx/quickstart.table.html)\n - [Command Line Interface](https://dothinking.github.io/pdf2docx/quickstart.cli.html)\n - [Graphic User Interface](https://dothinking.github.io/pdf2docx/quickstart.gui.html)\n- [Technical Documentation (In Chinese)](https://dothinking.github.io/pdf2docx/techdoc.html)\n- [API Documentation](https://dothinking.github.io/pdf2docx/modules.html)\n\n## Sample\n\n![sample_compare.png](https://s1.ax1x.com/2020/08/04/aDryx1.png)\n\n",
"bugtrack_url": null,
"license": "GPL v3",
"summary": "parse PDF files to docx",
"version": "0.3",
"project_urls": {
"Homepage": "https://github.com/dothinking/RGpdfconverter"
},
"split_keywords": [
"pdf-to-word",
"pdf-to-docx"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "015431d9bcdb43853c86aeee6fc449e419d9c995f95a4d26e6779f0228889404",
"md5": "9435d1fdc95145bb3077c8a1f6af4665",
"sha256": "55500d6cbcc7a42233e85d1145dd51a0faa474b85dee4c7fda03d002ece46edf"
},
"downloads": -1,
"filename": "RGpdfconverter-0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9435d1fdc95145bb3077c8a1f6af4665",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 129086,
"upload_time": "2023-10-30T19:53:18",
"upload_time_iso_8601": "2023-10-30T19:53:18.533252Z",
"url": "https://files.pythonhosted.org/packages/01/54/31d9bcdb43853c86aeee6fc449e419d9c995f95a4d26e6779f0228889404/RGpdfconverter-0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3a9b9103bdc9025413fc09c7ce4f504c2b531b128e6c743fb051d7dc6f3b920e",
"md5": "b0426a50ca2dab4cba5096b10f18b29f",
"sha256": "53d1110be1c03024e0e6723163b9cb12ed39bc1ebcf0ce1ecd1328751f9f0223"
},
"downloads": -1,
"filename": "RGpdfconverter-0.3.tar.gz",
"has_sig": false,
"md5_digest": "b0426a50ca2dab4cba5096b10f18b29f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 3062738,
"upload_time": "2023-10-30T19:53:21",
"upload_time_iso_8601": "2023-10-30T19:53:21.808767Z",
"url": "https://files.pythonhosted.org/packages/3a/9b/9103bdc9025413fc09c7ce4f504c2b531b128e6c743fb051d7dc6f3b920e/RGpdfconverter-0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-30 19:53:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dothinking",
"github_project": "RGpdfconverter",
"github_not_found": true,
"lcname": "rgpdfconverter"
}