pdf2image


Namepdf2image JSON
Version 1.17.0 PyPI version JSON
download
home_pagehttps://github.com/Belval/pdf2image
SummaryA wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
upload_time2024-01-07 20:33:01
maintainer
docs_urlNone
authorEdouard Belval
requires_python
licenseMIT
keywords pdf image png jpeg jpg convert
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pdf2image
[![CircleCI](https://circleci.com/gh/Belval/pdf2image/tree/master.svg?style=svg)](https://circleci.com/gh/Belval/pdf2image/tree/master) [![PyPI version](https://badge.fury.io/py/pdf2image.svg)](https://badge.fury.io/py/pdf2image) [![codecov](https://codecov.io/gh/Belval/pdf2image/branch/master/graph/badge.svg)](https://codecov.io/gh/Belval/pdf2image) [![Downloads](https://pepy.tech/badge/pdf2image/month)](https://pepy.tech/project/pdf2image) [![GitHub CI](https://github.com/Belval/pdf2image/actions/workflows/documentation.yml/badge.svg)](https://belval.github.io/pdf2image)

A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

## How to install

`pip install pdf2image`

### Windows

Windows users will have to build or download poppler for Windows. I recommend [@oschwartz10612 version](https://github.com/oschwartz10612/poppler-windows/releases/) which is the most up-to-date. You will then have to add the `bin/` folder to [PATH](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/) or use `poppler_path = r"C:\path\to\poppler-xx\bin" as an argument` in `convert_from_path`.

### Mac

Mac users will have to install [poppler](https://poppler.freedesktop.org/).

Installing using [Brew](https://brew.sh/):

```
brew install poppler
```

### Linux

Most distros ship with `pdftoppm` and `pdftocairo`. If they are not installed, refer to your package manager to install `poppler-utils`

### Platform-independant (Using `conda`)

1. Install poppler: `conda install -c conda-forge poppler`
2. Install pdf2image: `pip install pdf2image`

## How does it work?


```py
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)
```

Then simply do:

```py
images = convert_from_path('/home/belval/example.pdf')
```

OR

```py
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
```

OR better yet

```py
import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
    # Do something here
```

`images` will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

`convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)`

`convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)`

## What's new?

- Allow users to hide attributes when using pdftoppm with `hide_attributes` (Thank you @StaticRocket)
- Fix console opening on Windows (Thank you @OhMyAgnes!)
- Add `timeout` parameter which raises `PDFPopplerTimeoutError` after the given number of seconds.
- Add `use_pdftocairo` parameter which forces `pdf2image` to use `pdftocairo`. Should improve performance.
- Fixed a bug where using `pdf2image` with multiple threads (but not multiple processes) would cause and exception
- `jpegopt` parameter allows for tuning of the output JPEG when using `fmt="jpeg"` (`-jpegopt` in pdftoppm CLI) (Thank you @abieler)
- `pdfinfo_from_path` and `pdfinfo_from_bytes` which expose the output of the pdfinfo CLI
- `paths_only` parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
- `size` parameter allows you to define the shape of the resulting images (`-scale-to` in pdftoppm CLI)
    - `size=400` will fit the image to a 400x400 box, preserving aspect ratio
    - `size=(400, None)` will make the image 400 pixels wide, preserving aspect ratio
    - `size=(500, 500)` will resize the image to 500x500 pixels, not preserving aspect ratio
- `grayscale` parameter allows you to convert images to grayscale (`-gray` in pdftoppm CLI)
- `single_file` parameter allows you to convert the first PDF page only, without adding digits at the end of the `output_file`
- Allow the user to specify poppler's installation path with `poppler_path`

## Performance tips

- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, this is because of the compression.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run `python tests.py` to get timings.

## Limitations / known issues

- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
- Sometimes fail read pdf signed using DocuSign, [Solution for DocuSign issue.](docs/installation.md)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Belval/pdf2image",
    "name": "pdf2image",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pdf image png jpeg jpg convert",
    "author": "Edouard Belval",
    "author_email": "edouard@belval.org",
    "download_url": "https://files.pythonhosted.org/packages/00/d8/b280f01045555dc257b8153c00dee3bc75830f91a744cd5f84ef3a0a64b1/pdf2image-1.17.0.tar.gz",
    "platform": null,
    "description": "# pdf2image\n[![CircleCI](https://circleci.com/gh/Belval/pdf2image/tree/master.svg?style=svg)](https://circleci.com/gh/Belval/pdf2image/tree/master) [![PyPI version](https://badge.fury.io/py/pdf2image.svg)](https://badge.fury.io/py/pdf2image) [![codecov](https://codecov.io/gh/Belval/pdf2image/branch/master/graph/badge.svg)](https://codecov.io/gh/Belval/pdf2image) [![Downloads](https://pepy.tech/badge/pdf2image/month)](https://pepy.tech/project/pdf2image) [![GitHub CI](https://github.com/Belval/pdf2image/actions/workflows/documentation.yml/badge.svg)](https://belval.github.io/pdf2image)\n\nA python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object\n\n## How to install\n\n`pip install pdf2image`\n\n### Windows\n\nWindows users will have to build or download poppler for Windows. I recommend [@oschwartz10612 version](https://github.com/oschwartz10612/poppler-windows/releases/) which is the most up-to-date. You will then have to add the `bin/` folder to [PATH](https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/) or use `poppler_path = r\"C:\\path\\to\\poppler-xx\\bin\" as an argument` in `convert_from_path`.\n\n### Mac\n\nMac users will have to install [poppler](https://poppler.freedesktop.org/).\n\nInstalling using [Brew](https://brew.sh/):\n\n```\nbrew install poppler\n```\n\n### Linux\n\nMost distros ship with `pdftoppm` and `pdftocairo`. If they are not installed, refer to your package manager to install `poppler-utils`\n\n### Platform-independant (Using `conda`)\n\n1. Install poppler: `conda install -c conda-forge poppler`\n2. Install pdf2image: `pip install pdf2image`\n\n## How does it work?\n\n\n```py\nfrom pdf2image import convert_from_path, convert_from_bytes\nfrom pdf2image.exceptions import (\n    PDFInfoNotInstalledError,\n    PDFPageCountError,\n    PDFSyntaxError\n)\n```\n\nThen simply do:\n\n```py\nimages = convert_from_path('/home/belval/example.pdf')\n```\n\nOR\n\n```py\nimages = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())\n```\n\nOR better yet\n\n```py\nimport tempfile\n\nwith tempfile.TemporaryDirectory() as path:\n    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)\n    # Do something here\n```\n\n`images` will be a list of PIL Image representing each page of the PDF document.\n\nHere are the definitions:\n\n`convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)`\n\n`convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)`\n\n## What's new?\n\n- Allow users to hide attributes when using pdftoppm with `hide_attributes` (Thank you @StaticRocket)\n- Fix console opening on Windows (Thank you @OhMyAgnes!)\n- Add `timeout` parameter which raises `PDFPopplerTimeoutError` after the given number of seconds.\n- Add `use_pdftocairo` parameter which forces `pdf2image` to use `pdftocairo`. Should improve performance.\n- Fixed a bug where using `pdf2image` with multiple threads (but not multiple processes) would cause and exception\n- `jpegopt` parameter allows for tuning of the output JPEG when using `fmt=\"jpeg\"` (`-jpegopt` in pdftoppm CLI) (Thank you @abieler)\n- `pdfinfo_from_path` and `pdfinfo_from_bytes` which expose the output of the pdfinfo CLI\n- `paths_only` parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF\n- `size` parameter allows you to define the shape of the resulting images (`-scale-to` in pdftoppm CLI)\n    - `size=400`\u00a0will fit the image to a 400x400 box, preserving aspect ratio\n    - `size=(400, None)` will make the image 400 pixels wide, preserving aspect ratio\n    - `size=(500, 500)` will resize the image to 500x500 pixels, not preserving aspect ratio\n- `grayscale` parameter allows you to convert images to grayscale (`-gray` in pdftoppm CLI)\n- `single_file` parameter allows you to convert the first PDF page only, without adding digits at the end of the `output_file`\n- Allow the user to specify poppler's installation path with `poppler_path`\n\n## Performance tips\n\n- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.\n- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).\n- If i/o is your bottleneck, using the JPEG format can lead to significant gains.\n- PNG format is pretty slow, this is because of the compression.\n- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run `python tests.py` to get timings.\n\n## Limitations / known issues\n\n- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)\n- Sometimes fail read pdf signed using DocuSign, [Solution for DocuSign issue.](docs/installation.md)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.",
    "version": "1.17.0",
    "project_urls": {
        "Homepage": "https://github.com/Belval/pdf2image"
    },
    "split_keywords": [
        "pdf",
        "image",
        "png",
        "jpeg",
        "jpg",
        "convert"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "623361766ae033518957f877ab246f87ca30a85b778ebaad65b7f74fa7e52988",
                "md5": "34470f853c84ebed2d342d975222e9c3",
                "sha256": "ecdd58d7afb810dffe21ef2b1bbc057ef434dabbac6c33778a38a3f7744a27e2"
            },
            "downloads": -1,
            "filename": "pdf2image-1.17.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34470f853c84ebed2d342d975222e9c3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 11618,
            "upload_time": "2024-01-07T20:32:59",
            "upload_time_iso_8601": "2024-01-07T20:32:59.957917Z",
            "url": "https://files.pythonhosted.org/packages/62/33/61766ae033518957f877ab246f87ca30a85b778ebaad65b7f74fa7e52988/pdf2image-1.17.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "00d8b280f01045555dc257b8153c00dee3bc75830f91a744cd5f84ef3a0a64b1",
                "md5": "989a182455d439b3a58640031e14652c",
                "sha256": "eaa959bc116b420dd7ec415fcae49b98100dda3dd18cd2fdfa86d09f112f6d57"
            },
            "downloads": -1,
            "filename": "pdf2image-1.17.0.tar.gz",
            "has_sig": false,
            "md5_digest": "989a182455d439b3a58640031e14652c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12811,
            "upload_time": "2024-01-07T20:33:01",
            "upload_time_iso_8601": "2024-01-07T20:33:01.965185Z",
            "url": "https://files.pythonhosted.org/packages/00/d8/b280f01045555dc257b8153c00dee3bc75830f91a744cd5f84ef3a0a64b1/pdf2image-1.17.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-07 20:33:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Belval",
    "github_project": "pdf2image",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "circle": true,
    "lcname": "pdf2image"
}
        
Elapsed time: 0.25431s