findimagespdf

Name	findimagespdf JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	Search and save images from PDF files.
upload_time	2025-07-17 17:55:29
maintainer	None
docs_url	None
author	kurotom
requires_python	<4.0,>=3.9
license	MIT
keywords	pdf pdf-parser image pillow
VCS
bugtrack_url
requirements	pillow platformdirs
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # findimagespdf

Tool to extract images from PDF files, avoiding duplicates and storing images in a structured way.

The main directory is stored on the desktop and contains all subdirectories with PDF file names with their images.

* Directory structure:

```bash
FindImagesPDF
├── file1
│   └── image1.jpeg
├── file2
│   ├── image1.jpeg
│   ├── image2.jpeg
│   └── image3.jpeg
└── file3
    ├── image1.jpeg
    ├── image2.jpeg
    └── image3.jpeg
```


# Install

```bash
$ pip install findimagespdf
```


# CLI

```bash
$ findimagespdf --help
usage: findimagespdf [-h] -p PATH [-d DEST] [-v]

Extract images of PDF files.

options:
  -h, --help            show this help message and
                        exit
  -p PATH, --path PATH  Path of PDF file or
                        directory of PDF files.
  -d DEST, --dest DEST  Path of the directory to
                        store the images, by
                        default the directory is on
                        the desktop.
  -v, --verbose         Verbose

extracts images and stores them in a directory on
the desktop.
```

## Examples

```bash
$ findimagespdf --path pdf_samples

$ findimagespdf --path pdf_samples --dest .

$ findimagespdf --path pdf_samples --dest . --verbose
```


# Usage

* Using `with`

```python
from findimagespdf.pdffile import PDFFile

files = ["file1.pdf", "file2.pdf", "file3.pdf"]

for file in files:
    with PDFFile(path_or_bytes=file) as pdf:
        pdf.find_startxref()  # searches the xref table.
        pdf.search_deep()     # searches the entire archive for images
        pdf.search_images()   # searches the images.
        pdf.get_images()      # extracts and saves the images.
```

It is recommended to use `with`, because once the process is finished the file will be closed automatically. But you can open and close the file manually.

* Manually

```python
file = 'file1.pdf'

pdf = PDFFile(file)

pdf.open()            # open pdf file.

pdf.find_startxref()  # searches the xref table.
pdf.search_deep()     # searches the entire archive for images
pdf.search_images()   # searches the images.
pdf.get_images()      # extracts and saves the images.

pdf.close()           # closes pdf file.
```


# Methods

The main pdf processing methods for image search are:

| Method | Description |
|-|-|
| `find_startxref` | find the xref table of the file. |
| `search_deep` | optional but necessary in case the xref table is corrupted or to search for hidden or unlinked images. |
| `search_images` | search and filter images from the collected data. |
| `get_images` | extract and save images to the directory on the desktop. |

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "findimagespdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "pdf, pdf-parser, image, pillow",
    "author": "kurotom",
    "author_email": "55354389+kurotom@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/39/97/2d9a455690cafef7d1b75db4becb173b7fb969656314bedf4c3e907e398a/findimagespdf-0.1.2.tar.gz",
    "platform": null,
    "description": "# findimagespdf\n\nTool to extract images from PDF files, avoiding duplicates and storing images in a structured way.\n\nThe main directory is stored on the desktop and contains all subdirectories with PDF file names with their images.\n\n* Directory structure:\n\n```bash\nFindImagesPDF\n\u251c\u2500\u2500 file1\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 image1.jpeg\n\u251c\u2500\u2500 file2\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 image1.jpeg\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 image2.jpeg\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 image3.jpeg\n\u2514\u2500\u2500 file3\n    \u251c\u2500\u2500 image1.jpeg\n    \u251c\u2500\u2500 image2.jpeg\n    \u2514\u2500\u2500 image3.jpeg\n```\n\n\n# Install\n\n```bash\n$ pip install findimagespdf\n```\n\n\n# CLI\n\n```bash\n$ findimagespdf --help\nusage: findimagespdf [-h] -p PATH [-d DEST] [-v]\n\nExtract images of PDF files.\n\noptions:\n  -h, --help            show this help message and\n                        exit\n  -p PATH, --path PATH  Path of PDF file or\n                        directory of PDF files.\n  -d DEST, --dest DEST  Path of the directory to\n                        store the images, by\n                        default the directory is on\n                        the desktop.\n  -v, --verbose         Verbose\n\nextracts images and stores them in a directory on\nthe desktop.\n```\n\n## Examples\n\n```bash\n$ findimagespdf --path pdf_samples\n\n$ findimagespdf --path pdf_samples --dest .\n\n$ findimagespdf --path pdf_samples --dest . --verbose\n```\n\n\n# Usage\n\n* Using `with`\n\n```python\nfrom findimagespdf.pdffile import PDFFile\n\nfiles = [\"file1.pdf\", \"file2.pdf\", \"file3.pdf\"]\n\nfor file in files:\n    with PDFFile(path_or_bytes=file) as pdf:\n        pdf.find_startxref()  # searches the xref table.\n        pdf.search_deep()     # searches the entire archive for images\n        pdf.search_images()   # searches the images.\n        pdf.get_images()      # extracts and saves the images.\n```\n\nIt is recommended to use `with`, because once the process is finished the file will be closed automatically. But you can open and close the file manually.\n\n* Manually\n\n```python\nfile = 'file1.pdf'\n\npdf = PDFFile(file)\n\npdf.open()            # open pdf file.\n\npdf.find_startxref()  # searches the xref table.\npdf.search_deep()     # searches the entire archive for images\npdf.search_images()   # searches the images.\npdf.get_images()      # extracts and saves the images.\n\npdf.close()           # closes pdf file.\n```\n\n\n# Methods\n\nThe main pdf processing methods for image search are:\n\n| Method | Description |\n|-|-|\n| `find_startxref` | find the xref table of the file. |\n| `search_deep` | optional but necessary in case the xref table is corrupted or to search for hidden or unlinked images. |\n| `search_images` | search and filter images from the collected data. |\n| `get_images` | extract and save images to the directory on the desktop. |\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Search and save images from PDF files.",
    "version": "0.1.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/kurotom/findimagespdf/issues"
    },
    "split_keywords": [
        "pdf",
        " pdf-parser",
        " image",
        " pillow"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd37fe2fc018c0103d547d0da881371ccc44009e157145d630843f4c21152d39",
                "md5": "4075241995e394e57165f0925873e986",
                "sha256": "f7bbb0c1036ecb9606d07ab9f98036bc85801c4ff678be4cdb40920f008233f6"
            },
            "downloads": -1,
            "filename": "findimagespdf-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4075241995e394e57165f0925873e986",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 10153,
            "upload_time": "2025-07-17T17:55:28",
            "upload_time_iso_8601": "2025-07-17T17:55:28.986718Z",
            "url": "https://files.pythonhosted.org/packages/cd/37/fe2fc018c0103d547d0da881371ccc44009e157145d630843f4c21152d39/findimagespdf-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "39972d9a455690cafef7d1b75db4becb173b7fb969656314bedf4c3e907e398a",
                "md5": "68ca394f76e216da1995b7a452306d20",
                "sha256": "7b966d7c1e932f34432fd805c82b1e12b56b8cc63ea654d3f53d70dec86d2fce"
            },
            "downloads": -1,
            "filename": "findimagespdf-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "68ca394f76e216da1995b7a452306d20",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 8640,
            "upload_time": "2025-07-17T17:55:29",
            "upload_time_iso_8601": "2025-07-17T17:55:29.881766Z",
            "url": "https://files.pythonhosted.org/packages/39/97/2d9a455690cafef7d1b75db4becb173b7fb969656314bedf4c3e907e398a/findimagespdf-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 17:55:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kurotom",
    "github_project": "findimagespdf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pillow",
            "specs": [
                [
                    "==",
                    "11.3.0"
                ]
            ]
        },
        {
            "name": "platformdirs",
            "specs": [
                [
                    "==",
                    "4.3.8"
                ]
            ]
        }
    ],
    "lcname": "findimagespdf"
}

kurotom