# findimagespdf
Tool to extract images from PDF files, avoiding duplicates and storing images in a structured way.
The main directory is stored on the desktop and contains all subdirectories with PDF file names with their images.
* Directory structure:
```bash
FindImagesPDF
├── file1
│ └── image1.jpeg
├── file2
│ ├── image1.jpeg
│ ├── image2.jpeg
│ └── image3.jpeg
└── file3
├── image1.jpeg
├── image2.jpeg
└── image3.jpeg
```
# Install
```bash
$ pip install findimagespdf
```
# CLI
```bash
$ findimagespdf --help
usage: findimagespdf [-h] -p PATH [-d DEST] [-v]
Extract images of PDF files.
options:
-h, --help show this help message and
exit
-p PATH, --path PATH Path of PDF file or
directory of PDF files.
-d DEST, --dest DEST Path of the directory to
store the images, by
default the directory is on
the desktop.
-v, --verbose Verbose
extracts images and stores them in a directory on
the desktop.
```
## Examples
```bash
$ findimagespdf --path pdf_samples
$ findimagespdf --path pdf_samples --dest .
$ findimagespdf --path pdf_samples --dest . --verbose
```
# Usage
* Using `with`
```python
from findimagespdf.pdffile import PDFFile
files = ["file1.pdf", "file2.pdf", "file3.pdf"]
for file in files:
with PDFFile(path_or_bytes=file) as pdf:
pdf.find_startxref() # searches the xref table.
pdf.search_deep() # searches the entire archive for images
pdf.search_images() # searches the images.
pdf.get_images() # extracts and saves the images.
```
It is recommended to use `with`, because once the process is finished the file will be closed automatically. But you can open and close the file manually.
* Manually
```python
file = 'file1.pdf'
pdf = PDFFile(file)
pdf.open() # open pdf file.
pdf.find_startxref() # searches the xref table.
pdf.search_deep() # searches the entire archive for images
pdf.search_images() # searches the images.
pdf.get_images() # extracts and saves the images.
pdf.close() # closes pdf file.
```
# Methods
The main pdf processing methods for image search are:
| Method | Description |
|-|-|
| `find_startxref` | find the xref table of the file. |
| `search_deep` | optional but necessary in case the xref table is corrupted or to search for hidden or unlinked images. |
| `search_images` | search and filter images from the collected data. |
| `get_images` | extract and save images to the directory on the desktop. |
Raw data
{
"_id": null,
"home_page": null,
"name": "findimagespdf",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "pdf, pdf-parser, image, pillow",
"author": "kurotom",
"author_email": "55354389+kurotom@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/39/97/2d9a455690cafef7d1b75db4becb173b7fb969656314bedf4c3e907e398a/findimagespdf-0.1.2.tar.gz",
"platform": null,
"description": "# findimagespdf\n\nTool to extract images from PDF files, avoiding duplicates and storing images in a structured way.\n\nThe main directory is stored on the desktop and contains all subdirectories with PDF file names with their images.\n\n* Directory structure:\n\n```bash\nFindImagesPDF\n\u251c\u2500\u2500 file1\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 image1.jpeg\n\u251c\u2500\u2500 file2\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 image1.jpeg\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 image2.jpeg\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 image3.jpeg\n\u2514\u2500\u2500 file3\n \u251c\u2500\u2500 image1.jpeg\n \u251c\u2500\u2500 image2.jpeg\n \u2514\u2500\u2500 image3.jpeg\n```\n\n\n# Install\n\n```bash\n$ pip install findimagespdf\n```\n\n\n# CLI\n\n```bash\n$ findimagespdf --help\nusage: findimagespdf [-h] -p PATH [-d DEST] [-v]\n\nExtract images of PDF files.\n\noptions:\n -h, --help show this help message and\n exit\n -p PATH, --path PATH Path of PDF file or\n directory of PDF files.\n -d DEST, --dest DEST Path of the directory to\n store the images, by\n default the directory is on\n the desktop.\n -v, --verbose Verbose\n\nextracts images and stores them in a directory on\nthe desktop.\n```\n\n## Examples\n\n```bash\n$ findimagespdf --path pdf_samples\n\n$ findimagespdf --path pdf_samples --dest .\n\n$ findimagespdf --path pdf_samples --dest . --verbose\n```\n\n\n# Usage\n\n* Using `with`\n\n```python\nfrom findimagespdf.pdffile import PDFFile\n\nfiles = [\"file1.pdf\", \"file2.pdf\", \"file3.pdf\"]\n\nfor file in files:\n with PDFFile(path_or_bytes=file) as pdf:\n pdf.find_startxref() # searches the xref table.\n pdf.search_deep() # searches the entire archive for images\n pdf.search_images() # searches the images.\n pdf.get_images() # extracts and saves the images.\n```\n\nIt is recommended to use `with`, because once the process is finished the file will be closed automatically. But you can open and close the file manually.\n\n* Manually\n\n```python\nfile = 'file1.pdf'\n\npdf = PDFFile(file)\n\npdf.open() # open pdf file.\n\npdf.find_startxref() # searches the xref table.\npdf.search_deep() # searches the entire archive for images\npdf.search_images() # searches the images.\npdf.get_images() # extracts and saves the images.\n\npdf.close() # closes pdf file.\n```\n\n\n# Methods\n\nThe main pdf processing methods for image search are:\n\n| Method | Description |\n|-|-|\n| `find_startxref` | find the xref table of the file. |\n| `search_deep` | optional but necessary in case the xref table is corrupted or to search for hidden or unlinked images. |\n| `search_images` | search and filter images from the collected data. |\n| `get_images` | extract and save images to the directory on the desktop. |\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Search and save images from PDF files.",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/kurotom/findimagespdf/issues"
},
"split_keywords": [
"pdf",
" pdf-parser",
" image",
" pillow"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cd37fe2fc018c0103d547d0da881371ccc44009e157145d630843f4c21152d39",
"md5": "4075241995e394e57165f0925873e986",
"sha256": "f7bbb0c1036ecb9606d07ab9f98036bc85801c4ff678be4cdb40920f008233f6"
},
"downloads": -1,
"filename": "findimagespdf-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4075241995e394e57165f0925873e986",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 10153,
"upload_time": "2025-07-17T17:55:28",
"upload_time_iso_8601": "2025-07-17T17:55:28.986718Z",
"url": "https://files.pythonhosted.org/packages/cd/37/fe2fc018c0103d547d0da881371ccc44009e157145d630843f4c21152d39/findimagespdf-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "39972d9a455690cafef7d1b75db4becb173b7fb969656314bedf4c3e907e398a",
"md5": "68ca394f76e216da1995b7a452306d20",
"sha256": "7b966d7c1e932f34432fd805c82b1e12b56b8cc63ea654d3f53d70dec86d2fce"
},
"downloads": -1,
"filename": "findimagespdf-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "68ca394f76e216da1995b7a452306d20",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 8640,
"upload_time": "2025-07-17T17:55:29",
"upload_time_iso_8601": "2025-07-17T17:55:29.881766Z",
"url": "https://files.pythonhosted.org/packages/39/97/2d9a455690cafef7d1b75db4becb173b7fb969656314bedf4c3e907e398a/findimagespdf-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-17 17:55:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kurotom",
"github_project": "findimagespdf",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pillow",
"specs": [
[
"==",
"11.3.0"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"4.3.8"
]
]
}
],
"lcname": "findimagespdf"
}