adf2pdf

Name	adf2pdf JSON
Version	0.8.3 JSON
	download
home_page	https://github.com/gsauthof/adf2pdf
Summary	Automate the workflow around ADF scanning, OCR and PDF creation
upload_time	2023-08-15 21:09:48
maintainer
docs_url	None
author	Georg Sauthoff
requires_python	>=3
license
keywords	adf scanning sane duplex-scanning ocr tesseract pdf
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            adf2pdf - a tool that turns a batch of paper pages into a PDF
with a text layer.  By default, it detects empty pages (as they
may easily occur during duplex scanning) and excludes them from
the OCR and the resulting PDF.

For that, it uses [Sane's][5] [scanimage][6] for the scanning,
[Tesseract][4] for the [optical character recognition][ocr] (OCR), and
the Python packages [img2pdf][9], [Pillow (PIL)][10] and
[PyPDF2][11] for some image-processing tasks and PDF mangling.


Example:

    $ adf2pdf contract-xyz.pdf

2017, Georg Sauthoff <mail@gms.tf>

## Features

- Automatic document feed (ADF) support
- Fast empty page detection
- Overlaying of scanning, image processing, OCR and PDF creation
  to minimize the total runtime
- Fast creation of small PDFs using the fine [img2pdf][9] package
- Only use of safe compression methods, i.e. no error-prone
  symbol segmentation style compression like [JBIG2][12] or JB2
  that is used in [Xerox photocopiers][12] and the DjVu format.

## Install Instructions

Adf2pdf can be directly installed with [`pip`][13], e.g.

    $ pip3 install --user adf2pdf

or

    $ pip3 install adf2pdf

See also the [PyPI adf2pdf project page][14].

Alternatively, the Python file `adf2pdf.py` can be directly
executed in a cloned repository, e.g.:

    $ ./adf2pdf.py report.pdf

In addition to that, one can install the development version from
a cloned work-tree like this:

    $ pip3 install --user .

## Hardware Requirements

A scanner with automatic document feed (ADF) that is supported by
Sane. For example, the [Fujitsu ScanSnap S1500][1] works
well. That model supports duplex scanning, which is quite
convenient.

## Example continued

Running _adf2pdf_ for a 7 page example document takes 150 seconds
on an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the
Fujitsu ScanSnap S1500). With the defaults, _adf2pdf_ calls
`scanimage` for duplex scanning into 600 dpi lineart (black and
white) images. In this example, 6 pages are empty and thus
automatically excluded, i.e. the resulting PDF then just contains
8 pages.

The resulting PDF contains a text layer from the OCR such that
one can search and copy'n'paste some text. It is 1.1 MiB big,
i.e. a page is stored in 132 KiB, on average.

## Software Requirements

The script assumes Tesseract version 4, by default. Version 3 can
be used as well, but the [new neural network system in Tesseract
4][8] just performs magnitudes better than the old OCR model.
Tesseract 4.0.0 was released in late 2018, thus, distributions
released in that time frame may still just include version 3 in
their repositories (e.g. Fedora 29 while Fedora 30 features version
4). Since version 4 is so much better at OCR I can't recommend it
enough over the stable version 3.

Tesseract 4 notes (in case you need to build it from the sources):

- [Build instructions][2] - warning: if you miss the
  `autoconf-archive` dependency you'll get weird autoconf error
  messages
- [Data files][3] - you need the training data for your
  languages of choice and the OSD data

Python packages:

- [img2pdf][9] (Fedora package: python3-img2pdf)
- [Pillow (PIL)][10] (Fedora package: python3-pillow-devel)
- [PyPDF2][11] (Fedora package: python3-PyPDF2)

[1]: http://www.fujitsu.com/us/products/computing/peripheral/scanners/product/eol/s1500/
[2]: https://github.com/tesseract-ocr/tesseract/wiki/Compiling-–-GitInstallation
[3]: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
[4]: https://en.wikipedia.org/wiki/Tesseract_(software)
[5]: https://en.wikipedia.org/wiki/Scanner_Access_Now_Easy
[6]: http://www.sane-project.org/man/scanimage.1.html
[7]: https://en.wikipedia.org/wiki/Optical_character_recognition
[8]: https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00
[9]: https://pypi.org/project/img2pdf/
[10]: http://python-pillow.github.io/
[11]: https://github.com/mstamy2/PyPDF2
[12]: https://en.wikipedia.org/wiki/JBIG2
[13]: https://en.wikipedia.org/wiki/Pip_(package_manager)
[14]: https://pypi.org/project/adf2pdf/
[ocr]: https://en.wikipedia.org/wiki/Optical_character_recognition

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gsauthof/adf2pdf",
    "name": "adf2pdf",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3",
    "maintainer_email": "",
    "keywords": "adf scanning sane duplex-scanning ocr tesseract pdf",
    "author": "Georg Sauthoff",
    "author_email": "mail@gms.tf",
    "download_url": "https://files.pythonhosted.org/packages/e2/9e/7beaedc362d898ae8781e29ea5328f3aa3a401dee5beeb52f83cf4f24c19/adf2pdf-0.8.3.tar.gz",
    "platform": null,
    "description": "adf2pdf - a tool that turns a batch of paper pages into a PDF\nwith a text layer.  By default, it detects empty pages (as they\nmay easily occur during duplex scanning) and excludes them from\nthe OCR and the resulting PDF.\n\nFor that, it uses [Sane's][5] [scanimage][6] for the scanning,\n[Tesseract][4] for the [optical character recognition][ocr] (OCR), and\nthe Python packages [img2pdf][9], [Pillow (PIL)][10] and\n[PyPDF2][11] for some image-processing tasks and PDF mangling.\n\n\nExample:\n\n    $ adf2pdf contract-xyz.pdf\n\n2017, Georg Sauthoff <mail@gms.tf>\n\n## Features\n\n- Automatic document feed (ADF) support\n- Fast empty page detection\n- Overlaying of scanning, image processing, OCR and PDF creation\n  to minimize the total runtime\n- Fast creation of small PDFs using the fine [img2pdf][9] package\n- Only use of safe compression methods, i.e. no error-prone\n  symbol segmentation style compression like [JBIG2][12] or JB2\n  that is used in [Xerox photocopiers][12] and the DjVu format.\n\n## Install Instructions\n\nAdf2pdf can be directly installed with [`pip`][13], e.g.\n\n    $ pip3 install --user adf2pdf\n\nor\n\n    $ pip3 install adf2pdf\n\nSee also the [PyPI adf2pdf project page][14].\n\nAlternatively, the Python file `adf2pdf.py` can be directly\nexecuted in a cloned repository, e.g.:\n\n    $ ./adf2pdf.py report.pdf\n\nIn addition to that, one can install the development version from\na cloned work-tree like this:\n\n    $ pip3 install --user .\n\n## Hardware Requirements\n\nA scanner with automatic document feed (ADF) that is supported by\nSane. For example, the [Fujitsu ScanSnap S1500][1] works\nwell. That model supports duplex scanning, which is quite\nconvenient.\n\n## Example continued\n\nRunning _adf2pdf_ for a 7 page example document takes 150 seconds\non an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the\nFujitsu ScanSnap S1500). With the defaults, _adf2pdf_ calls\n`scanimage` for duplex scanning into 600 dpi lineart (black and\nwhite) images. In this example, 6 pages are empty and thus\nautomatically excluded, i.e. the resulting PDF then just contains\n8 pages.\n\nThe resulting PDF contains a text layer from the OCR such that\none can search and copy'n'paste some text. It is 1.1 MiB big,\ni.e. a page is stored in 132 KiB, on average.\n\n## Software Requirements\n\nThe script assumes Tesseract version 4, by default. Version 3 can\nbe used as well, but the [new neural network system in Tesseract\n4][8] just performs magnitudes better than the old OCR model.\nTesseract 4.0.0 was released in late 2018, thus, distributions\nreleased in that time frame may still just include version 3 in\ntheir repositories (e.g. Fedora 29 while Fedora 30 features version\n4). Since version 4 is so much better at OCR I can't recommend it\nenough over the stable version 3.\n\nTesseract 4 notes (in case you need to build it from the sources):\n\n- [Build instructions][2] - warning: if you miss the\n  `autoconf-archive` dependency you'll get weird autoconf error\n  messages\n- [Data files][3] - you need the training data for your\n  languages of choice and the OSD data\n\nPython packages:\n\n- [img2pdf][9] (Fedora package: python3-img2pdf)\n- [Pillow (PIL)][10] (Fedora package: python3-pillow-devel)\n- [PyPDF2][11] (Fedora package: python3-PyPDF2)\n\n[1]: http://www.fujitsu.com/us/products/computing/peripheral/scanners/product/eol/s1500/\n[2]: https://github.com/tesseract-ocr/tesseract/wiki/Compiling-\u2013-GitInstallation\n[3]: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files\n[4]: https://en.wikipedia.org/wiki/Tesseract_(software)\n[5]: https://en.wikipedia.org/wiki/Scanner_Access_Now_Easy\n[6]: http://www.sane-project.org/man/scanimage.1.html\n[7]: https://en.wikipedia.org/wiki/Optical_character_recognition\n[8]: https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00\n[9]: https://pypi.org/project/img2pdf/\n[10]: http://python-pillow.github.io/\n[11]: https://github.com/mstamy2/PyPDF2\n[12]: https://en.wikipedia.org/wiki/JBIG2\n[13]: https://en.wikipedia.org/wiki/Pip_(package_manager)\n[14]: https://pypi.org/project/adf2pdf/\n[ocr]: https://en.wikipedia.org/wiki/Optical_character_recognition\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Automate the workflow around ADF scanning, OCR and PDF creation",
    "version": "0.8.3",
    "project_urls": {
        "Bug Reports": "https://github.com/gsauthof/adf2pdf/issues",
        "Homepage": "https://github.com/gsauthof/adf2pdf",
        "Say Thanks!": "https://gms.tf",
        "Source": "https://github.com/gsauthof/adf2pdf"
    },
    "split_keywords": [
        "adf",
        "scanning",
        "sane",
        "duplex-scanning",
        "ocr",
        "tesseract",
        "pdf"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e29e7beaedc362d898ae8781e29ea5328f3aa3a401dee5beeb52f83cf4f24c19",
                "md5": "3796c8ca880ce9d7e7253e38bb0f7803",
                "sha256": "41400fb252cb875fde225515d58027a91ade5ca77ec0c27d7fb42846d85ed7d6"
            },
            "downloads": -1,
            "filename": "adf2pdf-0.8.3.tar.gz",
            "has_sig": false,
            "md5_digest": "3796c8ca880ce9d7e7253e38bb0f7803",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 21466,
            "upload_time": "2023-08-15T21:09:48",
            "upload_time_iso_8601": "2023-08-15T21:09:48.359494Z",
            "url": "https://files.pythonhosted.org/packages/e2/9e/7beaedc362d898ae8781e29ea5328f3aa3a401dee5beeb52f83cf4f24c19/adf2pdf-0.8.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-15 21:09:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gsauthof",
    "github_project": "adf2pdf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "adf2pdf"
}

Georg Sauthoff