pdfminify


Namepdfminify JSON
Version 0.2.2 PyPI version JSON
download
home_pagehttps://github.com/johndoe31415/pdfminify
SummaryPDF minification tool
upload_time2024-09-14 08:45:37
maintainerNone
docs_urlNone
authorJohannes Bauer
requires_pythonNone
licensegpl-3.0
keywords pdf minifier
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # pdfminify
[![Build Status](https://travis-ci.com/johndoe31415/pdfminify.svg?branch=master)](https://travis-ci.com/johndoe31415/pdfminify)


pdfminify is intended to re-compress PDF images while operating directly at PDF
level (i.e., no re-compression using PostScript). It parses the PDF file,
hashes all image references, re-links resources that are duplicate (i.e., have
the same MD5 hash value) and also is able to re-compress images which are
stored with lossless compression in order to use JPEG. It tries to calculate
the physical extent of the images (which, depending on the inclusion method,
can be a bit messy) and then is able to calculate the actual image resolution.
If it exceeds a given target resolution, it can also re-sample images (i.e.,
re-scaling them using ImageMagick) before re-integrating them into the target
PDF.

In particular, I wrote this software because PDFs generated by the libcairo PDF
export are *huge*. Images that are used are included a dozen times and only
lossless compression is used. Therefore I use pdfminify to decrease the
filesize later.

Another use for pdfminify is that it is able to convert PDFs into PDF/A-1b
compliant PDF files. Since this is something that's really difficult to do,
there are no guarantees regarding the resulting PDF -- please check for
yourself if the results still behave identical to your source version.

Finally, pdfminify is able to digitally sign your PDF files. For this you will
need an X.509 certificate and corresponding keypair. The signature will be
included in the PDF in the form a banner with metadata and sophisticated PDF
readers will be able to verify that the PDF has not been tampered with.

## Requirements
pdfminify needs at least Python 3.5 and the [llpdf Python
package](https://github.com/johndoe31415/llpdf) at least v0.0.4. It furthermore
uses the "identify" and "convert" utilities of ImageMagick. It uses the former
to determine the width, height, colorspace and bits per component of image
files and the latter to convert images from PNM (the internal format that
pdfminify is capable of writing natively) to JPEG.

## Acknowledgments
pdfminify uses the Toy Parser Generator (TPG) of Christophe Delord
(http://cdsoft.fr/tpg/). It is included (tpg.py file) and licensed under the
GNU LGPL v2.1 or any later version. Despite its name, it is far from a toy. In
fact, it is the most beautiful parser generator I have ever worked with and
makes grammars and parsing exceptionally easy, even for people without any
previous parsing experience. If you need parsing and use Python, TPG is *the*
go-to solution I would recommend. Seriously, it's amazing. Check it out.
Copyright and license details can be found in EXTERNAL_LICENSES.md.

## Usage
<pre>
$ pdfminify
usage: pdfminify [-h] [-d dpi] [-j] [--jpeg-quality percent]
                 [--no-downscaling] [--cropbox x,y,w,h]
                 [--unit {cm,inch,mm,native}] [--one-bit-alpha]
                 [--remove-alpha] [--background-color color]
                 [--strip-metadata] [--saveimgdir path] [--raw-output]
                 [--pretty-pdf] [--no-xref-stream] [--no-object-streams]
                 [--pdfa-1b] [--color-profile iccfile] [--sign-cert certfile]
                 [--sign-key keyfile] [--sign-chain pemfile] [--signer name]
                 [--sign-location hostname] [--sign-contact-info infotext]
                 [--sign-reason reason] [--sign-page pageno]
                 [--sign-font pfbfile] [--sign-pos x,y] [--embed-payload path]
                 [--no-pdf-tagging] [--decompress-data] [--analyze]
                 [--dump-xref-table] [--no-filters] [-v]
                 pdf_in pdf_out

Minifies PDF files, can crop them, convert them to PDF/A-1b and sign them
cryptographically.

positional arguments:
  pdf_in                Input PDF file.
  pdf_out               Output PDF file.

optional arguments:
  -h, --help            show this help message and exit
  -d dpi, --target-dpi dpi
                        Default resoulution to which images will be resampled
                        at. Defaults to 150 dots per inch (dpi).
  -j, --jpeg-images     Convert images to JPEG format. This means that lossy
                        compression is used that however often yields a much
                        higher compression ratio.
  --jpeg-quality percent
                        When converting images to JPEG format, the parameter
                        gives the compression quality. It is an integer from
                        0-100 (higher is better, but creates also larger
                        output files).
  --no-downscaling      Do not apply downscaling filter on the PDF, take all
                        images as they are.
  --cropbox x,y,w,h     Crop pages by additionally adding a /CropBox to all
                        pages of the PDF file. Pages will be cropped at offset
                        (x, y) to a width (w, h). The unit in which offset,
                        width and height are given can be specified using the
                        --unit parameter.
  --unit {cm,inch,mm,native}
                        Specify the unit of measurement that is used for input
                        and output. Can be any of cm, inch, mm, native,
                        defaults to native. One native PDF unit equals 1/72th
                        of an inch.
  --one-bit-alpha       Force all alpha channels in images to use a color
                        depth of one bit. This will make transparent images
                        have rougher edges, but saves additional space.
  --remove-alpha        Entirely remove the alpha channel (i.e., transparency)
                        of all images. The color which with transparent areas
                        are replaced with can be specified using the
                        --background-color command line option.
  --background-color color
                        When removing alpha channels, specifies the color that
                        should be used as background. Defaults to white.
                        Hexadecimal values can be specified as well in the
                        format '#rrggbb'.
  --strip-metadata      Strip metadata inside PDF objects that is not strictly
                        required, such as /PTEX.* entries inside object
                        content.
  --saveimgdir path     When specified, save all handled images as individual
                        files into the specified directory. Useful for image
                        extraction from a PDF as well as debugging.
  --raw-output          When saving images externally, save them in exactly
                        the format in which they're also present inside the
                        PDF. Note that this will produce raw image files in
                        some cases which won't have any header (but just
                        contain pixel data). Less useful for image extraction,
                        but can make sense for debugging.
  --pretty-pdf          Write pretty PDF files, i.e., format all dictionaries
                        so they're well-readable regarding indentation.
                        Increases required file size a tiny bit and increases
                        generation time of the PDF a little, but produces
                        easily debuggable PDFs.
  --no-xref-stream      Do not write the XRef table as a XRef stream, but
                        instead write a classical PDF XRef table and trailer.
                        This will increase the file size a bit, but might
                        improve compatibility with old PDF readers (XRef
                        streams are supported only starting with PDF 1.5).
                        XRef-streams are a prerequisite to object stream
                        compression, so if XRef-streams are disabled, so will
                        also be object streams (e.g, --no-object-streams is
                        implied).
  --no-object-streams   Do not compress objects into object-streams. Object
                        stream compression is introduced with PDF 1.5 and
                        means that multiple simple objects (without any stream
                        data) are concatenated together and compressed
                        together into one large stream object.
  --pdfa-1b             Try to create a PDF/A-1b compliant PDF document.
                        Implies --no-xref-stream, --no-object-streams,
                        --remove-alpha, removes transpacency groups and adds a
                        PDF/A entry into XMP metadata.
  --color-profile iccfile
                        When creating a PDF/A-1b PDF, gives the Internal Color
                        Consortium (ICC) color profile that should be embedded
                        into the PDF as part of the output intent. When
                        omitted, it defaults to the sRGB IEC61966 v2 "black
                        scaled" profile which is included within pdfminify.
  --sign-cert certfile  pdfminify can additionally cryptographically sign your
                        result PDF file with an X.509 certificate and
                        corresponding key. This parameter specifies the
                        certificate filename.
  --sign-key keyfile    This parameter specifies the key filename, also in PEM
                        format.
  --sign-chain pemfile  When signing a PDF, this gives the PEM-formatted
                        certificate chain file. Can be omitted if this should
                        not be included in the PKCS#7 signature.
  --signer name         The name of the person responsible for signing the
                        document.
  --sign-location hostname
                        The location of the signing; usually this is the
                        hostname of the computer that the signature is
                        generated on.
  --sign-contact-info infotext
                        A contact information field under which the signer can
                        be reached. Usually a phone number of email address.
  --sign-reason reason  The reason why the document was signed.
  --sign-page pageno    Page number on which the signature should be
                        displayed. Defaults to 1.
  --sign-font pfbfile   To be able to include metadata text in the signature
                        form, a T1 font must be included into the PDF. This
                        gives the filename of the font that is to be used for
                        that purpose. Must be in PFB (PostScript Font Binary)
                        file format and will be included in the result PDF in
                        full (i.e., not reduced to the glyphs that are
                        actually needed). Defaults to the Bitstream Charter
                        Serif font that is included within pdfminify.
  --sign-pos x,y        Determines where the signature will be placed on the
                        page. Units are determined by the --unit variable and
                        the position is relative to lower left corner.
  --embed-payload path  Embed an opaque file as a payload into the PDF as a
                        valid PDF object. This is useful only if you want to
                        place an easter egg inside your PDF file.
  --no-pdf-tagging      Omit tagging the PDF file with a reference to
                        pdfminify and the used version.
  --decompress-data     Decompress all FlateDecode compressed data in the
                        file. Useful only for debugging.
  --analyze             Perform an analysis of the read PDF file and dump out
                        useful information about it.
  --dump-xref-table     Dump out the XRef table that was read from the input
                        PDF file. Mainly useful for debugging.
  --no-filters          Do not apply any filters on the source PDF whatsoever,
                        just read it in and write it back out. This is useful
                        to reformat a PDF and/or debug the PDF reader/writer
                        facilities without introducing other sources of
                        malformed PDF generation.
  -v, --verbose         Show verbose messages during conversation. Can be
                        specified multiple times to increase log level.

pdfminify version 0.2.1; llpdf version: 0.0.4
</pre>

## Bugs
PDF is an inherently messy format and parsing it really isn't pretty. I've
implemented only what I needed to implement in order to get my job done. I'm
sure there's dozens of examples in which pdfminify just plainly doesn't work or
creates broken output PDFs. Please feel free to fix these errors and send in a
pull request. I think it's a genuinely useful tool and therefore would be nice
to support a wider variety of PDFs than just those that I happen to generate.

If you encounter an issue but are unable to fix it because you don't know
enough about Python, PDF (or either), I will still look at your issue if you
report it on GitHub. However due to my lack of time, I cannot promise that I
can fix it -- to be honest, PDF is so complicated that I'm not even sure that I
can find what the issue is. In any case, be sure to include a minimal example
PDF that demonstrates the issue in the bug report.

## License
pdfminify is licensed under the GNU GPL v3 (except for external components as
TPG, which has its own license). Later versions of the GPL are explicitly
excluded.

TPG (Toy Parser Generator) falls under its respective license (see
EXTERNAL_LICENSES.md).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/johndoe31415/pdfminify",
    "name": "pdfminify",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "pdf, minifier",
    "author": "Johannes Bauer",
    "author_email": "joe@johannes-bauer.com",
    "download_url": "https://files.pythonhosted.org/packages/a4/af/8ea6ebcf91433364d45f6b3b3d766e9c10b8c5b0e5c55982764e709fe651/pdfminify-0.2.2.tar.gz",
    "platform": null,
    "description": "# pdfminify\n[![Build Status](https://travis-ci.com/johndoe31415/pdfminify.svg?branch=master)](https://travis-ci.com/johndoe31415/pdfminify)\n\n\npdfminify is intended to re-compress PDF images while operating directly at PDF\nlevel (i.e., no re-compression using PostScript). It parses the PDF file,\nhashes all image references, re-links resources that are duplicate (i.e., have\nthe same MD5 hash value) and also is able to re-compress images which are\nstored with lossless compression in order to use JPEG. It tries to calculate\nthe physical extent of the images (which, depending on the inclusion method,\ncan be a bit messy) and then is able to calculate the actual image resolution.\nIf it exceeds a given target resolution, it can also re-sample images (i.e.,\nre-scaling them using ImageMagick) before re-integrating them into the target\nPDF.\n\nIn particular, I wrote this software because PDFs generated by the libcairo PDF\nexport are *huge*. Images that are used are included a dozen times and only\nlossless compression is used. Therefore I use pdfminify to decrease the\nfilesize later.\n\nAnother use for pdfminify is that it is able to convert PDFs into PDF/A-1b\ncompliant PDF files. Since this is something that's really difficult to do,\nthere are no guarantees regarding the resulting PDF -- please check for\nyourself if the results still behave identical to your source version.\n\nFinally, pdfminify is able to digitally sign your PDF files. For this you will\nneed an X.509 certificate and corresponding keypair. The signature will be\nincluded in the PDF in the form a banner with metadata and sophisticated PDF\nreaders will be able to verify that the PDF has not been tampered with.\n\n## Requirements\npdfminify needs at least Python 3.5 and the [llpdf Python\npackage](https://github.com/johndoe31415/llpdf) at least v0.0.4. It furthermore\nuses the \"identify\" and \"convert\" utilities of ImageMagick. It uses the former\nto determine the width, height, colorspace and bits per component of image\nfiles and the latter to convert images from PNM (the internal format that\npdfminify is capable of writing natively) to JPEG.\n\n## Acknowledgments\npdfminify uses the Toy Parser Generator (TPG) of Christophe Delord\n(http://cdsoft.fr/tpg/). It is included (tpg.py file) and licensed under the\nGNU LGPL v2.1 or any later version. Despite its name, it is far from a toy. In\nfact, it is the most beautiful parser generator I have ever worked with and\nmakes grammars and parsing exceptionally easy, even for people without any\nprevious parsing experience. If you need parsing and use Python, TPG is *the*\ngo-to solution I would recommend. Seriously, it's amazing. Check it out.\nCopyright and license details can be found in EXTERNAL_LICENSES.md.\n\n## Usage\n<pre>\n$ pdfminify\nusage: pdfminify [-h] [-d dpi] [-j] [--jpeg-quality percent]\n                 [--no-downscaling] [--cropbox x,y,w,h]\n                 [--unit {cm,inch,mm,native}] [--one-bit-alpha]\n                 [--remove-alpha] [--background-color color]\n                 [--strip-metadata] [--saveimgdir path] [--raw-output]\n                 [--pretty-pdf] [--no-xref-stream] [--no-object-streams]\n                 [--pdfa-1b] [--color-profile iccfile] [--sign-cert certfile]\n                 [--sign-key keyfile] [--sign-chain pemfile] [--signer name]\n                 [--sign-location hostname] [--sign-contact-info infotext]\n                 [--sign-reason reason] [--sign-page pageno]\n                 [--sign-font pfbfile] [--sign-pos x,y] [--embed-payload path]\n                 [--no-pdf-tagging] [--decompress-data] [--analyze]\n                 [--dump-xref-table] [--no-filters] [-v]\n                 pdf_in pdf_out\n\nMinifies PDF files, can crop them, convert them to PDF/A-1b and sign them\ncryptographically.\n\npositional arguments:\n  pdf_in                Input PDF file.\n  pdf_out               Output PDF file.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -d dpi, --target-dpi dpi\n                        Default resoulution to which images will be resampled\n                        at. Defaults to 150 dots per inch (dpi).\n  -j, --jpeg-images     Convert images to JPEG format. This means that lossy\n                        compression is used that however often yields a much\n                        higher compression ratio.\n  --jpeg-quality percent\n                        When converting images to JPEG format, the parameter\n                        gives the compression quality. It is an integer from\n                        0-100 (higher is better, but creates also larger\n                        output files).\n  --no-downscaling      Do not apply downscaling filter on the PDF, take all\n                        images as they are.\n  --cropbox x,y,w,h     Crop pages by additionally adding a /CropBox to all\n                        pages of the PDF file. Pages will be cropped at offset\n                        (x, y) to a width (w, h). The unit in which offset,\n                        width and height are given can be specified using the\n                        --unit parameter.\n  --unit {cm,inch,mm,native}\n                        Specify the unit of measurement that is used for input\n                        and output. Can be any of cm, inch, mm, native,\n                        defaults to native. One native PDF unit equals 1/72th\n                        of an inch.\n  --one-bit-alpha       Force all alpha channels in images to use a color\n                        depth of one bit. This will make transparent images\n                        have rougher edges, but saves additional space.\n  --remove-alpha        Entirely remove the alpha channel (i.e., transparency)\n                        of all images. The color which with transparent areas\n                        are replaced with can be specified using the\n                        --background-color command line option.\n  --background-color color\n                        When removing alpha channels, specifies the color that\n                        should be used as background. Defaults to white.\n                        Hexadecimal values can be specified as well in the\n                        format '#rrggbb'.\n  --strip-metadata      Strip metadata inside PDF objects that is not strictly\n                        required, such as /PTEX.* entries inside object\n                        content.\n  --saveimgdir path     When specified, save all handled images as individual\n                        files into the specified directory. Useful for image\n                        extraction from a PDF as well as debugging.\n  --raw-output          When saving images externally, save them in exactly\n                        the format in which they're also present inside the\n                        PDF. Note that this will produce raw image files in\n                        some cases which won't have any header (but just\n                        contain pixel data). Less useful for image extraction,\n                        but can make sense for debugging.\n  --pretty-pdf          Write pretty PDF files, i.e., format all dictionaries\n                        so they're well-readable regarding indentation.\n                        Increases required file size a tiny bit and increases\n                        generation time of the PDF a little, but produces\n                        easily debuggable PDFs.\n  --no-xref-stream      Do not write the XRef table as a XRef stream, but\n                        instead write a classical PDF XRef table and trailer.\n                        This will increase the file size a bit, but might\n                        improve compatibility with old PDF readers (XRef\n                        streams are supported only starting with PDF 1.5).\n                        XRef-streams are a prerequisite to object stream\n                        compression, so if XRef-streams are disabled, so will\n                        also be object streams (e.g, --no-object-streams is\n                        implied).\n  --no-object-streams   Do not compress objects into object-streams. Object\n                        stream compression is introduced with PDF 1.5 and\n                        means that multiple simple objects (without any stream\n                        data) are concatenated together and compressed\n                        together into one large stream object.\n  --pdfa-1b             Try to create a PDF/A-1b compliant PDF document.\n                        Implies --no-xref-stream, --no-object-streams,\n                        --remove-alpha, removes transpacency groups and adds a\n                        PDF/A entry into XMP metadata.\n  --color-profile iccfile\n                        When creating a PDF/A-1b PDF, gives the Internal Color\n                        Consortium (ICC) color profile that should be embedded\n                        into the PDF as part of the output intent. When\n                        omitted, it defaults to the sRGB IEC61966 v2 \"black\n                        scaled\" profile which is included within pdfminify.\n  --sign-cert certfile  pdfminify can additionally cryptographically sign your\n                        result PDF file with an X.509 certificate and\n                        corresponding key. This parameter specifies the\n                        certificate filename.\n  --sign-key keyfile    This parameter specifies the key filename, also in PEM\n                        format.\n  --sign-chain pemfile  When signing a PDF, this gives the PEM-formatted\n                        certificate chain file. Can be omitted if this should\n                        not be included in the PKCS#7 signature.\n  --signer name         The name of the person responsible for signing the\n                        document.\n  --sign-location hostname\n                        The location of the signing; usually this is the\n                        hostname of the computer that the signature is\n                        generated on.\n  --sign-contact-info infotext\n                        A contact information field under which the signer can\n                        be reached. Usually a phone number of email address.\n  --sign-reason reason  The reason why the document was signed.\n  --sign-page pageno    Page number on which the signature should be\n                        displayed. Defaults to 1.\n  --sign-font pfbfile   To be able to include metadata text in the signature\n                        form, a T1 font must be included into the PDF. This\n                        gives the filename of the font that is to be used for\n                        that purpose. Must be in PFB (PostScript Font Binary)\n                        file format and will be included in the result PDF in\n                        full (i.e., not reduced to the glyphs that are\n                        actually needed). Defaults to the Bitstream Charter\n                        Serif font that is included within pdfminify.\n  --sign-pos x,y        Determines where the signature will be placed on the\n                        page. Units are determined by the --unit variable and\n                        the position is relative to lower left corner.\n  --embed-payload path  Embed an opaque file as a payload into the PDF as a\n                        valid PDF object. This is useful only if you want to\n                        place an easter egg inside your PDF file.\n  --no-pdf-tagging      Omit tagging the PDF file with a reference to\n                        pdfminify and the used version.\n  --decompress-data     Decompress all FlateDecode compressed data in the\n                        file. Useful only for debugging.\n  --analyze             Perform an analysis of the read PDF file and dump out\n                        useful information about it.\n  --dump-xref-table     Dump out the XRef table that was read from the input\n                        PDF file. Mainly useful for debugging.\n  --no-filters          Do not apply any filters on the source PDF whatsoever,\n                        just read it in and write it back out. This is useful\n                        to reformat a PDF and/or debug the PDF reader/writer\n                        facilities without introducing other sources of\n                        malformed PDF generation.\n  -v, --verbose         Show verbose messages during conversation. Can be\n                        specified multiple times to increase log level.\n\npdfminify version 0.2.1; llpdf version: 0.0.4\n</pre>\n\n## Bugs\nPDF is an inherently messy format and parsing it really isn't pretty. I've\nimplemented only what I needed to implement in order to get my job done. I'm\nsure there's dozens of examples in which pdfminify just plainly doesn't work or\ncreates broken output PDFs. Please feel free to fix these errors and send in a\npull request. I think it's a genuinely useful tool and therefore would be nice\nto support a wider variety of PDFs than just those that I happen to generate.\n\nIf you encounter an issue but are unable to fix it because you don't know\nenough about Python, PDF (or either), I will still look at your issue if you\nreport it on GitHub. However due to my lack of time, I cannot promise that I\ncan fix it -- to be honest, PDF is so complicated that I'm not even sure that I\ncan find what the issue is. In any case, be sure to include a minimal example\nPDF that demonstrates the issue in the bug report.\n\n## License\npdfminify is licensed under the GNU GPL v3 (except for external components as\nTPG, which has its own license). Later versions of the GPL are explicitly\nexcluded.\n\nTPG (Toy Parser Generator) falls under its respective license (see\nEXTERNAL_LICENSES.md).\n",
    "bugtrack_url": null,
    "license": "gpl-3.0",
    "summary": "PDF minification tool",
    "version": "0.2.2",
    "project_urls": {
        "Download": "https://github.com/johndoe31415/pdfminify/archive/v0.2.2.tar.gz",
        "Homepage": "https://github.com/johndoe31415/pdfminify"
    },
    "split_keywords": [
        "pdf",
        " minifier"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a4af8ea6ebcf91433364d45f6b3b3d766e9c10b8c5b0e5c55982764e709fe651",
                "md5": "a4812667e66a83c7292d08339adf54c1",
                "sha256": "a56a1bc65d23a8a0639f9d8507aebe6e5ce55710590bb9af0a035ebf22261fe7"
            },
            "downloads": -1,
            "filename": "pdfminify-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a4812667e66a83c7292d08339adf54c1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 102632,
            "upload_time": "2024-09-14T08:45:37",
            "upload_time_iso_8601": "2024-09-14T08:45:37.273067Z",
            "url": "https://files.pythonhosted.org/packages/a4/af/8ea6ebcf91433364d45f6b3b3d766e9c10b8c5b0e5c55982764e709fe651/pdfminify-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-14 08:45:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "johndoe31415",
    "github_project": "pdfminify",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pdfminify"
}
        
Elapsed time: 0.34126s