playa-pdf


Nameplaya-pdf JSON
Version 0.2.5 PyPI version JSON
download
home_pageNone
SummaryPLAYA ain't a LAYout Analyzer
upload_time2024-12-15 18:08:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords pdf parser text mining
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # **P**LAYA-PDF is a **LA**z**Y** **A**nalyzer for **PDF** 🏖️

## About

There are already too many PDF libraries, unfortunately none of which
does everything that everybody wants it to do, and we probably don't
need another one. It is not recommended that you use this library for
anything at all, but if you were going to use it for something, it
would be specifically one of these things and nothing else:

1. Accessing the document catalog, page tree, structure tree, content
   streams, cross-reference table, XObjects, and other low-level PDF
   metadata.
2. Obtaining the absolute position and attributes of every character,
   line, path, and image in every page of a PDF.
   
If you just want to extract text from a PDF, there are a lot of better
and faster tools and libraries out there, see [these
benchmarks](https://github.com/py-pdf/benchmarks) for a summary (TL;DR
pypdfium2 is probably what you want, but pdfplumber does a nice job of
converting PDF to ASCII art).

The purpose of PLAYA is to provide an efficent, pure-Python and
Pythonic (for its author's definition of the term), lazy interface to
the internals of PDF files.

## Installation

Installing it should be really simple as long as you have Python 3.8
or newer:

    pipx install playa-pdf

Yes it's not just "playa".  Sorry about that.

## Usage

Do you want to get stuff out of a PDF?  You have come to the right
place!  Let's open up a PDF and see what's in it:

```python
pdf = playa.open("my_awesome_document.pdf")
raw_byte_stream = pdf.buffer
a_bunch_of_tokens = list(pdf.tokens)
a_bunch_of_indirect_objects = list(pdf)
```

The raw PDF tokens and objects are probably not terribly useful to
you, but you might find them interesting.  Note that these are
"indirect objects" where the actual object is accompanied by an object
number and generation number:

```python
for objid, genno, obj in pdf:
    ...
# or also
for obj in pdf:
    obj.objid, obj.genno, obj.obj
```

Also, these will only be the top-level objects and not those found
inside object streams (the streams are themselves indirect objects).
You can iterate over all indirect objects including object streams
using the `objects` property:

```python
for obj in pdf.objects:
    obj.objid, obj.genno, obj.obj
```

In this case it is possible you will encounter multiple objects with
the same `objid` due to the "incremental updates" feature of PDF.
Currently, iterating over the objects in a particular stream is
possible, but complicated.  It may be simplified in PLAYA 0.3.

You can also access indirect objects by number (this will return the
object with most recent generation number):

```python
a_particular_object = pdf[42]
```

Your PDF document probably has some pages.  How many?  What are their
numbers/labels?  (they could be things like "xviii", 'a", or "42", for
instance)

```python
npages = len(pdf.pages)
page_numbers = [page.label for page in pdf.pages]
```

What's in the table of contents?  (NOTE: this API will likely change
in PLAYA 0.3 as it is not Lazy nor does it properly represent the
hierarchy of the document outline)

```python
for entry in pdf.outlines:
    level, title, dest, action, struct_element = entry
    # or
    entry.level, entry.title, entry.dest, entry.action, entry.se
    ...
```

If you are lucky it has a "logical structure tree".  The elements here
might even be referenced from the table of contents!  (or, they might
not... with PDF you never know).  (NOTE: this API will definitely
change in PLAYA 0.3 as it is not the least bit Lazy)

```python
structure = pdf.structtree
for element in structure:
   for child in element:
       ...
```

Now perhaps we want to look at a specific page.  Okay!  You can also
look at its contents, more on that in a bit:

```python
page = pdf.pages[0]        # they are numbered from 0
page = pdf.pages["xviii"]  # but you can get them by label (a string)
page = pdf.pages["42"]     # or "logical" page number (also a string)
print(f"Page {page.label} is {page.width} x {page.height}")
```

## Accessing content

What are these "contents" of which you speak, which were surely
created by a Content Creator?  Well, you can look at the stream of
tokens or mysterious PDF objects:

```python
for token in page.tokens:
    ...
for object in page.contents:
    ...
```

But that isn't very useful, so you can also access actual textual and
graphical objects (if you wanted to, for instance, do layout
analysis).

```python
for item in page:
    ...
```

Because it is quite inefficient to expand, calculate, and copy every
possible piece of information, PLAYA gives you some options here.
Wherever possible this information can be computed lazily, but this
involves some more work on the user's part.

## Dictionary-based API

If, on the other hand, **you** are lazy, then you can just use
`page.layout`, which will flatten everything for you into a friendly
dictionary representation (but it is a
[`TypedDict`](https://typing.readthedocs.io/en/latest/spec/typeddict.html#typeddict))
which, um, looks a lot like what `pdfplumber` gives you, except possibly in
a different
coordinate space, as defined [below](#an-important-note-about-coordinate-spaces).

```python
for dic in page.layout:
    print("it is a {dic['object_type']} at ({dic['x0']}", {dic['y0']}))
    print("    the color is {dic['stroking_color']}")
    print("    the text is {dic['text']}")
    print("    it is in MCS {dic['mcid']} which is a {dic['tag']}")
    print("    it is also in Form XObject {dic['xobjid']}")
```

This is for instance quite useful for doing "Artificial Intelligence",
or if you like wasting time and energy for no good reason, but I
repeat myself.  For instance, you can write `page.layout` to a CSV file:

```python
writer = DictWriter(outfh, fieldnames=playa.fieldnames)
writer.writeheader()
for dic in pdf.layout:
    writer.writerow(dic)
```

you can also create a Pandas DataFrame:

```python
df = pandas.DataFrame.from_records(pdf.layout)
```

or a Polars DataFrame or LazyFrame:

```python
df = polars.DataFrame(pdf.layout, schema=playa.schema)
```

If you have more specific needs or want better performance, then read on.

## An important note about coordinate spaces

Wait, what is this "absolute position" of which you speak, and which
PLAYA gives you?  It's important to understand that there is no
definition of "device space" in the PDF standard, and I quote (PDF 1.7
sec 8.3.2.2):

> A particular device’s coordinate system is called its device
space. The origin of the device space on different devices can fall in
different places on the output page; on displays, the origin can vary
depending on the window system. Because the paper or other output
medium moves through different printers and imagesetters in different
directions, the axes of their device spaces may be oriented
differently.

You may immediately think of CSS when you hear the phrase "absolute
position" and this is exactly what PLAYA gives you as its default
device space, specifically:

- Units are default user space units (1/72 of an inch).
- `(0, 0)` is the top-left corner of the page, as defined by its
  `MediaBox` after rotation is applied.
- Coordinates increase from the top-left corner of the page towards
  the bottom-right corner.

However, for compatibility with `pdfminer.six`, you can also pass
`space="page"` to `playa.open`.  In this case, `(0, 0)` is the
bottom-left corner of the page as defined by the `MediaBox`, after
rotation, and coordinates increase from the bottom-left corner of the
page towards the top-right, as they do in PDF user space.

If you don't care about absolute positioning, you can use
`space="default"`, which may be somewhat faster in the future (currently
it isn't).  In this case, no translation or rotation of the default
user space is done (in other words any values of `MediaBox` or
`Rotate` in the page dictionary are simply ignored).  This is **definitely**
what you want if you wish to take advantage of the coordinates that
you may find in `outlines`, `dests`, tags and logical structure
elements.

## Lazy object API

Fundamentally you may just want to know *what* is *where* on the page,
and PLAYA has you covered there (note that the bbox is normalized, and
in the aforementioned interpretation of "device space"):

```python
for obj in page:
    print(f"{obj.object_type} at {obj.bbox}")
    left, top, right, bottom = obj.bbox
    print(f"  top left is {left, top}")
    print(f"  bottom right is {right, botom}")
```

Another important piece of information (which `pdfminer.six` does not
really handle) is the relationship between layout and logical
structure, done using *marked content sections*:

```python
for obj in page:
    print(f"{obj.object_type} is in marked content section {obj.mcs.mcid}")
    print(f"    which is tag {obj.mcs.tag.name}")
    print(f"    with properties {obj.mcs.tag.props}")
```

The `mcid` here is the same one referenced in elements of the
structure tree as shown above (but remember that `tag` has nothing to
do with the structure tree element, because Reasons).  A marked
content section does not necessarily have a `mcid` or `props`, but it
will *always* have a `tag`.

PDF also has the concept of "marked content points". PLAYA suports
these with objects of `object_type == "tag"`.  The tag name and
properties are also accessible via the `mcs` attribute.

### Form XObjects

A PDF page may also contain "Form XObjects" which are like tiny
embedded PDF documents (they have nothing to do with fillable forms).
The lazy API (because it is lazy) **will not expand these for you**
which may be a source of surprise.  You can identify them because they
have `object_type == "xobject"`.  The layout objects inside them are
accessible by iteration, as with pages (but **not** documents):

```python
for obj in page:
    if obj.object_type == "xobject":
        for item in obj:
            ...
```

You can also iterate over them in the page context with `page.xobjects`:

```python
for xobj in page.xobjects:
    for item in xobj:
        ...
```

Exceptionally, these have a few more features than the ordinary
`ContentObject` - you can look at their raw stream contents as well as
the tokens, and you can also see raw, mysterious PDF objects with
`contents`.

### Graphics state

You may also wish to know what color an object is, and other aspects of
what PDF refers to as the *graphics state*, which is accessible
through `obj.gstate`.  This is a mutable object, and since there are
quite a few parameters in the graphics state, PLAYA does not create a
copy of it for every object in the layout - you are responsible for
saving them yourself if you should so desire.  This is not
particularly onerous, because the parameters themselves are immutable:

```python
for obj in page:
    print(f"{obj.object_type} at {obj.bbox} is:")
    print(f"    {obj.gstate.scolor} stroking color")
    print(f"    {obj.gstate.ncolor} non-stroking color")
    print(f"    {obj.gstate.dash} dashing style")
    my_stuff = (obj.dash, obj.gstate.scolor, obj.gstate.ncolor)
    other_stuff.append(my_stuff)  # it's safe there
```

For compatbility with `pdfminer.six`, PLAYA, even though it is not a
layout analyzer, can do some basic interpretation of paths.  Again,
this is lazy.  If you don't care about them, you just get objects with
`object_type` of `"path"`, which you can ignore.  PLAYA won't even
compute the bounding box (which isn't all that slow, but still).  If
you *do* care, then you have some options.  You can look at the actual
path segments in user space (fast):

```python
for seg in path.raw_segments:
   print(f"segment: {seg}")
```

Or in PLAYA's "device space" (not so fast):

```python
for seg in path.segments:
   print(f"segment: {seg}")
```

This API doesn't try to interpret paths for you.  You only get
`PathSegment`s.  But for convenience you can get them grouped by
subpaths as created using the `m` or `re` operators:

```python
for subpath in path:
   for seg in subpath.segments:
       print(f"segment: {seg}")
```

### Text Objects

Since most PDFs consist primarily of text, obviously you may wish to
know something about the actual text (or the `ActualText`, which you
can sometimes find in `obj.mcs.tag.props["ActualText"]`).  This is
more difficult than it looks, as fundamentally PDF just positions
arbitrarily numbered glyphs on a page, and the vast majority of PDFs
embed their own fonts, using *subsetting* to include only the glyphs
actually used.

Whereas `pdfminer.six` would break down text objects into their
individual glyphs (which might or might not correspond to characters),
this is not always what you want, and moreover it is computationally
quite expensive.  So PLAYA, by default, does not do this.  If you
don't need to know the actual bounding box of a text object, then
don't access `obj.bbox` and it won't be computed.  If you don't need
to know the position of each glyph but simply want the Unicode
characters, then just look at `obj.chars`.

It is important to understand that `obj.chars` may or may not correspond
to the actual text that a human will read on the page.  To
actually extract *text* from a PDF necessarily involves Heuristics
or Machine Learning (yes, capitalized, like that) and PLAYA does not do
either of those things.

This is because PDFs, especially ones produced by OCR, don't organize
text objects in any meaningful fashion, so you will want to actually
look at the glyphs.  This becomes a matter of iterating over the item,
giving you, well, more items, which are the individual glyphs:

```python
for glyph in item:
    print("Glyph has CID {glyph.cid} and Unicode {glyph.text}")
```

By default PLAYA, following the PDF specification, considers the
grouping of glyphs into strings irrelevant by default.  I *might*
consider separating the strings in the future.

PDF has the concept of a *text state* which determines some aspects of
how text is rendered.  You can obviously access this though
`glyph.textstate` - note that the text state, like the graphics state,
is mutable, so you will have to copy it or save individual parameters
that you might care about.  This may be a major footgun so watch out.

PLAYA doesn't guarantee that text objects come at you in anything
other than the order they occur in the file (but it does guarantee
that).

In some cases might want to look at the abovementioned `ActualText`
attribute to reliably extract text, particularly if the PDF was
created by certain versions of LibreOffice, but in their infinite
wisdom, Adobe made `ActualText` a property of *marked content
sections* and not *text objects*, so you may be out of luck if you
want to actually match these characters to glyphs.  Sorry, I don't
write the standards.

## Conclusion

As mentioned earlier, if you really just want to do text extraction,
there's always pdfplumber, pymupdf, pypdfium2, pikepdf, pypdf, borb,
etc, etc, etc.

## Acknowledgement

This repository obviously includes code from `pdfminer.six`.  Original
license text is included in
[LICENSE](https://github.com/dhdaines/playa/blob/main/LICENSE).  The
license itself has not changed!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "playa-pdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "pdf parser, text mining",
    "author": null,
    "author_email": "David Huggins-Daines <dhd@ecolingui.ca>",
    "download_url": "https://files.pythonhosted.org/packages/6c/18/a652ca7d4c8ed625814df27915120d313bbd48bfb5fe79323e4d5f8780de/playa_pdf-0.2.5.tar.gz",
    "platform": null,
    "description": "# **P**LAYA-PDF is a **LA**z**Y** **A**nalyzer for **PDF** \ud83c\udfd6\ufe0f\n\n## About\n\nThere are already too many PDF libraries, unfortunately none of which\ndoes everything that everybody wants it to do, and we probably don't\nneed another one. It is not recommended that you use this library for\nanything at all, but if you were going to use it for something, it\nwould be specifically one of these things and nothing else:\n\n1. Accessing the document catalog, page tree, structure tree, content\n   streams, cross-reference table, XObjects, and other low-level PDF\n   metadata.\n2. Obtaining the absolute position and attributes of every character,\n   line, path, and image in every page of a PDF.\n   \nIf you just want to extract text from a PDF, there are a lot of better\nand faster tools and libraries out there, see [these\nbenchmarks](https://github.com/py-pdf/benchmarks) for a summary (TL;DR\npypdfium2 is probably what you want, but pdfplumber does a nice job of\nconverting PDF to ASCII art).\n\nThe purpose of PLAYA is to provide an efficent, pure-Python and\nPythonic (for its author's definition of the term), lazy interface to\nthe internals of PDF files.\n\n## Installation\n\nInstalling it should be really simple as long as you have Python 3.8\nor newer:\n\n    pipx install playa-pdf\n\nYes it's not just \"playa\".  Sorry about that.\n\n## Usage\n\nDo you want to get stuff out of a PDF?  You have come to the right\nplace!  Let's open up a PDF and see what's in it:\n\n```python\npdf = playa.open(\"my_awesome_document.pdf\")\nraw_byte_stream = pdf.buffer\na_bunch_of_tokens = list(pdf.tokens)\na_bunch_of_indirect_objects = list(pdf)\n```\n\nThe raw PDF tokens and objects are probably not terribly useful to\nyou, but you might find them interesting.  Note that these are\n\"indirect objects\" where the actual object is accompanied by an object\nnumber and generation number:\n\n```python\nfor objid, genno, obj in pdf:\n    ...\n# or also\nfor obj in pdf:\n    obj.objid, obj.genno, obj.obj\n```\n\nAlso, these will only be the top-level objects and not those found\ninside object streams (the streams are themselves indirect objects).\nYou can iterate over all indirect objects including object streams\nusing the `objects` property:\n\n```python\nfor obj in pdf.objects:\n    obj.objid, obj.genno, obj.obj\n```\n\nIn this case it is possible you will encounter multiple objects with\nthe same `objid` due to the \"incremental updates\" feature of PDF.\nCurrently, iterating over the objects in a particular stream is\npossible, but complicated.  It may be simplified in PLAYA 0.3.\n\nYou can also access indirect objects by number (this will return the\nobject with most recent generation number):\n\n```python\na_particular_object = pdf[42]\n```\n\nYour PDF document probably has some pages.  How many?  What are their\nnumbers/labels?  (they could be things like \"xviii\", 'a\", or \"42\", for\ninstance)\n\n```python\nnpages = len(pdf.pages)\npage_numbers = [page.label for page in pdf.pages]\n```\n\nWhat's in the table of contents?  (NOTE: this API will likely change\nin PLAYA 0.3 as it is not Lazy nor does it properly represent the\nhierarchy of the document outline)\n\n```python\nfor entry in pdf.outlines:\n    level, title, dest, action, struct_element = entry\n    # or\n    entry.level, entry.title, entry.dest, entry.action, entry.se\n    ...\n```\n\nIf you are lucky it has a \"logical structure tree\".  The elements here\nmight even be referenced from the table of contents!  (or, they might\nnot... with PDF you never know).  (NOTE: this API will definitely\nchange in PLAYA 0.3 as it is not the least bit Lazy)\n\n```python\nstructure = pdf.structtree\nfor element in structure:\n   for child in element:\n       ...\n```\n\nNow perhaps we want to look at a specific page.  Okay!  You can also\nlook at its contents, more on that in a bit:\n\n```python\npage = pdf.pages[0]        # they are numbered from 0\npage = pdf.pages[\"xviii\"]  # but you can get them by label (a string)\npage = pdf.pages[\"42\"]     # or \"logical\" page number (also a string)\nprint(f\"Page {page.label} is {page.width} x {page.height}\")\n```\n\n## Accessing content\n\nWhat are these \"contents\" of which you speak, which were surely\ncreated by a Content Creator?  Well, you can look at the stream of\ntokens or mysterious PDF objects:\n\n```python\nfor token in page.tokens:\n    ...\nfor object in page.contents:\n    ...\n```\n\nBut that isn't very useful, so you can also access actual textual and\ngraphical objects (if you wanted to, for instance, do layout\nanalysis).\n\n```python\nfor item in page:\n    ...\n```\n\nBecause it is quite inefficient to expand, calculate, and copy every\npossible piece of information, PLAYA gives you some options here.\nWherever possible this information can be computed lazily, but this\ninvolves some more work on the user's part.\n\n## Dictionary-based API\n\nIf, on the other hand, **you** are lazy, then you can just use\n`page.layout`, which will flatten everything for you into a friendly\ndictionary representation (but it is a\n[`TypedDict`](https://typing.readthedocs.io/en/latest/spec/typeddict.html#typeddict))\nwhich, um, looks a lot like what `pdfplumber` gives you, except possibly in\na different\ncoordinate space, as defined [below](#an-important-note-about-coordinate-spaces).\n\n```python\nfor dic in page.layout:\n    print(\"it is a {dic['object_type']} at ({dic['x0']}\", {dic['y0']}))\n    print(\"    the color is {dic['stroking_color']}\")\n    print(\"    the text is {dic['text']}\")\n    print(\"    it is in MCS {dic['mcid']} which is a {dic['tag']}\")\n    print(\"    it is also in Form XObject {dic['xobjid']}\")\n```\n\nThis is for instance quite useful for doing \"Artificial Intelligence\",\nor if you like wasting time and energy for no good reason, but I\nrepeat myself.  For instance, you can write `page.layout` to a CSV file:\n\n```python\nwriter = DictWriter(outfh, fieldnames=playa.fieldnames)\nwriter.writeheader()\nfor dic in pdf.layout:\n    writer.writerow(dic)\n```\n\nyou can also create a Pandas DataFrame:\n\n```python\ndf = pandas.DataFrame.from_records(pdf.layout)\n```\n\nor a Polars DataFrame or LazyFrame:\n\n```python\ndf = polars.DataFrame(pdf.layout, schema=playa.schema)\n```\n\nIf you have more specific needs or want better performance, then read on.\n\n## An important note about coordinate spaces\n\nWait, what is this \"absolute position\" of which you speak, and which\nPLAYA gives you?  It's important to understand that there is no\ndefinition of \"device space\" in the PDF standard, and I quote (PDF 1.7\nsec 8.3.2.2):\n\n> A particular device\u2019s coordinate system is called its device\nspace. The origin of the device space on different devices can fall in\ndifferent places on the output page; on displays, the origin can vary\ndepending on the window system. Because the paper or other output\nmedium moves through different printers and imagesetters in different\ndirections, the axes of their device spaces may be oriented\ndifferently.\n\nYou may immediately think of CSS when you hear the phrase \"absolute\nposition\" and this is exactly what PLAYA gives you as its default\ndevice space, specifically:\n\n- Units are default user space units (1/72 of an inch).\n- `(0, 0)` is the top-left corner of the page, as defined by its\n  `MediaBox` after rotation is applied.\n- Coordinates increase from the top-left corner of the page towards\n  the bottom-right corner.\n\nHowever, for compatibility with `pdfminer.six`, you can also pass\n`space=\"page\"` to `playa.open`.  In this case, `(0, 0)` is the\nbottom-left corner of the page as defined by the `MediaBox`, after\nrotation, and coordinates increase from the bottom-left corner of the\npage towards the top-right, as they do in PDF user space.\n\nIf you don't care about absolute positioning, you can use\n`space=\"default\"`, which may be somewhat faster in the future (currently\nit isn't).  In this case, no translation or rotation of the default\nuser space is done (in other words any values of `MediaBox` or\n`Rotate` in the page dictionary are simply ignored).  This is **definitely**\nwhat you want if you wish to take advantage of the coordinates that\nyou may find in `outlines`, `dests`, tags and logical structure\nelements.\n\n## Lazy object API\n\nFundamentally you may just want to know *what* is *where* on the page,\nand PLAYA has you covered there (note that the bbox is normalized, and\nin the aforementioned interpretation of \"device space\"):\n\n```python\nfor obj in page:\n    print(f\"{obj.object_type} at {obj.bbox}\")\n    left, top, right, bottom = obj.bbox\n    print(f\"  top left is {left, top}\")\n    print(f\"  bottom right is {right, botom}\")\n```\n\nAnother important piece of information (which `pdfminer.six` does not\nreally handle) is the relationship between layout and logical\nstructure, done using *marked content sections*:\n\n```python\nfor obj in page:\n    print(f\"{obj.object_type} is in marked content section {obj.mcs.mcid}\")\n    print(f\"    which is tag {obj.mcs.tag.name}\")\n    print(f\"    with properties {obj.mcs.tag.props}\")\n```\n\nThe `mcid` here is the same one referenced in elements of the\nstructure tree as shown above (but remember that `tag` has nothing to\ndo with the structure tree element, because Reasons).  A marked\ncontent section does not necessarily have a `mcid` or `props`, but it\nwill *always* have a `tag`.\n\nPDF also has the concept of \"marked content points\". PLAYA suports\nthese with objects of `object_type == \"tag\"`.  The tag name and\nproperties are also accessible via the `mcs` attribute.\n\n### Form XObjects\n\nA PDF page may also contain \"Form XObjects\" which are like tiny\nembedded PDF documents (they have nothing to do with fillable forms).\nThe lazy API (because it is lazy) **will not expand these for you**\nwhich may be a source of surprise.  You can identify them because they\nhave `object_type == \"xobject\"`.  The layout objects inside them are\naccessible by iteration, as with pages (but **not** documents):\n\n```python\nfor obj in page:\n    if obj.object_type == \"xobject\":\n        for item in obj:\n            ...\n```\n\nYou can also iterate over them in the page context with `page.xobjects`:\n\n```python\nfor xobj in page.xobjects:\n    for item in xobj:\n        ...\n```\n\nExceptionally, these have a few more features than the ordinary\n`ContentObject` - you can look at their raw stream contents as well as\nthe tokens, and you can also see raw, mysterious PDF objects with\n`contents`.\n\n### Graphics state\n\nYou may also wish to know what color an object is, and other aspects of\nwhat PDF refers to as the *graphics state*, which is accessible\nthrough `obj.gstate`.  This is a mutable object, and since there are\nquite a few parameters in the graphics state, PLAYA does not create a\ncopy of it for every object in the layout - you are responsible for\nsaving them yourself if you should so desire.  This is not\nparticularly onerous, because the parameters themselves are immutable:\n\n```python\nfor obj in page:\n    print(f\"{obj.object_type} at {obj.bbox} is:\")\n    print(f\"    {obj.gstate.scolor} stroking color\")\n    print(f\"    {obj.gstate.ncolor} non-stroking color\")\n    print(f\"    {obj.gstate.dash} dashing style\")\n    my_stuff = (obj.dash, obj.gstate.scolor, obj.gstate.ncolor)\n    other_stuff.append(my_stuff)  # it's safe there\n```\n\nFor compatbility with `pdfminer.six`, PLAYA, even though it is not a\nlayout analyzer, can do some basic interpretation of paths.  Again,\nthis is lazy.  If you don't care about them, you just get objects with\n`object_type` of `\"path\"`, which you can ignore.  PLAYA won't even\ncompute the bounding box (which isn't all that slow, but still).  If\nyou *do* care, then you have some options.  You can look at the actual\npath segments in user space (fast):\n\n```python\nfor seg in path.raw_segments:\n   print(f\"segment: {seg}\")\n```\n\nOr in PLAYA's \"device space\" (not so fast):\n\n```python\nfor seg in path.segments:\n   print(f\"segment: {seg}\")\n```\n\nThis API doesn't try to interpret paths for you.  You only get\n`PathSegment`s.  But for convenience you can get them grouped by\nsubpaths as created using the `m` or `re` operators:\n\n```python\nfor subpath in path:\n   for seg in subpath.segments:\n       print(f\"segment: {seg}\")\n```\n\n### Text Objects\n\nSince most PDFs consist primarily of text, obviously you may wish to\nknow something about the actual text (or the `ActualText`, which you\ncan sometimes find in `obj.mcs.tag.props[\"ActualText\"]`).  This is\nmore difficult than it looks, as fundamentally PDF just positions\narbitrarily numbered glyphs on a page, and the vast majority of PDFs\nembed their own fonts, using *subsetting* to include only the glyphs\nactually used.\n\nWhereas `pdfminer.six` would break down text objects into their\nindividual glyphs (which might or might not correspond to characters),\nthis is not always what you want, and moreover it is computationally\nquite expensive.  So PLAYA, by default, does not do this.  If you\ndon't need to know the actual bounding box of a text object, then\ndon't access `obj.bbox` and it won't be computed.  If you don't need\nto know the position of each glyph but simply want the Unicode\ncharacters, then just look at `obj.chars`.\n\nIt is important to understand that `obj.chars` may or may not correspond\nto the actual text that a human will read on the page.  To\nactually extract *text* from a PDF necessarily involves Heuristics\nor Machine Learning (yes, capitalized, like that) and PLAYA does not do\neither of those things.\n\nThis is because PDFs, especially ones produced by OCR, don't organize\ntext objects in any meaningful fashion, so you will want to actually\nlook at the glyphs.  This becomes a matter of iterating over the item,\ngiving you, well, more items, which are the individual glyphs:\n\n```python\nfor glyph in item:\n    print(\"Glyph has CID {glyph.cid} and Unicode {glyph.text}\")\n```\n\nBy default PLAYA, following the PDF specification, considers the\ngrouping of glyphs into strings irrelevant by default.  I *might*\nconsider separating the strings in the future.\n\nPDF has the concept of a *text state* which determines some aspects of\nhow text is rendered.  You can obviously access this though\n`glyph.textstate` - note that the text state, like the graphics state,\nis mutable, so you will have to copy it or save individual parameters\nthat you might care about.  This may be a major footgun so watch out.\n\nPLAYA doesn't guarantee that text objects come at you in anything\nother than the order they occur in the file (but it does guarantee\nthat).\n\nIn some cases might want to look at the abovementioned `ActualText`\nattribute to reliably extract text, particularly if the PDF was\ncreated by certain versions of LibreOffice, but in their infinite\nwisdom, Adobe made `ActualText` a property of *marked content\nsections* and not *text objects*, so you may be out of luck if you\nwant to actually match these characters to glyphs.  Sorry, I don't\nwrite the standards.\n\n## Conclusion\n\nAs mentioned earlier, if you really just want to do text extraction,\nthere's always pdfplumber, pymupdf, pypdfium2, pikepdf, pypdf, borb,\netc, etc, etc.\n\n## Acknowledgement\n\nThis repository obviously includes code from `pdfminer.six`.  Original\nlicense text is included in\n[LICENSE](https://github.com/dhdaines/playa/blob/main/LICENSE).  The\nlicense itself has not changed!\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "PLAYA ain't a LAYout Analyzer",
    "version": "0.2.5",
    "project_urls": {
        "Homepage": "https://dhdaines.github.io/playa"
    },
    "split_keywords": [
        "pdf parser",
        " text mining"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8277bdd83da492fb3e448a790d4c6616df97dafd9bdd271a60ad42b88cd4ee4d",
                "md5": "ea5e42664f8ef79b6017dcc8a6f7c392",
                "sha256": "519dc8996d39e454a869833c4f280a1fab7d7ac70af69a35307c04f1c106fe70"
            },
            "downloads": -1,
            "filename": "playa_pdf-0.2.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ea5e42664f8ef79b6017dcc8a6f7c392",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5612189,
            "upload_time": "2024-12-15T18:08:48",
            "upload_time_iso_8601": "2024-12-15T18:08:48.432477Z",
            "url": "https://files.pythonhosted.org/packages/82/77/bdd83da492fb3e448a790d4c6616df97dafd9bdd271a60ad42b88cd4ee4d/playa_pdf-0.2.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6c18a652ca7d4c8ed625814df27915120d313bbd48bfb5fe79323e4d5f8780de",
                "md5": "a9041210e49dd9a15a61cedf4681b96e",
                "sha256": "c383da65460932a6448223c1ed8c048dbfd24a891e5ebbd994762cbe105dc874"
            },
            "downloads": -1,
            "filename": "playa_pdf-0.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "a9041210e49dd9a15a61cedf4681b96e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 7727200,
            "upload_time": "2024-12-15T18:08:52",
            "upload_time_iso_8601": "2024-12-15T18:08:52.359847Z",
            "url": "https://files.pythonhosted.org/packages/6c/18/a652ca7d4c8ed625814df27915120d313bbd48bfb5fe79323e4d5f8780de/playa_pdf-0.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-15 18:08:52",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "playa-pdf"
}
        
Elapsed time: 0.37349s