Name | paves JSON |
Version |
0.6.1
JSON |
| download |
home_page | None |
Summary | PDF, Analyse et Visualisation avancÉS |
upload_time | 2025-07-16 01:30:38 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | None |
keywords |
graphics
pdf
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PAVÉS: Bajo los adoquines, la PLAYA 🏖️
[**PLAYA**](https://github.com/dhdaines/playa) is intended
to get objects out of PDF, with no
dependencies or further analysis. So, over top of **PLAYA** there is
**PAVÉS**: "**P**DF, **A**nalyse et **V**isualisation ... plus
avancé**ES**", I guess?
Anything that deviates from the core mission of "getting objects out
of PDF" goes here, so, hopefully, more interesting analysis and
extraction that may be useful for all of you AI Bros doing
"Partitioning" and "Retrieval-Assisted-Generation" and suchlike
things. But specifically, visualization stuff inspired by the "visual
debugging" features of `pdfplumber` but not specifically tied to its
data structures and algorithms.
There will be dependencies. Oh, there will be dependencies.
## Installation
```console
pip install paves
```
## Looking at Stuff in a PDF
When poking around in a PDF, it is useful not simply to read
descriptions of objects (text, images, etc) but also to visualise them
in the rendered document. `pdfplumber` is quite nice for this, though
it is oriented towards the particular set of objects that it can
extract from the PDF.
The primary goal of [PLAYA-PDF](https://dhdaines.github.io/playa)
is to give access to all the objects and
particularly the metadata in a PDF. One goal of PAVÉS (because there
are a few) is to give an easy way to visualise these objects and
metadata.
First, maybe you want to just look at a page in your Jupyter notebook.
Okay!
```python
import playa, paves.image as pi
pdf = playa.open("my_awesome.pdf")
page = pdf.pages[3]
pi.show(page)
```
Something quite interesting to do is, if your PDF contains a logical
structure tree, to look at the bounding boxes of the contents of those
structure elements for a given page:
```python
pi.box(pdf.structure.find_all(lambda el: el.page is page))
```

You can also look at the marked content sections, which are the
leaf-nodes of the structure tree:
```python
pi.box(page.structure)
```
Alternately, if you have annotations (such as links), you can look at
those too:
```python
pi.box(page.annotations)
```

You can of course draw boxes around individual PDF objects, or
one particular sort of object, or filter them with a generator
expression:
```python
pi.box(page) # outlines everything
pi.box(page.texts)
pi.box(page.images)
pi.box(t for t in page.texts if "spam" in t.chars)
```
Alternately you can "highlight" objects by overlaying them with a
semi-transparent colour, which otherwise works the same way:
```python
pi.mark(page.images)
```

If you wish you can give each type of object a different colour:
```python
pi.mark(page, color={"text": "red", "image": "blue", "path": "green"})
```

You can also add outlines and labels around the highlighting:
```python
pi.mark(page, outline=True, label=True,
color={"text": "red", "image": "blue", "path": "green"})
```

By default, PAVÉS will assign a new colour to each distinct label based
on a colour cycle [borrowed from
Matplotlib](https://matplotlib.org/stable/gallery/color/color_cycle_default.html)
(no actual Matplotlib was harmed in the making of this library). You
can use Matplotlib's colour cycles if you like:
```
import matplotlib
pi.box(page, color=matplotlib.color_sequences["Dark2"])
```

Or just any list (it must be a `list`) of color specifications (which
are either strings, 3-tuples of integers in the range `[0, 255]`, or
3-tuples of floats in the range `[0.0, 1.0]`):
```
pi.mark(page, color=["blue", "magenta", (0.0, 0.5, 0.32), (233, 222, 111)], labelfunc=repr)
```

(yes, that just cycles through the colors for each new object)
## Working in the PDF mine
`pdfminer.six` is widely used for text extraction and layout analysis
due to its liberal licensing terms. Unfortunately it is quite slow
and contains many bugs. Now you can use PAVÉS instead:
```python
from paves.miner import extract, LAParams
laparams = LAParams()
for page in extract(path, laparams):
# do something
```
This is generally faster than `pdfminer.six`. You can often make it
even faster on large documents by running in parallel with the
`max_workers` argument, which is the same as the one you will find in
`concurrent.futures.ProcessPoolExecutor`. If you pass `None` it will
use all your CPUs, but due to some unavoidable overhead, it usually
doesn't help to use more than 2-4:
```
for page in extract(path, laparams, max_workers=2):
# do something
```
There are a few differences with `pdfminer.six` (some might call them
bug fixes):
- By default, if you do not pass the `laparams` argument to `extract`,
no layout analysis at all is done. This is different from
`extract_pages` in `pdfminer.six` which will set some default
parameters for you. If you don't see any `LTTextBox` items in your
`LTPage` then this is why!
- Rectangles are recognized correctly in some cases where
`pdfminer.six` thought they were "curves".
- Colours and colour spaces are the PLAYA versions, which do not
correspond to what `pdfminer.six` gives you, because what
`pdfminer.six` gives you is not useful and often wrong.
- You have access to the list of enclosing marked content sections in
every `LTComponent`, as the `mcstack` attribute.
- Bounding boxes of rotated glyphs are the actual bounding box.
Probably more... but you didn't use any of that stuff anyway, you just
wanted to get `LTTextBoxes` to feed to your hallucination factories.
## PLAYA Bears
[PLAYA](https://github.com/dhdaines/playa) has a nice "lazy" API which
is efficient but does take a bit of work to use. If, on the other
hand, **you** are lazy, then you can use `paves.bears`, which will
flatten everything for you into a friendly dictionary representation
(but it is a
[`TypedDict`](https://typing.readthedocs.io/en/latest/spec/typeddict.html#typeddict))
which, um, looks a lot like what `pdfplumber` gives you, except
possibly in a different coordinate space, as defined [in the PLAYA
documentation](https://github.com/dhdaines/playa#an-important-note-about-coordinate-spaces).
```python
from paves.bears import extract
for dic in extract(path):
print("it is a {dic['object_type']} at ({dic['x0']}", {dic['y0']}))
print(" the color is {dic['stroking_color']}")
print(" the text is {dic['text']}")
print(" it is in MCS {dic['mcid']} which is a {dic['tag']}")
print(" it is also in Form XObject {dic['xobjid']}")
```
This can be used to do machine learning of various sorts. For
instance, you can write `page.layout` to a CSV file:
```python
from paves.bears import FIELDNAMES
writer = DictWriter(outfh, fieldnames=FIELDNAMES)
writer.writeheader()
for dic in extract(path):
writer.writerow(dic)
```
you can also create a Pandas DataFrame:
```python
df = pandas.DataFrame.from_records(extract(path))
```
or a Polars DataFrame or LazyFrame:
```python
from paves.bears import SCHEMA
df = polars.DataFrame(extract(path), schema=SCHEMA)
```
As above, you can use multiple CPUs with `max_workers`, and this will
scale considerably better than `paves.miner`.
## License
`PAVÉS` is distributed under the terms of the
[MIT](https://spdx.org/licenses/MIT.html) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "paves",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "graphics, pdf",
"author": null,
"author_email": "David Huggins-Daines <dhd@ecolingui.ca>",
"download_url": "https://files.pythonhosted.org/packages/b6/52/111534d7bda42ad4c746865bfbf49b7360bfa6eddd00661075f08d25493f/paves-0.6.1.tar.gz",
"platform": null,
"description": "# PAV\u00c9S: Bajo los adoquines, la PLAYA \ud83c\udfd6\ufe0f\n\n[**PLAYA**](https://github.com/dhdaines/playa) is intended\nto get objects out of PDF, with no\ndependencies or further analysis. So, over top of **PLAYA** there is\n**PAV\u00c9S**: \"**P**DF, **A**nalyse et **V**isualisation ... plus\navanc\u00e9**ES**\", I guess?\n\nAnything that deviates from the core mission of \"getting objects out\nof PDF\" goes here, so, hopefully, more interesting analysis and\nextraction that may be useful for all of you AI Bros doing\n\"Partitioning\" and \"Retrieval-Assisted-Generation\" and suchlike\nthings. But specifically, visualization stuff inspired by the \"visual\ndebugging\" features of `pdfplumber` but not specifically tied to its\ndata structures and algorithms.\n\nThere will be dependencies. Oh, there will be dependencies.\n\n## Installation\n\n```console\npip install paves\n```\n\n## Looking at Stuff in a PDF\n\nWhen poking around in a PDF, it is useful not simply to read\ndescriptions of objects (text, images, etc) but also to visualise them\nin the rendered document. `pdfplumber` is quite nice for this, though\nit is oriented towards the particular set of objects that it can\nextract from the PDF.\n\nThe primary goal of [PLAYA-PDF](https://dhdaines.github.io/playa)\nis to give access to all the objects and\nparticularly the metadata in a PDF. One goal of PAV\u00c9S (because there\nare a few) is to give an easy way to visualise these objects and\nmetadata.\n\nFirst, maybe you want to just look at a page in your Jupyter notebook.\nOkay!\n\n```python\nimport playa, paves.image as pi\npdf = playa.open(\"my_awesome.pdf\")\npage = pdf.pages[3]\npi.show(page)\n```\n\nSomething quite interesting to do is, if your PDF contains a logical\nstructure tree, to look at the bounding boxes of the contents of those\nstructure elements for a given page:\n\n```python\npi.box(pdf.structure.find_all(lambda el: el.page is page))\n```\n\n\n\nYou can also look at the marked content sections, which are the\nleaf-nodes of the structure tree:\n\n```python\npi.box(page.structure)\n```\n\nAlternately, if you have annotations (such as links), you can look at\nthose too:\n\n```python\npi.box(page.annotations)\n```\n\n\n\nYou can of course draw boxes around individual PDF objects, or\none particular sort of object, or filter them with a generator\nexpression:\n\n```python\npi.box(page) # outlines everything\npi.box(page.texts)\npi.box(page.images)\npi.box(t for t in page.texts if \"spam\" in t.chars)\n```\n\nAlternately you can \"highlight\" objects by overlaying them with a\nsemi-transparent colour, which otherwise works the same way:\n\n```python\npi.mark(page.images)\n```\n\n\n\nIf you wish you can give each type of object a different colour:\n\n```python\npi.mark(page, color={\"text\": \"red\", \"image\": \"blue\", \"path\": \"green\"})\n```\n\n\n\nYou can also add outlines and labels around the highlighting:\n\n```python\npi.mark(page, outline=True, label=True,\n color={\"text\": \"red\", \"image\": \"blue\", \"path\": \"green\"})\n```\n\n\n\nBy default, PAV\u00c9S will assign a new colour to each distinct label based\non a colour cycle [borrowed from\nMatplotlib](https://matplotlib.org/stable/gallery/color/color_cycle_default.html)\n(no actual Matplotlib was harmed in the making of this library). You\ncan use Matplotlib's colour cycles if you like:\n\n```\nimport matplotlib\npi.box(page, color=matplotlib.color_sequences[\"Dark2\"])\n```\n\n\n\nOr just any list (it must be a `list`) of color specifications (which\nare either strings, 3-tuples of integers in the range `[0, 255]`, or\n3-tuples of floats in the range `[0.0, 1.0]`):\n\n```\npi.mark(page, color=[\"blue\", \"magenta\", (0.0, 0.5, 0.32), (233, 222, 111)], labelfunc=repr)\n```\n\n\n\n(yes, that just cycles through the colors for each new object)\n\n## Working in the PDF mine\n\n`pdfminer.six` is widely used for text extraction and layout analysis\ndue to its liberal licensing terms. Unfortunately it is quite slow\nand contains many bugs. Now you can use PAV\u00c9S instead:\n\n```python\nfrom paves.miner import extract, LAParams\n\nlaparams = LAParams()\nfor page in extract(path, laparams):\n # do something\n```\n\nThis is generally faster than `pdfminer.six`. You can often make it\neven faster on large documents by running in parallel with the\n`max_workers` argument, which is the same as the one you will find in\n`concurrent.futures.ProcessPoolExecutor`. If you pass `None` it will\nuse all your CPUs, but due to some unavoidable overhead, it usually\ndoesn't help to use more than 2-4:\n\n```\nfor page in extract(path, laparams, max_workers=2):\n # do something\n```\n\nThere are a few differences with `pdfminer.six` (some might call them\nbug fixes):\n\n- By default, if you do not pass the `laparams` argument to `extract`,\n no layout analysis at all is done. This is different from\n `extract_pages` in `pdfminer.six` which will set some default\n parameters for you. If you don't see any `LTTextBox` items in your\n `LTPage` then this is why!\n- Rectangles are recognized correctly in some cases where\n `pdfminer.six` thought they were \"curves\".\n- Colours and colour spaces are the PLAYA versions, which do not\n correspond to what `pdfminer.six` gives you, because what\n `pdfminer.six` gives you is not useful and often wrong.\n- You have access to the list of enclosing marked content sections in\n every `LTComponent`, as the `mcstack` attribute.\n- Bounding boxes of rotated glyphs are the actual bounding box.\n\nProbably more... but you didn't use any of that stuff anyway, you just\nwanted to get `LTTextBoxes` to feed to your hallucination factories.\n\n## PLAYA Bears\n\n[PLAYA](https://github.com/dhdaines/playa) has a nice \"lazy\" API which\nis efficient but does take a bit of work to use. If, on the other\nhand, **you** are lazy, then you can use `paves.bears`, which will\nflatten everything for you into a friendly dictionary representation\n(but it is a\n[`TypedDict`](https://typing.readthedocs.io/en/latest/spec/typeddict.html#typeddict))\nwhich, um, looks a lot like what `pdfplumber` gives you, except\npossibly in a different coordinate space, as defined [in the PLAYA\ndocumentation](https://github.com/dhdaines/playa#an-important-note-about-coordinate-spaces).\n\n```python\nfrom paves.bears import extract\n\nfor dic in extract(path):\n print(\"it is a {dic['object_type']} at ({dic['x0']}\", {dic['y0']}))\n print(\" the color is {dic['stroking_color']}\")\n print(\" the text is {dic['text']}\")\n print(\" it is in MCS {dic['mcid']} which is a {dic['tag']}\")\n print(\" it is also in Form XObject {dic['xobjid']}\")\n```\n\nThis can be used to do machine learning of various sorts. For\ninstance, you can write `page.layout` to a CSV file:\n\n```python\nfrom paves.bears import FIELDNAMES\n\nwriter = DictWriter(outfh, fieldnames=FIELDNAMES)\nwriter.writeheader()\nfor dic in extract(path):\n writer.writerow(dic)\n```\n\nyou can also create a Pandas DataFrame:\n\n```python\ndf = pandas.DataFrame.from_records(extract(path))\n```\n\nor a Polars DataFrame or LazyFrame:\n\n```python\nfrom paves.bears import SCHEMA\n\ndf = polars.DataFrame(extract(path), schema=SCHEMA)\n```\n\nAs above, you can use multiple CPUs with `max_workers`, and this will\nscale considerably better than `paves.miner`.\n\n## License\n\n`PAV\u00c9S` is distributed under the terms of the\n[MIT](https://spdx.org/licenses/MIT.html) license.\n",
"bugtrack_url": null,
"license": null,
"summary": "PDF, Analyse et Visualisation avanc\u00c9S",
"version": "0.6.1",
"project_urls": {
"Documentation": "https://github.com/dhdaines/paves#readme",
"Issues": "https://github.com/dhdaines/paves/issues",
"Source": "https://github.com/dhdaines/paves"
},
"split_keywords": [
"graphics",
" pdf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dc731b53f4bea65bc027d20a0b6a837c0e1a3625c9bbd90595f8f3c758f4975c",
"md5": "312eb4b1929d8ef65de8ff088a95761b",
"sha256": "b0d935a28d2d4c5dace5ee798b1bea8aa2289ff5358930eae01d76ae107e8164"
},
"downloads": -1,
"filename": "paves-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "312eb4b1929d8ef65de8ff088a95761b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 27067,
"upload_time": "2025-07-16T01:30:36",
"upload_time_iso_8601": "2025-07-16T01:30:36.921359Z",
"url": "https://files.pythonhosted.org/packages/dc/73/1b53f4bea65bc027d20a0b6a837c0e1a3625c9bbd90595f8f3c758f4975c/paves-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b652111534d7bda42ad4c746865bfbf49b7360bfa6eddd00661075f08d25493f",
"md5": "8820816d635239294b3c5611d37dfd24",
"sha256": "bc046b35313461bb2016d29c8d1d8a1fd0ca69eec338b3781e55a5b4ab4310e4"
},
"downloads": -1,
"filename": "paves-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "8820816d635239294b3c5611d37dfd24",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 1519472,
"upload_time": "2025-07-16T01:30:38",
"upload_time_iso_8601": "2025-07-16T01:30:38.774207Z",
"url": "https://files.pythonhosted.org/packages/b6/52/111534d7bda42ad4c746865bfbf49b7360bfa6eddd00661075f08d25493f/paves-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-16 01:30:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dhdaines",
"github_project": "paves#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "paves"
}