| Name | pdfsyntax JSON |
| Version |
0.1.2
JSON |
| download |
| home_page | None |
| Summary | A Python library to inspect and modify the internal structure of a PDF file |
| upload_time | 2024-09-07 18:32:55 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | MIT license |
| keywords |
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
PDFSyntax
=========
*A Python library to inspect and transform the internal structure of PDF files*
## Introduction
The project is focused on chapter 7 ("Syntax") of the Portable Document Format (PDF) Specification. It implements all the detailed document structure management down to the byte level for inspection and transformation use cases (access to metadata, rotation,...).
- Internal functions are being exposed as an API toolkit for PDF read/write operations,
- Some specific functions are additionally exposed as a command line interface for use in a terminal or a browser.
PDFSyntax is lightweight (no dependencies) and written from scratch in pure Python, with a focus on simplicity and immutability.
It favors non-destructive edits allowed by the PDF Specification: by default incremental updates are added at the end of the original file (you may rewind or squash all revisions into a single one).
## Project status
WORK IN PROGRESS! This is BETA quality software. The API may change anytime.
Next on TO-DO list:
- Cut & append pages
- Lossless compression
- More filters
- Improve text extraction
- Augment text extraction with layout detection
## Installation
You can install from PyPI:
pip install pdfsyntax
## CLI overview
Please refer to the [CLI README](https://github.com/desgeeko/pdfsyntax/blob/main/docs/cli.md) for details.
The general form of the CLI usage is:
python3 -m pdfsyntax COMMAND FILE
You can get quick insights on a PDF file with these commands:
- `overview` outputs text data about the structure and the metadata.
- `disasm` outputs a dump of the file structure on the terminal.
- `text` outputs extracted text spatially, as if it was a kind of scan.
- `browse` outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.
## API overview
Please refer to the [API README](https://github.com/desgeeko/pdfsyntax/blob/main/docs/api.md) for details.
PDFSyntax is mostly made of simple functions. Example:
```Python
>>> from pdfsyntax import readfile, metadata
>>> doc = readfile("samples/simple_text_string.pdf")
>>> metadata(doc) #returns a Python dict whose keys are 'Title', 'Author', etc...
```
The Doc object is probably the only dedicated class you will need to handle. It is a black box that stores all the internal states of a document:
- content that is cached/memoized from an original file,
- modifications that add/modifiy/delete content and that are tracked as incremental updates.
```Python
>>> doc
<PDF Doc in revision 1 with 0 modified object(s)>
```
This object exposes as a method the same metadata function, therefore you can get the same result with:
```Python
>>> doc.metadata() #returns a Python dict whose keys are 'Title', 'Author', etc...
```
Low-level functions like `get_object` or `update_object` allow you to directly access and manipulate the inner objects of the document structure.
You may also use higher-level functions like `rotate`:
```Python
>>> from pdfsyntax import rotate, writefile
>>> doc180 = rotate(doc, 180) #rotate pages by 180°
```
The original object is unchanged and a new object is created with an incremental update (revision 2) that encloses the ongoing orientation modification:
```Python
>>> doc180
<PDF Doc in revision 1 with 1 modified object(s)>
```
You then can write the modified PDF to disk. Note that the resulting file contains a new section appended to the original content. You may cut this section to revert the change.
```Python
>>> writefile(doc180, "rotated_doc.pdf")
```
## Open-Source, not Open-Contribution yet
PDFSyntax is [MIT licensed](https://github.com/desgeeko/pdfsyntax/blob/main/LICENCE) but is currently closed to contributions.
> Personal note: this is a pet projet of mine and my time is limited. First I need to focus on my roadmap (new features and refactoring) and then I will happily accept contributions when everything is a little more stabilised.
Raw data
{
"_id": null,
"home_page": null,
"name": "pdfsyntax",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "\"Martin D.\" <desgeeko@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/47/1a/94b12b97ff1ffad4191adb853940293fa0b876982bcb1a930bf0edb6dc32/pdfsyntax-0.1.2.tar.gz",
"platform": null,
"description": "PDFSyntax\n=========\n\n*A Python library to inspect and transform the internal structure of PDF files*\n\n## Introduction\n\nThe project is focused on chapter 7 (\"Syntax\") of the Portable Document Format (PDF) Specification. It implements all the detailed document structure management down to the byte level for inspection and transformation use cases (access to metadata, rotation,...).\n\n- Internal functions are being exposed as an API toolkit for PDF read/write operations,\n- Some specific functions are additionally exposed as a command line interface for use in a terminal or a browser.\n\nPDFSyntax is lightweight (no dependencies) and written from scratch in pure Python, with a focus on simplicity and immutability.\n\nIt favors non-destructive edits allowed by the PDF Specification: by default incremental updates are added at the end of the original file (you may rewind or squash all revisions into a single one).\n\n## Project status\n\nWORK IN PROGRESS! This is BETA quality software. The API may change anytime.\nNext on TO-DO list:\n- Cut & append pages\n- Lossless compression\n- More filters\n- Improve text extraction\n- Augment text extraction with layout detection\n\n## Installation\n\nYou can install from PyPI:\n\n pip install pdfsyntax\n\n## CLI overview\n\nPlease refer to the [CLI README](https://github.com/desgeeko/pdfsyntax/blob/main/docs/cli.md) for details.\n\nThe general form of the CLI usage is:\n\n python3 -m pdfsyntax COMMAND FILE\n\nYou can get quick insights on a PDF file with these commands:\n- `overview` outputs text data about the structure and the metadata.\n- `disasm` outputs a dump of the file structure on the terminal.\n- `text` outputs extracted text spatially, as if it was a kind of scan.\n- `browse` outputs static html data that lets you browse the internal structure of the PDF file: the PDF source is pretty-printed and augmented with hyperlinks.\n\n## API overview\n\nPlease refer to the [API README](https://github.com/desgeeko/pdfsyntax/blob/main/docs/api.md) for details.\n\nPDFSyntax is mostly made of simple functions. Example:\n\n```Python\n>>> from pdfsyntax import readfile, metadata\n>>> doc = readfile(\"samples/simple_text_string.pdf\")\n>>> metadata(doc) #returns a Python dict whose keys are 'Title', 'Author', etc...\n```\n\nThe Doc object is probably the only dedicated class you will need to handle. It is a black box that stores all the internal states of a document:\n- content that is cached/memoized from an original file,\n- modifications that add/modifiy/delete content and that are tracked as incremental updates.\n\n```Python\n>>> doc\n<PDF Doc in revision 1 with 0 modified object(s)>\n```\n\nThis object exposes as a method the same metadata function, therefore you can get the same result with:\n\n```Python\n>>> doc.metadata() #returns a Python dict whose keys are 'Title', 'Author', etc...\n```\n\nLow-level functions like `get_object` or `update_object` allow you to directly access and manipulate the inner objects of the document structure.\nYou may also use higher-level functions like `rotate`:\n\n```Python\n>>> from pdfsyntax import rotate, writefile\n>>> doc180 = rotate(doc, 180) #rotate pages by 180\u00b0\n```\n\nThe original object is unchanged and a new object is created with an incremental update (revision 2) that encloses the ongoing orientation modification:\n\n```Python\n>>> doc180\n<PDF Doc in revision 1 with 1 modified object(s)>\n```\n\nYou then can write the modified PDF to disk. Note that the resulting file contains a new section appended to the original content. You may cut this section to revert the change.\n\n```Python\n>>> writefile(doc180, \"rotated_doc.pdf\")\n```\n\n\n## Open-Source, not Open-Contribution yet\n\nPDFSyntax is [MIT licensed](https://github.com/desgeeko/pdfsyntax/blob/main/LICENCE) but is currently closed to contributions.\n> Personal note: this is a pet projet of mine and my time is limited. First I need to focus on my roadmap (new features and refactoring) and then I will happily accept contributions when everything is a little more stabilised. \n\n\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "A Python library to inspect and modify the internal structure of a PDF file",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/desgeeko/pdfsyntax/issues",
"Homepage": "https://github.com/desgeeko/pdfsyntax"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c2d73e7177bdeb057d40193ed98ae0bd1ba635c635ac4efae4581972f85ef246",
"md5": "3c8be411cc33962ebb759d73190e44f2",
"sha256": "97ca595cfcba497cf07e9a40c61e6c4a79bd64dccdfa7ce15639ca288d03137e"
},
"downloads": -1,
"filename": "pdfsyntax-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3c8be411cc33962ebb759d73190e44f2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 38134,
"upload_time": "2024-09-07T18:32:53",
"upload_time_iso_8601": "2024-09-07T18:32:53.707314Z",
"url": "https://files.pythonhosted.org/packages/c2/d7/3e7177bdeb057d40193ed98ae0bd1ba635c635ac4efae4581972f85ef246/pdfsyntax-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "471a94b12b97ff1ffad4191adb853940293fa0b876982bcb1a930bf0edb6dc32",
"md5": "26db04cc0b099a5d6b9bf6aebc188e20",
"sha256": "f8570f34eb7feebf685ec8cd55fffeaaeb139c662157e48835c06fe324b9e56d"
},
"downloads": -1,
"filename": "pdfsyntax-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "26db04cc0b099a5d6b9bf6aebc188e20",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 38744,
"upload_time": "2024-09-07T18:32:55",
"upload_time_iso_8601": "2024-09-07T18:32:55.038994Z",
"url": "https://files.pythonhosted.org/packages/47/1a/94b12b97ff1ffad4191adb853940293fa0b876982bcb1a930bf0edb6dc32/pdfsyntax-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-07 18:32:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "desgeeko",
"github_project": "pdfsyntax",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pdfsyntax"
}