# Document Transformer
Document Transformer allows users to define and apply transformations to documents in a flexible and robust manner, ensuring traceability of each change made to the documents.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)
## Features
- Flexible document transformation
- Comprehensive traceability for each transformation
- Add custom supports to multiple document formats (e.g., JSON, XML, CSV)
- Easy integration with other tools and workflows
## Installation
To install Document Transformer, follow these steps:
```sh
# Install using pip
pip install document-transformer
```
## Usage
Define custom Document class
```python
from document_transformer import Document
class PDFDocument(Document):
"""Custom class to PDF Documents"""
class ImageDocument(Document):
"""Custom class to Image Documents"""
def saver(self, path):
self.data.save(path)
return self
```
Define the transformer. Specify input and output Document types
```python
from document_transformer import DocumentTransformer
import pdf2image # install: pip install pdf2image
from typing import List
from pathlib import Path
class PDF2Images(DocumentTransformer):
input: PDFDocument = None
output: List[ImageDocument] = []
def transformer(self) -> List[ImageDocument]:
"""Split the PDF document into pages"""
images = pdf2image.convert_from_path(self.input.path)
return [
ImageDocument(
metadata={'pdf_path': Path(self.input.path).name, 'page': i+1, 'size': image.size},
data=image,
)
for i, image in enumerate(images)
]
```
Run your implementation
```python
pdf_doc = PDFDocument(path="document.pdf")
images = PDF2Images(input=pdf_doc).run()
for image in images:
image.save(path=f'images/pag_{image.metadata["page"]}.jpg')
print(f"Imagen: {image.id}")
print(f"Parents: {image.parents}")
print(f"Metadata: {image.metadata}")
```
Or run like a pipeline, visualize the graph transformation
```python
from document_transformer import Pipeline
from document_transformer.utils import plot_graph
# Define Pipeline, add more transformers as you need
pipeline = Pipeline(transformers=[
PDF2Images(to="images/pag_{metadata[page]}.jpg"),
# Images2Markdown(to="images/pag_{metadata[page]}.md")),
# ...
])
# Define input and get output
pdf_doc = PDFDocument(path="document.pdf")
images = pipeline.run(input=pdf_doc)
# See transfomer plot graph
plot_graph(pipeline.get_traces())
```
![plot_graph.png](docs/static/plot_graph.png)
## Contributing
We welcome contributions! Please read our Contributing Guide to learn how you can help.
## License
Document Transformer is licensed under the MIT License
## Contact
If you have any questions or feedback, please feel free to reach out to us at johngonzalezv@gmail.com.
Raw data
{
"_id": null,
"home_page": "https://github.com/johngonzalez/document-transformer",
"name": "document-transformer",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": "document, transformer, pipeline, components, reuse, traceability, organization",
"author": "John Gonzalez",
"author_email": "johngonzalezv@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/04/de/39231636b7df7c356e099c14b02bff2b2cf5ceeb7696dc4523d70d30dd43/document_transformer-0.1.5.tar.gz",
"platform": null,
"description": "# Document Transformer\n\nDocument Transformer allows users to define and apply transformations to documents in a flexible and robust manner, ensuring traceability of each change made to the documents.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Contributing](#contributing)\n- [License](#license)\n- [Contact](#contact)\n\n## Features\n\n- Flexible document transformation\n- Comprehensive traceability for each transformation\n- Add custom supports to multiple document formats (e.g., JSON, XML, CSV)\n- Easy integration with other tools and workflows\n\n## Installation\n\nTo install Document Transformer, follow these steps:\n\n```sh\n# Install using pip\npip install document-transformer\n```\n\n## Usage\n\nDefine custom Document class\n\n```python\nfrom document_transformer import Document\n\nclass PDFDocument(Document):\n \"\"\"Custom class to PDF Documents\"\"\"\n\nclass ImageDocument(Document):\n \"\"\"Custom class to Image Documents\"\"\"\n def saver(self, path):\n self.data.save(path)\n return self\n```\n\nDefine the transformer. Specify input and output Document types\n\n```python\nfrom document_transformer import DocumentTransformer\nimport pdf2image # install: pip install pdf2image\nfrom typing import List\nfrom pathlib import Path\n\nclass PDF2Images(DocumentTransformer):\n input: PDFDocument = None\n output: List[ImageDocument] = []\n\n def transformer(self) -> List[ImageDocument]:\n \"\"\"Split the PDF document into pages\"\"\"\n images = pdf2image.convert_from_path(self.input.path)\n return [\n ImageDocument(\n metadata={'pdf_path': Path(self.input.path).name, 'page': i+1, 'size': image.size},\n data=image,\n )\n for i, image in enumerate(images)\n ]\n```\n\nRun your implementation\n\n```python\npdf_doc = PDFDocument(path=\"document.pdf\")\nimages = PDF2Images(input=pdf_doc).run()\n\nfor image in images:\n image.save(path=f'images/pag_{image.metadata[\"page\"]}.jpg')\n print(f\"Imagen: {image.id}\")\n print(f\"Parents: {image.parents}\")\n print(f\"Metadata: {image.metadata}\")\n```\n\nOr run like a pipeline, visualize the graph transformation\n\n```python\nfrom document_transformer import Pipeline\nfrom document_transformer.utils import plot_graph\n\n# Define Pipeline, add more transformers as you need\npipeline = Pipeline(transformers=[\n PDF2Images(to=\"images/pag_{metadata[page]}.jpg\"),\n # Images2Markdown(to=\"images/pag_{metadata[page]}.md\")),\n # ...\n])\n\n# Define input and get output\npdf_doc = PDFDocument(path=\"document.pdf\")\nimages = pipeline.run(input=pdf_doc)\n\n# See transfomer plot graph\nplot_graph(pipeline.get_traces())\n```\n\n![plot_graph.png](docs/static/plot_graph.png)\n\n## Contributing\nWe welcome contributions! Please read our Contributing Guide to learn how you can help.\n\n## License\nDocument Transformer is licensed under the MIT License\n\n## Contact\nIf you have any questions or feedback, please feel free to reach out to us at johngonzalezv@gmail.com.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Organize your document transformation pipeline. Define your components, this tool ensure traceability and organization. Reuse your work easily in other projects",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/johngonzalez/document-transformer",
"Repository": "https://github.com/johngonzalez/document-transformer"
},
"split_keywords": [
"document",
" transformer",
" pipeline",
" components",
" reuse",
" traceability",
" organization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4c87eeaf5ca543cbf8373cec1f4113aaf8be282795d1d3e5ed4510e7f1d7baa0",
"md5": "4c98ebaf9e8a8cdca74417f2ee4112c1",
"sha256": "037af31ea417576c769575fbed192055cc219ae539aa4f59e532625d1a8b33d3"
},
"downloads": -1,
"filename": "document_transformer-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c98ebaf9e8a8cdca74417f2ee4112c1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 8352,
"upload_time": "2024-10-01T23:03:04",
"upload_time_iso_8601": "2024-10-01T23:03:04.683412Z",
"url": "https://files.pythonhosted.org/packages/4c/87/eeaf5ca543cbf8373cec1f4113aaf8be282795d1d3e5ed4510e7f1d7baa0/document_transformer-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "04de39231636b7df7c356e099c14b02bff2b2cf5ceeb7696dc4523d70d30dd43",
"md5": "d5ba4137fd5893947cc9731d1f03ef2d",
"sha256": "ea2ab79c855a3a251d2d5abb7c165dd78a66b3413919ade8efc8d5ce64804cb6"
},
"downloads": -1,
"filename": "document_transformer-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "d5ba4137fd5893947cc9731d1f03ef2d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 6603,
"upload_time": "2024-10-01T23:03:05",
"upload_time_iso_8601": "2024-10-01T23:03:05.927199Z",
"url": "https://files.pythonhosted.org/packages/04/de/39231636b7df7c356e099c14b02bff2b2cf5ceeb7696dc4523d70d30dd43/document_transformer-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-01 23:03:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "johngonzalez",
"github_project": "document-transformer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "document-transformer"
}