document-transformer


Namedocument-transformer JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/johngonzalez/document-transformer
SummaryOrganize your document transformation pipeline. Define your components, this tool ensure traceability and organization. Reuse your work easily in other projects
upload_time2024-06-01 19:26:06
maintainerNone
docs_urlNone
authorJohn Gonzalez
requires_python<4.0,>=3.9
licenseMIT
keywords document transformer pipeline components reuse traceability organization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Document Transformer

Document Transformer allows users to define and apply transformations to documents in a flexible and robust manner, ensuring traceability of each change made to the documents.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)
- [Contact](#contact)

## Features

- Flexible document transformation
- Comprehensive traceability for each transformation
- Add custom supports to multiple document formats (e.g., JSON, XML, CSV)
- Easy integration with other tools and workflows

## Installation

To install Document Transformer, follow these steps:

```sh
# Install using pip
pip install document-transformer
```

## Usage

Define custom Document class

```python
from document_transformer import Document

class PDFDocument(Document):
    """Custom class to PDF Documents"""

class ImageDocument(Document):
    """Custom class to Image Documents"""
    def saver(self, path):
        self.data.save(path)
        return self
```

Define the transformer. Specify input and output Document types

```python
from document_transformer import DocumentTransformer
import pdf2image  # install: pip install pdf2image
from typing import List
from pathlib import Path

class PDF2Images(DocumentTransformer):
    input: PDFDocument = None
    output: List[ImageDocument] = []

    def transformer(self) -> List[ImageDocument]:
        """Split the PDF document into pages"""
        images = pdf2image.convert_from_path(self.input.path)
        return [
            ImageDocument(
                metadata={'pdf_path': Path(self.input.path).name, 'page': i+1, 'size': image.size},
                data=image,
            )
            for i, image in enumerate(images)
        ]
```

Run your implementation

```python
pdf_doc = PDFDocument(path="document.pdf")
images = PDF2Images(input=pdf_doc).run()

for image in images:
    image.save(path=f'images/pag_{image.metadata["page"]}.jpg')
    print(f"Imagen: {image.id}")
    print(f"Parents: {image.parents}")
    print(f"Metadata: {image.metadata}")
```

Or run like a pipeline, visualize the graph transformation

```python
from document_transformer import Pipeline
from document_transformer.utils import plot_graph

# Define Pipeline, add more transformers as you need
pipeline = Pipeline(transformers=[
    PDF2Images(to="images/pag_{metadata[page]}.jpg"),
    # Images2Markdown(to="images/pag_{metadata[page]}.md")),
    # ...
])

# Define input and get output
pdf_doc = PDFDocument(path="document.pdf")
images = pipeline.run(input=pdf_doc)

# See transfomer plot graph
plot_graph(pipeline.get_traces())
```

![plot_graph.png](docs/static/plot_graph.png)

## Contributing
We welcome contributions! Please read our Contributing Guide to learn how you can help.

## License
Document Transformer is licensed under the MIT License

## Contact
If you have any questions or feedback, please feel free to reach out to us at johngonzalezv@gmail.com.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/johngonzalez/document-transformer",
    "name": "document-transformer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "document, transformer, pipeline, components, reuse, traceability, organization",
    "author": "John Gonzalez",
    "author_email": "johngonzalezv@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a6/ab/8fc522467b0c8ec39dd8e48693b6efe6aa6da42eceb6b2b5b769f199ea6e/document_transformer-0.1.2.tar.gz",
    "platform": null,
    "description": "# Document Transformer\n\nDocument Transformer allows users to define and apply transformations to documents in a flexible and robust manner, ensuring traceability of each change made to the documents.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Contributing](#contributing)\n- [License](#license)\n- [Contact](#contact)\n\n## Features\n\n- Flexible document transformation\n- Comprehensive traceability for each transformation\n- Add custom supports to multiple document formats (e.g., JSON, XML, CSV)\n- Easy integration with other tools and workflows\n\n## Installation\n\nTo install Document Transformer, follow these steps:\n\n```sh\n# Install using pip\npip install document-transformer\n```\n\n## Usage\n\nDefine custom Document class\n\n```python\nfrom document_transformer import Document\n\nclass PDFDocument(Document):\n    \"\"\"Custom class to PDF Documents\"\"\"\n\nclass ImageDocument(Document):\n    \"\"\"Custom class to Image Documents\"\"\"\n    def saver(self, path):\n        self.data.save(path)\n        return self\n```\n\nDefine the transformer. Specify input and output Document types\n\n```python\nfrom document_transformer import DocumentTransformer\nimport pdf2image  # install: pip install pdf2image\nfrom typing import List\nfrom pathlib import Path\n\nclass PDF2Images(DocumentTransformer):\n    input: PDFDocument = None\n    output: List[ImageDocument] = []\n\n    def transformer(self) -> List[ImageDocument]:\n        \"\"\"Split the PDF document into pages\"\"\"\n        images = pdf2image.convert_from_path(self.input.path)\n        return [\n            ImageDocument(\n                metadata={'pdf_path': Path(self.input.path).name, 'page': i+1, 'size': image.size},\n                data=image,\n            )\n            for i, image in enumerate(images)\n        ]\n```\n\nRun your implementation\n\n```python\npdf_doc = PDFDocument(path=\"document.pdf\")\nimages = PDF2Images(input=pdf_doc).run()\n\nfor image in images:\n    image.save(path=f'images/pag_{image.metadata[\"page\"]}.jpg')\n    print(f\"Imagen: {image.id}\")\n    print(f\"Parents: {image.parents}\")\n    print(f\"Metadata: {image.metadata}\")\n```\n\nOr run like a pipeline, visualize the graph transformation\n\n```python\nfrom document_transformer import Pipeline\nfrom document_transformer.utils import plot_graph\n\n# Define Pipeline, add more transformers as you need\npipeline = Pipeline(transformers=[\n    PDF2Images(to=\"images/pag_{metadata[page]}.jpg\"),\n    # Images2Markdown(to=\"images/pag_{metadata[page]}.md\")),\n    # ...\n])\n\n# Define input and get output\npdf_doc = PDFDocument(path=\"document.pdf\")\nimages = pipeline.run(input=pdf_doc)\n\n# See transfomer plot graph\nplot_graph(pipeline.get_traces())\n```\n\n![plot_graph.png](docs/static/plot_graph.png)\n\n## Contributing\nWe welcome contributions! Please read our Contributing Guide to learn how you can help.\n\n## License\nDocument Transformer is licensed under the MIT License\n\n## Contact\nIf you have any questions or feedback, please feel free to reach out to us at johngonzalezv@gmail.com.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Organize your document transformation pipeline. Define your components, this tool ensure traceability and organization. Reuse your work easily in other projects",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/johngonzalez/document-transformer",
        "Repository": "https://github.com/johngonzalez/document-transformer"
    },
    "split_keywords": [
        "document",
        " transformer",
        " pipeline",
        " components",
        " reuse",
        " traceability",
        " organization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea2a9fec764e7fe9593d6ebce3369f5ed3a2b203f91cb6ea2d8fc66773cb1a02",
                "md5": "d538085f30f6dcf87f6bdf276155dd1e",
                "sha256": "255d08627fa6d679d3e718a6494ec37f18fc35676b47cbbbe4cc4aa52385ef1a"
            },
            "downloads": -1,
            "filename": "document_transformer-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d538085f30f6dcf87f6bdf276155dd1e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 6635,
            "upload_time": "2024-06-01T19:26:04",
            "upload_time_iso_8601": "2024-06-01T19:26:04.967212Z",
            "url": "https://files.pythonhosted.org/packages/ea/2a/9fec764e7fe9593d6ebce3369f5ed3a2b203f91cb6ea2d8fc66773cb1a02/document_transformer-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a6ab8fc522467b0c8ec39dd8e48693b6efe6aa6da42eceb6b2b5b769f199ea6e",
                "md5": "bb2b915d6c1f2f61003b2c9c2a4bad3b",
                "sha256": "817c9c0f588171f64c9167efad763b42f0f2d2b35d4932daee4a522e84fdacfb"
            },
            "downloads": -1,
            "filename": "document_transformer-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "bb2b915d6c1f2f61003b2c9c2a4bad3b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 5183,
            "upload_time": "2024-06-01T19:26:06",
            "upload_time_iso_8601": "2024-06-01T19:26:06.494495Z",
            "url": "https://files.pythonhosted.org/packages/a6/ab/8fc522467b0c8ec39dd8e48693b6efe6aa6da42eceb6b2b5b769f199ea6e/document_transformer-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-01 19:26:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "johngonzalez",
    "github_project": "document-transformer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "document-transformer"
}
        
Elapsed time: 0.28413s