pdf-image-text9-test

Name	pdf-image-text9-test JSON
Version	0.1.9 JSON
	download
home_page
Summary	A python package to generate text from pdf and images
upload_time	2024-03-13 01:49:13
maintainer
docs_url	None
author	Devesh Singh
requires_python	>=3.8,<4.0
license	LICENCE
keywords	test dependencies documentation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Pdf-Image-Text

A library to get data from standalone images and images present inside pdf.
Powered by fitz and OpenAI.

### Github - https://github.com/Bain/aag-pdf-image-text

### Description: 
This library provides below two functionalities:
1. PdfToText - A class which fetch the images present on pdf files, get image transcriptions from openAI, fetches plain text from the pdf page and return a object with image and text data.
The users can command to process full pdf file or specific page. 
2. ImageToText - A class which process the transcription of the provided image.

## Instructions

### 1. Installation

`pip install pdf-image-text`

### 2. Initialize

`pdf_to_text = PdfToText(open_ai_key='<>', model='<>')`
`image_to_text = ImageToText(open_ai_key='<>', model='<>')`


Note: The parameters are not required if already present in environment variables as 'OPEN_AI_KEY' and 'MODEL'

### 3. Load data

From file - 
##### Pdf2Text: `pdf_to_text.load_data(file_name='<Path to file>')`
##### Image2Text: `image_to_text.load_data(file_name='<Path to file>')`

From file object - 

##### Pdf2Text: `pdf_to_text.load_data(file_bytes_object='<file content>')`
##### Image2Text: `image_to_text.load_data(file_bytes_object='<file content>')`



### 3. Get output

##### Pdf2Text: 
`image_filter = ImageFilter(lower_height=<int>, upper_height=<int>, lower_width=<int>, upper_width=<int>)`

`output = pdf_to_text.get_pdf_content(image_filter=image_filter, page_index=<optional field: int>, include_formatted_content=<optional field: bool>)`

##### Image2Text: 
`output = image_to_text.get_image_transcription()`


### 4. Response Object

##### Pdf2Text: 

The output response contains a list of Page object. 
The page object consists of below attributes - 
1. image_content: A list of transcriptions for images fetched from the current pdf page.
2. text_content: The plain text fetched from the current pdf page.
3. formatted_content [Optional] : An optional attribute which contains the formatted string output containing the
plain text and figure data (Inside FIGURE TRANSCRIPTIONS section). This is useful in knowledge bot applications.
The default value for this flag is false.

##### Image2Text: 

The response contains a string representing the transcription of the provided image.

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pdf-image-text9-test",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "test,dependencies,documentation",
    "author": "Devesh Singh",
    "author_email": "devesh.singh@bain.com",
    "download_url": "https://files.pythonhosted.org/packages/33/d7/e399c3d4dcfe6d4cef0f3b11733a2a7a08cf58da1def595f54a3f325d963/pdf_image_text9_test-0.1.9.tar.gz",
    "platform": null,
    "description": "# Pdf-Image-Text\n\nA library to get data from standalone images and images present inside pdf.\nPowered by fitz and OpenAI.\n\n### Github - https://github.com/Bain/aag-pdf-image-text\n\n### Description: \nThis library provides below two functionalities:\n1. PdfToText - A class which fetch the images present on pdf files, get image transcriptions from openAI, fetches plain text from the pdf page and return a object with image and text data.\nThe users can command to process full pdf file or specific page. \n2. ImageToText - A class which process the transcription of the provided image.\n\n## Instructions\n\n### 1. Installation\n\n`pip install pdf-image-text`\n\n### 2. Initialize\n\n`pdf_to_text = PdfToText(open_ai_key='<>', model='<>')`\n`image_to_text = ImageToText(open_ai_key='<>', model='<>')`\n\n\nNote: The parameters are not required if already present in environment variables as 'OPEN_AI_KEY' and 'MODEL'\n\n### 3. Load data\n\nFrom file - \n##### Pdf2Text: `pdf_to_text.load_data(file_name='<Path to file>')`\n##### Image2Text: `image_to_text.load_data(file_name='<Path to file>')`\n\nFrom file object - \n\n##### Pdf2Text: `pdf_to_text.load_data(file_bytes_object='<file content>')`\n##### Image2Text: `image_to_text.load_data(file_bytes_object='<file content>')`\n\n\n\n### 3. Get output\n\n##### Pdf2Text: \n`image_filter = ImageFilter(lower_height=<int>, upper_height=<int>, lower_width=<int>, upper_width=<int>)`\n\n`output = pdf_to_text.get_pdf_content(image_filter=image_filter, page_index=<optional field: int>, include_formatted_content=<optional field: bool>)`\n\n##### Image2Text: \n`output = image_to_text.get_image_transcription()`\n\n\n### 4. Response Object\n\n##### Pdf2Text: \n\nThe output response contains a list of Page object. \nThe page object consists of below attributes - \n1. image_content: A list of transcriptions for images fetched from the current pdf page.\n2. text_content: The plain text fetched from the current pdf page.\n3. formatted_content [Optional] : An optional attribute which contains the formatted string output containing the\nplain text and figure data (Inside FIGURE TRANSCRIPTIONS section). This is useful in knowledge bot applications.\nThe default value for this flag is false.\n\n##### Image2Text: \n\nThe response contains a string representing the transcription of the provided image. \n\n\n",
    "bugtrack_url": null,
    "license": "LICENCE",
    "summary": "A python package to generate text from pdf and images",
    "version": "0.1.9",
    "project_urls": null,
    "split_keywords": [
        "test",
        "dependencies",
        "documentation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "06f4a61c649401c1a5cf511a4064aeb75a99eedc704be8f6956fddafd6d8b2b3",
                "md5": "c5337cabf598200a4e20c5ec2c2d795c",
                "sha256": "b9865a1e6bc5db356d88d851856a0f822e23770ae12282c0c4b164e915b3201e"
            },
            "downloads": -1,
            "filename": "pdf_image_text9_test-0.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c5337cabf598200a4e20c5ec2c2d795c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 6783,
            "upload_time": "2024-03-13T01:49:12",
            "upload_time_iso_8601": "2024-03-13T01:49:12.572609Z",
            "url": "https://files.pythonhosted.org/packages/06/f4/a61c649401c1a5cf511a4064aeb75a99eedc704be8f6956fddafd6d8b2b3/pdf_image_text9_test-0.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "33d7e399c3d4dcfe6d4cef0f3b11733a2a7a08cf58da1def595f54a3f325d963",
                "md5": "97c75493aaf682d887f28084226c70f7",
                "sha256": "e33ed90873f22ea3b7daa55fe827e8d43979a9258420d9e3f6354a46ee93e7e3"
            },
            "downloads": -1,
            "filename": "pdf_image_text9_test-0.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "97c75493aaf682d887f28084226c70f7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 5306,
            "upload_time": "2024-03-13T01:49:13",
            "upload_time_iso_8601": "2024-03-13T01:49:13.448115Z",
            "url": "https://files.pythonhosted.org/packages/33/d7/e399c3d4dcfe6d4cef0f3b11733a2a7a08cf58da1def595f54a3f325d963/pdf_image_text9_test-0.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-13 01:49:13",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pdf-image-text9-test"
}

Devesh Singh