aipdf


Nameaipdf JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryA tool to extract PDF files to markdown, or any other format using AI
upload_time2024-10-14 23:59:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords pdf markdown ai conversion openai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AIPDF: Simple PDF OCR with GPT-like Multimodal Models

Screw traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!

AIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON. 

## Installation

```bash
pip install aipdf
```

in macOS you will need to install poppler
```bash
brew install poppler 
```

## Quick Start



```python
from aipdf import ocr

# Your OpenAI API key   
api_key = 'your_openai_api_key'

file = open('somepdf.pdf', 'rb')
markdown_pages = ocr(file, api_key)

```

##  Ollama

You can use with any ollama multi-modal models 

```python
ocr(pdf_file, api_key='ollama', model="llama3.2", base_url= 'http://localhost:11434/v1', prompt=...)
```
## Any file system

We chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc

### From url
```python

pdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)

# extract
pages = ocr(pdf_file, api_key, prompt="extract tables, return each table in json")

```
### From S3

```python

s3 = boto3.client('s3', config=Config(signature_version='s3v4'),
                  aws_access_key_id=access_token,
                  aws_secret_access_key='', # Not needed for token-based auth
                  aws_session_token=access_token)


pdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())
# extract 
pages = ocr(pdf_file, api_key, prompt="extract charts data, turn it into tables that represent the variables in the chart")
```


## Why AIPDF?

1. **Simplicity**: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.
2. **Flexibility**: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.
3. **Power of AI**: Leverages state-of-the-art multi modal models (gpt, llama, ..).
4. **Customizable**: Tailor the extraction process to your specific needs with custom prompts.
5. **Efficient**: Utilizes parallel processing for faster extraction of multi-page PDFs.

## Requirements

- Python 3.7+

We will keep this super clean, only 3 required libraries:

- openai library to talk to completion endpoints
- pdf2image library (for PDF to image conversion)
- Pillow (PIL) library

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Support

If you encounter any problems or have any questions, please open an issue on the GitHub repository.

---

AIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "aipdf",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pdf, markdown, ai, conversion, openai",
    "author": null,
    "author_email": "Jorge Torres <support@mindsdb.com>",
    "download_url": "https://files.pythonhosted.org/packages/34/f9/85a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433/aipdf-0.0.4.tar.gz",
    "platform": null,
    "description": "# AIPDF: Simple PDF OCR with GPT-like Multimodal Models\n\nScrew traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!\n\nAIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON. \n\n## Installation\n\n```bash\npip install aipdf\n```\n\nin macOS you will need to install poppler\n```bash\nbrew install poppler \n```\n\n## Quick Start\n\n\n\n```python\nfrom aipdf import ocr\n\n# Your OpenAI API key   \napi_key = 'your_openai_api_key'\n\nfile = open('somepdf.pdf', 'rb')\nmarkdown_pages = ocr(file, api_key)\n\n```\n\n##  Ollama\n\nYou can use with any ollama multi-modal models \n\n```python\nocr(pdf_file, api_key='ollama', model=\"llama3.2\", base_url= 'http://localhost:11434/v1', prompt=...)\n```\n## Any file system\n\nWe chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc\n\n### From url\n```python\n\npdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)\n\n# extract\npages = ocr(pdf_file, api_key, prompt=\"extract tables, return each table in json\")\n\n```\n### From S3\n\n```python\n\ns3 = boto3.client('s3', config=Config(signature_version='s3v4'),\n                  aws_access_key_id=access_token,\n                  aws_secret_access_key='', # Not needed for token-based auth\n                  aws_session_token=access_token)\n\n\npdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())\n# extract \npages = ocr(pdf_file, api_key, prompt=\"extract charts data, turn it into tables that represent the variables in the chart\")\n```\n\n\n## Why AIPDF?\n\n1. **Simplicity**: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.\n2. **Flexibility**: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.\n3. **Power of AI**: Leverages state-of-the-art multi modal models (gpt, llama, ..).\n4. **Customizable**: Tailor the extraction process to your specific needs with custom prompts.\n5. **Efficient**: Utilizes parallel processing for faster extraction of multi-page PDFs.\n\n## Requirements\n\n- Python 3.7+\n\nWe will keep this super clean, only 3 required libraries:\n\n- openai library to talk to completion endpoints\n- pdf2image library (for PDF to image conversion)\n- Pillow (PIL) library\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Support\n\nIf you encounter any problems or have any questions, please open an issue on the GitHub repository.\n\n---\n\nAIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool to extract PDF files to markdown, or any other format using AI",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/mindsdb/aipdf",
        "Repository": "https://github.com/mindsdb/aipdf.git"
    },
    "split_keywords": [
        "pdf",
        " markdown",
        " ai",
        " conversion",
        " openai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d2af773140ec0643c11ad1d68ecb3c9fb942ffbfd46ce909bb3aa8aa2f0bc8df",
                "md5": "b3a8fdad17378f19893735e636ac1faf",
                "sha256": "65de523e9ce064982546525bf90021d217c01b8b9655ac3f855594ba52d8f271"
            },
            "downloads": -1,
            "filename": "aipdf-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b3a8fdad17378f19893735e636ac1faf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 5831,
            "upload_time": "2024-10-14T23:59:26",
            "upload_time_iso_8601": "2024-10-14T23:59:26.188791Z",
            "url": "https://files.pythonhosted.org/packages/d2/af/773140ec0643c11ad1d68ecb3c9fb942ffbfd46ce909bb3aa8aa2f0bc8df/aipdf-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "34f985a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433",
                "md5": "5407e1757199c25b6dcf03761d1d2875",
                "sha256": "d0fce45c256b23c50ef13fb5095dae114364e994bf6f2ac9821aaad358fd7578"
            },
            "downloads": -1,
            "filename": "aipdf-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "5407e1757199c25b6dcf03761d1d2875",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 6752,
            "upload_time": "2024-10-14T23:59:27",
            "upload_time_iso_8601": "2024-10-14T23:59:27.119753Z",
            "url": "https://files.pythonhosted.org/packages/34/f9/85a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433/aipdf-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-14 23:59:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mindsdb",
    "github_project": "aipdf",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "aipdf"
}
        
Elapsed time: 0.38375s