Name | aipdf JSON |
Version |
0.0.4
JSON |
| download |
home_page | None |
Summary | A tool to extract PDF files to markdown, or any other format using AI |
upload_time | 2024-10-14 23:59:27 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.7 |
license | MIT |
keywords |
pdf
markdown
ai
conversion
openai
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# AIPDF: Simple PDF OCR with GPT-like Multimodal Models
Screw traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!
AIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON.
## Installation
```bash
pip install aipdf
```
in macOS you will need to install poppler
```bash
brew install poppler
```
## Quick Start
```python
from aipdf import ocr
# Your OpenAI API key
api_key = 'your_openai_api_key'
file = open('somepdf.pdf', 'rb')
markdown_pages = ocr(file, api_key)
```
## Ollama
You can use with any ollama multi-modal models
```python
ocr(pdf_file, api_key='ollama', model="llama3.2", base_url= 'http://localhost:11434/v1', prompt=...)
```
## Any file system
We chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc
### From url
```python
pdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)
# extract
pages = ocr(pdf_file, api_key, prompt="extract tables, return each table in json")
```
### From S3
```python
s3 = boto3.client('s3', config=Config(signature_version='s3v4'),
aws_access_key_id=access_token,
aws_secret_access_key='', # Not needed for token-based auth
aws_session_token=access_token)
pdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())
# extract
pages = ocr(pdf_file, api_key, prompt="extract charts data, turn it into tables that represent the variables in the chart")
```
## Why AIPDF?
1. **Simplicity**: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.
2. **Flexibility**: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.
3. **Power of AI**: Leverages state-of-the-art multi modal models (gpt, llama, ..).
4. **Customizable**: Tailor the extraction process to your specific needs with custom prompts.
5. **Efficient**: Utilizes parallel processing for faster extraction of multi-page PDFs.
## Requirements
- Python 3.7+
We will keep this super clean, only 3 required libraries:
- openai library to talk to completion endpoints
- pdf2image library (for PDF to image conversion)
- Pillow (PIL) library
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Support
If you encounter any problems or have any questions, please open an issue on the GitHub repository.
---
AIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!
Raw data
{
"_id": null,
"home_page": null,
"name": "aipdf",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "pdf, markdown, ai, conversion, openai",
"author": null,
"author_email": "Jorge Torres <support@mindsdb.com>",
"download_url": "https://files.pythonhosted.org/packages/34/f9/85a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433/aipdf-0.0.4.tar.gz",
"platform": null,
"description": "# AIPDF: Simple PDF OCR with GPT-like Multimodal Models\n\nScrew traditional OCRs or heavy libraries to get data from PDFs, GenAI does a better job!\n\nAIPDF is a stand-alone, minimalistic, yet powerful pure Python library that leverages multi-modal gen AI models (OpenAI, llama3 or compatible alternatives) to extract data from PDFs and convert it into various formats such as Markdown or JSON. \n\n## Installation\n\n```bash\npip install aipdf\n```\n\nin macOS you will need to install poppler\n```bash\nbrew install poppler \n```\n\n## Quick Start\n\n\n\n```python\nfrom aipdf import ocr\n\n# Your OpenAI API key \napi_key = 'your_openai_api_key'\n\nfile = open('somepdf.pdf', 'rb')\nmarkdown_pages = ocr(file, api_key)\n\n```\n\n## Ollama\n\nYou can use with any ollama multi-modal models \n\n```python\nocr(pdf_file, api_key='ollama', model=\"llama3.2\", base_url= 'http://localhost:11434/v1', prompt=...)\n```\n## Any file system\n\nWe chose that you pass a file object, because that way it is flexible for you to use this with any type of file system, s3, localfiles, urls etc\n\n### From url\n```python\n\npdf_file = io.BytesIO(requests.get('https://arxiv.org/pdf/2410.02467').content)\n\n# extract\npages = ocr(pdf_file, api_key, prompt=\"extract tables, return each table in json\")\n\n```\n### From S3\n\n```python\n\ns3 = boto3.client('s3', config=Config(signature_version='s3v4'),\n aws_access_key_id=access_token,\n aws_secret_access_key='', # Not needed for token-based auth\n aws_session_token=access_token)\n\n\npdf_file = io.BytesIO(s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read())\n# extract \npages = ocr(pdf_file, api_key, prompt=\"extract charts data, turn it into tables that represent the variables in the chart\")\n```\n\n\n## Why AIPDF?\n\n1. **Simplicity**: AIPDF provides a straightforward function, it requires minimal setup, dependencies and configuration.\n2. **Flexibility**: Extract data into Markdown, JSON, HTML, YAML, whatever... file format and schema.\n3. **Power of AI**: Leverages state-of-the-art multi modal models (gpt, llama, ..).\n4. **Customizable**: Tailor the extraction process to your specific needs with custom prompts.\n5. **Efficient**: Utilizes parallel processing for faster extraction of multi-page PDFs.\n\n## Requirements\n\n- Python 3.7+\n\nWe will keep this super clean, only 3 required libraries:\n\n- openai library to talk to completion endpoints\n- pdf2image library (for PDF to image conversion)\n- Pillow (PIL) library\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Support\n\nIf you encounter any problems or have any questions, please open an issue on the GitHub repository.\n\n---\n\nAIPDF makes PDF data extraction simple, flexible, and powerful. Try it out and simplify your PDF processing workflow today!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool to extract PDF files to markdown, or any other format using AI",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/mindsdb/aipdf",
"Repository": "https://github.com/mindsdb/aipdf.git"
},
"split_keywords": [
"pdf",
" markdown",
" ai",
" conversion",
" openai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d2af773140ec0643c11ad1d68ecb3c9fb942ffbfd46ce909bb3aa8aa2f0bc8df",
"md5": "b3a8fdad17378f19893735e636ac1faf",
"sha256": "65de523e9ce064982546525bf90021d217c01b8b9655ac3f855594ba52d8f271"
},
"downloads": -1,
"filename": "aipdf-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b3a8fdad17378f19893735e636ac1faf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 5831,
"upload_time": "2024-10-14T23:59:26",
"upload_time_iso_8601": "2024-10-14T23:59:26.188791Z",
"url": "https://files.pythonhosted.org/packages/d2/af/773140ec0643c11ad1d68ecb3c9fb942ffbfd46ce909bb3aa8aa2f0bc8df/aipdf-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "34f985a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433",
"md5": "5407e1757199c25b6dcf03761d1d2875",
"sha256": "d0fce45c256b23c50ef13fb5095dae114364e994bf6f2ac9821aaad358fd7578"
},
"downloads": -1,
"filename": "aipdf-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "5407e1757199c25b6dcf03761d1d2875",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 6752,
"upload_time": "2024-10-14T23:59:27",
"upload_time_iso_8601": "2024-10-14T23:59:27.119753Z",
"url": "https://files.pythonhosted.org/packages/34/f9/85a89f59e8cc8d10a5b4ad5f7217cc2562388036548b8bcd21a9e0836433/aipdf-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-14 23:59:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mindsdb",
"github_project": "aipdf",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "aipdf"
}