# lemonpdf
![PyPI - Downloads](https://img.shields.io/pypi/dm/lemonpdf)
![PyPI - License](https://img.shields.io/pypi/l/lemonpdf)
![GitHub Tag](https://img.shields.io/github/v/tag/JuanBindez/lemonpdf?include_prereleases)
<a href="https://pypi.org/project/lemonpdf/"><img src="https://img.shields.io/pypi/v/lemonpdf" /></a>
### Python3 library to get urls from PDF files.
### Install
sudo apt install tesseract-ocr poppler-utils
pip install lemonpdf
### Quickstart
### Command line interface use (CLI)
#### get urls
lemonpdf -u file.pdf
#### save urls list in file txt
lemonpdf -u file.pdf -o urls.txt -s
#### get domains
lemonpdf -d file.pdf
#### save domains in file txt
lemonpdf -d file.pdf -o domains.txt -s
### scripts
#### get urls and save file txt
```python
from lemonpdf import Extractor
pdf_path = 'file.pdf'
output_txt_path = 'out_file.txt'
extractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)
urls = extractor.extract_urls(save=True)
print(urls)
```
#### get domains and save file txt
```python
from lemonpdf import Extractor
pdf_path = 'file.pdf'
output_txt_path = 'domains.txt'
extractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)
urls = extractor.extract_domains(save=True)
print(urls)
```
Raw data
{
"_id": null,
"home_page": null,
"name": "lemonpdf",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "PDF, Extractor, cli, tools",
"author": "zudefoque",
"author_email": "Juan Bindez <juanbindez780@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/c6/ab/511613ea8e3f8b638c2bfa0d7b8110cf78949d2ce10677b55670c60845de/lemonpdf-2.0.0.tar.gz",
"platform": null,
"description": "# lemonpdf\n\n![PyPI - Downloads](https://img.shields.io/pypi/dm/lemonpdf)\n![PyPI - License](https://img.shields.io/pypi/l/lemonpdf)\n![GitHub Tag](https://img.shields.io/github/v/tag/JuanBindez/lemonpdf?include_prereleases)\n<a href=\"https://pypi.org/project/lemonpdf/\"><img src=\"https://img.shields.io/pypi/v/lemonpdf\" /></a>\n\n### Python3 library to get urls from PDF files.\n\n\n### Install\n sudo apt install tesseract-ocr poppler-utils\n pip install lemonpdf\n\n### Quickstart\n\n\n### Command line interface use (CLI)\n\n#### get urls\n\n lemonpdf -u file.pdf\n\n#### save urls list in file txt\n\n lemonpdf -u file.pdf -o urls.txt -s\n\n#### get domains\n\n lemonpdf -d file.pdf\n\n#### save domains in file txt\n\n lemonpdf -d file.pdf -o domains.txt -s\n\n### scripts\n\n#### get urls and save file txt\n\n```python\n\nfrom lemonpdf import Extractor\n\npdf_path = 'file.pdf'\noutput_txt_path = 'out_file.txt'\n\nextractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)\n\nurls = extractor.extract_urls(save=True)\n\nprint(urls)\n\n\n```\n\n#### get domains and save file txt\n\n```python\nfrom lemonpdf import Extractor\n\npdf_path = 'file.pdf'\noutput_txt_path = 'domains.txt'\n\nextractor = Extractor(pdf_path=pdf_path, output_txt_path=output_txt_path)\n\nurls = extractor.extract_domains(save=True)\n\nprint(urls)\n\n\n```\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "Python3 library to get urls from PDF files.",
"version": "2.0.0",
"project_urls": {
"Bug Reports": "https://github.com/juanbindez/lemonpdf/issues",
"Homepage": "https://github.com/juanbindez/lemonpdf",
"Read the Docs": "http://lemonpdf.readthedocs.io/"
},
"split_keywords": [
"pdf",
" extractor",
" cli",
" tools"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5422f99684bf557dfe3bf5cc6e5d855acacd3cbd696635d72bbafbf2492b161b",
"md5": "eeb4d3aafdf1a726607295871cdb6d56",
"sha256": "c96d225be0257320209efb0beab58232e8021b570e0363498b6ca18099e853ea"
},
"downloads": -1,
"filename": "lemonpdf-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eeb4d3aafdf1a726607295871cdb6d56",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 4500,
"upload_time": "2024-08-07T15:04:59",
"upload_time_iso_8601": "2024-08-07T15:04:59.262955Z",
"url": "https://files.pythonhosted.org/packages/54/22/f99684bf557dfe3bf5cc6e5d855acacd3cbd696635d72bbafbf2492b161b/lemonpdf-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c6ab511613ea8e3f8b638c2bfa0d7b8110cf78949d2ce10677b55670c60845de",
"md5": "8dafc160abc42c0527880ff19ac9db36",
"sha256": "15852f44f492e9b5a2772349c7a7afa37e0c5bb25a10bad3bf343ab2b0b54a6b"
},
"downloads": -1,
"filename": "lemonpdf-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "8dafc160abc42c0527880ff19ac9db36",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 3877,
"upload_time": "2024-08-07T15:05:00",
"upload_time_iso_8601": "2024-08-07T15:05:00.985372Z",
"url": "https://files.pythonhosted.org/packages/c6/ab/511613ea8e3f8b638c2bfa0d7b8110cf78949d2ce10677b55670c60845de/lemonpdf-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-07 15:05:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "juanbindez",
"github_project": "lemonpdf",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "packaging",
"specs": []
},
{
"name": "pdf2image",
"specs": []
},
{
"name": "pillow",
"specs": []
},
{
"name": "PyMuPDF",
"specs": []
},
{
"name": "PyMuPDFb",
"specs": []
},
{
"name": "pytesseract",
"specs": []
}
],
"lcname": "lemonpdf"
}