Name | df-extract JSON |
Version |
0.0.1
JSON |
| download |
home_page | |
Summary | Utils for decisionforce extract |
upload_time | 2023-10-10 11:42:30 |
maintainer | |
docs_url | None |
author | DecisionForce |
requires_python | |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# DF Extract Lib
## Requirements
Python 3.10+
## Installation
```shell
$ python -m pip install .
```
### 1. To extract content from `PDF`
```python
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path)
# By default, output as text
extract_pdf.extract() # Output will be located `/home/test/ABC.pdf.txt`
# Output as json
extract_pdf.extract(as_json=True) # Output will be located `/home/test/ABC.pdf.json`
```
> You can change the output directory with simply pass `output_dir` param
```python
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
extract_pdf.extract()
```
#### Extract content from `PDF` with image data
> This requires [`easyocr`](https://github.com/jaidedai/easyocr)
```python
from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF
path = "/home/test/ABC.pdf"
image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
extract_pdf.extract()
```
### 2. To extract content from `PPT` and `PPTx`
```python
from df_extract.pptx import ExtractPPTx
path = "/home/test/DEF.pptx"
extract_pptx = ExtractPPTx(file_path=path)
# By default, output as text
extract_pptx.extract() # Output will be located `/home/test/DEF.pptx.txt`
# Output as json
extract_pptx.extract(as_json=True) # Output will be located `/home/test/DEF.pptx.json`
```
### 3. To extract content from `Doc` and `Docx`
```python
from df_extract.docx import ExtractDocx
path = "/home/test/GHI.docx"
extract_docx = ExtractDocx(file_path=path)
# By default, output as text
extract_docx.extract() # Output will be located `/home/test/GHI.docx.txt`
# Output as json
extract_docx.extract(as_json=True) # Output will be located `/home/test/GHI.docx.json`
```
### 4. To extract content from `PNG`, `JPEG` and `JPG`
```python
from df_extract.image import ExtractImage
path = "/home/test/JKL.png"
extract_png = ExtractImage(file_path=path)
# By default, output as text
extract_png.extract() # Output will be located `/home/test/JKL.png.txt`
# Output as json
extract_png.extract(as_json=True) # Output will be located `/home/test/JKL.png.json`
```
Raw data
{
"_id": null,
"home_page": "",
"name": "df-extract",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "DecisionForce",
"author_email": "info@decisionforce.io",
"download_url": "https://files.pythonhosted.org/packages/80/cf/bdd6e93f37e965ca2aa9af1757484257fb7b64599e2b5b1116d2057d6ad6/df_extract-0.0.1.tar.gz",
"platform": null,
"description": "# DF Extract Lib\n\n## Requirements\n\nPython 3.10+\n\n## Installation\n\n```shell\n$ python -m pip install .\n```\n\n### 1. To extract content from `PDF`\n\n```python\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nextract_pdf = ExtractPDF(file_path=path)\n\n# By default, output as text\nextract_pdf.extract() # Output will be located `/home/test/ABC.pdf.txt`\n\n# Output as json\nextract_pdf.extract(as_json=True) # Output will be located `/home/test/ABC.pdf.json`\n```\n\n> You can change the output directory with simply pass `output_dir` param\n```python\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nextract_pdf = ExtractPDF(file_path=path, output_dir=\"/home/test/output\")\nextract_pdf.extract()\n```\n\n#### Extract content from `PDF` with image data\n> This requires [`easyocr`](https://github.com/jaidedai/easyocr)\n\n```python\nfrom df_extract.base import ImageExtract\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nimage_extract = ImageExtract(model_download_enabled=True)\nextract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)\nextract_pdf.extract()\n```\n\n### 2. To extract content from `PPT` and `PPTx`\n\n```python\nfrom df_extract.pptx import ExtractPPTx\n\n\npath = \"/home/test/DEF.pptx\"\n\nextract_pptx = ExtractPPTx(file_path=path)\n\n# By default, output as text\nextract_pptx.extract() # Output will be located `/home/test/DEF.pptx.txt`\n\n# Output as json\nextract_pptx.extract(as_json=True) # Output will be located `/home/test/DEF.pptx.json`\n```\n\n### 3. To extract content from `Doc` and `Docx`\n\n```python\nfrom df_extract.docx import ExtractDocx\n\n\npath = \"/home/test/GHI.docx\"\n\nextract_docx = ExtractDocx(file_path=path)\n\n# By default, output as text\nextract_docx.extract() # Output will be located `/home/test/GHI.docx.txt`\n\n# Output as json\nextract_docx.extract(as_json=True) # Output will be located `/home/test/GHI.docx.json`\n```\n\n### 4. To extract content from `PNG`, `JPEG` and `JPG`\n\n```python\nfrom df_extract.image import ExtractImage\n\n\npath = \"/home/test/JKL.png\"\n\nextract_png = ExtractImage(file_path=path)\n\n# By default, output as text\nextract_png.extract() # Output will be located `/home/test/JKL.png.txt`\n\n# Output as json\nextract_png.extract(as_json=True) # Output will be located `/home/test/JKL.png.json`\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Utils for decisionforce extract",
"version": "0.0.1",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4d1b62e699bbddd2debb41fa724b971540cd4e096d9b1da2c21ade5942738c6f",
"md5": "d747e2c3e366cac7b0f28f09714fcea0",
"sha256": "89ba15dc095515de942bf969fcc7eade533ad503716af42b47aa4de3b4b8e8a9"
},
"downloads": -1,
"filename": "df_extract-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d747e2c3e366cac7b0f28f09714fcea0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 13698,
"upload_time": "2023-10-10T11:42:28",
"upload_time_iso_8601": "2023-10-10T11:42:28.835208Z",
"url": "https://files.pythonhosted.org/packages/4d/1b/62e699bbddd2debb41fa724b971540cd4e096d9b1da2c21ade5942738c6f/df_extract-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "80cfbdd6e93f37e965ca2aa9af1757484257fb7b64599e2b5b1116d2057d6ad6",
"md5": "9cf186ab4593ba614405fc0526e470d4",
"sha256": "8e9807c1a19a43dc7b56a3b3f86c8d89098066d46a656f8c209c126536ac3ce7"
},
"downloads": -1,
"filename": "df_extract-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "9cf186ab4593ba614405fc0526e470d4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11071,
"upload_time": "2023-10-10T11:42:30",
"upload_time_iso_8601": "2023-10-10T11:42:30.584504Z",
"url": "https://files.pythonhosted.org/packages/80/cf/bdd6e93f37e965ca2aa9af1757484257fb7b64599e2b5b1116d2057d6ad6/df_extract-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-10 11:42:30",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "df-extract"
}