df-extract


Namedf-extract JSON
Version 0.0.1 PyPI version JSON
download
home_page
SummaryUtils for decisionforce extract
upload_time2023-10-10 11:42:30
maintainer
docs_urlNone
authorDecisionForce
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DF Extract Lib

## Requirements

Python 3.10+

## Installation

```shell
$ python -m pip install .
```

### 1. To extract content from `PDF`

```python
from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path)

# By default, output as text
extract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`

# Output as json
extract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`
```

> You can change the output directory with simply pass `output_dir` param
```python
from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

extract_pdf = ExtractPDF(file_path=path, output_dir="/home/test/output")
extract_pdf.extract()
```

#### Extract content from `PDF` with image data
> This requires [`easyocr`](https://github.com/jaidedai/easyocr)

```python
from df_extract.base import ImageExtract
from df_extract.pdf import ExtractPDF


path = "/home/test/ABC.pdf"

image_extract = ImageExtract(model_download_enabled=True)
extract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)
extract_pdf.extract()
```

### 2. To extract content from `PPT` and `PPTx`

```python
from df_extract.pptx import ExtractPPTx


path = "/home/test/DEF.pptx"

extract_pptx = ExtractPPTx(file_path=path)

# By default, output as text
extract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`

# Output as json
extract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`
```

### 3. To extract content from `Doc` and `Docx`

```python
from df_extract.docx import ExtractDocx


path = "/home/test/GHI.docx"

extract_docx = ExtractDocx(file_path=path)

# By default, output as text
extract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`

# Output as json
extract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`
```

### 4. To extract content from `PNG`, `JPEG` and `JPG`

```python
from df_extract.image import ExtractImage


path = "/home/test/JKL.png"

extract_png = ExtractImage(file_path=path)

# By default, output as text
extract_png.extract()  # Output will be located `/home/test/JKL.png.txt`

# Output as json
extract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "df-extract",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "DecisionForce",
    "author_email": "info@decisionforce.io",
    "download_url": "https://files.pythonhosted.org/packages/80/cf/bdd6e93f37e965ca2aa9af1757484257fb7b64599e2b5b1116d2057d6ad6/df_extract-0.0.1.tar.gz",
    "platform": null,
    "description": "# DF Extract Lib\n\n## Requirements\n\nPython 3.10+\n\n## Installation\n\n```shell\n$ python -m pip install .\n```\n\n### 1. To extract content from `PDF`\n\n```python\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nextract_pdf = ExtractPDF(file_path=path)\n\n# By default, output as text\nextract_pdf.extract()  # Output will be located `/home/test/ABC.pdf.txt`\n\n# Output as json\nextract_pdf.extract(as_json=True)  # Output will be located `/home/test/ABC.pdf.json`\n```\n\n> You can change the output directory with simply pass `output_dir` param\n```python\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nextract_pdf = ExtractPDF(file_path=path, output_dir=\"/home/test/output\")\nextract_pdf.extract()\n```\n\n#### Extract content from `PDF` with image data\n> This requires [`easyocr`](https://github.com/jaidedai/easyocr)\n\n```python\nfrom df_extract.base import ImageExtract\nfrom df_extract.pdf import ExtractPDF\n\n\npath = \"/home/test/ABC.pdf\"\n\nimage_extract = ImageExtract(model_download_enabled=True)\nextract_pdf = ExtractPDF(file_path=path, image_extract=image_extract)\nextract_pdf.extract()\n```\n\n### 2. To extract content from `PPT` and `PPTx`\n\n```python\nfrom df_extract.pptx import ExtractPPTx\n\n\npath = \"/home/test/DEF.pptx\"\n\nextract_pptx = ExtractPPTx(file_path=path)\n\n# By default, output as text\nextract_pptx.extract()  # Output will be located `/home/test/DEF.pptx.txt`\n\n# Output as json\nextract_pptx.extract(as_json=True)  # Output will be located `/home/test/DEF.pptx.json`\n```\n\n### 3. To extract content from `Doc` and `Docx`\n\n```python\nfrom df_extract.docx import ExtractDocx\n\n\npath = \"/home/test/GHI.docx\"\n\nextract_docx = ExtractDocx(file_path=path)\n\n# By default, output as text\nextract_docx.extract()  # Output will be located `/home/test/GHI.docx.txt`\n\n# Output as json\nextract_docx.extract(as_json=True)  # Output will be located `/home/test/GHI.docx.json`\n```\n\n### 4. To extract content from `PNG`, `JPEG` and `JPG`\n\n```python\nfrom df_extract.image import ExtractImage\n\n\npath = \"/home/test/JKL.png\"\n\nextract_png = ExtractImage(file_path=path)\n\n# By default, output as text\nextract_png.extract()  # Output will be located `/home/test/JKL.png.txt`\n\n# Output as json\nextract_png.extract(as_json=True)  # Output will be located `/home/test/JKL.png.json`\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Utils for decisionforce extract",
    "version": "0.0.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4d1b62e699bbddd2debb41fa724b971540cd4e096d9b1da2c21ade5942738c6f",
                "md5": "d747e2c3e366cac7b0f28f09714fcea0",
                "sha256": "89ba15dc095515de942bf969fcc7eade533ad503716af42b47aa4de3b4b8e8a9"
            },
            "downloads": -1,
            "filename": "df_extract-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d747e2c3e366cac7b0f28f09714fcea0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 13698,
            "upload_time": "2023-10-10T11:42:28",
            "upload_time_iso_8601": "2023-10-10T11:42:28.835208Z",
            "url": "https://files.pythonhosted.org/packages/4d/1b/62e699bbddd2debb41fa724b971540cd4e096d9b1da2c21ade5942738c6f/df_extract-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "80cfbdd6e93f37e965ca2aa9af1757484257fb7b64599e2b5b1116d2057d6ad6",
                "md5": "9cf186ab4593ba614405fc0526e470d4",
                "sha256": "8e9807c1a19a43dc7b56a3b3f86c8d89098066d46a656f8c209c126536ac3ce7"
            },
            "downloads": -1,
            "filename": "df_extract-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "9cf186ab4593ba614405fc0526e470d4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11071,
            "upload_time": "2023-10-10T11:42:30",
            "upload_time_iso_8601": "2023-10-10T11:42:30.584504Z",
            "url": "https://files.pythonhosted.org/packages/80/cf/bdd6e93f37e965ca2aa9af1757484257fb7b64599e2b5b1116d2057d6ad6/df_extract-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-10 11:42:30",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "df-extract"
}
        
Elapsed time: 0.16753s