docp


Namedocp JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryA basic document parsing and loading utility.
upload_time2025-02-12 17:52:38
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseGNU GPL-3
keywords document library parsing utility utilities
VCS
bugtrack_url
requirements chromadb langchain langchain_community langchain_huggingface pandas pdfplumber torch unidecode utils4
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# A basic document parsing and loading utility.

[![PyPI - Version](https://img.shields.io/pypi/v/docp?style=flat-square)](https://pypi.org/project/docp)
[![PyPI - Implementation](https://img.shields.io/pypi/implementation/docp?style=flat-square)](https://pypi.org/project/docp)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docp?style=flat-square)](https://pypi.org/project/docp)
[![PyPI - Status](https://img.shields.io/pypi/status/docp?style=flat-square)](https://pypi.org/project/docp)
[![Static Badge](https://img.shields.io/badge/tests-pending-orange?style=flat-square)](https://pypi.org/project/docp)
[![Static Badge](https://img.shields.io/badge/code_coverage-pending-orange?style=flat-square)](https://pypi.org/project/docp)
[![Static Badge](https://img.shields.io/badge/pylint_analysis-100%25-brightgreen?style=flat-square)](https://pypi.org/project/docp)
[![Documentation Status](https://readthedocs.org/projects/docp/badge/?version=latest&style=flat-square)](https://docp.readthedocs.io/en/latest/)
[![PyPI - License](https://img.shields.io/pypi/l/docp?style=flat-square)](https://opensource.org/license/gpl-3-0)
[![PyPI - Wheel](https://img.shields.io/pypi/wheel/docp?style=flat-square)](https://pypi.org/project/docp)

In its simplest form, the ``docp`` project is a (doc)ument \(p\)arsing library.

Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes [document loaders](#loaders) which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.


## Installation
The easiest way to install ``docp`` is using ``pip`` *after* activating your virtual environment::
    
    pip install docp

Additional (older) releases can be found either at [PyPI](https://pypi.org/project/docp/#history) or in [GitHub Releases](https://github.com/s3dev/docp/releases).

### A note on the installation of dependencies:
To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are *not* installed automatically, as part of the ``pip install`` command.

If a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).

The rationale behind this design decision is that not all users will need the document *loading* capability, so ``torch``, ``langchain``,  etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as ``torch``, nor make your project dependent on it.


## The Toolset

### Parsers
As of this release, parsers for the following binary document types are supported:

- PDF
- MS PowerPoint (PPTX)
- (more coming soon)

### Loaders
In addition to document parsing, document *loading* functionality is built-in as well. Specifically, loading documents into a [Chroma](https://www.trychroma.com) vector database for RAG-enabled LLM ingestion.

For example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ``ChromaLoader`` class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.

For further detail and usage examples, please refer to the project's [documentation](https://docp.readthedocs.io/).


## Using the Library
The documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the 
[Library API](https://docp.readthedocs.io/en/latest/library.html) page in the documentation.

### Quickstart
For convenience, here are a couple examples for how to parse the supported document types.

**Extract text from a PDF file:**

    >>> from docp import PDFParser

    >>> pdf = PDFParser(path='/path/to/myfile.pdf')
    >>> pdf.extract_text()

    # Access the content of page 1.
    >>> pg1 = pdf.doc.pages[1].content

**Extract text from a PowerPoint presentation:**

    >>> from docp import PPTXParser

    >>> pptx = PPTXParser(path='/path/to/myfile.pptx')
    >>> pptx.extract_text()

    # Access the text on slide 1.
    >>> pg1 = pptx.doc.slides[1].content


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "docp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "document, library, parsing, utility, utilities",
    "author": null,
    "author_email": "The Developers <development@s3dev.uk>",
    "download_url": "https://files.pythonhosted.org/packages/5c/e1/ac074f5dc568c5fcbd57b86fbabc5dc4919e989bac85d31fabf0d9cf11ec/docp-0.2.0.tar.gz",
    "platform": null,
    "description": "\n# A basic document parsing and loading utility.\n\n[![PyPI - Version](https://img.shields.io/pypi/v/docp?style=flat-square)](https://pypi.org/project/docp)\n[![PyPI - Implementation](https://img.shields.io/pypi/implementation/docp?style=flat-square)](https://pypi.org/project/docp)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docp?style=flat-square)](https://pypi.org/project/docp)\n[![PyPI - Status](https://img.shields.io/pypi/status/docp?style=flat-square)](https://pypi.org/project/docp)\n[![Static Badge](https://img.shields.io/badge/tests-pending-orange?style=flat-square)](https://pypi.org/project/docp)\n[![Static Badge](https://img.shields.io/badge/code_coverage-pending-orange?style=flat-square)](https://pypi.org/project/docp)\n[![Static Badge](https://img.shields.io/badge/pylint_analysis-100%25-brightgreen?style=flat-square)](https://pypi.org/project/docp)\n[![Documentation Status](https://readthedocs.org/projects/docp/badge/?version=latest&style=flat-square)](https://docp.readthedocs.io/en/latest/)\n[![PyPI - License](https://img.shields.io/pypi/l/docp?style=flat-square)](https://opensource.org/license/gpl-3-0)\n[![PyPI - Wheel](https://img.shields.io/pypi/wheel/docp?style=flat-square)](https://pypi.org/project/docp)\n\nIn its simplest form, the ``docp`` project is a (doc)ument \\(p\\)arsing library.\n\nWritten in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes [document loaders](#loaders) which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.\n\n\n## Installation\nThe easiest way to install ``docp`` is using ``pip`` *after* activating your virtual environment::\n    \n    pip install docp\n\nAdditional (older) releases can be found either at [PyPI](https://pypi.org/project/docp/#history) or in [GitHub Releases](https://github.com/s3dev/docp/releases).\n\n### A note on the installation of dependencies:\nTo keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are *not* installed automatically, as part of the ``pip install`` command.\n\nIf a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).\n\nThe rationale behind this design decision is that not all users will need the document *loading* capability, so ``torch``, ``langchain``,  etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as ``torch``, nor make your project dependent on it.\n\n\n## The Toolset\n\n### Parsers\nAs of this release, parsers for the following binary document types are supported:\n\n- PDF\n- MS PowerPoint (PPTX)\n- (more coming soon)\n\n### Loaders\nIn addition to document parsing, document *loading* functionality is built-in as well. Specifically, loading documents into a [Chroma](https://www.trychroma.com) vector database for RAG-enabled LLM ingestion.\n\nFor example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ``ChromaLoader`` class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.\n\nFor further detail and usage examples, please refer to the project's [documentation](https://docp.readthedocs.io/).\n\n\n## Using the Library\nThe documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the \n[Library API](https://docp.readthedocs.io/en/latest/library.html) page in the documentation.\n\n### Quickstart\nFor convenience, here are a couple examples for how to parse the supported document types.\n\n**Extract text from a PDF file:**\n\n    >>> from docp import PDFParser\n\n    >>> pdf = PDFParser(path='/path/to/myfile.pdf')\n    >>> pdf.extract_text()\n\n    # Access the content of page 1.\n    >>> pg1 = pdf.doc.pages[1].content\n\n**Extract text from a PowerPoint presentation:**\n\n    >>> from docp import PPTXParser\n\n    >>> pptx = PPTXParser(path='/path/to/myfile.pptx')\n    >>> pptx.extract_text()\n\n    # Access the text on slide 1.\n    >>> pg1 = pptx.doc.slides[1].content\n\n",
    "bugtrack_url": null,
    "license": "GNU GPL-3",
    "summary": "A basic document parsing and loading utility.",
    "version": "0.2.0",
    "project_urls": {
        "Documentation": "https://docp.readthedocs.io",
        "Homepage": "https://github.com/s3dev/docp",
        "Repository": "https://github.com/s3dev/docp"
    },
    "split_keywords": [
        "document",
        " library",
        " parsing",
        " utility",
        " utilities"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ccb0e41a74b290a5cc17fdcb906e5f47c597a862596a81907595e9c71814ff95",
                "md5": "e23278c6a34656fab49b7f0b1812d50a",
                "sha256": "833dd9bb4ae5167bce5dae9b90467e9a4558467c80522758efbf9e6762ac7652"
            },
            "downloads": -1,
            "filename": "docp-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e23278c6a34656fab49b7f0b1812d50a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 68139,
            "upload_time": "2025-02-12T17:52:25",
            "upload_time_iso_8601": "2025-02-12T17:52:25.830073Z",
            "url": "https://files.pythonhosted.org/packages/cc/b0/e41a74b290a5cc17fdcb906e5f47c597a862596a81907595e9c71814ff95/docp-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ce1ac074f5dc568c5fcbd57b86fbabc5dc4919e989bac85d31fabf0d9cf11ec",
                "md5": "787fb4c302aded6346de35d4cbc40391",
                "sha256": "6e3255de5a8b45de9e0a5e1ff0cc57c7198fc17acbd692b9c738c1a4aa5ae120"
            },
            "downloads": -1,
            "filename": "docp-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "787fb4c302aded6346de35d4cbc40391",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 8351207,
            "upload_time": "2025-02-12T17:52:38",
            "upload_time_iso_8601": "2025-02-12T17:52:38.234579Z",
            "url": "https://files.pythonhosted.org/packages/5c/e1/ac074f5dc568c5fcbd57b86fbabc5dc4919e989bac85d31fabf0d9cf11ec/docp-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-12 17:52:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "s3dev",
    "github_project": "docp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "chromadb",
            "specs": [
                [
                    "==",
                    "0.6.3"
                ]
            ]
        },
        {
            "name": "langchain",
            "specs": [
                [
                    "==",
                    "0.3.17"
                ]
            ]
        },
        {
            "name": "langchain_community",
            "specs": [
                [
                    "==",
                    "0.3.16"
                ]
            ]
        },
        {
            "name": "langchain_huggingface",
            "specs": [
                [
                    "==",
                    "0.1.2"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "pdfplumber",
            "specs": [
                [
                    "==",
                    "0.11.5"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "unidecode",
            "specs": [
                [
                    "==",
                    "1.3.8"
                ]
            ]
        },
        {
            "name": "utils4",
            "specs": [
                [
                    "==",
                    "1.7.0"
                ]
            ]
        }
    ],
    "lcname": "docp"
}
        
Elapsed time: 0.46239s