powerscale-rag-connector

Name	powerscale-rag-connector JSON
Version	1.0.9 JSON
	download
home_page	None
Summary	An open-source python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed for Dell PowerScale storage.
upload_time	2025-03-10 22:24:00
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	None
keywords	dell onefs powerscale
VCS
bugtrack_url
requirements	elasticsearch dotenv
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PowerScale RAG Connector

The PowerScale RAG Connector is an open-source Python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed. It leverages PowerScale's unique MetadataIQ capability to identify changed files within the OneFS filesystem and publish this information in an easily consumable format via ElasticSearch.

Developers can integrate the PowerScale RAG Connector directly within a LangChain RAG application as a supported document loader or use it independently as a generic Python class.

## Workflow

![Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software](powerscale-rag-connector-workflow.png)

*Figure 1: Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software.*


## Audience

The intended audience for this document includes software developers, machine learning scientists, and AI developers who will utilize files from PowerScale in the development of a RAG application.

## Overview

This guide is divided into two sections: setting up the environment and using the connector. Note that system administration privileges are required for the initial configuration on PowerScale, which may need to be performed by PowerScale administrators.

## Terminology

| Term | Definition |
|------|------------|
| RAG | Retrieval Augmented Generation. A technique used to take an off the shelf large language model and provide the LLM context to data it has no knowledge of. |
| LangChain | LangChain is an open-source python and javascript framework used to help developers create RAG applications. |
| Nvidia NIM Services | Part of Nvidia AI Enterprise, a set of microservices that can optional be used to efficiently chunk and embed files with GPU. The output of this data can be stored in a vector database for a RAG framework to use. |
| NV-Ingest | An Nvidia NIM microservice that will ingest complex office documents files with tables, and figures, and produce chunks and embedding to be stored in a vector database. |
| Chunking | The process of splitting the source file into smaller context aware pieces that can be searched and converted into vectors. Example: a chunk could be every paragraph within a large office document |
| Embedding | Turning a chunk of data into a vector where vector operations such as similarity, can be performed. |
| MetadataIQ | A new feature in PowerScale OneFS 9.10 that will periodically save filesystem metadata to an external database such as Elasticsearch |
| PowerScale RAG Connector | An open-source connector that can integrate with LangChain to improve data ingestion when data resides on PowerScale. |

## Installation

```bash
pip install powerscale-rag-connector
```

## Installing NVIDIA Ingest Client

To use the NVIDIA Ingest client with the PowerScale RAG Connector, you'll need to install the NVIDIA Ingest client library. This code has been tested with nv-ingest v24.12.1.

For more detailed information about the NVIDIA Ingest client library, refer to the [official NVIDIA NV-Ingest client documentation](https://github.com/NVIDIA/nv-ingest/tree/main/client).


## Usage

The PowerScale RAG Connector can be used in two ways:

1. As a LangChain document loader
2. As a standalone Python class

### Using as a LangChain Document Loader

```python
from powerscale_rag_connector import PowerScaleDocumentLoader

# Initialize the loader
loader = PowerScaleDocumentLoader(
    es_host_url="http://elasticsearch:9200",
    es_index_name="metadataiq",
    es_api_key="your-api-key",
    folder_path="/ifs/data"
)

# Load documents
documents = loader.load()
```

### Using as a Standalone Path Loader

```python
from powerscale_rag_connector import PowerScalePathLoader

# Initialize the loader
loader = PowerScalePathLoader(
    es_host_url="http://elasticsearch:9200",
    es_index_name="metadataiq",
    es_api_key="your-api-key",
    folder_path="/ifs/data"
)

# Get changed files
changed_files = loader.lazy_load()
```

## Examples

Check out the [examples directory](./examples) for complete usage examples:

- [test Environment Configuration](./examples/config.py.example)
- [PowerScale NVIngest Integration](./examples/powerscale_nvingest_example.py)

## Components

The connector consists of several modules:

- [PowerScalePathLoader](./src/PowerScalePathLoader.py): Core module for identifying changed files
- [PowerScaleDocumentLoader](./src/PowerScaleDocumentLoader.py): Custom DocumentLoader for LangChain integration
- [PowerScaleUnstructuredLoader](./src/PowerScaleUnstructuredLoader.py): Custom Loader returning Documents processed by LangChain's UnstructuredFileLoader

## Requirements

- Python 3.8+
- Elasticsearch client
- PowerScale OneFS 9.10+ with MetadataIQ configured
- LangChain (optional, for LangChain integration)

## License

[MIT](https://github.com/dell/powerscale-rag-connector/blob/main/LICENSE)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "powerscale-rag-connector",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "Dell, OneFS, PowerScale",
    "author": null,
    "author_email": "Adam Brenner <adam.brenner@dell.com>, Michael Horgan <mike.horgan1@dell.com>",
    "download_url": "https://files.pythonhosted.org/packages/94/3e/477e131fc356b8aec77567549a259c0afad995fd0bc5a43603814a7fa8cb/powerscale_rag_connector-1.0.9.tar.gz",
    "platform": null,
    "description": "# PowerScale RAG Connector\n\nThe PowerScale RAG Connector is an open-source Python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed. It leverages PowerScale's unique MetadataIQ capability to identify changed files within the OneFS filesystem and publish this information in an easily consumable format via ElasticSearch.\n\nDevelopers can integrate the PowerScale RAG Connector directly within a LangChain RAG application as a supported document loader or use it independently as a generic Python class.\n\n## Workflow\n\n![Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software](powerscale-rag-connector-workflow.png)\n\n*Figure 1: Workflow and integration of how the PowerScale RAG Connector integrates with the LangChain and NVIDIA AI Enterprise Software.*\n\n\n## Audience\n\nThe intended audience for this document includes software developers, machine learning scientists, and AI developers who will utilize files from PowerScale in the development of a RAG application.\n\n## Overview\n\nThis guide is divided into two sections: setting up the environment and using the connector. Note that system administration privileges are required for the initial configuration on PowerScale, which may need to be performed by PowerScale administrators.\n\n## Terminology\n\n| Term | Definition |\n|------|------------|\n| RAG | Retrieval Augmented Generation. A technique used to take an off the shelf large language model and provide the LLM context to data it has no knowledge of. |\n| LangChain | LangChain is an open-source python and javascript framework used to help developers create RAG applications. |\n| Nvidia NIM Services | Part of Nvidia AI Enterprise, a set of microservices that can optional be used to efficiently chunk and embed files with GPU. The output of this data can be stored in a vector database for a RAG framework to use. |\n| NV-Ingest | An Nvidia NIM microservice that will ingest complex office documents files with tables, and figures, and produce chunks and embedding to be stored in a vector database. |\n| Chunking | The process of splitting the source file into smaller context aware pieces that can be searched and converted into vectors. Example: a chunk could be every paragraph within a large office document |\n| Embedding | Turning a chunk of data into a vector where vector operations such as similarity, can be performed. |\n| MetadataIQ | A new feature in PowerScale OneFS 9.10 that will periodically save filesystem metadata to an external database such as Elasticsearch |\n| PowerScale RAG Connector | An open-source connector that can integrate with LangChain to improve data ingestion when data resides on PowerScale. |\n\n## Installation\n\n```bash\npip install powerscale-rag-connector\n```\n\n## Installing NVIDIA Ingest Client\n\nTo use the NVIDIA Ingest client with the PowerScale RAG Connector, you'll need to install the NVIDIA Ingest client library. This code has been tested with nv-ingest v24.12.1.\n\nFor more detailed information about the NVIDIA Ingest client library, refer to the [official NVIDIA NV-Ingest client documentation](https://github.com/NVIDIA/nv-ingest/tree/main/client).\n\n\n## Usage\n\nThe PowerScale RAG Connector can be used in two ways:\n\n1. As a LangChain document loader\n2. As a standalone Python class\n\n### Using as a LangChain Document Loader\n\n```python\nfrom powerscale_rag_connector import PowerScaleDocumentLoader\n\n# Initialize the loader\nloader = PowerScaleDocumentLoader(\n    es_host_url=\"http://elasticsearch:9200\",\n    es_index_name=\"metadataiq\",\n    es_api_key=\"your-api-key\",\n    folder_path=\"/ifs/data\"\n)\n\n# Load documents\ndocuments = loader.load()\n```\n\n### Using as a Standalone Path Loader\n\n```python\nfrom powerscale_rag_connector import PowerScalePathLoader\n\n# Initialize the loader\nloader = PowerScalePathLoader(\n    es_host_url=\"http://elasticsearch:9200\",\n    es_index_name=\"metadataiq\",\n    es_api_key=\"your-api-key\",\n    folder_path=\"/ifs/data\"\n)\n\n# Get changed files\nchanged_files = loader.lazy_load()\n```\n\n## Examples\n\nCheck out the [examples directory](./examples) for complete usage examples:\n\n- [test Environment Configuration](./examples/config.py.example)\n- [PowerScale NVIngest Integration](./examples/powerscale_nvingest_example.py)\n\n## Components\n\nThe connector consists of several modules:\n\n- [PowerScalePathLoader](./src/PowerScalePathLoader.py): Core module for identifying changed files\n- [PowerScaleDocumentLoader](./src/PowerScaleDocumentLoader.py): Custom DocumentLoader for LangChain integration\n- [PowerScaleUnstructuredLoader](./src/PowerScaleUnstructuredLoader.py): Custom Loader returning Documents processed by LangChain's UnstructuredFileLoader\n\n## Requirements\n\n- Python 3.8+\n- Elasticsearch client\n- PowerScale OneFS 9.10+ with MetadataIQ configured\n- LangChain (optional, for LangChain integration)\n\n## License\n\n[MIT](https://github.com/dell/powerscale-rag-connector/blob/main/LICENSE)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An open-source python library designed to enhance RAG application performance during data ingestion by skipping files that have already been processed for Dell PowerScale storage.",
    "version": "1.0.9",
    "project_urls": {
        "Homepage": "https://github.com/dell/powerscale-rag-connector",
        "Issues": "https://github.com/dell/powerscale-rag-connector/issues",
        "Repository": "https://github.com/dell/powerscale-rag-connector.git"
    },
    "split_keywords": [
        "dell",
        " onefs",
        " powerscale"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5fb3c492508adcacf9e7b4610337736259166b5e7eb16599c5ad830bcd39f05f",
                "md5": "71092bf6af44adde228cc075f4f38e11",
                "sha256": "da6786a36d7dcf285d586d165b9e688fecc48fa1597af41405549c29132592ce"
            },
            "downloads": -1,
            "filename": "powerscale_rag_connector-1.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "71092bf6af44adde228cc075f4f38e11",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14005,
            "upload_time": "2025-03-10T22:23:59",
            "upload_time_iso_8601": "2025-03-10T22:23:59.309508Z",
            "url": "https://files.pythonhosted.org/packages/5f/b3/c492508adcacf9e7b4610337736259166b5e7eb16599c5ad830bcd39f05f/powerscale_rag_connector-1.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "943e477e131fc356b8aec77567549a259c0afad995fd0bc5a43603814a7fa8cb",
                "md5": "94a09f172d246104c8f1d89d5b316858",
                "sha256": "973ebcb45c7f676e9df1db9ebd97bf733262e66aa056dd9c8c00f147c41ec497"
            },
            "downloads": -1,
            "filename": "powerscale_rag_connector-1.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "94a09f172d246104c8f1d89d5b316858",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 90310,
            "upload_time": "2025-03-10T22:24:00",
            "upload_time_iso_8601": "2025-03-10T22:24:00.990702Z",
            "url": "https://files.pythonhosted.org/packages/94/3e/477e131fc356b8aec77567549a259c0afad995fd0bc5a43603814a7fa8cb/powerscale_rag_connector-1.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-10 22:24:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dell",
    "github_project": "powerscale-rag-connector",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "elasticsearch",
            "specs": [
                [
                    "==",
                    "8.17.2"
                ]
            ]
        },
        {
            "name": "dotenv",
            "specs": [
                [
                    "==",
                    "0.9.9"
                ]
            ]
        }
    ],
    "lcname": "powerscale-rag-connector"
}

None