embedd-all


Nameembedd-all JSON
Version 0.0.938 PyPI version JSON
download
home_pageNone
SummaryEmbedd (docs, pdfs, excels, csv etc) -> RAG -> Query with LLMs
upload_time2024-09-05 06:26:56
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # embedd-all

`embedd-all` is a Python package designed to convert various document formats into a format that can be used to create an embedding vector using embedding models. The package extracts text from PDFs, summarizes data from Excel files, and now includes functionality to create RAG (Retrieval-Augmented Generation) for documents using Voyage AI embedding models and Pinecone vector database. It supports file formats including xlsx, csv, pdf, doc, and docx.

## Features

- **Multi-format Support**: Supports PDF, Excel (xlsx, csv), and Word (doc, docx) file processing.
- **PDF Processing**: Extracts text from each page of a PDF and returns it as an array.
- **Excel Processing**: Summarizes the data in each sheet by concatenating column names and their respective values, creating a new column `df["summarized"]`. If the Excel file contains multiple sheets, it processes each sheet and returns all summaries.
- **RAG Creation**: Creates RAG for documents (all supported formats) using Voyage AI embedding models and stores them in a Pinecone vector database.

## Installation

Install the package via pip:

```bash
pip install embedd-all
```

## Usage

### Import the package

```python
from embedd_all.index import modify_excel_for_embedding, process_pdf, pinecone_embeddings_with_voyage_ai, rag_query
```

### Example Usage

#### Processing an Excel File

The `modify_excel_for_embedding` function processes an Excel file, summarizes each row, and returns the summaries.

```python
import pandas as pd
from embedd_all.embedd.index import modify_excel_for_embedding

if __name__ == '__main__':
    # Path to the Excel file
    file_path = '/path/to/your/data.xlsx'
    context = "data"

    # Process the Excel file
    dfs = modify_excel_for_embedding(file_path=file_path, context=context)

    # Display the summarized data from the second sheet (if exists)
    if len(dfs) > 1:
        logger.info(dfs[1].head(3))
```

#### Processing a PDF File

The `process_pdf` function extracts text from each page of a PDF file and returns it as an array.

```python
from embedd_all.embedd.index import process_pdf

if __name__ == '__main__':
    # Path to the PDF file
    file_path = '/path/to/your/document.pdf'

    # Process the PDF file
    texts = process_pdf(file_path)

    # Display the processed text
    logger.info("Number of pages processed: ", len(texts))
    logger.info("Sample text from the first page: ", texts[0])
```

#### Creating RAG for Documents

The `pinecone_embeddings_with_voyage_ai` function creates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. This function supports multiple file formats including xlsx, csv, pdf, doc, and docx.

```python
from embedd_all.embedd.index import pinecone_embeddings_with_voyage_ai

def create_rag_for_documents():
    paths = [
        '/Users/arnabbhattachargya/Desktop/flamingo_english_book.pdf',
        '/Users/arnabbhattachargya/Desktop/Data_Train.xlsx'
    ]
    vector_db_name = 'arnab-test'
    voyage_embed_model = 'voyage-2'
    embed_dimension = 1024
    pinecone_embeddings_with_voyage_ai(paths, PINECONE_KEY, VOYAGE_API_KEY, vector_db_name, voyage_embed_model, embed_dimension)

if __name__ == '__main__':
    create_rag_for_documents()
```

#### Querying with RAG

The `rag_query` function performs context-based querying using RAG (Retrieval-Augmented Generation).

```python
from embedd_all.embedd.index import rag_query

def execute_rag_query():
    CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
    INDEX_NAME = 'arnab-test'
    TEMPERATURE = 0
    MAX_TOKENS = 4000
    QUERY = 'what all fuel types are there in cars?'
    SYSTEM_PROMPT = "You are a world-class document writer. Respond only with detailed descriptions and implementations. Use bullet points if necessary."
    VOYAGE_EMBED_MODEL = 'voyage-2'

    resp = rag_query(
        temperature=TEMPERATURE,
        max_tokens=MAX_TOKENS,
        anthropic_api_key=ANTHROPIC_API_KEY,
        claude_model=CLAUDE_MODEL,
        index_name=INDEX_NAME,
        pinecone_key=PINECONE_KEY,
        query=QUERY,
        system_prompt=SYSTEM_PROMPT,
        voyage_api_key=VOYAGE_API_KEY,
        voyage_embed_model=VOYAGE_EMBED_MODEL
    )

    for text_block in resp:
        print(text_block.text)

if __name__ == '__main__':
    execute_rag_query()
```

## Functions

### `modify_excel_for_embedding(file_path: str, context: str) -> list`

Processes an Excel file and summarizes the data in each sheet.

- **Parameters:**
  - `file_path` (str): Path to the Excel file.
  - `context` (str): Additional context to be added to each summary.

- **Returns:**
  - `list`: A list of DataFrames, each containing the summarized data for each sheet.

### `process_pdf(file_path: str) -> list`

Extracts text from each page of a PDF file.

- **Parameters:**
  - `file_path` (str): Path to the PDF file.

- **Returns:**
  - `list`: A list of strings, each representing the text extracted from a page.

### `pinecone_embeddings_with_voyage_ai(paths: list, PINECONE_KEY: str, VOYAGE_API_KEY: str, vector_db_name: str, voyage_embed_model: str, embed_dimension: int)`

Creates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. Supports various document formats including xlsx, csv, pdf, doc, and docx.

- **Parameters:**
  - `paths` (list): List of paths to documents.
  - `PINECONE_KEY` (str): Pinecone API key.
  - `VOYAGE_API_KEY` (str): Voyage AI API key.
  - `vector_db_name` (str): Name of the Pinecone vector database.
  - `voyage_embed_model` (str): Name of the Voyage AI embedding model to use.
  - `embed_dimension` (int): Dimension of the embedding vectors.

### `rag_query()`

Performs context-based querying using RAG (Retrieval-Augmented Generation).

- **Parameters:**
  - `temperature` (float): Sampling temperature.
  - `max_tokens` (int): Maximum number of tokens in the response.
  - `anthropic_api_key` (str): Anthropic API key.
  - `claude_model` (str): Name of the Claude model to use.
  - `index_name` (str): Name of the Pinecone index.
  - `pinecone_key` (str): Pinecone API key.
  - `query` (str): The query to perform.
  - `system_prompt` (str): The system prompt for guiding the model's response.
  - `voyage_api_key` (str): Voyage AI API key.
  - `voyage_embed_model` (str): Name of the Voyage AI embedding model to use.

## License

This project is licensed under the MIT License - see the [LICENSE](https://github.com/Arnab28122000/embed-all/blob/main/LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Contact

If you have any questions or suggestions, please open an issue or contact the maintainer.

---

Happy embedding with `embedd-all`!
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "embedd-all",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Arnab Bhattacharya <arnab.bhattacharya28122000@gmail.com>",
    "keywords": null,
    "author": null,
    "author_email": "Arnab Bhattacharya <arnab.bhattacharya28122000@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/aa/7f/576d0dd6d17aee92db2503188999e75309ae6144559dc96b7f20ceb1493b/embedd_all-0.0.938.tar.gz",
    "platform": null,
    "description": "# embedd-all\n\n`embedd-all` is a Python package designed to convert various document formats into a format that can be used to create an embedding vector using embedding models. The package extracts text from PDFs, summarizes data from Excel files, and now includes functionality to create RAG (Retrieval-Augmented Generation) for documents using Voyage AI embedding models and Pinecone vector database. It supports file formats including xlsx, csv, pdf, doc, and docx.\n\n## Features\n\n- **Multi-format Support**: Supports PDF, Excel (xlsx, csv), and Word (doc, docx) file processing.\n- **PDF Processing**: Extracts text from each page of a PDF and returns it as an array.\n- **Excel Processing**: Summarizes the data in each sheet by concatenating column names and their respective values, creating a new column `df[\"summarized\"]`. If the Excel file contains multiple sheets, it processes each sheet and returns all summaries.\n- **RAG Creation**: Creates RAG for documents (all supported formats) using Voyage AI embedding models and stores them in a Pinecone vector database.\n\n## Installation\n\nInstall the package via pip:\n\n```bash\npip install embedd-all\n```\n\n## Usage\n\n### Import the package\n\n```python\nfrom embedd_all.index import modify_excel_for_embedding, process_pdf, pinecone_embeddings_with_voyage_ai, rag_query\n```\n\n### Example Usage\n\n#### Processing an Excel File\n\nThe `modify_excel_for_embedding` function processes an Excel file, summarizes each row, and returns the summaries.\n\n```python\nimport pandas as pd\nfrom embedd_all.embedd.index import modify_excel_for_embedding\n\nif __name__ == '__main__':\n    # Path to the Excel file\n    file_path = '/path/to/your/data.xlsx'\n    context = \"data\"\n\n    # Process the Excel file\n    dfs = modify_excel_for_embedding(file_path=file_path, context=context)\n\n    # Display the summarized data from the second sheet (if exists)\n    if len(dfs) > 1:\n        logger.info(dfs[1].head(3))\n```\n\n#### Processing a PDF File\n\nThe `process_pdf` function extracts text from each page of a PDF file and returns it as an array.\n\n```python\nfrom embedd_all.embedd.index import process_pdf\n\nif __name__ == '__main__':\n    # Path to the PDF file\n    file_path = '/path/to/your/document.pdf'\n\n    # Process the PDF file\n    texts = process_pdf(file_path)\n\n    # Display the processed text\n    logger.info(\"Number of pages processed: \", len(texts))\n    logger.info(\"Sample text from the first page: \", texts[0])\n```\n\n#### Creating RAG for Documents\n\nThe `pinecone_embeddings_with_voyage_ai` function creates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. This function supports multiple file formats including xlsx, csv, pdf, doc, and docx.\n\n```python\nfrom embedd_all.embedd.index import pinecone_embeddings_with_voyage_ai\n\ndef create_rag_for_documents():\n    paths = [\n        '/Users/arnabbhattachargya/Desktop/flamingo_english_book.pdf',\n        '/Users/arnabbhattachargya/Desktop/Data_Train.xlsx'\n    ]\n    vector_db_name = 'arnab-test'\n    voyage_embed_model = 'voyage-2'\n    embed_dimension = 1024\n    pinecone_embeddings_with_voyage_ai(paths, PINECONE_KEY, VOYAGE_API_KEY, vector_db_name, voyage_embed_model, embed_dimension)\n\nif __name__ == '__main__':\n    create_rag_for_documents()\n```\n\n#### Querying with RAG\n\nThe `rag_query` function performs context-based querying using RAG (Retrieval-Augmented Generation).\n\n```python\nfrom embedd_all.embedd.index import rag_query\n\ndef execute_rag_query():\n    CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n    INDEX_NAME = 'arnab-test'\n    TEMPERATURE = 0\n    MAX_TOKENS = 4000\n    QUERY = 'what all fuel types are there in cars?'\n    SYSTEM_PROMPT = \"You are a world-class document writer. Respond only with detailed descriptions and implementations. Use bullet points if necessary.\"\n    VOYAGE_EMBED_MODEL = 'voyage-2'\n\n    resp = rag_query(\n        temperature=TEMPERATURE,\n        max_tokens=MAX_TOKENS,\n        anthropic_api_key=ANTHROPIC_API_KEY,\n        claude_model=CLAUDE_MODEL,\n        index_name=INDEX_NAME,\n        pinecone_key=PINECONE_KEY,\n        query=QUERY,\n        system_prompt=SYSTEM_PROMPT,\n        voyage_api_key=VOYAGE_API_KEY,\n        voyage_embed_model=VOYAGE_EMBED_MODEL\n    )\n\n    for text_block in resp:\n        print(text_block.text)\n\nif __name__ == '__main__':\n    execute_rag_query()\n```\n\n## Functions\n\n### `modify_excel_for_embedding(file_path: str, context: str) -> list`\n\nProcesses an Excel file and summarizes the data in each sheet.\n\n- **Parameters:**\n  - `file_path` (str): Path to the Excel file.\n  - `context` (str): Additional context to be added to each summary.\n\n- **Returns:**\n  - `list`: A list of DataFrames, each containing the summarized data for each sheet.\n\n### `process_pdf(file_path: str) -> list`\n\nExtracts text from each page of a PDF file.\n\n- **Parameters:**\n  - `file_path` (str): Path to the PDF file.\n\n- **Returns:**\n  - `list`: A list of strings, each representing the text extracted from a page.\n\n### `pinecone_embeddings_with_voyage_ai(paths: list, PINECONE_KEY: str, VOYAGE_API_KEY: str, vector_db_name: str, voyage_embed_model: str, embed_dimension: int)`\n\nCreates RAG for documents using Voyage AI embedding models and stores them in a Pinecone vector database. Supports various document formats including xlsx, csv, pdf, doc, and docx.\n\n- **Parameters:**\n  - `paths` (list): List of paths to documents.\n  - `PINECONE_KEY` (str): Pinecone API key.\n  - `VOYAGE_API_KEY` (str): Voyage AI API key.\n  - `vector_db_name` (str): Name of the Pinecone vector database.\n  - `voyage_embed_model` (str): Name of the Voyage AI embedding model to use.\n  - `embed_dimension` (int): Dimension of the embedding vectors.\n\n### `rag_query()`\n\nPerforms context-based querying using RAG (Retrieval-Augmented Generation).\n\n- **Parameters:**\n  - `temperature` (float): Sampling temperature.\n  - `max_tokens` (int): Maximum number of tokens in the response.\n  - `anthropic_api_key` (str): Anthropic API key.\n  - `claude_model` (str): Name of the Claude model to use.\n  - `index_name` (str): Name of the Pinecone index.\n  - `pinecone_key` (str): Pinecone API key.\n  - `query` (str): The query to perform.\n  - `system_prompt` (str): The system prompt for guiding the model's response.\n  - `voyage_api_key` (str): Voyage AI API key.\n  - `voyage_embed_model` (str): Name of the Voyage AI embedding model to use.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/Arnab28122000/embed-all/blob/main/LICENSE) file for details.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Contact\n\nIf you have any questions or suggestions, please open an issue or contact the maintainer.\n\n---\n\nHappy embedding with `embedd-all`!",
    "bugtrack_url": null,
    "license": null,
    "summary": "Embedd (docs, pdfs, excels, csv etc) -> RAG -> Query with LLMs",
    "version": "0.0.938",
    "project_urls": {
        "Bug Tracker": "https://github.com/Arnab28122000/embed-all/issues",
        "Documentation": "https://github.com/Arnab28122000/embed-all/blob/main/README.md",
        "Homepage": "https://github.com/Arnab28122000/embed-all",
        "Repository": "https://github.com/Arnab28122000/embed-all"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8009242565c0a3516b96aa1a0448b933bde5327191e1a9facacd837e53b6ab5f",
                "md5": "ffd50d950016e83d6addcdfb322b4010",
                "sha256": "8639d633f7a3e4566f5a14e205d702ddbe2d7ac934e2cfed84bef97c1ce44891"
            },
            "downloads": -1,
            "filename": "embedd_all-0.0.938-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ffd50d950016e83d6addcdfb322b4010",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 9907,
            "upload_time": "2024-09-05T06:26:54",
            "upload_time_iso_8601": "2024-09-05T06:26:54.868834Z",
            "url": "https://files.pythonhosted.org/packages/80/09/242565c0a3516b96aa1a0448b933bde5327191e1a9facacd837e53b6ab5f/embedd_all-0.0.938-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aa7f576d0dd6d17aee92db2503188999e75309ae6144559dc96b7f20ceb1493b",
                "md5": "878f1f6df375add5008697cd6712bab5",
                "sha256": "d80cff0ade6d14661c58028621d47abc1f05e41a20822a84a1ac2fd6033576ce"
            },
            "downloads": -1,
            "filename": "embedd_all-0.0.938.tar.gz",
            "has_sig": false,
            "md5_digest": "878f1f6df375add5008697cd6712bab5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 10377,
            "upload_time": "2024-09-05T06:26:56",
            "upload_time_iso_8601": "2024-09-05T06:26:56.386012Z",
            "url": "https://files.pythonhosted.org/packages/aa/7f/576d0dd6d17aee92db2503188999e75309ae6144559dc96b7f20ceb1493b/embedd_all-0.0.938.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-05 06:26:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Arnab28122000",
    "github_project": "embed-all",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "embedd-all"
}
        
Elapsed time: 0.70441s