![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/thumbnail.png?raw=true)
# RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)
RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.
## Quick install
The reference page for RAGoon is available on the official page of PyPI: [RAGoon](https://pypi.org/project/ragoon/).
```python
pip install ragoon
```
## Usage
This section provides an overview of different code blocks that can be executed with RAGoon to enhance your NLP and language model projects.
### Embeddings production
This class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.
```python
from ragoon import EmbeddingsDataLoader
from datasets import load_dataset
# Initialize the dataset loader with multiple models
loader = EmbeddingsDataLoader(
token="hf_token",
dataset=load_dataset("louisbrulenaudet/dac6-instruct", split="train"), # If dataset is already loaded.
# dataset_name="louisbrulenaudet/dac6-instruct", # If you want to load the dataset from the class.
model_configs=[
{"model": "bert-base-uncased", "query_prefix": "Query:"},
{"model": "distilbert-base-uncased", "query_prefix": "Query:"}
# Add more model configurations as needed
]
)
# Uncomment this line if passing dataset_name instead of dataset.
# loader.load_dataset()
# Process the splits with all models loaded
loader.process(
column="output",
preload_models=True
)
# To access the processed dataset
processed_dataset = loader.get_dataset()
print(processed_dataset[0])
```
You can also embed a single text using multiple models:
```python
from ragoon import EmbeddingsDataLoader
# Initialize the dataset loader with multiple models
loader = EmbeddingsDataLoader(
token="hf_token",
model_configs=[
{"model": "bert-base-uncased"},
{"model": "distilbert-base-uncased"}
]
)
# Load models
loader.load_models()
# Embed a single text with all loaded models
text = "This is a single text for embedding."
embedding_result = loader.batch_encode(text)
# Output the embeddings
print(embedding_result)
```
### Similarity search and index creation
The `SimilaritySearch` class is instantiated with specific parameters to configure the embedding model and search infrastructure. The chosen model, `louisbrulenaudet/tsdae-lemone-mbert-base`, is likely a multilingual BERT model fine-tuned with TSDAE (Transfomer-based Denoising Auto-Encoder) on a custom dataset. This model choice suggests a focus on multilingual capabilities and improved semantic representations.
The `cuda` device specification leverages GPU acceleration, crucial for efficient processing of large datasets. The embedding dimension of `768` is typical for BERT-based models, representing a balance between expressiveness and computational efficiency. The `ip` (inner product) metric is selected for similarity comparisons, which is computationally faster than cosine similarity when vectors are normalized. The `i8` dtype indicates 8-bit integer quantization, a technique that significantly reduces memory usage and speeds up similarity search at the cost of a small accuracy rade-off.
```python
import polars as pl
from ragoon import (
dataset_loader,
SimilaritySearch,
EmbeddingsVisualizer
)
dataset = dataset_loader(
name="louisbrulenaudet/dac6-instruct",
streaming=False,
split="train"
)
dataset.save_to_disk("dataset.hf")
instance = SimilaritySearch(
model_name="louisbrulenaudet/tsdae-lemone-mbert-base",
device="cuda",
ndim=768,
metric="ip",
dtype="i8"
)
embeddings = instance.encode(corpus=dataset["output"])
ubinary_embeddings = instance.quantize_embeddings(
embeddings=embeddings,
quantization_type="ubinary"
)
int8_embeddings = instance.quantize_embeddings(
embeddings=embeddings,
quantization_type="int8"
)
instance.create_usearch_index(
int8_embeddings=int8_embeddings,
index_path="./usearch_int8.index",
save=True
)
instance.create_faiss_index(
ubinary_embeddings=ubinary_embeddings,
index_path="./faiss_ubinary.index",
save=True
)
top_k_scores, top_k_indices = instance.search(
query="Définir le rôle d'un intermédiaire concepteur conformément à l'article 1649 AE du Code général des Impôts.",
top_k=10,
rescore_multiplier=4
)
try:
dataframe = pl.from_arrow(dataset.data.table).with_row_index()
except:
dataframe = pl.from_arrow(dataset.data.table).with_row_count(
name="index"
)
scores_df = pl.DataFrame(
{
"index": top_k_indices,
"score": top_k_scores
}
).with_columns(
pl.col("index").cast(pl.UInt32)
)
search_results = dataframe.filter(
pl.col("index").is_in(top_k_indices)
).join(
scores_df,
how="inner",
on="index"
)
print("search_results")
```
### Embeddings visualization
This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.
```python
from ragoon import EmbeddingsVisualizer
visualizer = EmbeddingsVisualizer(
index_path="path/to/index",
dataset_path="path/to/dataset"
)
visualizer.visualize(
method="pca",
save_html=True,
html_file_name="embedding_visualization.html"
)
```
![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/assets/embeddings_visualization.gif?raw=true)
### Dynamic web search
RAGoon is a Python library that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, web scraping, and data augmentation techniques. It integrates various APIs, enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.
RAGoon's core functionality revolves around the concept of few-shot learning, where language models are provided with a small set of high-quality examples to enhance their understanding and generate more accurate outputs. By curating and retrieving relevant data from the web, RAGoon equips language models with the necessary context and knowledge to tackle complex queries and generate insightful responses.
```python
from groq import Groq
# from openai import OpenAI
from ragoon import WebRAG
# Initialize RAGoon instance
ragoon = WebRAG(
google_api_key="your_google_api_key",
google_cx="your_google_cx",
completion_client=Groq(api_key="your_groq_api_key")
)
# Search and get results
query = "I want to do a left join in Python Polars"
results = ragoon.search(
query=query,
completion_model="Llama3-70b-8192",
max_tokens=512,
temperature=1,
)
# Print results
print(results)
```
## Badge
Building something cool with RAGoon? Consider adding a badge to your project card.
```markdown
[<img src="https://raw.githubusercontent.com/louisbrulenaudet/ragoon/main/assets/badge.svg" alt="Built with RAGoon" width="200" height="32"/>](https://github.com/louisbrulenaudet/ragoon)
```
[<img src="https://raw.githubusercontent.com/louisbrulenaudet/ragoon/main/assets/badge.svg" alt="Built with RAGoon" width="200" height="32"/>](https://github.com/louisbrulenaudet/ragoon)
## Citing this project
If you use this code in your research, please use the following BibTeX entry.
```BibTeX
@misc{louisbrulenaudet2024,
author = {Louis Brulé Naudet},
title = {RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing},
howpublished = {\url{https://github.com/louisbrulenaudet/ragoon}},
year = {2024}
}
```
## Feedback
If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).
Raw data
{
"_id": null,
"home_page": null,
"name": "ragoon",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "language-models, retrieval, web-scraping, few-shot-learning, nlp, machine-learning, retrieval-augmented-generation, RAG, groq, generative-ai, llama, Mistral, embeddings, BERT, plot, RAGoon",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/e4/69/ddd3e83ec39bdd4b050d646b646716d94ee33384ad3479c6b221e8763d04/ragoon-0.0.15.tar.gz",
"platform": null,
"description": "![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/thumbnail.png?raw=true)\n\n# RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing \u26a1\n[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n\nRAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.\n\n## Quick install\nThe reference page for RAGoon is available on the official page of PyPI: [RAGoon](https://pypi.org/project/ragoon/).\n\n```python\npip install ragoon\n```\n\n## Usage\n\nThis section provides an overview of different code blocks that can be executed with RAGoon to enhance your NLP and language model projects.\n\n### Embeddings production\n\nThis class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.\n\n```python\nfrom ragoon import EmbeddingsDataLoader\nfrom datasets import load_dataset\n\n# Initialize the dataset loader with multiple models\nloader = EmbeddingsDataLoader(\n token=\"hf_token\",\n dataset=load_dataset(\"louisbrulenaudet/dac6-instruct\", split=\"train\"), # If dataset is already loaded.\n # dataset_name=\"louisbrulenaudet/dac6-instruct\", # If you want to load the dataset from the class.\n model_configs=[\n {\"model\": \"bert-base-uncased\", \"query_prefix\": \"Query:\"},\n {\"model\": \"distilbert-base-uncased\", \"query_prefix\": \"Query:\"}\n # Add more model configurations as needed\n ]\n)\n\n# Uncomment this line if passing dataset_name instead of dataset.\n# loader.load_dataset()\n\n# Process the splits with all models loaded\nloader.process(\n column=\"output\",\n preload_models=True\n)\n\n# To access the processed dataset\nprocessed_dataset = loader.get_dataset()\nprint(processed_dataset[0])\n```\n\nYou can also embed a single text using multiple models:\n\n```python\nfrom ragoon import EmbeddingsDataLoader\n\n# Initialize the dataset loader with multiple models\nloader = EmbeddingsDataLoader(\n token=\"hf_token\",\n model_configs=[\n {\"model\": \"bert-base-uncased\"},\n {\"model\": \"distilbert-base-uncased\"}\n ]\n)\n\n# Load models\nloader.load_models()\n\n# Embed a single text with all loaded models\ntext = \"This is a single text for embedding.\"\nembedding_result = loader.batch_encode(text)\n\n# Output the embeddings\nprint(embedding_result)\n```\n\n### Similarity search and index creation\n\nThe `SimilaritySearch` class is instantiated with specific parameters to configure the embedding model and search infrastructure. The chosen model, `louisbrulenaudet/tsdae-lemone-mbert-base`, is likely a multilingual BERT model fine-tuned with TSDAE (Transfomer-based Denoising Auto-Encoder) on a custom dataset. This model choice suggests a focus on multilingual capabilities and improved semantic representations.\n\nThe `cuda` device specification leverages GPU acceleration, crucial for efficient processing of large datasets. The embedding dimension of `768` is typical for BERT-based models, representing a balance between expressiveness and computational efficiency. The `ip` (inner product) metric is selected for similarity comparisons, which is computationally faster than cosine similarity when vectors are normalized. The `i8` dtype indicates 8-bit integer quantization, a technique that significantly reduces memory usage and speeds up similarity search at the cost of a small accuracy rade-off.\n\n```python\nimport polars as pl\nfrom ragoon import (\n dataset_loader,\n SimilaritySearch,\n EmbeddingsVisualizer\n)\n\ndataset = dataset_loader(\n name=\"louisbrulenaudet/dac6-instruct\",\n streaming=False,\n split=\"train\"\n)\n\ndataset.save_to_disk(\"dataset.hf\")\n\ninstance = SimilaritySearch(\n model_name=\"louisbrulenaudet/tsdae-lemone-mbert-base\",\n device=\"cuda\",\n ndim=768,\n metric=\"ip\",\n dtype=\"i8\"\n)\n\nembeddings = instance.encode(corpus=dataset[\"output\"])\n\nubinary_embeddings = instance.quantize_embeddings(\n embeddings=embeddings,\n quantization_type=\"ubinary\"\n)\n\nint8_embeddings = instance.quantize_embeddings(\n embeddings=embeddings,\n quantization_type=\"int8\"\n)\n\ninstance.create_usearch_index(\n int8_embeddings=int8_embeddings,\n index_path=\"./usearch_int8.index\",\n save=True\n)\n\ninstance.create_faiss_index(\n ubinary_embeddings=ubinary_embeddings,\n index_path=\"./faiss_ubinary.index\",\n save=True\n)\n\ntop_k_scores, top_k_indices = instance.search(\n query=\"D\u00e9finir le r\u00f4le d'un interm\u00e9diaire concepteur conform\u00e9ment \u00e0 l'article 1649 AE du Code g\u00e9n\u00e9ral des Imp\u00f4ts.\",\n top_k=10,\n rescore_multiplier=4\n)\n\ntry:\n dataframe = pl.from_arrow(dataset.data.table).with_row_index()\n\nexcept:\n dataframe = pl.from_arrow(dataset.data.table).with_row_count(\n name=\"index\"\n )\n\nscores_df = pl.DataFrame(\n {\n \"index\": top_k_indices,\n \"score\": top_k_scores\n }\n).with_columns(\n pl.col(\"index\").cast(pl.UInt32)\n)\n\nsearch_results = dataframe.filter(\n pl.col(\"index\").is_in(top_k_indices)\n).join(\n scores_df,\n how=\"inner\",\n on=\"index\"\n)\n\nprint(\"search_results\")\n```\n\n### Embeddings visualization\n\nThis class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.\n\n```python\nfrom ragoon import EmbeddingsVisualizer\n\nvisualizer = EmbeddingsVisualizer(\n index_path=\"path/to/index\", \n dataset_path=\"path/to/dataset\"\n)\n\nvisualizer.visualize(\n method=\"pca\",\n save_html=True,\n html_file_name=\"embedding_visualization.html\"\n)\n```\n\n![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/assets/embeddings_visualization.gif?raw=true)\n\n### Dynamic web search\n\nRAGoon is a Python library that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, web scraping, and data augmentation techniques. It integrates various APIs, enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.\n\nRAGoon's core functionality revolves around the concept of few-shot learning, where language models are provided with a small set of high-quality examples to enhance their understanding and generate more accurate outputs. By curating and retrieving relevant data from the web, RAGoon equips language models with the necessary context and knowledge to tackle complex queries and generate insightful responses.\n\n```python\nfrom groq import Groq\n# from openai import OpenAI\nfrom ragoon import WebRAG\n\n# Initialize RAGoon instance\nragoon = WebRAG(\n google_api_key=\"your_google_api_key\",\n google_cx=\"your_google_cx\",\n completion_client=Groq(api_key=\"your_groq_api_key\")\n)\n\n# Search and get results\nquery = \"I want to do a left join in Python Polars\"\nresults = ragoon.search(\n query=query,\n completion_model=\"Llama3-70b-8192\",\n max_tokens=512,\n temperature=1,\n)\n\n# Print results\nprint(results)\n```\n\n## Badge\n\nBuilding something cool with RAGoon? Consider adding a badge to your project card.\n\n```markdown\n[<img src=\"https://raw.githubusercontent.com/louisbrulenaudet/ragoon/main/assets/badge.svg\" alt=\"Built with RAGoon\" width=\"200\" height=\"32\"/>](https://github.com/louisbrulenaudet/ragoon)\n```\n[<img src=\"https://raw.githubusercontent.com/louisbrulenaudet/ragoon/main/assets/badge.svg\" alt=\"Built with RAGoon\" width=\"200\" height=\"32\"/>](https://github.com/louisbrulenaudet/ragoon)\n\n## Citing this project\nIf you use this code in your research, please use the following BibTeX entry.\n\n```BibTeX\n@misc{louisbrulenaudet2024,\n\tauthor = {Louis Brul\u00e9 Naudet},\n\ttitle = {RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing},\n\thowpublished = {\\url{https://github.com/louisbrulenaudet/ragoon}},\n\tyear = {2024}\n}\n```\n\n## Feedback\nIf you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).\n",
"bugtrack_url": null,
"license": null,
"summary": "RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing \u26a1",
"version": "0.0.15",
"project_urls": null,
"split_keywords": [
"language-models",
" retrieval",
" web-scraping",
" few-shot-learning",
" nlp",
" machine-learning",
" retrieval-augmented-generation",
" rag",
" groq",
" generative-ai",
" llama",
" mistral",
" embeddings",
" bert",
" plot",
" ragoon"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ba06ed98e1a31512d29a370ad7440409a6fffd8e1356d8525223af541c472a05",
"md5": "7a1c17eca1d2817063b687303e33444a",
"sha256": "f2f8281d8922034a82e7f42585f11894c17e697be0ebd959801775a49d432b13"
},
"downloads": -1,
"filename": "ragoon-0.0.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7a1c17eca1d2817063b687303e33444a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 42363,
"upload_time": "2024-11-03T21:18:04",
"upload_time_iso_8601": "2024-11-03T21:18:04.019913Z",
"url": "https://files.pythonhosted.org/packages/ba/06/ed98e1a31512d29a370ad7440409a6fffd8e1356d8525223af541c472a05/ragoon-0.0.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e469ddd3e83ec39bdd4b050d646b646716d94ee33384ad3479c6b221e8763d04",
"md5": "a0dd3fcfc86b4d30231cfc0eed70933d",
"sha256": "5a36d7890f9aebb7cb2a01020cff10c152eba4adbf696f63ec9feb44f9b707f9"
},
"downloads": -1,
"filename": "ragoon-0.0.15.tar.gz",
"has_sig": false,
"md5_digest": "a0dd3fcfc86b4d30231cfc0eed70933d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 42281,
"upload_time": "2024-11-03T21:18:07",
"upload_time_iso_8601": "2024-11-03T21:18:07.060566Z",
"url": "https://files.pythonhosted.org/packages/e4/69/ddd3e83ec39bdd4b050d646b646716d94ee33384ad3479c6b221e8763d04/ragoon-0.0.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-03 21:18:07",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "ragoon"
}