# RAGStack Knowledge Store
Hybrid Knowledge Store combining vector similarity and edges between chunks.
## Usage
1. Pre-process your documents to populate `metadata` information.
1. Create a Hybrid `KnowledgeStore` and add your LangChain `Document`s.
1. Retrieve documents from the `KnowledgeStore`.
### Populate Metadata
The Knowledge Store makes use of the following metadata fields on each `Document`:
- `content_id`: If assigned, this specifies the unique ID of the `Document`.
If not assigned, one will be generated.
This should be set if you may re-ingest the same document so that it is overwritten rather than being duplicated.
- `link_tags`: A set of `LinkTag`s indicating how this node should be linked to other nodes.
#### Hyperlinks
To connect nodes based on hyperlinks, you can use the `HtmlLinkEdgeExtractor` as shown below:
```python
from ragstack_knowledge_store.langchain.extractors import HtmlLinkEdgeExtractor
html_link_extractor = HtmlLinkEdgeExtractor()
for doc in documents:
doc.metadata["content_id"] = doc.metadata["source"]
# Add link tags from the page_content to the metadata.
# Should be passed the HTML content as a string or BeautifulSoup.
html_link_extractor.extract_one(doc, doc.page_content)
```
### Store
```python
import cassio
from langchain_openai import OpenAIEmbeddings
from ragstack_knowledge_store import KnowledgeStore
cassio.init(auto=True)
knowledge_store = KnowledgeStore(embeddings=OpenAIEmbeddings())
# Store the documents
knowledge_store.add_documents(documents)
```
### Retrieve
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
# Retrieve and generate using the relevant snippets of the blog.
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
# Depth 0 - don't traverse edges. equivalent to vector-only.
# Depth 1 - vector search plus 1 level of edges
retriever = knowledge_store.as_retriever(k=4, depth=1)
template = """You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
formatted = "\n\n".join(f"From {doc.metadata['content_id']}: {doc.page_content}" for doc in docs)
return formatted
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
```
## Development
```shell
poetry install --with=dev
# Run Tests
poetry run pytest
```
Raw data
{
"_id": null,
"home_page": "https://github.com/datastax/ragstack-ai",
"name": "ragstack-ai-knowledge-store",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8.1",
"maintainer_email": null,
"keywords": null,
"author": "DataStax",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/95/d8/a65a9db859b3bffbf1c9b75004ad3e99623228748ef37d806b9166d13264/ragstack_ai_knowledge_store-0.0.4.tar.gz",
"platform": null,
"description": "# RAGStack Knowledge Store\n\nHybrid Knowledge Store combining vector similarity and edges between chunks.\n\n## Usage\n\n1. Pre-process your documents to populate `metadata` information.\n1. Create a Hybrid `KnowledgeStore` and add your LangChain `Document`s.\n1. Retrieve documents from the `KnowledgeStore`.\n\n### Populate Metadata\n\nThe Knowledge Store makes use of the following metadata fields on each `Document`:\n\n- `content_id`: If assigned, this specifies the unique ID of the `Document`.\n If not assigned, one will be generated.\n This should be set if you may re-ingest the same document so that it is overwritten rather than being duplicated.\n- `link_tags`: A set of `LinkTag`s indicating how this node should be linked to other nodes.\n\n#### Hyperlinks\n\nTo connect nodes based on hyperlinks, you can use the `HtmlLinkEdgeExtractor` as shown below:\n\n```python\nfrom ragstack_knowledge_store.langchain.extractors import HtmlLinkEdgeExtractor\n\nhtml_link_extractor = HtmlLinkEdgeExtractor()\n\nfor doc in documents:\n doc.metadata[\"content_id\"] = doc.metadata[\"source\"]\n\n # Add link tags from the page_content to the metadata.\n # Should be passed the HTML content as a string or BeautifulSoup.\n html_link_extractor.extract_one(doc, doc.page_content)\n```\n\n### Store\n\n```python\nimport cassio\nfrom langchain_openai import OpenAIEmbeddings\nfrom ragstack_knowledge_store import KnowledgeStore\n\ncassio.init(auto=True)\n\nknowledge_store = KnowledgeStore(embeddings=OpenAIEmbeddings())\n\n# Store the documents\nknowledge_store.add_documents(documents)\n```\n\n### Retrieve\n\n```python\nfrom langchain_openai import ChatOpenAI\n\nllm = ChatOpenAI(model=\"gpt-4o\")\n\n# Retrieve and generate using the relevant snippets of the blog.\nfrom langchain_core.runnables import RunnablePassthrough\nfrom langchain_core.output_parsers import StrOutputParser\nfrom langchain_core.prompts import ChatPromptTemplate\n\n# Depth 0 - don't traverse edges. equivalent to vector-only.\n# Depth 1 - vector search plus 1 level of edges\nretriever = knowledge_store.as_retriever(k=4, depth=1)\n\ntemplate = \"\"\"You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n\"\"\"\nprompt = ChatPromptTemplate.from_template(template)\n\ndef format_docs(docs):\n formatted = \"\\n\\n\".join(f\"From {doc.metadata['content_id']}: {doc.page_content}\" for doc in docs)\n return formatted\n\n\nrag_chain = (\n {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n | prompt\n | llm\n | StrOutputParser()\n)\n```\n\n## Development\n\n```shell\npoetry install --with=dev\n\n# Run Tests\npoetry run pytest\n```\n",
"bugtrack_url": null,
"license": "BUSL-1.1",
"summary": "DataStax RAGStack Knowledge Store",
"version": "0.0.4",
"project_urls": {
"Documentation": "https://docs.datastax.com/en/ragstack",
"Homepage": "https://github.com/datastax/ragstack-ai",
"Repository": "https://github.com/datastax/ragstack-ai"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "13961105631a0f506e8ecc0ecd7433f8055601087777b361e38920b4cd08a182",
"md5": "31ae0a4e7e7fb3ca8560db9760b4ae0b",
"sha256": "88b8f196d6a58c66d4b49c93a523fd252102023d26a063d0f8cfb5911663e0da"
},
"downloads": -1,
"filename": "ragstack_ai_knowledge_store-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "31ae0a4e7e7fb3ca8560db9760b4ae0b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8.1",
"size": 21004,
"upload_time": "2024-06-13T14:54:19",
"upload_time_iso_8601": "2024-06-13T14:54:19.921317Z",
"url": "https://files.pythonhosted.org/packages/13/96/1105631a0f506e8ecc0ecd7433f8055601087777b361e38920b4cd08a182/ragstack_ai_knowledge_store-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "95d8a65a9db859b3bffbf1c9b75004ad3e99623228748ef37d806b9166d13264",
"md5": "7cd6ad32f0da01b2ca49a7ea706242f8",
"sha256": "8be1221ce6304c84ac984f95bb1902802766ec449ebc04417303014f3f883bd3"
},
"downloads": -1,
"filename": "ragstack_ai_knowledge_store-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "7cd6ad32f0da01b2ca49a7ea706242f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8.1",
"size": 17031,
"upload_time": "2024-06-13T14:54:20",
"upload_time_iso_8601": "2024-06-13T14:54:20.918577Z",
"url": "https://files.pythonhosted.org/packages/95/d8/a65a9db859b3bffbf1c9b75004ad3e99623228748ef37d806b9166d13264/ragstack_ai_knowledge_store-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-13 14:54:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "datastax",
"github_project": "ragstack-ai",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "ragstack-ai-knowledge-store"
}