| Name | sparql-llm JSON |
| Version |
0.1.0
JSON |
| download |
| home_page | None |
| Summary | Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema. |
| upload_time | 2025-10-06 09:18:48 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.10 |
| license | MIT License Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
| keywords |
chatbot
expasy
kgqa
llm
sparql
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
<div align="center">
# âĻ SPARQL query generation with LLMs ðĶ
[](https://pypi.org/project/sparql-llm/)
[](https://pypi.org/project/sparql-llm/)
[](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml)
</div>
This project provides tools to enhance the capabilities of Large Language Models (LLMs) in generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for specific endpoints:
- a **MCP server** to expose tools to help LLM write SPARQL queries for a set of endpoints available at https://chat.expasy.org/mcp
- a complete **chat web service**
- **reusable components** published as the [`sparql-llm`](https://pypi.org/project/sparql-llm/) pip package
The system integrates Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, to ensure more accurate and relevant query generation on large scale knowledge graphs.
The components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It **requires endpoints to include metadata** such as [SPARQL query examples](https://github.com/sib-swiss/sparql-examples) and endpoint descriptions using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can be automatically generated using the [void-generator](https://github.com/JervenBolleman/void-generator).
## ð Features
- **Metadata Extraction**: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with [LangChain](https://python.langchain.com) but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.
- **SPARQL Query Validation**: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.
- **MCP server** with tools to help LLM write SPARQL queries for a set of endpoints
- **Deployable Chat System**: A reusable and containerized system for deploying an LLM-based chat service with a web UI, API, and vector database. This system helps users write SPARQL queries by leveraging endpoint metadata (WIP).
- **Live Example**: Configuration for **[expasy.org/chat](https://expasy.org/chat)**, an LLM-powered chat system supporting SPARQL query generation for endpoints maintained by the [SIB](https://www.sib.swiss/).
> [!TIP]
>
> You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check)
## ð MCP server
The server exposes a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) endpoint to access [biodata resources](https://www.expasy.org/) at the [SIB](https://www.sib.swiss/), through their [SPARQL](https://www.w3.org/TR/sparql12-query/) endpoints, such as UniProt, Bgee, OMA, SwissLipids, Cellosaurus at **[chat.expasy.org/mcp](https://chat.expasy.org/mcp)**
Available tools are:
- **ð Retrieve relevant documents** (query examples and classes schema) to help writing SPARQL queries to access SIB biodata resources
- Arguments:
- `question` (string): the user's question
- `potential_classes` (list[string]): high level concepts and potential classes that could be found in the SPARQL endpoints
- `steps` (list[string]): split the question in standalone smaller parts if relevant
- **ð·ïļ Retrieve relevant classes schema** to help writing SPARQL queries to access SIB biodata resources
- Arguments:
- `classes` (list[string]): high level concepts and potential classes that could be found in the SPARQL endpoints
- ðĄ **Execute a SPARQL query** against a SPARQL endpoint
- Arguments:
- `query` (string): a valid SPARQL query string
- `endpoint` (string): the SPARQL endpoint URL to execute the query against
### ð Connect client to MCP
Follow the instructions of your client, and use the URL of the public server: **https://chat.expasy.org/mcp**
For example, for GitHub Copilot in VSCode, to add a new MCP server through the VSCode UI:
- [x] Open side panel chat (`ctrl+shift+i` or `cmd+shift+i`), and make sure the mode is set to `Agent` in the bottom right
- [x] Open command palette (`ctrl+shift+p` or `cmd+shift+p`), and search for `MCP: Open User Configuration`, this will open a `mcp.json` file
In VSCode `mcp.json` you should have the following:
```sh
{
"servers": {
"expasy-mcp-server": {
"url": "https://chat.expasy.org/mcp",
"type": "http"
}
},
"inputs": []
}
```
> [!IMPORTANT]
>
> Click on `Start` just on top of `"expasy-mcp-server"` to start the connection to the MCP server.
>
> You can click the wrench and screwdriver button ð ïļ (`Select Tools...`) to enable/disable specific tools
> [!NOTE]
>
> Find more details in the [official docs](https://code.visualstudio.com/docs/copilot/chat/mcp-servers).
Alternatively you can use it with stdio transport:
```sh
uvx sparql-llm --stdio
```
## ðĶïļ Reusable components
### Installation
> Requires Python >=3.10
```bash
pip install sparql-llm
```
Or with `uv`:
```sh
uv add sparql-llm
```
### SPARQL query examples loader
Load SPARQL query examples defined using the SHACL ontology from a SPARQL endpoint. See **[github.com/sib-swiss/sparql-examples](https://github.com/sib-swiss/sparql-examples)** for more details on how to define the examples.
```python
from sparql_llm import SparqlExamplesLoader
loader = SparqlExamplesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)
```
You can provide the examples as a file if it is not integrated in the endpoint, e.g.:
```python
loader = SparqlExamplesLoader("https://sparql.uniprot.org/sparql/", examples_file="uniprot_examples.ttl")
```
> Refer to the [LangChain documentation](https://python.langchain.com/v0.2/docs/) to figure out how to best integrate documents loaders to your system.
> [!NOTE]
>
> You can check the completeness of your examples against the endpoint schema using [this notebook](https://github.com/sib-swiss/sparql-llm/blob/main/notebooks/compare_queries_examples_to_void.ipynb).
### SPARQL endpoint schema loader
Generate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on the [VoID description](https://www.w3.org/TR/void/) present in the endpoint. Ideally the endpoint should also contain the ontology describing the classes, so the `rdfs:label` and `rdfs:comment` of the classes can be used to generate embeddings and improve semantic matching.
> [!TIP]
>
> Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.
```python
from sparql_llm import SparqlVoidShapesLoader
loader = SparqlVoidShapesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)
```
You can provide the VoID description as a file if it is not integrated in the endpoint, e.g.:
```python
loader = SparqlVoidShapesLoader("https://sparql.uniprot.org/sparql/", void_file="uniprot_void.ttl")
```
> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt:
>
> ```turtle
> up:Disease_Annotation {
> a [ up:Disease_Annotation ] ;
> up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
> rdfs:comment xsd:string ;
> up:disease IRI
> }
> ```
### Generate complete ShEx shapes from VoID description
You can also generate the complete ShEx shapes for a SPARQL endpoint with:
```python
from sparql_llm import get_shex_from_void
shex_str = get_shex_from_void("https://sparql.uniprot.org/sparql/")
print(shex_str)
```
### Validate a SPARQL query based on VoID description
This takes a SPARQL query and validates the predicates/types used are compliant with the VoID description present in the SPARQL endpoint the query is executed on.
This function supports:
* federated queries (VoID description will be automatically retrieved for each SERVICE call in the query),
* path patterns (e.g. `orth:organism/obo:RO_0002162/up:scientificName`)
This function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.
It will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes/predicates), which can be passed to an LLM as context to help it figuring out how to fix the query.
```python
from sparql_llm import validate_sparql_with_void
sparql_query = """PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX orth: <http://purl.org/net/orth#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX genex: <http://purl.org/genex#>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?humanProtein ?orthologRatProtein ?orthologRatGene
WHERE {
?humanProtein a orth:Protein ;
lscr:xrefUniprot <http://purl.uniprot.org/uniprot/Q9Y2T1> .
?orthologRatProtein a orth:Protein ;
sio:SIO_010078 ?orthologRatGene ;
orth:organism/obo:RO_0002162/up:name 'Rattus norvegicus' .
?cluster a orth:OrthologsCluster .
?cluster orth:hasHomologousMember ?node1 .
?cluster orth:hasHomologousMember ?node2 .
?node1 orth:hasHomologousMember* ?humanProtein .
?node2 orth:hasHomologousMember* ?orthologRatProtein .
FILTER(?node1 != ?node2)
SERVICE <https://www.bgee.org/sparql/> {
?orthologRatGene a orth:Gene ;
genex:expressedIn ?anatEntity ;
orth:organism ?ratOrganism .
?anatEntity rdfs:label 'brain' .
?ratOrganism obo:RO_0002162 taxon:10116 .
}
}"""
issues = validate_sparql_with_void(sparql_query, "https://sparql.omabrowser.org/sparql/")
print("\n".join(issues))
```
## ð Complete chat system
> [!WARNING]
>
> To deploy the complete chat system right now you will need to fork/clone this repository, change the configuration in `src/expasy-agent/src/expasy_agent/config.py` and `compose.yml`, then deploy with docker/podman compose.
>
> It can easily be adapted to use any LLM served through an OpenAI-compatible API. We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!
Requirements: Docker, nodejs (to build the frontend), and optionally [`uv`](https://docs.astral.sh/uv/getting-started/installation/) if you want to run scripts outside of docker.
1. Explore and change the system configuration in `src/expasy-agent/src/expasy_agent/config.py`
2. Create a `.env` file at the root of the repository to provide secrets and API keys:
```sh
CHAT_API_KEY=NOT_SO_SECRET_API_KEY_USED_BY_FRONTEND_TO_AVOID_SPAM_FROM_CRAWLERS
LOGS_API_KEY=SECRET_PASSWORD_TO_EASILY_ACCESS_LOGS_THROUGH_THE_API
OPENAI_API_KEY=sk-proj-YYY
GROQ_API_KEY=gsk_YYY
HUGGINGFACEHUB_API_TOKEN=
TOGETHER_API_KEY=
AZURE_INFERENCE_CREDENTIAL=
AZURE_INFERENCE_ENDPOINT=https://project-id.services.ai.azure.com/models
LANGFUSE_HOST=https://cloud.langfuse.com
LANGFUSE_PUBLIC_KEY=
LANGFUSE_SECRET_KEY=
```
3. Optionally, if you made changes to it, build the chat UI webpage:
```sh
cd chat-with-context
npm i
npm run build:demo
cd ..
```
> You can change the UI around the chat in `chat-with-context/demo/index.html`
4. **Start** the vector database and web server locally for development, with code from the `src` folder mounted in the container and automatic API reload on changes to the code:
```bash
docker compose -f compose.dev.yml up
```
* Chat web UI available at http://localhost:8000
* OpenAPI Swagger UI available at http://localhost:8000/docs
* Vector database dashboard UI available at http://localhost:6333/dashboard
In production, you will need to make some changes to the `compose.yml` file to adapt it to your server/proxy setup:
```bash
docker compose up
```
> All data from the containers are stored persistently in the `data` folder (e.g. vectordb indexes)
> [!NOTE]
>
> Query the chat API:
>
> ```sh
> curl -X POST http://localhost:8000/chat -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "What is the HGNC symbol for the P68871 protein?"}], "model": "mistralai/mistral-small-latest", "stream": true}'
> ```
> [!WARNING]
>
> **Experimental entities indexing**: it can take a lot of time to generate embeddings for millions of entities. So we recommend to run the script to generate embeddings on a machine with GPU (does not need to be a powerful one, but at least with a GPU, checkout [fastembed GPU docs](https://qdrant.github.io/fastembed/examples/FastEmbed_GPU/) to install the GPU drivers and dependencies)
>
> ```sh
> docker compose -f compose.dev.yml up vectordb -d
> cd src/expasy-agent
> VECTORDB_URL=http://localhost:6334 nohup uv run --extra gpu src/expasy_agent/indexing/index_entities.py --gpu &
> ```
>
> Then move the entities collection containing the embeddings in `data/qdrant/collections/entities` before starting the stack
### ðĨ Benchmarks
There are a few benchmarks available for the system:
- The `tests/benchmark.py` script will run a list of questions and compare their results to a reference SPARQL queries, with and without query validation, against a list of LLM providers. You will need to change the list of queries if you want to use it for different endpoints. You will need to start the stack in development mode to run it:
```sh
uv run --env-file .env src/expasy-agent/tests/benchmark.py
```
> It takes time to run and will log the output and results in `data/benchmarks`
- Follow [these instructions](src/expasy-agent/tests/text2sparql/README.md) to run the `Text2SPARQL Benchmark`.
## ð§âðŦ Tutorial
There is a step by step tutorial to show how a LLM-based chat system for generating SPARQL queries can be easily built here: https://sib-swiss.github.io/sparql-llm
## ð§âðŧ Contributing
Checkout the [`CONTRIBUTING.md`](https://github.com/sib-swiss/sparql-llm/blob/main/CONTRIBUTING.md) page.
## ðŠķ How to cite this work
If you reuse any part of this work, please cite [the arXiv paper](https://arxiv.org/abs/2410.06062):
```bibtex
@misc{emonet2024llmbasedsparqlquerygeneration,
title={LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs},
author={Vincent Emonet and Jerven Bolleman and Severine Duvaud and Tarcisio Mendes de Farias and Ana Claudia Sima},
year={2024},
eprint={2410.06062},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2410.06062},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "sparql-llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Vincent Emonet <vincent.emonet@gmail.com>",
"keywords": "Chatbot, Expasy, KGQA, LLM, SPARQL",
"author": null,
"author_email": "Vincent Emonet <vincent.emonet@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b0/3a/a55707584ae263389ade8e35dae32ce244e6c788f8be765388aaa64bc2fe/sparql_llm-0.1.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# \u2728 SPARQL query generation with LLMs \ud83e\udd9c\n\n[](https://pypi.org/project/sparql-llm/)\n[](https://pypi.org/project/sparql-llm/)\n[](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml)\n\n</div>\n\nThis project provides tools to enhance the capabilities of Large Language Models (LLMs) in generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for specific endpoints:\n\n- a **MCP server** to expose tools to help LLM write SPARQL queries for a set of endpoints available at https://chat.expasy.org/mcp\n- a complete **chat web service**\n- **reusable components** published as the [`sparql-llm`](https://pypi.org/project/sparql-llm/) pip package\n\nThe system integrates Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, to ensure more accurate and relevant query generation on large scale knowledge graphs.\n\nThe components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It **requires endpoints to include metadata** such as [SPARQL query examples](https://github.com/sib-swiss/sparql-examples) and endpoint descriptions using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can be automatically generated using the [void-generator](https://github.com/JervenBolleman/void-generator).\n\n## \ud83c\udf08 Features\n\n- **Metadata Extraction**: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with [LangChain](https://python.langchain.com) but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.\n- **SPARQL Query Validation**: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.\n- **MCP server** with tools to help LLM write SPARQL queries for a set of endpoints\n- **Deployable Chat System**: A reusable and containerized system for deploying an LLM-based chat service with a web UI, API, and vector database. This system helps users write SPARQL queries by leveraging endpoint metadata (WIP).\n- **Live Example**: Configuration for **[expasy.org/chat](https://expasy.org/chat)**, an LLM-powered chat system supporting SPARQL query generation for endpoints maintained by the [SIB](https://www.sib.swiss/).\n\n> [!TIP]\n>\n> You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check)\n\n## \ud83d\udd0c MCP server\n\nThe server exposes a [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) endpoint to access [biodata resources](https://www.expasy.org/) at the [SIB](https://www.sib.swiss/), through their [SPARQL](https://www.w3.org/TR/sparql12-query/) endpoints, such as UniProt, Bgee, OMA, SwissLipids, Cellosaurus at **[chat.expasy.org/mcp](https://chat.expasy.org/mcp)**\n\nAvailable tools are:\n\n- **\ud83d\udcdd Retrieve relevant documents** (query examples and classes schema) to help writing SPARQL queries to access SIB biodata resources\n - Arguments:\n - `question` (string): the user's question\n - `potential_classes` (list[string]): high level concepts and potential classes that could be found in the SPARQL endpoints\n - `steps` (list[string]): split the question in standalone smaller parts if relevant\n- **\ud83c\udff7\ufe0f Retrieve relevant classes schema** to help writing SPARQL queries to access SIB biodata resources\n - Arguments:\n - `classes` (list[string]): high level concepts and potential classes that could be found in the SPARQL endpoints\n- \ud83d\udce1 **Execute a SPARQL query** against a SPARQL endpoint\n - Arguments:\n - `query` (string): a valid SPARQL query string\n - `endpoint` (string): the SPARQL endpoint URL to execute the query against\n\n### \ud83d\udc19 Connect client to MCP\n\nFollow the instructions of your client, and use the URL of the public server: **https://chat.expasy.org/mcp**\n\nFor example, for GitHub Copilot in VSCode, to add a new MCP server through the VSCode UI:\n\n- [x] Open side panel chat (`ctrl+shift+i` or `cmd+shift+i`), and make sure the mode is set to `Agent` in the bottom right\n- [x] Open command palette (`ctrl+shift+p` or `cmd+shift+p`), and search for `MCP: Open User Configuration`, this will open a `mcp.json` file\n\nIn VSCode `mcp.json` you should have the following:\n\n```sh\n{\n\t\"servers\": {\n\t\t\"expasy-mcp-server\": {\n\t\t\t\"url\": \"https://chat.expasy.org/mcp\",\n\t\t\t\"type\": \"http\"\n\t\t}\n\t},\n\t\"inputs\": []\n}\n```\n\n> [!IMPORTANT]\n>\n> Click on `Start` just on top of `\"expasy-mcp-server\"` to start the connection to the MCP server.\n>\n> You can click the wrench and screwdriver button \ud83d\udee0\ufe0f (`Select Tools...`) to enable/disable specific tools\n\n> [!NOTE]\n>\n> Find more details in the [official docs](https://code.visualstudio.com/docs/copilot/chat/mcp-servers).\n\nAlternatively you can use it with stdio transport:\n\n```sh\nuvx sparql-llm --stdio\n```\n\n## \ud83d\udce6\ufe0f Reusable components\n\n### Installation\n\n> Requires Python >=3.10\n\n```bash\npip install sparql-llm\n```\n\nOr with `uv`:\n\n```sh\nuv add sparql-llm\n```\n\n### SPARQL query examples loader\n\nLoad SPARQL query examples defined using the SHACL ontology from a SPARQL endpoint. See **[github.com/sib-swiss/sparql-examples](https://github.com/sib-swiss/sparql-examples)** for more details on how to define the examples.\n\n```python\nfrom sparql_llm import SparqlExamplesLoader\n\nloader = SparqlExamplesLoader(\"https://sparql.uniprot.org/sparql/\")\ndocs = loader.load()\nprint(len(docs))\nprint(docs[0].metadata)\n```\n\nYou can provide the examples as a file if it is not integrated in the endpoint, e.g.:\n\n```python\nloader = SparqlExamplesLoader(\"https://sparql.uniprot.org/sparql/\", examples_file=\"uniprot_examples.ttl\")\n```\n\n> Refer to the [LangChain documentation](https://python.langchain.com/v0.2/docs/) to figure out how to best integrate documents loaders to your system.\n\n> [!NOTE]\n>\n> You can check the completeness of your examples against the endpoint schema using [this notebook](https://github.com/sib-swiss/sparql-llm/blob/main/notebooks/compare_queries_examples_to_void.ipynb).\n\n### SPARQL endpoint schema loader\n\nGenerate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on the [VoID description](https://www.w3.org/TR/void/) present in the endpoint. Ideally the endpoint should also contain the ontology describing the classes, so the `rdfs:label` and `rdfs:comment` of the classes can be used to generate embeddings and improve semantic matching.\n\n> [!TIP]\n>\n> Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.\n\n```python\nfrom sparql_llm import SparqlVoidShapesLoader\n\nloader = SparqlVoidShapesLoader(\"https://sparql.uniprot.org/sparql/\")\ndocs = loader.load()\nprint(len(docs))\nprint(docs[0].metadata)\n```\n\nYou can provide the VoID description as a file if it is not integrated in the endpoint, e.g.:\n\n```python\nloader = SparqlVoidShapesLoader(\"https://sparql.uniprot.org/sparql/\", void_file=\"uniprot_void.ttl\")\n```\n\n> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt:\n>\n> ```turtle\n> up:Disease_Annotation {\n> a [ up:Disease_Annotation ] ;\n> up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;\n> rdfs:comment xsd:string ;\n> up:disease IRI\n> }\n> ```\n\n### Generate complete ShEx shapes from VoID description\n\nYou can also generate the complete ShEx shapes for a SPARQL endpoint with:\n\n```python\nfrom sparql_llm import get_shex_from_void\n\nshex_str = get_shex_from_void(\"https://sparql.uniprot.org/sparql/\")\nprint(shex_str)\n```\n\n### Validate a SPARQL query based on VoID description\n\nThis takes a SPARQL query and validates the predicates/types used are compliant with the VoID description present in the SPARQL endpoint the query is executed on.\n\nThis function supports:\n\n* federated queries (VoID description will be automatically retrieved for each SERVICE call in the query),\n* path patterns (e.g. `orth:organism/obo:RO_0002162/up:scientificName`)\n\nThis function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.\n\nIt will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes/predicates), which can be passed to an LLM as context to help it figuring out how to fix the query.\n\n```python\nfrom sparql_llm import validate_sparql_with_void\n\nsparql_query = \"\"\"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX up: <http://purl.uniprot.org/core/>\nPREFIX taxon: <http://purl.uniprot.org/taxonomy/>\nPREFIX orth: <http://purl.org/net/orth#>\nPREFIX obo: <http://purl.obolibrary.org/obo/>\nPREFIX lscr: <http://purl.org/lscr#>\nPREFIX genex: <http://purl.org/genex#>\nPREFIX sio: <http://semanticscience.org/resource/>\nSELECT DISTINCT ?humanProtein ?orthologRatProtein ?orthologRatGene\nWHERE {\n ?humanProtein a orth:Protein ;\n lscr:xrefUniprot <http://purl.uniprot.org/uniprot/Q9Y2T1> .\n ?orthologRatProtein a orth:Protein ;\n sio:SIO_010078 ?orthologRatGene ;\n orth:organism/obo:RO_0002162/up:name 'Rattus norvegicus' .\n ?cluster a orth:OrthologsCluster .\n ?cluster orth:hasHomologousMember ?node1 .\n ?cluster orth:hasHomologousMember ?node2 .\n ?node1 orth:hasHomologousMember* ?humanProtein .\n ?node2 orth:hasHomologousMember* ?orthologRatProtein .\n FILTER(?node1 != ?node2)\n SERVICE <https://www.bgee.org/sparql/> {\n ?orthologRatGene a orth:Gene ;\n genex:expressedIn ?anatEntity ;\n orth:organism ?ratOrganism .\n ?anatEntity rdfs:label 'brain' .\n ?ratOrganism obo:RO_0002162 taxon:10116 .\n }\n}\"\"\"\n\nissues = validate_sparql_with_void(sparql_query, \"https://sparql.omabrowser.org/sparql/\")\nprint(\"\\n\".join(issues))\n```\n\n## \ud83d\ude80 Complete chat system\n\n> [!WARNING]\n>\n> To deploy the complete chat system right now you will need to fork/clone this repository, change the configuration in `src/expasy-agent/src/expasy_agent/config.py` and `compose.yml`, then deploy with docker/podman compose.\n>\n> It can easily be adapted to use any LLM served through an OpenAI-compatible API. We plan to make configuration and deployment of complete SPARQL LLM chat system easier in the future, let us know if you are interested in the GitHub issues!\n\nRequirements: Docker, nodejs (to build the frontend), and optionally [`uv`](https://docs.astral.sh/uv/getting-started/installation/) if you want to run scripts outside of docker.\n\n1. Explore and change the system configuration in `src/expasy-agent/src/expasy_agent/config.py`\n\n2. Create a `.env` file at the root of the repository to provide secrets and API keys:\n\n ```sh\n CHAT_API_KEY=NOT_SO_SECRET_API_KEY_USED_BY_FRONTEND_TO_AVOID_SPAM_FROM_CRAWLERS\n LOGS_API_KEY=SECRET_PASSWORD_TO_EASILY_ACCESS_LOGS_THROUGH_THE_API\n\n OPENAI_API_KEY=sk-proj-YYY\n GROQ_API_KEY=gsk_YYY\n HUGGINGFACEHUB_API_TOKEN=\n TOGETHER_API_KEY=\n AZURE_INFERENCE_CREDENTIAL=\n AZURE_INFERENCE_ENDPOINT=https://project-id.services.ai.azure.com/models\n\n LANGFUSE_HOST=https://cloud.langfuse.com\n LANGFUSE_PUBLIC_KEY=\n LANGFUSE_SECRET_KEY=\n ```\n\n3. Optionally, if you made changes to it, build the chat UI webpage:\n\n ```sh\n cd chat-with-context\n npm i\n npm run build:demo\n cd ..\n ```\n\n > You can change the UI around the chat in `chat-with-context/demo/index.html`\n\n4. **Start** the vector database and web server locally for development, with code from the `src` folder mounted in the container and automatic API reload on changes to the code:\n\n ```bash\n docker compose -f compose.dev.yml up\n ```\n\n * Chat web UI available at http://localhost:8000\n * OpenAPI Swagger UI available at http://localhost:8000/docs\n * Vector database dashboard UI available at http://localhost:6333/dashboard\n\n In production, you will need to make some changes to the `compose.yml` file to adapt it to your server/proxy setup:\n\n ```bash\n docker compose up\n ```\n\n > All data from the containers are stored persistently in the `data` folder (e.g. vectordb indexes)\n\n> [!NOTE]\n>\n> Query the chat API:\n>\n> ```sh\n> curl -X POST http://localhost:8000/chat -H \"Content-Type: application/json\" -d '{\"messages\": [{\"role\": \"user\", \"content\": \"What is the HGNC symbol for the P68871 protein?\"}], \"model\": \"mistralai/mistral-small-latest\", \"stream\": true}'\n> ```\n\n> [!WARNING]\n>\n> **Experimental entities indexing**: it can take a lot of time to generate embeddings for millions of entities. So we recommend to run the script to generate embeddings on a machine with GPU (does not need to be a powerful one, but at least with a GPU, checkout [fastembed GPU docs](https://qdrant.github.io/fastembed/examples/FastEmbed_GPU/) to install the GPU drivers and dependencies)\n>\n> ```sh\n> docker compose -f compose.dev.yml up vectordb -d\n> cd src/expasy-agent\n> VECTORDB_URL=http://localhost:6334 nohup uv run --extra gpu src/expasy_agent/indexing/index_entities.py --gpu &\n> ```\n>\n> Then move the entities collection containing the embeddings in `data/qdrant/collections/entities` before starting the stack\n\n### \ud83e\udd47 Benchmarks\n\nThere are a few benchmarks available for the system:\n\n- The `tests/benchmark.py` script will run a list of questions and compare their results to a reference SPARQL queries, with and without query validation, against a list of LLM providers. You will need to change the list of queries if you want to use it for different endpoints. You will need to start the stack in development mode to run it:\n\n ```sh\n uv run --env-file .env src/expasy-agent/tests/benchmark.py\n ```\n\n > It takes time to run and will log the output and results in `data/benchmarks`\n\n- Follow [these instructions](src/expasy-agent/tests/text2sparql/README.md) to run the `Text2SPARQL Benchmark`.\n\n## \ud83e\uddd1\u200d\ud83c\udfeb Tutorial\n\nThere is a step by step tutorial to show how a LLM-based chat system for generating SPARQL queries can be easily built here: https://sib-swiss.github.io/sparql-llm\n\n## \ud83e\uddd1\u200d\ud83d\udcbb Contributing\n\nCheckout the [`CONTRIBUTING.md`](https://github.com/sib-swiss/sparql-llm/blob/main/CONTRIBUTING.md) page.\n\n## \ud83e\udeb6 How to cite this work\n\nIf you reuse any part of this work, please cite [the arXiv paper](https://arxiv.org/abs/2410.06062):\n\n```bibtex\n@misc{emonet2024llmbasedsparqlquerygeneration,\n title={LLM-based SPARQL Query Generation from Natural Language over Federated Knowledge Graphs},\n author={Vincent Emonet and Jerven Bolleman and Severine Duvaud and Tarcisio Mendes de Farias and Ana Claudia Sima},\n year={2024},\n eprint={2410.06062},\n archivePrefix={arXiv},\n primaryClass={cs.DB},\n url={https://arxiv.org/abs/2410.06062},\n}\n```\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"summary": "Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/sib-swiss/sparql-llm",
"History": "https://github.com/sib-swiss/sparql-llm/releases",
"Homepage": "https://github.com/sib-swiss/sparql-llm",
"Source": "https://github.com/sib-swiss/sparql-llm",
"Tracker": "https://github.com/sib-swiss/sparql-llm/issues"
},
"split_keywords": [
"chatbot",
" expasy",
" kgqa",
" llm",
" sparql"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "32b42910411d6b207225dbc9e68300d6ffe3961272f95605af7ecd1644dcacce",
"md5": "25e2a2ac151b6e96a65f897e6d64c268",
"sha256": "e8622528936f838c3cec09f963edcf10951eea96b744cc7b1f276188c38d3af8"
},
"downloads": -1,
"filename": "sparql_llm-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "25e2a2ac151b6e96a65f897e6d64c268",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 123920,
"upload_time": "2025-10-06T09:18:46",
"upload_time_iso_8601": "2025-10-06T09:18:46.063555Z",
"url": "https://files.pythonhosted.org/packages/32/b4/2910411d6b207225dbc9e68300d6ffe3961272f95605af7ecd1644dcacce/sparql_llm-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b03aa55707584ae263389ade8e35dae32ce244e6c788f8be765388aaa64bc2fe",
"md5": "e3ba72c94a8a0ff4865ea1369cb50d13",
"sha256": "ecf78a85d6bc72b6df4c0646f242f8a3236fb89cdf2533e79c5ecc1fb332c2a6"
},
"downloads": -1,
"filename": "sparql_llm-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "e3ba72c94a8a0ff4865ea1369cb50d13",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 8640557,
"upload_time": "2025-10-06T09:18:48",
"upload_time_iso_8601": "2025-10-06T09:18:48.721265Z",
"url": "https://files.pythonhosted.org/packages/b0/3a/a55707584ae263389ade8e35dae32ce244e6c788f8be765388aaa64bc2fe/sparql_llm-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 09:18:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sib-swiss",
"github_project": "sparql-llm",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sparql-llm"
}