Name | sparql-llm JSON |
Version |
0.0.7
JSON |
| download |
home_page | None |
Summary | Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema. |
upload_time | 2025-02-19 09:13:29 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT License
Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. |
keywords |
expasy
kgqa
llm
sparql
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# โจ SPARQL query generation with LLMs ๐ฆ
[](https://pypi.org/project/sparql-llm/)
[](https://pypi.org/project/sparql-llm/)
[](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml)
</div>
This project provides reusable components and functions to enhance the capabilities of Large Language Models (LLMs) in generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for specific endpoints. By integrating Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, it ensures more accurate and relevant query generation on large scale knowledge graphs.
The components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It **requires endpoints to include metadata** such as [SPARQL query examples](https://github.com/sib-swiss/sparql-examples) and endpoint descriptions using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can be automatically generated using the [void-generator](https://github.com/JervenBolleman/void-generator).
## ๐ Features
- **Metadata Extraction**: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with [LangChain](https://python.langchain.com) but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.
- **SPARQL Query Validation**: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.
> [!TIP]
>
> You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check)
## ๐ฆ๏ธ Reusable components
### Installation
> Requires Python >=3.9
```bash
pip install sparql-llm
```
### SPARQL query examples loader
Load SPARQL query examples defined using the SHACL ontology from a SPARQL endpoint. See **[github.com/sib-swiss/sparql-examples](https://github.com/sib-swiss/sparql-examples)** for more details on how to define the examples.
```python
from sparql_llm import SparqlExamplesLoader
loader = SparqlExamplesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)
```
You can provide the examples as a file if it is not integrated in the endpoint, e.g.:
```python
loader = SparqlExamplesLoader("https://sparql.uniprot.org/sparql/", examples_file="uniprot_examples.ttl")
```
> Refer to the [LangChain documentation](https://python.langchain.com/v0.2/docs/) to figure out how to best integrate documents loaders to your system.
> [!NOTE]
>
> You can check the completeness of your examples against the endpoint schema using [this notebook](https://github.com/sib-swiss/sparql-llm/blob/main/notebooks/compare_queries_examples_to_void.ipynb).
### SPARQL endpoint schema loader
Generate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on the [VoID description](https://www.w3.org/TR/void/) present in the endpoint. Ideally the endpoint should also contain the ontology describing the classes, so the `rdfs:label` and `rdfs:comment` of the classes can be used to generate embeddings and improve semantic matching.
> [!TIP]
>
> Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.
```python
from sparql_llm import SparqlVoidShapesLoader
loader = SparqlVoidShapesLoader("https://sparql.uniprot.org/sparql/")
docs = loader.load()
print(len(docs))
print(docs[0].metadata)
```
You can provide the VoID description as a file if it is not integrated in the endpoint, e.g.:
```python
loader = SparqlVoidShapesLoader("https://sparql.uniprot.org/sparql/", void_file="uniprot_void.ttl")
```
> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt:
>
> ```turtle
> up:Disease_Annotation {
> a [ up:Disease_Annotation ] ;
> up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
> rdfs:comment xsd:string ;
> up:disease IRI
> }
> ```
### Generate complete ShEx shapes from VoID description
You can also generate the complete ShEx shapes for a SPARQL endpoint with:
```python
from sparql_llm import get_shex_from_void
shex_str = get_shex_from_void("https://sparql.uniprot.org/sparql/")
print(shex_str)
```
### Validate a SPARQL query based on VoID description
This takes a SPARQL query and validates the predicates/types used are compliant with the VoID description present in the SPARQL endpoint the query is executed on.
This function supports:
* federated queries (VoID description will be automatically retrieved for each SERVICE call in the query),
* path patterns (e.g. `orth:organism/obo:RO_0002162/up:scientificName`)
This function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.
It will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes/predicates), which can be passed to an LLM as context to help it figuring out how to fix the query.
```python
from sparql_llm import validate_sparql_with_void
sparql_query = """PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX orth: <http://purl.org/net/orth#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX genex: <http://purl.org/genex#>
PREFIX sio: <http://semanticscience.org/resource/>
SELECT DISTINCT ?humanProtein ?orthologRatProtein ?orthologRatGene
WHERE {
?humanProtein a orth:Protein ;
lscr:xrefUniprot <http://purl.uniprot.org/uniprot/Q9Y2T1> .
?orthologRatProtein a orth:Protein ;
sio:SIO_010078 ?orthologRatGene ;
orth:organism/obo:RO_0002162/up:name 'Rattus norvegicus' .
?cluster a orth:OrthologsCluster .
?cluster orth:hasHomologousMember ?node1 .
?cluster orth:hasHomologousMember ?node2 .
?node1 orth:hasHomologousMember* ?humanProtein .
?node2 orth:hasHomologousMember* ?orthologRatProtein .
FILTER(?node1 != ?node2)
SERVICE <https://www.bgee.org/sparql/> {
?orthologRatGene a orth:Gene ;
genex:expressedIn ?anatEntity ;
orth:organism ?ratOrganism .
?anatEntity rdfs:label 'brain' .
?ratOrganism obo:RO_0002162 taxon:10116 .
}
}"""
issues = validate_sparql_with_void(sparql_query, "https://sparql.omabrowser.org/sparql/")
print("\n".join(issues))
```
## ๐งโ๐ป Development
This section is for if you want to run the package and reusable components in development, and get involved by making a code contribution.
> Requirements: [`uv`](https://docs.astral.sh/uv/getting-started/installation/) to easily handle scripts and virtual environments.
### ๐ฅ๏ธ Clone
Clone the repository:
```bash
git clone https://github.com/sib-swiss/sparql-llm
cd sparql-llm
```
### โ๏ธ Run tests
Make sure the existing tests still work by running the test suite and linting checks. Note that any pull requests to the fairworkflows repository on github will automatically trigger running of the test suite;
```bash
cd packages/sparql-llm
uv run pytest
```
To display all logs when debugging:
```bash
uv run test -s
```
### ๐งน Format code
```bash
uvx ruff format
uvx ruff check --fix
```
### โป๏ธ Reset the environment
Upgrade `uv`:
```sh
uv self update
```
Clean `uv` cache:
```sh
uv cache clean
```
### ๐ท๏ธ New release process
Get a PyPI API token at [pypi.org/manage/account](https://pypi.org/manage/account).
1. Increment the `version` number in the `pyproject.toml` file.
```bash
uvx hatch version fix
```
2. Build and publish:
```bash
uv build
cd ../..
uv publish
```
> If `uv publish` is still broken:
>
> ```sh
> uvx hatch build
> uvx hatch publish
> ```
Raw data
{
"_id": null,
"home_page": null,
"name": "sparql-llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Vincent Emonet <vincent.emonet@gmail.com>",
"keywords": "Expasy, KGQA, LLM, SPARQL",
"author": null,
"author_email": "Vincent Emonet <vincent.emonet@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/64/96/0dce7210fd24232641b547f4887390406b3fc720c70a91bd85161a9a79e3/sparql_llm-0.0.7.tar.gz",
"platform": null,
"description": "# \u2728 SPARQL query generation with LLMs \ud83e\udd9c\n\n[](https://pypi.org/project/sparql-llm/)\n[](https://pypi.org/project/sparql-llm/)\n[](https://github.com/sib-swiss/sparql-llm/actions/workflows/test.yml)\n\n</div>\n\nThis project provides reusable components and functions to enhance the capabilities of Large Language Models (LLMs) in generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for specific endpoints. By integrating Retrieval-Augmented Generation (RAG) and SPARQL query validation through endpoint schemas, it ensures more accurate and relevant query generation on large scale knowledge graphs.\n\nThe components are designed to work either independently or as part of a full chat-based system that can be deployed for a set of SPARQL endpoints. It **requires endpoints to include metadata** such as [SPARQL query examples](https://github.com/sib-swiss/sparql-examples) and endpoint descriptions using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can be automatically generated using the [void-generator](https://github.com/JervenBolleman/void-generator).\n\n## \ud83c\udf08 Features\n\n- **Metadata Extraction**: Functions to extract and load relevant metadata from SPARQL endpoints. These loaders are compatible with [LangChain](https://python.langchain.com) but are flexible enough to be used independently, providing metadata as JSON for custom vector store integration.\n- **SPARQL Query Validation**: A function to automatically parse and validate federated SPARQL queries against the VoID description of the target endpoints.\n\n> [!TIP]\n>\n> You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check)\n\n## \ud83d\udce6\ufe0f Reusable components\n\n### Installation\n\n> Requires Python >=3.9\n\n```bash\npip install sparql-llm\n```\n\n### SPARQL query examples loader\n\nLoad SPARQL query examples defined using the SHACL ontology from a SPARQL endpoint. See **[github.com/sib-swiss/sparql-examples](https://github.com/sib-swiss/sparql-examples)** for more details on how to define the examples.\n\n```python\nfrom sparql_llm import SparqlExamplesLoader\n\nloader = SparqlExamplesLoader(\"https://sparql.uniprot.org/sparql/\")\ndocs = loader.load()\nprint(len(docs))\nprint(docs[0].metadata)\n```\n\nYou can provide the examples as a file if it is not integrated in the endpoint, e.g.:\n\n```python\nloader = SparqlExamplesLoader(\"https://sparql.uniprot.org/sparql/\", examples_file=\"uniprot_examples.ttl\")\n```\n\n> Refer to the [LangChain documentation](https://python.langchain.com/v0.2/docs/) to figure out how to best integrate documents loaders to your system.\n\n> [!NOTE]\n>\n> You can check the completeness of your examples against the endpoint schema using [this notebook](https://github.com/sib-swiss/sparql-llm/blob/main/notebooks/compare_queries_examples_to_void.ipynb).\n\n### SPARQL endpoint schema loader\n\nGenerate a human-readable schema using the ShEx format to describe all classes of a SPARQL endpoint based on the [VoID description](https://www.w3.org/TR/void/) present in the endpoint. Ideally the endpoint should also contain the ontology describing the classes, so the `rdfs:label` and `rdfs:comment` of the classes can be used to generate embeddings and improve semantic matching.\n\n> [!TIP]\n>\n> Checkout the **[void-generator](https://github.com/JervenBolleman/void-generator)** project to automatically generate VoID description for your endpoint.\n\n```python\nfrom sparql_llm import SparqlVoidShapesLoader\n\nloader = SparqlVoidShapesLoader(\"https://sparql.uniprot.org/sparql/\")\ndocs = loader.load()\nprint(len(docs))\nprint(docs[0].metadata)\n```\n\nYou can provide the VoID description as a file if it is not integrated in the endpoint, e.g.:\n\n```python\nloader = SparqlVoidShapesLoader(\"https://sparql.uniprot.org/sparql/\", void_file=\"uniprot_void.ttl\")\n```\n\n> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt:\n>\n> ```turtle\n> up:Disease_Annotation {\n> a [ up:Disease_Annotation ] ;\n> up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;\n> rdfs:comment xsd:string ;\n> up:disease IRI\n> }\n> ```\n\n### Generate complete ShEx shapes from VoID description\n\nYou can also generate the complete ShEx shapes for a SPARQL endpoint with:\n\n```python\nfrom sparql_llm import get_shex_from_void\n\nshex_str = get_shex_from_void(\"https://sparql.uniprot.org/sparql/\")\nprint(shex_str)\n```\n\n### Validate a SPARQL query based on VoID description\n\nThis takes a SPARQL query and validates the predicates/types used are compliant with the VoID description present in the SPARQL endpoint the query is executed on.\n\nThis function supports:\n\n* federated queries (VoID description will be automatically retrieved for each SERVICE call in the query),\n* path patterns (e.g. `orth:organism/obo:RO_0002162/up:scientificName`)\n\nThis function requires that at least one type is defined for each endpoint, but it will be able to infer types of subjects that are connected to the subject for which the type is defined.\n\nIt will return a list of issues described in natural language, with hints on how to fix them (by listing the available classes/predicates), which can be passed to an LLM as context to help it figuring out how to fix the query.\n\n```python\nfrom sparql_llm import validate_sparql_with_void\n\nsparql_query = \"\"\"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nPREFIX up: <http://purl.uniprot.org/core/>\nPREFIX taxon: <http://purl.uniprot.org/taxonomy/>\nPREFIX orth: <http://purl.org/net/orth#>\nPREFIX obo: <http://purl.obolibrary.org/obo/>\nPREFIX lscr: <http://purl.org/lscr#>\nPREFIX genex: <http://purl.org/genex#>\nPREFIX sio: <http://semanticscience.org/resource/>\nSELECT DISTINCT ?humanProtein ?orthologRatProtein ?orthologRatGene\nWHERE {\n ?humanProtein a orth:Protein ;\n lscr:xrefUniprot <http://purl.uniprot.org/uniprot/Q9Y2T1> .\n ?orthologRatProtein a orth:Protein ;\n sio:SIO_010078 ?orthologRatGene ;\n orth:organism/obo:RO_0002162/up:name 'Rattus norvegicus' .\n ?cluster a orth:OrthologsCluster .\n ?cluster orth:hasHomologousMember ?node1 .\n ?cluster orth:hasHomologousMember ?node2 .\n ?node1 orth:hasHomologousMember* ?humanProtein .\n ?node2 orth:hasHomologousMember* ?orthologRatProtein .\n FILTER(?node1 != ?node2)\n SERVICE <https://www.bgee.org/sparql/> {\n ?orthologRatGene a orth:Gene ;\n genex:expressedIn ?anatEntity ;\n orth:organism ?ratOrganism .\n ?anatEntity rdfs:label 'brain' .\n ?ratOrganism obo:RO_0002162 taxon:10116 .\n }\n}\"\"\"\n\nissues = validate_sparql_with_void(sparql_query, \"https://sparql.omabrowser.org/sparql/\")\nprint(\"\\n\".join(issues))\n```\n\n## \ud83e\uddd1\u200d\ud83d\udcbb Development\n\nThis section is for if you want to run the package and reusable components in development, and get involved by making a code contribution.\n\n> Requirements: [`uv`](https://docs.astral.sh/uv/getting-started/installation/) to easily handle scripts and virtual environments.\n\n### \ud83d\udce5\ufe0f Clone\n\nClone the repository:\n\n```bash\ngit clone https://github.com/sib-swiss/sparql-llm\ncd sparql-llm\n```\n\n### \u2611\ufe0f Run tests\n\nMake sure the existing tests still work by running the test suite and linting checks. Note that any pull requests to the fairworkflows repository on github will automatically trigger running of the test suite;\n\n```bash\ncd packages/sparql-llm\nuv run pytest\n```\n\nTo display all logs when debugging:\n\n```bash\nuv run test -s\n```\n\n### \ud83e\uddf9 Format code\n\n```bash\nuvx ruff format\nuvx ruff check --fix\n```\n\n### \u267b\ufe0f Reset the environment\n\nUpgrade `uv`:\n\n```sh\nuv self update\n```\n\nClean `uv` cache:\n\n```sh\nuv cache clean\n```\n\n### \ud83c\udff7\ufe0f New release process\n\nGet a PyPI API token at [pypi.org/manage/account](https://pypi.org/manage/account).\n\n1. Increment the `version` number in the `pyproject.toml` file.\n\n ```bash\n uvx hatch version fix\n ```\n\n2. Build and publish:\n\n ```bash\n uv build\n cd ../..\n uv publish\n ```\n\n> If `uv publish` is still broken:\n>\n> ```sh\n> uvx hatch build\n> uvx hatch publish\n> ```\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2024-present SIB Swiss Institute of Bioinformatics\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.",
"summary": "Reusable components and complete chat system to improve Large Language Models (LLMs) capabilities when generating SPARQL queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.",
"version": "0.0.7",
"project_urls": {
"Documentation": "https://github.com/sib-swiss/sparql-llm",
"History": "https://github.com/sib-swiss/sparql-llm/releases",
"Homepage": "https://github.com/sib-swiss/sparql-llm",
"Source": "https://github.com/sib-swiss/sparql-llm",
"Tracker": "https://github.com/sib-swiss/sparql-llm/issues"
},
"split_keywords": [
"expasy",
" kgqa",
" llm",
" sparql"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e50a77d1af233bbc3a671b389b9d267d0e0b61ad60b0efb28c2f431d595f41c4",
"md5": "e478c08adc4c2b5bb629efe5399f23ec",
"sha256": "af1a0862682f0a0109e138b156d99f2a5dce3d9535d8001a3504b0f812d1a4e0"
},
"downloads": -1,
"filename": "sparql_llm-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e478c08adc4c2b5bb629efe5399f23ec",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 84500,
"upload_time": "2025-02-19T09:13:31",
"upload_time_iso_8601": "2025-02-19T09:13:31.751415Z",
"url": "https://files.pythonhosted.org/packages/e5/0a/77d1af233bbc3a671b389b9d267d0e0b61ad60b0efb28c2f431d595f41c4/sparql_llm-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "64960dce7210fd24232641b547f4887390406b3fc720c70a91bd85161a9a79e3",
"md5": "71571043c95db6302cf07a8dfc6ed145",
"sha256": "414716edf2e1414b3ee8f411a7b9bd00b315b6fa10f4e371e60f35541205a5dc"
},
"downloads": -1,
"filename": "sparql_llm-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "71571043c95db6302cf07a8dfc6ed145",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 193330,
"upload_time": "2025-02-19T09:13:29",
"upload_time_iso_8601": "2025-02-19T09:13:29.216766Z",
"url": "https://files.pythonhosted.org/packages/64/96/0dce7210fd24232641b547f4887390406b3fc720c70a91bd85161a9a79e3/sparql_llm-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-19 09:13:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sib-swiss",
"github_project": "sparql-llm",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sparql-llm"
}