# SRTK: Subgraph Retrieval Toolkit
[](https://pypi.org/project/srtk/)
[](https://srtk.readthedocs.io/en/latest/?badge=latest)
[](https://github.com/happen2me/subgraph-retrieval-toolkit/actions/workflows/pytest.yml)
[](https://opensource.org/licenses/MIT)
[](https://zenodo.org/badge/latestdoi/622648166)
**SRTK** is a toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs. It currently supports Wikidata, Freebase and DBpedia.
A minimum walkthrough of the retrieve process:

<img width="400rem" src="https://i.imgur.com/jG7nZuo.png" alt="Visualized subgraph"/>
## Prerequisite
### Installations
```bash
pip install srtk
```
### Local Deployment of Knowledge Graphs
- [Setup Wikidata locally](https://srtk.readthedocs.io/en/latest/setups/setup_wikidata.html)
- [Setup Freebase locally](https://srtk.readthedocs.io/en/latest/setups/setup_freebase.html)
- [Setup DBpedia locally](https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart)
## Usage
There are mainly five subcommands of SRTK, which covers the whole pipeline of subgraph retrieval.
For retrieval:
- `srtk link`: Link entity mentions in texts to a knowledge graph. Currently Wikidata and DBpedia are supported out of the box.
- `srtk retrieve`: Retrieve semantic-relevant subgraphs from a knowledge graph with a trained retriever. It can also be used to evaluate a trained retriever.
- `srtk visualize`: Visualize retrieved subgraphs using a graph visualization tool.
For training a retriever:
- `srtk preprocess`: Preprocess a dataset for training a subgraph retrieval model.
- `srtk train`: Train a subgraph retrieval model on a preprocessed dataset.
Use `srtk [subcommand] --help` to see the detailed usage of each subcommand.
## A Tour of SRTK
### Retrieve Subgraphs
#### Retrieve subgraphs with a trained scorer
```bash
srtk retrieve [-h] -i INPUT -o OUTPUT [-e SPARQL_ENDPOINT] -kg {freebase,wikidata,dbpedia}
-m SCORER_MODEL_PATH [--beam-width BEAM_WIDTH] [--max-depth MAX_DEPTH]
[--evaluate] [--include-qualifiers]
```
The `scorer-model-path` argument can be any huggingface pretrained encoder model. If it is a local
path, please ensure the tokenizer is also saved along with the model.
#### Visualize retrieved subgraph
```bash
srtk visualize [-h] -i INPUT -o OUTPUT_DIR [-e SPARQL_ENDPOINT]
[-kg {wikidata,freebase}] [--max-output MAX_OUTPUT]
```
### Train a Retriever
A scorer is the model used to navigate the expanding path. At each expanding step, relations scored higher with scorer are picked as relations for the next hop.
The score is based on the embedding similarity of the to-be-expanded relation with the query (question + previous expanding path).
The model is trained in a distant supervised learning fashion. Given the question entities and the answer entities, the model uses the shortest paths along them as the supervision signal.
#### Preprocess a dataset
1. prepare training samples where question entities and answer entities are know.
The training data should be saved in a jsonl file (e.g. `data/grounded.jsonl`). Each training sample should come with the following format:
```json
{
"id": "sample-id",
"question": "Which universities did Barack Obama graduate from?",
"question_entities": [
"Q76"
],
"answer_entities": [
"Q49122",
"Q1346110",
"Q4569677"
]
}
```
2. Preprocess the samples with `srtk preprocess` command.
```bash
srtk preprocess [-h] -i INPUT -o OUTPUT [--intermediate-dir INTERMEDIATE_DIR]
-e SPARQL_ENDPOINT -kg {wikidata,freebase} [--search-path]
[--metric {jaccard,recall}] [--num-negative NUM_NEGATIVE]
[--positive-threshold POSITIVE_THRESHOLD]
```
Under the hood, it does four things:
1. Find the shortest paths between the question entities and the answer entities.
2. Score the searched paths with Jaccard scores with the answers.
3. Negative sampling. At each expanding step, the negative samples are those false relations connected to the tracked entities.
4. Generate training dataset as a jsonl file.
#### Train a sentence encoder
The scorer should be initialized from a pretrained encoder model from huggingface hub. Here I used `intfloat/e5-small`, which is a checkpoint of the BERT model.
```bash
srtk train --data-file data/train.jsonl \
--model-name-or-path intfloat/e5-small \
--save-model-path artifacts/scorer
```
## Trained models
SRTK is compatible with any language encoder or encoder-decoder models from [huggingface hub](https://huggingface.co/models). You only need to specify the model name or path for arguments like `--model-name-or-path` or `--scorer-model-path`.
Here we provide some trained models for subgraph retrieval.
| Model | Dataset | Base Model | Notes |
| --- | --- | --- | --- |
| [`drt/srtk-scorer`](https://huggingface.co/drt/srtk-scorer) | [WebQSP](https://www.microsoft.com/en-us/download/details.aspx?id=52763), [SimpleQuestionsWikidata](https://github.com/askplatypus/wikidata-simplequestions), [SimpleDBpediaQA](https://github.com/castorini/SimpleDBpediaQA) | `roberta-base` | Jointly trained for Wikidata, Freebase and DBpedia. |
## Tutorials
- [End-to-end Subgraph Retrieval](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/2.end_to_end_subgraph_retrieval.ipynb)
- [Train a Retriever on Wikidata with Weak Supervision](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/3.weak_train_wikidata.ipynb)
- [Train a Retriever on Freebase with Weak Supervision](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/4.weak_train_freebase.ipynb)
- [Supervised Training with Wikidata Simple Questions](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/5.supervised_train_wikidata.ipynb)
- [Extend SRTK to other Knowledge Graphs](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/6.extend_to_new_kg.ipynb)
## License
This project is licensed under the terms of the MIT license.
Raw data
{
"_id": null,
"home_page": "",
"name": "srtk",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "",
"author_email": "Yuanchun Shen <y.c.shen@tum.de>",
"download_url": "https://files.pythonhosted.org/packages/f5/4b/e63495b8961ec3557175d8d342fce2ef502b07a917aabbf6875381390d06/srtk-0.0.7.tar.gz",
"platform": null,
"description": "# SRTK: Subgraph Retrieval Toolkit\n\n[](https://pypi.org/project/srtk/)\n[](https://srtk.readthedocs.io/en/latest/?badge=latest)\n[](https://github.com/happen2me/subgraph-retrieval-toolkit/actions/workflows/pytest.yml)\n[](https://opensource.org/licenses/MIT)\n[](https://zenodo.org/badge/latestdoi/622648166)\n\n\n**SRTK** is a toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs. It currently supports Wikidata, Freebase and DBpedia.\n\nA minimum walkthrough of the retrieve process:\n\n\n\n<img width=\"400rem\" src=\"https://i.imgur.com/jG7nZuo.png\" alt=\"Visualized subgraph\"/>\n\n## Prerequisite\n\n### Installations\n\n```bash\npip install srtk\n```\n\n### Local Deployment of Knowledge Graphs\n\n- [Setup Wikidata locally](https://srtk.readthedocs.io/en/latest/setups/setup_wikidata.html)\n- [Setup Freebase locally](https://srtk.readthedocs.io/en/latest/setups/setup_freebase.html)\n- [Setup DBpedia locally](https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart)\n\n## Usage\n\nThere are mainly five subcommands of SRTK, which covers the whole pipeline of subgraph retrieval.\n\nFor retrieval:\n\n- `srtk link`: Link entity mentions in texts to a knowledge graph. Currently Wikidata and DBpedia are supported out of the box.\n- `srtk retrieve`: Retrieve semantic-relevant subgraphs from a knowledge graph with a trained retriever. It can also be used to evaluate a trained retriever.\n- `srtk visualize`: Visualize retrieved subgraphs using a graph visualization tool.\n\nFor training a retriever:\n\n- `srtk preprocess`: Preprocess a dataset for training a subgraph retrieval model.\n- `srtk train`: Train a subgraph retrieval model on a preprocessed dataset.\n\n\nUse `srtk [subcommand] --help` to see the detailed usage of each subcommand.\n\n## A Tour of SRTK\n\n### Retrieve Subgraphs\n\n#### Retrieve subgraphs with a trained scorer\n\n```bash\nsrtk retrieve [-h] -i INPUT -o OUTPUT [-e SPARQL_ENDPOINT] -kg {freebase,wikidata,dbpedia}\n -m SCORER_MODEL_PATH [--beam-width BEAM_WIDTH] [--max-depth MAX_DEPTH]\n [--evaluate] [--include-qualifiers]\n```\n\nThe `scorer-model-path` argument can be any huggingface pretrained encoder model. If it is a local\npath, please ensure the tokenizer is also saved along with the model.\n\n#### Visualize retrieved subgraph\n\n```bash\nsrtk visualize [-h] -i INPUT -o OUTPUT_DIR [-e SPARQL_ENDPOINT]\n [-kg {wikidata,freebase}] [--max-output MAX_OUTPUT]\n```\n\n### Train a Retriever\n\nA scorer is the model used to navigate the expanding path. At each expanding step, relations scored higher with scorer are picked as relations for the next hop.\n\nThe score is based on the embedding similarity of the to-be-expanded relation with the query (question + previous expanding path).\n\nThe model is trained in a distant supervised learning fashion. Given the question entities and the answer entities, the model uses the shortest paths along them as the supervision signal.\n\n#### Preprocess a dataset\n\n1. prepare training samples where question entities and answer entities are know.\n\n The training data should be saved in a jsonl file (e.g. `data/grounded.jsonl`). Each training sample should come with the following format:\n \n ```json\n {\n \"id\": \"sample-id\",\n \"question\": \"Which universities did Barack Obama graduate from?\",\n \"question_entities\": [\n \"Q76\"\n ],\n \"answer_entities\": [\n \"Q49122\",\n \"Q1346110\",\n \"Q4569677\"\n ]\n }\n ```\n2. Preprocess the samples with `srtk preprocess` command.\n\n ```bash\n srtk preprocess [-h] -i INPUT -o OUTPUT [--intermediate-dir INTERMEDIATE_DIR]\n -e SPARQL_ENDPOINT -kg {wikidata,freebase} [--search-path]\n [--metric {jaccard,recall}] [--num-negative NUM_NEGATIVE]\n [--positive-threshold POSITIVE_THRESHOLD]\n ```\n\n Under the hood, it does four things:\n\n 1. Find the shortest paths between the question entities and the answer entities.\n 2. Score the searched paths with Jaccard scores with the answers.\n 3. Negative sampling. At each expanding step, the negative samples are those false relations connected to the tracked entities.\n 4. Generate training dataset as a jsonl file.\n\n#### Train a sentence encoder\n\nThe scorer should be initialized from a pretrained encoder model from huggingface hub. Here I used `intfloat/e5-small`, which is a checkpoint of the BERT model.\n\n```bash\nsrtk train --data-file data/train.jsonl \\\n --model-name-or-path intfloat/e5-small \\\n --save-model-path artifacts/scorer\n```\n\n## Trained models\n\nSRTK is compatible with any language encoder or encoder-decoder models from [huggingface hub](https://huggingface.co/models). You only need to specify the model name or path for arguments like `--model-name-or-path` or `--scorer-model-path`.\n\nHere we provide some trained models for subgraph retrieval.\n\n| Model | Dataset | Base Model | Notes |\n| --- | --- | --- | --- |\n| [`drt/srtk-scorer`](https://huggingface.co/drt/srtk-scorer) | [WebQSP](https://www.microsoft.com/en-us/download/details.aspx?id=52763), [SimpleQuestionsWikidata](https://github.com/askplatypus/wikidata-simplequestions), [SimpleDBpediaQA](https://github.com/castorini/SimpleDBpediaQA) | `roberta-base` | Jointly trained for Wikidata, Freebase and DBpedia. |\n\n## Tutorials\n\n- [End-to-end Subgraph Retrieval](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/2.end_to_end_subgraph_retrieval.ipynb)\n- [Train a Retriever on Wikidata with Weak Supervision](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/3.weak_train_wikidata.ipynb)\n- [Train a Retriever on Freebase with Weak Supervision](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/4.weak_train_freebase.ipynb)\n- [Supervised Training with Wikidata Simple Questions](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/5.supervised_train_wikidata.ipynb)\n- [Extend SRTK to other Knowledge Graphs](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/6.extend_to_new_kg.ipynb)\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n",
"bugtrack_url": null,
"license": "",
"summary": "A toolkit for semantic-relevant subgraph retrieval from large-scale knowledge graphs.",
"version": "0.0.7",
"project_urls": {
"Bug Tracker": "https://github.com/happen2me/subgraph-retrieval-toolkit/issues",
"Homepage": "https://github.com/happen2me/subgraph-retrieval-toolkit"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b09f85623ec1942a7be4ce0a20dab0211ef5e85e762da9de3b0b7f77782e64f7",
"md5": "d555916ba0876348b37d06b0b1aeadcc",
"sha256": "a5781f2663c52186211877b2804e113bdd7ac571b6747cc0bac4b1ad5484114a"
},
"downloads": -1,
"filename": "srtk-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d555916ba0876348b37d06b0b1aeadcc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 43693,
"upload_time": "2023-10-15T12:54:44",
"upload_time_iso_8601": "2023-10-15T12:54:44.849843Z",
"url": "https://files.pythonhosted.org/packages/b0/9f/85623ec1942a7be4ce0a20dab0211ef5e85e762da9de3b0b7f77782e64f7/srtk-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f54be63495b8961ec3557175d8d342fce2ef502b07a917aabbf6875381390d06",
"md5": "c7046a0ea28a5c622fa132d364cb81d2",
"sha256": "7b188c31a1c678609164fc69e1b26cfa44a2b16bf32d8f3c6d107a746af515b9"
},
"downloads": -1,
"filename": "srtk-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "c7046a0ea28a5c622fa132d364cb81d2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 151716,
"upload_time": "2023-10-15T12:54:47",
"upload_time_iso_8601": "2023-10-15T12:54:47.009130Z",
"url": "https://files.pythonhosted.org/packages/f5/4b/e63495b8961ec3557175d8d342fce2ef502b07a917aabbf6875381390d06/srtk-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-15 12:54:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "happen2me",
"github_project": "subgraph-retrieval-toolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "beautifulsoup4",
"specs": []
},
{
"name": "datasets",
"specs": []
},
{
"name": "lightning",
"specs": []
},
{
"name": "pyvis",
"specs": []
},
{
"name": "SPARQLWrapper",
"specs": []
},
{
"name": "srsly",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "wikimapper",
"specs": []
}
],
"lcname": "srtk"
}