# KAPipe: A Modular Pipeline for Knowledge Acquisition
## Table of Contents
- [π€ What is KAPipe?](#what-is-kapipe)
- [π¦ Installation](#-installation)
- [π§© Triple Extraction](#-triple-extraction)
- [πΈοΈ Knowledge Graph Construction](#-knowledge-graph-construction)
- [π§± Community Clustering](#-community-clustering)
- [π Report Generation](#-report-generation)
- [βοΈ Chunking](#-chunking)
- [π Passage Retrieval](#-passage-retrieval)
- [π¬ Question Answering](#-question-answering)
## π€ What is KAPipe?
**KAPipe** is a modular pipeline for comprehensive **knowledge acquisition** from unstructured documents.
It supports **extraction**, **organization**, **retrieval**, and **utilization** of knowledge, serving as a core framework for building intelligent systems that reason over structured knowledge.
Currently, KAPipe provides the following functionalities:
- π§©**Triple Extraction**
- Extract facts in the form of (head entity, relation, tail entity) triples from raw text.
- πΈοΈ**Knowledge Graph Construction**
- Build a symbolic knowledge graph from triples, optionally augmented with external ontologies or knowledge bases (e.g., Wikidata, UMLS).
- π§±**Community Clustering**
- Cluster the knowledge graph into semantically coherent subgraphs (*communities*).
- π**Report Generation**
- Generate textual reports (or summaries) of graph communities.
- βοΈ**Chunking**
- Split text (e.g., community report) into fixed-size chunks based on a predefined token length (e.g., n=300).
- π**Passage Retrieval**
- Retrieve relevant chunks for given queries using lexical or dense retrieval.
- π¬**Question Answering**
- Answer questions using retrieved chunks as context.
These components together form an implementation of **graph-based retrieval-augmented generation (GraphRAG)**, enabling question answering and reasoning grounded in structured knowledge.
## π¦ Installation
### Step 1: Set up a Python environment
```bash
python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
```
### Step 2: Install KAPipe
```bash
pip install kapipe
```
### Step 3: Download pretrained models and configurations
Pretrained models and configuration files can be downloaded from the following Google Drive folder:
π [KAPipe Release Files](https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp)
Download the latest release file named release.YYYYMMDD.tar.gz, then extract it to the ~/.kapipe directory:
```bash
mkdir -p ~/.kapipe
mv release.YYYYMMDD.tar.gz ~/.kapipe
cd ~/.kapipe
tar -zxvf release.YYYYMMDD.tar.gz
```
If the extraction is successful, you should see a directory `~/.kapipe/download/`, which contains model resources.
## π§© Triple Extraction
### Overview
The **Triple Extraction** module identifies relational facts from raw text in the form of (head entity, relation, tail entity) **triples**.
Specifically, this is achieved through the following cascade of subtasks:
1. **Named Entity Recognition (NER):**
- Detect entity mentions (spans) and classify their types.
1. **Entity Disambiguation Retrieval (ED-Retrieval)**:
- Retrieve candidate concept IDs from a knowledge base for each mention.
1. **Entity Disambiguation Reranking (ED-Reranking)**:
- Select the most probable concept ID from the retrieved candidates.
1. **Document-level Relation Extraction (DocRE)**:
- Extract relational triples based on the disambiguated entity set.
### Input
This module takes as input:
1. ***Document***, or a dictionary containing
- `doc_key` (str): Unique identifier for the document
- `sentences` (list[str]): List of sentence strings (tokenized)
```json
{
"doc_key": "6794356",
"sentences": [
"Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .",
"A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .",
...
]
}
```
(See `experiments/data/examples/documents_without_triples.json` for more details.)
Each subtask takes a ***Document*** object as input, augments it with new fields, and returns it.
This allows custom metadata to persist throughout the pipeline.
### Output
The output is also the same-format dictionary (***Document***), augmented with extracted entities and triples information:
- `doc_key` (str): Same as input
- `sentences` (list[str]): Same as input
- `mentions` (list[dict]): Mentions, or a list of dictionaries, each containing:
- `span` (tuple[int,int]): Mention span
- `name` (str): Mention string
- `entity_type` (str): Entity type
- `entity_id` (str): Concept ID
- `entities` (list[dict]): Entities, or a list of dictionaries, each containing
- `mention_indices` (list[int]): Indices of mentions belonging to this entity
- `entity_type` (str): Entity type
- `entity_id` (str): Concept ID
- `relations` (list[dict]): Triples, or a list dictionaries, each containing
- `arg1` (int): Index of the head/subject entity
- `relation` (str): Semantic relation
- `arg2` (int): Index of the tail/object entity
```json
{
"doc_key": "6794356",
"sentences": [...],
"mentions": [
{
"span": [0, 2],
"name": "Tricuspid valve regurgitation",
"entity_type": "Disease",
"entity_id": "D014262"
},
...
],
"entities": [
{
"mention_indices": [0, 3, 7],
"entity_type": "Disease",
"entity_id": "D014262"
},
...
],
"relations": [
{
"arg1": 1,
"relation": "CID",
"arg2": 7
},
...
]
}
```
(See `experiments/data/examples/documents_with_triples.json` for more details.)
### How to Use
```python
import kapipe.triple_extraction
IDENTIFIER = "biaffinener_blink_blink_atlop_cdr"
# Load the Triple Extraction pipeline
pipe = kapipe.triple_extraction.load(
identifier=IDENTIFIER,
gpu_map={"ner": 0, "ed_retrieval": 0, "ed_reranking": 2, "docre": 3}
)
# We provide a utility for converting text (string) to Document format
# `title` is optional
document = pipe.text_to_document(doc_key=your_doc_key, text=your_text, title=your_title)
# Apply the pipeline to your input document
document = pipe(document)
```
(See `experiments/codes/run_triple_extraction.py` for specific examples.)
<!-- The `identifier` determines the specific models used for each subtask.
For example, `"biaffinener_blink_blink_atlop_cdr"` uses:
- **NER**: Biaffine-NER (trained on BC5CDR for Chemical and Disease types)
- **ED-Retrieval**: BLINK Bi-Encoder (trained on BC5CDR for MeSH 2015)
- **ED-Reranking**: BLINK Cross-Encoder (trained on BC5CDR for MeSH 2015)
- **DocRE**: ATLOP (trained on BC5CDR for Chemical-Induce-Disease (CID) relation) -->
### Supported Methods
#### Named Entity Recognition (NER)
- **Biaffine-NER** ([`Yu et al., 2020`](https://aclanthology.org/2020.acl-main.577/)): Span-based BERT model using biaffine scoring
- **LLM-NER**: A proprietary/open-source LLM using a NER-specific prompt template and few-shot examples
#### Entity Disambiguation Retrieval (ED-Retrieval)
- **BLINK Bi-Encoder** ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)): Dense retriever using BERT-based encoders and approximate nearest neighbor search
- **BM25**: Lexical matching
- **Levenshtein**: Edit distance matching
#### Entity Disambiguation Reranking (ED-Reranking)
- **BLINK Cross-Encoder** (Wu et al., 2020): Reranker using a BERT-based encoder for candidates from the Bi-Encoder
- **LLM-ED**: A proprietary/open-source LLM using a ED-specific prompt template and few-shot examples
#### Document-level Relation Extraction (DocRE)
- **ATLOP** ([`Zhou et al., 2021`](https://ojs.aaai.org/index.php/AAAI/article/view/17717)): BERT-based model for DocRE
- **MA-ATLOP** (Oumaima & Nishida et al., 2024): Mention-agnostic extension of ATLOP
- **MA-QA** (Oumaima & Nishida, 2024): Question-answering style DocRE model
- **LLM-DocRE**: A proprietary/open-source LLM using a DocRE-specific prompt template and few-shot examples
### Available Pipeline Identifiers
The following pipeline configurations are currently available:
| identifier | NER (Entity Types) | ED-Retrieval (Knowledge Base) | ED-Reranking (Knowledge Base) | DocRE (Relations) |
| --- | --- | --- | --- | --- |
| `biaffinener_blink_blink_atlop_cdr` | Biaffine-NER ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | BLINK Cross-Encoder (MeSH 2015) | ATLOP ({Chemical-Induce-Disease}) |
| `biaffinener_blink_blink_atlop_linked_docred` | Biaffine-NER ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | BLINK Cross-Encoder (DBPedia 2020.02.01) | ATLOP (DBPedia 96 relations) |
| `llmner_blink_llmed_llmdocre_cdr` | LLM-NER `gpt-4o-mini` ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | LLM-ED `gpt-4o-mini` (MeSH 2015) | LLM-DocRE `gpt-4o-mini` ({Chemical-Induce-Disease}) |
| `llmner_blink_llmed_llmdocre_linked_docred` | LLM-NER `gpt-4o-mini` ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | LLM-ED `gpt-4o-mini` (DBPedia 2020.02.01) | LLM-DocRE `gpt-4o-mini` (DBPedia 96 relations) |
## πΈοΈ Knowledge Graph Construction
### Overview
The **Knowledge Graph Construction** module builds a **directed multi-relational graph** from a set of extracted triples.
- **Nodes** represent unique entities (i.e., concepts).
- **Edges** represent semantic relations between entities.
### Input
This module takes as input:
1. List of ***Document*** objects with triples, produced by the **Triple Extraction** module
2. (optional) ***Additional Triples*** (existing KBs), or a list of dictionaries, each containing:
- `head` (str): Entity ID of the subject
- `relation` (str): Relation type (e.g., treats, causes)
- `tail` (str): Entity ID of the object
```json
[
{
"head": "D000001",
"relation": "treats",
"tail": "D014262"
},
...
]
```
(See `experiments/data/examples/additional_triples.json` for more details.)
3. ***Entity Dictionary***, or a list of dictionaries, each containing:
- `entity_id` (str): Unique concept ID
- `canonical_name` (str): Official name of the concept
- `entity_type` (str): Type/category of the concept
- `synonyms` (list[str]): A list of alternative names
- `description` (str): Textual definition of the concept
```JSON
[
{
"entity_id": "C009166",
"entity_index": 252696,
"entity_type": null,
"canonical_name": "retinol acetate",
"synonyms": [
"retinyl acetate",
"vitamin A acetate"
],
"description": "",
"tree_numbers": []
},
{
"entity_id": "D000641",
"entity_index": 610,
"entity_type": "Chemical",
"canonical_name": "Ammonia",
"synonyms": [],
"description": "A colorless alkaline gas. It is formed in the body during decomposition of organic materials during a large number of metabolically important reactions. Note that the aqueous form of ammonia is referred to as AMMONIUM HYDROXIDE.",
"tree_numbers": [
"D01.362.075",
"D01.625.050"
]
},
...
]
```
(See `experiments/data/examples/entity_dict.json` for more details.)
### Output
The output is a `networkx.MultiDiGraph` object representing the knowledge graph.
Each node has the following attributes:
- `entity_id` (str): Concept ID (e.g., MeSH ID)
- `entity_type` (str): Type of entity (e.g., Disease, Chemical, Person, Location)
- `name` (str): Canonical name (from Entity Dictionary)
- `description` (str): Textual definition (from Entity Dictionary)
- `doc_key_list` (list[str]): List of document IDs where this entity appears
Each edge has the following attributes:
- `head` (str): Head entity ID
- `tail` (str): Tail entity ID
- `relation` (str): Type of semantic relation
- `doc_key_list` (list[str]): List of document IDs supporting this relation
(See `experiments/data/examples/graph.graphml` for more details.)
### How to Use
```python
from kapipe.graph_construction import GraphConstructor
PATH_TO_DOCUMENTS = "./experiments/data/examples/documents.json"
PATH_TO_TRIPLES = "./experiments/data/examples/additional_triples.json" # Or set to None if unused
PATH_TO_ENTITY_DICT = "./experiments/data/examples/entity_dict.json"
# Initialize the knowledge graph constructor
constructor = GraphConstructor()
# Construct the knowledge graph
graph = constructor.construct_knowledge_graph(
path_documents_list=[PATH_TO_DOCUMENTS],
path_additional_triples=PATH_TO_TRIPLES, # Optional
path_entity_dict=PATH_TO_ENTITY_DICT
)
```
(See `experiments/codes/run_graph_construction.py` for specific examples.)
## π§± Community Clustering
### Overview
The **Community Clustering** module partitions the knowledge graph into **semantically coherent subgraphs**, referred to as *communities*.
Each community represents a localized set of closely related concepts and relations, and serves as a fundamental unit of structured knowledge.
### Input
This module takes as input:
1. Knowledge graph (`networkx.MultiDiGraph`) produced by the **Knowledge Graph Construction** module.
### Output
The output is a list of hierarchical community records (dictionaries), each containing:
- `community_id` (str): Unique ID for the community
- `nodes` (list[str]): List of entity IDs belonging to the community (null for ROOT)
- `level` (int): Depth in the hierarchy (ROOT=-1)
- `parent_community_id` (str): ID of the parent community (null for ROOT)
- `child_community_ids` (list[str]): List of child community IDs (empty for leaf communities)
```json
[
{
"community_id": "ROOT",
"nodes": null,
"level": -1,
"parent_community_id": null,
"child_community_ids": [
"0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9"
]
},
{
"community_id": "0",
"nodes": [
"D016651",
"D014262",
"D003866",
"D003490",
"D001145"
],
"level": 0,
"parent_community_id": "ROOT",
"child_community_ids": [...]
},
...
]
```
(See `experiments/data/examples/communities.json` for more details.)
This hierarchical structure enables multi-level organization of knowledge, particularly useful for coarse-to-fine report generation and scalable retrieval.
### How to Use
```python
from kapipe.community_clustering import (
HierarchicalLeiden,
NeighborhoodAggregation,
TripleLevelFactorization
)
# Initialize the community clusterer
clusterer = HierarchicalLeiden()
# clusterer = NeighborhoodAggregation()
# clusterer = TripleLevelFactorization()
# Apply the community clusterer to the graph
communities = clusterer.cluster_communities(graph)
```
(See `experiments/codes/run_community_clustering.py` for specific examples.)
### Supported Methods
- **Hierarchical Leiden**
- Recursively applies the Leiden algorithm (Traag et al., 2019) to optimize modularity. Large communities are subdivided until they satisfy a predefined size constraint (default: 10 nodes).
- **Neighborhood Aggregation**
- Groups each node with its immediate neighbors to form local communities.
- **Triple-level Factorization**
- Treats each individual (subject, relation, object) triple as an atomic community.
## π Report Generation
### Overview
The **Report Generation** module converts each community into a **natural language report**, making structured knowledge interpretable for both humans and language models.
### Input
This module takes as input:
1. Knowledge graph (`networkx.MultiDiGraph`) generated by the **Knowledge Graph Construction** module.
1. List of community records generated by the **Community Clustering** module.
### Output
The output is a `.jsonl` file, where each line corresponds to one ***Passage***, a dictionary containing:
- `title` (str): Concice topic summary of the community
- `text` (str): Full natural language description of the community's content
```json
{"title": "Lithium Carbonate and Related Health Conditions", "text": "This report examines the interconnections between Lithium Carbonate, ...."}
{"title": "Phenobarbital and Drug-Induced Dyskinesia", "text": "This report examines the relationship between Phenobarbital, ..."}
{"title": "Ammonia and Valproic Acid in Disorders of Excessive Somnolence", "text": "This report examines the relationship between ammonia and valproic acid, ..."}
...
```
(See `experiments/data/examples/reports.jsonl` for more details.)
β
The output format is fully compatible with the **Chunking** module, which accepts any dictionary containing a `title` and `text` field.
Thus, each community report can also be treated as a generic ***Passage***.
### How to Use
```python
from kapipe.report_generation import (
LLMBasedReportGenerator,
TemplateBasedReportGenerator
)
PATH_TO_REPORTS = "./experiments/data/examples/reports.jsonl"
# Initialize the report generator
generator = LLMBasedReportGenerator()
# generator = TemplateBasedReportGenerator()
# Generate community reports
generator.generate_community_reports(
graph=graph,
communities=communities,
path_output=PATH_TO_REPORTS
)
```
(See `experiments/codes/run_report_generation.py` for specific examples.)
### Supported Methods
- **LLM-based Generation**
- Uses a large language model (e.g., GPT-4o-mini) prompted with a community content to generate fluent summaries.
- **Template-based Generation**
- Uses a deterministic format that verbalizes each entity/triple and linearizes them:
- Entity format: `"{name} | {type} | {definition}"`
- Triple format: `"{subject} | {relation} | {object}"`
## βοΈ Chunking
### Overview
The **Chunking** module splits each input text into multiple **non-overlapping text chunks**, each constrained by a maximum token length (e.g., 100 tokens).
This module is essential for preparing context units that are compatible with downstream modules such as retrieval and question answering.
<!-- It supports any input that conforms to the following format:
- A dictionary containing a `"title"` and `"text"` field.
This makes the module applicable not only to **community reports**, but also to other types of *Passage* data with similar structure. -->
### Input
This module takes as input:
1. ***Passage***, or a dictionary containing `title` and `text` field.
- `title` (str): Title of the passage
- `text` (str): Full natural language description of the passage
```json
{
"title": "Lithium Carbonate and Related Health Conditions",
"text": "This report examines the interconnections between Lithium Carbonate, ..."
}
```
### Output
The output is a list of ***Passage*** objects, each containing:
- `title` (str): Same as input
- `text` (str): Chunked portion of the original text, within the specified token window
- Other metadata (e.g., community_id) is carried over
```json
[
{
"title": "Lithium Carbonate and Related Health Conditions",
"text": "This report examines the interconnections between Lithium Carbonate, ..."
},
{
"title": "Lithium Carbonate and Related Health Conditions",
"text": "This duality necessitates careful monitoring of patients receiving Lithium treatment, ..."
},
{
"title": "Lithium Carbonate and Related Health Conditions",
"text": "Similarly, cardiac arrhythmias, which involve irregular heartbeats, can pose ..."
}
...
]
```
(See `experiments/data/examples/reports.chunked_w100.jsonl` for more details.)
### How to Use
```python
from kapipe.chunking import Chunker
MODEL_NAME = "en_core_web_sm" # SpaCy tokenizer
WINDOW_SIZE = 100 # Max number of tokens per chunk
# Initialize the chunker
chunker = Chunker(model_name=MODEL_NAME)
# Chunk the passage
chunked_passages = chunker.split_passage_to_chunked_passages(
passage=passage,
window_size=WINDOW_SIZE
)
```
(See `experiments/codes/run_chunking.py` for specific examples.)
## π Passage Retrieval
### Overview
The **Passage Retrieval** module searches for the top-k most **relevant chunks** given a user query.
It uses lexical or dense retrievers (e.g., BM25, Contriever) to compute semantic similarity between queries and chunks using embedding-based methods.
### Input
**(1) Indexing**:
During the indexing phase, this module takes as input:
1. List of ***Passage*** objects
**(2) Search**:
During the search phase, this module takes as input:
1. ***Question***, or a dictionary containing:
- `question_key` (str): Unique identifier for the question
- `question` (str): Natural language question
```json
{
"question_key": "question#123",
"question": "What does lithium carbonate induce?"
}
```
(See `experiments/data/examples/questions.json` for more details.)
### Output
**(1) Indexing**:
The indexing result is automatically saved to the path specified by `index_root` and `index_name`.
**(2) Search**:
The search result for each question is represented as a dictionary containing:
- `question_key` (str): Refers back to the original query
- `contexts` (list[***Passage***]): Top-k retrieved chunks sorted by relevance, each containing:
- `title` (str): Chunk title
- `text` (str): Chunk text
- `score` (float): Similarity score computed by the retriever
- `rank` (int): Rank of the chunk (1-based)
```json
{
"question_key": "question#123",
"contexts": [
{
"title": "Lithium Carbonate and Related Health Conditions",
"text": "This report examines the interconnections between Lithium Carbonate, ...",
(meta data, if exists)
"score": 1.5991605520248413,
"rank": 1
},
{
"title": "Lithium Carbonate and Related Health Conditions",
"text": "\n\nIn summary, while Lithium Carbonate is an effective treatment for mood disorders, ...",
(meta data, if exists)
"score": 1.51018488407135,
"rank": 2
},
...
]
}
```
(See `experiments/data/examples/questions.contexts.json` for more details.)
### How to Use
**(1) Indexing**:
```python
from kapipe.passage_retrieval import Contriever
INDEX_ROOT = "./"
INDEX_NAME = "example"
# Initialize retriever
retriever = Contriever(
max_passage_length=512,
pooling_method="average",
normalize=False,
gpu_id=0,
metric="inner-product"
)
# Build index
retriever.make_index(
passages=passages,
index_root=INDEX_ROOT,
index_name=INDEX_NAME
)
```
(See `experiments/codes/run_passage_retrieval.py` for specific examples.)
**(2) Search**:
```python
# Load the index
retriever.load_index(index_root=INDEX_ROOT, index_name=INDEX_NAME)
# Search for top-k contexts
retrieved_passages = retriever.search(queries=[question], top_k=10)[0]
contexts_for_question = {
"question_key": question["question_key"],
"contexts": retrieved_passages
}
```
### Supported Methods
- **BM25**
- A sparse lexical matching model based on term frequency and inverse document frequency.
- **Contriever** (Izacard et al., 2022)
- A dual-encoder retriever trained with contrastive learning (Izacard et al., 2022). Computes similarity between query and chunk embeddings.
- **ColBERTv2** (Santhanam et al., 2022)
- A token-level late-interaction retriever for fine-grained semantic matching. Provides higher accuracy with increased inference cost.
- Note: This method is currently unavailable due to an import error in the external `ragatouille` package ([here](https://github.com/AnswerDotAI/RAGatouille/issues/272)).
## π¬ Question Answering
### Overview
The **Question Answering** module generates an answer for each user query, optionally conditioned on the retrieved context chunks.
It uses a large language model such as GPT-4o to produce factually grounded and context-aware answers in natural language.
### Input
This module takes as input:
1. ***Question***, or a dictionary containing:
- `question_key`: Unique identifier for the question
- `question`: Natural language question string
(See `experiments/data/examples/questions.json` for more details.)
2. A dictionary containing:
- `question_key`: The same identifier with the ***Question***
- `contexts`: List of ***Passage*** objects
(See `experiments/data/examples/questions.contexts.json` for more details.)
### Output
The answer is a dictionary containing:
- `question_key` (str): Same as input
- `question` (str): Original question text
- `output_answer` (str): Model-generated natural language answer
- `helpfulness_score` (float): Confidence score generated by the model
```json
{
"question_key": "question#123",
"question": "What does lithium carbonate induce?",
"output_answer": "Lithium carbonate can induce depressive disorder, cyanosis, and cardiac arrhythmias.",
"helpfulness_score": 1.0
}
```
(See `experiments/data/examples/answers.json` for more details.)
### How to Use
```python
from os.path import expanduser
from kapipe.qa import LLMQA
from kapipe import utils
# Initialize the QA module
answerer = LLMQA(path_snapshot=expanduser("~/.kapipe/download/results/qa/llmqa/openai_gpt4o"))
# Generate answer
answer = answerer.run(
question=question,
contexts_for_question=contexts_for_question
)
```
(See `experiments/codes/run_qa.py` for specific examples.)
Raw data
{
"_id": null,
"home_page": null,
"name": "kapipe",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "NLP, knowledge acquisition, information extraction, knowledge graph, retrieval, question answering",
"author": null,
"author_email": "Noriki Nishida <norikinishida@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/2b/7c/31bc4ffeaa9a6a89d105994280668fa9a537bf4b4db5669eab4903169716/kapipe-0.0.7.tar.gz",
"platform": null,
"description": "# KAPipe: A Modular Pipeline for Knowledge Acquisition\n\n## Table of Contents\n\n- [\ud83e\udd16 What is KAPipe?](#what-is-kapipe)\n- [\ud83d\udce6 Installation](#-installation)\n- [\ud83e\udde9 Triple Extraction](#-triple-extraction)\n- [\ud83d\udd78\ufe0f Knowledge Graph Construction](#-knowledge-graph-construction)\n- [\ud83e\uddf1 Community Clustering](#-community-clustering)\n- [\ud83d\udcdd Report Generation](#-report-generation)\n- [\u2702\ufe0f Chunking](#-chunking)\n- [\ud83d\udd0d Passage Retrieval](#-passage-retrieval)\n- [\ud83d\udcac Question Answering](#-question-answering)\n\n## \ud83e\udd16 What is KAPipe?\n\n**KAPipe** is a modular pipeline for comprehensive **knowledge acquisition** from unstructured documents. \nIt supports **extraction**, **organization**, **retrieval**, and **utilization** of knowledge, serving as a core framework for building intelligent systems that reason over structured knowledge. \n\nCurrently, KAPipe provides the following functionalities:\n\n- \ud83e\udde9**Triple Extraction** \n - Extract facts in the form of (head entity, relation, tail entity) triples from raw text.\n\n- \ud83d\udd78\ufe0f**Knowledge Graph Construction** \n - Build a symbolic knowledge graph from triples, optionally augmented with external ontologies or knowledge bases (e.g., Wikidata, UMLS).\n\n- \ud83e\uddf1**Community Clustering** \n - Cluster the knowledge graph into semantically coherent subgraphs (*communities*).\n\n- \ud83d\udcdd**Report Generation** \n - Generate textual reports (or summaries) of graph communities.\n\n- \u2702\ufe0f**Chunking** \n - Split text (e.g., community report) into fixed-size chunks based on a predefined token length (e.g., n=300).\n\n- \ud83d\udd0d**Passage Retrieval** \n - Retrieve relevant chunks for given queries using lexical or dense retrieval.\n\n- \ud83d\udcac**Question Answering** \n - Answer questions using retrieved chunks as context.\n\nThese components together form an implementation of **graph-based retrieval-augmented generation (GraphRAG)**, enabling question answering and reasoning grounded in structured knowledge.\n\n## \ud83d\udce6 Installation\n\n### Step 1: Set up a Python environment\n```bash\npython -m venv .env\nsource .env/bin/activate\npip install -U pip setuptools wheel\n```\n\n### Step 2: Install KAPipe\n```bash\npip install kapipe\n```\n\n### Step 3: Download pretrained models and configurations\n\nPretrained models and configuration files can be downloaded from the following Google Drive folder:\n\n\ud83d\udcc1 [KAPipe Release Files](https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp)\n\nDownload the latest release file named release.YYYYMMDD.tar.gz, then extract it to the ~/.kapipe directory:\n\n```bash\nmkdir -p ~/.kapipe\nmv release.YYYYMMDD.tar.gz ~/.kapipe\ncd ~/.kapipe\ntar -zxvf release.YYYYMMDD.tar.gz\n```\n\nIf the extraction is successful, you should see a directory `~/.kapipe/download/`, which contains model resources.\n\n## \ud83e\udde9 Triple Extraction\n\n### Overview\n\nThe **Triple Extraction** module identifies relational facts from raw text in the form of (head entity, relation, tail entity) **triples**.\n\nSpecifically, this is achieved through the following cascade of subtasks:\n\n1. **Named Entity Recognition (NER):**\n - Detect entity mentions (spans) and classify their types.\n1. **Entity Disambiguation Retrieval (ED-Retrieval)**:\n - Retrieve candidate concept IDs from a knowledge base for each mention.\n1. **Entity Disambiguation Reranking (ED-Reranking)**:\n - Select the most probable concept ID from the retrieved candidates.\n1. **Document-level Relation Extraction (DocRE)**:\n - Extract relational triples based on the disambiguated entity set.\n\n### Input\n\nThis module takes as input:\n\n1. ***Document***, or a dictionary containing\n - `doc_key` (str): Unique identifier for the document\n - `sentences` (list[str]): List of sentence strings (tokenized)\n\n```json\n{\n \"doc_key\": \"6794356\",\n \"sentences\": [\n \"Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .\",\n \"A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .\",\n ...\n ]\n}\n```\n(See `experiments/data/examples/documents_without_triples.json` for more details.)\n\nEach subtask takes a ***Document*** object as input, augments it with new fields, and returns it. \nThis allows custom metadata to persist throughout the pipeline.\n\n### Output\n\nThe output is also the same-format dictionary (***Document***), augmented with extracted entities and triples information:\n\n- `doc_key` (str): Same as input\n- `sentences` (list[str]): Same as input\n- `mentions` (list[dict]): Mentions, or a list of dictionaries, each containing:\n - `span` (tuple[int,int]): Mention span\n - `name` (str): Mention string\n - `entity_type` (str): Entity type\n - `entity_id` (str): Concept ID\n- `entities` (list[dict]): Entities, or a list of dictionaries, each containing\n - `mention_indices` (list[int]): Indices of mentions belonging to this entity\n - `entity_type` (str): Entity type \n - `entity_id` (str): Concept ID\n- `relations` (list[dict]): Triples, or a list dictionaries, each containing\n - `arg1` (int): Index of the head/subject entity\n - `relation` (str): Semantic relation\n - `arg2` (int): Index of the tail/object entity\n\n```json\n{\n \"doc_key\": \"6794356\",\n \"sentences\": [...],\n \"mentions\": [\n {\n \"span\": [0, 2],\n \"name\": \"Tricuspid valve regurgitation\",\n \"entity_type\": \"Disease\",\n \"entity_id\": \"D014262\"\n },\n ...\n ],\n \"entities\": [\n {\n \"mention_indices\": [0, 3, 7],\n \"entity_type\": \"Disease\",\n \"entity_id\": \"D014262\"\n },\n ...\n ],\n \"relations\": [\n {\n \"arg1\": 1,\n \"relation\": \"CID\",\n \"arg2\": 7\n },\n ...\n ]\n}\n```\n(See `experiments/data/examples/documents_with_triples.json` for more details.)\n\n### How to Use\n\n```python\nimport kapipe.triple_extraction\n\nIDENTIFIER = \"biaffinener_blink_blink_atlop_cdr\"\n\n# Load the Triple Extraction pipeline\npipe = kapipe.triple_extraction.load(\n identifier=IDENTIFIER,\n gpu_map={\"ner\": 0, \"ed_retrieval\": 0, \"ed_reranking\": 2, \"docre\": 3}\n)\n\n# We provide a utility for converting text (string) to Document format\n# `title` is optional\ndocument = pipe.text_to_document(doc_key=your_doc_key, text=your_text, title=your_title)\n\n# Apply the pipeline to your input document\ndocument = pipe(document)\n```\n(See `experiments/codes/run_triple_extraction.py` for specific examples.)\n\n<!-- The `identifier` determines the specific models used for each subtask. \nFor example, `\"biaffinener_blink_blink_atlop_cdr\"` uses:\n\n- **NER**: Biaffine-NER (trained on BC5CDR for Chemical and Disease types)\n- **ED-Retrieval**: BLINK Bi-Encoder (trained on BC5CDR for MeSH 2015)\n- **ED-Reranking**: BLINK Cross-Encoder (trained on BC5CDR for MeSH 2015)\n- **DocRE**: ATLOP (trained on BC5CDR for Chemical-Induce-Disease (CID) relation) -->\n\n### Supported Methods\n\n#### Named Entity Recognition (NER)\n- **Biaffine-NER** ([`Yu et al., 2020`](https://aclanthology.org/2020.acl-main.577/)): Span-based BERT model using biaffine scoring\n- **LLM-NER**: A proprietary/open-source LLM using a NER-specific prompt template and few-shot examples\n\n#### Entity Disambiguation Retrieval (ED-Retrieval)\n- **BLINK Bi-Encoder** ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)): Dense retriever using BERT-based encoders and approximate nearest neighbor search\n- **BM25**: Lexical matching\n- **Levenshtein**: Edit distance matching\n\n#### Entity Disambiguation Reranking (ED-Reranking)\n- **BLINK Cross-Encoder** (Wu et al., 2020): Reranker using a BERT-based encoder for candidates from the Bi-Encoder\n- **LLM-ED**: A proprietary/open-source LLM using a ED-specific prompt template and few-shot examples\n\n#### Document-level Relation Extraction (DocRE)\n- **ATLOP** ([`Zhou et al., 2021`](https://ojs.aaai.org/index.php/AAAI/article/view/17717)): BERT-based model for DocRE\n- **MA-ATLOP** (Oumaima & Nishida et al., 2024): Mention-agnostic extension of ATLOP\n- **MA-QA** (Oumaima & Nishida, 2024): Question-answering style DocRE model\n- **LLM-DocRE**: A proprietary/open-source LLM using a DocRE-specific prompt template and few-shot examples\n\n### Available Pipeline Identifiers\n\nThe following pipeline configurations are currently available:\n\n| identifier | NER (Entity Types) | ED-Retrieval (Knowledge Base) | ED-Reranking (Knowledge Base) | DocRE (Relations) |\n| --- | --- | --- | --- | --- |\n| `biaffinener_blink_blink_atlop_cdr` | Biaffine-NER ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | BLINK Cross-Encoder (MeSH 2015) | ATLOP ({Chemical-Induce-Disease}) |\n| `biaffinener_blink_blink_atlop_linked_docred` | Biaffine-NER ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | BLINK Cross-Encoder (DBPedia 2020.02.01) | ATLOP (DBPedia 96 relations) |\n| `llmner_blink_llmed_llmdocre_cdr` | LLM-NER `gpt-4o-mini` ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | LLM-ED `gpt-4o-mini` (MeSH 2015) | LLM-DocRE `gpt-4o-mini` ({Chemical-Induce-Disease}) |\n| `llmner_blink_llmed_llmdocre_linked_docred` | LLM-NER `gpt-4o-mini` ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | LLM-ED `gpt-4o-mini` (DBPedia 2020.02.01) | LLM-DocRE `gpt-4o-mini` (DBPedia 96 relations) |\n\n## \ud83d\udd78\ufe0f Knowledge Graph Construction\n\n### Overview\n\nThe **Knowledge Graph Construction** module builds a **directed multi-relational graph** from a set of extracted triples.\n\n- **Nodes** represent unique entities (i.e., concepts).\n- **Edges** represent semantic relations between entities.\n\n### Input\n\nThis module takes as input:\n\n1. List of ***Document*** objects with triples, produced by the **Triple Extraction** module\n\n2. (optional) ***Additional Triples*** (existing KBs), or a list of dictionaries, each containing:\n - `head` (str): Entity ID of the subject\n - `relation` (str): Relation type (e.g., treats, causes)\n - `tail` (str): Entity ID of the object\n```json\n[\n {\n \"head\": \"D000001\",\n \"relation\": \"treats\",\n \"tail\": \"D014262\"\n },\n ...\n]\n```\n(See `experiments/data/examples/additional_triples.json` for more details.)\n\n\n3. ***Entity Dictionary***, or a list of dictionaries, each containing:\n - `entity_id` (str): Unique concept ID\n - `canonical_name` (str): Official name of the concept\n - `entity_type` (str): Type/category of the concept\n - `synonyms` (list[str]): A list of alternative names\n - `description` (str): Textual definition of the concept\n```JSON\n[\n {\n \"entity_id\": \"C009166\",\n \"entity_index\": 252696,\n \"entity_type\": null,\n \"canonical_name\": \"retinol acetate\",\n \"synonyms\": [\n \"retinyl acetate\",\n \"vitamin A acetate\"\n ],\n \"description\": \"\",\n \"tree_numbers\": []\n },\n {\n \"entity_id\": \"D000641\",\n \"entity_index\": 610,\n \"entity_type\": \"Chemical\",\n \"canonical_name\": \"Ammonia\",\n \"synonyms\": [],\n \"description\": \"A colorless alkaline gas. It is formed in the body during decomposition of organic materials during a large number of metabolically important reactions. Note that the aqueous form of ammonia is referred to as AMMONIUM HYDROXIDE.\",\n \"tree_numbers\": [\n \"D01.362.075\",\n \"D01.625.050\"\n ]\n },\n ...\n]\n```\n(See `experiments/data/examples/entity_dict.json` for more details.)\n\n### Output\n\nThe output is a `networkx.MultiDiGraph` object representing the knowledge graph.\n\nEach node has the following attributes:\n\n- `entity_id` (str): Concept ID (e.g., MeSH ID)\n- `entity_type` (str): Type of entity (e.g., Disease, Chemical, Person, Location)\n- `name` (str): Canonical name (from Entity Dictionary)\n- `description` (str): Textual definition (from Entity Dictionary)\n- `doc_key_list` (list[str]): List of document IDs where this entity appears\n\nEach edge has the following attributes:\n\n- `head` (str): Head entity ID\n- `tail` (str): Tail entity ID\n- `relation` (str): Type of semantic relation\n- `doc_key_list` (list[str]): List of document IDs supporting this relation\n\n(See `experiments/data/examples/graph.graphml` for more details.)\n\n### How to Use\n\n```python\nfrom kapipe.graph_construction import GraphConstructor\n\nPATH_TO_DOCUMENTS = \"./experiments/data/examples/documents.json\"\nPATH_TO_TRIPLES = \"./experiments/data/examples/additional_triples.json\" # Or set to None if unused\nPATH_TO_ENTITY_DICT = \"./experiments/data/examples/entity_dict.json\"\n\n# Initialize the knowledge graph constructor\nconstructor = GraphConstructor()\n\n# Construct the knowledge graph\ngraph = constructor.construct_knowledge_graph(\n path_documents_list=[PATH_TO_DOCUMENTS],\n path_additional_triples=PATH_TO_TRIPLES, # Optional\n path_entity_dict=PATH_TO_ENTITY_DICT\n)\n```\n(See `experiments/codes/run_graph_construction.py` for specific examples.)\n\n## \ud83e\uddf1 Community Clustering\n\n### Overview\n\nThe **Community Clustering** module partitions the knowledge graph into **semantically coherent subgraphs**, referred to as *communities*. \nEach community represents a localized set of closely related concepts and relations, and serves as a fundamental unit of structured knowledge.\n\n### Input\n\nThis module takes as input:\n\n1. Knowledge graph (`networkx.MultiDiGraph`) produced by the **Knowledge Graph Construction** module.\n\n### Output\n\nThe output is a list of hierarchical community records (dictionaries), each containing:\n\n- `community_id` (str): Unique ID for the community\n- `nodes` (list[str]): List of entity IDs belonging to the community (null for ROOT)\n- `level` (int): Depth in the hierarchy (ROOT=-1)\n- `parent_community_id` (str): ID of the parent community (null for ROOT)\n- `child_community_ids` (list[str]): List of child community IDs (empty for leaf communities)\n\n```json\n[\n {\n \"community_id\": \"ROOT\",\n \"nodes\": null,\n \"level\": -1,\n \"parent_community_id\": null,\n \"child_community_ids\": [\n \"0\",\n \"1\",\n \"2\",\n \"3\",\n \"4\",\n \"5\",\n \"6\",\n \"7\",\n \"8\",\n \"9\"\n ]\n },\n {\n \"community_id\": \"0\",\n \"nodes\": [\n \"D016651\",\n \"D014262\",\n \"D003866\",\n \"D003490\",\n \"D001145\"\n ],\n \"level\": 0,\n \"parent_community_id\": \"ROOT\",\n \"child_community_ids\": [...]\n },\n ...\n]\n```\n(See `experiments/data/examples/communities.json` for more details.)\n\nThis hierarchical structure enables multi-level organization of knowledge, particularly useful for coarse-to-fine report generation and scalable retrieval.\n\n### How to Use\n\n```python\nfrom kapipe.community_clustering import (\n HierarchicalLeiden,\n NeighborhoodAggregation,\n TripleLevelFactorization\n)\n\n# Initialize the community clusterer\nclusterer = HierarchicalLeiden()\n# clusterer = NeighborhoodAggregation()\n# clusterer = TripleLevelFactorization()\n\n# Apply the community clusterer to the graph\ncommunities = clusterer.cluster_communities(graph)\n```\n(See `experiments/codes/run_community_clustering.py` for specific examples.)\n\n### Supported Methods\n\n- **Hierarchical Leiden**\n - Recursively applies the Leiden algorithm (Traag et al., 2019) to optimize modularity. Large communities are subdivided until they satisfy a predefined size constraint (default: 10 nodes).\n- **Neighborhood Aggregation**\n - Groups each node with its immediate neighbors to form local communities.\n- **Triple-level Factorization**\n - Treats each individual (subject, relation, object) triple as an atomic community.\n\n## \ud83d\udcdd Report Generation\n\n### Overview\n\nThe **Report Generation** module converts each community into a **natural language report**, making structured knowledge interpretable for both humans and language models. \n\n### Input\n\nThis module takes as input:\n\n1. Knowledge graph (`networkx.MultiDiGraph`) generated by the **Knowledge Graph Construction** module.\n1. List of community records generated by the **Community Clustering** module.\n\n### Output\n\nThe output is a `.jsonl` file, where each line corresponds to one ***Passage***, a dictionary containing:\n\n- `title` (str): Concice topic summary of the community\n- `text` (str): Full natural language description of the community's content\n\n```json\n{\"title\": \"Lithium Carbonate and Related Health Conditions\", \"text\": \"This report examines the interconnections between Lithium Carbonate, ....\"}\n{\"title\": \"Phenobarbital and Drug-Induced Dyskinesia\", \"text\": \"This report examines the relationship between Phenobarbital, ...\"}\n{\"title\": \"Ammonia and Valproic Acid in Disorders of Excessive Somnolence\", \"text\": \"This report examines the relationship between ammonia and valproic acid, ...\"}\n...\n```\n(See `experiments/data/examples/reports.jsonl` for more details.)\n\n\u2705 The output format is fully compatible with the **Chunking** module, which accepts any dictionary containing a `title` and `text` field. \nThus, each community report can also be treated as a generic ***Passage***.\n\n### How to Use\n\n```python\nfrom kapipe.report_generation import (\n LLMBasedReportGenerator,\n TemplateBasedReportGenerator\n)\n\nPATH_TO_REPORTS = \"./experiments/data/examples/reports.jsonl\"\n\n# Initialize the report generator\ngenerator = LLMBasedReportGenerator()\n# generator = TemplateBasedReportGenerator()\n \n# Generate community reports\ngenerator.generate_community_reports(\n graph=graph,\n communities=communities,\n path_output=PATH_TO_REPORTS\n)\n```\n(See `experiments/codes/run_report_generation.py` for specific examples.)\n\n### Supported Methods\n\n- **LLM-based Generation**\n - Uses a large language model (e.g., GPT-4o-mini) prompted with a community content to generate fluent summaries.\n- **Template-based Generation**\n - Uses a deterministic format that verbalizes each entity/triple and linearizes them:\n - Entity format: `\"{name} | {type} | {definition}\"`\n - Triple format: `\"{subject} | {relation} | {object}\"`\n\n## \u2702\ufe0f Chunking\n\n### Overview\n\nThe **Chunking** module splits each input text into multiple **non-overlapping text chunks**, each constrained by a maximum token length (e.g., 100 tokens). \nThis module is essential for preparing context units that are compatible with downstream modules such as retrieval and question answering. \n\n<!-- It supports any input that conforms to the following format:\n\n- A dictionary containing a `\"title\"` and `\"text\"` field. \n\nThis makes the module applicable not only to **community reports**, but also to other types of *Passage* data with similar structure. -->\n\n### Input\n\nThis module takes as input:\n\n1. ***Passage***, or a dictionary containing `title` and `text` field.\n\n - `title` (str): Title of the passage\n - `text` (str): Full natural language description of the passage\n\n```json\n{\n \"title\": \"Lithium Carbonate and Related Health Conditions\",\n \"text\": \"This report examines the interconnections between Lithium Carbonate, ...\"\n}\n```\n\n### Output\n\nThe output is a list of ***Passage*** objects, each containing:\n- `title` (str): Same as input\n- `text` (str): Chunked portion of the original text, within the specified token window\n- Other metadata (e.g., community_id) is carried over\n\n```json\n[\n {\n \"title\": \"Lithium Carbonate and Related Health Conditions\",\n \"text\": \"This report examines the interconnections between Lithium Carbonate, ...\"\n },\n {\n \"title\": \"Lithium Carbonate and Related Health Conditions\",\n \"text\": \"This duality necessitates careful monitoring of patients receiving Lithium treatment, ...\"\n },\n {\n \"title\": \"Lithium Carbonate and Related Health Conditions\",\n \"text\": \"Similarly, cardiac arrhythmias, which involve irregular heartbeats, can pose ...\"\n }\n ...\n]\n```\n(See `experiments/data/examples/reports.chunked_w100.jsonl` for more details.)\n\n### How to Use\n\n```python\nfrom kapipe.chunking import Chunker\n\nMODEL_NAME = \"en_core_web_sm\" # SpaCy tokenizer\nWINDOW_SIZE = 100 # Max number of tokens per chunk\n\n# Initialize the chunker\nchunker = Chunker(model_name=MODEL_NAME)\n\n# Chunk the passage\nchunked_passages = chunker.split_passage_to_chunked_passages(\n passage=passage,\n window_size=WINDOW_SIZE\n)\n```\n(See `experiments/codes/run_chunking.py` for specific examples.)\n\n## \ud83d\udd0d Passage Retrieval\n\n### Overview\n\nThe **Passage Retrieval** module searches for the top-k most **relevant chunks** given a user query. \nIt uses lexical or dense retrievers (e.g., BM25, Contriever) to compute semantic similarity between queries and chunks using embedding-based methods.\n\n### Input\n\n**(1) Indexing**:\n\nDuring the indexing phase, this module takes as input:\n\n1. List of ***Passage*** objects\n\n**(2) Search**:\n\nDuring the search phase, this module takes as input:\n\n1. ***Question***, or a dictionary containing:\n - `question_key` (str): Unique identifier for the question\n - `question` (str): Natural language question\n\n```json\n{\n \"question_key\": \"question#123\",\n \"question\": \"What does lithium carbonate induce?\"\n}\n```\n(See `experiments/data/examples/questions.json` for more details.)\n\n### Output\n\n**(1) Indexing**:\n\nThe indexing result is automatically saved to the path specified by `index_root` and `index_name`.\n\n**(2) Search**:\n\nThe search result for each question is represented as a dictionary containing:\n- `question_key` (str): Refers back to the original query\n- `contexts` (list[***Passage***]): Top-k retrieved chunks sorted by relevance, each containing:\n - `title` (str): Chunk title\n - `text` (str): Chunk text\n - `score` (float): Similarity score computed by the retriever\n - `rank` (int): Rank of the chunk (1-based)\n\n```json\n{\n \"question_key\": \"question#123\",\n \"contexts\": [\n {\n \"title\": \"Lithium Carbonate and Related Health Conditions\",\n \"text\": \"This report examines the interconnections between Lithium Carbonate, ...\",\n (meta data, if exists)\n \"score\": 1.5991605520248413,\n \"rank\": 1\n },\n {\n \"title\": \"Lithium Carbonate and Related Health Conditions\",\n \"text\": \"\\n\\nIn summary, while Lithium Carbonate is an effective treatment for mood disorders, ...\",\n (meta data, if exists)\n \"score\": 1.51018488407135,\n \"rank\": 2\n },\n ...\n ]\n}\n```\n(See `experiments/data/examples/questions.contexts.json` for more details.)\n\n### How to Use\n\n**(1) Indexing**:\n\n```python\nfrom kapipe.passage_retrieval import Contriever\n\nINDEX_ROOT = \"./\"\nINDEX_NAME = \"example\"\n\n# Initialize retriever\nretriever = Contriever(\n max_passage_length=512,\n pooling_method=\"average\",\n normalize=False,\n gpu_id=0,\n metric=\"inner-product\"\n)\n\n# Build index\nretriever.make_index(\n passages=passages,\n index_root=INDEX_ROOT,\n index_name=INDEX_NAME\n)\n```\n(See `experiments/codes/run_passage_retrieval.py` for specific examples.)\n\n**(2) Search**:\n\n```python\n# Load the index\nretriever.load_index(index_root=INDEX_ROOT, index_name=INDEX_NAME)\n\n# Search for top-k contexts\nretrieved_passages = retriever.search(queries=[question], top_k=10)[0]\ncontexts_for_question = {\n \"question_key\": question[\"question_key\"],\n \"contexts\": retrieved_passages\n}\n```\n\n### Supported Methods\n\n- **BM25**\n - A sparse lexical matching model based on term frequency and inverse document frequency.\n- **Contriever** (Izacard et al., 2022)\n - A dual-encoder retriever trained with contrastive learning (Izacard et al., 2022). Computes similarity between query and chunk embeddings.\n- **ColBERTv2** (Santhanam et al., 2022)\n - A token-level late-interaction retriever for fine-grained semantic matching. Provides higher accuracy with increased inference cost.\n - Note: This method is currently unavailable due to an import error in the external `ragatouille` package ([here](https://github.com/AnswerDotAI/RAGatouille/issues/272)).\n\n## \ud83d\udcac Question Answering\n\n### Overview\n\nThe **Question Answering** module generates an answer for each user query, optionally conditioned on the retrieved context chunks. \nIt uses a large language model such as GPT-4o to produce factually grounded and context-aware answers in natural language.\n\n### Input\n\nThis module takes as input:\n\n1. ***Question***, or a dictionary containing:\n - `question_key`: Unique identifier for the question\n - `question`: Natural language question string\n\n(See `experiments/data/examples/questions.json` for more details.)\n\n2. A dictionary containing:\n - `question_key`: The same identifier with the ***Question***\n - `contexts`: List of ***Passage*** objects\n\n(See `experiments/data/examples/questions.contexts.json` for more details.)\n\n### Output\n\nThe answer is a dictionary containing:\n\n- `question_key` (str): Same as input\n- `question` (str): Original question text\n- `output_answer` (str): Model-generated natural language answer\n- `helpfulness_score` (float): Confidence score generated by the model\n\n```json\n{\n \"question_key\": \"question#123\",\n \"question\": \"What does lithium carbonate induce?\",\n \"output_answer\": \"Lithium carbonate can induce depressive disorder, cyanosis, and cardiac arrhythmias.\",\n \"helpfulness_score\": 1.0\n}\n```\n(See `experiments/data/examples/answers.json` for more details.)\n\n### How to Use\n\n```python\nfrom os.path import expanduser\nfrom kapipe.qa import LLMQA\nfrom kapipe import utils\n\n# Initialize the QA module\nanswerer = LLMQA(path_snapshot=expanduser(\"~/.kapipe/download/results/qa/llmqa/openai_gpt4o\"))\n\n# Generate answer\nanswer = answerer.run(\n question=question,\n contexts_for_question=contexts_for_question\n)\n\n```\n(See `experiments/codes/run_qa.py` for specific examples.)\n",
"bugtrack_url": null,
"license": "LICENSE",
"summary": "A modular pipeline for knowledge acquisition",
"version": "0.0.7",
"project_urls": {
"Homepage": "https://github.com/norikinishida/kapipe"
},
"split_keywords": [
"nlp",
" knowledge acquisition",
" information extraction",
" knowledge graph",
" retrieval",
" question answering"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "bd59a89d3f4e0b9daa0389e33ad5a1836834130afb5a39f5545b7340a88d587e",
"md5": "dfc73f06c8cbc2a52a02c36dc1779c46",
"sha256": "b1117d553dd1efaf7584c734849695198f51b89d2e3d3e659ebb5eb5db4c1df0"
},
"downloads": -1,
"filename": "kapipe-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dfc73f06c8cbc2a52a02c36dc1779c46",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 183382,
"upload_time": "2025-07-10T10:49:02",
"upload_time_iso_8601": "2025-07-10T10:49:02.307804Z",
"url": "https://files.pythonhosted.org/packages/bd/59/a89d3f4e0b9daa0389e33ad5a1836834130afb5a39f5545b7340a88d587e/kapipe-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2b7c31bc4ffeaa9a6a89d105994280668fa9a537bf4b4db5669eab4903169716",
"md5": "2c3da45a327a7589a35b756ba4b7a15c",
"sha256": "cecd658449e089376aeee8956910bf6d6c90872416f82eb67576edae4eab1572"
},
"downloads": -1,
"filename": "kapipe-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "2c3da45a327a7589a35b756ba4b7a15c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 137342,
"upload_time": "2025-07-10T10:49:04",
"upload_time_iso_8601": "2025-07-10T10:49:04.625820Z",
"url": "https://files.pythonhosted.org/packages/2b/7c/31bc4ffeaa9a6a89d105994280668fa9a537bf4b4db5669eab4903169716/kapipe-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-10 10:49:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "norikinishida",
"github_project": "kapipe",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.22.2"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.10.1"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.5.3"
]
]
},
{
"name": "spacy",
"specs": [
[
">=",
"3.7.1"
]
]
},
{
"name": "spacy-alignments",
"specs": [
[
">=",
"0.9.1"
]
]
},
{
"name": "scispacy",
"specs": []
},
{
"name": "torch",
"specs": [
[
">=",
"2.6.0"
]
]
},
{
"name": "torch-tensorrt",
"specs": []
},
{
"name": "torchdata",
"specs": []
},
{
"name": "torchtext",
"specs": []
},
{
"name": "torchvision",
"specs": []
},
{
"name": "opt-einsum",
"specs": [
[
">=",
"3.3.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.46.0"
]
]
},
{
"name": "accelerate",
"specs": [
[
">=",
"1.0.1"
]
]
},
{
"name": "bitsandbytes",
"specs": [
[
">=",
"0.44.1"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.53.0"
]
]
},
{
"name": "tenacity",
"specs": [
[
">=",
"9.1.2"
]
]
},
{
"name": "faiss-gpu",
"specs": [
[
">=",
"1.7.2"
]
]
},
{
"name": "Levenshtein",
"specs": [
[
">=",
"0.25.0"
]
]
},
{
"name": "networkx",
"specs": [
[
">=",
"3.4.2"
]
]
},
{
"name": "neo4j",
"specs": [
[
">=",
"5.25.0"
]
]
},
{
"name": "graspologic",
"specs": [
[
">=",
"3.4.1"
]
]
},
{
"name": "future",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "pyhocon",
"specs": [
[
">=",
"0.3.60"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.66.1"
]
]
},
{
"name": "jsonlines",
"specs": [
[
">=",
"4.0.0"
]
]
}
],
"lcname": "kapipe"
}