kapipe

Name	kapipe JSON
Version	0.0.7 JSON
	download
home_page	None
Summary	A modular pipeline for knowledge acquisition
upload_time	2025-07-10 10:49:04
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	LICENSE
keywords	nlp knowledge acquisition information extraction knowledge graph retrieval question answering
VCS
bugtrack_url
requirements	numpy scipy pandas spacy spacy-alignments scispacy torch torch-tensorrt torchdata torchtext torchvision opt-einsum transformers accelerate bitsandbytes openai tenacity faiss-gpu Levenshtein networkx neo4j graspologic future pyhocon tqdm jsonlines
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # KAPipe: A Modular Pipeline for Knowledge Acquisition

## Table of Contents

- [🤖 What is KAPipe?](#what-is-kapipe)
- [📦 Installation](#-installation)
- [🧩 Triple Extraction](#-triple-extraction)
- [🕸️ Knowledge Graph Construction](#-knowledge-graph-construction)
- [🧱 Community Clustering](#-community-clustering)
- [📝 Report Generation](#-report-generation)
- [✂️ Chunking](#-chunking)
- [🔍 Passage Retrieval](#-passage-retrieval)
- [💬 Question Answering](#-question-answering)

## 🤖 What is KAPipe?

**KAPipe** is a modular pipeline for comprehensive **knowledge acquisition** from unstructured documents.  
It supports **extraction**, **organization**, **retrieval**, and **utilization** of knowledge, serving as a core framework for building intelligent systems that reason over structured knowledge.  

Currently, KAPipe provides the following functionalities:

- 🧩**Triple Extraction**  
    - Extract facts in the form of (head entity, relation, tail entity) triples from raw text.

- 🕸️**Knowledge Graph Construction**  
    - Build a symbolic knowledge graph from triples, optionally augmented with external ontologies or knowledge bases (e.g., Wikidata, UMLS).

- 🧱**Community Clustering**  
    - Cluster the knowledge graph into semantically coherent subgraphs (*communities*).

- 📝**Report Generation**  
    - Generate textual reports (or summaries) of graph communities.

- ✂️**Chunking**  
    - Split text (e.g., community report) into fixed-size chunks based on a predefined token length (e.g., n=300).

- 🔍**Passage Retrieval**  
    - Retrieve relevant chunks for given queries using lexical or dense retrieval.

- 💬**Question Answering**  
    - Answer questions using retrieved chunks as context.

These components together form an implementation of **graph-based retrieval-augmented generation (GraphRAG)**, enabling question answering and reasoning grounded in structured knowledge.

## 📦 Installation

### Step 1: Set up a Python environment
```bash
python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
```

### Step 2: Install KAPipe
```bash
pip install kapipe
```

### Step 3: Download pretrained models and configurations

Pretrained models and configuration files can be downloaded from the following Google Drive folder:

📁 [KAPipe Release Files](https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp)

Download the latest release file named release.YYYYMMDD.tar.gz, then extract it to the ~/.kapipe directory:

```bash
mkdir -p ~/.kapipe
mv release.YYYYMMDD.tar.gz ~/.kapipe
cd ~/.kapipe
tar -zxvf release.YYYYMMDD.tar.gz
```

If the extraction is successful, you should see a directory `~/.kapipe/download/`, which contains model resources.

## 🧩 Triple Extraction

### Overview

The **Triple Extraction** module identifies relational facts from raw text in the form of (head entity, relation, tail entity) **triples**.

Specifically, this is achieved through the following cascade of subtasks:

1. **Named Entity Recognition (NER):**
    - Detect entity mentions (spans) and classify their types.
1. **Entity Disambiguation Retrieval (ED-Retrieval)**:
    - Retrieve candidate concept IDs from a knowledge base for each mention.
1. **Entity Disambiguation Reranking (ED-Reranking)**:
    - Select the most probable concept ID from the retrieved candidates.
1. **Document-level Relation Extraction (DocRE)**:
    - Extract relational triples based on the disambiguated entity set.

### Input

This module takes as input:

1. ***Document***, or a dictionary containing
    - `doc_key` (str): Unique identifier for the document
    - `sentences` (list[str]): List of sentence strings (tokenized)

```json
{
    "doc_key": "6794356",
    "sentences": [
        "Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .",
        "A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .",
        ...
    ]
}
```
(See `experiments/data/examples/documents_without_triples.json` for more details.)

Each subtask takes a ***Document*** object as input, augments it with new fields, and returns it.  
This allows custom metadata to persist throughout the pipeline.

### Output

The output is also the same-format dictionary (***Document***), augmented with extracted entities and triples information:

- `doc_key` (str): Same as input
- `sentences` (list[str]): Same as input
- `mentions` (list[dict]): Mentions, or a list of dictionaries, each containing:
    - `span` (tuple[int,int]): Mention span
    - `name` (str): Mention string
    - `entity_type` (str): Entity type
    - `entity_id` (str): Concept ID
- `entities` (list[dict]): Entities, or a list of dictionaries, each containing
    - `mention_indices` (list[int]): Indices of mentions belonging to this entity
    - `entity_type` (str): Entity type 
    - `entity_id` (str): Concept ID
- `relations` (list[dict]): Triples, or a list dictionaries, each containing
    - `arg1` (int): Index of the head/subject entity
    - `relation` (str): Semantic relation
    - `arg2` (int): Index of the tail/object entity

```json
{
    "doc_key": "6794356",
    "sentences": [...],
    "mentions": [
        {
            "span": [0, 2],
            "name": "Tricuspid valve regurgitation",
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        ...
    ],
    "entities": [
        {
            "mention_indices": [0, 3, 7],
            "entity_type": "Disease",
            "entity_id": "D014262"
        },
        ...
    ],
    "relations": [
        {
            "arg1": 1,
            "relation": "CID",
            "arg2": 7
        },
        ...
    ]
}
```
(See `experiments/data/examples/documents_with_triples.json` for more details.)

### How to Use

```python
import kapipe.triple_extraction

IDENTIFIER = "biaffinener_blink_blink_atlop_cdr"

# Load the Triple Extraction pipeline
pipe = kapipe.triple_extraction.load(
    identifier=IDENTIFIER,
    gpu_map={"ner": 0, "ed_retrieval": 0, "ed_reranking": 2, "docre": 3}
)

# We provide a utility for converting text (string) to Document format
# `title` is optional
document = pipe.text_to_document(doc_key=your_doc_key, text=your_text, title=your_title)

# Apply the pipeline to your input document
document = pipe(document)
```
(See `experiments/codes/run_triple_extraction.py` for specific examples.)

<!-- The `identifier` determines the specific models used for each subtask.  
For example, `"biaffinener_blink_blink_atlop_cdr"` uses:

- **NER**: Biaffine-NER (trained on BC5CDR for Chemical and Disease types)
- **ED-Retrieval**: BLINK Bi-Encoder (trained on BC5CDR for MeSH 2015)
- **ED-Reranking**: BLINK Cross-Encoder (trained on BC5CDR for MeSH 2015)
- **DocRE**: ATLOP (trained on BC5CDR for Chemical-Induce-Disease (CID) relation) -->

### Supported Methods

#### Named Entity Recognition (NER)
- **Biaffine-NER** ([`Yu et al., 2020`](https://aclanthology.org/2020.acl-main.577/)): Span-based BERT model using biaffine scoring
- **LLM-NER**: A proprietary/open-source LLM using a NER-specific prompt template and few-shot examples

#### Entity Disambiguation Retrieval (ED-Retrieval)
- **BLINK Bi-Encoder** ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)): Dense retriever using BERT-based encoders and approximate nearest neighbor search
- **BM25**: Lexical matching
- **Levenshtein**: Edit distance matching

#### Entity Disambiguation Reranking (ED-Reranking)
- **BLINK Cross-Encoder** (Wu et al., 2020): Reranker using a BERT-based encoder for candidates from the Bi-Encoder
- **LLM-ED**: A proprietary/open-source LLM using a ED-specific prompt template and few-shot examples

#### Document-level Relation Extraction (DocRE)
- **ATLOP** ([`Zhou et al., 2021`](https://ojs.aaai.org/index.php/AAAI/article/view/17717)): BERT-based model for DocRE
- **MA-ATLOP** (Oumaima & Nishida et al., 2024): Mention-agnostic extension of ATLOP
- **MA-QA** (Oumaima & Nishida, 2024): Question-answering style DocRE model
- **LLM-DocRE**: A proprietary/open-source LLM using a DocRE-specific prompt template and few-shot examples

### Available Pipeline Identifiers

The following pipeline configurations are currently available:

| identifier | NER (Entity Types) | ED-Retrieval (Knowledge Base) | ED-Reranking (Knowledge Base) | DocRE (Relations) |
| --- | --- | --- | --- | --- |
| `biaffinener_blink_blink_atlop_cdr` | Biaffine-NER ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | BLINK Cross-Encoder (MeSH 2015) | ATLOP ({Chemical-Induce-Disease}) |
| `biaffinener_blink_blink_atlop_linked_docred` | Biaffine-NER ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | BLINK Cross-Encoder (DBPedia 2020.02.01) | ATLOP (DBPedia 96 relations) |
| `llmner_blink_llmed_llmdocre_cdr` | LLM-NER `gpt-4o-mini` ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | LLM-ED `gpt-4o-mini` (MeSH 2015) | LLM-DocRE `gpt-4o-mini` ({Chemical-Induce-Disease}) |
| `llmner_blink_llmed_llmdocre_linked_docred` | LLM-NER `gpt-4o-mini` ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | LLM-ED `gpt-4o-mini` (DBPedia 2020.02.01) | LLM-DocRE `gpt-4o-mini` (DBPedia 96 relations) |

## 🕸️ Knowledge Graph Construction

### Overview

The **Knowledge Graph Construction** module builds a **directed multi-relational graph** from a set of extracted triples.

- **Nodes** represent unique entities (i.e., concepts).
- **Edges** represent semantic relations between entities.

### Input

This module takes as input:

1. List of ***Document*** objects with triples, produced by the **Triple Extraction** module

2. (optional) ***Additional Triples*** (existing KBs), or a list of dictionaries, each containing:
    - `head` (str): Entity ID of the subject
    - `relation` (str): Relation type (e.g., treats, causes)
    - `tail` (str): Entity ID of the object
```json
[
    {
        "head": "D000001",
        "relation": "treats",
        "tail": "D014262"
    },
    ...
]
```
(See `experiments/data/examples/additional_triples.json` for more details.)


3. ***Entity Dictionary***, or a list of dictionaries, each containing:
    - `entity_id` (str): Unique concept ID
    - `canonical_name` (str): Official name of the concept
    - `entity_type` (str): Type/category of the concept
    - `synonyms` (list[str]): A list of alternative names
    - `description` (str): Textual definition of the concept
```JSON
[
    {
        "entity_id": "C009166",
        "entity_index": 252696,
        "entity_type": null,
        "canonical_name": "retinol acetate",
        "synonyms": [
            "retinyl acetate",
            "vitamin A acetate"
        ],
        "description": "",
        "tree_numbers": []
    },
    {
        "entity_id": "D000641",
        "entity_index": 610,
        "entity_type": "Chemical",
        "canonical_name": "Ammonia",
        "synonyms": [],
        "description": "A colorless alkaline gas. It is formed in the body during decomposition of organic materials during a large number of metabolically important reactions. Note that the aqueous form of ammonia is referred to as AMMONIUM HYDROXIDE.",
        "tree_numbers": [
            "D01.362.075",
            "D01.625.050"
        ]
    },
    ...
]
```
(See `experiments/data/examples/entity_dict.json` for more details.)

### Output

The output is a `networkx.MultiDiGraph` object representing the knowledge graph.

Each node has the following attributes:

- `entity_id` (str): Concept ID (e.g., MeSH ID)
- `entity_type` (str): Type of entity (e.g., Disease, Chemical, Person, Location)
- `name` (str): Canonical name (from Entity Dictionary)
- `description` (str): Textual definition (from Entity Dictionary)
- `doc_key_list` (list[str]): List of document IDs where this entity appears

Each edge has the following attributes:

- `head` (str): Head entity ID
- `tail` (str): Tail entity ID
- `relation` (str): Type of semantic relation
- `doc_key_list` (list[str]): List of document IDs supporting this relation

(See `experiments/data/examples/graph.graphml` for more details.)

### How to Use

```python
from kapipe.graph_construction import GraphConstructor

PATH_TO_DOCUMENTS = "./experiments/data/examples/documents.json"
PATH_TO_TRIPLES = "./experiments/data/examples/additional_triples.json" # Or set to None if unused
PATH_TO_ENTITY_DICT = "./experiments/data/examples/entity_dict.json"

# Initialize the knowledge graph constructor
constructor = GraphConstructor()

# Construct the knowledge graph
graph = constructor.construct_knowledge_graph(
    path_documents_list=[PATH_TO_DOCUMENTS],
    path_additional_triples=PATH_TO_TRIPLES, # Optional
    path_entity_dict=PATH_TO_ENTITY_DICT
)
```
(See `experiments/codes/run_graph_construction.py` for specific examples.)

## 🧱 Community Clustering

### Overview

The **Community Clustering** module partitions the knowledge graph into **semantically coherent subgraphs**, referred to as *communities*.  
Each community represents a localized set of closely related concepts and relations, and serves as a fundamental unit of structured knowledge.

### Input

This module takes as input:

1. Knowledge graph (`networkx.MultiDiGraph`) produced by the **Knowledge Graph Construction** module.

### Output

The output is a list of hierarchical community records (dictionaries), each containing:

- `community_id` (str): Unique ID for the community
- `nodes` (list[str]): List of entity IDs belonging to the community (null for ROOT)
- `level` (int): Depth in the hierarchy (ROOT=-1)
- `parent_community_id` (str): ID of the parent community (null for ROOT)
- `child_community_ids` (list[str]): List of child community IDs (empty for leaf communities)

```json
[
    {
        "community_id": "ROOT",
        "nodes": null,
        "level": -1,
        "parent_community_id": null,
        "child_community_ids": [
            "0",
            "1",
            "2",
            "3",
            "4",
            "5",
            "6",
            "7",
            "8",
            "9"
        ]
    },
    {
        "community_id": "0",
        "nodes": [
            "D016651",
            "D014262",
            "D003866",
            "D003490",
            "D001145"
        ],
        "level": 0,
        "parent_community_id": "ROOT",
        "child_community_ids": [...]
    },
    ...
]
```
(See `experiments/data/examples/communities.json` for more details.)

This hierarchical structure enables multi-level organization of knowledge, particularly useful for coarse-to-fine report generation and scalable retrieval.

### How to Use

```python
from kapipe.community_clustering import (
    HierarchicalLeiden,
    NeighborhoodAggregation,
    TripleLevelFactorization
)

# Initialize the community clusterer
clusterer = HierarchicalLeiden()
# clusterer = NeighborhoodAggregation()
# clusterer = TripleLevelFactorization()

# Apply the community clusterer to the graph
communities = clusterer.cluster_communities(graph)
```
(See `experiments/codes/run_community_clustering.py` for specific examples.)

### Supported Methods

- **Hierarchical Leiden**
    - Recursively applies the Leiden algorithm (Traag et al., 2019) to optimize modularity. Large communities are subdivided until they satisfy a predefined size constraint (default: 10 nodes).
- **Neighborhood Aggregation**
    - Groups each node with its immediate neighbors to form local communities.
- **Triple-level Factorization**
    - Treats each individual (subject, relation, object) triple as an atomic community.

## 📝 Report Generation

### Overview

The **Report Generation** module converts each community into a **natural language report**, making structured knowledge interpretable for both humans and language models.  

### Input

This module takes as input:

1. Knowledge graph (`networkx.MultiDiGraph`) generated by the **Knowledge Graph Construction** module.
1. List of community records generated by the **Community Clustering** module.

### Output

The output is a `.jsonl` file, where each line corresponds to one ***Passage***, a dictionary containing:

- `title` (str): Concice topic summary of the community
- `text` (str): Full natural language description of the community's content

```json
{"title": "Lithium Carbonate and Related Health Conditions", "text": "This report examines the interconnections between Lithium Carbonate, ...."}
{"title": "Phenobarbital and Drug-Induced Dyskinesia", "text": "This report examines the relationship between Phenobarbital, ..."}
{"title": "Ammonia and Valproic Acid in Disorders of Excessive Somnolence", "text": "This report examines the relationship between ammonia and valproic acid, ..."}
...
```
(See `experiments/data/examples/reports.jsonl` for more details.)

✅ The output format is fully compatible with the **Chunking** module, which accepts any dictionary containing a `title` and `text` field.  
Thus, each community report can also be treated as a generic ***Passage***.

### How to Use

```python
from kapipe.report_generation import (
    LLMBasedReportGenerator,
    TemplateBasedReportGenerator
)

PATH_TO_REPORTS = "./experiments/data/examples/reports.jsonl"

# Initialize the report generator
generator = LLMBasedReportGenerator()
# generator = TemplateBasedReportGenerator()
 
# Generate community reports
generator.generate_community_reports(
    graph=graph,
    communities=communities,
    path_output=PATH_TO_REPORTS
)
```
(See `experiments/codes/run_report_generation.py` for specific examples.)

### Supported Methods

- **LLM-based Generation**
    - Uses a large language model (e.g., GPT-4o-mini) prompted with a community content to generate fluent summaries.
- **Template-based Generation**
    - Uses a deterministic format that verbalizes each entity/triple and linearizes them:
        - Entity format: `"{name} | {type} | {definition}"`
        - Triple format: `"{subject} | {relation} | {object}"`

## ✂️ Chunking

### Overview

The **Chunking** module splits each input text into multiple **non-overlapping text chunks**, each constrained by a maximum token length (e.g., 100 tokens).  
This module is essential for preparing context units that are compatible with downstream modules such as retrieval and question answering.  

<!-- It supports any input that conforms to the following format:

- A dictionary containing a `"title"` and `"text"` field.  

This makes the module applicable not only to **community reports**, but also to other types of *Passage* data with similar structure. -->

### Input

This module takes as input:

1. ***Passage***, or a dictionary containing `title` and `text` field.

    - `title` (str): Title of the passage
    - `text` (str): Full natural language description of the passage

```json
{
    "title": "Lithium Carbonate and Related Health Conditions",
    "text": "This report examines the interconnections between Lithium Carbonate, ..."
}
```

### Output

The output is a list of ***Passage*** objects, each containing:
- `title` (str): Same as input
- `text` (str): Chunked portion of the original text, within the specified token window
- Other metadata (e.g., community_id) is carried over

```json
[
    {
        "title": "Lithium Carbonate and Related Health Conditions",
        "text": "This report examines the interconnections between Lithium Carbonate, ..."
    },
    {
        "title": "Lithium Carbonate and Related Health Conditions",
        "text": "This duality necessitates careful monitoring of patients receiving Lithium treatment, ..."
    },
    {
        "title": "Lithium Carbonate and Related Health Conditions",
        "text": "Similarly, cardiac arrhythmias, which involve irregular heartbeats, can pose ..."
    }
    ...
]
```
(See `experiments/data/examples/reports.chunked_w100.jsonl` for more details.)

### How to Use

```python
from kapipe.chunking import Chunker

MODEL_NAME = "en_core_web_sm"  # SpaCy tokenizer
WINDOW_SIZE = 100  # Max number of tokens per chunk

# Initialize the chunker
chunker = Chunker(model_name=MODEL_NAME)

# Chunk the passage
chunked_passages = chunker.split_passage_to_chunked_passages(
    passage=passage,
    window_size=WINDOW_SIZE
)
```
(See `experiments/codes/run_chunking.py` for specific examples.)

## 🔍 Passage Retrieval

### Overview

The **Passage Retrieval** module searches for the top-k most **relevant chunks** given a user query.  
It uses lexical or dense retrievers (e.g., BM25, Contriever) to compute semantic similarity between queries and chunks using embedding-based methods.

### Input

**(1) Indexing**:

During the indexing phase, this module takes as input:

1. List of ***Passage*** objects

**(2) Search**:

During the search phase, this module takes as input:

1. ***Question***, or a dictionary containing:
    - `question_key` (str): Unique identifier for the question
    - `question` (str): Natural language question

```json
{
    "question_key": "question#123",
    "question": "What does lithium carbonate induce?"
}
```
(See `experiments/data/examples/questions.json` for more details.)

### Output

**(1) Indexing**:

The indexing result is automatically saved to the path specified by `index_root` and `index_name`.

**(2) Search**:

The search result for each question is represented as a dictionary containing:
- `question_key` (str): Refers back to the original query
- `contexts` (list[***Passage***]): Top-k retrieved chunks sorted by relevance, each containing:
    - `title` (str): Chunk title
    - `text` (str): Chunk text
    - `score` (float): Similarity score computed by the retriever
    - `rank` (int): Rank of the chunk (1-based)

```json
{
    "question_key": "question#123",
    "contexts": [
        {
            "title": "Lithium Carbonate and Related Health Conditions",
            "text": "This report examines the interconnections between Lithium Carbonate, ...",
            (meta data, if exists)
            "score": 1.5991605520248413,
            "rank": 1
        },
        {
            "title": "Lithium Carbonate and Related Health Conditions",
            "text": "\n\nIn summary, while Lithium Carbonate is an effective treatment for mood disorders, ...",
            (meta data, if exists)
            "score": 1.51018488407135,
            "rank": 2
        },
        ...
    ]
}
```
(See `experiments/data/examples/questions.contexts.json` for more details.)

### How to Use

**(1) Indexing**:

```python
from kapipe.passage_retrieval import Contriever

INDEX_ROOT = "./"
INDEX_NAME = "example"

# Initialize retriever
retriever = Contriever(
    max_passage_length=512,
    pooling_method="average",
    normalize=False,
    gpu_id=0,
    metric="inner-product"
)

# Build index
retriever.make_index(
    passages=passages,
    index_root=INDEX_ROOT,
    index_name=INDEX_NAME
)
```
(See `experiments/codes/run_passage_retrieval.py` for specific examples.)

**(2) Search**:

```python
# Load the index
retriever.load_index(index_root=INDEX_ROOT, index_name=INDEX_NAME)

# Search for top-k contexts
retrieved_passages = retriever.search(queries=[question], top_k=10)[0]
contexts_for_question = {
    "question_key": question["question_key"],
    "contexts": retrieved_passages
}
```

### Supported Methods

- **BM25**
    - A sparse lexical matching model based on term frequency and inverse document frequency.
- **Contriever** (Izacard et al., 2022)
    - A dual-encoder retriever trained with contrastive learning (Izacard et al., 2022). Computes similarity between query and chunk embeddings.
- **ColBERTv2** (Santhanam et al., 2022)
    - A token-level late-interaction retriever for fine-grained semantic matching. Provides higher accuracy with increased inference cost.
    - Note: This method is currently unavailable due to an import error in the external `ragatouille` package ([here](https://github.com/AnswerDotAI/RAGatouille/issues/272)).

## 💬 Question Answering

### Overview

The **Question Answering** module generates an answer for each user query, optionally conditioned on the retrieved context chunks.  
It uses a large language model such as GPT-4o to produce factually grounded and context-aware answers in natural language.

### Input

This module takes as input:

1. ***Question***, or a dictionary containing:
    - `question_key`: Unique identifier for the question
    - `question`: Natural language question string

(See `experiments/data/examples/questions.json` for more details.)

2. A dictionary containing:
    - `question_key`: The same identifier with the ***Question***
    - `contexts`: List of ***Passage*** objects

(See `experiments/data/examples/questions.contexts.json` for more details.)

### Output

The answer is a dictionary containing:

- `question_key` (str): Same as input
- `question` (str): Original question text
- `output_answer` (str): Model-generated natural language answer
- `helpfulness_score` (float): Confidence score generated by the model

```json
{
    "question_key": "question#123",
    "question": "What does lithium carbonate induce?",
    "output_answer": "Lithium carbonate can induce depressive disorder, cyanosis, and cardiac arrhythmias.",
    "helpfulness_score": 1.0
}
```
(See `experiments/data/examples/answers.json` for more details.)

### How to Use

```python
from os.path import expanduser
from kapipe.qa import LLMQA
from kapipe import utils

# Initialize the QA module
answerer = LLMQA(path_snapshot=expanduser("~/.kapipe/download/results/qa/llmqa/openai_gpt4o"))

# Generate answer
answer = answerer.run(
    question=question,
    contexts_for_question=contexts_for_question
)

```
(See `experiments/codes/run_qa.py` for specific examples.)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kapipe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "NLP, knowledge acquisition, information extraction, knowledge graph, retrieval, question answering",
    "author": null,
    "author_email": "Noriki Nishida <norikinishida@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/2b/7c/31bc4ffeaa9a6a89d105994280668fa9a537bf4b4db5669eab4903169716/kapipe-0.0.7.tar.gz",
    "platform": null,
    "description": "# KAPipe: A Modular Pipeline for Knowledge Acquisition\n\n## Table of Contents\n\n- [\ud83e\udd16 What is KAPipe?](#what-is-kapipe)\n- [\ud83d\udce6 Installation](#-installation)\n- [\ud83e\udde9 Triple Extraction](#-triple-extraction)\n- [\ud83d\udd78\ufe0f Knowledge Graph Construction](#-knowledge-graph-construction)\n- [\ud83e\uddf1 Community Clustering](#-community-clustering)\n- [\ud83d\udcdd Report Generation](#-report-generation)\n- [\u2702\ufe0f Chunking](#-chunking)\n- [\ud83d\udd0d Passage Retrieval](#-passage-retrieval)\n- [\ud83d\udcac Question Answering](#-question-answering)\n\n## \ud83e\udd16 What is KAPipe?\n\n**KAPipe** is a modular pipeline for comprehensive **knowledge acquisition** from unstructured documents.  \nIt supports **extraction**, **organization**, **retrieval**, and **utilization** of knowledge, serving as a core framework for building intelligent systems that reason over structured knowledge.  \n\nCurrently, KAPipe provides the following functionalities:\n\n- \ud83e\udde9**Triple Extraction**  \n    - Extract facts in the form of (head entity, relation, tail entity) triples from raw text.\n\n- \ud83d\udd78\ufe0f**Knowledge Graph Construction**  \n    - Build a symbolic knowledge graph from triples, optionally augmented with external ontologies or knowledge bases (e.g., Wikidata, UMLS).\n\n- \ud83e\uddf1**Community Clustering**  \n    - Cluster the knowledge graph into semantically coherent subgraphs (*communities*).\n\n- \ud83d\udcdd**Report Generation**  \n    - Generate textual reports (or summaries) of graph communities.\n\n- \u2702\ufe0f**Chunking**  \n    - Split text (e.g., community report) into fixed-size chunks based on a predefined token length (e.g., n=300).\n\n- \ud83d\udd0d**Passage Retrieval**  \n    - Retrieve relevant chunks for given queries using lexical or dense retrieval.\n\n- \ud83d\udcac**Question Answering**  \n    - Answer questions using retrieved chunks as context.\n\nThese components together form an implementation of **graph-based retrieval-augmented generation (GraphRAG)**, enabling question answering and reasoning grounded in structured knowledge.\n\n## \ud83d\udce6 Installation\n\n### Step 1: Set up a Python environment\n```bash\npython -m venv .env\nsource .env/bin/activate\npip install -U pip setuptools wheel\n```\n\n### Step 2: Install KAPipe\n```bash\npip install kapipe\n```\n\n### Step 3: Download pretrained models and configurations\n\nPretrained models and configuration files can be downloaded from the following Google Drive folder:\n\n\ud83d\udcc1 [KAPipe Release Files](https://drive.google.com/drive/folders/16ypMCoLYf5kDxglDD_NYoCNAfhTy4Qwp)\n\nDownload the latest release file named release.YYYYMMDD.tar.gz, then extract it to the ~/.kapipe directory:\n\n```bash\nmkdir -p ~/.kapipe\nmv release.YYYYMMDD.tar.gz ~/.kapipe\ncd ~/.kapipe\ntar -zxvf release.YYYYMMDD.tar.gz\n```\n\nIf the extraction is successful, you should see a directory `~/.kapipe/download/`, which contains model resources.\n\n## \ud83e\udde9 Triple Extraction\n\n### Overview\n\nThe **Triple Extraction** module identifies relational facts from raw text in the form of (head entity, relation, tail entity) **triples**.\n\nSpecifically, this is achieved through the following cascade of subtasks:\n\n1. **Named Entity Recognition (NER):**\n    - Detect entity mentions (spans) and classify their types.\n1. **Entity Disambiguation Retrieval (ED-Retrieval)**:\n    - Retrieve candidate concept IDs from a knowledge base for each mention.\n1. **Entity Disambiguation Reranking (ED-Reranking)**:\n    - Select the most probable concept ID from the retrieved candidates.\n1. **Document-level Relation Extraction (DocRE)**:\n    - Extract relational triples based on the disambiguated entity set.\n\n### Input\n\nThis module takes as input:\n\n1. ***Document***, or a dictionary containing\n    - `doc_key` (str): Unique identifier for the document\n    - `sentences` (list[str]): List of sentence strings (tokenized)\n\n```json\n{\n    \"doc_key\": \"6794356\",\n    \"sentences\": [\n        \"Tricuspid valve regurgitation and lithium carbonate toxicity in a newborn infant .\",\n        \"A newborn with massive tricuspid regurgitation , atrial flutter , congestive heart failure , and a high serum lithium level is described .\",\n        ...\n    ]\n}\n```\n(See `experiments/data/examples/documents_without_triples.json` for more details.)\n\nEach subtask takes a ***Document*** object as input, augments it with new fields, and returns it.  \nThis allows custom metadata to persist throughout the pipeline.\n\n### Output\n\nThe output is also the same-format dictionary (***Document***), augmented with extracted entities and triples information:\n\n- `doc_key` (str): Same as input\n- `sentences` (list[str]): Same as input\n- `mentions` (list[dict]): Mentions, or a list of dictionaries, each containing:\n    - `span` (tuple[int,int]): Mention span\n    - `name` (str): Mention string\n    - `entity_type` (str): Entity type\n    - `entity_id` (str): Concept ID\n- `entities` (list[dict]): Entities, or a list of dictionaries, each containing\n    - `mention_indices` (list[int]): Indices of mentions belonging to this entity\n    - `entity_type` (str): Entity type \n    - `entity_id` (str): Concept ID\n- `relations` (list[dict]): Triples, or a list dictionaries, each containing\n    - `arg1` (int): Index of the head/subject entity\n    - `relation` (str): Semantic relation\n    - `arg2` (int): Index of the tail/object entity\n\n```json\n{\n    \"doc_key\": \"6794356\",\n    \"sentences\": [...],\n    \"mentions\": [\n        {\n            \"span\": [0, 2],\n            \"name\": \"Tricuspid valve regurgitation\",\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D014262\"\n        },\n        ...\n    ],\n    \"entities\": [\n        {\n            \"mention_indices\": [0, 3, 7],\n            \"entity_type\": \"Disease\",\n            \"entity_id\": \"D014262\"\n        },\n        ...\n    ],\n    \"relations\": [\n        {\n            \"arg1\": 1,\n            \"relation\": \"CID\",\n            \"arg2\": 7\n        },\n        ...\n    ]\n}\n```\n(See `experiments/data/examples/documents_with_triples.json` for more details.)\n\n### How to Use\n\n```python\nimport kapipe.triple_extraction\n\nIDENTIFIER = \"biaffinener_blink_blink_atlop_cdr\"\n\n# Load the Triple Extraction pipeline\npipe = kapipe.triple_extraction.load(\n    identifier=IDENTIFIER,\n    gpu_map={\"ner\": 0, \"ed_retrieval\": 0, \"ed_reranking\": 2, \"docre\": 3}\n)\n\n# We provide a utility for converting text (string) to Document format\n# `title` is optional\ndocument = pipe.text_to_document(doc_key=your_doc_key, text=your_text, title=your_title)\n\n# Apply the pipeline to your input document\ndocument = pipe(document)\n```\n(See `experiments/codes/run_triple_extraction.py` for specific examples.)\n\n<!-- The `identifier` determines the specific models used for each subtask.  \nFor example, `\"biaffinener_blink_blink_atlop_cdr\"` uses:\n\n- **NER**: Biaffine-NER (trained on BC5CDR for Chemical and Disease types)\n- **ED-Retrieval**: BLINK Bi-Encoder (trained on BC5CDR for MeSH 2015)\n- **ED-Reranking**: BLINK Cross-Encoder (trained on BC5CDR for MeSH 2015)\n- **DocRE**: ATLOP (trained on BC5CDR for Chemical-Induce-Disease (CID) relation) -->\n\n### Supported Methods\n\n#### Named Entity Recognition (NER)\n- **Biaffine-NER** ([`Yu et al., 2020`](https://aclanthology.org/2020.acl-main.577/)): Span-based BERT model using biaffine scoring\n- **LLM-NER**: A proprietary/open-source LLM using a NER-specific prompt template and few-shot examples\n\n#### Entity Disambiguation Retrieval (ED-Retrieval)\n- **BLINK Bi-Encoder** ([`Wu et al., 2020`](https://aclanthology.org/2020.emnlp-main.519/)): Dense retriever using BERT-based encoders and approximate nearest neighbor search\n- **BM25**: Lexical matching\n- **Levenshtein**: Edit distance matching\n\n#### Entity Disambiguation Reranking (ED-Reranking)\n- **BLINK Cross-Encoder** (Wu et al., 2020): Reranker using a BERT-based encoder for candidates from the Bi-Encoder\n- **LLM-ED**: A proprietary/open-source LLM using a ED-specific prompt template and few-shot examples\n\n#### Document-level Relation Extraction (DocRE)\n- **ATLOP** ([`Zhou et al., 2021`](https://ojs.aaai.org/index.php/AAAI/article/view/17717)): BERT-based model for DocRE\n- **MA-ATLOP** (Oumaima & Nishida et al., 2024): Mention-agnostic extension of ATLOP\n- **MA-QA** (Oumaima & Nishida, 2024): Question-answering style DocRE model\n- **LLM-DocRE**: A proprietary/open-source LLM using a DocRE-specific prompt template and few-shot examples\n\n### Available Pipeline Identifiers\n\nThe following pipeline configurations are currently available:\n\n| identifier | NER (Entity Types) | ED-Retrieval (Knowledge Base) | ED-Reranking (Knowledge Base) | DocRE (Relations) |\n| --- | --- | --- | --- | --- |\n| `biaffinener_blink_blink_atlop_cdr` | Biaffine-NER ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | BLINK Cross-Encoder (MeSH 2015) | ATLOP ({Chemical-Induce-Disease}) |\n| `biaffinener_blink_blink_atlop_linked_docred` | Biaffine-NER ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | BLINK Cross-Encoder (DBPedia 2020.02.01) | ATLOP (DBPedia 96 relations) |\n| `llmner_blink_llmed_llmdocre_cdr` | LLM-NER `gpt-4o-mini` ({Chemical, Disease}) | BLINK Bi-Encoder (MeSH 2015) | LLM-ED `gpt-4o-mini` (MeSH 2015) | LLM-DocRE `gpt-4o-mini` ({Chemical-Induce-Disease}) |\n| `llmner_blink_llmed_llmdocre_linked_docred` | LLM-NER `gpt-4o-mini` ({Person, Organization, Location, Time, Number, Misc}) | BLINK Bi-Encoder (DBPedia 2020.02.01) | LLM-ED `gpt-4o-mini` (DBPedia 2020.02.01) | LLM-DocRE `gpt-4o-mini` (DBPedia 96 relations) |\n\n## \ud83d\udd78\ufe0f Knowledge Graph Construction\n\n### Overview\n\nThe **Knowledge Graph Construction** module builds a **directed multi-relational graph** from a set of extracted triples.\n\n- **Nodes** represent unique entities (i.e., concepts).\n- **Edges** represent semantic relations between entities.\n\n### Input\n\nThis module takes as input:\n\n1. List of ***Document*** objects with triples, produced by the **Triple Extraction** module\n\n2. (optional) ***Additional Triples*** (existing KBs), or a list of dictionaries, each containing:\n    - `head` (str): Entity ID of the subject\n    - `relation` (str): Relation type (e.g., treats, causes)\n    - `tail` (str): Entity ID of the object\n```json\n[\n    {\n        \"head\": \"D000001\",\n        \"relation\": \"treats\",\n        \"tail\": \"D014262\"\n    },\n    ...\n]\n```\n(See `experiments/data/examples/additional_triples.json` for more details.)\n\n\n3. ***Entity Dictionary***, or a list of dictionaries, each containing:\n    - `entity_id` (str): Unique concept ID\n    - `canonical_name` (str): Official name of the concept\n    - `entity_type` (str): Type/category of the concept\n    - `synonyms` (list[str]): A list of alternative names\n    - `description` (str): Textual definition of the concept\n```JSON\n[\n    {\n        \"entity_id\": \"C009166\",\n        \"entity_index\": 252696,\n        \"entity_type\": null,\n        \"canonical_name\": \"retinol acetate\",\n        \"synonyms\": [\n            \"retinyl acetate\",\n            \"vitamin A acetate\"\n        ],\n        \"description\": \"\",\n        \"tree_numbers\": []\n    },\n    {\n        \"entity_id\": \"D000641\",\n        \"entity_index\": 610,\n        \"entity_type\": \"Chemical\",\n        \"canonical_name\": \"Ammonia\",\n        \"synonyms\": [],\n        \"description\": \"A colorless alkaline gas. It is formed in the body during decomposition of organic materials during a large number of metabolically important reactions. Note that the aqueous form of ammonia is referred to as AMMONIUM HYDROXIDE.\",\n        \"tree_numbers\": [\n            \"D01.362.075\",\n            \"D01.625.050\"\n        ]\n    },\n    ...\n]\n```\n(See `experiments/data/examples/entity_dict.json` for more details.)\n\n### Output\n\nThe output is a `networkx.MultiDiGraph` object representing the knowledge graph.\n\nEach node has the following attributes:\n\n- `entity_id` (str): Concept ID (e.g., MeSH ID)\n- `entity_type` (str): Type of entity (e.g., Disease, Chemical, Person, Location)\n- `name` (str): Canonical name (from Entity Dictionary)\n- `description` (str): Textual definition (from Entity Dictionary)\n- `doc_key_list` (list[str]): List of document IDs where this entity appears\n\nEach edge has the following attributes:\n\n- `head` (str): Head entity ID\n- `tail` (str): Tail entity ID\n- `relation` (str): Type of semantic relation\n- `doc_key_list` (list[str]): List of document IDs supporting this relation\n\n(See `experiments/data/examples/graph.graphml` for more details.)\n\n### How to Use\n\n```python\nfrom kapipe.graph_construction import GraphConstructor\n\nPATH_TO_DOCUMENTS = \"./experiments/data/examples/documents.json\"\nPATH_TO_TRIPLES = \"./experiments/data/examples/additional_triples.json\" # Or set to None if unused\nPATH_TO_ENTITY_DICT = \"./experiments/data/examples/entity_dict.json\"\n\n# Initialize the knowledge graph constructor\nconstructor = GraphConstructor()\n\n# Construct the knowledge graph\ngraph = constructor.construct_knowledge_graph(\n    path_documents_list=[PATH_TO_DOCUMENTS],\n    path_additional_triples=PATH_TO_TRIPLES, # Optional\n    path_entity_dict=PATH_TO_ENTITY_DICT\n)\n```\n(See `experiments/codes/run_graph_construction.py` for specific examples.)\n\n## \ud83e\uddf1 Community Clustering\n\n### Overview\n\nThe **Community Clustering** module partitions the knowledge graph into **semantically coherent subgraphs**, referred to as *communities*.  \nEach community represents a localized set of closely related concepts and relations, and serves as a fundamental unit of structured knowledge.\n\n### Input\n\nThis module takes as input:\n\n1. Knowledge graph (`networkx.MultiDiGraph`) produced by the **Knowledge Graph Construction** module.\n\n### Output\n\nThe output is a list of hierarchical community records (dictionaries), each containing:\n\n- `community_id` (str): Unique ID for the community\n- `nodes` (list[str]): List of entity IDs belonging to the community (null for ROOT)\n- `level` (int): Depth in the hierarchy (ROOT=-1)\n- `parent_community_id` (str): ID of the parent community (null for ROOT)\n- `child_community_ids` (list[str]): List of child community IDs (empty for leaf communities)\n\n```json\n[\n    {\n        \"community_id\": \"ROOT\",\n        \"nodes\": null,\n        \"level\": -1,\n        \"parent_community_id\": null,\n        \"child_community_ids\": [\n            \"0\",\n            \"1\",\n            \"2\",\n            \"3\",\n            \"4\",\n            \"5\",\n            \"6\",\n            \"7\",\n            \"8\",\n            \"9\"\n        ]\n    },\n    {\n        \"community_id\": \"0\",\n        \"nodes\": [\n            \"D016651\",\n            \"D014262\",\n            \"D003866\",\n            \"D003490\",\n            \"D001145\"\n        ],\n        \"level\": 0,\n        \"parent_community_id\": \"ROOT\",\n        \"child_community_ids\": [...]\n    },\n    ...\n]\n```\n(See `experiments/data/examples/communities.json` for more details.)\n\nThis hierarchical structure enables multi-level organization of knowledge, particularly useful for coarse-to-fine report generation and scalable retrieval.\n\n### How to Use\n\n```python\nfrom kapipe.community_clustering import (\n    HierarchicalLeiden,\n    NeighborhoodAggregation,\n    TripleLevelFactorization\n)\n\n# Initialize the community clusterer\nclusterer = HierarchicalLeiden()\n# clusterer = NeighborhoodAggregation()\n# clusterer = TripleLevelFactorization()\n\n# Apply the community clusterer to the graph\ncommunities = clusterer.cluster_communities(graph)\n```\n(See `experiments/codes/run_community_clustering.py` for specific examples.)\n\n### Supported Methods\n\n- **Hierarchical Leiden**\n    - Recursively applies the Leiden algorithm (Traag et al., 2019) to optimize modularity. Large communities are subdivided until they satisfy a predefined size constraint (default: 10 nodes).\n- **Neighborhood Aggregation**\n    - Groups each node with its immediate neighbors to form local communities.\n- **Triple-level Factorization**\n    - Treats each individual (subject, relation, object) triple as an atomic community.\n\n## \ud83d\udcdd Report Generation\n\n### Overview\n\nThe **Report Generation** module converts each community into a **natural language report**, making structured knowledge interpretable for both humans and language models.  \n\n### Input\n\nThis module takes as input:\n\n1. Knowledge graph (`networkx.MultiDiGraph`) generated by the **Knowledge Graph Construction** module.\n1. List of community records generated by the **Community Clustering** module.\n\n### Output\n\nThe output is a `.jsonl` file, where each line corresponds to one ***Passage***, a dictionary containing:\n\n- `title` (str): Concice topic summary of the community\n- `text` (str): Full natural language description of the community's content\n\n```json\n{\"title\": \"Lithium Carbonate and Related Health Conditions\", \"text\": \"This report examines the interconnections between Lithium Carbonate, ....\"}\n{\"title\": \"Phenobarbital and Drug-Induced Dyskinesia\", \"text\": \"This report examines the relationship between Phenobarbital, ...\"}\n{\"title\": \"Ammonia and Valproic Acid in Disorders of Excessive Somnolence\", \"text\": \"This report examines the relationship between ammonia and valproic acid, ...\"}\n...\n```\n(See `experiments/data/examples/reports.jsonl` for more details.)\n\n\u2705 The output format is fully compatible with the **Chunking** module, which accepts any dictionary containing a `title` and `text` field.  \nThus, each community report can also be treated as a generic ***Passage***.\n\n### How to Use\n\n```python\nfrom kapipe.report_generation import (\n    LLMBasedReportGenerator,\n    TemplateBasedReportGenerator\n)\n\nPATH_TO_REPORTS = \"./experiments/data/examples/reports.jsonl\"\n\n# Initialize the report generator\ngenerator = LLMBasedReportGenerator()\n# generator = TemplateBasedReportGenerator()\n \n# Generate community reports\ngenerator.generate_community_reports(\n    graph=graph,\n    communities=communities,\n    path_output=PATH_TO_REPORTS\n)\n```\n(See `experiments/codes/run_report_generation.py` for specific examples.)\n\n### Supported Methods\n\n- **LLM-based Generation**\n    - Uses a large language model (e.g., GPT-4o-mini) prompted with a community content to generate fluent summaries.\n- **Template-based Generation**\n    - Uses a deterministic format that verbalizes each entity/triple and linearizes them:\n        - Entity format: `\"{name} | {type} | {definition}\"`\n        - Triple format: `\"{subject} | {relation} | {object}\"`\n\n## \u2702\ufe0f Chunking\n\n### Overview\n\nThe **Chunking** module splits each input text into multiple **non-overlapping text chunks**, each constrained by a maximum token length (e.g., 100 tokens).  \nThis module is essential for preparing context units that are compatible with downstream modules such as retrieval and question answering.  \n\n<!-- It supports any input that conforms to the following format:\n\n- A dictionary containing a `\"title\"` and `\"text\"` field.  \n\nThis makes the module applicable not only to **community reports**, but also to other types of *Passage* data with similar structure. -->\n\n### Input\n\nThis module takes as input:\n\n1. ***Passage***, or a dictionary containing `title` and `text` field.\n\n    - `title` (str): Title of the passage\n    - `text` (str): Full natural language description of the passage\n\n```json\n{\n    \"title\": \"Lithium Carbonate and Related Health Conditions\",\n    \"text\": \"This report examines the interconnections between Lithium Carbonate, ...\"\n}\n```\n\n### Output\n\nThe output is a list of ***Passage*** objects, each containing:\n- `title` (str): Same as input\n- `text` (str): Chunked portion of the original text, within the specified token window\n- Other metadata (e.g., community_id) is carried over\n\n```json\n[\n    {\n        \"title\": \"Lithium Carbonate and Related Health Conditions\",\n        \"text\": \"This report examines the interconnections between Lithium Carbonate, ...\"\n    },\n    {\n        \"title\": \"Lithium Carbonate and Related Health Conditions\",\n        \"text\": \"This duality necessitates careful monitoring of patients receiving Lithium treatment, ...\"\n    },\n    {\n        \"title\": \"Lithium Carbonate and Related Health Conditions\",\n        \"text\": \"Similarly, cardiac arrhythmias, which involve irregular heartbeats, can pose ...\"\n    }\n    ...\n]\n```\n(See `experiments/data/examples/reports.chunked_w100.jsonl` for more details.)\n\n### How to Use\n\n```python\nfrom kapipe.chunking import Chunker\n\nMODEL_NAME = \"en_core_web_sm\"  # SpaCy tokenizer\nWINDOW_SIZE = 100  # Max number of tokens per chunk\n\n# Initialize the chunker\nchunker = Chunker(model_name=MODEL_NAME)\n\n# Chunk the passage\nchunked_passages = chunker.split_passage_to_chunked_passages(\n    passage=passage,\n    window_size=WINDOW_SIZE\n)\n```\n(See `experiments/codes/run_chunking.py` for specific examples.)\n\n## \ud83d\udd0d Passage Retrieval\n\n### Overview\n\nThe **Passage Retrieval** module searches for the top-k most **relevant chunks** given a user query.  \nIt uses lexical or dense retrievers (e.g., BM25, Contriever) to compute semantic similarity between queries and chunks using embedding-based methods.\n\n### Input\n\n**(1) Indexing**:\n\nDuring the indexing phase, this module takes as input:\n\n1. List of ***Passage*** objects\n\n**(2) Search**:\n\nDuring the search phase, this module takes as input:\n\n1. ***Question***, or a dictionary containing:\n    - `question_key` (str): Unique identifier for the question\n    - `question` (str): Natural language question\n\n```json\n{\n    \"question_key\": \"question#123\",\n    \"question\": \"What does lithium carbonate induce?\"\n}\n```\n(See `experiments/data/examples/questions.json` for more details.)\n\n### Output\n\n**(1) Indexing**:\n\nThe indexing result is automatically saved to the path specified by `index_root` and `index_name`.\n\n**(2) Search**:\n\nThe search result for each question is represented as a dictionary containing:\n- `question_key` (str): Refers back to the original query\n- `contexts` (list[***Passage***]): Top-k retrieved chunks sorted by relevance, each containing:\n    - `title` (str): Chunk title\n    - `text` (str): Chunk text\n    - `score` (float): Similarity score computed by the retriever\n    - `rank` (int): Rank of the chunk (1-based)\n\n```json\n{\n    \"question_key\": \"question#123\",\n    \"contexts\": [\n        {\n            \"title\": \"Lithium Carbonate and Related Health Conditions\",\n            \"text\": \"This report examines the interconnections between Lithium Carbonate, ...\",\n            (meta data, if exists)\n            \"score\": 1.5991605520248413,\n            \"rank\": 1\n        },\n        {\n            \"title\": \"Lithium Carbonate and Related Health Conditions\",\n            \"text\": \"\\n\\nIn summary, while Lithium Carbonate is an effective treatment for mood disorders, ...\",\n            (meta data, if exists)\n            \"score\": 1.51018488407135,\n            \"rank\": 2\n        },\n        ...\n    ]\n}\n```\n(See `experiments/data/examples/questions.contexts.json` for more details.)\n\n### How to Use\n\n**(1) Indexing**:\n\n```python\nfrom kapipe.passage_retrieval import Contriever\n\nINDEX_ROOT = \"./\"\nINDEX_NAME = \"example\"\n\n# Initialize retriever\nretriever = Contriever(\n    max_passage_length=512,\n    pooling_method=\"average\",\n    normalize=False,\n    gpu_id=0,\n    metric=\"inner-product\"\n)\n\n# Build index\nretriever.make_index(\n    passages=passages,\n    index_root=INDEX_ROOT,\n    index_name=INDEX_NAME\n)\n```\n(See `experiments/codes/run_passage_retrieval.py` for specific examples.)\n\n**(2) Search**:\n\n```python\n# Load the index\nretriever.load_index(index_root=INDEX_ROOT, index_name=INDEX_NAME)\n\n# Search for top-k contexts\nretrieved_passages = retriever.search(queries=[question], top_k=10)[0]\ncontexts_for_question = {\n    \"question_key\": question[\"question_key\"],\n    \"contexts\": retrieved_passages\n}\n```\n\n### Supported Methods\n\n- **BM25**\n    - A sparse lexical matching model based on term frequency and inverse document frequency.\n- **Contriever** (Izacard et al., 2022)\n    - A dual-encoder retriever trained with contrastive learning (Izacard et al., 2022). Computes similarity between query and chunk embeddings.\n- **ColBERTv2** (Santhanam et al., 2022)\n    - A token-level late-interaction retriever for fine-grained semantic matching. Provides higher accuracy with increased inference cost.\n    - Note: This method is currently unavailable due to an import error in the external `ragatouille` package ([here](https://github.com/AnswerDotAI/RAGatouille/issues/272)).\n\n## \ud83d\udcac Question Answering\n\n### Overview\n\nThe **Question Answering** module generates an answer for each user query, optionally conditioned on the retrieved context chunks.  \nIt uses a large language model such as GPT-4o to produce factually grounded and context-aware answers in natural language.\n\n### Input\n\nThis module takes as input:\n\n1. ***Question***, or a dictionary containing:\n    - `question_key`: Unique identifier for the question\n    - `question`: Natural language question string\n\n(See `experiments/data/examples/questions.json` for more details.)\n\n2. A dictionary containing:\n    - `question_key`: The same identifier with the ***Question***\n    - `contexts`: List of ***Passage*** objects\n\n(See `experiments/data/examples/questions.contexts.json` for more details.)\n\n### Output\n\nThe answer is a dictionary containing:\n\n- `question_key` (str): Same as input\n- `question` (str): Original question text\n- `output_answer` (str): Model-generated natural language answer\n- `helpfulness_score` (float): Confidence score generated by the model\n\n```json\n{\n    \"question_key\": \"question#123\",\n    \"question\": \"What does lithium carbonate induce?\",\n    \"output_answer\": \"Lithium carbonate can induce depressive disorder, cyanosis, and cardiac arrhythmias.\",\n    \"helpfulness_score\": 1.0\n}\n```\n(See `experiments/data/examples/answers.json` for more details.)\n\n### How to Use\n\n```python\nfrom os.path import expanduser\nfrom kapipe.qa import LLMQA\nfrom kapipe import utils\n\n# Initialize the QA module\nanswerer = LLMQA(path_snapshot=expanduser(\"~/.kapipe/download/results/qa/llmqa/openai_gpt4o\"))\n\n# Generate answer\nanswer = answerer.run(\n    question=question,\n    contexts_for_question=contexts_for_question\n)\n\n```\n(See `experiments/codes/run_qa.py` for specific examples.)\n",
    "bugtrack_url": null,
    "license": "LICENSE",
    "summary": "A modular pipeline for knowledge acquisition",
    "version": "0.0.7",
    "project_urls": {
        "Homepage": "https://github.com/norikinishida/kapipe"
    },
    "split_keywords": [
        "nlp",
        " knowledge acquisition",
        " information extraction",
        " knowledge graph",
        " retrieval",
        " question answering"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bd59a89d3f4e0b9daa0389e33ad5a1836834130afb5a39f5545b7340a88d587e",
                "md5": "dfc73f06c8cbc2a52a02c36dc1779c46",
                "sha256": "b1117d553dd1efaf7584c734849695198f51b89d2e3d3e659ebb5eb5db4c1df0"
            },
            "downloads": -1,
            "filename": "kapipe-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dfc73f06c8cbc2a52a02c36dc1779c46",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 183382,
            "upload_time": "2025-07-10T10:49:02",
            "upload_time_iso_8601": "2025-07-10T10:49:02.307804Z",
            "url": "https://files.pythonhosted.org/packages/bd/59/a89d3f4e0b9daa0389e33ad5a1836834130afb5a39f5545b7340a88d587e/kapipe-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2b7c31bc4ffeaa9a6a89d105994280668fa9a537bf4b4db5669eab4903169716",
                "md5": "2c3da45a327a7589a35b756ba4b7a15c",
                "sha256": "cecd658449e089376aeee8956910bf6d6c90872416f82eb67576edae4eab1572"
            },
            "downloads": -1,
            "filename": "kapipe-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "2c3da45a327a7589a35b756ba4b7a15c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 137342,
            "upload_time": "2025-07-10T10:49:04",
            "upload_time_iso_8601": "2025-07-10T10:49:04.625820Z",
            "url": "https://files.pythonhosted.org/packages/2b/7c/31bc4ffeaa9a6a89d105994280668fa9a537bf4b4db5669eab4903169716/kapipe-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-10 10:49:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "norikinishida",
    "github_project": "kapipe",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.22.2"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.10.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.7.1"
                ]
            ]
        },
        {
            "name": "spacy-alignments",
            "specs": [
                [
                    ">=",
                    "0.9.1"
                ]
            ]
        },
        {
            "name": "scispacy",
            "specs": []
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "torch-tensorrt",
            "specs": []
        },
        {
            "name": "torchdata",
            "specs": []
        },
        {
            "name": "torchtext",
            "specs": []
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "opt-einsum",
            "specs": [
                [
                    ">=",
                    "3.3.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.46.0"
                ]
            ]
        },
        {
            "name": "accelerate",
            "specs": [
                [
                    ">=",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "bitsandbytes",
            "specs": [
                [
                    ">=",
                    "0.44.1"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.53.0"
                ]
            ]
        },
        {
            "name": "tenacity",
            "specs": [
                [
                    ">=",
                    "9.1.2"
                ]
            ]
        },
        {
            "name": "faiss-gpu",
            "specs": [
                [
                    ">=",
                    "1.7.2"
                ]
            ]
        },
        {
            "name": "Levenshtein",
            "specs": [
                [
                    ">=",
                    "0.25.0"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    ">=",
                    "3.4.2"
                ]
            ]
        },
        {
            "name": "neo4j",
            "specs": [
                [
                    ">=",
                    "5.25.0"
                ]
            ]
        },
        {
            "name": "graspologic",
            "specs": [
                [
                    ">=",
                    "3.4.1"
                ]
            ]
        },
        {
            "name": "future",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "pyhocon",
            "specs": [
                [
                    ">=",
                    "0.3.60"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.66.1"
                ]
            ]
        },
        {
            "name": "jsonlines",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        }
    ],
    "lcname": "kapipe"
}

None