hierarchical-rag-retrieval


Namehierarchical-rag-retrieval JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/arthur422tp/hierarchical
SummaryAI-Powered Legal Document Retrieval Engine based on Hierarchical Clustering & RAG
upload_time2025-08-13 04:08:32
maintainerNone
docs_urlNone
authorarthur422tp
requires_python>=3.8
licenseMIT
keywords rag retrieval hierarchical clustering legal nlp ai machine-learning
VCS
bugtrack_url
requirements fastapi uvicorn python-multipart pydantic sentence-transformers faiss-cpu langchain langchain-openai torch numpy scipy fastcluster scikit-learn python-dotenv langchain_community pandas
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Legal Document Retrieval System Based on Hierarchical Clustering
## 階層式聚類法規文本檢索系統

<div align="center">

![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)
![License](https://img.shields.io/badge/License-MIT-green.svg)
![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-red.svg)
![Docker](https://img.shields.io/badge/Docker-Ready-blue.svg)

**基於階層式聚類與 RAG 的法規文本智慧檢索引擎**

[📖 快速開始](#🚀-快速開始) • [📚 使用指南](#📚-詳細使用指南) • [🔧 API 參考](#🔬-api-參考) • [📄 arXiv 論文](https://arxiv.org/abs/2506.13607)

</div>

## 📖 系統介紹

本系統是一個結合 AI 法律助手與法規查詢功能的智慧檢索引擎,核心技術為階層式聚類(Hierarchical Clustering)與餘弦相似度(Cosine Similarity),並透過 OpenAI 的 Retrieval-Augmented Generation(RAG)技術進行答案生成。適用於智慧律所、法律聊天機器人、學術研究或多語言法條索引場景,能有效提供準確、可解釋的法規查詢回覆。

### 🎯 核心特色

- **🌳 階層式檢索樹**:自動建構語意層次索引結構
- **🔍 雙重檢索模式**:支援直接檢索與查詢提取兩種方式
- **🧠 RAG 技術整合**:結合 OpenAI GPT 進行智能答案生成
- **⚡ 高效能檢索**:無須手動設定 k 值,自動篩選相關文本
- **🎨 模組化設計**:易於整合到現有專案中
- **🌐 全端解決方案**:內建前端 UI + REST API
- **🐳 Docker 支援**:支援容器化部署,一鍵啟動

## 🛠️ 技術架構

| Component | Tech Used |
|----------|------------|
| Frontend | HTML / JavaScript / Tailwind CSS |
| Backend | FastAPI |
| Embedding Model | `intfloat/multilingual-e5-large` |
| Retrieval Tree | Hierarchical Clustering + Cosine Similarity |
| LLM API | OpenAI GPT (ChatGPT API) |
| Containerization | Docker & Docker Compose |

### 核心組件架構

```
hierarchical-rag-retrieval/
├── retrieval/          # 檢索核心模組
│   ├── RAGTree_function.py      # 階層式檢索樹
│   ├── multi_level_search.py    # 多層索引檢索
│   └── generated_function.py    # 查詢提取功能
├── utils/              # 工具模組
│   ├── word_embedding.py        # 詞嵌入處理
│   ├── word_chunking.py         # 文本分塊
│   └── query_retrieval.py       # FAISS 檢索
├── data_processing/    # 資料處理模組
│   └── data_dealer.py           # 資料格式處理
└── app/                # 演示應用
    ├── main.py                  # FastAPI 主程式
    └── static/index.html        # 前端介面
```

## 🚀 快速開始

### 📦 套件安裝

```bash
pip install hierarchical-rag-retrieval
```

### 🎯 基本使用範例

```python
from src.retrieval import create_ahc_tree, tree_search
from src.utils import WordEmbedding

# 1. 初始化詞嵌入模型
embedding_model = WordEmbedding()
model = embedding_model.load_model()

# 2. 準備您的文本資料
texts = [
    "民法總則規定自然人之權利能力始於出生終於死亡",
    "土地法規定土地所有權之移轉應辦理登記",
    "都市計畫法規定都市計畫區域內土地使用分區",
    # ... 更多文本
]

# 3. 生成文本向量
vectors = model.encode(texts)

# 4. 建立階層式檢索樹
tree_root = create_ahc_tree(vectors, texts)

# 5. 進行檢索
query = "土地所有權移轉需要什麼程序?"
results = tree_search(
    tree_root, 
    query, 
    model, 
    chunk_size=100, 
    chunk_overlap=20
)

# 6. 查看結果
for i, result in enumerate(results, 1):
    print(f"{i}. {result}")
```

### 🔧 環境設置(演示應用)

#### 前置條件

- Python 3.8+ 或 Docker 環境
- OpenAI API 金鑰

#### 方法一:傳統部署

```bash
# 安裝依賴
pip install -r requirements.txt

# 設置環境變數
echo "OPENAI_API_KEY=your_openai_api_key" > .env

# 啟動應用
cd app && python main.py
```

#### 方法二:Docker 部署 (推薦)

```bash
# 設置環境變數
echo "OPENAI_API_KEY=your_openai_api_key" > .env

# 啟動服務
docker-compose up -d

# 查看日誌
docker-compose logs -f
```

應用啟動後,瀏覽器訪問 http://localhost:8000 使用系統。

## 📚 詳細使用指南

### 1. 階層式檢索樹 (RAGTree)

階層式檢索樹是本系統的核心功能,通過聚類算法自動組織文本向量。

```python
from src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree

# 建立檢索樹
tree_root = create_ahc_tree(vectors, texts)

# 儲存檢索樹(可重複使用)
save_tree(tree_root, "my_retrieval_tree.pkl")

# 載入已儲存的檢索樹
tree_root = load_tree("my_retrieval_tree.pkl")

# 進行檢索
results = tree_search(
    root=tree_root,
    query="您的查詢問題",
    model=embedding_model.load_model(),
    chunk_size=100,
    chunk_overlap=20,
    max_chunks=10
)
```

**參數說明:**
- `chunk_size`: 文本分塊大小,較大的值保留更多上下文
- `chunk_overlap`: 分塊重疊大小,避免重要資訊被截斷
- `max_chunks`: 最大分塊數量,控制處理效率

### 2. 多層索引檢索(目前尚未完成,以下為預計情形|Not done yet)

針對大型文本庫優化的多層索引系統。

```python
from src.retrieval import (
    build_multi_level_index_from_files, 
    multi_level_tree_search,
    multi_level_extraction_tree_search
)

# 從檔案建立多層索引
index = build_multi_level_index_from_files(
    embeddings_path="embeddings.pkl",
    texts_path="texts.pkl"
)

# 直接檢索
results = multi_level_tree_search(
    index=index,
    query="查詢問題",
    model=model,
    chunk_size=100,
    chunk_overlap=20
)

# 使用查詢提取的檢索(適合複雜問題)
results = multi_level_extraction_tree_search(
    index=index,
    query="複雜的法律問題描述...",
    model=model,
    chunk_size=100,
    chunk_overlap=20,
    llm=openai_llm  # OpenAI 語言模型
)
```

### 3. 查詢提取與優化

對於複雜或冗長的查詢,系統可以自動提取核心問題。

```python
from src.retrieval import extraction_tree_search
from langchain_openai import ChatOpenAI

# 設定 OpenAI 模型
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    api_key="your-openai-api-key"
)

# 使用查詢提取進行檢索
complex_query = """
我想了解關於土地買賣的法律規定,特別是在都市計畫區域內,
如果我要購買一塊土地用於商業用途,需要注意哪些法律條文?
另外,土地所有權的移轉程序是什麼?
"""

results = extraction_tree_search(
    root=tree_root,
    query=complex_query,
    model=model,
    chunk_size=100,
    chunk_overlap=20,
    llm=llm
)
```

### 4. 自定義文本處理

```python
from src.utils import WordEmbedding, RagChunking
from src.data_processing import DataDealer
import pickle

# 處理自定義文本資料
dealer = DataDealer()

# 準備文本資料
custom_texts = [
    "您的第一個文檔內容...",
    "您的第二個文檔內容...",
    # ... 更多文本
]

# 生成嵌入向量
embedding_model = WordEmbedding()
model = embedding_model.load_model()
vectors = model.encode(custom_texts)

# 儲存處理後的資料
with open('custom_texts.pkl', 'wb') as f:
    pickle.dump(custom_texts, f)
with open('custom_embeddings.pkl', 'wb') as f:
    pickle.dump(vectors, f)

# 文本分塊處理
chunker = RagChunking("長文本內容...")
chunks = chunker.text_chunking(chunk_size=200, chunk_overlap=50)
```

## 🎨 進階用法(非主要功能,有些已經廢棄)

### 1. 自定義重排序(需自行導入cross-encoder model)

```python
from src.retrieval import rerank_texts

# 對檢索結果進行重新排序
query = "查詢問題"
passages = ["文檔1", "文檔2", "文檔3"]
reranked_passages = rerank_texts(query, passages, model)
```

### 2. 批次處理

```python
def batch_search(queries, tree_root, model):
    """批次處理多個查詢"""
    all_results = {}
    for query in queries:
        results = tree_search(tree_root, query, model, 100, 20)
        all_results[query] = results
    return all_results

queries = [
    "土地法相關問題",
    "民法總則規定", 
    "都市計畫法條文"
]

batch_results = batch_search(queries, tree_root, model)
```

### 3. 結果後處理

```python
def process_results(results, max_results=5):
    """處理和過濾檢索結果"""
    # 去重
    unique_results = list(set(results))
    
    # 長度過濾
    filtered_results = [r for r in unique_results if len(r.strip()) > 20]
    
    # 限制數量
    return filtered_results[:max_results]

processed_results = process_results(results)
```

## 📝 更新附註(最近變更)

- 外部化參數:現在可透過環境變數調整,不需改碼。
  - LLM:`OPENAI_API_KEY`、`OPENAI_MODEL`、`OPENAI_TEMPERATURE`、`OPENAI_TOP_P`、`OPENAI_MAX_TOKENS`
  - Embedding:`EMBEDDING_MODEL_NAME`
  - 檢索:`CHUNK_SIZE`、`CHUNK_OVERLAP`、`MAX_CHUNKS`、`MAX_RESULTS`、`TOP_K`
  - Rerank:`RERANKER_ENABLE_IN_PIPELINE`(預設 false)、`RERANKER_USE_CROSS_ENCODER`(預設 false)、`RERANKER_MODEL_NAME`
  - API:`CORS_ORIGINS`、`API_TITLE`

- Rerank 管線開關:
  - 當 `RERANKER_ENABLE_IN_PIPELINE=true` 且檢索結果數量 > `MAX_RESULTS` 時,系統會自動對候選結果進行重排序。
  - 若同時設定 `RERANKER_USE_CROSS_ENCODER=true`,會改用 Cross-Encoder 進行配對打分重排(可透過 `RERANKER_MODEL_NAME` 指定模型)。

- .env 範例:

```bash
OPENAI_API_KEY=sk-xxxx
OPENAI_MODEL=gpt-4o-mini
OPENAI_TEMPERATURE=0.2
OPENAI_TOP_P=0.9
OPENAI_MAX_TOKENS=4096
EMBEDDING_MODEL_NAME=intfloat/multilingual-e5-large
CHUNK_SIZE=150
CHUNK_OVERLAP=50
MAX_CHUNKS=12
MAX_RESULTS=100
TOP_K=15
RERANKER_ENABLE_IN_PIPELINE=true
RERANKER_USE_CROSS_ENCODER=true
RERANKER_MODEL_NAME=cross-encoder/ms-marco-MiniLM-L-6-v2
CORS_ORIGINS=http://localhost:3000,https://your.domain
API_TITLE=Hierarchical RAG API
```

## 🔧 配置參數

### 詞嵌入模型配置

```python
# 預設使用 intfloat/multilingual-e5-large
# 您也可以使用其他 Sentence Transformers 模型

from sentence_transformers import SentenceTransformer

# 自定義模型
custom_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# 在檢索中使用
results = tree_search(tree_root, query, custom_model, 100, 20)
```

### 系統參數調整

```python
# 針對不同場景的參數建議

# 精確檢索(較慢但更準確)
results = tree_search(
    tree_root, query, model,
    chunk_size=50,      # 較小的分塊
    chunk_overlap=10,   # 較小的重疊
    max_chunks=5        # 較少的分塊數
)

# 快速檢索(較快但可能遺漏細節)
results = tree_search(
    tree_root, query, model,
    chunk_size=200,     # 較大的分塊
    chunk_overlap=40,   # 較大的重疊
    max_chunks=15       # 較多的分塊數
)
```

## 📊 效能優化建議

### 1. 記憶體管理

```python
# 對於大型文本庫,建議分批處理
def process_large_corpus(texts, batch_size=1000):
    trees = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_vectors = model.encode(batch)
        tree = create_ahc_tree(batch_vectors, batch)
        trees.append(tree)
    return trees
```

### 2. 快取機制

```python
import os

# 檢查是否已有建立好的檢索樹
tree_file = "retrieval_tree.pkl"
if os.path.exists(tree_file):
    tree_root = load_tree(tree_file)
else:
    tree_root = create_ahc_tree(vectors, texts)
    save_tree(tree_root, tree_file)
```

## 🔍 實際應用案例

### 法律文件檢索

```python
# 法律條文檢索系統
legal_texts = [
    "民法第一條:民事,法律所未規定者,依習慣...",
    "刑法第十條:稱以上、以下、以內、以外者...",
    # ... 更多法條
]

# 建立法律檢索系統
legal_vectors = model.encode(legal_texts)
legal_tree = create_ahc_tree(legal_vectors, legal_texts)

# 查詢法律問題
legal_query = "關於契約的法律效力規定"
legal_results = tree_search(legal_tree, legal_query, model, 100, 20)
```

### 學術論文檢索

```python
# 學術文獻檢索
papers = [
    "本研究探討機器學習在自然語言處理中的應用...",
    "深度學習模型在圖像識別領域的最新進展...",
    # ... 更多論文摘要
]

academic_vectors = model.encode(papers)
academic_tree = create_ahc_tree(academic_vectors, papers)

research_query = "transformer模型在文本分類的效果"
academic_results = tree_search(academic_tree, research_query, model, 150, 30)
```

## 🔬 API 參考

### Web 應用 API

- `GET /`: 主頁面
- `GET /available-texts`: 獲取可用的文本列表
- `POST /query`: 提交查詢請求
  - 請求體:
    ```json
    {
        "query": "您的問題",
        "use_extraction": true/false,
        "text_name": "文本名稱",
        "prompt_type": "task_oriented" | "cot"
    }
    ```
  - 響應:
    ```json
    {
        "answer": "系統回答",
        "retrieved_docs": ["檢索到的文檔1", "檢索到的文檔2", ...]
    }
    ```

### Python API

#### 核心檢索函數

```python
# 主要檢索函數
from src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree

# 多層檢索函數
from src.retrieval import (
    build_multi_level_index_from_files,
    multi_level_tree_search,
    multi_level_extraction_tree_search
)

# 查詢提取函數
from src.retrieval import extraction_tree_search

# 工具函數
from src.utils import WordEmbedding, RagChunking
from src.data_processing import DataDealer
```

## 🐛 常見問題與解決方案

### Q: 檢索結果不夠精確?
**A:** 嘗試調整參數:
- 減少 `chunk_size` 提高精度
- 增加 `max_chunks` 獲得更多候選結果
- 使用查詢提取功能處理複雜問題

### Q: 處理速度較慢?
**A:** 優化建議:
- 增加 `chunk_size` 減少分塊數量
- 減少 `max_chunks` 限制處理範圍
- 使用多層索引代替單一檢索樹

### Q: 記憶體使用過多?
**A:** 記憶體管理:
- 分批處理大型文本庫
- 定期清理不需要的變數
- 使用生成器而非列表存儲大量資料

### Q: 如何處理不同語言的文本?
**A:** 多語言支援:
- 使用多語言嵌入模型(如預設的 multilingual-e5-large)
- 確保查詢語言與文本語言一致
- 考慮使用語言特定的分詞策略

## 📄 系統功能說明

### 檢索流程

本系統提供兩種檢索模式:

1. **直接檢索** - 適合簡單明確的問題
   - 將用戶輸入直接向量化
   - 通過檢索樹尋找相似文本片段
   - 使用語言模型生成答案

2. **查詢提取檢索** - 適合複雜或冗長問題
   - 先使用語言模型提取核心法律問題和概念
   - 將提取後的關鍵要點向量化
   - 通過檢索樹查找相關片段
   - 使用語言模型針對提取要點生成精確答案

### 回答方式說明

#### 任務導向(Task-Oriented)
- 特點:簡潔直接,快速提供答案
- 適用:需要明確法條解釋或操作指引的問題

#### 思維鏈(Chain of Thought, CoT)
- 特點:詳細分析,提供推理過程
- 適用:複雜法律邏輯分析或需要多步推論的問題

## 📦 部署與發布

### PyPI 安裝

```bash
pip install hierarchical-rag-retrieval
```

### 從 GitHub 安裝開發版本

```bash
pip install git+https://github.com/arthur422tp/hierarchical.git
```

### 本機開發安裝

```bash
# 克隆專案
git clone https://github.com/arthur422tp/hierarchical.git
cd hierarchical

# 安裝開發依賴
pip install -e .[dev]
```

## 📚 更多資源

- **GitHub Repository**: https://github.com/arthur422tp/hierarchical
- **arXiv 論文**: https://arxiv.org/abs/2506.13607
- **PyPI 套件**: https://pypi.org/project/hierarchical-rag-retrieval/
- **Issue 回報**: https://github.com/arthur422tp/hierarchical/issues

## 🤝 貢獻與支援

歡迎提交 Issue 和 Pull Request!如果您在使用過程中遇到任何問題,或有改進建議,請隨時聯繫我們。

### 開發環境設置

```bash
# 克隆專案
git clone https://github.com/arthur422tp/hierarchical.git
cd hierarchical

# 建立虛擬環境
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 安裝開發依賴
pip install -e .[dev,app]
```

## 📜 License

本專案採用 MIT 授權條款 - 詳見 [LICENSE](LICENSE) 檔案。

## 📞 聯繫方式

- 作者:arthur422tp
- Email:arthur422tp@gmail.com
- GitHub:https://github.com/arthur422tp

---

**祝您使用愉快!如果這個系統對您的專案有幫助,請考慮給我們一個 ⭐ Star!**



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/arthur422tp/hierarchical",
    "name": "hierarchical-rag-retrieval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "rag, retrieval, hierarchical, clustering, legal, nlp, ai, machine-learning",
    "author": "arthur422tp",
    "author_email": "arthur422tp <arthur422tp@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/91/d2/f42d88f791601f5a6457aa88245d4c3bfda51068d73299cb27706a9784cc/hierarchical_rag_retrieval-0.1.2.tar.gz",
    "platform": null,
    "description": "# Legal Document Retrieval System Based on Hierarchical Clustering\n## \u968e\u5c64\u5f0f\u805a\u985e\u6cd5\u898f\u6587\u672c\u6aa2\u7d22\u7cfb\u7d71\n\n<div align=\"center\">\n\n![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)\n![License](https://img.shields.io/badge/License-MIT-green.svg)\n![FastAPI](https://img.shields.io/badge/FastAPI-0.104+-red.svg)\n![Docker](https://img.shields.io/badge/Docker-Ready-blue.svg)\n\n**\u57fa\u65bc\u968e\u5c64\u5f0f\u805a\u985e\u8207 RAG \u7684\u6cd5\u898f\u6587\u672c\u667a\u6167\u6aa2\u7d22\u5f15\u64ce**\n\n[\ud83d\udcd6 \u5feb\u901f\u958b\u59cb](#\ud83d\ude80-\u5feb\u901f\u958b\u59cb) \u2022 [\ud83d\udcda \u4f7f\u7528\u6307\u5357](#\ud83d\udcda-\u8a73\u7d30\u4f7f\u7528\u6307\u5357) \u2022 [\ud83d\udd27 API \u53c3\u8003](#\ud83d\udd2c-api-\u53c3\u8003) \u2022 [\ud83d\udcc4 arXiv \u8ad6\u6587](https://arxiv.org/abs/2506.13607)\n\n</div>\n\n## \ud83d\udcd6 \u7cfb\u7d71\u4ecb\u7d39\n\n\u672c\u7cfb\u7d71\u662f\u4e00\u500b\u7d50\u5408 AI \u6cd5\u5f8b\u52a9\u624b\u8207\u6cd5\u898f\u67e5\u8a62\u529f\u80fd\u7684\u667a\u6167\u6aa2\u7d22\u5f15\u64ce\uff0c\u6838\u5fc3\u6280\u8853\u70ba\u968e\u5c64\u5f0f\u805a\u985e\uff08Hierarchical Clustering\uff09\u8207\u9918\u5f26\u76f8\u4f3c\u5ea6\uff08Cosine Similarity\uff09\uff0c\u4e26\u900f\u904e OpenAI \u7684 Retrieval-Augmented Generation\uff08RAG\uff09\u6280\u8853\u9032\u884c\u7b54\u6848\u751f\u6210\u3002\u9069\u7528\u65bc\u667a\u6167\u5f8b\u6240\u3001\u6cd5\u5f8b\u804a\u5929\u6a5f\u5668\u4eba\u3001\u5b78\u8853\u7814\u7a76\u6216\u591a\u8a9e\u8a00\u6cd5\u689d\u7d22\u5f15\u5834\u666f\uff0c\u80fd\u6709\u6548\u63d0\u4f9b\u6e96\u78ba\u3001\u53ef\u89e3\u91cb\u7684\u6cd5\u898f\u67e5\u8a62\u56de\u8986\u3002\n\n### \ud83c\udfaf \u6838\u5fc3\u7279\u8272\n\n- **\ud83c\udf33 \u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39**\uff1a\u81ea\u52d5\u5efa\u69cb\u8a9e\u610f\u5c64\u6b21\u7d22\u5f15\u7d50\u69cb\n- **\ud83d\udd0d \u96d9\u91cd\u6aa2\u7d22\u6a21\u5f0f**\uff1a\u652f\u63f4\u76f4\u63a5\u6aa2\u7d22\u8207\u67e5\u8a62\u63d0\u53d6\u5169\u7a2e\u65b9\u5f0f\n- **\ud83e\udde0 RAG \u6280\u8853\u6574\u5408**\uff1a\u7d50\u5408 OpenAI GPT \u9032\u884c\u667a\u80fd\u7b54\u6848\u751f\u6210\n- **\u26a1 \u9ad8\u6548\u80fd\u6aa2\u7d22**\uff1a\u7121\u9808\u624b\u52d5\u8a2d\u5b9a k \u503c\uff0c\u81ea\u52d5\u7be9\u9078\u76f8\u95dc\u6587\u672c\n- **\ud83c\udfa8 \u6a21\u7d44\u5316\u8a2d\u8a08**\uff1a\u6613\u65bc\u6574\u5408\u5230\u73fe\u6709\u5c08\u6848\u4e2d\n- **\ud83c\udf10 \u5168\u7aef\u89e3\u6c7a\u65b9\u6848**\uff1a\u5167\u5efa\u524d\u7aef UI + REST API\n- **\ud83d\udc33 Docker \u652f\u63f4**\uff1a\u652f\u63f4\u5bb9\u5668\u5316\u90e8\u7f72\uff0c\u4e00\u9375\u555f\u52d5\n\n## \ud83d\udee0\ufe0f \u6280\u8853\u67b6\u69cb\n\n| Component | Tech Used |\n|----------|------------|\n| Frontend | HTML / JavaScript / Tailwind CSS |\n| Backend | FastAPI |\n| Embedding Model | `intfloat/multilingual-e5-large` |\n| Retrieval Tree | Hierarchical Clustering + Cosine Similarity |\n| LLM API | OpenAI GPT (ChatGPT API) |\n| Containerization | Docker & Docker Compose |\n\n### \u6838\u5fc3\u7d44\u4ef6\u67b6\u69cb\n\n```\nhierarchical-rag-retrieval/\n\u251c\u2500\u2500 retrieval/          # \u6aa2\u7d22\u6838\u5fc3\u6a21\u7d44\n\u2502   \u251c\u2500\u2500 RAGTree_function.py      # \u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39\n\u2502   \u251c\u2500\u2500 multi_level_search.py    # \u591a\u5c64\u7d22\u5f15\u6aa2\u7d22\n\u2502   \u2514\u2500\u2500 generated_function.py    # \u67e5\u8a62\u63d0\u53d6\u529f\u80fd\n\u251c\u2500\u2500 utils/              # \u5de5\u5177\u6a21\u7d44\n\u2502   \u251c\u2500\u2500 word_embedding.py        # \u8a5e\u5d4c\u5165\u8655\u7406\n\u2502   \u251c\u2500\u2500 word_chunking.py         # \u6587\u672c\u5206\u584a\n\u2502   \u2514\u2500\u2500 query_retrieval.py       # FAISS \u6aa2\u7d22\n\u251c\u2500\u2500 data_processing/    # \u8cc7\u6599\u8655\u7406\u6a21\u7d44\n\u2502   \u2514\u2500\u2500 data_dealer.py           # \u8cc7\u6599\u683c\u5f0f\u8655\u7406\n\u2514\u2500\u2500 app/                # \u6f14\u793a\u61c9\u7528\n    \u251c\u2500\u2500 main.py                  # FastAPI \u4e3b\u7a0b\u5f0f\n    \u2514\u2500\u2500 static/index.html        # \u524d\u7aef\u4ecb\u9762\n```\n\n## \ud83d\ude80 \u5feb\u901f\u958b\u59cb\n\n### \ud83d\udce6 \u5957\u4ef6\u5b89\u88dd\n\n```bash\npip install hierarchical-rag-retrieval\n```\n\n### \ud83c\udfaf \u57fa\u672c\u4f7f\u7528\u7bc4\u4f8b\n\n```python\nfrom src.retrieval import create_ahc_tree, tree_search\nfrom src.utils import WordEmbedding\n\n# 1. \u521d\u59cb\u5316\u8a5e\u5d4c\u5165\u6a21\u578b\nembedding_model = WordEmbedding()\nmodel = embedding_model.load_model()\n\n# 2. \u6e96\u5099\u60a8\u7684\u6587\u672c\u8cc7\u6599\ntexts = [\n    \"\u6c11\u6cd5\u7e3d\u5247\u898f\u5b9a\u81ea\u7136\u4eba\u4e4b\u6b0a\u5229\u80fd\u529b\u59cb\u65bc\u51fa\u751f\u7d42\u65bc\u6b7b\u4ea1\",\n    \"\u571f\u5730\u6cd5\u898f\u5b9a\u571f\u5730\u6240\u6709\u6b0a\u4e4b\u79fb\u8f49\u61c9\u8fa6\u7406\u767b\u8a18\",\n    \"\u90fd\u5e02\u8a08\u756b\u6cd5\u898f\u5b9a\u90fd\u5e02\u8a08\u756b\u5340\u57df\u5167\u571f\u5730\u4f7f\u7528\u5206\u5340\",\n    # ... \u66f4\u591a\u6587\u672c\n]\n\n# 3. \u751f\u6210\u6587\u672c\u5411\u91cf\nvectors = model.encode(texts)\n\n# 4. \u5efa\u7acb\u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39\ntree_root = create_ahc_tree(vectors, texts)\n\n# 5. \u9032\u884c\u6aa2\u7d22\nquery = \"\u571f\u5730\u6240\u6709\u6b0a\u79fb\u8f49\u9700\u8981\u4ec0\u9ebc\u7a0b\u5e8f\uff1f\"\nresults = tree_search(\n    tree_root, \n    query, \n    model, \n    chunk_size=100, \n    chunk_overlap=20\n)\n\n# 6. \u67e5\u770b\u7d50\u679c\nfor i, result in enumerate(results, 1):\n    print(f\"{i}. {result}\")\n```\n\n### \ud83d\udd27 \u74b0\u5883\u8a2d\u7f6e\uff08\u6f14\u793a\u61c9\u7528\uff09\n\n#### \u524d\u7f6e\u689d\u4ef6\n\n- Python 3.8+ \u6216 Docker \u74b0\u5883\n- OpenAI API \u91d1\u9470\n\n#### \u65b9\u6cd5\u4e00\uff1a\u50b3\u7d71\u90e8\u7f72\n\n```bash\n# \u5b89\u88dd\u4f9d\u8cf4\npip install -r requirements.txt\n\n# \u8a2d\u7f6e\u74b0\u5883\u8b8a\u6578\necho \"OPENAI_API_KEY=your_openai_api_key\" > .env\n\n# \u555f\u52d5\u61c9\u7528\ncd app && python main.py\n```\n\n#### \u65b9\u6cd5\u4e8c\uff1aDocker \u90e8\u7f72 (\u63a8\u85a6)\n\n```bash\n# \u8a2d\u7f6e\u74b0\u5883\u8b8a\u6578\necho \"OPENAI_API_KEY=your_openai_api_key\" > .env\n\n# \u555f\u52d5\u670d\u52d9\ndocker-compose up -d\n\n# \u67e5\u770b\u65e5\u8a8c\ndocker-compose logs -f\n```\n\n\u61c9\u7528\u555f\u52d5\u5f8c\uff0c\u700f\u89bd\u5668\u8a2a\u554f http://localhost:8000 \u4f7f\u7528\u7cfb\u7d71\u3002\n\n## \ud83d\udcda \u8a73\u7d30\u4f7f\u7528\u6307\u5357\n\n### 1. \u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39 (RAGTree)\n\n\u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39\u662f\u672c\u7cfb\u7d71\u7684\u6838\u5fc3\u529f\u80fd\uff0c\u901a\u904e\u805a\u985e\u7b97\u6cd5\u81ea\u52d5\u7d44\u7e54\u6587\u672c\u5411\u91cf\u3002\n\n```python\nfrom src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree\n\n# \u5efa\u7acb\u6aa2\u7d22\u6a39\ntree_root = create_ahc_tree(vectors, texts)\n\n# \u5132\u5b58\u6aa2\u7d22\u6a39\uff08\u53ef\u91cd\u8907\u4f7f\u7528\uff09\nsave_tree(tree_root, \"my_retrieval_tree.pkl\")\n\n# \u8f09\u5165\u5df2\u5132\u5b58\u7684\u6aa2\u7d22\u6a39\ntree_root = load_tree(\"my_retrieval_tree.pkl\")\n\n# \u9032\u884c\u6aa2\u7d22\nresults = tree_search(\n    root=tree_root,\n    query=\"\u60a8\u7684\u67e5\u8a62\u554f\u984c\",\n    model=embedding_model.load_model(),\n    chunk_size=100,\n    chunk_overlap=20,\n    max_chunks=10\n)\n```\n\n**\u53c3\u6578\u8aaa\u660e\uff1a**\n- `chunk_size`: \u6587\u672c\u5206\u584a\u5927\u5c0f\uff0c\u8f03\u5927\u7684\u503c\u4fdd\u7559\u66f4\u591a\u4e0a\u4e0b\u6587\n- `chunk_overlap`: \u5206\u584a\u91cd\u758a\u5927\u5c0f\uff0c\u907f\u514d\u91cd\u8981\u8cc7\u8a0a\u88ab\u622a\u65b7\n- `max_chunks`: \u6700\u5927\u5206\u584a\u6578\u91cf\uff0c\u63a7\u5236\u8655\u7406\u6548\u7387\n\n### 2. \u591a\u5c64\u7d22\u5f15\u6aa2\u7d22(\u76ee\u524d\u5c1a\u672a\u5b8c\u6210\uff0c\u4ee5\u4e0b\u70ba\u9810\u8a08\u60c5\u5f62\uff5cNot done yet)\n\n\u91dd\u5c0d\u5927\u578b\u6587\u672c\u5eab\u512a\u5316\u7684\u591a\u5c64\u7d22\u5f15\u7cfb\u7d71\u3002\n\n```python\nfrom src.retrieval import (\n    build_multi_level_index_from_files, \n    multi_level_tree_search,\n    multi_level_extraction_tree_search\n)\n\n# \u5f9e\u6a94\u6848\u5efa\u7acb\u591a\u5c64\u7d22\u5f15\nindex = build_multi_level_index_from_files(\n    embeddings_path=\"embeddings.pkl\",\n    texts_path=\"texts.pkl\"\n)\n\n# \u76f4\u63a5\u6aa2\u7d22\nresults = multi_level_tree_search(\n    index=index,\n    query=\"\u67e5\u8a62\u554f\u984c\",\n    model=model,\n    chunk_size=100,\n    chunk_overlap=20\n)\n\n# \u4f7f\u7528\u67e5\u8a62\u63d0\u53d6\u7684\u6aa2\u7d22\uff08\u9069\u5408\u8907\u96dc\u554f\u984c\uff09\nresults = multi_level_extraction_tree_search(\n    index=index,\n    query=\"\u8907\u96dc\u7684\u6cd5\u5f8b\u554f\u984c\u63cf\u8ff0...\",\n    model=model,\n    chunk_size=100,\n    chunk_overlap=20,\n    llm=openai_llm  # OpenAI \u8a9e\u8a00\u6a21\u578b\n)\n```\n\n### 3. \u67e5\u8a62\u63d0\u53d6\u8207\u512a\u5316\n\n\u5c0d\u65bc\u8907\u96dc\u6216\u5197\u9577\u7684\u67e5\u8a62\uff0c\u7cfb\u7d71\u53ef\u4ee5\u81ea\u52d5\u63d0\u53d6\u6838\u5fc3\u554f\u984c\u3002\n\n```python\nfrom src.retrieval import extraction_tree_search\nfrom langchain_openai import ChatOpenAI\n\n# \u8a2d\u5b9a OpenAI \u6a21\u578b\nllm = ChatOpenAI(\n    model=\"gpt-3.5-turbo\",\n    api_key=\"your-openai-api-key\"\n)\n\n# \u4f7f\u7528\u67e5\u8a62\u63d0\u53d6\u9032\u884c\u6aa2\u7d22\ncomplex_query = \"\"\"\n\u6211\u60f3\u4e86\u89e3\u95dc\u65bc\u571f\u5730\u8cb7\u8ce3\u7684\u6cd5\u5f8b\u898f\u5b9a\uff0c\u7279\u5225\u662f\u5728\u90fd\u5e02\u8a08\u756b\u5340\u57df\u5167\uff0c\n\u5982\u679c\u6211\u8981\u8cfc\u8cb7\u4e00\u584a\u571f\u5730\u7528\u65bc\u5546\u696d\u7528\u9014\uff0c\u9700\u8981\u6ce8\u610f\u54ea\u4e9b\u6cd5\u5f8b\u689d\u6587\uff1f\n\u53e6\u5916\uff0c\u571f\u5730\u6240\u6709\u6b0a\u7684\u79fb\u8f49\u7a0b\u5e8f\u662f\u4ec0\u9ebc\uff1f\n\"\"\"\n\nresults = extraction_tree_search(\n    root=tree_root,\n    query=complex_query,\n    model=model,\n    chunk_size=100,\n    chunk_overlap=20,\n    llm=llm\n)\n```\n\n### 4. \u81ea\u5b9a\u7fa9\u6587\u672c\u8655\u7406\n\n```python\nfrom src.utils import WordEmbedding, RagChunking\nfrom src.data_processing import DataDealer\nimport pickle\n\n# \u8655\u7406\u81ea\u5b9a\u7fa9\u6587\u672c\u8cc7\u6599\ndealer = DataDealer()\n\n# \u6e96\u5099\u6587\u672c\u8cc7\u6599\ncustom_texts = [\n    \"\u60a8\u7684\u7b2c\u4e00\u500b\u6587\u6a94\u5167\u5bb9...\",\n    \"\u60a8\u7684\u7b2c\u4e8c\u500b\u6587\u6a94\u5167\u5bb9...\",\n    # ... \u66f4\u591a\u6587\u672c\n]\n\n# \u751f\u6210\u5d4c\u5165\u5411\u91cf\nembedding_model = WordEmbedding()\nmodel = embedding_model.load_model()\nvectors = model.encode(custom_texts)\n\n# \u5132\u5b58\u8655\u7406\u5f8c\u7684\u8cc7\u6599\nwith open('custom_texts.pkl', 'wb') as f:\n    pickle.dump(custom_texts, f)\nwith open('custom_embeddings.pkl', 'wb') as f:\n    pickle.dump(vectors, f)\n\n# \u6587\u672c\u5206\u584a\u8655\u7406\nchunker = RagChunking(\"\u9577\u6587\u672c\u5167\u5bb9...\")\nchunks = chunker.text_chunking(chunk_size=200, chunk_overlap=50)\n```\n\n## \ud83c\udfa8 \u9032\u968e\u7528\u6cd5\uff08\u975e\u4e3b\u8981\u529f\u80fd\uff0c\u6709\u4e9b\u5df2\u7d93\u5ee2\u68c4\uff09\n\n### 1. \u81ea\u5b9a\u7fa9\u91cd\u6392\u5e8f(\u9700\u81ea\u884c\u5c0e\u5165cross-encoder model)\n\n```python\nfrom src.retrieval import rerank_texts\n\n# \u5c0d\u6aa2\u7d22\u7d50\u679c\u9032\u884c\u91cd\u65b0\u6392\u5e8f\nquery = \"\u67e5\u8a62\u554f\u984c\"\npassages = [\"\u6587\u6a941\", \"\u6587\u6a942\", \"\u6587\u6a943\"]\nreranked_passages = rerank_texts(query, passages, model)\n```\n\n### 2. \u6279\u6b21\u8655\u7406\n\n```python\ndef batch_search(queries, tree_root, model):\n    \"\"\"\u6279\u6b21\u8655\u7406\u591a\u500b\u67e5\u8a62\"\"\"\n    all_results = {}\n    for query in queries:\n        results = tree_search(tree_root, query, model, 100, 20)\n        all_results[query] = results\n    return all_results\n\nqueries = [\n    \"\u571f\u5730\u6cd5\u76f8\u95dc\u554f\u984c\",\n    \"\u6c11\u6cd5\u7e3d\u5247\u898f\u5b9a\", \n    \"\u90fd\u5e02\u8a08\u756b\u6cd5\u689d\u6587\"\n]\n\nbatch_results = batch_search(queries, tree_root, model)\n```\n\n### 3. \u7d50\u679c\u5f8c\u8655\u7406\n\n```python\ndef process_results(results, max_results=5):\n    \"\"\"\u8655\u7406\u548c\u904e\u6ffe\u6aa2\u7d22\u7d50\u679c\"\"\"\n    # \u53bb\u91cd\n    unique_results = list(set(results))\n    \n    # \u9577\u5ea6\u904e\u6ffe\n    filtered_results = [r for r in unique_results if len(r.strip()) > 20]\n    \n    # \u9650\u5236\u6578\u91cf\n    return filtered_results[:max_results]\n\nprocessed_results = process_results(results)\n```\n\n## \ud83d\udcdd \u66f4\u65b0\u9644\u8a3b\uff08\u6700\u8fd1\u8b8a\u66f4\uff09\n\n- \u5916\u90e8\u5316\u53c3\u6578\uff1a\u73fe\u5728\u53ef\u900f\u904e\u74b0\u5883\u8b8a\u6578\u8abf\u6574\uff0c\u4e0d\u9700\u6539\u78bc\u3002\n  - LLM\uff1a`OPENAI_API_KEY`\u3001`OPENAI_MODEL`\u3001`OPENAI_TEMPERATURE`\u3001`OPENAI_TOP_P`\u3001`OPENAI_MAX_TOKENS`\n  - Embedding\uff1a`EMBEDDING_MODEL_NAME`\n  - \u6aa2\u7d22\uff1a`CHUNK_SIZE`\u3001`CHUNK_OVERLAP`\u3001`MAX_CHUNKS`\u3001`MAX_RESULTS`\u3001`TOP_K`\n  - Rerank\uff1a`RERANKER_ENABLE_IN_PIPELINE`\uff08\u9810\u8a2d false\uff09\u3001`RERANKER_USE_CROSS_ENCODER`\uff08\u9810\u8a2d false\uff09\u3001`RERANKER_MODEL_NAME`\n  - API\uff1a`CORS_ORIGINS`\u3001`API_TITLE`\n\n- Rerank \u7ba1\u7dda\u958b\u95dc\uff1a\n  - \u7576 `RERANKER_ENABLE_IN_PIPELINE=true` \u4e14\u6aa2\u7d22\u7d50\u679c\u6578\u91cf > `MAX_RESULTS` \u6642\uff0c\u7cfb\u7d71\u6703\u81ea\u52d5\u5c0d\u5019\u9078\u7d50\u679c\u9032\u884c\u91cd\u6392\u5e8f\u3002\n  - \u82e5\u540c\u6642\u8a2d\u5b9a `RERANKER_USE_CROSS_ENCODER=true`\uff0c\u6703\u6539\u7528 Cross-Encoder \u9032\u884c\u914d\u5c0d\u6253\u5206\u91cd\u6392\uff08\u53ef\u900f\u904e `RERANKER_MODEL_NAME` \u6307\u5b9a\u6a21\u578b\uff09\u3002\n\n- .env \u7bc4\u4f8b\uff1a\n\n```bash\nOPENAI_API_KEY=sk-xxxx\nOPENAI_MODEL=gpt-4o-mini\nOPENAI_TEMPERATURE=0.2\nOPENAI_TOP_P=0.9\nOPENAI_MAX_TOKENS=4096\nEMBEDDING_MODEL_NAME=intfloat/multilingual-e5-large\nCHUNK_SIZE=150\nCHUNK_OVERLAP=50\nMAX_CHUNKS=12\nMAX_RESULTS=100\nTOP_K=15\nRERANKER_ENABLE_IN_PIPELINE=true\nRERANKER_USE_CROSS_ENCODER=true\nRERANKER_MODEL_NAME=cross-encoder/ms-marco-MiniLM-L-6-v2\nCORS_ORIGINS=http://localhost:3000,https://your.domain\nAPI_TITLE=Hierarchical RAG API\n```\n\n## \ud83d\udd27 \u914d\u7f6e\u53c3\u6578\n\n### \u8a5e\u5d4c\u5165\u6a21\u578b\u914d\u7f6e\n\n```python\n# \u9810\u8a2d\u4f7f\u7528 intfloat/multilingual-e5-large\n# \u60a8\u4e5f\u53ef\u4ee5\u4f7f\u7528\u5176\u4ed6 Sentence Transformers \u6a21\u578b\n\nfrom sentence_transformers import SentenceTransformer\n\n# \u81ea\u5b9a\u7fa9\u6a21\u578b\ncustom_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')\n\n# \u5728\u6aa2\u7d22\u4e2d\u4f7f\u7528\nresults = tree_search(tree_root, query, custom_model, 100, 20)\n```\n\n### \u7cfb\u7d71\u53c3\u6578\u8abf\u6574\n\n```python\n# \u91dd\u5c0d\u4e0d\u540c\u5834\u666f\u7684\u53c3\u6578\u5efa\u8b70\n\n# \u7cbe\u78ba\u6aa2\u7d22\uff08\u8f03\u6162\u4f46\u66f4\u6e96\u78ba\uff09\nresults = tree_search(\n    tree_root, query, model,\n    chunk_size=50,      # \u8f03\u5c0f\u7684\u5206\u584a\n    chunk_overlap=10,   # \u8f03\u5c0f\u7684\u91cd\u758a\n    max_chunks=5        # \u8f03\u5c11\u7684\u5206\u584a\u6578\n)\n\n# \u5feb\u901f\u6aa2\u7d22\uff08\u8f03\u5feb\u4f46\u53ef\u80fd\u907a\u6f0f\u7d30\u7bc0\uff09\nresults = tree_search(\n    tree_root, query, model,\n    chunk_size=200,     # \u8f03\u5927\u7684\u5206\u584a\n    chunk_overlap=40,   # \u8f03\u5927\u7684\u91cd\u758a\n    max_chunks=15       # \u8f03\u591a\u7684\u5206\u584a\u6578\n)\n```\n\n## \ud83d\udcca \u6548\u80fd\u512a\u5316\u5efa\u8b70\n\n### 1. \u8a18\u61b6\u9ad4\u7ba1\u7406\n\n```python\n# \u5c0d\u65bc\u5927\u578b\u6587\u672c\u5eab\uff0c\u5efa\u8b70\u5206\u6279\u8655\u7406\ndef process_large_corpus(texts, batch_size=1000):\n    trees = []\n    for i in range(0, len(texts), batch_size):\n        batch = texts[i:i+batch_size]\n        batch_vectors = model.encode(batch)\n        tree = create_ahc_tree(batch_vectors, batch)\n        trees.append(tree)\n    return trees\n```\n\n### 2. \u5feb\u53d6\u6a5f\u5236\n\n```python\nimport os\n\n# \u6aa2\u67e5\u662f\u5426\u5df2\u6709\u5efa\u7acb\u597d\u7684\u6aa2\u7d22\u6a39\ntree_file = \"retrieval_tree.pkl\"\nif os.path.exists(tree_file):\n    tree_root = load_tree(tree_file)\nelse:\n    tree_root = create_ahc_tree(vectors, texts)\n    save_tree(tree_root, tree_file)\n```\n\n## \ud83d\udd0d \u5be6\u969b\u61c9\u7528\u6848\u4f8b\n\n### \u6cd5\u5f8b\u6587\u4ef6\u6aa2\u7d22\n\n```python\n# \u6cd5\u5f8b\u689d\u6587\u6aa2\u7d22\u7cfb\u7d71\nlegal_texts = [\n    \"\u6c11\u6cd5\u7b2c\u4e00\u689d\uff1a\u6c11\u4e8b\uff0c\u6cd5\u5f8b\u6240\u672a\u898f\u5b9a\u8005\uff0c\u4f9d\u7fd2\u6163...\",\n    \"\u5211\u6cd5\u7b2c\u5341\u689d\uff1a\u7a31\u4ee5\u4e0a\u3001\u4ee5\u4e0b\u3001\u4ee5\u5167\u3001\u4ee5\u5916\u8005...\",\n    # ... \u66f4\u591a\u6cd5\u689d\n]\n\n# \u5efa\u7acb\u6cd5\u5f8b\u6aa2\u7d22\u7cfb\u7d71\nlegal_vectors = model.encode(legal_texts)\nlegal_tree = create_ahc_tree(legal_vectors, legal_texts)\n\n# \u67e5\u8a62\u6cd5\u5f8b\u554f\u984c\nlegal_query = \"\u95dc\u65bc\u5951\u7d04\u7684\u6cd5\u5f8b\u6548\u529b\u898f\u5b9a\"\nlegal_results = tree_search(legal_tree, legal_query, model, 100, 20)\n```\n\n### \u5b78\u8853\u8ad6\u6587\u6aa2\u7d22\n\n```python\n# \u5b78\u8853\u6587\u737b\u6aa2\u7d22\npapers = [\n    \"\u672c\u7814\u7a76\u63a2\u8a0e\u6a5f\u5668\u5b78\u7fd2\u5728\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u4e2d\u7684\u61c9\u7528...\",\n    \"\u6df1\u5ea6\u5b78\u7fd2\u6a21\u578b\u5728\u5716\u50cf\u8b58\u5225\u9818\u57df\u7684\u6700\u65b0\u9032\u5c55...\",\n    # ... \u66f4\u591a\u8ad6\u6587\u6458\u8981\n]\n\nacademic_vectors = model.encode(papers)\nacademic_tree = create_ahc_tree(academic_vectors, papers)\n\nresearch_query = \"transformer\u6a21\u578b\u5728\u6587\u672c\u5206\u985e\u7684\u6548\u679c\"\nacademic_results = tree_search(academic_tree, research_query, model, 150, 30)\n```\n\n## \ud83d\udd2c API \u53c3\u8003\n\n### Web \u61c9\u7528 API\n\n- `GET /`: \u4e3b\u9801\u9762\n- `GET /available-texts`: \u7372\u53d6\u53ef\u7528\u7684\u6587\u672c\u5217\u8868\n- `POST /query`: \u63d0\u4ea4\u67e5\u8a62\u8acb\u6c42\n  - \u8acb\u6c42\u9ad4\uff1a\n    ```json\n    {\n        \"query\": \"\u60a8\u7684\u554f\u984c\",\n        \"use_extraction\": true/false,\n        \"text_name\": \"\u6587\u672c\u540d\u7a31\",\n        \"prompt_type\": \"task_oriented\" | \"cot\"\n    }\n    ```\n  - \u97ff\u61c9\uff1a\n    ```json\n    {\n        \"answer\": \"\u7cfb\u7d71\u56de\u7b54\",\n        \"retrieved_docs\": [\"\u6aa2\u7d22\u5230\u7684\u6587\u6a941\", \"\u6aa2\u7d22\u5230\u7684\u6587\u6a942\", ...]\n    }\n    ```\n\n### Python API\n\n#### \u6838\u5fc3\u6aa2\u7d22\u51fd\u6578\n\n```python\n# \u4e3b\u8981\u6aa2\u7d22\u51fd\u6578\nfrom src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree\n\n# \u591a\u5c64\u6aa2\u7d22\u51fd\u6578\nfrom src.retrieval import (\n    build_multi_level_index_from_files,\n    multi_level_tree_search,\n    multi_level_extraction_tree_search\n)\n\n# \u67e5\u8a62\u63d0\u53d6\u51fd\u6578\nfrom src.retrieval import extraction_tree_search\n\n# \u5de5\u5177\u51fd\u6578\nfrom src.utils import WordEmbedding, RagChunking\nfrom src.data_processing import DataDealer\n```\n\n## \ud83d\udc1b \u5e38\u898b\u554f\u984c\u8207\u89e3\u6c7a\u65b9\u6848\n\n### Q: \u6aa2\u7d22\u7d50\u679c\u4e0d\u5920\u7cbe\u78ba\uff1f\n**A:** \u5617\u8a66\u8abf\u6574\u53c3\u6578\uff1a\n- \u6e1b\u5c11 `chunk_size` \u63d0\u9ad8\u7cbe\u5ea6\n- \u589e\u52a0 `max_chunks` \u7372\u5f97\u66f4\u591a\u5019\u9078\u7d50\u679c\n- \u4f7f\u7528\u67e5\u8a62\u63d0\u53d6\u529f\u80fd\u8655\u7406\u8907\u96dc\u554f\u984c\n\n### Q: \u8655\u7406\u901f\u5ea6\u8f03\u6162\uff1f\n**A:** \u512a\u5316\u5efa\u8b70\uff1a\n- \u589e\u52a0 `chunk_size` \u6e1b\u5c11\u5206\u584a\u6578\u91cf\n- \u6e1b\u5c11 `max_chunks` \u9650\u5236\u8655\u7406\u7bc4\u570d\n- \u4f7f\u7528\u591a\u5c64\u7d22\u5f15\u4ee3\u66ff\u55ae\u4e00\u6aa2\u7d22\u6a39\n\n### Q: \u8a18\u61b6\u9ad4\u4f7f\u7528\u904e\u591a\uff1f\n**A:** \u8a18\u61b6\u9ad4\u7ba1\u7406\uff1a\n- \u5206\u6279\u8655\u7406\u5927\u578b\u6587\u672c\u5eab\n- \u5b9a\u671f\u6e05\u7406\u4e0d\u9700\u8981\u7684\u8b8a\u6578\n- \u4f7f\u7528\u751f\u6210\u5668\u800c\u975e\u5217\u8868\u5b58\u5132\u5927\u91cf\u8cc7\u6599\n\n### Q: \u5982\u4f55\u8655\u7406\u4e0d\u540c\u8a9e\u8a00\u7684\u6587\u672c\uff1f\n**A:** \u591a\u8a9e\u8a00\u652f\u63f4\uff1a\n- \u4f7f\u7528\u591a\u8a9e\u8a00\u5d4c\u5165\u6a21\u578b\uff08\u5982\u9810\u8a2d\u7684 multilingual-e5-large\uff09\n- \u78ba\u4fdd\u67e5\u8a62\u8a9e\u8a00\u8207\u6587\u672c\u8a9e\u8a00\u4e00\u81f4\n- \u8003\u616e\u4f7f\u7528\u8a9e\u8a00\u7279\u5b9a\u7684\u5206\u8a5e\u7b56\u7565\n\n## \ud83d\udcc4 \u7cfb\u7d71\u529f\u80fd\u8aaa\u660e\n\n### \u6aa2\u7d22\u6d41\u7a0b\n\n\u672c\u7cfb\u7d71\u63d0\u4f9b\u5169\u7a2e\u6aa2\u7d22\u6a21\u5f0f\uff1a\n\n1. **\u76f4\u63a5\u6aa2\u7d22** - \u9069\u5408\u7c21\u55ae\u660e\u78ba\u7684\u554f\u984c\n   - \u5c07\u7528\u6236\u8f38\u5165\u76f4\u63a5\u5411\u91cf\u5316\n   - \u901a\u904e\u6aa2\u7d22\u6a39\u5c0b\u627e\u76f8\u4f3c\u6587\u672c\u7247\u6bb5\n   - \u4f7f\u7528\u8a9e\u8a00\u6a21\u578b\u751f\u6210\u7b54\u6848\n\n2. **\u67e5\u8a62\u63d0\u53d6\u6aa2\u7d22** - \u9069\u5408\u8907\u96dc\u6216\u5197\u9577\u554f\u984c\n   - \u5148\u4f7f\u7528\u8a9e\u8a00\u6a21\u578b\u63d0\u53d6\u6838\u5fc3\u6cd5\u5f8b\u554f\u984c\u548c\u6982\u5ff5\n   - \u5c07\u63d0\u53d6\u5f8c\u7684\u95dc\u9375\u8981\u9ede\u5411\u91cf\u5316\n   - \u901a\u904e\u6aa2\u7d22\u6a39\u67e5\u627e\u76f8\u95dc\u7247\u6bb5\n   - \u4f7f\u7528\u8a9e\u8a00\u6a21\u578b\u91dd\u5c0d\u63d0\u53d6\u8981\u9ede\u751f\u6210\u7cbe\u78ba\u7b54\u6848\n\n### \u56de\u7b54\u65b9\u5f0f\u8aaa\u660e\n\n#### \u4efb\u52d9\u5c0e\u5411\uff08Task-Oriented\uff09\n- \u7279\u9ede\uff1a\u7c21\u6f54\u76f4\u63a5\uff0c\u5feb\u901f\u63d0\u4f9b\u7b54\u6848\n- \u9069\u7528\uff1a\u9700\u8981\u660e\u78ba\u6cd5\u689d\u89e3\u91cb\u6216\u64cd\u4f5c\u6307\u5f15\u7684\u554f\u984c\n\n#### \u601d\u7dad\u93c8\uff08Chain of Thought, CoT\uff09\n- \u7279\u9ede\uff1a\u8a73\u7d30\u5206\u6790\uff0c\u63d0\u4f9b\u63a8\u7406\u904e\u7a0b\n- \u9069\u7528\uff1a\u8907\u96dc\u6cd5\u5f8b\u908f\u8f2f\u5206\u6790\u6216\u9700\u8981\u591a\u6b65\u63a8\u8ad6\u7684\u554f\u984c\n\n## \ud83d\udce6 \u90e8\u7f72\u8207\u767c\u5e03\n\n### PyPI \u5b89\u88dd\n\n```bash\npip install hierarchical-rag-retrieval\n```\n\n### \u5f9e GitHub \u5b89\u88dd\u958b\u767c\u7248\u672c\n\n```bash\npip install git+https://github.com/arthur422tp/hierarchical.git\n```\n\n### \u672c\u6a5f\u958b\u767c\u5b89\u88dd\n\n```bash\n# \u514b\u9686\u5c08\u6848\ngit clone https://github.com/arthur422tp/hierarchical.git\ncd hierarchical\n\n# \u5b89\u88dd\u958b\u767c\u4f9d\u8cf4\npip install -e .[dev]\n```\n\n## \ud83d\udcda \u66f4\u591a\u8cc7\u6e90\n\n- **GitHub Repository**: https://github.com/arthur422tp/hierarchical\n- **arXiv \u8ad6\u6587**: https://arxiv.org/abs/2506.13607\n- **PyPI \u5957\u4ef6**: https://pypi.org/project/hierarchical-rag-retrieval/\n- **Issue \u56de\u5831**: https://github.com/arthur422tp/hierarchical/issues\n\n## \ud83e\udd1d \u8ca2\u737b\u8207\u652f\u63f4\n\n\u6b61\u8fce\u63d0\u4ea4 Issue \u548c Pull Request\uff01\u5982\u679c\u60a8\u5728\u4f7f\u7528\u904e\u7a0b\u4e2d\u9047\u5230\u4efb\u4f55\u554f\u984c\uff0c\u6216\u6709\u6539\u9032\u5efa\u8b70\uff0c\u8acb\u96a8\u6642\u806f\u7e6b\u6211\u5011\u3002\n\n### \u958b\u767c\u74b0\u5883\u8a2d\u7f6e\n\n```bash\n# \u514b\u9686\u5c08\u6848\ngit clone https://github.com/arthur422tp/hierarchical.git\ncd hierarchical\n\n# \u5efa\u7acb\u865b\u64ec\u74b0\u5883\npython -m venv venv\nsource venv/bin/activate  # Windows: venv\\Scripts\\activate\n\n# \u5b89\u88dd\u958b\u767c\u4f9d\u8cf4\npip install -e .[dev,app]\n```\n\n## \ud83d\udcdc License\n\n\u672c\u5c08\u6848\u63a1\u7528 MIT \u6388\u6b0a\u689d\u6b3e - \u8a73\u898b [LICENSE](LICENSE) \u6a94\u6848\u3002\n\n## \ud83d\udcde \u806f\u7e6b\u65b9\u5f0f\n\n- \u4f5c\u8005\uff1aarthur422tp\n- Email\uff1aarthur422tp@gmail.com\n- GitHub\uff1ahttps://github.com/arthur422tp\n\n---\n\n**\u795d\u60a8\u4f7f\u7528\u6109\u5feb\uff01\u5982\u679c\u9019\u500b\u7cfb\u7d71\u5c0d\u60a8\u7684\u5c08\u6848\u6709\u5e6b\u52a9\uff0c\u8acb\u8003\u616e\u7d66\u6211\u5011\u4e00\u500b \u2b50 Star\uff01**\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AI-Powered Legal Document Retrieval Engine based on Hierarchical Clustering & RAG",
    "version": "0.1.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/arthur422tp/hierarchical/issues",
        "Documentation": "https://github.com/arthur422tp/hierarchical#readme",
        "Homepage": "https://github.com/arthur422tp/hierarchical",
        "Repository": "https://github.com/arthur422tp/hierarchical",
        "arXiv Paper": "https://arxiv.org/abs/2506.13607"
    },
    "split_keywords": [
        "rag",
        " retrieval",
        " hierarchical",
        " clustering",
        " legal",
        " nlp",
        " ai",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "43e5079ba412e33a4e3d4a6dcb8fd0c48b0ea76f3f81c35a02583ed48c0860c3",
                "md5": "ab09e799f11972fc092b5fd78d14e196",
                "sha256": "deac79430e09096584ac00d43580b2cb01a1245404a4c3ee23a9b26ab66845ae"
            },
            "downloads": -1,
            "filename": "hierarchical_rag_retrieval-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ab09e799f11972fc092b5fd78d14e196",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 23635,
            "upload_time": "2025-08-13T04:08:30",
            "upload_time_iso_8601": "2025-08-13T04:08:30.601237Z",
            "url": "https://files.pythonhosted.org/packages/43/e5/079ba412e33a4e3d4a6dcb8fd0c48b0ea76f3f81c35a02583ed48c0860c3/hierarchical_rag_retrieval-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "91d2f42d88f791601f5a6457aa88245d4c3bfda51068d73299cb27706a9784cc",
                "md5": "84ab7214478cc30445cc3e0270d08664",
                "sha256": "889d45f46f618daf99adab9a2ac9942d4451cb6bc4981b9db42bbf7714510af2"
            },
            "downloads": -1,
            "filename": "hierarchical_rag_retrieval-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "84ab7214478cc30445cc3e0270d08664",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 2419554,
            "upload_time": "2025-08-13T04:08:32",
            "upload_time_iso_8601": "2025-08-13T04:08:32.926250Z",
            "url": "https://files.pythonhosted.org/packages/91/d2/f42d88f791601f5a6457aa88245d4c3bfda51068d73299cb27706a9784cc/hierarchical_rag_retrieval-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-13 04:08:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "arthur422tp",
    "github_project": "hierarchical",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "fastapi",
            "specs": [
                [
                    "==",
                    "0.104.1"
                ]
            ]
        },
        {
            "name": "uvicorn",
            "specs": [
                [
                    "==",
                    "0.24.0"
                ]
            ]
        },
        {
            "name": "python-multipart",
            "specs": [
                [
                    "==",
                    "0.0.6"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.4.2"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    "==",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "faiss-cpu",
            "specs": [
                [
                    "==",
                    "1.7.4"
                ]
            ]
        },
        {
            "name": "langchain",
            "specs": [
                [
                    "==",
                    "0.3.26"
                ]
            ]
        },
        {
            "name": "langchain-openai",
            "specs": [
                [
                    "==",
                    "0.0.2"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.1.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.3.1"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.16.0"
                ]
            ]
        },
        {
            "name": "fastcluster",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.3.2"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "langchain_community",
            "specs": [
                [
                    "==",
                    "0.0.10"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.1.0"
                ]
            ]
        }
    ],
    "lcname": "hierarchical-rag-retrieval"
}
        
Elapsed time: 1.14620s