# Legal Document Retrieval System Based on Hierarchical Clustering
## 階層式聚類法規文本檢索系統
<div align="center">




**基於階層式聚類與 RAG 的法規文本智慧檢索引擎**
[📖 快速開始](#🚀-快速開始) • [📚 使用指南](#📚-詳細使用指南) • [🔧 API 參考](#🔬-api-參考) • [📄 arXiv 論文](https://arxiv.org/abs/2506.13607)
</div>
## 📖 系統介紹
本系統是一個結合 AI 法律助手與法規查詢功能的智慧檢索引擎,核心技術為階層式聚類(Hierarchical Clustering)與餘弦相似度(Cosine Similarity),並透過 OpenAI 的 Retrieval-Augmented Generation(RAG)技術進行答案生成。適用於智慧律所、法律聊天機器人、學術研究或多語言法條索引場景,能有效提供準確、可解釋的法規查詢回覆。
### 🎯 核心特色
- **🌳 階層式檢索樹**:自動建構語意層次索引結構
- **🔍 雙重檢索模式**:支援直接檢索與查詢提取兩種方式
- **🧠 RAG 技術整合**:結合 OpenAI GPT 進行智能答案生成
- **⚡ 高效能檢索**:無須手動設定 k 值,自動篩選相關文本
- **🎨 模組化設計**:易於整合到現有專案中
- **🌐 全端解決方案**:內建前端 UI + REST API
- **🐳 Docker 支援**:支援容器化部署,一鍵啟動
## 🛠️ 技術架構
| Component | Tech Used |
|----------|------------|
| Frontend | HTML / JavaScript / Tailwind CSS |
| Backend | FastAPI |
| Embedding Model | `intfloat/multilingual-e5-large` |
| Retrieval Tree | Hierarchical Clustering + Cosine Similarity |
| LLM API | OpenAI GPT (ChatGPT API) |
| Containerization | Docker & Docker Compose |
### 核心組件架構
```
hierarchical-rag-retrieval/
├── retrieval/ # 檢索核心模組
│ ├── RAGTree_function.py # 階層式檢索樹
│ ├── multi_level_search.py # 多層索引檢索
│ └── generated_function.py # 查詢提取功能
├── utils/ # 工具模組
│ ├── word_embedding.py # 詞嵌入處理
│ ├── word_chunking.py # 文本分塊
│ └── query_retrieval.py # FAISS 檢索
├── data_processing/ # 資料處理模組
│ └── data_dealer.py # 資料格式處理
└── app/ # 演示應用
├── main.py # FastAPI 主程式
└── static/index.html # 前端介面
```
## 🚀 快速開始
### 📦 套件安裝
```bash
pip install hierarchical-rag-retrieval
```
### 🎯 基本使用範例
```python
from src.retrieval import create_ahc_tree, tree_search
from src.utils import WordEmbedding
# 1. 初始化詞嵌入模型
embedding_model = WordEmbedding()
model = embedding_model.load_model()
# 2. 準備您的文本資料
texts = [
"民法總則規定自然人之權利能力始於出生終於死亡",
"土地法規定土地所有權之移轉應辦理登記",
"都市計畫法規定都市計畫區域內土地使用分區",
# ... 更多文本
]
# 3. 生成文本向量
vectors = model.encode(texts)
# 4. 建立階層式檢索樹
tree_root = create_ahc_tree(vectors, texts)
# 5. 進行檢索
query = "土地所有權移轉需要什麼程序?"
results = tree_search(
tree_root,
query,
model,
chunk_size=100,
chunk_overlap=20
)
# 6. 查看結果
for i, result in enumerate(results, 1):
print(f"{i}. {result}")
```
### 🔧 環境設置(演示應用)
#### 前置條件
- Python 3.8+ 或 Docker 環境
- OpenAI API 金鑰
#### 方法一:傳統部署
```bash
# 安裝依賴
pip install -r requirements.txt
# 設置環境變數
echo "OPENAI_API_KEY=your_openai_api_key" > .env
# 啟動應用
cd app && python main.py
```
#### 方法二:Docker 部署 (推薦)
```bash
# 設置環境變數
echo "OPENAI_API_KEY=your_openai_api_key" > .env
# 啟動服務
docker-compose up -d
# 查看日誌
docker-compose logs -f
```
應用啟動後,瀏覽器訪問 http://localhost:8000 使用系統。
## 📚 詳細使用指南
### 1. 階層式檢索樹 (RAGTree)
階層式檢索樹是本系統的核心功能,通過聚類算法自動組織文本向量。
```python
from src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree
# 建立檢索樹
tree_root = create_ahc_tree(vectors, texts)
# 儲存檢索樹(可重複使用)
save_tree(tree_root, "my_retrieval_tree.pkl")
# 載入已儲存的檢索樹
tree_root = load_tree("my_retrieval_tree.pkl")
# 進行檢索
results = tree_search(
root=tree_root,
query="您的查詢問題",
model=embedding_model.load_model(),
chunk_size=100,
chunk_overlap=20,
max_chunks=10
)
```
**參數說明:**
- `chunk_size`: 文本分塊大小,較大的值保留更多上下文
- `chunk_overlap`: 分塊重疊大小,避免重要資訊被截斷
- `max_chunks`: 最大分塊數量,控制處理效率
### 2. 多層索引檢索(目前尚未完成,以下為預計情形|Not done yet)
針對大型文本庫優化的多層索引系統。
```python
from src.retrieval import (
build_multi_level_index_from_files,
multi_level_tree_search,
multi_level_extraction_tree_search
)
# 從檔案建立多層索引
index = build_multi_level_index_from_files(
embeddings_path="embeddings.pkl",
texts_path="texts.pkl"
)
# 直接檢索
results = multi_level_tree_search(
index=index,
query="查詢問題",
model=model,
chunk_size=100,
chunk_overlap=20
)
# 使用查詢提取的檢索(適合複雜問題)
results = multi_level_extraction_tree_search(
index=index,
query="複雜的法律問題描述...",
model=model,
chunk_size=100,
chunk_overlap=20,
llm=openai_llm # OpenAI 語言模型
)
```
### 3. 查詢提取與優化
對於複雜或冗長的查詢,系統可以自動提取核心問題。
```python
from src.retrieval import extraction_tree_search
from langchain_openai import ChatOpenAI
# 設定 OpenAI 模型
llm = ChatOpenAI(
model="gpt-3.5-turbo",
api_key="your-openai-api-key"
)
# 使用查詢提取進行檢索
complex_query = """
我想了解關於土地買賣的法律規定,特別是在都市計畫區域內,
如果我要購買一塊土地用於商業用途,需要注意哪些法律條文?
另外,土地所有權的移轉程序是什麼?
"""
results = extraction_tree_search(
root=tree_root,
query=complex_query,
model=model,
chunk_size=100,
chunk_overlap=20,
llm=llm
)
```
### 4. 自定義文本處理
```python
from src.utils import WordEmbedding, RagChunking
from src.data_processing import DataDealer
import pickle
# 處理自定義文本資料
dealer = DataDealer()
# 準備文本資料
custom_texts = [
"您的第一個文檔內容...",
"您的第二個文檔內容...",
# ... 更多文本
]
# 生成嵌入向量
embedding_model = WordEmbedding()
model = embedding_model.load_model()
vectors = model.encode(custom_texts)
# 儲存處理後的資料
with open('custom_texts.pkl', 'wb') as f:
pickle.dump(custom_texts, f)
with open('custom_embeddings.pkl', 'wb') as f:
pickle.dump(vectors, f)
# 文本分塊處理
chunker = RagChunking("長文本內容...")
chunks = chunker.text_chunking(chunk_size=200, chunk_overlap=50)
```
## 🎨 進階用法(非主要功能,有些已經廢棄)
### 1. 自定義重排序(需自行導入cross-encoder model)
```python
from src.retrieval import rerank_texts
# 對檢索結果進行重新排序
query = "查詢問題"
passages = ["文檔1", "文檔2", "文檔3"]
reranked_passages = rerank_texts(query, passages, model)
```
### 2. 批次處理
```python
def batch_search(queries, tree_root, model):
"""批次處理多個查詢"""
all_results = {}
for query in queries:
results = tree_search(tree_root, query, model, 100, 20)
all_results[query] = results
return all_results
queries = [
"土地法相關問題",
"民法總則規定",
"都市計畫法條文"
]
batch_results = batch_search(queries, tree_root, model)
```
### 3. 結果後處理
```python
def process_results(results, max_results=5):
"""處理和過濾檢索結果"""
# 去重
unique_results = list(set(results))
# 長度過濾
filtered_results = [r for r in unique_results if len(r.strip()) > 20]
# 限制數量
return filtered_results[:max_results]
processed_results = process_results(results)
```
## 📝 更新附註(最近變更)
- 外部化參數:現在可透過環境變數調整,不需改碼。
- LLM:`OPENAI_API_KEY`、`OPENAI_MODEL`、`OPENAI_TEMPERATURE`、`OPENAI_TOP_P`、`OPENAI_MAX_TOKENS`
- Embedding:`EMBEDDING_MODEL_NAME`
- 檢索:`CHUNK_SIZE`、`CHUNK_OVERLAP`、`MAX_CHUNKS`、`MAX_RESULTS`、`TOP_K`
- Rerank:`RERANKER_ENABLE_IN_PIPELINE`(預設 false)、`RERANKER_USE_CROSS_ENCODER`(預設 false)、`RERANKER_MODEL_NAME`
- API:`CORS_ORIGINS`、`API_TITLE`
- Rerank 管線開關:
- 當 `RERANKER_ENABLE_IN_PIPELINE=true` 且檢索結果數量 > `MAX_RESULTS` 時,系統會自動對候選結果進行重排序。
- 若同時設定 `RERANKER_USE_CROSS_ENCODER=true`,會改用 Cross-Encoder 進行配對打分重排(可透過 `RERANKER_MODEL_NAME` 指定模型)。
- .env 範例:
```bash
OPENAI_API_KEY=sk-xxxx
OPENAI_MODEL=gpt-4o-mini
OPENAI_TEMPERATURE=0.2
OPENAI_TOP_P=0.9
OPENAI_MAX_TOKENS=4096
EMBEDDING_MODEL_NAME=intfloat/multilingual-e5-large
CHUNK_SIZE=150
CHUNK_OVERLAP=50
MAX_CHUNKS=12
MAX_RESULTS=100
TOP_K=15
RERANKER_ENABLE_IN_PIPELINE=true
RERANKER_USE_CROSS_ENCODER=true
RERANKER_MODEL_NAME=cross-encoder/ms-marco-MiniLM-L-6-v2
CORS_ORIGINS=http://localhost:3000,https://your.domain
API_TITLE=Hierarchical RAG API
```
## 🔧 配置參數
### 詞嵌入模型配置
```python
# 預設使用 intfloat/multilingual-e5-large
# 您也可以使用其他 Sentence Transformers 模型
from sentence_transformers import SentenceTransformer
# 自定義模型
custom_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# 在檢索中使用
results = tree_search(tree_root, query, custom_model, 100, 20)
```
### 系統參數調整
```python
# 針對不同場景的參數建議
# 精確檢索(較慢但更準確)
results = tree_search(
tree_root, query, model,
chunk_size=50, # 較小的分塊
chunk_overlap=10, # 較小的重疊
max_chunks=5 # 較少的分塊數
)
# 快速檢索(較快但可能遺漏細節)
results = tree_search(
tree_root, query, model,
chunk_size=200, # 較大的分塊
chunk_overlap=40, # 較大的重疊
max_chunks=15 # 較多的分塊數
)
```
## 📊 效能優化建議
### 1. 記憶體管理
```python
# 對於大型文本庫,建議分批處理
def process_large_corpus(texts, batch_size=1000):
trees = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_vectors = model.encode(batch)
tree = create_ahc_tree(batch_vectors, batch)
trees.append(tree)
return trees
```
### 2. 快取機制
```python
import os
# 檢查是否已有建立好的檢索樹
tree_file = "retrieval_tree.pkl"
if os.path.exists(tree_file):
tree_root = load_tree(tree_file)
else:
tree_root = create_ahc_tree(vectors, texts)
save_tree(tree_root, tree_file)
```
## 🔍 實際應用案例
### 法律文件檢索
```python
# 法律條文檢索系統
legal_texts = [
"民法第一條:民事,法律所未規定者,依習慣...",
"刑法第十條:稱以上、以下、以內、以外者...",
# ... 更多法條
]
# 建立法律檢索系統
legal_vectors = model.encode(legal_texts)
legal_tree = create_ahc_tree(legal_vectors, legal_texts)
# 查詢法律問題
legal_query = "關於契約的法律效力規定"
legal_results = tree_search(legal_tree, legal_query, model, 100, 20)
```
### 學術論文檢索
```python
# 學術文獻檢索
papers = [
"本研究探討機器學習在自然語言處理中的應用...",
"深度學習模型在圖像識別領域的最新進展...",
# ... 更多論文摘要
]
academic_vectors = model.encode(papers)
academic_tree = create_ahc_tree(academic_vectors, papers)
research_query = "transformer模型在文本分類的效果"
academic_results = tree_search(academic_tree, research_query, model, 150, 30)
```
## 🔬 API 參考
### Web 應用 API
- `GET /`: 主頁面
- `GET /available-texts`: 獲取可用的文本列表
- `POST /query`: 提交查詢請求
- 請求體:
```json
{
"query": "您的問題",
"use_extraction": true/false,
"text_name": "文本名稱",
"prompt_type": "task_oriented" | "cot"
}
```
- 響應:
```json
{
"answer": "系統回答",
"retrieved_docs": ["檢索到的文檔1", "檢索到的文檔2", ...]
}
```
### Python API
#### 核心檢索函數
```python
# 主要檢索函數
from src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree
# 多層檢索函數
from src.retrieval import (
build_multi_level_index_from_files,
multi_level_tree_search,
multi_level_extraction_tree_search
)
# 查詢提取函數
from src.retrieval import extraction_tree_search
# 工具函數
from src.utils import WordEmbedding, RagChunking
from src.data_processing import DataDealer
```
## 🐛 常見問題與解決方案
### Q: 檢索結果不夠精確?
**A:** 嘗試調整參數:
- 減少 `chunk_size` 提高精度
- 增加 `max_chunks` 獲得更多候選結果
- 使用查詢提取功能處理複雜問題
### Q: 處理速度較慢?
**A:** 優化建議:
- 增加 `chunk_size` 減少分塊數量
- 減少 `max_chunks` 限制處理範圍
- 使用多層索引代替單一檢索樹
### Q: 記憶體使用過多?
**A:** 記憶體管理:
- 分批處理大型文本庫
- 定期清理不需要的變數
- 使用生成器而非列表存儲大量資料
### Q: 如何處理不同語言的文本?
**A:** 多語言支援:
- 使用多語言嵌入模型(如預設的 multilingual-e5-large)
- 確保查詢語言與文本語言一致
- 考慮使用語言特定的分詞策略
## 📄 系統功能說明
### 檢索流程
本系統提供兩種檢索模式:
1. **直接檢索** - 適合簡單明確的問題
- 將用戶輸入直接向量化
- 通過檢索樹尋找相似文本片段
- 使用語言模型生成答案
2. **查詢提取檢索** - 適合複雜或冗長問題
- 先使用語言模型提取核心法律問題和概念
- 將提取後的關鍵要點向量化
- 通過檢索樹查找相關片段
- 使用語言模型針對提取要點生成精確答案
### 回答方式說明
#### 任務導向(Task-Oriented)
- 特點:簡潔直接,快速提供答案
- 適用:需要明確法條解釋或操作指引的問題
#### 思維鏈(Chain of Thought, CoT)
- 特點:詳細分析,提供推理過程
- 適用:複雜法律邏輯分析或需要多步推論的問題
## 📦 部署與發布
### PyPI 安裝
```bash
pip install hierarchical-rag-retrieval
```
### 從 GitHub 安裝開發版本
```bash
pip install git+https://github.com/arthur422tp/hierarchical.git
```
### 本機開發安裝
```bash
# 克隆專案
git clone https://github.com/arthur422tp/hierarchical.git
cd hierarchical
# 安裝開發依賴
pip install -e .[dev]
```
## 📚 更多資源
- **GitHub Repository**: https://github.com/arthur422tp/hierarchical
- **arXiv 論文**: https://arxiv.org/abs/2506.13607
- **PyPI 套件**: https://pypi.org/project/hierarchical-rag-retrieval/
- **Issue 回報**: https://github.com/arthur422tp/hierarchical/issues
## 🤝 貢獻與支援
歡迎提交 Issue 和 Pull Request!如果您在使用過程中遇到任何問題,或有改進建議,請隨時聯繫我們。
### 開發環境設置
```bash
# 克隆專案
git clone https://github.com/arthur422tp/hierarchical.git
cd hierarchical
# 建立虛擬環境
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 安裝開發依賴
pip install -e .[dev,app]
```
## 📜 License
本專案採用 MIT 授權條款 - 詳見 [LICENSE](LICENSE) 檔案。
## 📞 聯繫方式
- 作者:arthur422tp
- Email:arthur422tp@gmail.com
- GitHub:https://github.com/arthur422tp
---
**祝您使用愉快!如果這個系統對您的專案有幫助,請考慮給我們一個 ⭐ Star!**
Raw data
{
"_id": null,
"home_page": "https://github.com/arthur422tp/hierarchical",
"name": "hierarchical-rag-retrieval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "rag, retrieval, hierarchical, clustering, legal, nlp, ai, machine-learning",
"author": "arthur422tp",
"author_email": "arthur422tp <arthur422tp@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/91/d2/f42d88f791601f5a6457aa88245d4c3bfda51068d73299cb27706a9784cc/hierarchical_rag_retrieval-0.1.2.tar.gz",
"platform": null,
"description": "# Legal Document Retrieval System Based on Hierarchical Clustering\n## \u968e\u5c64\u5f0f\u805a\u985e\u6cd5\u898f\u6587\u672c\u6aa2\u7d22\u7cfb\u7d71\n\n<div align=\"center\">\n\n\n\n\n\n\n**\u57fa\u65bc\u968e\u5c64\u5f0f\u805a\u985e\u8207 RAG \u7684\u6cd5\u898f\u6587\u672c\u667a\u6167\u6aa2\u7d22\u5f15\u64ce**\n\n[\ud83d\udcd6 \u5feb\u901f\u958b\u59cb](#\ud83d\ude80-\u5feb\u901f\u958b\u59cb) \u2022 [\ud83d\udcda \u4f7f\u7528\u6307\u5357](#\ud83d\udcda-\u8a73\u7d30\u4f7f\u7528\u6307\u5357) \u2022 [\ud83d\udd27 API \u53c3\u8003](#\ud83d\udd2c-api-\u53c3\u8003) \u2022 [\ud83d\udcc4 arXiv \u8ad6\u6587](https://arxiv.org/abs/2506.13607)\n\n</div>\n\n## \ud83d\udcd6 \u7cfb\u7d71\u4ecb\u7d39\n\n\u672c\u7cfb\u7d71\u662f\u4e00\u500b\u7d50\u5408 AI \u6cd5\u5f8b\u52a9\u624b\u8207\u6cd5\u898f\u67e5\u8a62\u529f\u80fd\u7684\u667a\u6167\u6aa2\u7d22\u5f15\u64ce\uff0c\u6838\u5fc3\u6280\u8853\u70ba\u968e\u5c64\u5f0f\u805a\u985e\uff08Hierarchical Clustering\uff09\u8207\u9918\u5f26\u76f8\u4f3c\u5ea6\uff08Cosine Similarity\uff09\uff0c\u4e26\u900f\u904e OpenAI \u7684 Retrieval-Augmented Generation\uff08RAG\uff09\u6280\u8853\u9032\u884c\u7b54\u6848\u751f\u6210\u3002\u9069\u7528\u65bc\u667a\u6167\u5f8b\u6240\u3001\u6cd5\u5f8b\u804a\u5929\u6a5f\u5668\u4eba\u3001\u5b78\u8853\u7814\u7a76\u6216\u591a\u8a9e\u8a00\u6cd5\u689d\u7d22\u5f15\u5834\u666f\uff0c\u80fd\u6709\u6548\u63d0\u4f9b\u6e96\u78ba\u3001\u53ef\u89e3\u91cb\u7684\u6cd5\u898f\u67e5\u8a62\u56de\u8986\u3002\n\n### \ud83c\udfaf \u6838\u5fc3\u7279\u8272\n\n- **\ud83c\udf33 \u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39**\uff1a\u81ea\u52d5\u5efa\u69cb\u8a9e\u610f\u5c64\u6b21\u7d22\u5f15\u7d50\u69cb\n- **\ud83d\udd0d \u96d9\u91cd\u6aa2\u7d22\u6a21\u5f0f**\uff1a\u652f\u63f4\u76f4\u63a5\u6aa2\u7d22\u8207\u67e5\u8a62\u63d0\u53d6\u5169\u7a2e\u65b9\u5f0f\n- **\ud83e\udde0 RAG \u6280\u8853\u6574\u5408**\uff1a\u7d50\u5408 OpenAI GPT \u9032\u884c\u667a\u80fd\u7b54\u6848\u751f\u6210\n- **\u26a1 \u9ad8\u6548\u80fd\u6aa2\u7d22**\uff1a\u7121\u9808\u624b\u52d5\u8a2d\u5b9a k \u503c\uff0c\u81ea\u52d5\u7be9\u9078\u76f8\u95dc\u6587\u672c\n- **\ud83c\udfa8 \u6a21\u7d44\u5316\u8a2d\u8a08**\uff1a\u6613\u65bc\u6574\u5408\u5230\u73fe\u6709\u5c08\u6848\u4e2d\n- **\ud83c\udf10 \u5168\u7aef\u89e3\u6c7a\u65b9\u6848**\uff1a\u5167\u5efa\u524d\u7aef UI + REST API\n- **\ud83d\udc33 Docker \u652f\u63f4**\uff1a\u652f\u63f4\u5bb9\u5668\u5316\u90e8\u7f72\uff0c\u4e00\u9375\u555f\u52d5\n\n## \ud83d\udee0\ufe0f \u6280\u8853\u67b6\u69cb\n\n| Component | Tech Used |\n|----------|------------|\n| Frontend | HTML / JavaScript / Tailwind CSS |\n| Backend | FastAPI |\n| Embedding Model | `intfloat/multilingual-e5-large` |\n| Retrieval Tree | Hierarchical Clustering + Cosine Similarity |\n| LLM API | OpenAI GPT (ChatGPT API) |\n| Containerization | Docker & Docker Compose |\n\n### \u6838\u5fc3\u7d44\u4ef6\u67b6\u69cb\n\n```\nhierarchical-rag-retrieval/\n\u251c\u2500\u2500 retrieval/ # \u6aa2\u7d22\u6838\u5fc3\u6a21\u7d44\n\u2502 \u251c\u2500\u2500 RAGTree_function.py # \u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39\n\u2502 \u251c\u2500\u2500 multi_level_search.py # \u591a\u5c64\u7d22\u5f15\u6aa2\u7d22\n\u2502 \u2514\u2500\u2500 generated_function.py # \u67e5\u8a62\u63d0\u53d6\u529f\u80fd\n\u251c\u2500\u2500 utils/ # \u5de5\u5177\u6a21\u7d44\n\u2502 \u251c\u2500\u2500 word_embedding.py # \u8a5e\u5d4c\u5165\u8655\u7406\n\u2502 \u251c\u2500\u2500 word_chunking.py # \u6587\u672c\u5206\u584a\n\u2502 \u2514\u2500\u2500 query_retrieval.py # FAISS \u6aa2\u7d22\n\u251c\u2500\u2500 data_processing/ # \u8cc7\u6599\u8655\u7406\u6a21\u7d44\n\u2502 \u2514\u2500\u2500 data_dealer.py # \u8cc7\u6599\u683c\u5f0f\u8655\u7406\n\u2514\u2500\u2500 app/ # \u6f14\u793a\u61c9\u7528\n \u251c\u2500\u2500 main.py # FastAPI \u4e3b\u7a0b\u5f0f\n \u2514\u2500\u2500 static/index.html # \u524d\u7aef\u4ecb\u9762\n```\n\n## \ud83d\ude80 \u5feb\u901f\u958b\u59cb\n\n### \ud83d\udce6 \u5957\u4ef6\u5b89\u88dd\n\n```bash\npip install hierarchical-rag-retrieval\n```\n\n### \ud83c\udfaf \u57fa\u672c\u4f7f\u7528\u7bc4\u4f8b\n\n```python\nfrom src.retrieval import create_ahc_tree, tree_search\nfrom src.utils import WordEmbedding\n\n# 1. \u521d\u59cb\u5316\u8a5e\u5d4c\u5165\u6a21\u578b\nembedding_model = WordEmbedding()\nmodel = embedding_model.load_model()\n\n# 2. \u6e96\u5099\u60a8\u7684\u6587\u672c\u8cc7\u6599\ntexts = [\n \"\u6c11\u6cd5\u7e3d\u5247\u898f\u5b9a\u81ea\u7136\u4eba\u4e4b\u6b0a\u5229\u80fd\u529b\u59cb\u65bc\u51fa\u751f\u7d42\u65bc\u6b7b\u4ea1\",\n \"\u571f\u5730\u6cd5\u898f\u5b9a\u571f\u5730\u6240\u6709\u6b0a\u4e4b\u79fb\u8f49\u61c9\u8fa6\u7406\u767b\u8a18\",\n \"\u90fd\u5e02\u8a08\u756b\u6cd5\u898f\u5b9a\u90fd\u5e02\u8a08\u756b\u5340\u57df\u5167\u571f\u5730\u4f7f\u7528\u5206\u5340\",\n # ... \u66f4\u591a\u6587\u672c\n]\n\n# 3. \u751f\u6210\u6587\u672c\u5411\u91cf\nvectors = model.encode(texts)\n\n# 4. \u5efa\u7acb\u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39\ntree_root = create_ahc_tree(vectors, texts)\n\n# 5. \u9032\u884c\u6aa2\u7d22\nquery = \"\u571f\u5730\u6240\u6709\u6b0a\u79fb\u8f49\u9700\u8981\u4ec0\u9ebc\u7a0b\u5e8f\uff1f\"\nresults = tree_search(\n tree_root, \n query, \n model, \n chunk_size=100, \n chunk_overlap=20\n)\n\n# 6. \u67e5\u770b\u7d50\u679c\nfor i, result in enumerate(results, 1):\n print(f\"{i}. {result}\")\n```\n\n### \ud83d\udd27 \u74b0\u5883\u8a2d\u7f6e\uff08\u6f14\u793a\u61c9\u7528\uff09\n\n#### \u524d\u7f6e\u689d\u4ef6\n\n- Python 3.8+ \u6216 Docker \u74b0\u5883\n- OpenAI API \u91d1\u9470\n\n#### \u65b9\u6cd5\u4e00\uff1a\u50b3\u7d71\u90e8\u7f72\n\n```bash\n# \u5b89\u88dd\u4f9d\u8cf4\npip install -r requirements.txt\n\n# \u8a2d\u7f6e\u74b0\u5883\u8b8a\u6578\necho \"OPENAI_API_KEY=your_openai_api_key\" > .env\n\n# \u555f\u52d5\u61c9\u7528\ncd app && python main.py\n```\n\n#### \u65b9\u6cd5\u4e8c\uff1aDocker \u90e8\u7f72 (\u63a8\u85a6)\n\n```bash\n# \u8a2d\u7f6e\u74b0\u5883\u8b8a\u6578\necho \"OPENAI_API_KEY=your_openai_api_key\" > .env\n\n# \u555f\u52d5\u670d\u52d9\ndocker-compose up -d\n\n# \u67e5\u770b\u65e5\u8a8c\ndocker-compose logs -f\n```\n\n\u61c9\u7528\u555f\u52d5\u5f8c\uff0c\u700f\u89bd\u5668\u8a2a\u554f http://localhost:8000 \u4f7f\u7528\u7cfb\u7d71\u3002\n\n## \ud83d\udcda \u8a73\u7d30\u4f7f\u7528\u6307\u5357\n\n### 1. \u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39 (RAGTree)\n\n\u968e\u5c64\u5f0f\u6aa2\u7d22\u6a39\u662f\u672c\u7cfb\u7d71\u7684\u6838\u5fc3\u529f\u80fd\uff0c\u901a\u904e\u805a\u985e\u7b97\u6cd5\u81ea\u52d5\u7d44\u7e54\u6587\u672c\u5411\u91cf\u3002\n\n```python\nfrom src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree\n\n# \u5efa\u7acb\u6aa2\u7d22\u6a39\ntree_root = create_ahc_tree(vectors, texts)\n\n# \u5132\u5b58\u6aa2\u7d22\u6a39\uff08\u53ef\u91cd\u8907\u4f7f\u7528\uff09\nsave_tree(tree_root, \"my_retrieval_tree.pkl\")\n\n# \u8f09\u5165\u5df2\u5132\u5b58\u7684\u6aa2\u7d22\u6a39\ntree_root = load_tree(\"my_retrieval_tree.pkl\")\n\n# \u9032\u884c\u6aa2\u7d22\nresults = tree_search(\n root=tree_root,\n query=\"\u60a8\u7684\u67e5\u8a62\u554f\u984c\",\n model=embedding_model.load_model(),\n chunk_size=100,\n chunk_overlap=20,\n max_chunks=10\n)\n```\n\n**\u53c3\u6578\u8aaa\u660e\uff1a**\n- `chunk_size`: \u6587\u672c\u5206\u584a\u5927\u5c0f\uff0c\u8f03\u5927\u7684\u503c\u4fdd\u7559\u66f4\u591a\u4e0a\u4e0b\u6587\n- `chunk_overlap`: \u5206\u584a\u91cd\u758a\u5927\u5c0f\uff0c\u907f\u514d\u91cd\u8981\u8cc7\u8a0a\u88ab\u622a\u65b7\n- `max_chunks`: \u6700\u5927\u5206\u584a\u6578\u91cf\uff0c\u63a7\u5236\u8655\u7406\u6548\u7387\n\n### 2. \u591a\u5c64\u7d22\u5f15\u6aa2\u7d22(\u76ee\u524d\u5c1a\u672a\u5b8c\u6210\uff0c\u4ee5\u4e0b\u70ba\u9810\u8a08\u60c5\u5f62\uff5cNot done yet)\n\n\u91dd\u5c0d\u5927\u578b\u6587\u672c\u5eab\u512a\u5316\u7684\u591a\u5c64\u7d22\u5f15\u7cfb\u7d71\u3002\n\n```python\nfrom src.retrieval import (\n build_multi_level_index_from_files, \n multi_level_tree_search,\n multi_level_extraction_tree_search\n)\n\n# \u5f9e\u6a94\u6848\u5efa\u7acb\u591a\u5c64\u7d22\u5f15\nindex = build_multi_level_index_from_files(\n embeddings_path=\"embeddings.pkl\",\n texts_path=\"texts.pkl\"\n)\n\n# \u76f4\u63a5\u6aa2\u7d22\nresults = multi_level_tree_search(\n index=index,\n query=\"\u67e5\u8a62\u554f\u984c\",\n model=model,\n chunk_size=100,\n chunk_overlap=20\n)\n\n# \u4f7f\u7528\u67e5\u8a62\u63d0\u53d6\u7684\u6aa2\u7d22\uff08\u9069\u5408\u8907\u96dc\u554f\u984c\uff09\nresults = multi_level_extraction_tree_search(\n index=index,\n query=\"\u8907\u96dc\u7684\u6cd5\u5f8b\u554f\u984c\u63cf\u8ff0...\",\n model=model,\n chunk_size=100,\n chunk_overlap=20,\n llm=openai_llm # OpenAI \u8a9e\u8a00\u6a21\u578b\n)\n```\n\n### 3. \u67e5\u8a62\u63d0\u53d6\u8207\u512a\u5316\n\n\u5c0d\u65bc\u8907\u96dc\u6216\u5197\u9577\u7684\u67e5\u8a62\uff0c\u7cfb\u7d71\u53ef\u4ee5\u81ea\u52d5\u63d0\u53d6\u6838\u5fc3\u554f\u984c\u3002\n\n```python\nfrom src.retrieval import extraction_tree_search\nfrom langchain_openai import ChatOpenAI\n\n# \u8a2d\u5b9a OpenAI \u6a21\u578b\nllm = ChatOpenAI(\n model=\"gpt-3.5-turbo\",\n api_key=\"your-openai-api-key\"\n)\n\n# \u4f7f\u7528\u67e5\u8a62\u63d0\u53d6\u9032\u884c\u6aa2\u7d22\ncomplex_query = \"\"\"\n\u6211\u60f3\u4e86\u89e3\u95dc\u65bc\u571f\u5730\u8cb7\u8ce3\u7684\u6cd5\u5f8b\u898f\u5b9a\uff0c\u7279\u5225\u662f\u5728\u90fd\u5e02\u8a08\u756b\u5340\u57df\u5167\uff0c\n\u5982\u679c\u6211\u8981\u8cfc\u8cb7\u4e00\u584a\u571f\u5730\u7528\u65bc\u5546\u696d\u7528\u9014\uff0c\u9700\u8981\u6ce8\u610f\u54ea\u4e9b\u6cd5\u5f8b\u689d\u6587\uff1f\n\u53e6\u5916\uff0c\u571f\u5730\u6240\u6709\u6b0a\u7684\u79fb\u8f49\u7a0b\u5e8f\u662f\u4ec0\u9ebc\uff1f\n\"\"\"\n\nresults = extraction_tree_search(\n root=tree_root,\n query=complex_query,\n model=model,\n chunk_size=100,\n chunk_overlap=20,\n llm=llm\n)\n```\n\n### 4. \u81ea\u5b9a\u7fa9\u6587\u672c\u8655\u7406\n\n```python\nfrom src.utils import WordEmbedding, RagChunking\nfrom src.data_processing import DataDealer\nimport pickle\n\n# \u8655\u7406\u81ea\u5b9a\u7fa9\u6587\u672c\u8cc7\u6599\ndealer = DataDealer()\n\n# \u6e96\u5099\u6587\u672c\u8cc7\u6599\ncustom_texts = [\n \"\u60a8\u7684\u7b2c\u4e00\u500b\u6587\u6a94\u5167\u5bb9...\",\n \"\u60a8\u7684\u7b2c\u4e8c\u500b\u6587\u6a94\u5167\u5bb9...\",\n # ... \u66f4\u591a\u6587\u672c\n]\n\n# \u751f\u6210\u5d4c\u5165\u5411\u91cf\nembedding_model = WordEmbedding()\nmodel = embedding_model.load_model()\nvectors = model.encode(custom_texts)\n\n# \u5132\u5b58\u8655\u7406\u5f8c\u7684\u8cc7\u6599\nwith open('custom_texts.pkl', 'wb') as f:\n pickle.dump(custom_texts, f)\nwith open('custom_embeddings.pkl', 'wb') as f:\n pickle.dump(vectors, f)\n\n# \u6587\u672c\u5206\u584a\u8655\u7406\nchunker = RagChunking(\"\u9577\u6587\u672c\u5167\u5bb9...\")\nchunks = chunker.text_chunking(chunk_size=200, chunk_overlap=50)\n```\n\n## \ud83c\udfa8 \u9032\u968e\u7528\u6cd5\uff08\u975e\u4e3b\u8981\u529f\u80fd\uff0c\u6709\u4e9b\u5df2\u7d93\u5ee2\u68c4\uff09\n\n### 1. \u81ea\u5b9a\u7fa9\u91cd\u6392\u5e8f(\u9700\u81ea\u884c\u5c0e\u5165cross-encoder model)\n\n```python\nfrom src.retrieval import rerank_texts\n\n# \u5c0d\u6aa2\u7d22\u7d50\u679c\u9032\u884c\u91cd\u65b0\u6392\u5e8f\nquery = \"\u67e5\u8a62\u554f\u984c\"\npassages = [\"\u6587\u6a941\", \"\u6587\u6a942\", \"\u6587\u6a943\"]\nreranked_passages = rerank_texts(query, passages, model)\n```\n\n### 2. \u6279\u6b21\u8655\u7406\n\n```python\ndef batch_search(queries, tree_root, model):\n \"\"\"\u6279\u6b21\u8655\u7406\u591a\u500b\u67e5\u8a62\"\"\"\n all_results = {}\n for query in queries:\n results = tree_search(tree_root, query, model, 100, 20)\n all_results[query] = results\n return all_results\n\nqueries = [\n \"\u571f\u5730\u6cd5\u76f8\u95dc\u554f\u984c\",\n \"\u6c11\u6cd5\u7e3d\u5247\u898f\u5b9a\", \n \"\u90fd\u5e02\u8a08\u756b\u6cd5\u689d\u6587\"\n]\n\nbatch_results = batch_search(queries, tree_root, model)\n```\n\n### 3. \u7d50\u679c\u5f8c\u8655\u7406\n\n```python\ndef process_results(results, max_results=5):\n \"\"\"\u8655\u7406\u548c\u904e\u6ffe\u6aa2\u7d22\u7d50\u679c\"\"\"\n # \u53bb\u91cd\n unique_results = list(set(results))\n \n # \u9577\u5ea6\u904e\u6ffe\n filtered_results = [r for r in unique_results if len(r.strip()) > 20]\n \n # \u9650\u5236\u6578\u91cf\n return filtered_results[:max_results]\n\nprocessed_results = process_results(results)\n```\n\n## \ud83d\udcdd \u66f4\u65b0\u9644\u8a3b\uff08\u6700\u8fd1\u8b8a\u66f4\uff09\n\n- \u5916\u90e8\u5316\u53c3\u6578\uff1a\u73fe\u5728\u53ef\u900f\u904e\u74b0\u5883\u8b8a\u6578\u8abf\u6574\uff0c\u4e0d\u9700\u6539\u78bc\u3002\n - LLM\uff1a`OPENAI_API_KEY`\u3001`OPENAI_MODEL`\u3001`OPENAI_TEMPERATURE`\u3001`OPENAI_TOP_P`\u3001`OPENAI_MAX_TOKENS`\n - Embedding\uff1a`EMBEDDING_MODEL_NAME`\n - \u6aa2\u7d22\uff1a`CHUNK_SIZE`\u3001`CHUNK_OVERLAP`\u3001`MAX_CHUNKS`\u3001`MAX_RESULTS`\u3001`TOP_K`\n - Rerank\uff1a`RERANKER_ENABLE_IN_PIPELINE`\uff08\u9810\u8a2d false\uff09\u3001`RERANKER_USE_CROSS_ENCODER`\uff08\u9810\u8a2d false\uff09\u3001`RERANKER_MODEL_NAME`\n - API\uff1a`CORS_ORIGINS`\u3001`API_TITLE`\n\n- Rerank \u7ba1\u7dda\u958b\u95dc\uff1a\n - \u7576 `RERANKER_ENABLE_IN_PIPELINE=true` \u4e14\u6aa2\u7d22\u7d50\u679c\u6578\u91cf > `MAX_RESULTS` \u6642\uff0c\u7cfb\u7d71\u6703\u81ea\u52d5\u5c0d\u5019\u9078\u7d50\u679c\u9032\u884c\u91cd\u6392\u5e8f\u3002\n - \u82e5\u540c\u6642\u8a2d\u5b9a `RERANKER_USE_CROSS_ENCODER=true`\uff0c\u6703\u6539\u7528 Cross-Encoder \u9032\u884c\u914d\u5c0d\u6253\u5206\u91cd\u6392\uff08\u53ef\u900f\u904e `RERANKER_MODEL_NAME` \u6307\u5b9a\u6a21\u578b\uff09\u3002\n\n- .env \u7bc4\u4f8b\uff1a\n\n```bash\nOPENAI_API_KEY=sk-xxxx\nOPENAI_MODEL=gpt-4o-mini\nOPENAI_TEMPERATURE=0.2\nOPENAI_TOP_P=0.9\nOPENAI_MAX_TOKENS=4096\nEMBEDDING_MODEL_NAME=intfloat/multilingual-e5-large\nCHUNK_SIZE=150\nCHUNK_OVERLAP=50\nMAX_CHUNKS=12\nMAX_RESULTS=100\nTOP_K=15\nRERANKER_ENABLE_IN_PIPELINE=true\nRERANKER_USE_CROSS_ENCODER=true\nRERANKER_MODEL_NAME=cross-encoder/ms-marco-MiniLM-L-6-v2\nCORS_ORIGINS=http://localhost:3000,https://your.domain\nAPI_TITLE=Hierarchical RAG API\n```\n\n## \ud83d\udd27 \u914d\u7f6e\u53c3\u6578\n\n### \u8a5e\u5d4c\u5165\u6a21\u578b\u914d\u7f6e\n\n```python\n# \u9810\u8a2d\u4f7f\u7528 intfloat/multilingual-e5-large\n# \u60a8\u4e5f\u53ef\u4ee5\u4f7f\u7528\u5176\u4ed6 Sentence Transformers \u6a21\u578b\n\nfrom sentence_transformers import SentenceTransformer\n\n# \u81ea\u5b9a\u7fa9\u6a21\u578b\ncustom_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')\n\n# \u5728\u6aa2\u7d22\u4e2d\u4f7f\u7528\nresults = tree_search(tree_root, query, custom_model, 100, 20)\n```\n\n### \u7cfb\u7d71\u53c3\u6578\u8abf\u6574\n\n```python\n# \u91dd\u5c0d\u4e0d\u540c\u5834\u666f\u7684\u53c3\u6578\u5efa\u8b70\n\n# \u7cbe\u78ba\u6aa2\u7d22\uff08\u8f03\u6162\u4f46\u66f4\u6e96\u78ba\uff09\nresults = tree_search(\n tree_root, query, model,\n chunk_size=50, # \u8f03\u5c0f\u7684\u5206\u584a\n chunk_overlap=10, # \u8f03\u5c0f\u7684\u91cd\u758a\n max_chunks=5 # \u8f03\u5c11\u7684\u5206\u584a\u6578\n)\n\n# \u5feb\u901f\u6aa2\u7d22\uff08\u8f03\u5feb\u4f46\u53ef\u80fd\u907a\u6f0f\u7d30\u7bc0\uff09\nresults = tree_search(\n tree_root, query, model,\n chunk_size=200, # \u8f03\u5927\u7684\u5206\u584a\n chunk_overlap=40, # \u8f03\u5927\u7684\u91cd\u758a\n max_chunks=15 # \u8f03\u591a\u7684\u5206\u584a\u6578\n)\n```\n\n## \ud83d\udcca \u6548\u80fd\u512a\u5316\u5efa\u8b70\n\n### 1. \u8a18\u61b6\u9ad4\u7ba1\u7406\n\n```python\n# \u5c0d\u65bc\u5927\u578b\u6587\u672c\u5eab\uff0c\u5efa\u8b70\u5206\u6279\u8655\u7406\ndef process_large_corpus(texts, batch_size=1000):\n trees = []\n for i in range(0, len(texts), batch_size):\n batch = texts[i:i+batch_size]\n batch_vectors = model.encode(batch)\n tree = create_ahc_tree(batch_vectors, batch)\n trees.append(tree)\n return trees\n```\n\n### 2. \u5feb\u53d6\u6a5f\u5236\n\n```python\nimport os\n\n# \u6aa2\u67e5\u662f\u5426\u5df2\u6709\u5efa\u7acb\u597d\u7684\u6aa2\u7d22\u6a39\ntree_file = \"retrieval_tree.pkl\"\nif os.path.exists(tree_file):\n tree_root = load_tree(tree_file)\nelse:\n tree_root = create_ahc_tree(vectors, texts)\n save_tree(tree_root, tree_file)\n```\n\n## \ud83d\udd0d \u5be6\u969b\u61c9\u7528\u6848\u4f8b\n\n### \u6cd5\u5f8b\u6587\u4ef6\u6aa2\u7d22\n\n```python\n# \u6cd5\u5f8b\u689d\u6587\u6aa2\u7d22\u7cfb\u7d71\nlegal_texts = [\n \"\u6c11\u6cd5\u7b2c\u4e00\u689d\uff1a\u6c11\u4e8b\uff0c\u6cd5\u5f8b\u6240\u672a\u898f\u5b9a\u8005\uff0c\u4f9d\u7fd2\u6163...\",\n \"\u5211\u6cd5\u7b2c\u5341\u689d\uff1a\u7a31\u4ee5\u4e0a\u3001\u4ee5\u4e0b\u3001\u4ee5\u5167\u3001\u4ee5\u5916\u8005...\",\n # ... \u66f4\u591a\u6cd5\u689d\n]\n\n# \u5efa\u7acb\u6cd5\u5f8b\u6aa2\u7d22\u7cfb\u7d71\nlegal_vectors = model.encode(legal_texts)\nlegal_tree = create_ahc_tree(legal_vectors, legal_texts)\n\n# \u67e5\u8a62\u6cd5\u5f8b\u554f\u984c\nlegal_query = \"\u95dc\u65bc\u5951\u7d04\u7684\u6cd5\u5f8b\u6548\u529b\u898f\u5b9a\"\nlegal_results = tree_search(legal_tree, legal_query, model, 100, 20)\n```\n\n### \u5b78\u8853\u8ad6\u6587\u6aa2\u7d22\n\n```python\n# \u5b78\u8853\u6587\u737b\u6aa2\u7d22\npapers = [\n \"\u672c\u7814\u7a76\u63a2\u8a0e\u6a5f\u5668\u5b78\u7fd2\u5728\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u4e2d\u7684\u61c9\u7528...\",\n \"\u6df1\u5ea6\u5b78\u7fd2\u6a21\u578b\u5728\u5716\u50cf\u8b58\u5225\u9818\u57df\u7684\u6700\u65b0\u9032\u5c55...\",\n # ... \u66f4\u591a\u8ad6\u6587\u6458\u8981\n]\n\nacademic_vectors = model.encode(papers)\nacademic_tree = create_ahc_tree(academic_vectors, papers)\n\nresearch_query = \"transformer\u6a21\u578b\u5728\u6587\u672c\u5206\u985e\u7684\u6548\u679c\"\nacademic_results = tree_search(academic_tree, research_query, model, 150, 30)\n```\n\n## \ud83d\udd2c API \u53c3\u8003\n\n### Web \u61c9\u7528 API\n\n- `GET /`: \u4e3b\u9801\u9762\n- `GET /available-texts`: \u7372\u53d6\u53ef\u7528\u7684\u6587\u672c\u5217\u8868\n- `POST /query`: \u63d0\u4ea4\u67e5\u8a62\u8acb\u6c42\n - \u8acb\u6c42\u9ad4\uff1a\n ```json\n {\n \"query\": \"\u60a8\u7684\u554f\u984c\",\n \"use_extraction\": true/false,\n \"text_name\": \"\u6587\u672c\u540d\u7a31\",\n \"prompt_type\": \"task_oriented\" | \"cot\"\n }\n ```\n - \u97ff\u61c9\uff1a\n ```json\n {\n \"answer\": \"\u7cfb\u7d71\u56de\u7b54\",\n \"retrieved_docs\": [\"\u6aa2\u7d22\u5230\u7684\u6587\u6a941\", \"\u6aa2\u7d22\u5230\u7684\u6587\u6a942\", ...]\n }\n ```\n\n### Python API\n\n#### \u6838\u5fc3\u6aa2\u7d22\u51fd\u6578\n\n```python\n# \u4e3b\u8981\u6aa2\u7d22\u51fd\u6578\nfrom src.retrieval import create_ahc_tree, tree_search, save_tree, load_tree\n\n# \u591a\u5c64\u6aa2\u7d22\u51fd\u6578\nfrom src.retrieval import (\n build_multi_level_index_from_files,\n multi_level_tree_search,\n multi_level_extraction_tree_search\n)\n\n# \u67e5\u8a62\u63d0\u53d6\u51fd\u6578\nfrom src.retrieval import extraction_tree_search\n\n# \u5de5\u5177\u51fd\u6578\nfrom src.utils import WordEmbedding, RagChunking\nfrom src.data_processing import DataDealer\n```\n\n## \ud83d\udc1b \u5e38\u898b\u554f\u984c\u8207\u89e3\u6c7a\u65b9\u6848\n\n### Q: \u6aa2\u7d22\u7d50\u679c\u4e0d\u5920\u7cbe\u78ba\uff1f\n**A:** \u5617\u8a66\u8abf\u6574\u53c3\u6578\uff1a\n- \u6e1b\u5c11 `chunk_size` \u63d0\u9ad8\u7cbe\u5ea6\n- \u589e\u52a0 `max_chunks` \u7372\u5f97\u66f4\u591a\u5019\u9078\u7d50\u679c\n- \u4f7f\u7528\u67e5\u8a62\u63d0\u53d6\u529f\u80fd\u8655\u7406\u8907\u96dc\u554f\u984c\n\n### Q: \u8655\u7406\u901f\u5ea6\u8f03\u6162\uff1f\n**A:** \u512a\u5316\u5efa\u8b70\uff1a\n- \u589e\u52a0 `chunk_size` \u6e1b\u5c11\u5206\u584a\u6578\u91cf\n- \u6e1b\u5c11 `max_chunks` \u9650\u5236\u8655\u7406\u7bc4\u570d\n- \u4f7f\u7528\u591a\u5c64\u7d22\u5f15\u4ee3\u66ff\u55ae\u4e00\u6aa2\u7d22\u6a39\n\n### Q: \u8a18\u61b6\u9ad4\u4f7f\u7528\u904e\u591a\uff1f\n**A:** \u8a18\u61b6\u9ad4\u7ba1\u7406\uff1a\n- \u5206\u6279\u8655\u7406\u5927\u578b\u6587\u672c\u5eab\n- \u5b9a\u671f\u6e05\u7406\u4e0d\u9700\u8981\u7684\u8b8a\u6578\n- \u4f7f\u7528\u751f\u6210\u5668\u800c\u975e\u5217\u8868\u5b58\u5132\u5927\u91cf\u8cc7\u6599\n\n### Q: \u5982\u4f55\u8655\u7406\u4e0d\u540c\u8a9e\u8a00\u7684\u6587\u672c\uff1f\n**A:** \u591a\u8a9e\u8a00\u652f\u63f4\uff1a\n- \u4f7f\u7528\u591a\u8a9e\u8a00\u5d4c\u5165\u6a21\u578b\uff08\u5982\u9810\u8a2d\u7684 multilingual-e5-large\uff09\n- \u78ba\u4fdd\u67e5\u8a62\u8a9e\u8a00\u8207\u6587\u672c\u8a9e\u8a00\u4e00\u81f4\n- \u8003\u616e\u4f7f\u7528\u8a9e\u8a00\u7279\u5b9a\u7684\u5206\u8a5e\u7b56\u7565\n\n## \ud83d\udcc4 \u7cfb\u7d71\u529f\u80fd\u8aaa\u660e\n\n### \u6aa2\u7d22\u6d41\u7a0b\n\n\u672c\u7cfb\u7d71\u63d0\u4f9b\u5169\u7a2e\u6aa2\u7d22\u6a21\u5f0f\uff1a\n\n1. **\u76f4\u63a5\u6aa2\u7d22** - \u9069\u5408\u7c21\u55ae\u660e\u78ba\u7684\u554f\u984c\n - \u5c07\u7528\u6236\u8f38\u5165\u76f4\u63a5\u5411\u91cf\u5316\n - \u901a\u904e\u6aa2\u7d22\u6a39\u5c0b\u627e\u76f8\u4f3c\u6587\u672c\u7247\u6bb5\n - \u4f7f\u7528\u8a9e\u8a00\u6a21\u578b\u751f\u6210\u7b54\u6848\n\n2. **\u67e5\u8a62\u63d0\u53d6\u6aa2\u7d22** - \u9069\u5408\u8907\u96dc\u6216\u5197\u9577\u554f\u984c\n - \u5148\u4f7f\u7528\u8a9e\u8a00\u6a21\u578b\u63d0\u53d6\u6838\u5fc3\u6cd5\u5f8b\u554f\u984c\u548c\u6982\u5ff5\n - \u5c07\u63d0\u53d6\u5f8c\u7684\u95dc\u9375\u8981\u9ede\u5411\u91cf\u5316\n - \u901a\u904e\u6aa2\u7d22\u6a39\u67e5\u627e\u76f8\u95dc\u7247\u6bb5\n - \u4f7f\u7528\u8a9e\u8a00\u6a21\u578b\u91dd\u5c0d\u63d0\u53d6\u8981\u9ede\u751f\u6210\u7cbe\u78ba\u7b54\u6848\n\n### \u56de\u7b54\u65b9\u5f0f\u8aaa\u660e\n\n#### \u4efb\u52d9\u5c0e\u5411\uff08Task-Oriented\uff09\n- \u7279\u9ede\uff1a\u7c21\u6f54\u76f4\u63a5\uff0c\u5feb\u901f\u63d0\u4f9b\u7b54\u6848\n- \u9069\u7528\uff1a\u9700\u8981\u660e\u78ba\u6cd5\u689d\u89e3\u91cb\u6216\u64cd\u4f5c\u6307\u5f15\u7684\u554f\u984c\n\n#### \u601d\u7dad\u93c8\uff08Chain of Thought, CoT\uff09\n- \u7279\u9ede\uff1a\u8a73\u7d30\u5206\u6790\uff0c\u63d0\u4f9b\u63a8\u7406\u904e\u7a0b\n- \u9069\u7528\uff1a\u8907\u96dc\u6cd5\u5f8b\u908f\u8f2f\u5206\u6790\u6216\u9700\u8981\u591a\u6b65\u63a8\u8ad6\u7684\u554f\u984c\n\n## \ud83d\udce6 \u90e8\u7f72\u8207\u767c\u5e03\n\n### PyPI \u5b89\u88dd\n\n```bash\npip install hierarchical-rag-retrieval\n```\n\n### \u5f9e GitHub \u5b89\u88dd\u958b\u767c\u7248\u672c\n\n```bash\npip install git+https://github.com/arthur422tp/hierarchical.git\n```\n\n### \u672c\u6a5f\u958b\u767c\u5b89\u88dd\n\n```bash\n# \u514b\u9686\u5c08\u6848\ngit clone https://github.com/arthur422tp/hierarchical.git\ncd hierarchical\n\n# \u5b89\u88dd\u958b\u767c\u4f9d\u8cf4\npip install -e .[dev]\n```\n\n## \ud83d\udcda \u66f4\u591a\u8cc7\u6e90\n\n- **GitHub Repository**: https://github.com/arthur422tp/hierarchical\n- **arXiv \u8ad6\u6587**: https://arxiv.org/abs/2506.13607\n- **PyPI \u5957\u4ef6**: https://pypi.org/project/hierarchical-rag-retrieval/\n- **Issue \u56de\u5831**: https://github.com/arthur422tp/hierarchical/issues\n\n## \ud83e\udd1d \u8ca2\u737b\u8207\u652f\u63f4\n\n\u6b61\u8fce\u63d0\u4ea4 Issue \u548c Pull Request\uff01\u5982\u679c\u60a8\u5728\u4f7f\u7528\u904e\u7a0b\u4e2d\u9047\u5230\u4efb\u4f55\u554f\u984c\uff0c\u6216\u6709\u6539\u9032\u5efa\u8b70\uff0c\u8acb\u96a8\u6642\u806f\u7e6b\u6211\u5011\u3002\n\n### \u958b\u767c\u74b0\u5883\u8a2d\u7f6e\n\n```bash\n# \u514b\u9686\u5c08\u6848\ngit clone https://github.com/arthur422tp/hierarchical.git\ncd hierarchical\n\n# \u5efa\u7acb\u865b\u64ec\u74b0\u5883\npython -m venv venv\nsource venv/bin/activate # Windows: venv\\Scripts\\activate\n\n# \u5b89\u88dd\u958b\u767c\u4f9d\u8cf4\npip install -e .[dev,app]\n```\n\n## \ud83d\udcdc License\n\n\u672c\u5c08\u6848\u63a1\u7528 MIT \u6388\u6b0a\u689d\u6b3e - \u8a73\u898b [LICENSE](LICENSE) \u6a94\u6848\u3002\n\n## \ud83d\udcde \u806f\u7e6b\u65b9\u5f0f\n\n- \u4f5c\u8005\uff1aarthur422tp\n- Email\uff1aarthur422tp@gmail.com\n- GitHub\uff1ahttps://github.com/arthur422tp\n\n---\n\n**\u795d\u60a8\u4f7f\u7528\u6109\u5feb\uff01\u5982\u679c\u9019\u500b\u7cfb\u7d71\u5c0d\u60a8\u7684\u5c08\u6848\u6709\u5e6b\u52a9\uff0c\u8acb\u8003\u616e\u7d66\u6211\u5011\u4e00\u500b \u2b50 Star\uff01**\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AI-Powered Legal Document Retrieval Engine based on Hierarchical Clustering & RAG",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/arthur422tp/hierarchical/issues",
"Documentation": "https://github.com/arthur422tp/hierarchical#readme",
"Homepage": "https://github.com/arthur422tp/hierarchical",
"Repository": "https://github.com/arthur422tp/hierarchical",
"arXiv Paper": "https://arxiv.org/abs/2506.13607"
},
"split_keywords": [
"rag",
" retrieval",
" hierarchical",
" clustering",
" legal",
" nlp",
" ai",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "43e5079ba412e33a4e3d4a6dcb8fd0c48b0ea76f3f81c35a02583ed48c0860c3",
"md5": "ab09e799f11972fc092b5fd78d14e196",
"sha256": "deac79430e09096584ac00d43580b2cb01a1245404a4c3ee23a9b26ab66845ae"
},
"downloads": -1,
"filename": "hierarchical_rag_retrieval-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ab09e799f11972fc092b5fd78d14e196",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 23635,
"upload_time": "2025-08-13T04:08:30",
"upload_time_iso_8601": "2025-08-13T04:08:30.601237Z",
"url": "https://files.pythonhosted.org/packages/43/e5/079ba412e33a4e3d4a6dcb8fd0c48b0ea76f3f81c35a02583ed48c0860c3/hierarchical_rag_retrieval-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "91d2f42d88f791601f5a6457aa88245d4c3bfda51068d73299cb27706a9784cc",
"md5": "84ab7214478cc30445cc3e0270d08664",
"sha256": "889d45f46f618daf99adab9a2ac9942d4451cb6bc4981b9db42bbf7714510af2"
},
"downloads": -1,
"filename": "hierarchical_rag_retrieval-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "84ab7214478cc30445cc3e0270d08664",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 2419554,
"upload_time": "2025-08-13T04:08:32",
"upload_time_iso_8601": "2025-08-13T04:08:32.926250Z",
"url": "https://files.pythonhosted.org/packages/91/d2/f42d88f791601f5a6457aa88245d4c3bfda51068d73299cb27706a9784cc/hierarchical_rag_retrieval-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-13 04:08:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arthur422tp",
"github_project": "hierarchical",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "fastapi",
"specs": [
[
"==",
"0.104.1"
]
]
},
{
"name": "uvicorn",
"specs": [
[
"==",
"0.24.0"
]
]
},
{
"name": "python-multipart",
"specs": [
[
"==",
"0.0.6"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.4.2"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
"==",
"2.2.2"
]
]
},
{
"name": "faiss-cpu",
"specs": [
[
"==",
"1.7.4"
]
]
},
{
"name": "langchain",
"specs": [
[
"==",
"0.3.26"
]
]
},
{
"name": "langchain-openai",
"specs": [
[
"==",
"0.0.2"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.1.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "fastcluster",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"==",
"1.3.2"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "langchain_community",
"specs": [
[
"==",
"0.0.10"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.1.0"
]
]
}
],
"lcname": "hierarchical-rag-retrieval"
}