langchain-clickzetta


Namelangchain-clickzetta JSON
Version 0.1.13 PyPI version JSON
download
home_pageNone
SummaryAn integration package connecting ClickZetta and LangChain
upload_time2025-09-20 02:20:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords clickzetta embeddings full-text-search langchain llm sql vector-database
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LangChain ClickZetta

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

🚀 **Enterprise-grade LangChain integration for ClickZetta** - Unlock the power of cloud-native lakehouse with AI-driven SQL queries, high-performance vector search, and intelligent full-text retrieval in a unified platform.

## 📖 Table of Contents

- [Why ClickZetta + LangChain?](#-why-clickzetta--langchain)
- [Core Features](#️-core-features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Storage Services](#-storage-services)
- [Comparison with Alternatives](#-comparison-with-alternatives)
- [Advanced Usage](#advanced-usage)
- [Testing](#testing)
- [Development](#development)
- [Contributing](#contributing)

## 🚀 Why ClickZetta + LangChain?

### 🏆 Unique Advantages

**1. Native Lakehouse Architecture**
- ClickZetta's cloud-native lakehouse provides 10x performance improvement over traditional Spark-based architectures
- Unified storage and compute for all data types (structured, semi-structured, unstructured)
- Real-time incremental processing capabilities

**2. True Hybrid Search in Single Table**
- Industry-first single-table hybrid search combining vector and full-text indexes
- No complex joins or multiple tables needed - everything in one place
- Atomic MERGE operations for consistent data updates

**3. Enterprise-Grade Storage Services**
- Complete LangChain BaseStore implementation with sync/async support
- Native Volume integration for binary file storage (models, embeddings)
- SQL-queryable document storage with JSON metadata
- Atomic UPSERT operations using ClickZetta's MERGE INTO

**4. Advanced Chinese Language Support**
- Built-in Chinese text analyzers (IK, standard, keyword)
- Optimized for bilingual (Chinese/English) AI applications
- DashScope integration for state-of-the-art Chinese embeddings

**5. Production-Ready Features**
- Connection pooling and query optimization
- Comprehensive error handling and logging
- Full test coverage (unit + integration)
- Type-safe operations throughout

## 🛠️ Core Features

### 🧠 AI-Powered Query Interface
- **Natural Language to SQL**: Convert questions to optimized ClickZetta SQL
- **Context-Aware**: Understands table schemas and relationships
- **Bilingual Support**: Works seamlessly with Chinese and English queries

### 🔍 Advanced Search Capabilities
- **Vector Search**: High-performance embedding-based similarity search
- **Full-Text Search**: Enterprise-grade inverted index with multiple analyzers
- **True Hybrid Search**: Single-table combined vector + text search (industry first)
- **Metadata Filtering**: Complex filtering with JSON metadata support

### 💾 Enterprise Storage Solutions
- **ClickZettaStore**: High-performance key-value storage using SQL tables
- **ClickZettaDocumentStore**: Structured document storage with queryable metadata
- **ClickZettaFileStore**: Binary file storage using native ClickZetta Volume
- **ClickZettaVolumeStore**: Direct Volume integration for maximum performance

### 🔄 Production-Grade Operations
- **Atomic UPSERT**: MERGE INTO operations for data consistency
- **Batch Processing**: Efficient bulk operations for large datasets
- **Connection Management**: Pooling and automatic reconnection
- **Type Safety**: Full type annotations and runtime validation

### 🎯 LangChain Compatibility
- **BaseStore Interface**: 100% compatible with LangChain storage standards
- **Async Support**: Full async/await pattern implementation
- **Chain Integration**: Seamless integration with LangChain chains and agents
- **Memory Systems**: Persistent chat history and conversation memory

## Installation

### From PyPI (Recommended)

```bash
pip install langchain-clickzetta
```

### Development Installation

```bash
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install -e ".[dev]"
```

### Local Installation from Source

```bash
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta
pip install .
```

## Quick Start

### Basic Setup

```python
from langchain_clickzetta import ClickZettaEngine, ClickZettaSQLChain, ClickZettaVectorStore
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_community.llms import Tongyi

# Initialize ClickZetta engine
# ClickZetta requires exactly 7 connection parameters
engine = ClickZettaEngine(
    service="your-service",
    instance="your-instance",
    workspace="your-workspace",
    schema="your-schema",
    username="your-username",
    password="your-password",
    vcluster="your-vcluster"  
)

# Initialize embeddings (DashScope recommended for Chinese/English support)
embeddings = DashScopeEmbeddings(
    dashscope_api_key="your-dashscope-api-key",
    model="text-embedding-v4"
)

# Initialize LLM
llm = Tongyi(dashscope_api_key="your-dashscope-api-key")
```

### SQL Queries

```python
# Create SQL chain
sql_chain = ClickZettaSQLChain.from_engine(
    engine=engine,
    llm=llm,
    return_sql=True
)

# Ask questions in natural language
result = sql_chain.invoke({
    "query": "How many users do we have in the database?"
})

print(result["result"])  # Natural language answer
print(result["sql_query"])  # Generated SQL query
```

### Vector Storage

```python
from langchain_core.documents import Document

# Create vector store
vector_store = ClickZettaVectorStore(
    engine=engine,
    embeddings=embeddings,
    table_name="my_vectors",
    vector_element_type="float"  # Options: float, int, tinyint
)

# Add documents
documents = [
    Document(
        page_content="ClickZetta is a high-performance analytics database.",
        metadata={"category": "database", "type": "analytics"}
    ),
    Document(
        page_content="LangChain enables building applications with LLMs.",
        metadata={"category": "framework", "type": "ai"}
    )
]

vector_store.add_documents(documents)

# Search for similar documents
results = vector_store.similarity_search(
    "What is ClickZetta?",
    k=2
)

for doc in results:
    print(doc.page_content)
```

### Full-text Search

```python
from langchain_clickzetta.retrievers import ClickZettaFullTextRetriever

# Create full-text retriever
retriever = ClickZettaFullTextRetriever(
    engine=engine,
    table_name="my_documents",
    search_type="phrase",
    k=5
)

# Add documents to search index
retriever.add_documents(documents)

# Search documents
results = retriever.get_relevant_documents("ClickZetta database")
for doc in results:
    print(f"Score: {doc.metadata.get('relevance_score', 'N/A')}")
    print(f"Content: {doc.page_content}")
```

### True Hybrid Search (Single Table)

```python
from langchain_clickzetta import ClickZettaHybridStore, ClickZettaUnifiedRetriever

# Create true hybrid store (single table with both vector + inverted indexes)
hybrid_store = ClickZettaHybridStore(
    engine=engine,
    embeddings=embeddings,
    table_name="hybrid_docs",
    text_analyzer="ik",  # Chinese text analyzer
    distance_metric="cosine"
)

# Add documents to hybrid store
documents = [
    Document(page_content="云器 Lakehouse 是由云器科技完全自主研发的新一代云湖仓。使用增量计算的数据计算引擎,性能可以提升至传统开源架构例如Spark的 10倍,实现了海量数据的全链路-低成本-实时化处理,为AI 创新提供了支持全类型数据整合、存储与计算的平台,帮助企业从传统的开源 Spark 体系升级到 AI 时代的数据基础设施。"),
    Document(page_content="LangChain enables building LLM applications")
]
hybrid_store.add_documents(documents)

# Create unified retriever for hybrid search
retriever = ClickZettaUnifiedRetriever(
    hybrid_store=hybrid_store,
    search_type="hybrid",  # "vector", "fulltext", or "hybrid"
    alpha=0.5,  # Balance between vector and full-text search
    k=5
)

# Search using hybrid approach
results = retriever.invoke("analytics database")
for doc in results:
    print(f"Content: {doc.page_content}")
```

### Chat Message History

```python
from langchain_clickzetta import ClickZettaChatMessageHistory
from langchain_core.messages import HumanMessage, AIMessage

# Create chat history
chat_history = ClickZettaChatMessageHistory(
    engine=engine,
    session_id="user_123",
    table_name="chat_sessions"
)

# Add messages
chat_history.add_message(HumanMessage(content="Hello!"))
chat_history.add_message(AIMessage(content="Hi there! How can I help you?"))

# Retrieve conversation history
messages = chat_history.messages
for message in messages:
    print(f"{message.__class__.__name__}: {message.content}")
```

## Configuration

### Environment Variables

You can configure ClickZetta connection using environment variables:

```bash
export CLICKZETTA_SERVICE="your-service"
export CLICKZETTA_INSTANCE="your-instance"
export CLICKZETTA_WORKSPACE="your-workspace"
export CLICKZETTA_SCHEMA="your-schema"
export CLICKZETTA_USERNAME="your-username"
export CLICKZETTA_PASSWORD="your-password"
export CLICKZETTA_VCLUSTER="your-vcluster"  # Required
```

### Connection Options

```python
engine = ClickZettaEngine(
    service="your-service",
    instance="your-instance",
    workspace="your-workspace",
    schema="your-schema",
    username="your-username",
    password="your-password",
    vcluster="your-vcluster",  # Required parameter
    connection_timeout=30,      # Connection timeout in seconds
    query_timeout=300,         # Query timeout in seconds
    hints={                    # Custom query hints
        "sdk.job.timeout": 600,
        "query_tag": "My Application"
    }
)
```

## Advanced Usage

### Custom SQL Prompts

```python
from langchain_core.prompts import PromptTemplate

custom_prompt = PromptTemplate(
    input_variables=["input", "table_info", "dialect"],
    template="""
    You are a ClickZetta SQL expert. Given the input question and table information,
    write a syntactically correct {dialect} query.

    Tables: {table_info}
    Question: {input}

    SQL Query:"""
)

sql_chain = ClickZettaSQLChain(
    engine=engine,
    llm=llm,
    sql_prompt=custom_prompt
)
```

### Vector Store with Custom Distance Metrics

```python
vector_store = ClickZettaVectorStore(
    engine=engine,
    embeddings=embeddings,
    distance_metric="euclidean",  # or "cosine", "manhattan"
    vector_dimension=1536,
    vector_element_type="float"  # or "int", "tinyint"
)
```

### Metadata Filtering

```python
# Search with metadata filters
results = vector_store.similarity_search(
    "machine learning",
    k=5,
    filter={"category": "tech", "year": 2024}
)

# Full-text search with metadata
retriever = ClickZettaFullTextRetriever(
    engine=engine,
    table_name="research_docs"
)
results = retriever.get_relevant_documents(
    "artificial intelligence",
    filter={"type": "research"}
)
```

## Testing

Run the test suite:

```bash
# Navigate to package directory
cd libs/clickzetta

# Install test dependencies
pip install -e ".[dev]"

# Run unit tests
make test-unit

# Run integration tests
make test-integration

# Run all tests
make test
```

### Integration Tests

To run integration tests against a real ClickZetta instance:

1. Configure your connection in `~/.clickzetta/connections.json` with a UAT connection
2. Add DashScope API key to the configuration
3. Run integration tests:

```bash
cd libs/clickzetta
make integration
make integration-dashscope
```

## Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/yunqiqiliang/langchain-clickzetta.git
cd langchain-clickzetta/libs/clickzetta

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks (if configured)
pre-commit install
```

### Code Quality

```bash
# Navigate to the package directory
cd libs/clickzetta

# Format code (auto-fixes many issues)
make format

# Linting (significantly improved)
make lint      # ✅ Reduced from 358 to 65 errors - 82% improvement!

# Core functionality testing
# Use project virtual environment for best results:
source .venv/bin/activate
make test-unit        # ✅ Core unit tests (LangChain compatibility verified)
make test-integration # Integration tests

# Type checking (in progress)
make typecheck # Some LangChain compatibility issues being resolved
```

**Recent Improvements ✨**:
- ✅ **Ruff configuration updated** to modern format
- ✅ **155 typing issues auto-fixed** (Dict→dict, Optional→|None)
- ✅ **Method signatures fixed** for LangChain BaseStore compatibility
- ✅ **Bare except clauses improved** with proper exception handling
- ✅ **Code formatting standardized** with black

**Current Status**: Core functionality fully working with significantly improved code quality (82% reduction in lint errors). All LangChain BaseStore compatibility tests pass.

## 📦 Storage Services

LangChain ClickZetta provides comprehensive storage services that implement the LangChain BaseStore interface with enterprise-grade features:

### 🔑 Key Advantages of ClickZetta Storage

**🚀 Performance Benefits**
- **10x Faster**: ClickZetta's optimized lakehouse architecture
- **Atomic Operations**: MERGE INTO for consistent UPSERT operations
- **Batch Processing**: Efficient handling of large datasets
- **Connection Pooling**: Optimized database connections

**🏗️ Architecture Benefits**
- **Native Integration**: Direct ClickZetta Volume support for binary data
- **SQL Queryability**: Full SQL access to stored documents and metadata
- **Unified Storage**: Single platform for all data types
- **Schema Evolution**: Flexible metadata storage with JSON support

**🔒 Enterprise Features**
- **ACID Compliance**: Full transaction support
- **Type Safety**: Runtime validation and type checking
- **Error Handling**: Comprehensive error recovery and logging
- **Monitoring**: Built-in query performance tracking

### Key-Value Store
```python
from langchain_clickzetta import ClickZettaStore

# Basic key-value storage
store = ClickZettaStore(engine=engine, table_name="cache")
store.mset([("key1", b"value1"), ("key2", b"value2")])
values = store.mget(["key1", "key2"])
```

### Document Store
```python
from langchain_clickzetta import ClickZettaDocumentStore

# Document storage with metadata
doc_store = ClickZettaDocumentStore(engine=engine, table_name="documents")
doc_store.store_document("doc1", "content", {"author": "user"})
content, metadata = doc_store.get_document("doc1")
```

### File Store
```python
from langchain_clickzetta import ClickZettaFileStore

# Binary file storage using ClickZetta Volume
file_store = ClickZettaFileStore(
    engine=engine,
    volume_type="user",
    subdirectory="models"
)
file_store.store_file("model.bin", binary_data, "application/octet-stream")
content, mime_type = file_store.get_file("model.bin")
```

### Volume Store (Native ClickZetta Volume)
```python
from langchain_clickzetta import ClickZettaUserVolumeStore

# Native Volume integration
volume_store = ClickZettaUserVolumeStore(engine=engine, subdirectory="data")
volume_store.mset([("config.json", b'{"key": "value"}')])
config = volume_store.mget(["config.json"])[0]
```

## 📊 Comparison with Alternatives

### ClickZetta vs. Traditional Vector Databases

| Feature | ClickZetta + LangChain | Pinecone/Weaviate | Chroma/FAISS |
|---------|------------------------|-------------------|---------------|
| **Hybrid Search** | ✅ Single table | ❌ Multiple systems | ❌ Separate tools |
| **SQL Queryability** | ✅ Full SQL support | ❌ Limited | ❌ No SQL |
| **Lakehouse Integration** | ✅ Native | ❌ External | ❌ External |
| **Chinese Language** | ✅ Optimized | ⚠️ Basic | ⚠️ Basic |
| **Enterprise Features** | ✅ ACID, Transactions | ⚠️ Limited | ❌ Basic |
| **Storage Services** | ✅ Full LangChain API | ❌ Custom | ❌ Limited |
| **Performance** | ✅ 10x improvement | ⚠️ Variable | ⚠️ Memory limited |

### ClickZetta vs. Other LangChain Integrations

| Integration | Vector Search | Full-Text | Hybrid | Storage API | SQL Queries |
|-------------|---------------|-----------|---------|-------------|-------------|
| **ClickZetta** | ✅ | ✅ | ✅ | ✅ | ✅ |
| Elasticsearch | ✅ | ✅ | ⚠️ | ❌ | ❌ |
| PostgreSQL/pgvector | ✅ | ⚠️ | ❌ | ⚠️ | ✅ |
| MongoDB | ✅ | ⚠️ | ❌ | ⚠️ | ❌ |
| Redis | ✅ | ❌ | ❌ | ✅ | ❌ |

### Key Differentiators

**🎯 Single Platform Solution**
- No need to manage multiple systems (vector DB + full-text + SQL + storage)
- Unified data governance and security model
- Simplified architecture and reduced operational complexity

**🚀 Performance at Scale**
- ClickZetta's incremental computing engine
- Optimized for both analytical and operational workloads
- Native lakehouse storage with separation of compute and storage

**🌏 Chinese Market Focus**
- Deep integration with Chinese AI ecosystem (DashScope, Tongyi)
- Optimized text processing for Chinese language
- Compliance with Chinese data regulations

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for your changes
5. Ensure all tests pass (`pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Create a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support

- Documentation: [Link to detailed docs]
- Issues: [GitHub Issues](https://github.com/yunqiqiliang/langchain-clickzetta/issues)
- Discussions: [GitHub Discussions](https://github.com/yunqiqiliang/langchain-clickzetta/discussions)

## Acknowledgments

- [LangChain](https://github.com/langchain-ai/langchain) for the foundational framework
- [ClickZetta](https://www.yunqi.tech/) for the powerful analytics lakehouse
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "langchain-clickzetta",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "clickzetta, embeddings, full-text-search, langchain, llm, sql, vector-database",
    "author": null,
    "author_email": "ClickZetta LangChain Integration Team <support@clickzetta.com>",
    "download_url": "https://files.pythonhosted.org/packages/9a/d6/952bcbce11986e568d918fc9a1054f718649cb6030e7bb9e136d0a71f8d0/langchain_clickzetta-0.1.13.tar.gz",
    "platform": null,
    "description": "# LangChain ClickZetta\n\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\ud83d\ude80 **Enterprise-grade LangChain integration for ClickZetta** - Unlock the power of cloud-native lakehouse with AI-driven SQL queries, high-performance vector search, and intelligent full-text retrieval in a unified platform.\n\n## \ud83d\udcd6 Table of Contents\n\n- [Why ClickZetta + LangChain?](#-why-clickzetta--langchain)\n- [Core Features](#\ufe0f-core-features)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Storage Services](#-storage-services)\n- [Comparison with Alternatives](#-comparison-with-alternatives)\n- [Advanced Usage](#advanced-usage)\n- [Testing](#testing)\n- [Development](#development)\n- [Contributing](#contributing)\n\n## \ud83d\ude80 Why ClickZetta + LangChain?\n\n### \ud83c\udfc6 Unique Advantages\n\n**1. Native Lakehouse Architecture**\n- ClickZetta's cloud-native lakehouse provides 10x performance improvement over traditional Spark-based architectures\n- Unified storage and compute for all data types (structured, semi-structured, unstructured)\n- Real-time incremental processing capabilities\n\n**2. True Hybrid Search in Single Table**\n- Industry-first single-table hybrid search combining vector and full-text indexes\n- No complex joins or multiple tables needed - everything in one place\n- Atomic MERGE operations for consistent data updates\n\n**3. Enterprise-Grade Storage Services**\n- Complete LangChain BaseStore implementation with sync/async support\n- Native Volume integration for binary file storage (models, embeddings)\n- SQL-queryable document storage with JSON metadata\n- Atomic UPSERT operations using ClickZetta's MERGE INTO\n\n**4. Advanced Chinese Language Support**\n- Built-in Chinese text analyzers (IK, standard, keyword)\n- Optimized for bilingual (Chinese/English) AI applications\n- DashScope integration for state-of-the-art Chinese embeddings\n\n**5. Production-Ready Features**\n- Connection pooling and query optimization\n- Comprehensive error handling and logging\n- Full test coverage (unit + integration)\n- Type-safe operations throughout\n\n## \ud83d\udee0\ufe0f Core Features\n\n### \ud83e\udde0 AI-Powered Query Interface\n- **Natural Language to SQL**: Convert questions to optimized ClickZetta SQL\n- **Context-Aware**: Understands table schemas and relationships\n- **Bilingual Support**: Works seamlessly with Chinese and English queries\n\n### \ud83d\udd0d Advanced Search Capabilities\n- **Vector Search**: High-performance embedding-based similarity search\n- **Full-Text Search**: Enterprise-grade inverted index with multiple analyzers\n- **True Hybrid Search**: Single-table combined vector + text search (industry first)\n- **Metadata Filtering**: Complex filtering with JSON metadata support\n\n### \ud83d\udcbe Enterprise Storage Solutions\n- **ClickZettaStore**: High-performance key-value storage using SQL tables\n- **ClickZettaDocumentStore**: Structured document storage with queryable metadata\n- **ClickZettaFileStore**: Binary file storage using native ClickZetta Volume\n- **ClickZettaVolumeStore**: Direct Volume integration for maximum performance\n\n### \ud83d\udd04 Production-Grade Operations\n- **Atomic UPSERT**: MERGE INTO operations for data consistency\n- **Batch Processing**: Efficient bulk operations for large datasets\n- **Connection Management**: Pooling and automatic reconnection\n- **Type Safety**: Full type annotations and runtime validation\n\n### \ud83c\udfaf LangChain Compatibility\n- **BaseStore Interface**: 100% compatible with LangChain storage standards\n- **Async Support**: Full async/await pattern implementation\n- **Chain Integration**: Seamless integration with LangChain chains and agents\n- **Memory Systems**: Persistent chat history and conversation memory\n\n## Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install langchain-clickzetta\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/yunqiqiliang/langchain-clickzetta.git\ncd langchain-clickzetta/libs/clickzetta\npip install -e \".[dev]\"\n```\n\n### Local Installation from Source\n\n```bash\ngit clone https://github.com/yunqiqiliang/langchain-clickzetta.git\ncd langchain-clickzetta/libs/clickzetta\npip install .\n```\n\n## Quick Start\n\n### Basic Setup\n\n```python\nfrom langchain_clickzetta import ClickZettaEngine, ClickZettaSQLChain, ClickZettaVectorStore\nfrom langchain_community.embeddings import DashScopeEmbeddings\nfrom langchain_community.llms import Tongyi\n\n# Initialize ClickZetta engine\n# ClickZetta requires exactly 7 connection parameters\nengine = ClickZettaEngine(\n    service=\"your-service\",\n    instance=\"your-instance\",\n    workspace=\"your-workspace\",\n    schema=\"your-schema\",\n    username=\"your-username\",\n    password=\"your-password\",\n    vcluster=\"your-vcluster\"  \n)\n\n# Initialize embeddings (DashScope recommended for Chinese/English support)\nembeddings = DashScopeEmbeddings(\n    dashscope_api_key=\"your-dashscope-api-key\",\n    model=\"text-embedding-v4\"\n)\n\n# Initialize LLM\nllm = Tongyi(dashscope_api_key=\"your-dashscope-api-key\")\n```\n\n### SQL Queries\n\n```python\n# Create SQL chain\nsql_chain = ClickZettaSQLChain.from_engine(\n    engine=engine,\n    llm=llm,\n    return_sql=True\n)\n\n# Ask questions in natural language\nresult = sql_chain.invoke({\n    \"query\": \"How many users do we have in the database?\"\n})\n\nprint(result[\"result\"])  # Natural language answer\nprint(result[\"sql_query\"])  # Generated SQL query\n```\n\n### Vector Storage\n\n```python\nfrom langchain_core.documents import Document\n\n# Create vector store\nvector_store = ClickZettaVectorStore(\n    engine=engine,\n    embeddings=embeddings,\n    table_name=\"my_vectors\",\n    vector_element_type=\"float\"  # Options: float, int, tinyint\n)\n\n# Add documents\ndocuments = [\n    Document(\n        page_content=\"ClickZetta is a high-performance analytics database.\",\n        metadata={\"category\": \"database\", \"type\": \"analytics\"}\n    ),\n    Document(\n        page_content=\"LangChain enables building applications with LLMs.\",\n        metadata={\"category\": \"framework\", \"type\": \"ai\"}\n    )\n]\n\nvector_store.add_documents(documents)\n\n# Search for similar documents\nresults = vector_store.similarity_search(\n    \"What is ClickZetta?\",\n    k=2\n)\n\nfor doc in results:\n    print(doc.page_content)\n```\n\n### Full-text Search\n\n```python\nfrom langchain_clickzetta.retrievers import ClickZettaFullTextRetriever\n\n# Create full-text retriever\nretriever = ClickZettaFullTextRetriever(\n    engine=engine,\n    table_name=\"my_documents\",\n    search_type=\"phrase\",\n    k=5\n)\n\n# Add documents to search index\nretriever.add_documents(documents)\n\n# Search documents\nresults = retriever.get_relevant_documents(\"ClickZetta database\")\nfor doc in results:\n    print(f\"Score: {doc.metadata.get('relevance_score', 'N/A')}\")\n    print(f\"Content: {doc.page_content}\")\n```\n\n### True Hybrid Search (Single Table)\n\n```python\nfrom langchain_clickzetta import ClickZettaHybridStore, ClickZettaUnifiedRetriever\n\n# Create true hybrid store (single table with both vector + inverted indexes)\nhybrid_store = ClickZettaHybridStore(\n    engine=engine,\n    embeddings=embeddings,\n    table_name=\"hybrid_docs\",\n    text_analyzer=\"ik\",  # Chinese text analyzer\n    distance_metric=\"cosine\"\n)\n\n# Add documents to hybrid store\ndocuments = [\n    Document(page_content=\"\u4e91\u5668 Lakehouse \u662f\u7531\u4e91\u5668\u79d1\u6280\u5b8c\u5168\u81ea\u4e3b\u7814\u53d1\u7684\u65b0\u4e00\u4ee3\u4e91\u6e56\u4ed3\u3002\u4f7f\u7528\u589e\u91cf\u8ba1\u7b97\u7684\u6570\u636e\u8ba1\u7b97\u5f15\u64ce\uff0c\u6027\u80fd\u53ef\u4ee5\u63d0\u5347\u81f3\u4f20\u7edf\u5f00\u6e90\u67b6\u6784\u4f8b\u5982Spark\u7684 10\u500d\uff0c\u5b9e\u73b0\u4e86\u6d77\u91cf\u6570\u636e\u7684\u5168\u94fe\u8def-\u4f4e\u6210\u672c-\u5b9e\u65f6\u5316\u5904\u7406\uff0c\u4e3aAI \u521b\u65b0\u63d0\u4f9b\u4e86\u652f\u6301\u5168\u7c7b\u578b\u6570\u636e\u6574\u5408\u3001\u5b58\u50a8\u4e0e\u8ba1\u7b97\u7684\u5e73\u53f0\uff0c\u5e2e\u52a9\u4f01\u4e1a\u4ece\u4f20\u7edf\u7684\u5f00\u6e90 Spark \u4f53\u7cfb\u5347\u7ea7\u5230 AI \u65f6\u4ee3\u7684\u6570\u636e\u57fa\u7840\u8bbe\u65bd\u3002\"),\n    Document(page_content=\"LangChain enables building LLM applications\")\n]\nhybrid_store.add_documents(documents)\n\n# Create unified retriever for hybrid search\nretriever = ClickZettaUnifiedRetriever(\n    hybrid_store=hybrid_store,\n    search_type=\"hybrid\",  # \"vector\", \"fulltext\", or \"hybrid\"\n    alpha=0.5,  # Balance between vector and full-text search\n    k=5\n)\n\n# Search using hybrid approach\nresults = retriever.invoke(\"analytics database\")\nfor doc in results:\n    print(f\"Content: {doc.page_content}\")\n```\n\n### Chat Message History\n\n```python\nfrom langchain_clickzetta import ClickZettaChatMessageHistory\nfrom langchain_core.messages import HumanMessage, AIMessage\n\n# Create chat history\nchat_history = ClickZettaChatMessageHistory(\n    engine=engine,\n    session_id=\"user_123\",\n    table_name=\"chat_sessions\"\n)\n\n# Add messages\nchat_history.add_message(HumanMessage(content=\"Hello!\"))\nchat_history.add_message(AIMessage(content=\"Hi there! How can I help you?\"))\n\n# Retrieve conversation history\nmessages = chat_history.messages\nfor message in messages:\n    print(f\"{message.__class__.__name__}: {message.content}\")\n```\n\n## Configuration\n\n### Environment Variables\n\nYou can configure ClickZetta connection using environment variables:\n\n```bash\nexport CLICKZETTA_SERVICE=\"your-service\"\nexport CLICKZETTA_INSTANCE=\"your-instance\"\nexport CLICKZETTA_WORKSPACE=\"your-workspace\"\nexport CLICKZETTA_SCHEMA=\"your-schema\"\nexport CLICKZETTA_USERNAME=\"your-username\"\nexport CLICKZETTA_PASSWORD=\"your-password\"\nexport CLICKZETTA_VCLUSTER=\"your-vcluster\"  # Required\n```\n\n### Connection Options\n\n```python\nengine = ClickZettaEngine(\n    service=\"your-service\",\n    instance=\"your-instance\",\n    workspace=\"your-workspace\",\n    schema=\"your-schema\",\n    username=\"your-username\",\n    password=\"your-password\",\n    vcluster=\"your-vcluster\",  # Required parameter\n    connection_timeout=30,      # Connection timeout in seconds\n    query_timeout=300,         # Query timeout in seconds\n    hints={                    # Custom query hints\n        \"sdk.job.timeout\": 600,\n        \"query_tag\": \"My Application\"\n    }\n)\n```\n\n## Advanced Usage\n\n### Custom SQL Prompts\n\n```python\nfrom langchain_core.prompts import PromptTemplate\n\ncustom_prompt = PromptTemplate(\n    input_variables=[\"input\", \"table_info\", \"dialect\"],\n    template=\"\"\"\n    You are a ClickZetta SQL expert. Given the input question and table information,\n    write a syntactically correct {dialect} query.\n\n    Tables: {table_info}\n    Question: {input}\n\n    SQL Query:\"\"\"\n)\n\nsql_chain = ClickZettaSQLChain(\n    engine=engine,\n    llm=llm,\n    sql_prompt=custom_prompt\n)\n```\n\n### Vector Store with Custom Distance Metrics\n\n```python\nvector_store = ClickZettaVectorStore(\n    engine=engine,\n    embeddings=embeddings,\n    distance_metric=\"euclidean\",  # or \"cosine\", \"manhattan\"\n    vector_dimension=1536,\n    vector_element_type=\"float\"  # or \"int\", \"tinyint\"\n)\n```\n\n### Metadata Filtering\n\n```python\n# Search with metadata filters\nresults = vector_store.similarity_search(\n    \"machine learning\",\n    k=5,\n    filter={\"category\": \"tech\", \"year\": 2024}\n)\n\n# Full-text search with metadata\nretriever = ClickZettaFullTextRetriever(\n    engine=engine,\n    table_name=\"research_docs\"\n)\nresults = retriever.get_relevant_documents(\n    \"artificial intelligence\",\n    filter={\"type\": \"research\"}\n)\n```\n\n## Testing\n\nRun the test suite:\n\n```bash\n# Navigate to package directory\ncd libs/clickzetta\n\n# Install test dependencies\npip install -e \".[dev]\"\n\n# Run unit tests\nmake test-unit\n\n# Run integration tests\nmake test-integration\n\n# Run all tests\nmake test\n```\n\n### Integration Tests\n\nTo run integration tests against a real ClickZetta instance:\n\n1. Configure your connection in `~/.clickzetta/connections.json` with a UAT connection\n2. Add DashScope API key to the configuration\n3. Run integration tests:\n\n```bash\ncd libs/clickzetta\nmake integration\nmake integration-dashscope\n```\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/yunqiqiliang/langchain-clickzetta.git\ncd langchain-clickzetta/libs/clickzetta\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Install pre-commit hooks (if configured)\npre-commit install\n```\n\n### Code Quality\n\n```bash\n# Navigate to the package directory\ncd libs/clickzetta\n\n# Format code (auto-fixes many issues)\nmake format\n\n# Linting (significantly improved)\nmake lint      # \u2705 Reduced from 358 to 65 errors - 82% improvement!\n\n# Core functionality testing\n# Use project virtual environment for best results:\nsource .venv/bin/activate\nmake test-unit        # \u2705 Core unit tests (LangChain compatibility verified)\nmake test-integration # Integration tests\n\n# Type checking (in progress)\nmake typecheck # Some LangChain compatibility issues being resolved\n```\n\n**Recent Improvements \u2728**:\n- \u2705 **Ruff configuration updated** to modern format\n- \u2705 **155 typing issues auto-fixed** (Dict\u2192dict, Optional\u2192|None)\n- \u2705 **Method signatures fixed** for LangChain BaseStore compatibility\n- \u2705 **Bare except clauses improved** with proper exception handling\n- \u2705 **Code formatting standardized** with black\n\n**Current Status**: Core functionality fully working with significantly improved code quality (82% reduction in lint errors). All LangChain BaseStore compatibility tests pass.\n\n## \ud83d\udce6 Storage Services\n\nLangChain ClickZetta provides comprehensive storage services that implement the LangChain BaseStore interface with enterprise-grade features:\n\n### \ud83d\udd11 Key Advantages of ClickZetta Storage\n\n**\ud83d\ude80 Performance Benefits**\n- **10x Faster**: ClickZetta's optimized lakehouse architecture\n- **Atomic Operations**: MERGE INTO for consistent UPSERT operations\n- **Batch Processing**: Efficient handling of large datasets\n- **Connection Pooling**: Optimized database connections\n\n**\ud83c\udfd7\ufe0f Architecture Benefits**\n- **Native Integration**: Direct ClickZetta Volume support for binary data\n- **SQL Queryability**: Full SQL access to stored documents and metadata\n- **Unified Storage**: Single platform for all data types\n- **Schema Evolution**: Flexible metadata storage with JSON support\n\n**\ud83d\udd12 Enterprise Features**\n- **ACID Compliance**: Full transaction support\n- **Type Safety**: Runtime validation and type checking\n- **Error Handling**: Comprehensive error recovery and logging\n- **Monitoring**: Built-in query performance tracking\n\n### Key-Value Store\n```python\nfrom langchain_clickzetta import ClickZettaStore\n\n# Basic key-value storage\nstore = ClickZettaStore(engine=engine, table_name=\"cache\")\nstore.mset([(\"key1\", b\"value1\"), (\"key2\", b\"value2\")])\nvalues = store.mget([\"key1\", \"key2\"])\n```\n\n### Document Store\n```python\nfrom langchain_clickzetta import ClickZettaDocumentStore\n\n# Document storage with metadata\ndoc_store = ClickZettaDocumentStore(engine=engine, table_name=\"documents\")\ndoc_store.store_document(\"doc1\", \"content\", {\"author\": \"user\"})\ncontent, metadata = doc_store.get_document(\"doc1\")\n```\n\n### File Store\n```python\nfrom langchain_clickzetta import ClickZettaFileStore\n\n# Binary file storage using ClickZetta Volume\nfile_store = ClickZettaFileStore(\n    engine=engine,\n    volume_type=\"user\",\n    subdirectory=\"models\"\n)\nfile_store.store_file(\"model.bin\", binary_data, \"application/octet-stream\")\ncontent, mime_type = file_store.get_file(\"model.bin\")\n```\n\n### Volume Store (Native ClickZetta Volume)\n```python\nfrom langchain_clickzetta import ClickZettaUserVolumeStore\n\n# Native Volume integration\nvolume_store = ClickZettaUserVolumeStore(engine=engine, subdirectory=\"data\")\nvolume_store.mset([(\"config.json\", b'{\"key\": \"value\"}')])\nconfig = volume_store.mget([\"config.json\"])[0]\n```\n\n## \ud83d\udcca Comparison with Alternatives\n\n### ClickZetta vs. Traditional Vector Databases\n\n| Feature | ClickZetta + LangChain | Pinecone/Weaviate | Chroma/FAISS |\n|---------|------------------------|-------------------|---------------|\n| **Hybrid Search** | \u2705 Single table | \u274c Multiple systems | \u274c Separate tools |\n| **SQL Queryability** | \u2705 Full SQL support | \u274c Limited | \u274c No SQL |\n| **Lakehouse Integration** | \u2705 Native | \u274c External | \u274c External |\n| **Chinese Language** | \u2705 Optimized | \u26a0\ufe0f Basic | \u26a0\ufe0f Basic |\n| **Enterprise Features** | \u2705 ACID, Transactions | \u26a0\ufe0f Limited | \u274c Basic |\n| **Storage Services** | \u2705 Full LangChain API | \u274c Custom | \u274c Limited |\n| **Performance** | \u2705 10x improvement | \u26a0\ufe0f Variable | \u26a0\ufe0f Memory limited |\n\n### ClickZetta vs. Other LangChain Integrations\n\n| Integration | Vector Search | Full-Text | Hybrid | Storage API | SQL Queries |\n|-------------|---------------|-----------|---------|-------------|-------------|\n| **ClickZetta** | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 |\n| Elasticsearch | \u2705 | \u2705 | \u26a0\ufe0f | \u274c | \u274c |\n| PostgreSQL/pgvector | \u2705 | \u26a0\ufe0f | \u274c | \u26a0\ufe0f | \u2705 |\n| MongoDB | \u2705 | \u26a0\ufe0f | \u274c | \u26a0\ufe0f | \u274c |\n| Redis | \u2705 | \u274c | \u274c | \u2705 | \u274c |\n\n### Key Differentiators\n\n**\ud83c\udfaf Single Platform Solution**\n- No need to manage multiple systems (vector DB + full-text + SQL + storage)\n- Unified data governance and security model\n- Simplified architecture and reduced operational complexity\n\n**\ud83d\ude80 Performance at Scale**\n- ClickZetta's incremental computing engine\n- Optimized for both analytical and operational workloads\n- Native lakehouse storage with separation of compute and storage\n\n**\ud83c\udf0f Chinese Market Focus**\n- Deep integration with Chinese AI ecosystem (DashScope, Tongyi)\n- Optimized text processing for Chinese language\n- Compliance with Chinese data regulations\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for your changes\n5. Ensure all tests pass (`pytest`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Create a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Support\n\n- Documentation: [Link to detailed docs]\n- Issues: [GitHub Issues](https://github.com/yunqiqiliang/langchain-clickzetta/issues)\n- Discussions: [GitHub Discussions](https://github.com/yunqiqiliang/langchain-clickzetta/discussions)\n\n## Acknowledgments\n\n- [LangChain](https://github.com/langchain-ai/langchain) for the foundational framework\n- [ClickZetta](https://www.yunqi.tech/) for the powerful analytics lakehouse",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An integration package connecting ClickZetta and LangChain",
    "version": "0.1.13",
    "project_urls": null,
    "split_keywords": [
        "clickzetta",
        " embeddings",
        " full-text-search",
        " langchain",
        " llm",
        " sql",
        " vector-database"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d3ef0327bab240b56df6043d63897a113ebe1f825a7564843c71047db00c6e36",
                "md5": "51bb51935125308a93412c40c95b7ffd",
                "sha256": "38c9d410efc7b9496a50b82d1b1e6df8895f79c751ebcd3c5334ebe908af85bf"
            },
            "downloads": -1,
            "filename": "langchain_clickzetta-0.1.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "51bb51935125308a93412c40c95b7ffd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 44513,
            "upload_time": "2025-09-20T02:20:25",
            "upload_time_iso_8601": "2025-09-20T02:20:25.307141Z",
            "url": "https://files.pythonhosted.org/packages/d3/ef/0327bab240b56df6043d63897a113ebe1f825a7564843c71047db00c6e36/langchain_clickzetta-0.1.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9ad6952bcbce11986e568d918fc9a1054f718649cb6030e7bb9e136d0a71f8d0",
                "md5": "95989f8feed18a4190a73b9d779b31d4",
                "sha256": "9a2295aef25bd7d24ac0f51169c79e9e763ecf2d5868dcaad2de2a5c330aecd8"
            },
            "downloads": -1,
            "filename": "langchain_clickzetta-0.1.13.tar.gz",
            "has_sig": false,
            "md5_digest": "95989f8feed18a4190a73b9d779b31d4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 97978,
            "upload_time": "2025-09-20T02:20:27",
            "upload_time_iso_8601": "2025-09-20T02:20:27.245050Z",
            "url": "https://files.pythonhosted.org/packages/9a/d6/952bcbce11986e568d918fc9a1054f718649cb6030e7bb9e136d0a71f8d0/langchain_clickzetta-0.1.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-20 02:20:27",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "langchain-clickzetta"
}
        
Elapsed time: 0.53531s