---
title: KnowLangBot
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---
# KnowLang: Comprehensive Understanding for Complex Codebase
KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.
[](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)
## Features
- 🔍 **Semantic Code Search**: Find relevant code snippets based on natural language queries
- 📚 **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
- 🎯 **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
- 🔄 **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
- 🐍 **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support
## How It Works
### Code Parsing Pipeline
```mermaid
flowchart TD
A[Git Repository] --> B[Code Files]
B --> C[Code Parser]
C --> D{Parse by Type}
D --> E[Class Definitions]
D --> F[Function Definitions]
D --> G[Other Code]
E --> H[Code Chunks]
F --> H
G --> H
H --> I[LLM Summarization]
H --> J
I --> J[Embeddings]
J --> K[(Vector Store)]
```
### RAG Chatbot Pipeline
```mermaid
flowchart LR
A[User Query] --> B[Query Embedding]
B --> C[Vector Search]
C --> D[Context Collection]
D --> E[LLM Response Generation]
E --> F[User Interface]
```
## Prerequisites
KnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:
1. Install Ollama:
```bash
# check the official download instructions from https://ollama.com/download
curl -fsSL https://ollama.com/install.sh | sh
```
2. Pull required models:
```bash
# For LLM responses
ollama pull llama3.2
# For code embeddings
ollama pull mxbai-embed-large
```
3. Verify Ollama is running:
```bash
ollama list
```
You should see both `llama3.2` and `mxbai-embed-large` in the list of available models.
Note: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.
## Quick Start
### System Requirements
- **RAM**: Minimum 16GB recommended (Ollama models require significant memory)
- **Storage**: At least 10GB free space for model files
- **OS**:
- Linux (recommended)
- macOS 12+ (Intel or Apple Silicon)
- Windows 10+ with WSL2
- **Python**: 3.10 or higher
### Installation
```bash
pip install knowlang
```
### Basic Usage
1. First, parse and index your codebase:
```bash
# For a local codebase
knowlang parse ./my-project
# For verbose output
knowlang -v parse ./my-project
```
2. Then, launch the chat interface:
```bash
knowlang chat
```
That's it! The chat interface will open in your browser, ready to answer questions about your codebase.
### Advanced Usage
#### Custom Configuration
```bash
# Use custom configuration file
knowlang parse --config my_config.yaml ./my-project
# Output parsing results in JSON format
knowlang parse --output json ./my-project
```
#### Chat Interface Options
```bash
# Run on a specific port
knowlang chat --port 7860
# Create a shareable link
knowlang chat --share
# Run on custom server
knowlang chat --server-name localhost --server-port 8000
```
### Example Session
```bash
# Parse the transformers library
$ knowlang parse ./transformers
Found 1247 code chunks
Processing summaries... Done!
# Start chatting
$ knowlang chat
💡 Ask questions like:
- How is tokenization implemented?
- Explain the training pipeline
- Show me examples of custom model usage
```
## Architecture
KnowLang uses several key technologies:
- **Tree-sitter**: For robust, language-agnostic code parsing
- **ChromaDB**: For efficient vector storage and retrieval
- **PydanticAI**: For type-safe LLM interactions
- **Gradio**: For the interactive chat interface
## Technical Details
### Code Parsing
Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:
1. Repository cloning and file identification
2. Semantic parsing with Tree-sitter
3. Smart chunking based on code structure
4. LLM-powered summarization
5. Embedding generation with mxbai-embed-large
6. Vector store indexing
### RAG Implementation
The RAG system uses a multi-stage retrieval process:
1. Query embedding generation
2. Initial vector similarity search
3. Context aggregation
4. LLM response generation with full context
## Roadmap
- [ ] Inter-repository semantic search
- [ ] Support for additional programming languages
- [ ] Automatic documentation maintenance
- [ ] Integration with popular IDEs
- [ ] Custom embedding model training
- [ ] Enhanced evaluation metrics
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.
## Citation
If you use KnowLang in your research, please cite:
```bibtex
@software{knowlang2025,
author = KnowLang,
title = {KnowLang: Comprehensive Understanding for Complex Codebase},
year = {2025},
publisher = {GitHub},
url = {https://github.com/kimgb415/know-lang}
}
```
## Support
For support, please open an issue on GitHub or reach out to us directly through discussions.
Raw data
{
"_id": null,
"home_page": "https://github.com/kimgb415/know-lang",
"name": "knowlang",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "code-understanding, rag, llm, code-search, documentation, code-analysis, semantic-search, developer-tools",
"author": "gabhyun kim",
"author_email": "kimgb415@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b6/4b/b5684e5113fb43f3ca5e53a9bcf72033dc933110298cee1ef78ff521b0b8/knowlang-0.2.0.tar.gz",
"platform": null,
"description": "---\ntitle: KnowLangBot\nemoji: \ud83e\udd16\ncolorFrom: blue\ncolorTo: purple\nsdk: docker\napp_port: 7860\n---\n\n# KnowLang: Comprehensive Understanding for Complex Codebase\n\nKnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.\n\n[](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)\n\n## Features\n\n- \ud83d\udd0d **Semantic Code Search**: Find relevant code snippets based on natural language queries\n- \ud83d\udcda **Contextual Q&A**: Get detailed explanations about code functionality and implementation details\n- \ud83c\udfaf **Smart Chunking**: Intelligent code parsing that preserves semantic meaning\n- \ud83d\udd04 **Multi-Stage Retrieval**: Combined embedding and semantic search for better results\n- \ud83d\udc0d **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support\n\n## How It Works\n\n### Code Parsing Pipeline\n\n```mermaid\nflowchart TD\n A[Git Repository] --> B[Code Files]\n B --> C[Code Parser]\n C --> D{Parse by Type}\n D --> E[Class Definitions]\n D --> F[Function Definitions]\n D --> G[Other Code]\n E --> H[Code Chunks]\n F --> H\n G --> H\n H --> I[LLM Summarization]\n H --> J\n I --> J[Embeddings]\n J --> K[(Vector Store)]\n```\n\n### RAG Chatbot Pipeline\n\n```mermaid\nflowchart LR\n A[User Query] --> B[Query Embedding]\n B --> C[Vector Search]\n C --> D[Context Collection]\n D --> E[LLM Response Generation]\n E --> F[User Interface]\n```\n\n\n## Prerequisites\n\nKnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:\n\n1. Install Ollama:\n```bash\n# check the official download instructions from https://ollama.com/download\ncurl -fsSL https://ollama.com/install.sh | sh\n```\n\n2. Pull required models:\n```bash\n# For LLM responses\nollama pull llama3.2\n\n# For code embeddings\nollama pull mxbai-embed-large\n```\n\n3. Verify Ollama is running:\n```bash\nollama list\n```\n\nYou should see both `llama3.2` and `mxbai-embed-large` in the list of available models.\n\nNote: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.\n\n## Quick Start\n\n### System Requirements\n\n- **RAM**: Minimum 16GB recommended (Ollama models require significant memory)\n- **Storage**: At least 10GB free space for model files\n- **OS**: \n - Linux (recommended)\n - macOS 12+ (Intel or Apple Silicon)\n - Windows 10+ with WSL2\n- **Python**: 3.10 or higher\n\n\n### Installation\n\n```bash\npip install knowlang\n```\n\n### Basic Usage\n\n1. First, parse and index your codebase:\n```bash\n# For a local codebase\nknowlang parse ./my-project\n\n# For verbose output\nknowlang -v parse ./my-project\n```\n\n2. Then, launch the chat interface:\n```bash\nknowlang chat\n```\n\nThat's it! The chat interface will open in your browser, ready to answer questions about your codebase.\n\n### Advanced Usage\n\n#### Custom Configuration\n```bash\n# Use custom configuration file\nknowlang parse --config my_config.yaml ./my-project\n\n# Output parsing results in JSON format\nknowlang parse --output json ./my-project\n```\n\n#### Chat Interface Options\n```bash\n# Run on a specific port\nknowlang chat --port 7860\n\n# Create a shareable link\nknowlang chat --share\n\n# Run on custom server\nknowlang chat --server-name localhost --server-port 8000\n```\n\n### Example Session\n\n```bash\n# Parse the transformers library\n$ knowlang parse ./transformers\nFound 1247 code chunks\nProcessing summaries... Done!\n\n# Start chatting\n$ knowlang chat\n\n\ud83d\udca1 Ask questions like:\n- How is tokenization implemented?\n- Explain the training pipeline\n- Show me examples of custom model usage\n```\n\n## Architecture\n\nKnowLang uses several key technologies:\n\n- **Tree-sitter**: For robust, language-agnostic code parsing\n- **ChromaDB**: For efficient vector storage and retrieval\n- **PydanticAI**: For type-safe LLM interactions\n- **Gradio**: For the interactive chat interface\n\n## Technical Details\n\n### Code Parsing\n\nOur code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:\n\n1. Repository cloning and file identification\n2. Semantic parsing with Tree-sitter\n3. Smart chunking based on code structure\n4. LLM-powered summarization\n5. Embedding generation with mxbai-embed-large\n6. Vector store indexing\n\n### RAG Implementation\n\nThe RAG system uses a multi-stage retrieval process:\n\n1. Query embedding generation\n2. Initial vector similarity search\n3. Context aggregation\n4. LLM response generation with full context\n\n\n## Roadmap\n\n- [ ] Inter-repository semantic search\n- [ ] Support for additional programming languages\n- [ ] Automatic documentation maintenance\n- [ ] Integration with popular IDEs\n- [ ] Custom embedding model training\n- [ ] Enhanced evaluation metrics\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.\n\n## Citation\n\nIf you use KnowLang in your research, please cite:\n\n```bibtex\n@software{knowlang2025,\n author = KnowLang,\n title = {KnowLang: Comprehensive Understanding for Complex Codebase},\n year = {2025},\n publisher = {GitHub},\n url = {https://github.com/kimgb415/know-lang}\n}\n```\n\n## Support\n\nFor support, please open an issue on GitHub or reach out to us directly through discussions.",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "AI-powered code understanding assistant that helps developers explore and understand complex codebases through semantic search and intelligent Q&A",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/kimgb415/know-lang",
"Repository": "https://github.com/kimgb415/know-lang"
},
"split_keywords": [
"code-understanding",
" rag",
" llm",
" code-search",
" documentation",
" code-analysis",
" semantic-search",
" developer-tools"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fe4d87f8a1f031642edb8237ee17a96bc171005a5db796c4cd881315378417d3",
"md5": "09e08c9fff1a504cfb858a03b8b1d6ef",
"sha256": "d4239889cb443aa8b736112dcb3a176be93f690649fd57185c3ce7b17b7b5887"
},
"downloads": -1,
"filename": "knowlang-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "09e08c9fff1a504cfb858a03b8b1d6ef",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 50583,
"upload_time": "2025-02-09T00:58:40",
"upload_time_iso_8601": "2025-02-09T00:58:40.387764Z",
"url": "https://files.pythonhosted.org/packages/fe/4d/87f8a1f031642edb8237ee17a96bc171005a5db796c4cd881315378417d3/knowlang-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b64bb5684e5113fb43f3ca5e53a9bcf72033dc933110298cee1ef78ff521b0b8",
"md5": "566acf1350ac83122702d26bb1ed4904",
"sha256": "2cb1c356318569bfea2220dc4f0da28107f22a1c4c3fbb1e5fcbccbac1606de1"
},
"downloads": -1,
"filename": "knowlang-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "566acf1350ac83122702d26bb1ed4904",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 39771,
"upload_time": "2025-02-09T00:58:42",
"upload_time_iso_8601": "2025-02-09T00:58:42.812063Z",
"url": "https://files.pythonhosted.org/packages/b6/4b/b5684e5113fb43f3ca5e53a9bcf72033dc933110298cee1ef78ff521b0b8/knowlang-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-09 00:58:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kimgb415",
"github_project": "know-lang",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "knowlang"
}