knowlang


Nameknowlang JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/kimgb415/know-lang
SummaryAI-powered code understanding assistant that helps developers explore and understand complex codebases through semantic search and intelligent Q&A
upload_time2025-02-09 00:58:42
maintainerNone
docs_urlNone
authorgabhyun kim
requires_python<3.13,>=3.10
licenseApache-2.0
keywords code-understanding rag llm code-search documentation code-analysis semantic-search developer-tools
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ---
title: KnowLangBot
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
---

# KnowLang: Comprehensive Understanding for Complex Codebase

KnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.

[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)

## Features

- 🔍 **Semantic Code Search**: Find relevant code snippets based on natural language queries
- 📚 **Contextual Q&A**: Get detailed explanations about code functionality and implementation details
- 🎯 **Smart Chunking**: Intelligent code parsing that preserves semantic meaning
- 🔄 **Multi-Stage Retrieval**: Combined embedding and semantic search for better results
- 🐍 **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support

## How It Works

### Code Parsing Pipeline

```mermaid
flowchart TD
    A[Git Repository] --> B[Code Files]
    B --> C[Code Parser]
    C --> D{Parse by Type}
    D --> E[Class Definitions]
    D --> F[Function Definitions]
    D --> G[Other Code]
    E --> H[Code Chunks]
    F --> H
    G --> H
    H --> I[LLM Summarization]
    H --> J
    I --> J[Embeddings]
    J --> K[(Vector Store)]
```

### RAG Chatbot Pipeline

```mermaid
flowchart LR
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]
    C --> D[Context Collection]
    D --> E[LLM Response Generation]
    E --> F[User Interface]
```


## Prerequisites

KnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:

1. Install Ollama:
```bash
# check the official download instructions from https://ollama.com/download
curl -fsSL https://ollama.com/install.sh | sh
```

2. Pull required models:
```bash
# For LLM responses
ollama pull llama3.2

# For code embeddings
ollama pull mxbai-embed-large
```

3. Verify Ollama is running:
```bash
ollama list
```

You should see both `llama3.2` and `mxbai-embed-large` in the list of available models.

Note: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.

## Quick Start

### System Requirements

- **RAM**: Minimum 16GB recommended (Ollama models require significant memory)
- **Storage**: At least 10GB free space for model files
- **OS**: 
  - Linux (recommended)
  - macOS 12+ (Intel or Apple Silicon)
  - Windows 10+ with WSL2
- **Python**: 3.10 or higher


### Installation

```bash
pip install knowlang
```

### Basic Usage

1. First, parse and index your codebase:
```bash
# For a local codebase
knowlang parse ./my-project

# For verbose output
knowlang -v parse ./my-project
```

2. Then, launch the chat interface:
```bash
knowlang chat
```

That's it! The chat interface will open in your browser, ready to answer questions about your codebase.

### Advanced Usage

#### Custom Configuration
```bash
# Use custom configuration file
knowlang parse --config my_config.yaml ./my-project

# Output parsing results in JSON format
knowlang parse --output json ./my-project
```

#### Chat Interface Options
```bash
# Run on a specific port
knowlang chat --port 7860

# Create a shareable link
knowlang chat --share

# Run on custom server
knowlang chat --server-name localhost --server-port 8000
```

### Example Session

```bash
# Parse the transformers library
$ knowlang parse ./transformers
Found 1247 code chunks
Processing summaries... Done!

# Start chatting
$ knowlang chat

💡 Ask questions like:
- How is tokenization implemented?
- Explain the training pipeline
- Show me examples of custom model usage
```

## Architecture

KnowLang uses several key technologies:

- **Tree-sitter**: For robust, language-agnostic code parsing
- **ChromaDB**: For efficient vector storage and retrieval
- **PydanticAI**: For type-safe LLM interactions
- **Gradio**: For the interactive chat interface

## Technical Details

### Code Parsing

Our code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:

1. Repository cloning and file identification
2. Semantic parsing with Tree-sitter
3. Smart chunking based on code structure
4. LLM-powered summarization
5. Embedding generation with mxbai-embed-large
6. Vector store indexing

### RAG Implementation

The RAG system uses a multi-stage retrieval process:

1. Query embedding generation
2. Initial vector similarity search
3. Context aggregation
4. LLM response generation with full context


## Roadmap

- [ ] Inter-repository semantic search
- [ ] Support for additional programming languages
- [ ] Automatic documentation maintenance
- [ ] Integration with popular IDEs
- [ ] Custom embedding model training
- [ ] Enhanced evaluation metrics

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.

## Citation

If you use KnowLang in your research, please cite:

```bibtex
@software{knowlang2025,
  author = KnowLang,
  title = {KnowLang: Comprehensive Understanding for Complex Codebase},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/kimgb415/know-lang}
}
```

## Support

For support, please open an issue on GitHub or reach out to us directly through discussions.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kimgb415/know-lang",
    "name": "knowlang",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "code-understanding, rag, llm, code-search, documentation, code-analysis, semantic-search, developer-tools",
    "author": "gabhyun kim",
    "author_email": "kimgb415@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b6/4b/b5684e5113fb43f3ca5e53a9bcf72033dc933110298cee1ef78ff521b0b8/knowlang-0.2.0.tar.gz",
    "platform": null,
    "description": "---\ntitle: KnowLangBot\nemoji: \ud83e\udd16\ncolorFrom: blue\ncolorTo: purple\nsdk: docker\napp_port: 7860\n---\n\n# KnowLang: Comprehensive Understanding for Complex Codebase\n\nKnowLang is an advanced codebase exploration tool that helps software engineers better understand complex codebases through semantic search and intelligent Q&A capabilities. Our first release focuses on providing RAG-powered search and Q&A for popular open-source libraries, with Hugging Face's repositories as our initial targets.\n\n[![Hugging Face Space](https://img.shields.io/badge/\ud83e\udd17%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/gabykim/KnowLang_Transformers_Demo)\n\n## Features\n\n- \ud83d\udd0d **Semantic Code Search**: Find relevant code snippets based on natural language queries\n- \ud83d\udcda **Contextual Q&A**: Get detailed explanations about code functionality and implementation details\n- \ud83c\udfaf **Smart Chunking**: Intelligent code parsing that preserves semantic meaning\n- \ud83d\udd04 **Multi-Stage Retrieval**: Combined embedding and semantic search for better results\n- \ud83d\udc0d **Python Support**: Currently optimized for Python codebases, with a roadmap for multi-language support\n\n## How It Works\n\n### Code Parsing Pipeline\n\n```mermaid\nflowchart TD\n    A[Git Repository] --> B[Code Files]\n    B --> C[Code Parser]\n    C --> D{Parse by Type}\n    D --> E[Class Definitions]\n    D --> F[Function Definitions]\n    D --> G[Other Code]\n    E --> H[Code Chunks]\n    F --> H\n    G --> H\n    H --> I[LLM Summarization]\n    H --> J\n    I --> J[Embeddings]\n    J --> K[(Vector Store)]\n```\n\n### RAG Chatbot Pipeline\n\n```mermaid\nflowchart LR\n    A[User Query] --> B[Query Embedding]\n    B --> C[Vector Search]\n    C --> D[Context Collection]\n    D --> E[LLM Response Generation]\n    E --> F[User Interface]\n```\n\n\n## Prerequisites\n\nKnowLang uses [Ollama](https://ollama.com) as its default LLM and embedding provider. Before installing KnowLang:\n\n1. Install Ollama:\n```bash\n# check the official download instructions from https://ollama.com/download\ncurl -fsSL https://ollama.com/install.sh | sh\n```\n\n2. Pull required models:\n```bash\n# For LLM responses\nollama pull llama3.2\n\n# For code embeddings\nollama pull mxbai-embed-large\n```\n\n3. Verify Ollama is running:\n```bash\nollama list\n```\n\nYou should see both `llama3.2` and `mxbai-embed-large` in the list of available models.\n\nNote: While Ollama is the default choice for easy setup, KnowLang supports other LLM providers through configuration. See our [Configuration Guide](configuration.md) for using alternative providers like OpenAI or Anthropic.\n\n## Quick Start\n\n### System Requirements\n\n- **RAM**: Minimum 16GB recommended (Ollama models require significant memory)\n- **Storage**: At least 10GB free space for model files\n- **OS**: \n  - Linux (recommended)\n  - macOS 12+ (Intel or Apple Silicon)\n  - Windows 10+ with WSL2\n- **Python**: 3.10 or higher\n\n\n### Installation\n\n```bash\npip install knowlang\n```\n\n### Basic Usage\n\n1. First, parse and index your codebase:\n```bash\n# For a local codebase\nknowlang parse ./my-project\n\n# For verbose output\nknowlang -v parse ./my-project\n```\n\n2. Then, launch the chat interface:\n```bash\nknowlang chat\n```\n\nThat's it! The chat interface will open in your browser, ready to answer questions about your codebase.\n\n### Advanced Usage\n\n#### Custom Configuration\n```bash\n# Use custom configuration file\nknowlang parse --config my_config.yaml ./my-project\n\n# Output parsing results in JSON format\nknowlang parse --output json ./my-project\n```\n\n#### Chat Interface Options\n```bash\n# Run on a specific port\nknowlang chat --port 7860\n\n# Create a shareable link\nknowlang chat --share\n\n# Run on custom server\nknowlang chat --server-name localhost --server-port 8000\n```\n\n### Example Session\n\n```bash\n# Parse the transformers library\n$ knowlang parse ./transformers\nFound 1247 code chunks\nProcessing summaries... Done!\n\n# Start chatting\n$ knowlang chat\n\n\ud83d\udca1 Ask questions like:\n- How is tokenization implemented?\n- Explain the training pipeline\n- Show me examples of custom model usage\n```\n\n## Architecture\n\nKnowLang uses several key technologies:\n\n- **Tree-sitter**: For robust, language-agnostic code parsing\n- **ChromaDB**: For efficient vector storage and retrieval\n- **PydanticAI**: For type-safe LLM interactions\n- **Gradio**: For the interactive chat interface\n\n## Technical Details\n\n### Code Parsing\n\nOur code parsing pipeline uses Tree-sitter to break down source code into meaningful chunks while preserving context:\n\n1. Repository cloning and file identification\n2. Semantic parsing with Tree-sitter\n3. Smart chunking based on code structure\n4. LLM-powered summarization\n5. Embedding generation with mxbai-embed-large\n6. Vector store indexing\n\n### RAG Implementation\n\nThe RAG system uses a multi-stage retrieval process:\n\n1. Query embedding generation\n2. Initial vector similarity search\n3. Context aggregation\n4. LLM response generation with full context\n\n\n## Roadmap\n\n- [ ] Inter-repository semantic search\n- [ ] Support for additional programming languages\n- [ ] Automatic documentation maintenance\n- [ ] Integration with popular IDEs\n- [ ] Custom embedding model training\n- [ ] Enhanced evaluation metrics\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. The Apache License 2.0 is a permissive license that enables broad use, modification, and distribution while providing patent rights and protecting trademark use.\n\n## Citation\n\nIf you use KnowLang in your research, please cite:\n\n```bibtex\n@software{knowlang2025,\n  author = KnowLang,\n  title = {KnowLang: Comprehensive Understanding for Complex Codebase},\n  year = {2025},\n  publisher = {GitHub},\n  url = {https://github.com/kimgb415/know-lang}\n}\n```\n\n## Support\n\nFor support, please open an issue on GitHub or reach out to us directly through discussions.",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "AI-powered code understanding assistant that helps developers explore and understand complex codebases through semantic search and intelligent Q&A",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/kimgb415/know-lang",
        "Repository": "https://github.com/kimgb415/know-lang"
    },
    "split_keywords": [
        "code-understanding",
        " rag",
        " llm",
        " code-search",
        " documentation",
        " code-analysis",
        " semantic-search",
        " developer-tools"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe4d87f8a1f031642edb8237ee17a96bc171005a5db796c4cd881315378417d3",
                "md5": "09e08c9fff1a504cfb858a03b8b1d6ef",
                "sha256": "d4239889cb443aa8b736112dcb3a176be93f690649fd57185c3ce7b17b7b5887"
            },
            "downloads": -1,
            "filename": "knowlang-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "09e08c9fff1a504cfb858a03b8b1d6ef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 50583,
            "upload_time": "2025-02-09T00:58:40",
            "upload_time_iso_8601": "2025-02-09T00:58:40.387764Z",
            "url": "https://files.pythonhosted.org/packages/fe/4d/87f8a1f031642edb8237ee17a96bc171005a5db796c4cd881315378417d3/knowlang-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b64bb5684e5113fb43f3ca5e53a9bcf72033dc933110298cee1ef78ff521b0b8",
                "md5": "566acf1350ac83122702d26bb1ed4904",
                "sha256": "2cb1c356318569bfea2220dc4f0da28107f22a1c4c3fbb1e5fcbccbac1606de1"
            },
            "downloads": -1,
            "filename": "knowlang-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "566acf1350ac83122702d26bb1ed4904",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 39771,
            "upload_time": "2025-02-09T00:58:42",
            "upload_time_iso_8601": "2025-02-09T00:58:42.812063Z",
            "url": "https://files.pythonhosted.org/packages/b6/4b/b5684e5113fb43f3ca5e53a9bcf72033dc933110298cee1ef78ff521b0b8/knowlang-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-09 00:58:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kimgb415",
    "github_project": "know-lang",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "knowlang"
}
        
Elapsed time: 0.67814s