# Tree-sitter Chunker
A high-performance semantic code chunker that leverages [Tree-sitter](https://tree-sitter.github.io/) parsers to intelligently split source code into meaningful chunks like functions, classes, and methods.
[]()
[]()
## β¨ Key Features
- π― **Semantic Understanding** - Extracts functions, classes, methods based on AST
- π **Blazing Fast** - 11.9x speedup with intelligent AST caching
- π **Universal Language Support** - Auto-download and support for 100+ Tree-sitter grammars
- π **Plugin Architecture** - Built-in plugins for 29 languages + auto-download support for 100+ more including all major programming languages
- ποΈ **Flexible Configuration** - TOML/YAML/JSON config files with per-language settings
- π **14 Export Formats** - JSON, JSONL, Parquet, CSV, XML, GraphML, Neo4j, DOT, SQLite, PostgreSQL, and more
- β‘ **Parallel Processing** - Process entire codebases with configurable workers
- π **Streaming Support** - Handle files larger than memory
- π¨ **Rich CLI** - Progress bars, batch processing, and filtering
- π€ **LLM-Ready** - Token counting, chunk optimization, and context-aware splitting
- π **Text File Support** - Markdown, logs, config files with intelligent chunking
- π **Advanced Query** - Natural language search across your codebase
- π **Graph Export** - Visualize code structure in yEd, Neo4j, or Graphviz
- π **Debug Tools** - AST visualization, chunk inspection, performance profiling
- π§ **Developer Tools** - Pre-commit hooks, CI/CD generation, quality metrics
- π¦ **Multi-Platform Distribution** - PyPI, Docker, Homebrew packages
- π **Zero-Configuration** - Automatic language detection and grammar download
## π¦ Installation
### Prerequisites
- Python 3.8+ (for Python usage)
- C compiler (for building Tree-sitter grammars)
- `uv` package manager (recommended) or pip
### Installation Methods
#### From PyPI (when published)
```bash
pip install treesitter-chunker
# With REST API support
pip install "treesitter-chunker[api]"
```
#### For Other Languages
See [Cross-Language Usage Guide](docs/cross-language-usage.md) for using from JavaScript, Go, Ruby, etc.
#### Using Docker
```bash
docker pull ghcr.io/consiliency/treesitter-chunker:latest
docker run -v $(pwd):/workspace treesitter-chunker chunk /workspace/example.py -l python
```
#### Using Homebrew (macOS/Linux)
```bash
brew tap consiliency/treesitter-chunker
brew install treesitter-chunker
```
#### For Debian/Ubuntu
```bash
# Download .deb package from releases
sudo dpkg -i python3-treesitter-chunker_1.0.0-1_all.deb
```
#### For Fedora/RHEL
```bash
# Download .rpm package from releases
sudo rpm -i python-treesitter-chunker-1.0.0-1.noarch.rpm
```
### Quick Install
```bash
# Clone the repository
git clone https://github.com/Consiliency/treesitter-chunker.git
cd treesitter-chunker
# Install with uv (recommended)
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
uv pip install git+https://github.com/tree-sitter/py-tree-sitter.git
# Build language grammars
python scripts/fetch_grammars.py
python scripts/build_lib.py
# Verify installation
python -c "from chunker.parser import list_languages; print(list_languages())"
# Output: ['c', 'cpp', 'javascript', 'python', 'rust']
```
### Using prebuilt grammars (no local builds)
Starting with CI-built wheels, precompiled Tree-sitter grammars are bundled for common platforms. If a grammar isnβt bundled yet, the library can build it on demand to your user cache.
To opt into building grammars once and reusing them:
```bash
export CHUNKER_GRAMMAR_BUILD_DIR="$HOME/.cache/treesitter-chunker/build"
```
Then build a language one time from Python:
```python
from pathlib import Path
from chunker.grammar.manager import TreeSitterGrammarManager
cache = Path.home() / ".cache" / "treesitter-chunker"
gm = TreeSitterGrammarManager(grammars_dir=cache / "grammars", build_dir=cache / "build")
gm.add_grammar("python", "https://github.com/tree-sitter/tree-sitter-python")
gm.fetch_grammar("python")
gm.build_grammar("python")
```
Now chunking with `language="python"` works without further setup.
## π Quick Start
### Python Usage
```python
from chunker import chunk_file, chunk_text, chunk_directory
# Extract chunks from a Python file
chunks = chunk_file("example.py", "python")
# Or chunk text directly
chunks = chunk_text(code_string, "javascript")
for chunk in chunks:
print(f"{chunk.node_type} at lines {chunk.start_line}-{chunk.end_line}")
print(f" Context: {chunk.parent_context or 'module level'}")
```
### Incremental Processing
Efficiently detect changes after edits and update only what changed:
```python
from chunker import DefaultIncrementalProcessor, chunk_file
from pathlib import Path
processor = DefaultIncrementalProcessor()
file_path = Path("example.py")
old_chunks = chunk_file(file_path, "python")
processor.store_chunks(str(file_path), old_chunks)
# ... modify example.py ...
new_chunks = chunk_file(file_path, "python")
# API 1: file path + new chunks
diff = processor.compute_diff(str(file_path), new_chunks)
for added in diff.added:
print("Added:", added.chunk_id)
# API 2: old chunks + new text + language
# diff = processor.compute_diff(old_chunks, file_path.read_text(), "python")
```
### Smart Context and Natural-Language Query (optional)
Advanced features are optional at import time (NumPy/PyArrow heavy deps); when available:
```python
from chunker import (
TreeSitterSmartContextProvider,
InMemoryContextCache,
AdvancedQueryIndex,
NaturalLanguageQueryEngine,
)
from chunker import chunk_file
chunks = chunk_file("api/server.py", "python")
# Semantic context
ctx = TreeSitterSmartContextProvider(cache=InMemoryContextCache(ttl=3600))
context, metadata = ctx.get_semantic_context(chunks[0])
# Query
index = AdvancedQueryIndex()
index.build_index(chunks)
engine = NaturalLanguageQueryEngine()
results = engine.search("API endpoints", chunks)
for r in results[:3]:
print(r.score, r.chunk.node_type)
```
### Streaming Large Files
```python
from chunker import chunk_file_streaming
for chunk in chunk_file_streaming("big.sql", language="sql"):
print(chunk.node_type, chunk.start_line, chunk.end_line)
```
### Cross-Language Usage
```bash
# CLI with JSON output (callable from any language)
treesitter-chunker chunk file.py --lang python --json
# REST API
curl -X POST http://localhost:8000/chunk/text \
-H "Content-Type: application/json" \
-d '{"content": "def hello(): pass", "language": "python"}'
```
See [Cross-Language Usage Guide](docs/cross-language-usage.md) for JavaScript, Go, and other language examples.
> **Note**: By default, chunks smaller than 3 lines are filtered out. Adjust `min_chunk_size` in configuration if needed.
### Zero-Configuration Usage (New!)
```python
from chunker.auto import ZeroConfigAPI
# Create API instance - no setup required!
api = ZeroConfigAPI()
# Automatically detects language and downloads grammar if needed
result = api.auto_chunk_file("example.rs")
for chunk in result.chunks:
print(f"{chunk.node_type} at lines {chunk.start_line}-{chunk.end_line}")
# Preload languages for offline use
api.preload_languages(["python", "rust", "go", "typescript"])
```
### Using Plugins
```python
from chunker.core import chunk_file
from chunker.plugin_manager import get_plugin_manager
# Load built-in language plugins
manager = get_plugin_manager()
manager.load_built_in_plugins()
# Now chunking uses plugin-based rules
chunks = chunk_file("example.py", "python")
```
### Parallel Processing
```python
from chunker.parallel import chunk_files_parallel, chunk_directory_parallel
# Process multiple files in parallel
results = chunk_files_parallel(
["file1.py", "file2.py", "file3.py"],
"python",
max_workers=4,
show_progress=True
)
# Process entire directory
results = chunk_directory_parallel(
"src/",
"python",
pattern="**/*.py"
)
```
### Build Wheels (for contributors)
The build system supports environment flags to speed up or stabilize local builds:
```bash
# Limit grammars included in combined wheels (comma-separated subset)
export CHUNKER_WHEEL_LANGS=python,javascript,rust
# Verbose build logs
export CHUNKER_BUILD_VERBOSE=1
# Optional build timeout in seconds (per compilation unit)
export CHUNKER_BUILD_TIMEOUT=240
```
### Export Formats
```python
from chunker.core import chunk_file
from chunker.export.json_export import JSONExporter, JSONLExporter
from chunker.export.formatters import SchemaType
from chunker.exporters.parquet import ParquetExporter
chunks = chunk_file("example.py", "python")
# Export to JSON with nested schema
json_exporter = JSONExporter(schema_type=SchemaType.NESTED)
json_exporter.export(chunks, "chunks.json")
# Export to JSONL for streaming
jsonl_exporter = JSONLExporter()
jsonl_exporter.export(chunks, "chunks.jsonl")
# Export to Parquet for analytics
parquet_exporter = ParquetExporter(compression="snappy")
parquet_exporter.export(chunks, "chunks.parquet")
```
### CLI Usage
```bash
# Basic chunking
python cli/main.py chunk example.py -l python
# Process directory with progress bar
python cli/main.py batch src/ --recursive
# Export as JSON
python cli/main.py chunk example.py -l python --json > chunks.json
# With configuration file
python cli/main.py chunk src/ --config .chunkerrc
# Override exclude patterns (default excludes files with 'test' in name)
python cli/main.py batch src/ --exclude "*.tmp,*.bak" --include "*.py"
```
### Zero-Config CLI (auto-detection)
```bash
# Automatically detect language and chunk a file
python cli/main.py auto-chunk example.rs
# Auto-chunk a directory using detection + intelligent fallbacks
python cli/main.py auto-batch repo/
```
### AST Visualization
Generate Graphviz diagrams of the parse tree:
```bash
python scripts/visualize_ast.py example.py --lang python --out example.svg
```
### VS Code Extension
The Tree-sitter Chunker VS Code extension provides integrated chunking capabilities:
1. **Install the extension**: Search for "TreeSitter Chunker" in VS Code marketplace
2. **Commands available**:
- `TreeSitter Chunker: Chunk Current File` - Analyze the active file
- `TreeSitter Chunker: Chunk Workspace` - Process all supported files
- `TreeSitter Chunker: Show Chunks` - View chunks in a webview
- `TreeSitter Chunker: Export Chunks` - Export to JSON/JSONL/Parquet
3. **Features**:
- Visual chunk boundaries in the editor
- Context menu integration
- Configurable chunk types per language
- Progress tracking for large operations
## π― Features
### Plugin Architecture
The chunker uses a flexible plugin system for language support:
- **Built-in Plugins**: 29 languages with dedicated plugins: Python, JavaScript (includes TypeScript/TSX), Rust, C, C++, Go, Ruby, Java, Dockerfile, SQL, MATLAB, R, Julia, OCaml, Haskell, Scala, Elixir, Clojure, Dart, Vue, Svelte, Zig, NASM, WebAssembly, XML, YAML, TOML
- **Auto-Download Support**: 100+ additional languages via automatic grammar download including PHP, Kotlin, C#, Swift, CSS, HTML, JSON, and many more
- **Custom Plugins**: Easy to add new languages using the TemplateGenerator
- **Configuration**: Per-language chunk types and rules
- **Hot Loading**: Load plugins from directories
### Performance Features
- **AST Caching**: 11.9x speedup for repeated processing
- **Parallel Processing**: Utilize multiple CPU cores
- **Streaming**: Process files larger than memory
- **Progress Tracking**: Rich progress bars with ETA
### Configuration System
Support for multiple configuration formats:
```toml
# .chunkerrc
min_chunk_size = 3
max_chunk_size = 300
[languages.python]
chunk_types = ["function_definition", "class_definition", "async_function_definition"]
min_chunk_size = 5
```
### Export Formats
- **JSON**: Human-readable, supports nested/flat/relational schemas
- **JSONL**: Line-delimited JSON for streaming
- **Parquet**: Columnar format for analytics with compression
### Recent Feature Additions
#### Phase 9 Features (Completed)
- **Token Integration**: Count tokens for LLM context windows
- **Chunk Hierarchy**: Build hierarchical chunk relationships
- **Metadata Extraction**: Extract TODOs, complexity metrics, etc.
- **Semantic Merging**: Intelligently merge related chunks
- **Custom Rules**: Define custom chunking rules per language
- **Repository Processing**: Process entire repositories efficiently
- **Overlapping Fallback**: Handle edge cases with smart fallbacks
- **Cross-Platform Packaging**: Distribute as wheels for all platforms
#### Phase 14: Universal Language Support (Completed)
- **Automatic Grammar Discovery**: Discovers 100+ Tree-sitter grammars from GitHub
- **On-Demand Download**: Downloads and compiles grammars automatically when needed
- **Zero-Configuration API**: Simple API that just works without setup
- **Smart Caching**: Local cache with 24-hour refresh for offline use
- **Language Detection**: Automatic language detection from file extensions
#### Phase 15: Production Readiness & Comprehensive Testing (Completed)
- **900+ Tests**: All tests passing across unit, integration, and language-specific test suites
- **Test Fixes**: Fixed fallback warnings, CSV header inclusion, and large file streaming
- **Comprehensive Methodology**: Full testing coverage for security, performance, reliability, and operations
- **36+ Languages**: Production-ready support for all programming languages
#### Phase 19: Comprehensive Language Expansion (Completed)
- **Template Generator**: Automated plugin and test generation with Jinja2
- **Grammar Manager**: Dynamic grammar source management with parallel compilation
- **36+ Built-in Languages**: Added 22 new language plugins across 4 tiers
- **Contract-Driven Development**: Clean component boundaries for parallel implementation
- **ExtendedLanguagePluginContract**: Enhanced contract for consistent plugin behavior
## π API Overview
Tree-sitter Chunker exports 110+ APIs organized into logical groups:
### Core Functions
- `chunk_file()` - Extract chunks from a file
- `CodeChunk` - Data class representing a chunk
- `chunk_text()` - Chunk raw source text (convenience wrapper)
- `chunk_directory()` - Parallel directory chunking (convenience alias)
### Parser Management
- `get_parser()` - Get parser for a language
- `list_languages()` - List available languages
- `get_language_info()` - Get language metadata
- `return_parser()` - Return parser to pool
- `clear_cache()` - Clear parser cache
### Plugin System
- `PluginManager` - Manage language plugins
- `LanguagePlugin` - Base class for plugins
- `PluginConfig` - Plugin configuration
- `get_plugin_manager()` - Get global plugin manager
### Performance Features
- `chunk_files_parallel()` - Process files in parallel
- `chunk_directory_parallel()` - Process directories
- `chunk_file_streaming()` - Stream large files
- `ASTCache` - Cache parsed ASTs
- `StreamingChunker` - Streaming chunker class
- `ParallelChunker` - Parallel processing class
### Incremental Processing
- `DefaultIncrementalProcessor` - Compute diffs between old/new chunks
- `DefaultChangeDetector`, `DefaultChunkCache` - Helpers and caching
### Advanced Query (optional)
- `AdvancedQueryIndex` - Text/AST/embedding indexes
- `
Raw data
{
"_id": null,
"home_page": null,
"name": "treesitter-chunker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Consiliency <dev@consiliency.com>",
"keywords": "tree-sitter, code-analysis, chunking, parsing, ast, semantic-analysis, llm, embeddings, rag",
"author": null,
"author_email": "Consiliency <dev@consiliency.com>",
"download_url": "https://files.pythonhosted.org/packages/a4/81/2d961265e6ebcc11194f7a367269525a578dfd21dcb4798a80136df283a7/treesitter_chunker-1.0.9.tar.gz",
"platform": null,
"description": "# Tree-sitter Chunker\n\nA high-performance semantic code chunker that leverages [Tree-sitter](https://tree-sitter.github.io/) parsers to intelligently split source code into meaningful chunks like functions, classes, and methods.\n\n[]()\n[]()\n\n## \u2728 Key Features\n\n- \ud83c\udfaf **Semantic Understanding** - Extracts functions, classes, methods based on AST\n- \ud83d\ude80 **Blazing Fast** - 11.9x speedup with intelligent AST caching\n- \ud83c\udf0d **Universal Language Support** - Auto-download and support for 100+ Tree-sitter grammars\n- \ud83d\udd0c **Plugin Architecture** - Built-in plugins for 29 languages + auto-download support for 100+ more including all major programming languages\n- \ud83c\udf9b\ufe0f **Flexible Configuration** - TOML/YAML/JSON config files with per-language settings\n- \ud83d\udcca **14 Export Formats** - JSON, JSONL, Parquet, CSV, XML, GraphML, Neo4j, DOT, SQLite, PostgreSQL, and more\n- \u26a1 **Parallel Processing** - Process entire codebases with configurable workers\n- \ud83c\udf0a **Streaming Support** - Handle files larger than memory\n- \ud83c\udfa8 **Rich CLI** - Progress bars, batch processing, and filtering\n- \ud83e\udd16 **LLM-Ready** - Token counting, chunk optimization, and context-aware splitting\n- \ud83d\udcdd **Text File Support** - Markdown, logs, config files with intelligent chunking\n- \ud83d\udd0d **Advanced Query** - Natural language search across your codebase\n- \ud83d\udcc8 **Graph Export** - Visualize code structure in yEd, Neo4j, or Graphviz\n- \ud83d\udc1b **Debug Tools** - AST visualization, chunk inspection, performance profiling\n- \ud83d\udd27 **Developer Tools** - Pre-commit hooks, CI/CD generation, quality metrics\n- \ud83d\udce6 **Multi-Platform Distribution** - PyPI, Docker, Homebrew packages\n- \ud83c\udf10 **Zero-Configuration** - Automatic language detection and grammar download\n\n## \ud83d\udce6 Installation\n\n### Prerequisites\n- Python 3.8+ (for Python usage)\n- C compiler (for building Tree-sitter grammars)\n- `uv` package manager (recommended) or pip\n\n### Installation Methods\n\n#### From PyPI (when published)\n```bash\npip install treesitter-chunker\n\n# With REST API support\npip install \"treesitter-chunker[api]\"\n```\n\n#### For Other Languages\nSee [Cross-Language Usage Guide](docs/cross-language-usage.md) for using from JavaScript, Go, Ruby, etc.\n\n#### Using Docker\n```bash\ndocker pull ghcr.io/consiliency/treesitter-chunker:latest\ndocker run -v $(pwd):/workspace treesitter-chunker chunk /workspace/example.py -l python\n```\n\n#### Using Homebrew (macOS/Linux)\n```bash\nbrew tap consiliency/treesitter-chunker\nbrew install treesitter-chunker\n```\n\n#### For Debian/Ubuntu\n```bash\n# Download .deb package from releases\nsudo dpkg -i python3-treesitter-chunker_1.0.0-1_all.deb\n```\n\n#### For Fedora/RHEL\n```bash\n# Download .rpm package from releases\nsudo rpm -i python-treesitter-chunker-1.0.0-1.noarch.rpm\n```\n\n### Quick Install\n\n```bash\n# Clone the repository\ngit clone https://github.com/Consiliency/treesitter-chunker.git\ncd treesitter-chunker\n\n# Install with uv (recommended)\nuv venv\nsource .venv/bin/activate # On Windows: .venv\\Scripts\\activate\nuv pip install -e \".[dev]\"\nuv pip install git+https://github.com/tree-sitter/py-tree-sitter.git\n\n# Build language grammars\npython scripts/fetch_grammars.py\npython scripts/build_lib.py\n\n# Verify installation\npython -c \"from chunker.parser import list_languages; print(list_languages())\"\n# Output: ['c', 'cpp', 'javascript', 'python', 'rust']\n```\n\n### Using prebuilt grammars (no local builds)\n\nStarting with CI-built wheels, precompiled Tree-sitter grammars are bundled for common platforms. If a grammar isn\u2019t bundled yet, the library can build it on demand to your user cache.\n\nTo opt into building grammars once and reusing them:\n\n```bash\nexport CHUNKER_GRAMMAR_BUILD_DIR=\"$HOME/.cache/treesitter-chunker/build\"\n```\n\nThen build a language one time from Python:\n\n```python\nfrom pathlib import Path\nfrom chunker.grammar.manager import TreeSitterGrammarManager\n\ncache = Path.home() / \".cache\" / \"treesitter-chunker\"\ngm = TreeSitterGrammarManager(grammars_dir=cache / \"grammars\", build_dir=cache / \"build\")\ngm.add_grammar(\"python\", \"https://github.com/tree-sitter/tree-sitter-python\")\ngm.fetch_grammar(\"python\")\ngm.build_grammar(\"python\")\n```\n\nNow chunking with `language=\"python\"` works without further setup.\n\n## \ud83d\ude80 Quick Start\n\n### Python Usage\n\n```python\nfrom chunker import chunk_file, chunk_text, chunk_directory\n\n# Extract chunks from a Python file\nchunks = chunk_file(\"example.py\", \"python\")\n\n# Or chunk text directly\nchunks = chunk_text(code_string, \"javascript\")\n\nfor chunk in chunks:\n print(f\"{chunk.node_type} at lines {chunk.start_line}-{chunk.end_line}\")\n print(f\" Context: {chunk.parent_context or 'module level'}\")\n```\n\n### Incremental Processing\n\nEfficiently detect changes after edits and update only what changed:\n\n```python\nfrom chunker import DefaultIncrementalProcessor, chunk_file\nfrom pathlib import Path\n\nprocessor = DefaultIncrementalProcessor()\n\nfile_path = Path(\"example.py\")\nold_chunks = chunk_file(file_path, \"python\")\nprocessor.store_chunks(str(file_path), old_chunks)\n\n# ... modify example.py ...\nnew_chunks = chunk_file(file_path, \"python\")\n\n# API 1: file path + new chunks\ndiff = processor.compute_diff(str(file_path), new_chunks)\nfor added in diff.added:\n print(\"Added:\", added.chunk_id)\n\n# API 2: old chunks + new text + language\n# diff = processor.compute_diff(old_chunks, file_path.read_text(), \"python\")\n```\n\n### Smart Context and Natural-Language Query (optional)\n\nAdvanced features are optional at import time (NumPy/PyArrow heavy deps); when available:\n\n```python\nfrom chunker import (\n TreeSitterSmartContextProvider,\n InMemoryContextCache,\n AdvancedQueryIndex,\n NaturalLanguageQueryEngine,\n)\nfrom chunker import chunk_file\n\nchunks = chunk_file(\"api/server.py\", \"python\")\n\n# Semantic context\nctx = TreeSitterSmartContextProvider(cache=InMemoryContextCache(ttl=3600))\ncontext, metadata = ctx.get_semantic_context(chunks[0])\n\n# Query\nindex = AdvancedQueryIndex()\nindex.build_index(chunks)\nengine = NaturalLanguageQueryEngine()\nresults = engine.search(\"API endpoints\", chunks)\nfor r in results[:3]:\n print(r.score, r.chunk.node_type)\n```\n\n### Streaming Large Files\n\n```python\nfrom chunker import chunk_file_streaming\n\nfor chunk in chunk_file_streaming(\"big.sql\", language=\"sql\"):\n print(chunk.node_type, chunk.start_line, chunk.end_line)\n```\n\n### Cross-Language Usage\n\n```bash\n# CLI with JSON output (callable from any language)\ntreesitter-chunker chunk file.py --lang python --json\n\n# REST API\ncurl -X POST http://localhost:8000/chunk/text \\\n -H \"Content-Type: application/json\" \\\n -d '{\"content\": \"def hello(): pass\", \"language\": \"python\"}'\n```\n\nSee [Cross-Language Usage Guide](docs/cross-language-usage.md) for JavaScript, Go, and other language examples.\n\n> **Note**: By default, chunks smaller than 3 lines are filtered out. Adjust `min_chunk_size` in configuration if needed.\n\n### Zero-Configuration Usage (New!)\n\n```python\nfrom chunker.auto import ZeroConfigAPI\n\n# Create API instance - no setup required!\napi = ZeroConfigAPI()\n\n# Automatically detects language and downloads grammar if needed\nresult = api.auto_chunk_file(\"example.rs\")\n\nfor chunk in result.chunks:\n print(f\"{chunk.node_type} at lines {chunk.start_line}-{chunk.end_line}\")\n\n# Preload languages for offline use\napi.preload_languages([\"python\", \"rust\", \"go\", \"typescript\"])\n```\n\n### Using Plugins\n\n```python\nfrom chunker.core import chunk_file\nfrom chunker.plugin_manager import get_plugin_manager\n\n# Load built-in language plugins\nmanager = get_plugin_manager()\nmanager.load_built_in_plugins()\n\n# Now chunking uses plugin-based rules\nchunks = chunk_file(\"example.py\", \"python\")\n```\n\n### Parallel Processing\n\n```python\nfrom chunker.parallel import chunk_files_parallel, chunk_directory_parallel\n\n# Process multiple files in parallel\nresults = chunk_files_parallel(\n [\"file1.py\", \"file2.py\", \"file3.py\"],\n \"python\",\n max_workers=4,\n show_progress=True\n)\n\n# Process entire directory\nresults = chunk_directory_parallel(\n \"src/\",\n \"python\",\n pattern=\"**/*.py\"\n)\n```\n\n### Build Wheels (for contributors)\n\nThe build system supports environment flags to speed up or stabilize local builds:\n\n```bash\n# Limit grammars included in combined wheels (comma-separated subset)\nexport CHUNKER_WHEEL_LANGS=python,javascript,rust\n\n# Verbose build logs\nexport CHUNKER_BUILD_VERBOSE=1\n\n# Optional build timeout in seconds (per compilation unit)\nexport CHUNKER_BUILD_TIMEOUT=240\n```\n\n### Export Formats\n\n```python\nfrom chunker.core import chunk_file\nfrom chunker.export.json_export import JSONExporter, JSONLExporter\nfrom chunker.export.formatters import SchemaType\nfrom chunker.exporters.parquet import ParquetExporter\n\nchunks = chunk_file(\"example.py\", \"python\")\n\n# Export to JSON with nested schema\njson_exporter = JSONExporter(schema_type=SchemaType.NESTED)\njson_exporter.export(chunks, \"chunks.json\")\n\n# Export to JSONL for streaming\njsonl_exporter = JSONLExporter()\njsonl_exporter.export(chunks, \"chunks.jsonl\")\n\n# Export to Parquet for analytics\nparquet_exporter = ParquetExporter(compression=\"snappy\")\nparquet_exporter.export(chunks, \"chunks.parquet\")\n```\n\n### CLI Usage\n\n```bash\n# Basic chunking\npython cli/main.py chunk example.py -l python\n\n# Process directory with progress bar\npython cli/main.py batch src/ --recursive\n\n# Export as JSON\npython cli/main.py chunk example.py -l python --json > chunks.json\n\n# With configuration file\npython cli/main.py chunk src/ --config .chunkerrc\n\n# Override exclude patterns (default excludes files with 'test' in name)\npython cli/main.py batch src/ --exclude \"*.tmp,*.bak\" --include \"*.py\"\n```\n\n### Zero-Config CLI (auto-detection)\n\n```bash\n# Automatically detect language and chunk a file\npython cli/main.py auto-chunk example.rs\n\n# Auto-chunk a directory using detection + intelligent fallbacks\npython cli/main.py auto-batch repo/\n```\n\n### AST Visualization\n\nGenerate Graphviz diagrams of the parse tree:\n\n```bash\npython scripts/visualize_ast.py example.py --lang python --out example.svg\n```\n\n### VS Code Extension\n\nThe Tree-sitter Chunker VS Code extension provides integrated chunking capabilities:\n\n1. **Install the extension**: Search for \"TreeSitter Chunker\" in VS Code marketplace\n2. **Commands available**:\n - `TreeSitter Chunker: Chunk Current File` - Analyze the active file\n - `TreeSitter Chunker: Chunk Workspace` - Process all supported files\n - `TreeSitter Chunker: Show Chunks` - View chunks in a webview\n - `TreeSitter Chunker: Export Chunks` - Export to JSON/JSONL/Parquet\n\n3. **Features**:\n - Visual chunk boundaries in the editor\n - Context menu integration\n - Configurable chunk types per language\n - Progress tracking for large operations\n\n## \ud83c\udfaf Features\n\n### Plugin Architecture\n\nThe chunker uses a flexible plugin system for language support:\n\n- **Built-in Plugins**: 29 languages with dedicated plugins: Python, JavaScript (includes TypeScript/TSX), Rust, C, C++, Go, Ruby, Java, Dockerfile, SQL, MATLAB, R, Julia, OCaml, Haskell, Scala, Elixir, Clojure, Dart, Vue, Svelte, Zig, NASM, WebAssembly, XML, YAML, TOML\n- **Auto-Download Support**: 100+ additional languages via automatic grammar download including PHP, Kotlin, C#, Swift, CSS, HTML, JSON, and many more\n- **Custom Plugins**: Easy to add new languages using the TemplateGenerator\n- **Configuration**: Per-language chunk types and rules\n- **Hot Loading**: Load plugins from directories\n\n### Performance Features\n\n- **AST Caching**: 11.9x speedup for repeated processing\n- **Parallel Processing**: Utilize multiple CPU cores\n- **Streaming**: Process files larger than memory\n- **Progress Tracking**: Rich progress bars with ETA\n\n### Configuration System\n\nSupport for multiple configuration formats:\n\n```toml\n# .chunkerrc\nmin_chunk_size = 3\nmax_chunk_size = 300\n\n[languages.python]\nchunk_types = [\"function_definition\", \"class_definition\", \"async_function_definition\"]\nmin_chunk_size = 5\n```\n\n### Export Formats\n\n- **JSON**: Human-readable, supports nested/flat/relational schemas\n- **JSONL**: Line-delimited JSON for streaming\n- **Parquet**: Columnar format for analytics with compression\n\n### Recent Feature Additions\n\n#### Phase 9 Features (Completed)\n- **Token Integration**: Count tokens for LLM context windows\n- **Chunk Hierarchy**: Build hierarchical chunk relationships\n- **Metadata Extraction**: Extract TODOs, complexity metrics, etc.\n- **Semantic Merging**: Intelligently merge related chunks\n- **Custom Rules**: Define custom chunking rules per language\n- **Repository Processing**: Process entire repositories efficiently\n- **Overlapping Fallback**: Handle edge cases with smart fallbacks\n- **Cross-Platform Packaging**: Distribute as wheels for all platforms\n\n#### Phase 14: Universal Language Support (Completed)\n- **Automatic Grammar Discovery**: Discovers 100+ Tree-sitter grammars from GitHub\n- **On-Demand Download**: Downloads and compiles grammars automatically when needed\n- **Zero-Configuration API**: Simple API that just works without setup\n- **Smart Caching**: Local cache with 24-hour refresh for offline use\n- **Language Detection**: Automatic language detection from file extensions\n\n#### Phase 15: Production Readiness & Comprehensive Testing (Completed)\n- **900+ Tests**: All tests passing across unit, integration, and language-specific test suites\n- **Test Fixes**: Fixed fallback warnings, CSV header inclusion, and large file streaming\n- **Comprehensive Methodology**: Full testing coverage for security, performance, reliability, and operations\n- **36+ Languages**: Production-ready support for all programming languages\n\n#### Phase 19: Comprehensive Language Expansion (Completed)\n- **Template Generator**: Automated plugin and test generation with Jinja2\n- **Grammar Manager**: Dynamic grammar source management with parallel compilation\n- **36+ Built-in Languages**: Added 22 new language plugins across 4 tiers\n- **Contract-Driven Development**: Clean component boundaries for parallel implementation\n- **ExtendedLanguagePluginContract**: Enhanced contract for consistent plugin behavior\n\n## \ud83d\udcda API Overview\n\nTree-sitter Chunker exports 110+ APIs organized into logical groups:\n\n### Core Functions\n- `chunk_file()` - Extract chunks from a file\n- `CodeChunk` - Data class representing a chunk\n- `chunk_text()` - Chunk raw source text (convenience wrapper)\n- `chunk_directory()` - Parallel directory chunking (convenience alias)\n\n### Parser Management\n- `get_parser()` - Get parser for a language\n- `list_languages()` - List available languages\n- `get_language_info()` - Get language metadata\n- `return_parser()` - Return parser to pool\n- `clear_cache()` - Clear parser cache\n\n### Plugin System\n- `PluginManager` - Manage language plugins\n- `LanguagePlugin` - Base class for plugins\n- `PluginConfig` - Plugin configuration\n- `get_plugin_manager()` - Get global plugin manager\n\n### Performance Features\n- `chunk_files_parallel()` - Process files in parallel\n- `chunk_directory_parallel()` - Process directories\n- `chunk_file_streaming()` - Stream large files\n- `ASTCache` - Cache parsed ASTs\n- `StreamingChunker` - Streaming chunker class\n- `ParallelChunker` - Parallel processing class\n\n### Incremental Processing\n- `DefaultIncrementalProcessor` - Compute diffs between old/new chunks\n- `DefaultChangeDetector`, `DefaultChunkCache` - Helpers and caching\n\n### Advanced Query (optional)\n- `AdvancedQueryIndex` - Text/AST/embedding indexes\n- `\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Semantic code chunker using Tree-sitter for intelligent code analysis",
"version": "1.0.9",
"project_urls": {
"Changelog": "https://github.com/Consiliency/treesitter-chunker/blob/main/CHANGELOG.md",
"Documentation": "https://treesitter-chunker.readthedocs.io",
"Homepage": "https://github.com/Consiliency/treesitter-chunker",
"Issues": "https://github.com/Consiliency/treesitter-chunker/issues",
"Repository": "https://github.com/Consiliency/treesitter-chunker"
},
"split_keywords": [
"tree-sitter",
" code-analysis",
" chunking",
" parsing",
" ast",
" semantic-analysis",
" llm",
" embeddings",
" rag"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "531813f9d712eb76c90c95de06a85775458f2c9d3fd34348a5f7f7b6b32e5123",
"md5": "db2c6538a2df1f144ad8cdaf00533099",
"sha256": "a84423bf71718dd0b7696d3988a745a0ae902df83e109ca5f6e69ec3ba7b31dc"
},
"downloads": -1,
"filename": "treesitter_chunker-1.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "db2c6538a2df1f144ad8cdaf00533099",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 855557,
"upload_time": "2025-08-15T18:54:20",
"upload_time_iso_8601": "2025-08-15T18:54:20.602041Z",
"url": "https://files.pythonhosted.org/packages/53/18/13f9d712eb76c90c95de06a85775458f2c9d3fd34348a5f7f7b6b32e5123/treesitter_chunker-1.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a4812d961265e6ebcc11194f7a367269525a578dfd21dcb4798a80136df283a7",
"md5": "7db86ccfd6dc31fcc50df06bb2a0a6b4",
"sha256": "a9bb5dc3c1f5fa453481afe81ef0c2308b0175bf359836bf5946924002cb117c"
},
"downloads": -1,
"filename": "treesitter_chunker-1.0.9.tar.gz",
"has_sig": false,
"md5_digest": "7db86ccfd6dc31fcc50df06bb2a0a6b4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 37977055,
"upload_time": "2025-08-15T18:54:24",
"upload_time_iso_8601": "2025-08-15T18:54:24.268909Z",
"url": "https://files.pythonhosted.org/packages/a4/81/2d961265e6ebcc11194f7a367269525a578dfd21dcb4798a80136df283a7/treesitter_chunker-1.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-15 18:54:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Consiliency",
"github_project": "treesitter-chunker",
"github_not_found": true,
"lcname": "treesitter-chunker"
}