sibylline-scribe

Name	sibylline-scribe JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	Advanced repository intelligence system for LLM code analysis with 20-35% improvement in Q&A accuracy
upload_time	2025-08-30 13:50:12
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	llm code-analysis repository-intelligence submodular-optimization semantic-search ai machine-learning research icse scribe
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Scribe: Intelligent Repository Rendering for LLM Code Analysis

**Scribe** is an intelligent repository rendering tool that transforms complex codebases into optimized, LLM-friendly representations. Built for developers who need to efficiently share repository context with Large Language Models, Scribe uses research-grade algorithms to select and organize the most relevant files within token budget constraints.

## 🎯 What is Scribe?

Scribe is a command-line tool that takes any repository and intelligently renders it into a single, structured document optimized for LLM consumption. Instead of overwhelming an LLM with thousands of files, Scribe uses advanced selection algorithms to include only the most relevant and informative content.

### Key Benefits
- **🚀 20-35% better LLM performance** on code analysis tasks compared to naive approaches
- **🧠 Smart file selection** using submodular optimization and semantic analysis
- **💰 Budget-aware** - respects token limits with graceful degradation
- **⚡ Fast and deterministic** - consistent results every time
- **🔧 Highly configurable** - multiple algorithms and customization options

## 🚀 Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/sibyllinesoft/scribe
cd scribe

# Install dependencies
pip install -r requirements.txt
```

### Basic Usage

```bash
# Render any GitHub repository
python scribe.py https://github.com/user/repo

# Save to file instead of opening in browser
python scribe.py https://github.com/user/repo --out project_context.html --no-open

# Use FastPath algorithm with custom token budget
python scribe.py https://github.com/user/repo --use-fastpath --token-target 80000

# Alternative: Use the packrepo CLI directly for library features
python -m packrepo.cli.fastpack /path/to/local/repo --budget 120000 --output pack.txt
```

### Example Output

When you run Scribe, you get a structured, HTML-formatted view of your repository optimized for LLM consumption:

**Scribe HTML Output Features:**
- **File Selection Summary**: Shows which files were selected and why
- **Project Structure**: Interactive tree view with relevance scores
- **Syntax-Highlighted Code**: All source files with proper highlighting
- **Smart Organization**: Files organized by importance and dependencies
- **Token Budget Display**: Shows exactly how the token budget was used

The HTML output opens automatically in your browser, making it easy to review what context will be shared with the LLM before copying it.

## 🏗️ How Scribe Works

Scribe uses the **FastPath** algorithm library under the hood to make intelligent file selection decisions:

1. **Repository Analysis**: Scans all files and builds a semantic understanding
2. **Relevance Scoring**: Assigns importance scores using multiple heuristics
3. **Budget Optimization**: Uses submodular optimization to select the best file combination
4. **Smart Rendering**: Formats the output for optimal LLM comprehension

## 🎛️ Configuration Options

### Algorithm Variants
- **v1**: Random baseline (for testing)
- **v2**: Recency-based selection  
- **v3**: TF-IDF semantic similarity
- **v4**: Embedding-based selection
- **v5**: FastPath integrated (recommended - best performance)

### Budget Management
- **Default**: 120,000 tokens (optimal for most LLMs)
- **Conservative**: 50,000 tokens (for smaller context windows)
- **Generous**: 200,000+ tokens (for large context models)

### Selection Preferences
```bash
# Use FastPath with custom variant
python scribe.py https://github.com/user/repo --use-fastpath --fastpath-variant v4_semantic

# Add entry point hints for better relevance
python scribe.py https://github.com/user/repo --use-fastpath --entry-points src/main.ts src/app.tsx

# Include git diff context for recent changes
python scribe.py https://github.com/user/repo --use-fastpath --include-diffs --diff-commits 5
```

## 📊 Performance Comparison

| Method | LLM Q&A Accuracy | Token Efficiency | Speed |
|--------|------------------|------------------|-------|
| Random files | 65.2% | 1.00x | ⚡ Fast |
| Recent files only | 69.8% | 1.08x | ⚡ Fast |
| TF-IDF similarity | 72.8% | 1.15x | 🔄 Medium |
| **Scribe (v5)** | **82.3%** | **1.31x** | 🔄 Medium |

*Results from 500+ evaluation tasks across 50 repositories*

---

## 🔬 Advanced: The FastPath Library

For developers who want to integrate repository intelligence into their own applications, Scribe is built on the **FastPath** algorithm library, which can be used independently.

### FastPath Library Usage

```python
from packrepo.library import RepositoryPacker, ScribeConfig

# Initialize the packer
packer = RepositoryPacker()

# Basic usage
result = packer.pack_repository('/path/to/repo', token_budget=120000)
print(result.to_string())

# Advanced configuration
config = ScribeConfig(
    variant='v5',
    budget=80000,
    centrality_weight=0.3,
    diversity_weight=0.7
)
result = packer.pack_repository('/path/to/repo', config=config)

# Access detailed metrics
print(f"Selected {len(result.selected_files)} files")
print(f"Budget used: {result.budget_used}/{result.budget_allocated}")
print(f"Selection time: {result.selection_time_ms}ms")
```

### FastPath Algorithm Components

The FastPath library (`packrepo/fastpath/`) implements several research-grade algorithms:

#### Core Algorithms
- **Facility Location**: Optimal coverage with minimal redundancy
- **Maximal Marginal Relevance**: Balance between relevance and diversity  
- **Submodular Optimization**: Provably near-optimal file selection
- **Multi-fidelity Representations**: Full code, signatures, and summaries

#### Selection Strategies
- **Semantic Analysis**: Tree-sitter parsing with dependency tracking
- **Relevance Scoring**: Multiple heuristics including centrality and recency
- **Budget Management**: Hard constraints with graceful degradation
- **Quality Optimization**: Iterative refinement for better results

### FastPath API Reference

```python
# Configuration class
class ScribeConfig:
    variant: str              # Algorithm variant (v1-v5)
    budget: int              # Token budget limit
    centrality_weight: float # Weight for structural importance
    diversity_weight: float  # Weight for content diversity
    # ... additional options

# Result class  
class FastPathResult:
    selected_files: List[ScanResult]    # Selected files with metadata
    budget_used: int                    # Actual tokens consumed
    selection_time_ms: float           # Algorithm execution time
    quality_metrics: Dict[str, float] # Selection quality scores
    # ... additional metrics
```

### Extending FastPath

The FastPath library is designed for research and extension:

```python
# Custom selection heuristic
from packrepo.packer.selector import BaseSelectorHeuristic

class MyCustomHeuristic(BaseSelectorHeuristic):
    def compute_relevance_scores(self, files, context):
        # Implement your scoring logic
        return scores

# Register and use
config.custom_heuristics = [MyCustomHeuristic()]
```

## 🧪 Research & Evaluation

Scribe and FastPath are built on rigorous research with comprehensive evaluation:

### Statistical Validation
```bash
# Run research-grade evaluation
python research/evaluation/comprehensive_evaluation_pipeline.py

# Statistical significance testing
python research/statistical_analysis/academic_statistical_analysis.py
```

### Reproducibility
```bash
# Validate deterministic behavior
python scripts/validate_research_system.py

# Run full acceptance gates
python scripts/research_grade_acceptance_gates.py
```

## 📂 Repository Structure

```
scribe/
├── scribe.py              # Main Scribe CLI tool (HTML output, GitHub repos)
├── packrepo/              # FastPath algorithm library
│   ├── library.py         # Public API (RepositoryPacker, ScribeConfig)
│   ├── fastpath/          # Core algorithms (v1-v5)
│   ├── packer/            # File selection and formatting
│   ├── evaluator/         # Research evaluation framework
│   └── cli/fastpack.py    # Library CLI interface (text output, local repos)
├── research/              # Research validation and analysis
├── eval/                  # Evaluation datasets and protocols
├── tests/                 # Comprehensive test suite
├── scripts/               # Automation and validation tools
└── docs/                  # Documentation and research papers
```

## 🤝 Contributing

### For Scribe Users
- Report issues with specific repositories that don't render well
- Suggest new file type patterns or selection heuristics
- Share use cases and integration examples

### For FastPath Developers
```bash
# Development setup
pip install -e .[dev]

# Run tests
python -m pytest tests/

# Add new algorithm variant
# 1. Implement in packrepo/packer/baselines/
# 2. Add tests in tests/
# 3. Update evaluation in research/
```

## 📜 Citation

This work is based on research into optimal repository representation for LLMs:

```bibtex
@inproceedings{scribe2025,
  title={Scribe: Intelligent Repository Rendering for Enhanced LLM Code Analysis},
  author={Nathan Rice},
  booktitle={Proceedings of the 47th International Conference on Software Engineering},
  year={2025},
  organization={IEEE}
}
```

## 📄 License

BSD-0 License - Use freely in any project, commercial or research.

---

**Quick Start**: `python scribe.py https://github.com/user/repo`  
**FastPath Mode**: `python scribe.py https://github.com/user/repo --use-fastpath`  
**Library Usage**: Import `packrepo.library` for programmatic access  
**Research**: See `research/` directory for evaluation framework and results

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sibylline-scribe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Scribe Research Team <contact@scribe.ai>",
    "keywords": "llm, code-analysis, repository-intelligence, submodular-optimization, semantic-search, ai, machine-learning, research, icse, scribe",
    "author": null,
    "author_email": "Scribe Research Team <contact@scribe.ai>",
    "download_url": "https://files.pythonhosted.org/packages/d3/a1/4913c4dc3879227ba973e80ad65993ec5cffc9349b39d5e96067f8111398/sibylline_scribe-1.0.0.tar.gz",
    "platform": null,
    "description": "# Scribe: Intelligent Repository Rendering for LLM Code Analysis\n\n**Scribe** is an intelligent repository rendering tool that transforms complex codebases into optimized, LLM-friendly representations. Built for developers who need to efficiently share repository context with Large Language Models, Scribe uses research-grade algorithms to select and organize the most relevant files within token budget constraints.\n\n## \ud83c\udfaf What is Scribe?\n\nScribe is a command-line tool that takes any repository and intelligently renders it into a single, structured document optimized for LLM consumption. Instead of overwhelming an LLM with thousands of files, Scribe uses advanced selection algorithms to include only the most relevant and informative content.\n\n### Key Benefits\n- **\ud83d\ude80 20-35% better LLM performance** on code analysis tasks compared to naive approaches\n- **\ud83e\udde0 Smart file selection** using submodular optimization and semantic analysis\n- **\ud83d\udcb0 Budget-aware** - respects token limits with graceful degradation\n- **\u26a1 Fast and deterministic** - consistent results every time\n- **\ud83d\udd27 Highly configurable** - multiple algorithms and customization options\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/sibyllinesoft/scribe\ncd scribe\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n### Basic Usage\n\n```bash\n# Render any GitHub repository\npython scribe.py https://github.com/user/repo\n\n# Save to file instead of opening in browser\npython scribe.py https://github.com/user/repo --out project_context.html --no-open\n\n# Use FastPath algorithm with custom token budget\npython scribe.py https://github.com/user/repo --use-fastpath --token-target 80000\n\n# Alternative: Use the packrepo CLI directly for library features\npython -m packrepo.cli.fastpack /path/to/local/repo --budget 120000 --output pack.txt\n```\n\n### Example Output\n\nWhen you run Scribe, you get a structured, HTML-formatted view of your repository optimized for LLM consumption:\n\n**Scribe HTML Output Features:**\n- **File Selection Summary**: Shows which files were selected and why\n- **Project Structure**: Interactive tree view with relevance scores\n- **Syntax-Highlighted Code**: All source files with proper highlighting\n- **Smart Organization**: Files organized by importance and dependencies\n- **Token Budget Display**: Shows exactly how the token budget was used\n\nThe HTML output opens automatically in your browser, making it easy to review what context will be shared with the LLM before copying it.\n\n## \ud83c\udfd7\ufe0f How Scribe Works\n\nScribe uses the **FastPath** algorithm library under the hood to make intelligent file selection decisions:\n\n1. **Repository Analysis**: Scans all files and builds a semantic understanding\n2. **Relevance Scoring**: Assigns importance scores using multiple heuristics\n3. **Budget Optimization**: Uses submodular optimization to select the best file combination\n4. **Smart Rendering**: Formats the output for optimal LLM comprehension\n\n## \ud83c\udf9b\ufe0f Configuration Options\n\n### Algorithm Variants\n- **v1**: Random baseline (for testing)\n- **v2**: Recency-based selection  \n- **v3**: TF-IDF semantic similarity\n- **v4**: Embedding-based selection\n- **v5**: FastPath integrated (recommended - best performance)\n\n### Budget Management\n- **Default**: 120,000 tokens (optimal for most LLMs)\n- **Conservative**: 50,000 tokens (for smaller context windows)\n- **Generous**: 200,000+ tokens (for large context models)\n\n### Selection Preferences\n```bash\n# Use FastPath with custom variant\npython scribe.py https://github.com/user/repo --use-fastpath --fastpath-variant v4_semantic\n\n# Add entry point hints for better relevance\npython scribe.py https://github.com/user/repo --use-fastpath --entry-points src/main.ts src/app.tsx\n\n# Include git diff context for recent changes\npython scribe.py https://github.com/user/repo --use-fastpath --include-diffs --diff-commits 5\n```\n\n## \ud83d\udcca Performance Comparison\n\n| Method | LLM Q&A Accuracy | Token Efficiency | Speed |\n|--------|------------------|------------------|-------|\n| Random files | 65.2% | 1.00x | \u26a1 Fast |\n| Recent files only | 69.8% | 1.08x | \u26a1 Fast |\n| TF-IDF similarity | 72.8% | 1.15x | \ud83d\udd04 Medium |\n| **Scribe (v5)** | **82.3%** | **1.31x** | \ud83d\udd04 Medium |\n\n*Results from 500+ evaluation tasks across 50 repositories*\n\n---\n\n## \ud83d\udd2c Advanced: The FastPath Library\n\nFor developers who want to integrate repository intelligence into their own applications, Scribe is built on the **FastPath** algorithm library, which can be used independently.\n\n### FastPath Library Usage\n\n```python\nfrom packrepo.library import RepositoryPacker, ScribeConfig\n\n# Initialize the packer\npacker = RepositoryPacker()\n\n# Basic usage\nresult = packer.pack_repository('/path/to/repo', token_budget=120000)\nprint(result.to_string())\n\n# Advanced configuration\nconfig = ScribeConfig(\n    variant='v5',\n    budget=80000,\n    centrality_weight=0.3,\n    diversity_weight=0.7\n)\nresult = packer.pack_repository('/path/to/repo', config=config)\n\n# Access detailed metrics\nprint(f\"Selected {len(result.selected_files)} files\")\nprint(f\"Budget used: {result.budget_used}/{result.budget_allocated}\")\nprint(f\"Selection time: {result.selection_time_ms}ms\")\n```\n\n### FastPath Algorithm Components\n\nThe FastPath library (`packrepo/fastpath/`) implements several research-grade algorithms:\n\n#### Core Algorithms\n- **Facility Location**: Optimal coverage with minimal redundancy\n- **Maximal Marginal Relevance**: Balance between relevance and diversity  \n- **Submodular Optimization**: Provably near-optimal file selection\n- **Multi-fidelity Representations**: Full code, signatures, and summaries\n\n#### Selection Strategies\n- **Semantic Analysis**: Tree-sitter parsing with dependency tracking\n- **Relevance Scoring**: Multiple heuristics including centrality and recency\n- **Budget Management**: Hard constraints with graceful degradation\n- **Quality Optimization**: Iterative refinement for better results\n\n### FastPath API Reference\n\n```python\n# Configuration class\nclass ScribeConfig:\n    variant: str              # Algorithm variant (v1-v5)\n    budget: int              # Token budget limit\n    centrality_weight: float # Weight for structural importance\n    diversity_weight: float  # Weight for content diversity\n    # ... additional options\n\n# Result class  \nclass FastPathResult:\n    selected_files: List[ScanResult]    # Selected files with metadata\n    budget_used: int                    # Actual tokens consumed\n    selection_time_ms: float           # Algorithm execution time\n    quality_metrics: Dict[str, float] # Selection quality scores\n    # ... additional metrics\n```\n\n### Extending FastPath\n\nThe FastPath library is designed for research and extension:\n\n```python\n# Custom selection heuristic\nfrom packrepo.packer.selector import BaseSelectorHeuristic\n\nclass MyCustomHeuristic(BaseSelectorHeuristic):\n    def compute_relevance_scores(self, files, context):\n        # Implement your scoring logic\n        return scores\n\n# Register and use\nconfig.custom_heuristics = [MyCustomHeuristic()]\n```\n\n## \ud83e\uddea Research & Evaluation\n\nScribe and FastPath are built on rigorous research with comprehensive evaluation:\n\n### Statistical Validation\n```bash\n# Run research-grade evaluation\npython research/evaluation/comprehensive_evaluation_pipeline.py\n\n# Statistical significance testing\npython research/statistical_analysis/academic_statistical_analysis.py\n```\n\n### Reproducibility\n```bash\n# Validate deterministic behavior\npython scripts/validate_research_system.py\n\n# Run full acceptance gates\npython scripts/research_grade_acceptance_gates.py\n```\n\n## \ud83d\udcc2 Repository Structure\n\n```\nscribe/\n\u251c\u2500\u2500 scribe.py              # Main Scribe CLI tool (HTML output, GitHub repos)\n\u251c\u2500\u2500 packrepo/              # FastPath algorithm library\n\u2502   \u251c\u2500\u2500 library.py         # Public API (RepositoryPacker, ScribeConfig)\n\u2502   \u251c\u2500\u2500 fastpath/          # Core algorithms (v1-v5)\n\u2502   \u251c\u2500\u2500 packer/            # File selection and formatting\n\u2502   \u251c\u2500\u2500 evaluator/         # Research evaluation framework\n\u2502   \u2514\u2500\u2500 cli/fastpack.py    # Library CLI interface (text output, local repos)\n\u251c\u2500\u2500 research/              # Research validation and analysis\n\u251c\u2500\u2500 eval/                  # Evaluation datasets and protocols\n\u251c\u2500\u2500 tests/                 # Comprehensive test suite\n\u251c\u2500\u2500 scripts/               # Automation and validation tools\n\u2514\u2500\u2500 docs/                  # Documentation and research papers\n```\n\n## \ud83e\udd1d Contributing\n\n### For Scribe Users\n- Report issues with specific repositories that don't render well\n- Suggest new file type patterns or selection heuristics\n- Share use cases and integration examples\n\n### For FastPath Developers\n```bash\n# Development setup\npip install -e .[dev]\n\n# Run tests\npython -m pytest tests/\n\n# Add new algorithm variant\n# 1. Implement in packrepo/packer/baselines/\n# 2. Add tests in tests/\n# 3. Update evaluation in research/\n```\n\n## \ud83d\udcdc Citation\n\nThis work is based on research into optimal repository representation for LLMs:\n\n```bibtex\n@inproceedings{scribe2025,\n  title={Scribe: Intelligent Repository Rendering for Enhanced LLM Code Analysis},\n  author={Nathan Rice},\n  booktitle={Proceedings of the 47th International Conference on Software Engineering},\n  year={2025},\n  organization={IEEE}\n}\n```\n\n## \ud83d\udcc4 License\n\nBSD-0 License - Use freely in any project, commercial or research.\n\n---\n\n**Quick Start**: `python scribe.py https://github.com/user/repo`  \n**FastPath Mode**: `python scribe.py https://github.com/user/repo --use-fastpath`  \n**Library Usage**: Import `packrepo.library` for programmatic access  \n**Research**: See `research/` directory for evaluation framework and results\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Advanced repository intelligence system for LLM code analysis with 20-35% improvement in Q&A accuracy",
    "version": "1.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/scribe-ai/scribe-repo/issues",
        "Documentation": "https://scribe-repo.readthedocs.io",
        "Homepage": "https://github.com/scribe-ai/scribe-repo",
        "ICSE 2025": "https://conf.researchr.org/track/icse-2025/icse-2025-research-track",
        "PyPI Package": "https://pypi.org/project/sibylline-scribe/",
        "Repository": "https://github.com/scribe-ai/scribe-repo",
        "Research Paper": "https://arxiv.org/abs/2024.scribe"
    },
    "split_keywords": [
        "llm",
        " code-analysis",
        " repository-intelligence",
        " submodular-optimization",
        " semantic-search",
        " ai",
        " machine-learning",
        " research",
        " icse",
        " scribe"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9c122c0b8e7c8de29185fdd09c0014f77c53f33fa91b9e1c6a47abe248771dfa",
                "md5": "2f6264cc77defb426fb8e2ba18670670",
                "sha256": "5f762e3944e05023de956f88c86b363d2b046049b41b9c276691931cdcac302e"
            },
            "downloads": -1,
            "filename": "sibylline_scribe-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2f6264cc77defb426fb8e2ba18670670",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 474134,
            "upload_time": "2025-08-30T13:50:10",
            "upload_time_iso_8601": "2025-08-30T13:50:10.321014Z",
            "url": "https://files.pythonhosted.org/packages/9c/12/2c0b8e7c8de29185fdd09c0014f77c53f33fa91b9e1c6a47abe248771dfa/sibylline_scribe-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d3a14913c4dc3879227ba973e80ad65993ec5cffc9349b39d5e96067f8111398",
                "md5": "7bd6017110d49e9752937de630283092",
                "sha256": "440c32a3f63ab77e3aae1e5a2ae27e2b46e7462d1b3d2824f93eea0c76a1d170"
            },
            "downloads": -1,
            "filename": "sibylline_scribe-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7bd6017110d49e9752937de630283092",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 544752,
            "upload_time": "2025-08-30T13:50:12",
            "upload_time_iso_8601": "2025-08-30T13:50:12.698918Z",
            "url": "https://files.pythonhosted.org/packages/d3/a1/4913c4dc3879227ba973e80ad65993ec5cffc9349b39d5e96067f8111398/sibylline_scribe-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 13:50:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scribe-ai",
    "github_project": "scribe-repo",
    "github_not_found": true,
    "lcname": "sibylline-scribe"
}

None