# Rust Crate Pipeline
A comprehensive system for gathering, enriching, and analyzing metadata for Rust crates using AI-powered insights, web scraping, and dependency analysis.
## Overview
The Rust Crate Pipeline is designed to collect, process, and analyze Rust crates with a focus on **trust and security assessment**. It combines web scraping, AI-powered analysis, and cargo testing to provide comprehensive insights into Rust ecosystem packages, with an emphasis on identifying high-quality, trustworthy crates while flagging potentially problematic ones.
The pipeline implements a **three-tier trust verdict system**:
- **ALLOW**: High-quality, widely-used crates that meet strict criteria
- **DENY**: Clearly problematic crates with security issues or poor quality
- **DEFER**: Everything else requiring manual review
## Features
### Core Analysis
- **Trust Verdict System**: Three-tier assessment (ALLOW/DENY/DEFER) with strict criteria
- **Quality Scoring**: Comprehensive evaluation of documentation, usage, and community health
- **Security Analysis**: Automated detection of security advisories and vulnerabilities
- **Deprecation Detection**: Identifies deprecated or abandoned crates
### Data Collection
- **Enhanced Web Scraping**: Automated collection of crate metadata from crates.io using Crawl4AI with Playwright
- **Multi-Source Analysis**: Combines data from crates.io, docs.rs, lib.rs, and GitHub
- **Ecosystem Metrics**: Downloads, reverse dependencies, and community activity analysis
### AI & Processing
- **AI Enrichment**: Local and cloud AI-powered analysis of crate descriptions, features, and documentation
- **Multi-Provider LLM Support**: Unified LLM processor supporting OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM
- **Sentiment Analysis**: Community sentiment and trust assessment
- **License Analysis**: Automated detection of permissive vs restrictive licenses
### Technical Features
- **Cargo Testing**: Automated cargo build, test, and audit execution for comprehensive crate analysis
- **Dependency Analysis**: Deep analysis of crate dependencies and their relationships
- **Batch Processing**: Efficient processing of multiple crates with configurable batch sizes
- **Data Export**: Structured output in JSON format for further analysis
- **RAG Cache**: Intelligent caching with Rule Zero policies and architectural patterns
- **Optional Sanitization**: PII/secret stripping now opt-in via `Sanitizer(enabled=True)`—data remains unaltered by default
- **Robust Serialization**: Built-in helper converts all complex objects to JSON-safe formats
- **Docker Support**: Containerized deployment with optimized Docker configurations
- **Real-time Progress Monitoring**: CLI-based progress tracking with ASCII status indicators
- **Cross-platform Compatibility**: Full Unicode symbol replacement for better encoding support
## Requirements
- **Python 3.12+**: Required for modern type annotations and language features
- **Git**: For cloning repositories during analysis
- **Cargo**: For Rust crate testing and analysis
- **Playwright**: Automatically installed for enhanced web scraping
## Installation
```bash
# Clone the repository
git clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git
cd SigilDERG-Data_Production
# Install in development mode (includes all dependencies)
pip install -e .
# Install Playwright browsers for enhanced scraping
playwright install
```
### Automatic Dependency Installation
The package automatically installs all required dependencies including:
- `crawl4ai` for web scraping
- `playwright` for enhanced browser automation
- `requests` for HTTP requests
- `aiohttp` for async operations
- And all other required packages
## Configuration
### Environment Variables
Set the following environment variables for full functionality:
```bash
# GitHub Personal Access Token (required for API access)
export GITHUB_TOKEN="your_github_token_here"
# Azure OpenAI (optional, for cloud AI processing)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your_azure_openai_key"
export AZURE_OPENAI_DEPLOYMENT_NAME="your_deployment_name"
export AZURE_OPENAI_API_VERSION="2024-02-15-preview"
# PyPI API Token (optional, for publishing)
export PYPI_API_TOKEN="your_pypi_token"
# LiteLLM Configuration (optional, for multi-provider LLM support)
export LITELLM_MODEL="deepseek-coder:33b"
export LITELLM_BASE_URL="http://localhost:11434" # For Ollama
```
### Configuration File
Create a `config.json` file for custom settings:
```json
{
"batch_size": 10,
"n_workers": 4,
"max_retries": 3,
"checkpoint_interval": 10,
"use_azure_openai": true,
"crawl4ai_config": {
"max_pages": 5,
"concurrency": 2
}
}
```
## Usage
### Command Line Interface
#### Basic Usage
```bash
# Run with default settings
python -m rust_crate_pipeline
# Run with custom batch size
python -m rust_crate_pipeline --batch-size 20
# Run with specific workers
python -m rust_crate_pipeline --workers 8
# Use configuration file
python -m rust_crate_pipeline --config-file config.json
```
#### Advanced Options
```bash
# Enable Azure OpenAI processing
python -m rust_crate_pipeline --enable-azure-openai
# Set custom model path for local AI
python -m rust_crate_pipeline --model-path /path/to/model.gguf
# Configure token limits
python -m rust_crate_pipeline --max-tokens 2048
# Set checkpoint interval
python -m rust_crate_pipeline --checkpoint-interval 5
# Enable verbose logging
python -m rust_crate_pipeline --log-level DEBUG
# Enable enhanced scraping with Playwright
python -m rust_crate_pipeline --enable-enhanced-scraping
# Set output directory for results
python -m rust_crate_pipeline --output-path ./results
```
#### Enhanced Scraping
The pipeline now supports enhanced web scraping using Playwright for better data extraction:
```bash
# Enable enhanced scraping (default)
python -m rust_crate_pipeline --enable-enhanced-scraping
# Use basic scraping only
python -m rust_crate_pipeline --disable-enhanced-scraping
# Configure scraping options
python -m rust_crate_pipeline --scraping-config '{"max_pages": 10, "concurrency": 3}'
```
#### Multi-Provider LLM Support
```bash
# Use OpenAI
python -m rust_crate_pipeline.unified_llm_processor --provider openai --model-name gpt-4
# Use Azure OpenAI
python -m rust_crate_pipeline.unified_llm_processor --provider azure --model-name gpt-4
# Use Ollama (local)
python -m rust_crate_pipeline.unified_llm_processor --provider ollama --model-name deepseek-coder:33b
# Use LM Studio
python -m rust_crate_pipeline.unified_llm_processor --provider openai --base-url http://localhost:1234/v1 --model-name local-model
# Use LiteLLM
python -m rust_crate_pipeline.unified_llm_processor --provider litellm --model-name deepseek-coder:33b
```
#### Production Mode
```bash
# Run production pipeline with optimizations
python run_production.py
# Run with Sigil Protocol integration
python -m rust_crate_pipeline --enable-sigil-protocol
```
### Programmatic Usage
```python
from rust_crate_pipeline import CrateDataPipeline
from rust_crate_pipeline.config import PipelineConfig
# Create configuration
config = PipelineConfig(
batch_size=10,
n_workers=4,
use_azure_openai=True
)
# Initialize pipeline
pipeline = CrateDataPipeline(config)
# Run pipeline
import asyncio
result = asyncio.run(pipeline.run())
```
## Sample Data
### Input: Crate List
The pipeline processes crates from `rust_crate_pipeline/crate_list.txt`:
```
tokio
serde
reqwest
actix-web
clap
```
### Output: Trust Analysis Results
```json
{
"name": "tokio",
"version": "1.35.1",
"description": "An asynchronous runtime for Rust",
"trust_verdict": "ALLOW",
"verdict_reason": "Auto-promoted: High quality with permissive license and high usage",
"quality_score": 8.5,
"ecosystem_metrics": {
"downloads_all_time": 125000000,
"reverse_deps_visible": 45000,
"reverse_deps_direct": 8500
},
"license_analysis": {
"license_type": "MIT",
"is_permissive": true
},
"security_analysis": {
"critical_advisories": 0,
"security_audit": false
},
"deprecation_status": {
"is_deprecated": false,
"last_release": "2024-01-15",
"maintenance_status": "Active"
},
"sentiment_analysis": {
"overall": "positive",
"community_health": "Excellent"
},
"cargo_test_results": {
"build_success": true,
"test_success": true,
"audit_clean": true,
"dependencies": 45
}
}
```
### Example: DEFER Case
```json
{
"name": "regex_macros",
"version": "0.2.0",
"description": "An implementation of statically compiled regular expressions for Rust",
"trust_verdict": "DEFER",
"verdict_reason": "Requires manual review - insufficient evidence for automatic decision",
"quality_score": 5.67,
"ecosystem_metrics": {
"downloads_all_time": 235853,
"reverse_deps_visible": 8,
"reverse_deps_direct": 3
},
"license_analysis": {
"license_type": "MIT/Apache-2.0",
"is_permissive": true
},
"deprecation_status": {
"is_deprecated": true,
"deprecation_reason": "Deprecated in favor of regex crate",
"last_release": "2017-01-01"
}
}
```
## Trust Verdict System
The pipeline implements a sophisticated trust assessment system that automatically evaluates Rust crates based on multiple criteria:
### Auto-ALLOW Criteria
A crate is automatically approved if **ALL** of the following are met:
- **Quality Score ≥ 8.0**: High documentation and code quality
- **Permissive License**: MIT, Apache-2.0, or Unlicense
- **High Usage**: ≥10M downloads AND ≥200 reverse dependencies
- **No Critical Advisories**: No open security vulnerabilities
- **Not Deprecated**: Active maintenance and recent releases
- **Recent Activity**: Development within the last 2 years
### Auto-DENY Criteria
A crate is automatically rejected if **ANY** of the following are true:
- **Critical Security Advisories**: Open security vulnerabilities
- **Extremely Low Quality**: Quality score < 4.0
- **Negative Sentiment + Low Quality**: Negative community sentiment with quality score < 6.0
- **Deprecated/Abandoned**: Marked as deprecated or no activity for 4+ years
- **Very Low Usage**: <1,000 downloads AND <5 reverse dependencies
### DEFER (Manual Review)
All other crates are deferred for manual review, ensuring that only truly high-quality crates receive automatic approval.
## Architecture
### Core Components
- **IRL Engine**: Implements the trust verdict logic and decision making
- **Pipeline Orchestrator**: Manages the overall data processing workflow
- **Web Scraper**: Collects crate metadata using Crawl4AI
- **AI Enricher**: Enhances data with local or cloud AI analysis
- **Cargo Analyzer**: Executes cargo commands for comprehensive testing
- **Data Exporter**: Outputs structured results in various formats
### Data Flow
1. **Input**: Crate names from `crate_list.txt`
2. **Scraping**: Web scraping of crates.io for metadata
3. **Enrichment**: AI-powered analysis and insights
4. **Testing**: Cargo build, test, and audit execution
5. **Output**: Structured JSON with comprehensive crate analysis
## Development
### Prerequisites
- Python 3.12+ (required for modern type annotations)
- Git for version control
- Cargo for Rust crate testing
### Running Tests
```bash
# Run all tests
pytest tests/
# Run specific test module
pytest tests/test_main_integration.py
# Run with coverage
pytest --cov=rust_crate_pipeline tests/
# Run type checking
pyright rust_crate_pipeline/
# Run linting
flake8 rust_crate_pipeline/
```
### Code Quality
```bash
# Format code
black rust_crate_pipeline/
# Sort imports
isort rust_crate_pipeline/
# Type checking
pyright rust_crate_pipeline/
# Lint code
flake8 rust_crate_pipeline/
```
### Building and Publishing
```bash
# Build package
python -m build
# Upload to PyPI (requires PYPI_API_TOKEN)
python -m twine upload dist/*
# Create release
python scripts/create_release.py
```
### Docker Development
```bash
# Build Docker image
docker build -t rust-crate-pipeline .
# Run in Docker
docker run -it rust-crate-pipeline
# Run with volume mount for development
docker run -it -v $(pwd):/app rust-crate-pipeline
```
## Recent Improvements
### Version 1.4.6 (Latest)
- **Trust Verdict System**: Implemented sophisticated three-tier trust assessment (ALLOW/DENY/DEFER)
- **Stricter Auto-Promotion**: Quality threshold increased to 8.0, usage thresholds to 10M downloads/200 reverse deps
- **Auto-Deny Logic**: Automatic rejection of problematic crates with security issues or poor quality
- **Deprecation Detection**: Automated identification of deprecated or abandoned crates
- **Enhanced Sentiment Analysis**: Realistic community sentiment assessment
- **Fixed Parameter Ordering**: Corrected IRL engine parameter passing for accurate decision making
### Version 1.4.5
- **Sanitization Toggle** – default off; opt-in for redaction.
- **JSON Serializer** – universal helper prevents non-serializable errors.
- **Version bump & Docker update**
### Version 1.4.0
- **Security**: Robust Ed25519/RSA cryptographic signing and provenance
- **Automation**: Automated RAG and provenance workflows
- **CI/CD**: Improved GitHub Actions for validation and publishing
- **Docker**: Updated Docker image and compose for new version
- **Bug Fixes**: Workflow and validation fixes for Ed25519
### Version 1.3.6
- **Python 3.12+ Requirement**: Updated to use modern type annotations and language features
- **Type Safety**: Enhanced type annotations throughout the codebase with modern syntax
- **Build System**: Updated pyproject.toml and setup.py for better compatibility
### Version 1.3.5
- **Enhanced Web Scraping**: Added Playwright-based scraping for better data extraction
- **Unicode Compatibility**: Replaced all Unicode symbols with ASCII equivalents for better cross-platform support
- **Automatic Dependencies**: All required packages are now automatically installed
- **Real-time Progress**: Added CLI-based progress monitoring with ASCII status indicators
- **Docker Optimization**: Updated Dockerfile to include Playwright browser installation
### Version 1.3.4
- **PEP8 Compliance**: Fixed all Unicode emoji and symbols for better encoding support
- **Cross-platform Compatibility**: Improved compatibility across different operating systems
- **Type Safety**: Enhanced type annotations throughout the codebase
### Version 1.3.3
- **Real-time Progress Monitoring**: Added CLI-only progress tracking feature
- **Enhanced Logging**: Improved status reporting and error handling
### Version 1.3.2
- **Multi-Provider LLM Support**: Added support for OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM
- **Unified LLM Processor**: Centralized LLM processing with provider abstraction
- **Enhanced Error Handling**: Better error recovery and retry mechanisms
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Submit a pull request
## Support
For issues and questions:
- GitHub Issues: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues
- Documentation: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production#readme
## API Compliance & Attribution
### crates.io and GitHub API Usage
- This project accesses crates.io and GitHub APIs for data gathering and verification.
- **User-Agent:** All requests use:
`SigilDERG-Data-Production (Superuser666-Sigil; miragemodularframework@gmail.com; https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)`
- **Contact:** miragemodularframework@gmail.com
- **GitHub:** [Superuser666-Sigil/SigilDERG-Data_Production](https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)
- The project respects all rate limits and crawler policies. If you have questions or concerns, please contact us.
### Crawl4AI Attribution
This project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.
<!-- Badge Attribution (Disco Theme) -->
<a href="https://github.com/unclecode/crawl4ai">
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
</a>
Or, text attribution:
```
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
```
## 🚀 Unified, Cross-Platform, Multi-Provider LLM Support
This project supports **all major LLM providers** (cloud and local) on **Mac, Linux, and Windows** using a single, unified interface. All LLM calls are routed through the `UnifiedLLMProcessor` and `LLMConfig` abstractions, ensuring:
- **One code path for all providers:** Azure OpenAI, OpenAI, Anthropic, Google, Cohere, HuggingFace, Ollama, LM Studio, and any OpenAI-compatible endpoint.
- **Cross-platform compatibility:** Works out of the box on Mac, Linux, and Windows.
- **Configurable via CLI and config files:** Select provider, model, API key, endpoint, and provider-specific options at runtime.
- **Easy extensibility:** Add new providers by updating your config or CLI arguments—no code changes needed.
### 📖 Provider Setup & Usage
- See [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) for full details, setup instructions, and usage examples for every supported provider.
- Run `python run_pipeline_with_llm.py --help` for CLI options and provider-specific arguments.
### 🧩 Example Usage
```bash
# Azure OpenAI
python run_pipeline_with_llm.py --llm-provider azure --llm-model gpt-4o --crates tokio
# Ollama (local)
python run_pipeline_with_llm.py --llm-provider ollama --llm-model llama2 --crates serde
# OpenAI API
python run_pipeline_with_llm.py --llm-provider openai --llm-model gpt-4 --llm-api-key YOUR_KEY --crates tokio
# Anthropic Claude
python run_pipeline_with_llm.py --llm-provider anthropic --llm-model claude-3-sonnet --llm-api-key YOUR_KEY --crates serde
```
### 🔒 Security & Best Practices
- Store API keys as environment variables.
- Use local providers (Ollama, LM Studio) for full privacy—no data leaves your machine.
- All LLM calls are routed through a single, auditable interface for maximum maintainability and security.
### 🧪 Testing
- Run `python test_unified_llm.py` to verify provider support and configuration.
For more, see [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) and the CLI help output.
## Public RAG Database Hash Verification
The canonical hash of the RAG SQLite database (`sigil_rag_cache.db`) is stored in the public file `sigil_rag_cache.hash`.
- **Purpose:** Anyone can verify the integrity of the RAG database by comparing its SHA256 hash to the value in `sigil_rag_cache.hash`.
- **How to verify:**
```sh
python audits/validate_db_hash.py --db sigil_rag_cache.db --expected-hash "$(cat sigil_rag_cache.hash)"
```
- **CI/CD:** The GitHub Actions workflow `.github/workflows/validate-db-hash.yml` automatically checks this on every push.
- **No secrets required:** The hash is public and verifiable by anyone.
Raw data
{
"_id": null,
"home_page": "https://github.com/SigilDERG/rust-crate-pipeline",
"name": "rust-crate-pipeline",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "rust, crates, analysis, ai, pipeline, scraping",
"author": "SigilDERG Team",
"author_email": "SigilDERG Team <sigilderg@example.com>",
"download_url": "https://files.pythonhosted.org/packages/72/6a/4473f92bb563c351708e9196adfd71c05941a757d42d864b7766b7a7a47a/rust_crate_pipeline-1.4.6.tar.gz",
"platform": null,
"description": "# Rust Crate Pipeline\r\n\r\nA comprehensive system for gathering, enriching, and analyzing metadata for Rust crates using AI-powered insights, web scraping, and dependency analysis.\r\n\r\n## Overview\r\n\r\nThe Rust Crate Pipeline is designed to collect, process, and analyze Rust crates with a focus on **trust and security assessment**. It combines web scraping, AI-powered analysis, and cargo testing to provide comprehensive insights into Rust ecosystem packages, with an emphasis on identifying high-quality, trustworthy crates while flagging potentially problematic ones.\r\n\r\nThe pipeline implements a **three-tier trust verdict system**:\r\n- **ALLOW**: High-quality, widely-used crates that meet strict criteria\r\n- **DENY**: Clearly problematic crates with security issues or poor quality\r\n- **DEFER**: Everything else requiring manual review\r\n\r\n## Features\r\n\r\n### Core Analysis\r\n- **Trust Verdict System**: Three-tier assessment (ALLOW/DENY/DEFER) with strict criteria\r\n- **Quality Scoring**: Comprehensive evaluation of documentation, usage, and community health\r\n- **Security Analysis**: Automated detection of security advisories and vulnerabilities\r\n- **Deprecation Detection**: Identifies deprecated or abandoned crates\r\n\r\n### Data Collection\r\n- **Enhanced Web Scraping**: Automated collection of crate metadata from crates.io using Crawl4AI with Playwright\r\n- **Multi-Source Analysis**: Combines data from crates.io, docs.rs, lib.rs, and GitHub\r\n- **Ecosystem Metrics**: Downloads, reverse dependencies, and community activity analysis\r\n\r\n### AI & Processing\r\n- **AI Enrichment**: Local and cloud AI-powered analysis of crate descriptions, features, and documentation\r\n- **Multi-Provider LLM Support**: Unified LLM processor supporting OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM\r\n- **Sentiment Analysis**: Community sentiment and trust assessment\r\n- **License Analysis**: Automated detection of permissive vs restrictive licenses\r\n\r\n### Technical Features\r\n- **Cargo Testing**: Automated cargo build, test, and audit execution for comprehensive crate analysis\r\n- **Dependency Analysis**: Deep analysis of crate dependencies and their relationships\r\n- **Batch Processing**: Efficient processing of multiple crates with configurable batch sizes\r\n- **Data Export**: Structured output in JSON format for further analysis\r\n- **RAG Cache**: Intelligent caching with Rule Zero policies and architectural patterns\r\n- **Optional Sanitization**: PII/secret stripping now opt-in via `Sanitizer(enabled=True)`\u2014data remains unaltered by default\r\n- **Robust Serialization**: Built-in helper converts all complex objects to JSON-safe formats\r\n- **Docker Support**: Containerized deployment with optimized Docker configurations\r\n- **Real-time Progress Monitoring**: CLI-based progress tracking with ASCII status indicators\r\n- **Cross-platform Compatibility**: Full Unicode symbol replacement for better encoding support\r\n\r\n## Requirements\r\n\r\n- **Python 3.12+**: Required for modern type annotations and language features\r\n- **Git**: For cloning repositories during analysis\r\n- **Cargo**: For Rust crate testing and analysis\r\n- **Playwright**: Automatically installed for enhanced web scraping\r\n\r\n## Installation\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git\r\ncd SigilDERG-Data_Production\r\n\r\n# Install in development mode (includes all dependencies)\r\npip install -e .\r\n\r\n# Install Playwright browsers for enhanced scraping\r\nplaywright install\r\n```\r\n\r\n### Automatic Dependency Installation\r\n\r\nThe package automatically installs all required dependencies including:\r\n- `crawl4ai` for web scraping\r\n- `playwright` for enhanced browser automation\r\n- `requests` for HTTP requests\r\n- `aiohttp` for async operations\r\n- And all other required packages\r\n\r\n## Configuration\r\n\r\n### Environment Variables\r\n\r\nSet the following environment variables for full functionality:\r\n\r\n```bash\r\n# GitHub Personal Access Token (required for API access)\r\nexport GITHUB_TOKEN=\"your_github_token_here\"\r\n\r\n# Azure OpenAI (optional, for cloud AI processing)\r\nexport AZURE_OPENAI_ENDPOINT=\"https://your-resource.openai.azure.com/\"\r\nexport AZURE_OPENAI_API_KEY=\"your_azure_openai_key\"\r\nexport AZURE_OPENAI_DEPLOYMENT_NAME=\"your_deployment_name\"\r\nexport AZURE_OPENAI_API_VERSION=\"2024-02-15-preview\"\r\n\r\n# PyPI API Token (optional, for publishing)\r\nexport PYPI_API_TOKEN=\"your_pypi_token\"\r\n\r\n# LiteLLM Configuration (optional, for multi-provider LLM support)\r\nexport LITELLM_MODEL=\"deepseek-coder:33b\"\r\nexport LITELLM_BASE_URL=\"http://localhost:11434\" # For Ollama\r\n```\r\n\r\n### Configuration File\r\n\r\nCreate a `config.json` file for custom settings:\r\n\r\n```json\r\n{\r\n \"batch_size\": 10,\r\n \"n_workers\": 4,\r\n \"max_retries\": 3,\r\n \"checkpoint_interval\": 10,\r\n \"use_azure_openai\": true,\r\n \"crawl4ai_config\": {\r\n \"max_pages\": 5,\r\n \"concurrency\": 2\r\n }\r\n}\r\n```\r\n\r\n## Usage\r\n\r\n### Command Line Interface\r\n\r\n#### Basic Usage\r\n\r\n```bash\r\n# Run with default settings\r\npython -m rust_crate_pipeline\r\n\r\n# Run with custom batch size\r\npython -m rust_crate_pipeline --batch-size 20\r\n\r\n# Run with specific workers\r\npython -m rust_crate_pipeline --workers 8\r\n\r\n# Use configuration file\r\npython -m rust_crate_pipeline --config-file config.json\r\n```\r\n\r\n#### Advanced Options\r\n\r\n```bash\r\n# Enable Azure OpenAI processing\r\npython -m rust_crate_pipeline --enable-azure-openai\r\n\r\n# Set custom model path for local AI\r\npython -m rust_crate_pipeline --model-path /path/to/model.gguf\r\n\r\n# Configure token limits\r\npython -m rust_crate_pipeline --max-tokens 2048\r\n\r\n# Set checkpoint interval\r\npython -m rust_crate_pipeline --checkpoint-interval 5\r\n\r\n# Enable verbose logging\r\npython -m rust_crate_pipeline --log-level DEBUG\r\n\r\n# Enable enhanced scraping with Playwright\r\npython -m rust_crate_pipeline --enable-enhanced-scraping\r\n\r\n# Set output directory for results\r\npython -m rust_crate_pipeline --output-path ./results\r\n```\r\n\r\n#### Enhanced Scraping\r\n\r\nThe pipeline now supports enhanced web scraping using Playwright for better data extraction:\r\n\r\n```bash\r\n# Enable enhanced scraping (default)\r\npython -m rust_crate_pipeline --enable-enhanced-scraping\r\n\r\n# Use basic scraping only\r\npython -m rust_crate_pipeline --disable-enhanced-scraping\r\n\r\n# Configure scraping options\r\npython -m rust_crate_pipeline --scraping-config '{\"max_pages\": 10, \"concurrency\": 3}'\r\n```\r\n\r\n#### Multi-Provider LLM Support\r\n\r\n```bash\r\n# Use OpenAI\r\npython -m rust_crate_pipeline.unified_llm_processor --provider openai --model-name gpt-4\r\n\r\n# Use Azure OpenAI\r\npython -m rust_crate_pipeline.unified_llm_processor --provider azure --model-name gpt-4\r\n\r\n# Use Ollama (local)\r\npython -m rust_crate_pipeline.unified_llm_processor --provider ollama --model-name deepseek-coder:33b\r\n\r\n# Use LM Studio\r\npython -m rust_crate_pipeline.unified_llm_processor --provider openai --base-url http://localhost:1234/v1 --model-name local-model\r\n\r\n# Use LiteLLM\r\npython -m rust_crate_pipeline.unified_llm_processor --provider litellm --model-name deepseek-coder:33b\r\n```\r\n\r\n#### Production Mode\r\n\r\n```bash\r\n# Run production pipeline with optimizations\r\npython run_production.py\r\n\r\n# Run with Sigil Protocol integration\r\npython -m rust_crate_pipeline --enable-sigil-protocol\r\n```\r\n\r\n### Programmatic Usage\r\n\r\n```python\r\nfrom rust_crate_pipeline import CrateDataPipeline\r\nfrom rust_crate_pipeline.config import PipelineConfig\r\n\r\n# Create configuration\r\nconfig = PipelineConfig(\r\n batch_size=10,\r\n n_workers=4,\r\n use_azure_openai=True\r\n)\r\n\r\n# Initialize pipeline\r\npipeline = CrateDataPipeline(config)\r\n\r\n# Run pipeline\r\nimport asyncio\r\nresult = asyncio.run(pipeline.run())\r\n```\r\n\r\n## Sample Data\r\n\r\n### Input: Crate List\r\n\r\nThe pipeline processes crates from `rust_crate_pipeline/crate_list.txt`:\r\n\r\n```\r\ntokio\r\nserde\r\nreqwest\r\nactix-web\r\nclap\r\n```\r\n\r\n### Output: Trust Analysis Results\r\n\r\n```json\r\n{\r\n \"name\": \"tokio\",\r\n \"version\": \"1.35.1\",\r\n \"description\": \"An asynchronous runtime for Rust\",\r\n \"trust_verdict\": \"ALLOW\",\r\n \"verdict_reason\": \"Auto-promoted: High quality with permissive license and high usage\",\r\n \"quality_score\": 8.5,\r\n \"ecosystem_metrics\": {\r\n \"downloads_all_time\": 125000000,\r\n \"reverse_deps_visible\": 45000,\r\n \"reverse_deps_direct\": 8500\r\n },\r\n \"license_analysis\": {\r\n \"license_type\": \"MIT\",\r\n \"is_permissive\": true\r\n },\r\n \"security_analysis\": {\r\n \"critical_advisories\": 0,\r\n \"security_audit\": false\r\n },\r\n \"deprecation_status\": {\r\n \"is_deprecated\": false,\r\n \"last_release\": \"2024-01-15\",\r\n \"maintenance_status\": \"Active\"\r\n },\r\n \"sentiment_analysis\": {\r\n \"overall\": \"positive\",\r\n \"community_health\": \"Excellent\"\r\n },\r\n \"cargo_test_results\": {\r\n \"build_success\": true,\r\n \"test_success\": true,\r\n \"audit_clean\": true,\r\n \"dependencies\": 45\r\n }\r\n}\r\n```\r\n\r\n### Example: DEFER Case\r\n\r\n```json\r\n{\r\n \"name\": \"regex_macros\",\r\n \"version\": \"0.2.0\",\r\n \"description\": \"An implementation of statically compiled regular expressions for Rust\",\r\n \"trust_verdict\": \"DEFER\",\r\n \"verdict_reason\": \"Requires manual review - insufficient evidence for automatic decision\",\r\n \"quality_score\": 5.67,\r\n \"ecosystem_metrics\": {\r\n \"downloads_all_time\": 235853,\r\n \"reverse_deps_visible\": 8,\r\n \"reverse_deps_direct\": 3\r\n },\r\n \"license_analysis\": {\r\n \"license_type\": \"MIT/Apache-2.0\",\r\n \"is_permissive\": true\r\n },\r\n \"deprecation_status\": {\r\n \"is_deprecated\": true,\r\n \"deprecation_reason\": \"Deprecated in favor of regex crate\",\r\n \"last_release\": \"2017-01-01\"\r\n }\r\n}\r\n```\r\n\r\n## Trust Verdict System\r\n\r\nThe pipeline implements a sophisticated trust assessment system that automatically evaluates Rust crates based on multiple criteria:\r\n\r\n### Auto-ALLOW Criteria\r\nA crate is automatically approved if **ALL** of the following are met:\r\n- **Quality Score \u2265 8.0**: High documentation and code quality\r\n- **Permissive License**: MIT, Apache-2.0, or Unlicense\r\n- **High Usage**: \u226510M downloads AND \u2265200 reverse dependencies\r\n- **No Critical Advisories**: No open security vulnerabilities\r\n- **Not Deprecated**: Active maintenance and recent releases\r\n- **Recent Activity**: Development within the last 2 years\r\n\r\n### Auto-DENY Criteria\r\nA crate is automatically rejected if **ANY** of the following are true:\r\n- **Critical Security Advisories**: Open security vulnerabilities\r\n- **Extremely Low Quality**: Quality score < 4.0\r\n- **Negative Sentiment + Low Quality**: Negative community sentiment with quality score < 6.0\r\n- **Deprecated/Abandoned**: Marked as deprecated or no activity for 4+ years\r\n- **Very Low Usage**: <1,000 downloads AND <5 reverse dependencies\r\n\r\n### DEFER (Manual Review)\r\nAll other crates are deferred for manual review, ensuring that only truly high-quality crates receive automatic approval.\r\n\r\n## Architecture\r\n\r\n### Core Components\r\n\r\n- **IRL Engine**: Implements the trust verdict logic and decision making\r\n- **Pipeline Orchestrator**: Manages the overall data processing workflow\r\n- **Web Scraper**: Collects crate metadata using Crawl4AI\r\n- **AI Enricher**: Enhances data with local or cloud AI analysis\r\n- **Cargo Analyzer**: Executes cargo commands for comprehensive testing\r\n- **Data Exporter**: Outputs structured results in various formats\r\n\r\n### Data Flow\r\n\r\n1. **Input**: Crate names from `crate_list.txt`\r\n2. **Scraping**: Web scraping of crates.io for metadata\r\n3. **Enrichment**: AI-powered analysis and insights\r\n4. **Testing**: Cargo build, test, and audit execution\r\n5. **Output**: Structured JSON with comprehensive crate analysis\r\n\r\n## Development\r\n\r\n### Prerequisites\r\n\r\n- Python 3.12+ (required for modern type annotations)\r\n- Git for version control\r\n- Cargo for Rust crate testing\r\n\r\n### Running Tests\r\n\r\n```bash\r\n# Run all tests\r\npytest tests/\r\n\r\n# Run specific test module\r\npytest tests/test_main_integration.py\r\n\r\n# Run with coverage\r\npytest --cov=rust_crate_pipeline tests/\r\n\r\n# Run type checking\r\npyright rust_crate_pipeline/\r\n\r\n# Run linting\r\nflake8 rust_crate_pipeline/\r\n```\r\n\r\n### Code Quality\r\n\r\n```bash\r\n# Format code\r\nblack rust_crate_pipeline/\r\n\r\n# Sort imports\r\nisort rust_crate_pipeline/\r\n\r\n# Type checking\r\npyright rust_crate_pipeline/\r\n\r\n# Lint code\r\nflake8 rust_crate_pipeline/\r\n```\r\n\r\n### Building and Publishing\r\n\r\n```bash\r\n# Build package\r\npython -m build\r\n\r\n# Upload to PyPI (requires PYPI_API_TOKEN)\r\npython -m twine upload dist/*\r\n\r\n# Create release\r\npython scripts/create_release.py\r\n```\r\n\r\n### Docker Development\r\n\r\n```bash\r\n# Build Docker image\r\ndocker build -t rust-crate-pipeline .\r\n\r\n# Run in Docker\r\ndocker run -it rust-crate-pipeline\r\n\r\n# Run with volume mount for development\r\ndocker run -it -v $(pwd):/app rust-crate-pipeline\r\n```\r\n\r\n## Recent Improvements\r\n\r\n### Version 1.4.6 (Latest)\r\n- **Trust Verdict System**: Implemented sophisticated three-tier trust assessment (ALLOW/DENY/DEFER)\r\n- **Stricter Auto-Promotion**: Quality threshold increased to 8.0, usage thresholds to 10M downloads/200 reverse deps\r\n- **Auto-Deny Logic**: Automatic rejection of problematic crates with security issues or poor quality\r\n- **Deprecation Detection**: Automated identification of deprecated or abandoned crates\r\n- **Enhanced Sentiment Analysis**: Realistic community sentiment assessment\r\n- **Fixed Parameter Ordering**: Corrected IRL engine parameter passing for accurate decision making\r\n\r\n### Version 1.4.5\r\n- **Sanitization Toggle** \u2013 default off; opt-in for redaction.\r\n- **JSON Serializer** \u2013 universal helper prevents non-serializable errors.\r\n- **Version bump & Docker update**\r\n\r\n### Version 1.4.0\r\n- **Security**: Robust Ed25519/RSA cryptographic signing and provenance\r\n- **Automation**: Automated RAG and provenance workflows\r\n- **CI/CD**: Improved GitHub Actions for validation and publishing\r\n- **Docker**: Updated Docker image and compose for new version\r\n- **Bug Fixes**: Workflow and validation fixes for Ed25519\r\n\r\n### Version 1.3.6\r\n- **Python 3.12+ Requirement**: Updated to use modern type annotations and language features\r\n- **Type Safety**: Enhanced type annotations throughout the codebase with modern syntax\r\n- **Build System**: Updated pyproject.toml and setup.py for better compatibility\r\n\r\n### Version 1.3.5\r\n- **Enhanced Web Scraping**: Added Playwright-based scraping for better data extraction\r\n- **Unicode Compatibility**: Replaced all Unicode symbols with ASCII equivalents for better cross-platform support\r\n- **Automatic Dependencies**: All required packages are now automatically installed\r\n- **Real-time Progress**: Added CLI-based progress monitoring with ASCII status indicators\r\n- **Docker Optimization**: Updated Dockerfile to include Playwright browser installation\r\n\r\n### Version 1.3.4\r\n- **PEP8 Compliance**: Fixed all Unicode emoji and symbols for better encoding support\r\n- **Cross-platform Compatibility**: Improved compatibility across different operating systems\r\n- **Type Safety**: Enhanced type annotations throughout the codebase\r\n\r\n### Version 1.3.3\r\n- **Real-time Progress Monitoring**: Added CLI-only progress tracking feature\r\n- **Enhanced Logging**: Improved status reporting and error handling\r\n\r\n### Version 1.3.2\r\n- **Multi-Provider LLM Support**: Added support for OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM\r\n- **Unified LLM Processor**: Centralized LLM processing with provider abstraction\r\n- **Enhanced Error Handling**: Better error recovery and retry mechanisms\r\n\r\n## License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests for new functionality\r\n5. Submit a pull request\r\n\r\n## Support\r\n\r\nFor issues and questions:\r\n- GitHub Issues: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues\r\n- Documentation: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production#readme \r\n\r\n## API Compliance & Attribution\r\n\r\n### crates.io and GitHub API Usage\r\n- This project accesses crates.io and GitHub APIs for data gathering and verification.\r\n- **User-Agent:** All requests use:\r\n \r\n `SigilDERG-Data-Production (Superuser666-Sigil; miragemodularframework@gmail.com; https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)`\r\n- **Contact:** miragemodularframework@gmail.com\r\n- **GitHub:** [Superuser666-Sigil/SigilDERG-Data_Production](https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)\r\n- The project respects all rate limits and crawler policies. If you have questions or concerns, please contact us.\r\n\r\n### Crawl4AI Attribution\r\nThis project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.\r\n\r\n<!-- Badge Attribution (Disco Theme) -->\r\n<a href=\"https://github.com/unclecode/crawl4ai\">\r\n <img src=\"https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"/>\r\n</a>\r\n\r\nOr, text attribution:\r\n\r\n```\r\nThis project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.\r\n```\r\n\r\n## \ud83d\ude80 Unified, Cross-Platform, Multi-Provider LLM Support\r\n\r\nThis project supports **all major LLM providers** (cloud and local) on **Mac, Linux, and Windows** using a single, unified interface. All LLM calls are routed through the `UnifiedLLMProcessor` and `LLMConfig` abstractions, ensuring:\r\n\r\n- **One code path for all providers:** Azure OpenAI, OpenAI, Anthropic, Google, Cohere, HuggingFace, Ollama, LM Studio, and any OpenAI-compatible endpoint.\r\n- **Cross-platform compatibility:** Works out of the box on Mac, Linux, and Windows.\r\n- **Configurable via CLI and config files:** Select provider, model, API key, endpoint, and provider-specific options at runtime.\r\n- **Easy extensibility:** Add new providers by updating your config or CLI arguments\u2014no code changes needed.\r\n\r\n### \ud83d\udcd6 Provider Setup & Usage\r\n- See [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) for full details, setup instructions, and usage examples for every supported provider.\r\n- Run `python run_pipeline_with_llm.py --help` for CLI options and provider-specific arguments.\r\n\r\n### \ud83e\udde9 Example Usage\r\n```bash\r\n# Azure OpenAI\r\npython run_pipeline_with_llm.py --llm-provider azure --llm-model gpt-4o --crates tokio\r\n\r\n# Ollama (local)\r\npython run_pipeline_with_llm.py --llm-provider ollama --llm-model llama2 --crates serde\r\n\r\n# OpenAI API\r\npython run_pipeline_with_llm.py --llm-provider openai --llm-model gpt-4 --llm-api-key YOUR_KEY --crates tokio\r\n\r\n# Anthropic Claude\r\npython run_pipeline_with_llm.py --llm-provider anthropic --llm-model claude-3-sonnet --llm-api-key YOUR_KEY --crates serde\r\n```\r\n\r\n### \ud83d\udd12 Security & Best Practices\r\n- Store API keys as environment variables.\r\n- Use local providers (Ollama, LM Studio) for full privacy\u2014no data leaves your machine.\r\n- All LLM calls are routed through a single, auditable interface for maximum maintainability and security.\r\n\r\n### \ud83e\uddea Testing\r\n- Run `python test_unified_llm.py` to verify provider support and configuration.\r\n\r\nFor more, see [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) and the CLI help output. \r\n\r\n## Public RAG Database Hash Verification\r\n\r\nThe canonical hash of the RAG SQLite database (`sigil_rag_cache.db`) is stored in the public file `sigil_rag_cache.hash`.\r\n\r\n- **Purpose:** Anyone can verify the integrity of the RAG database by comparing its SHA256 hash to the value in `sigil_rag_cache.hash`.\r\n- **How to verify:**\r\n\r\n```sh\r\npython audits/validate_db_hash.py --db sigil_rag_cache.db --expected-hash \"$(cat sigil_rag_cache.hash)\"\r\n```\r\n\r\n- **CI/CD:** The GitHub Actions workflow `.github/workflows/validate-db-hash.yml` automatically checks this on every push.\r\n- **No secrets required:** The hash is public and verifiable by anyone.\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A comprehensive pipeline for analyzing Rust crates with AI enrichment and enhanced scraping",
"version": "1.4.6",
"project_urls": {
"Bug Tracker": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues",
"Documentation": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production#readme",
"Homepage": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production",
"Repository": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production"
},
"split_keywords": [
"rust",
" crates",
" analysis",
" ai",
" pipeline",
" scraping"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "789a75e08832b6096de6bff23ed9442356f36dd1d5cbdd7c268b53336a8c25ed",
"md5": "01bd570ad898eefeb91ad6482870deee",
"sha256": "be2a5d0c74a613c1d97323e4dde3ab9071c01a274c6a0378750107aab33296b3"
},
"downloads": -1,
"filename": "rust_crate_pipeline-1.4.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "01bd570ad898eefeb91ad6482870deee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 108042,
"upload_time": "2025-07-12T20:09:59",
"upload_time_iso_8601": "2025-07-12T20:09:59.079346Z",
"url": "https://files.pythonhosted.org/packages/78/9a/75e08832b6096de6bff23ed9442356f36dd1d5cbdd7c268b53336a8c25ed/rust_crate_pipeline-1.4.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "726a4473f92bb563c351708e9196adfd71c05941a757d42d864b7766b7a7a47a",
"md5": "c0c5c1c5eae69899c85844b5a6db00ca",
"sha256": "d7c75d1b558f62e380a7565fbc2ae4763728c6fdcfb7f28d85507cbc60ba9742"
},
"downloads": -1,
"filename": "rust_crate_pipeline-1.4.6.tar.gz",
"has_sig": false,
"md5_digest": "c0c5c1c5eae69899c85844b5a6db00ca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 118064,
"upload_time": "2025-07-12T20:10:00",
"upload_time_iso_8601": "2025-07-12T20:10:00.283868Z",
"url": "https://files.pythonhosted.org/packages/72/6a/4473f92bb563c351708e9196adfd71c05941a757d42d864b7766b7a7a47a/rust_crate_pipeline-1.4.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-12 20:10:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SigilDERG",
"github_project": "rust-crate-pipeline",
"github_not_found": true,
"lcname": "rust-crate-pipeline"
}