rust-crate-pipeline


Namerust-crate-pipeline JSON
Version 1.4.6 PyPI version JSON
download
home_pagehttps://github.com/SigilDERG/rust-crate-pipeline
SummaryA comprehensive pipeline for analyzing Rust crates with AI enrichment and enhanced scraping
upload_time2025-07-12 20:10:00
maintainerNone
docs_urlNone
authorSigilDERG Team
requires_python>=3.12
licenseMIT
keywords rust crates analysis ai pipeline scraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Rust Crate Pipeline

A comprehensive system for gathering, enriching, and analyzing metadata for Rust crates using AI-powered insights, web scraping, and dependency analysis.

## Overview

The Rust Crate Pipeline is designed to collect, process, and analyze Rust crates with a focus on **trust and security assessment**. It combines web scraping, AI-powered analysis, and cargo testing to provide comprehensive insights into Rust ecosystem packages, with an emphasis on identifying high-quality, trustworthy crates while flagging potentially problematic ones.

The pipeline implements a **three-tier trust verdict system**:
- **ALLOW**: High-quality, widely-used crates that meet strict criteria
- **DENY**: Clearly problematic crates with security issues or poor quality
- **DEFER**: Everything else requiring manual review

## Features

### Core Analysis
- **Trust Verdict System**: Three-tier assessment (ALLOW/DENY/DEFER) with strict criteria
- **Quality Scoring**: Comprehensive evaluation of documentation, usage, and community health
- **Security Analysis**: Automated detection of security advisories and vulnerabilities
- **Deprecation Detection**: Identifies deprecated or abandoned crates

### Data Collection
- **Enhanced Web Scraping**: Automated collection of crate metadata from crates.io using Crawl4AI with Playwright
- **Multi-Source Analysis**: Combines data from crates.io, docs.rs, lib.rs, and GitHub
- **Ecosystem Metrics**: Downloads, reverse dependencies, and community activity analysis

### AI & Processing
- **AI Enrichment**: Local and cloud AI-powered analysis of crate descriptions, features, and documentation
- **Multi-Provider LLM Support**: Unified LLM processor supporting OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM
- **Sentiment Analysis**: Community sentiment and trust assessment
- **License Analysis**: Automated detection of permissive vs restrictive licenses

### Technical Features
- **Cargo Testing**: Automated cargo build, test, and audit execution for comprehensive crate analysis
- **Dependency Analysis**: Deep analysis of crate dependencies and their relationships
- **Batch Processing**: Efficient processing of multiple crates with configurable batch sizes
- **Data Export**: Structured output in JSON format for further analysis
- **RAG Cache**: Intelligent caching with Rule Zero policies and architectural patterns
- **Optional Sanitization**: PII/secret stripping now opt-in via `Sanitizer(enabled=True)`—data remains unaltered by default
- **Robust Serialization**: Built-in helper converts all complex objects to JSON-safe formats
- **Docker Support**: Containerized deployment with optimized Docker configurations
- **Real-time Progress Monitoring**: CLI-based progress tracking with ASCII status indicators
- **Cross-platform Compatibility**: Full Unicode symbol replacement for better encoding support

## Requirements

- **Python 3.12+**: Required for modern type annotations and language features
- **Git**: For cloning repositories during analysis
- **Cargo**: For Rust crate testing and analysis
- **Playwright**: Automatically installed for enhanced web scraping

## Installation

```bash
# Clone the repository
git clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git
cd SigilDERG-Data_Production

# Install in development mode (includes all dependencies)
pip install -e .

# Install Playwright browsers for enhanced scraping
playwright install
```

### Automatic Dependency Installation

The package automatically installs all required dependencies including:
- `crawl4ai` for web scraping
- `playwright` for enhanced browser automation
- `requests` for HTTP requests
- `aiohttp` for async operations
- And all other required packages

## Configuration

### Environment Variables

Set the following environment variables for full functionality:

```bash
# GitHub Personal Access Token (required for API access)
export GITHUB_TOKEN="your_github_token_here"

# Azure OpenAI (optional, for cloud AI processing)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="your_azure_openai_key"
export AZURE_OPENAI_DEPLOYMENT_NAME="your_deployment_name"
export AZURE_OPENAI_API_VERSION="2024-02-15-preview"

# PyPI API Token (optional, for publishing)
export PYPI_API_TOKEN="your_pypi_token"

# LiteLLM Configuration (optional, for multi-provider LLM support)
export LITELLM_MODEL="deepseek-coder:33b"
export LITELLM_BASE_URL="http://localhost:11434"  # For Ollama
```

### Configuration File

Create a `config.json` file for custom settings:

```json
{
    "batch_size": 10,
    "n_workers": 4,
    "max_retries": 3,
    "checkpoint_interval": 10,
    "use_azure_openai": true,
    "crawl4ai_config": {
        "max_pages": 5,
        "concurrency": 2
    }
}
```

## Usage

### Command Line Interface

#### Basic Usage

```bash
# Run with default settings
python -m rust_crate_pipeline

# Run with custom batch size
python -m rust_crate_pipeline --batch-size 20

# Run with specific workers
python -m rust_crate_pipeline --workers 8

# Use configuration file
python -m rust_crate_pipeline --config-file config.json
```

#### Advanced Options

```bash
# Enable Azure OpenAI processing
python -m rust_crate_pipeline --enable-azure-openai

# Set custom model path for local AI
python -m rust_crate_pipeline --model-path /path/to/model.gguf

# Configure token limits
python -m rust_crate_pipeline --max-tokens 2048

# Set checkpoint interval
python -m rust_crate_pipeline --checkpoint-interval 5

# Enable verbose logging
python -m rust_crate_pipeline --log-level DEBUG

# Enable enhanced scraping with Playwright
python -m rust_crate_pipeline --enable-enhanced-scraping

# Set output directory for results
python -m rust_crate_pipeline --output-path ./results
```

#### Enhanced Scraping

The pipeline now supports enhanced web scraping using Playwright for better data extraction:

```bash
# Enable enhanced scraping (default)
python -m rust_crate_pipeline --enable-enhanced-scraping

# Use basic scraping only
python -m rust_crate_pipeline --disable-enhanced-scraping

# Configure scraping options
python -m rust_crate_pipeline --scraping-config '{"max_pages": 10, "concurrency": 3}'
```

#### Multi-Provider LLM Support

```bash
# Use OpenAI
python -m rust_crate_pipeline.unified_llm_processor --provider openai --model-name gpt-4

# Use Azure OpenAI
python -m rust_crate_pipeline.unified_llm_processor --provider azure --model-name gpt-4

# Use Ollama (local)
python -m rust_crate_pipeline.unified_llm_processor --provider ollama --model-name deepseek-coder:33b

# Use LM Studio
python -m rust_crate_pipeline.unified_llm_processor --provider openai --base-url http://localhost:1234/v1 --model-name local-model

# Use LiteLLM
python -m rust_crate_pipeline.unified_llm_processor --provider litellm --model-name deepseek-coder:33b
```

#### Production Mode

```bash
# Run production pipeline with optimizations
python run_production.py

# Run with Sigil Protocol integration
python -m rust_crate_pipeline --enable-sigil-protocol
```

### Programmatic Usage

```python
from rust_crate_pipeline import CrateDataPipeline
from rust_crate_pipeline.config import PipelineConfig

# Create configuration
config = PipelineConfig(
    batch_size=10,
    n_workers=4,
    use_azure_openai=True
)

# Initialize pipeline
pipeline = CrateDataPipeline(config)

# Run pipeline
import asyncio
result = asyncio.run(pipeline.run())
```

## Sample Data

### Input: Crate List

The pipeline processes crates from `rust_crate_pipeline/crate_list.txt`:

```
tokio
serde
reqwest
actix-web
clap
```

### Output: Trust Analysis Results

```json
{
    "name": "tokio",
    "version": "1.35.1",
    "description": "An asynchronous runtime for Rust",
    "trust_verdict": "ALLOW",
    "verdict_reason": "Auto-promoted: High quality with permissive license and high usage",
    "quality_score": 8.5,
    "ecosystem_metrics": {
        "downloads_all_time": 125000000,
        "reverse_deps_visible": 45000,
        "reverse_deps_direct": 8500
    },
    "license_analysis": {
        "license_type": "MIT",
        "is_permissive": true
    },
    "security_analysis": {
        "critical_advisories": 0,
        "security_audit": false
    },
    "deprecation_status": {
        "is_deprecated": false,
        "last_release": "2024-01-15",
        "maintenance_status": "Active"
    },
    "sentiment_analysis": {
        "overall": "positive",
        "community_health": "Excellent"
    },
    "cargo_test_results": {
        "build_success": true,
        "test_success": true,
        "audit_clean": true,
        "dependencies": 45
    }
}
```

### Example: DEFER Case

```json
{
    "name": "regex_macros",
    "version": "0.2.0",
    "description": "An implementation of statically compiled regular expressions for Rust",
    "trust_verdict": "DEFER",
    "verdict_reason": "Requires manual review - insufficient evidence for automatic decision",
    "quality_score": 5.67,
    "ecosystem_metrics": {
        "downloads_all_time": 235853,
        "reverse_deps_visible": 8,
        "reverse_deps_direct": 3
    },
    "license_analysis": {
        "license_type": "MIT/Apache-2.0",
        "is_permissive": true
    },
    "deprecation_status": {
        "is_deprecated": true,
        "deprecation_reason": "Deprecated in favor of regex crate",
        "last_release": "2017-01-01"
    }
}
```

## Trust Verdict System

The pipeline implements a sophisticated trust assessment system that automatically evaluates Rust crates based on multiple criteria:

### Auto-ALLOW Criteria
A crate is automatically approved if **ALL** of the following are met:
- **Quality Score ≥ 8.0**: High documentation and code quality
- **Permissive License**: MIT, Apache-2.0, or Unlicense
- **High Usage**: ≥10M downloads AND ≥200 reverse dependencies
- **No Critical Advisories**: No open security vulnerabilities
- **Not Deprecated**: Active maintenance and recent releases
- **Recent Activity**: Development within the last 2 years

### Auto-DENY Criteria
A crate is automatically rejected if **ANY** of the following are true:
- **Critical Security Advisories**: Open security vulnerabilities
- **Extremely Low Quality**: Quality score < 4.0
- **Negative Sentiment + Low Quality**: Negative community sentiment with quality score < 6.0
- **Deprecated/Abandoned**: Marked as deprecated or no activity for 4+ years
- **Very Low Usage**: <1,000 downloads AND <5 reverse dependencies

### DEFER (Manual Review)
All other crates are deferred for manual review, ensuring that only truly high-quality crates receive automatic approval.

## Architecture

### Core Components

- **IRL Engine**: Implements the trust verdict logic and decision making
- **Pipeline Orchestrator**: Manages the overall data processing workflow
- **Web Scraper**: Collects crate metadata using Crawl4AI
- **AI Enricher**: Enhances data with local or cloud AI analysis
- **Cargo Analyzer**: Executes cargo commands for comprehensive testing
- **Data Exporter**: Outputs structured results in various formats

### Data Flow

1. **Input**: Crate names from `crate_list.txt`
2. **Scraping**: Web scraping of crates.io for metadata
3. **Enrichment**: AI-powered analysis and insights
4. **Testing**: Cargo build, test, and audit execution
5. **Output**: Structured JSON with comprehensive crate analysis

## Development

### Prerequisites

- Python 3.12+ (required for modern type annotations)
- Git for version control
- Cargo for Rust crate testing

### Running Tests

```bash
# Run all tests
pytest tests/

# Run specific test module
pytest tests/test_main_integration.py

# Run with coverage
pytest --cov=rust_crate_pipeline tests/

# Run type checking
pyright rust_crate_pipeline/

# Run linting
flake8 rust_crate_pipeline/
```

### Code Quality

```bash
# Format code
black rust_crate_pipeline/

# Sort imports
isort rust_crate_pipeline/

# Type checking
pyright rust_crate_pipeline/

# Lint code
flake8 rust_crate_pipeline/
```

### Building and Publishing

```bash
# Build package
python -m build

# Upload to PyPI (requires PYPI_API_TOKEN)
python -m twine upload dist/*

# Create release
python scripts/create_release.py
```

### Docker Development

```bash
# Build Docker image
docker build -t rust-crate-pipeline .

# Run in Docker
docker run -it rust-crate-pipeline

# Run with volume mount for development
docker run -it -v $(pwd):/app rust-crate-pipeline
```

## Recent Improvements

### Version 1.4.6 (Latest)
- **Trust Verdict System**: Implemented sophisticated three-tier trust assessment (ALLOW/DENY/DEFER)
- **Stricter Auto-Promotion**: Quality threshold increased to 8.0, usage thresholds to 10M downloads/200 reverse deps
- **Auto-Deny Logic**: Automatic rejection of problematic crates with security issues or poor quality
- **Deprecation Detection**: Automated identification of deprecated or abandoned crates
- **Enhanced Sentiment Analysis**: Realistic community sentiment assessment
- **Fixed Parameter Ordering**: Corrected IRL engine parameter passing for accurate decision making

### Version 1.4.5
- **Sanitization Toggle** – default off; opt-in for redaction.
- **JSON Serializer** – universal helper prevents non-serializable errors.
- **Version bump & Docker update**

### Version 1.4.0
- **Security**: Robust Ed25519/RSA cryptographic signing and provenance
- **Automation**: Automated RAG and provenance workflows
- **CI/CD**: Improved GitHub Actions for validation and publishing
- **Docker**: Updated Docker image and compose for new version
- **Bug Fixes**: Workflow and validation fixes for Ed25519

### Version 1.3.6
- **Python 3.12+ Requirement**: Updated to use modern type annotations and language features
- **Type Safety**: Enhanced type annotations throughout the codebase with modern syntax
- **Build System**: Updated pyproject.toml and setup.py for better compatibility

### Version 1.3.5
- **Enhanced Web Scraping**: Added Playwright-based scraping for better data extraction
- **Unicode Compatibility**: Replaced all Unicode symbols with ASCII equivalents for better cross-platform support
- **Automatic Dependencies**: All required packages are now automatically installed
- **Real-time Progress**: Added CLI-based progress monitoring with ASCII status indicators
- **Docker Optimization**: Updated Dockerfile to include Playwright browser installation

### Version 1.3.4
- **PEP8 Compliance**: Fixed all Unicode emoji and symbols for better encoding support
- **Cross-platform Compatibility**: Improved compatibility across different operating systems
- **Type Safety**: Enhanced type annotations throughout the codebase

### Version 1.3.3
- **Real-time Progress Monitoring**: Added CLI-only progress tracking feature
- **Enhanced Logging**: Improved status reporting and error handling

### Version 1.3.2
- **Multi-Provider LLM Support**: Added support for OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM
- **Unified LLM Processor**: Centralized LLM processing with provider abstraction
- **Enhanced Error Handling**: Better error recovery and retry mechanisms

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Submit a pull request

## Support

For issues and questions:
- GitHub Issues: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues
- Documentation: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production#readme 

## API Compliance & Attribution

### crates.io and GitHub API Usage
- This project accesses crates.io and GitHub APIs for data gathering and verification.
- **User-Agent:** All requests use:
  
  `SigilDERG-Data-Production (Superuser666-Sigil; miragemodularframework@gmail.com; https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)`
- **Contact:** miragemodularframework@gmail.com
- **GitHub:** [Superuser666-Sigil/SigilDERG-Data_Production](https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)
- The project respects all rate limits and crawler policies. If you have questions or concerns, please contact us.

### Crawl4AI Attribution
This project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.

<!-- Badge Attribution (Disco Theme) -->
<a href="https://github.com/unclecode/crawl4ai">
  <img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
</a>

Or, text attribution:

```
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
```

## 🚀 Unified, Cross-Platform, Multi-Provider LLM Support

This project supports **all major LLM providers** (cloud and local) on **Mac, Linux, and Windows** using a single, unified interface. All LLM calls are routed through the `UnifiedLLMProcessor` and `LLMConfig` abstractions, ensuring:

- **One code path for all providers:** Azure OpenAI, OpenAI, Anthropic, Google, Cohere, HuggingFace, Ollama, LM Studio, and any OpenAI-compatible endpoint.
- **Cross-platform compatibility:** Works out of the box on Mac, Linux, and Windows.
- **Configurable via CLI and config files:** Select provider, model, API key, endpoint, and provider-specific options at runtime.
- **Easy extensibility:** Add new providers by updating your config or CLI arguments—no code changes needed.

### 📖 Provider Setup & Usage
- See [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) for full details, setup instructions, and usage examples for every supported provider.
- Run `python run_pipeline_with_llm.py --help` for CLI options and provider-specific arguments.

### 🧩 Example Usage
```bash
# Azure OpenAI
python run_pipeline_with_llm.py --llm-provider azure --llm-model gpt-4o --crates tokio

# Ollama (local)
python run_pipeline_with_llm.py --llm-provider ollama --llm-model llama2 --crates serde

# OpenAI API
python run_pipeline_with_llm.py --llm-provider openai --llm-model gpt-4 --llm-api-key YOUR_KEY --crates tokio

# Anthropic Claude
python run_pipeline_with_llm.py --llm-provider anthropic --llm-model claude-3-sonnet --llm-api-key YOUR_KEY --crates serde
```

### 🔒 Security & Best Practices
- Store API keys as environment variables.
- Use local providers (Ollama, LM Studio) for full privacy—no data leaves your machine.
- All LLM calls are routed through a single, auditable interface for maximum maintainability and security.

### 🧪 Testing
- Run `python test_unified_llm.py` to verify provider support and configuration.

For more, see [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) and the CLI help output. 

## Public RAG Database Hash Verification

The canonical hash of the RAG SQLite database (`sigil_rag_cache.db`) is stored in the public file `sigil_rag_cache.hash`.

- **Purpose:** Anyone can verify the integrity of the RAG database by comparing its SHA256 hash to the value in `sigil_rag_cache.hash`.
- **How to verify:**

```sh
python audits/validate_db_hash.py --db sigil_rag_cache.db --expected-hash "$(cat sigil_rag_cache.hash)"
```

- **CI/CD:** The GitHub Actions workflow `.github/workflows/validate-db-hash.yml` automatically checks this on every push.
- **No secrets required:** The hash is public and verifiable by anyone.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SigilDERG/rust-crate-pipeline",
    "name": "rust-crate-pipeline",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "rust, crates, analysis, ai, pipeline, scraping",
    "author": "SigilDERG Team",
    "author_email": "SigilDERG Team <sigilderg@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/72/6a/4473f92bb563c351708e9196adfd71c05941a757d42d864b7766b7a7a47a/rust_crate_pipeline-1.4.6.tar.gz",
    "platform": null,
    "description": "# Rust Crate Pipeline\r\n\r\nA comprehensive system for gathering, enriching, and analyzing metadata for Rust crates using AI-powered insights, web scraping, and dependency analysis.\r\n\r\n## Overview\r\n\r\nThe Rust Crate Pipeline is designed to collect, process, and analyze Rust crates with a focus on **trust and security assessment**. It combines web scraping, AI-powered analysis, and cargo testing to provide comprehensive insights into Rust ecosystem packages, with an emphasis on identifying high-quality, trustworthy crates while flagging potentially problematic ones.\r\n\r\nThe pipeline implements a **three-tier trust verdict system**:\r\n- **ALLOW**: High-quality, widely-used crates that meet strict criteria\r\n- **DENY**: Clearly problematic crates with security issues or poor quality\r\n- **DEFER**: Everything else requiring manual review\r\n\r\n## Features\r\n\r\n### Core Analysis\r\n- **Trust Verdict System**: Three-tier assessment (ALLOW/DENY/DEFER) with strict criteria\r\n- **Quality Scoring**: Comprehensive evaluation of documentation, usage, and community health\r\n- **Security Analysis**: Automated detection of security advisories and vulnerabilities\r\n- **Deprecation Detection**: Identifies deprecated or abandoned crates\r\n\r\n### Data Collection\r\n- **Enhanced Web Scraping**: Automated collection of crate metadata from crates.io using Crawl4AI with Playwright\r\n- **Multi-Source Analysis**: Combines data from crates.io, docs.rs, lib.rs, and GitHub\r\n- **Ecosystem Metrics**: Downloads, reverse dependencies, and community activity analysis\r\n\r\n### AI & Processing\r\n- **AI Enrichment**: Local and cloud AI-powered analysis of crate descriptions, features, and documentation\r\n- **Multi-Provider LLM Support**: Unified LLM processor supporting OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM\r\n- **Sentiment Analysis**: Community sentiment and trust assessment\r\n- **License Analysis**: Automated detection of permissive vs restrictive licenses\r\n\r\n### Technical Features\r\n- **Cargo Testing**: Automated cargo build, test, and audit execution for comprehensive crate analysis\r\n- **Dependency Analysis**: Deep analysis of crate dependencies and their relationships\r\n- **Batch Processing**: Efficient processing of multiple crates with configurable batch sizes\r\n- **Data Export**: Structured output in JSON format for further analysis\r\n- **RAG Cache**: Intelligent caching with Rule Zero policies and architectural patterns\r\n- **Optional Sanitization**: PII/secret stripping now opt-in via `Sanitizer(enabled=True)`\u2014data remains unaltered by default\r\n- **Robust Serialization**: Built-in helper converts all complex objects to JSON-safe formats\r\n- **Docker Support**: Containerized deployment with optimized Docker configurations\r\n- **Real-time Progress Monitoring**: CLI-based progress tracking with ASCII status indicators\r\n- **Cross-platform Compatibility**: Full Unicode symbol replacement for better encoding support\r\n\r\n## Requirements\r\n\r\n- **Python 3.12+**: Required for modern type annotations and language features\r\n- **Git**: For cloning repositories during analysis\r\n- **Cargo**: For Rust crate testing and analysis\r\n- **Playwright**: Automatically installed for enhanced web scraping\r\n\r\n## Installation\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git\r\ncd SigilDERG-Data_Production\r\n\r\n# Install in development mode (includes all dependencies)\r\npip install -e .\r\n\r\n# Install Playwright browsers for enhanced scraping\r\nplaywright install\r\n```\r\n\r\n### Automatic Dependency Installation\r\n\r\nThe package automatically installs all required dependencies including:\r\n- `crawl4ai` for web scraping\r\n- `playwright` for enhanced browser automation\r\n- `requests` for HTTP requests\r\n- `aiohttp` for async operations\r\n- And all other required packages\r\n\r\n## Configuration\r\n\r\n### Environment Variables\r\n\r\nSet the following environment variables for full functionality:\r\n\r\n```bash\r\n# GitHub Personal Access Token (required for API access)\r\nexport GITHUB_TOKEN=\"your_github_token_here\"\r\n\r\n# Azure OpenAI (optional, for cloud AI processing)\r\nexport AZURE_OPENAI_ENDPOINT=\"https://your-resource.openai.azure.com/\"\r\nexport AZURE_OPENAI_API_KEY=\"your_azure_openai_key\"\r\nexport AZURE_OPENAI_DEPLOYMENT_NAME=\"your_deployment_name\"\r\nexport AZURE_OPENAI_API_VERSION=\"2024-02-15-preview\"\r\n\r\n# PyPI API Token (optional, for publishing)\r\nexport PYPI_API_TOKEN=\"your_pypi_token\"\r\n\r\n# LiteLLM Configuration (optional, for multi-provider LLM support)\r\nexport LITELLM_MODEL=\"deepseek-coder:33b\"\r\nexport LITELLM_BASE_URL=\"http://localhost:11434\"  # For Ollama\r\n```\r\n\r\n### Configuration File\r\n\r\nCreate a `config.json` file for custom settings:\r\n\r\n```json\r\n{\r\n    \"batch_size\": 10,\r\n    \"n_workers\": 4,\r\n    \"max_retries\": 3,\r\n    \"checkpoint_interval\": 10,\r\n    \"use_azure_openai\": true,\r\n    \"crawl4ai_config\": {\r\n        \"max_pages\": 5,\r\n        \"concurrency\": 2\r\n    }\r\n}\r\n```\r\n\r\n## Usage\r\n\r\n### Command Line Interface\r\n\r\n#### Basic Usage\r\n\r\n```bash\r\n# Run with default settings\r\npython -m rust_crate_pipeline\r\n\r\n# Run with custom batch size\r\npython -m rust_crate_pipeline --batch-size 20\r\n\r\n# Run with specific workers\r\npython -m rust_crate_pipeline --workers 8\r\n\r\n# Use configuration file\r\npython -m rust_crate_pipeline --config-file config.json\r\n```\r\n\r\n#### Advanced Options\r\n\r\n```bash\r\n# Enable Azure OpenAI processing\r\npython -m rust_crate_pipeline --enable-azure-openai\r\n\r\n# Set custom model path for local AI\r\npython -m rust_crate_pipeline --model-path /path/to/model.gguf\r\n\r\n# Configure token limits\r\npython -m rust_crate_pipeline --max-tokens 2048\r\n\r\n# Set checkpoint interval\r\npython -m rust_crate_pipeline --checkpoint-interval 5\r\n\r\n# Enable verbose logging\r\npython -m rust_crate_pipeline --log-level DEBUG\r\n\r\n# Enable enhanced scraping with Playwright\r\npython -m rust_crate_pipeline --enable-enhanced-scraping\r\n\r\n# Set output directory for results\r\npython -m rust_crate_pipeline --output-path ./results\r\n```\r\n\r\n#### Enhanced Scraping\r\n\r\nThe pipeline now supports enhanced web scraping using Playwright for better data extraction:\r\n\r\n```bash\r\n# Enable enhanced scraping (default)\r\npython -m rust_crate_pipeline --enable-enhanced-scraping\r\n\r\n# Use basic scraping only\r\npython -m rust_crate_pipeline --disable-enhanced-scraping\r\n\r\n# Configure scraping options\r\npython -m rust_crate_pipeline --scraping-config '{\"max_pages\": 10, \"concurrency\": 3}'\r\n```\r\n\r\n#### Multi-Provider LLM Support\r\n\r\n```bash\r\n# Use OpenAI\r\npython -m rust_crate_pipeline.unified_llm_processor --provider openai --model-name gpt-4\r\n\r\n# Use Azure OpenAI\r\npython -m rust_crate_pipeline.unified_llm_processor --provider azure --model-name gpt-4\r\n\r\n# Use Ollama (local)\r\npython -m rust_crate_pipeline.unified_llm_processor --provider ollama --model-name deepseek-coder:33b\r\n\r\n# Use LM Studio\r\npython -m rust_crate_pipeline.unified_llm_processor --provider openai --base-url http://localhost:1234/v1 --model-name local-model\r\n\r\n# Use LiteLLM\r\npython -m rust_crate_pipeline.unified_llm_processor --provider litellm --model-name deepseek-coder:33b\r\n```\r\n\r\n#### Production Mode\r\n\r\n```bash\r\n# Run production pipeline with optimizations\r\npython run_production.py\r\n\r\n# Run with Sigil Protocol integration\r\npython -m rust_crate_pipeline --enable-sigil-protocol\r\n```\r\n\r\n### Programmatic Usage\r\n\r\n```python\r\nfrom rust_crate_pipeline import CrateDataPipeline\r\nfrom rust_crate_pipeline.config import PipelineConfig\r\n\r\n# Create configuration\r\nconfig = PipelineConfig(\r\n    batch_size=10,\r\n    n_workers=4,\r\n    use_azure_openai=True\r\n)\r\n\r\n# Initialize pipeline\r\npipeline = CrateDataPipeline(config)\r\n\r\n# Run pipeline\r\nimport asyncio\r\nresult = asyncio.run(pipeline.run())\r\n```\r\n\r\n## Sample Data\r\n\r\n### Input: Crate List\r\n\r\nThe pipeline processes crates from `rust_crate_pipeline/crate_list.txt`:\r\n\r\n```\r\ntokio\r\nserde\r\nreqwest\r\nactix-web\r\nclap\r\n```\r\n\r\n### Output: Trust Analysis Results\r\n\r\n```json\r\n{\r\n    \"name\": \"tokio\",\r\n    \"version\": \"1.35.1\",\r\n    \"description\": \"An asynchronous runtime for Rust\",\r\n    \"trust_verdict\": \"ALLOW\",\r\n    \"verdict_reason\": \"Auto-promoted: High quality with permissive license and high usage\",\r\n    \"quality_score\": 8.5,\r\n    \"ecosystem_metrics\": {\r\n        \"downloads_all_time\": 125000000,\r\n        \"reverse_deps_visible\": 45000,\r\n        \"reverse_deps_direct\": 8500\r\n    },\r\n    \"license_analysis\": {\r\n        \"license_type\": \"MIT\",\r\n        \"is_permissive\": true\r\n    },\r\n    \"security_analysis\": {\r\n        \"critical_advisories\": 0,\r\n        \"security_audit\": false\r\n    },\r\n    \"deprecation_status\": {\r\n        \"is_deprecated\": false,\r\n        \"last_release\": \"2024-01-15\",\r\n        \"maintenance_status\": \"Active\"\r\n    },\r\n    \"sentiment_analysis\": {\r\n        \"overall\": \"positive\",\r\n        \"community_health\": \"Excellent\"\r\n    },\r\n    \"cargo_test_results\": {\r\n        \"build_success\": true,\r\n        \"test_success\": true,\r\n        \"audit_clean\": true,\r\n        \"dependencies\": 45\r\n    }\r\n}\r\n```\r\n\r\n### Example: DEFER Case\r\n\r\n```json\r\n{\r\n    \"name\": \"regex_macros\",\r\n    \"version\": \"0.2.0\",\r\n    \"description\": \"An implementation of statically compiled regular expressions for Rust\",\r\n    \"trust_verdict\": \"DEFER\",\r\n    \"verdict_reason\": \"Requires manual review - insufficient evidence for automatic decision\",\r\n    \"quality_score\": 5.67,\r\n    \"ecosystem_metrics\": {\r\n        \"downloads_all_time\": 235853,\r\n        \"reverse_deps_visible\": 8,\r\n        \"reverse_deps_direct\": 3\r\n    },\r\n    \"license_analysis\": {\r\n        \"license_type\": \"MIT/Apache-2.0\",\r\n        \"is_permissive\": true\r\n    },\r\n    \"deprecation_status\": {\r\n        \"is_deprecated\": true,\r\n        \"deprecation_reason\": \"Deprecated in favor of regex crate\",\r\n        \"last_release\": \"2017-01-01\"\r\n    }\r\n}\r\n```\r\n\r\n## Trust Verdict System\r\n\r\nThe pipeline implements a sophisticated trust assessment system that automatically evaluates Rust crates based on multiple criteria:\r\n\r\n### Auto-ALLOW Criteria\r\nA crate is automatically approved if **ALL** of the following are met:\r\n- **Quality Score \u2265 8.0**: High documentation and code quality\r\n- **Permissive License**: MIT, Apache-2.0, or Unlicense\r\n- **High Usage**: \u226510M downloads AND \u2265200 reverse dependencies\r\n- **No Critical Advisories**: No open security vulnerabilities\r\n- **Not Deprecated**: Active maintenance and recent releases\r\n- **Recent Activity**: Development within the last 2 years\r\n\r\n### Auto-DENY Criteria\r\nA crate is automatically rejected if **ANY** of the following are true:\r\n- **Critical Security Advisories**: Open security vulnerabilities\r\n- **Extremely Low Quality**: Quality score < 4.0\r\n- **Negative Sentiment + Low Quality**: Negative community sentiment with quality score < 6.0\r\n- **Deprecated/Abandoned**: Marked as deprecated or no activity for 4+ years\r\n- **Very Low Usage**: <1,000 downloads AND <5 reverse dependencies\r\n\r\n### DEFER (Manual Review)\r\nAll other crates are deferred for manual review, ensuring that only truly high-quality crates receive automatic approval.\r\n\r\n## Architecture\r\n\r\n### Core Components\r\n\r\n- **IRL Engine**: Implements the trust verdict logic and decision making\r\n- **Pipeline Orchestrator**: Manages the overall data processing workflow\r\n- **Web Scraper**: Collects crate metadata using Crawl4AI\r\n- **AI Enricher**: Enhances data with local or cloud AI analysis\r\n- **Cargo Analyzer**: Executes cargo commands for comprehensive testing\r\n- **Data Exporter**: Outputs structured results in various formats\r\n\r\n### Data Flow\r\n\r\n1. **Input**: Crate names from `crate_list.txt`\r\n2. **Scraping**: Web scraping of crates.io for metadata\r\n3. **Enrichment**: AI-powered analysis and insights\r\n4. **Testing**: Cargo build, test, and audit execution\r\n5. **Output**: Structured JSON with comprehensive crate analysis\r\n\r\n## Development\r\n\r\n### Prerequisites\r\n\r\n- Python 3.12+ (required for modern type annotations)\r\n- Git for version control\r\n- Cargo for Rust crate testing\r\n\r\n### Running Tests\r\n\r\n```bash\r\n# Run all tests\r\npytest tests/\r\n\r\n# Run specific test module\r\npytest tests/test_main_integration.py\r\n\r\n# Run with coverage\r\npytest --cov=rust_crate_pipeline tests/\r\n\r\n# Run type checking\r\npyright rust_crate_pipeline/\r\n\r\n# Run linting\r\nflake8 rust_crate_pipeline/\r\n```\r\n\r\n### Code Quality\r\n\r\n```bash\r\n# Format code\r\nblack rust_crate_pipeline/\r\n\r\n# Sort imports\r\nisort rust_crate_pipeline/\r\n\r\n# Type checking\r\npyright rust_crate_pipeline/\r\n\r\n# Lint code\r\nflake8 rust_crate_pipeline/\r\n```\r\n\r\n### Building and Publishing\r\n\r\n```bash\r\n# Build package\r\npython -m build\r\n\r\n# Upload to PyPI (requires PYPI_API_TOKEN)\r\npython -m twine upload dist/*\r\n\r\n# Create release\r\npython scripts/create_release.py\r\n```\r\n\r\n### Docker Development\r\n\r\n```bash\r\n# Build Docker image\r\ndocker build -t rust-crate-pipeline .\r\n\r\n# Run in Docker\r\ndocker run -it rust-crate-pipeline\r\n\r\n# Run with volume mount for development\r\ndocker run -it -v $(pwd):/app rust-crate-pipeline\r\n```\r\n\r\n## Recent Improvements\r\n\r\n### Version 1.4.6 (Latest)\r\n- **Trust Verdict System**: Implemented sophisticated three-tier trust assessment (ALLOW/DENY/DEFER)\r\n- **Stricter Auto-Promotion**: Quality threshold increased to 8.0, usage thresholds to 10M downloads/200 reverse deps\r\n- **Auto-Deny Logic**: Automatic rejection of problematic crates with security issues or poor quality\r\n- **Deprecation Detection**: Automated identification of deprecated or abandoned crates\r\n- **Enhanced Sentiment Analysis**: Realistic community sentiment assessment\r\n- **Fixed Parameter Ordering**: Corrected IRL engine parameter passing for accurate decision making\r\n\r\n### Version 1.4.5\r\n- **Sanitization Toggle** \u2013 default off; opt-in for redaction.\r\n- **JSON Serializer** \u2013 universal helper prevents non-serializable errors.\r\n- **Version bump & Docker update**\r\n\r\n### Version 1.4.0\r\n- **Security**: Robust Ed25519/RSA cryptographic signing and provenance\r\n- **Automation**: Automated RAG and provenance workflows\r\n- **CI/CD**: Improved GitHub Actions for validation and publishing\r\n- **Docker**: Updated Docker image and compose for new version\r\n- **Bug Fixes**: Workflow and validation fixes for Ed25519\r\n\r\n### Version 1.3.6\r\n- **Python 3.12+ Requirement**: Updated to use modern type annotations and language features\r\n- **Type Safety**: Enhanced type annotations throughout the codebase with modern syntax\r\n- **Build System**: Updated pyproject.toml and setup.py for better compatibility\r\n\r\n### Version 1.3.5\r\n- **Enhanced Web Scraping**: Added Playwright-based scraping for better data extraction\r\n- **Unicode Compatibility**: Replaced all Unicode symbols with ASCII equivalents for better cross-platform support\r\n- **Automatic Dependencies**: All required packages are now automatically installed\r\n- **Real-time Progress**: Added CLI-based progress monitoring with ASCII status indicators\r\n- **Docker Optimization**: Updated Dockerfile to include Playwright browser installation\r\n\r\n### Version 1.3.4\r\n- **PEP8 Compliance**: Fixed all Unicode emoji and symbols for better encoding support\r\n- **Cross-platform Compatibility**: Improved compatibility across different operating systems\r\n- **Type Safety**: Enhanced type annotations throughout the codebase\r\n\r\n### Version 1.3.3\r\n- **Real-time Progress Monitoring**: Added CLI-only progress tracking feature\r\n- **Enhanced Logging**: Improved status reporting and error handling\r\n\r\n### Version 1.3.2\r\n- **Multi-Provider LLM Support**: Added support for OpenAI, Azure OpenAI, Ollama, LM Studio, and LiteLLM\r\n- **Unified LLM Processor**: Centralized LLM processing with provider abstraction\r\n- **Enhanced Error Handling**: Better error recovery and retry mechanisms\r\n\r\n## License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests for new functionality\r\n5. Submit a pull request\r\n\r\n## Support\r\n\r\nFor issues and questions:\r\n- GitHub Issues: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues\r\n- Documentation: https://github.com/Superuser666-Sigil/SigilDERG-Data_Production#readme \r\n\r\n## API Compliance & Attribution\r\n\r\n### crates.io and GitHub API Usage\r\n- This project accesses crates.io and GitHub APIs for data gathering and verification.\r\n- **User-Agent:** All requests use:\r\n  \r\n  `SigilDERG-Data-Production (Superuser666-Sigil; miragemodularframework@gmail.com; https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)`\r\n- **Contact:** miragemodularframework@gmail.com\r\n- **GitHub:** [Superuser666-Sigil/SigilDERG-Data_Production](https://github.com/Superuser666-Sigil/SigilDERG-Data_Production)\r\n- The project respects all rate limits and crawler policies. If you have questions or concerns, please contact us.\r\n\r\n### Crawl4AI Attribution\r\nThis project uses [Crawl4AI](https://github.com/unclecode/crawl4ai) for web data extraction.\r\n\r\n<!-- Badge Attribution (Disco Theme) -->\r\n<a href=\"https://github.com/unclecode/crawl4ai\">\r\n  <img src=\"https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"/>\r\n</a>\r\n\r\nOr, text attribution:\r\n\r\n```\r\nThis project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.\r\n```\r\n\r\n## \ud83d\ude80 Unified, Cross-Platform, Multi-Provider LLM Support\r\n\r\nThis project supports **all major LLM providers** (cloud and local) on **Mac, Linux, and Windows** using a single, unified interface. All LLM calls are routed through the `UnifiedLLMProcessor` and `LLMConfig` abstractions, ensuring:\r\n\r\n- **One code path for all providers:** Azure OpenAI, OpenAI, Anthropic, Google, Cohere, HuggingFace, Ollama, LM Studio, and any OpenAI-compatible endpoint.\r\n- **Cross-platform compatibility:** Works out of the box on Mac, Linux, and Windows.\r\n- **Configurable via CLI and config files:** Select provider, model, API key, endpoint, and provider-specific options at runtime.\r\n- **Easy extensibility:** Add new providers by updating your config or CLI arguments\u2014no code changes needed.\r\n\r\n### \ud83d\udcd6 Provider Setup & Usage\r\n- See [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) for full details, setup instructions, and usage examples for every supported provider.\r\n- Run `python run_pipeline_with_llm.py --help` for CLI options and provider-specific arguments.\r\n\r\n### \ud83e\udde9 Example Usage\r\n```bash\r\n# Azure OpenAI\r\npython run_pipeline_with_llm.py --llm-provider azure --llm-model gpt-4o --crates tokio\r\n\r\n# Ollama (local)\r\npython run_pipeline_with_llm.py --llm-provider ollama --llm-model llama2 --crates serde\r\n\r\n# OpenAI API\r\npython run_pipeline_with_llm.py --llm-provider openai --llm-model gpt-4 --llm-api-key YOUR_KEY --crates tokio\r\n\r\n# Anthropic Claude\r\npython run_pipeline_with_llm.py --llm-provider anthropic --llm-model claude-3-sonnet --llm-api-key YOUR_KEY --crates serde\r\n```\r\n\r\n### \ud83d\udd12 Security & Best Practices\r\n- Store API keys as environment variables.\r\n- Use local providers (Ollama, LM Studio) for full privacy\u2014no data leaves your machine.\r\n- All LLM calls are routed through a single, auditable interface for maximum maintainability and security.\r\n\r\n### \ud83e\uddea Testing\r\n- Run `python test_unified_llm.py` to verify provider support and configuration.\r\n\r\nFor more, see [`README_LLM_PROVIDERS.md`](./README_LLM_PROVIDERS.md) and the CLI help output. \r\n\r\n## Public RAG Database Hash Verification\r\n\r\nThe canonical hash of the RAG SQLite database (`sigil_rag_cache.db`) is stored in the public file `sigil_rag_cache.hash`.\r\n\r\n- **Purpose:** Anyone can verify the integrity of the RAG database by comparing its SHA256 hash to the value in `sigil_rag_cache.hash`.\r\n- **How to verify:**\r\n\r\n```sh\r\npython audits/validate_db_hash.py --db sigil_rag_cache.db --expected-hash \"$(cat sigil_rag_cache.hash)\"\r\n```\r\n\r\n- **CI/CD:** The GitHub Actions workflow `.github/workflows/validate-db-hash.yml` automatically checks this on every push.\r\n- **No secrets required:** The hash is public and verifiable by anyone.\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A comprehensive pipeline for analyzing Rust crates with AI enrichment and enhanced scraping",
    "version": "1.4.6",
    "project_urls": {
        "Bug Tracker": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production/issues",
        "Documentation": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production#readme",
        "Homepage": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production",
        "Repository": "https://github.com/Superuser666-Sigil/SigilDERG-Data_Production"
    },
    "split_keywords": [
        "rust",
        " crates",
        " analysis",
        " ai",
        " pipeline",
        " scraping"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "789a75e08832b6096de6bff23ed9442356f36dd1d5cbdd7c268b53336a8c25ed",
                "md5": "01bd570ad898eefeb91ad6482870deee",
                "sha256": "be2a5d0c74a613c1d97323e4dde3ab9071c01a274c6a0378750107aab33296b3"
            },
            "downloads": -1,
            "filename": "rust_crate_pipeline-1.4.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "01bd570ad898eefeb91ad6482870deee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 108042,
            "upload_time": "2025-07-12T20:09:59",
            "upload_time_iso_8601": "2025-07-12T20:09:59.079346Z",
            "url": "https://files.pythonhosted.org/packages/78/9a/75e08832b6096de6bff23ed9442356f36dd1d5cbdd7c268b53336a8c25ed/rust_crate_pipeline-1.4.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "726a4473f92bb563c351708e9196adfd71c05941a757d42d864b7766b7a7a47a",
                "md5": "c0c5c1c5eae69899c85844b5a6db00ca",
                "sha256": "d7c75d1b558f62e380a7565fbc2ae4763728c6fdcfb7f28d85507cbc60ba9742"
            },
            "downloads": -1,
            "filename": "rust_crate_pipeline-1.4.6.tar.gz",
            "has_sig": false,
            "md5_digest": "c0c5c1c5eae69899c85844b5a6db00ca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 118064,
            "upload_time": "2025-07-12T20:10:00",
            "upload_time_iso_8601": "2025-07-12T20:10:00.283868Z",
            "url": "https://files.pythonhosted.org/packages/72/6a/4473f92bb563c351708e9196adfd71c05941a757d42d864b7766b7a7a47a/rust_crate_pipeline-1.4.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-12 20:10:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SigilDERG",
    "github_project": "rust-crate-pipeline",
    "github_not_found": true,
    "lcname": "rust-crate-pipeline"
}
        
Elapsed time: 1.86647s