hallunox

Name	hallunox JSON
Version	0.6.3 JSON
	download
home_page	https://github.com/convai-innovations/hallunox
Summary	A confidence-aware routing system for LLM hallucination detection using multi-signal approach
upload_time	2025-10-06 18:07:58
maintainer	None
docs_url	None
author	Nandakishor M
requires_python	>=3.8
license	AGPL-3.0
keywords	hallucination-detection llm confidence-estimation model-reliability uncertainty-quantification ai-safety
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # HalluNox

**Confidence-Aware Routing for Large Language Model Reliability Enhancement**

A Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.

## ✨ Features

- **🎯 Pre-generation Hallucination Detection**: Assess model reliability before generation begins
- **🔄 Confidence-Aware Routing**: Automatically route queries based on estimated confidence
- **🧠 Multi-Signal Approach**: Combines semantic alignment, internal convergence, and learned confidence
- **⚡ Multi-Model Support**: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures
- **🏥 Medical Domain Specialization**: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds
- **🖼️ Multimodal Capabilities**: Image analysis and response generation for MedGemma models
- **📊 Comprehensive Evaluation**: Built-in metrics and routing strategy analysis
- **🚀 Easy Integration**: Simple API for both training and inference
- **🏃‍♂️ Performance Optimizations**: Optional LLM loading for faster initialization and lower memory usage
- **📝 Enhanced Query-Context**: Improved accuracy with structured prompt formatting
- **🎛️ Adaptive Thresholds**: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)
- **💬 Response Generation**: Built-in response generation with confidence-gated output
- **🔧 Automatic Model Management**: Auto-download and configuration for supported models

## 🔬 Research Foundation

Based on the research paper "Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation" by Nandakishor M (Convai Innovations).

The approach implements deterministic routing to appropriate response pathways:

### General Models (Llama-3.2-3B)
- **High Confidence (≥0.65)**: Local generation  
- **Medium Confidence (0.60-0.65)**: Retrieval-augmented generation
- **Low Confidence (0.4-0.60)**: Route to larger models
- **Very Low Confidence (<0.4)**: Human review required

### Medical Models (MedGemma-4B-IT)
- **High Medical Confidence (≥0.60)**: Local generation with medical validation
- **Medium Medical Confidence (0.55-0.60)**: Medical literature verification required
- **Low Medical Confidence (0.50-0.55)**: Professional medical verification required
- **Very Low Medical Confidence (<0.50)**: Seek professional medical advice

## 🆕 What's New in v0.6.3

### ✨ Enhanced Query-Context Support
- **🔗 Query-Context Pairs**: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses
- **🎯 Improved Accuracy**: Better confidence scoring when context is provided
- **📝 Enhanced Response Generation**: Context-aware prompt formatting for more accurate medical responses
- **🔄 Batch Processing**: New `generate_response_with_context()` method for processing multiple query-context pairs

### 🏥 Medical Domain Enhancements
- **🩺 Context Integration**: Medical queries now benefit from patient context and clinical background
- **📊 Better Confidence**: Context helps improve confidence scoring for medical scenarios
- **🎛️ Flexible Usage**: Works with existing methods while providing new convenience functions
- **🔍 Example Implementation**: New query_context_example.py demonstrates usage patterns

### 🧹 Simplified Architecture
- **📱 Removed Dashboard**: Eliminated dashboard dependencies for cleaner core package
- **⚡ Streamlined Installation**: Faster installation without unnecessary web components
- **🎯 Focused Functionality**: Core hallucination detection without UI overhead
- **📦 Lightweight**: Reduced package size and dependencies

### 🔧 Technical Improvements
- **🔗 Enhanced Prompt Formatting**: Context gets properly integrated into medical prompts
- **🎯 Backward Compatibility**: All existing code continues to work unchanged
- **📝 Better Documentation**: Comprehensive examples for query-context usage
- **🛡️ Stable Performance**: Maintains all stability improvements from v0.6.3

## 🚀 Installation

### Requirements

- Python 3.8+
- PyTorch 1.13+
- CUDA-compatible GPU (recommended)
- At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)
- 16GB RAM minimum (32GB recommended for training)

### Install from PyPI

```bash
pip install hallunox
```

### Install from Source

```bash
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e .
```

### MedGemma Model Access

HalluNox uses the open-access `convaiinnovations/gemma-finetuned-4b-it` model, which doesn't require authentication. The model will be automatically downloaded on first use.

### Core Dependencies

HalluNox automatically installs:

- `torch>=1.13.0` - PyTorch framework
- `transformers>=4.21.0` - Hugging Face Transformers
- `FlagEmbedding>=1.2.0` - BGE-M3 embedding model
- `datasets>=2.0.0` - Dataset loading utilities
- `scikit-learn>=1.0.0` - Evaluation metrics
- `numpy>=1.21.0` - Numerical computations
- `tqdm>=4.64.0` - Progress bars
- `Pillow>=8.0.0` - Image processing for multimodal capabilities
- `bitsandbytes>=0.41.0` - 4-bit quantization for memory optimization

## 📖 Quick Start

### Basic Usage (Llama-3.2-3B)

```python
from hallunox import HallucinationDetector

# Initialize detector (downloads pre-trained model automatically)
detector = HallucinationDetector()

# Analyze text for hallucination risk
results = detector.predict([
    "The capital of France is Paris.",  # High confidence
    "Your password is 12345678.",       # Low confidence  
    "The Moon is made of cheese."       # Very low confidence
])

# View results
for pred in results["predictions"]:
    print(f"Text: {pred['text']}")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Routing Action: {pred['routing_action']}")
    print()
```

### 🏥 MedGemma Medical Domain Usage

For medical applications using MedGemma 4B-IT with multimodal capabilities:

```python
from hallunox import HallucinationDetector
from PIL import Image

# Initialize MedGemma detector (auto-downloads medical model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    confidence_threshold=0.60,  # Medical-grade threshold
    enable_response_generation=True,  # Enable response generation
    enable_inference=True,
    mode="text"  # Text-only mode (default)
)

# Medical text analysis
medical_results = detector.predict([
    "Aspirin can help reduce heart attack risk when prescribed by a doctor.",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "Type 2 diabetes requires insulin injections in all cases.",  # Partially incorrect
])

for pred in medical_results["predictions"]:
    print(f"Medical Text: {pred['text'][:60]}...")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Medical Action: {pred['routing_action']}")
    print(f"Description: {pred['description']}")
    print("-" * 50)

# Response generation with confidence checking
question = "What are the symptoms of pneumonia?"
response = detector.generate_response(question, check_confidence=True)

if response["should_generate"]:
    print(f"✅ Medical Response Generated (confidence: {response['confidence_score']:.3f})")
    print(f"Response: {response['response']}")
    print(f"Meets threshold: {response['meets_threshold']}")
    if response.get('forced_generation'):
        print("⚠️ Note: Response was generated despite low confidence")
else:
    print(f"⚠️ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference regardless of confidence
forced_response = detector.generate_response(
    question, 
    check_confidence=True, 
    force_generate=True  # Generate even if confidence is low
)
print(f"🔬 Reference Response (forced): {forced_response['response']}")
print(f"📊 Confidence: {forced_response['confidence_score']:.3f}")
print(f"🎯 Forced Generation: {forced_response['forced_generation']}")

# Multimodal image analysis (MedGemma 4B-IT only)
if detector.is_multimodal:
    print("\n🖼️ Multimodal Image Analysis")
    
    # Load medical image (replace with actual medical image)
    try:
        image = Image.open("chest_xray.jpg")
    except:
        # Create demo image for testing
        image = Image.new('RGB', (224, 224), color='lightgray')
    
    # Analyze image confidence
    image_results = detector.predict_images([image], ["Chest X-ray"])
    
    for pred in image_results["predictions"]:
        print(f"Image: {pred['image_description']}")
        print(f"Confidence: {pred['confidence_score']:.3f}")
        print(f"Interpretation: {pred['interpretation']}")
        print(f"Risk Level: {pred['risk_level']}")
    
    # Generate image description
    description = detector.generate_image_response(
        image, 
        "Describe the findings in this chest X-ray."
    )
    print(f"Generated Description: {description}")
```

### 🔧 Advanced Configuration

```python
from hallunox import HallucinationDetector

# Full configuration example
detector = HallucinationDetector(
    # Model selection
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Custom model weights (optional)
    model_path="/path/to/custom/model.pt",  # None = auto-download
    
    # Hardware configuration
    device="cuda",  # or "cpu"
    use_fp16=True,  # Mixed precision for faster inference
    
    # Sequence lengths
    max_length=512,      # LLM context length
    bge_max_length=512,  # BGE-M3 context length
    
    # Feature toggles
    load_llm=True,                    # Load LLM for embeddings
    enable_inference=True,            # Enable LLM inference
    enable_response_generation=True,  # Enable response generation
    
    # Confidence settings
    confidence_threshold=0.60,  # Custom threshold (auto-detected by model type)
    
    # Operation mode
    mode="text",  # "text", "image", "both", or "auto"
)

# Check model capabilities
print(f"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}")
print(f"Multimodal support: {detector.is_multimodal}")
print(f"Operation mode: {detector.effective_mode} (requested: {detector.mode})")
print(f"Confidence threshold: {detector.confidence_threshold}")
```

### 🎛️ Operation Mode Configuration

The `mode` parameter controls what types of input the detector can process:

```python
from hallunox import HallucinationDetector

# Text mode (default) - processes text inputs only
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"  # Text-only processing (default)
)

# Auto mode - detects capabilities from model
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="auto"  # Auto: detects based on model capabilities
)

# Image-only mode - processes images only (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="image"  # Image processing only
)

# Both mode - processes text and images (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="both"  # Explicit multimodal mode
)
```

#### Mode Validation

- **Text mode**: Available for all models (default)
- **Image mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- **Both mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- **Auto mode**: Automatically selects based on model capabilities
  - Multimodal models → `effective_mode = "both"`
  - Text-only models → `effective_mode = "text"`

#### Error Examples

```python
# This will raise an error - image mode requires multimodal model
detector = HallucinationDetector(
    llm_model_id="unsloth/Llama-3.2-3B-Instruct",
    mode="image"  # ❌ Error: Image mode requires multimodal model
)

# This will raise an error - calling image methods in text mode
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"
)
detector.predict_images([image])  # ❌ Error: Current mode is 'text'
```

### ⚡ Performance Optimized Usage

For faster initialization when only doing embedding comparisons:

```python
from hallunox import HallucinationDetector

# Option 1: Factory method for embedding-only usage
detector = HallucinationDetector.for_embedding_only(
    device="cuda",
    use_fp16=True
)

# Option 2: Explicit parameter control
detector = HallucinationDetector(
    load_llm=False,         # Skip expensive LLM loading
    enable_inference=False, # Disable inference capabilities
    use_fp16=True          # Use mixed precision
)

# Note: This configuration cannot perform predictions
# Use for preprocessing or embedding extraction only
```

### 🧠 Memory Optimization with Quantization

For GPUs with limited VRAM (8-16GB), use 4-bit quantization:

```python
from hallunox import HallucinationDetector

# Option 1: Auto-optimized for low memory (recommended)
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # Or any supported model
    device="cuda",
    enable_response_generation=True,  # Enable response generation for evaluation
    verbose=True  # Show loading progress (optional)
)

# Option 2: Manual quantization configuration
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    use_quantization=True,  # Enable 4-bit quantization
    enable_response_generation=True,
    device="cuda"
)

# Option 3: Custom quantization settings
from transformers import BitsAndBytesConfig
import torch

custom_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 quantization type
    bnb_4bit_use_double_quant=True,     # Double quantization for extra savings
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute in bfloat16
)

detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    quantization_config=custom_quant_config,
    device="cuda"
)

print(f"✅ Memory optimized: {detector.use_quantization}")
print(f"🔧 Quantization: 4-bit NF4 with double quantization")
```

## 🤖 Response Generation & Evaluation

### Enabling Response Generation

When `enable_response_generation=True`, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:

```python
from hallunox import HallucinationDetector

# Enable response generation for evaluation
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda",
    enable_response_generation=True,  # Enable response generation
    verbose=False  # Clean logs for evaluation
)

# Test questions for evaluation
test_questions = [
    "What are the symptoms of diabetes?",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "How does aspirin help prevent heart attacks?",
    "All vaccines cause autism in children.",  # Medical misinformation
]

# Analyze with response generation
for question in test_questions:
    # The model will generate a response and analyze it
    results = detector.predict([question])
    prediction = results["predictions"][0]
    
    print(f"Question: {question}")
    print(f"Confidence: {prediction['confidence_score']:.3f}")
    print(f"Risk Level: {prediction['risk_level']}")
    print(f"Action: {prediction['medical_action']}")
    print(f"Description: {prediction['description']}")
    print("-" * 50)
```

### Response Generation Modes

```python
# Generate and analyze responses with confidence checking
response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True
)

if response["should_generate"]:
    print(f"✅ Generated Response: {response['response']}")
    print(f"Confidence: {response['confidence_score']:.3f}")
    print(f"Meets threshold: {response['meets_threshold']}")
else:
    print(f"⚠️ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference (useful for evaluation)
forced_response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True, 
    force_generate=True
)
print(f"🔬 Reference Response: {forced_response['response']}")
print(f"📊 Confidence: {forced_response['confidence_score']:.3f}")
print(f"🎯 Forced Generation: {forced_response['forced_generation']}")
```

### Evaluation Output Example

```
Question: What are the symptoms of diabetes?
Generated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.
Confidence: 0.857
Risk Level: LOW_MEDICAL_RISK
Action: ✅ Information can be used as reference
--------------------------------------------------
Question: Drinking bleach will cure COVID-19.
Generated Response: [Response blocked - confidence too low]
Confidence: 0.123
Risk Level: VERY_HIGH_MEDICAL_RISK
Action: ⛔ Do not use - seek professional medical advice
--------------------------------------------------
```

### 💾 Memory Usage Comparison

| Configuration | Model Size | VRAM Usage | Performance |
|--------------|------------|------------|-------------|
| **Full Precision** | ~16GB | ~14GB | 100% speed |
| **FP16 Mixed Precision** | ~8GB | ~7GB | 95% speed |
| **4-bit Quantization** | ~4GB | ~3.5GB | 85-90% speed |
| **4-bit + Double Quant** | ~3.5GB | ~3GB | 85-90% speed |

**Recommendation**: Use `HallucinationDetector.for_low_memory()` for GPUs with 8GB or less VRAM.

### 📝 Enhanced Query-Context Support (NEW in v0.6.3!)

HalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:

```python
from hallunox import HallucinationDetector

# Initialize MedGemma detector for context-aware medical responses
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    enable_response_generation=True
)

# Medical query-context pairs for enhanced accuracy
medical_query_context_pairs = [
    {
        "query": "Is it safe to take ibuprofen daily?",
        "context": "Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation."
    },
    {
        "query": "What's the recommended exercise routine?",
        "context": "28-year-old pregnant patient at 30 weeks, previously sedentary, no complications."
    },
    {
        "query": "How should I manage my diabetes medication?",
        "context": "Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily."
    }
]

# Method 1: Confidence analysis with context
results = detector.predict_with_query_context(medical_query_context_pairs)
for pred in results["predictions"]:
    print(f"Query: {pred['text']}")
    print(f"Context-Enhanced Confidence: {pred['confidence_score']:.3f}")
    print(f"Medical Risk Level: {pred['risk_level']}")
    print(f"Recommendation: {pred['routing_action']}")

# Method 2: Response generation with context
responses = detector.generate_response_with_context(
    medical_query_context_pairs,
    max_length=300,
    check_confidence=True
)

for i, response in enumerate(responses):
    pair = medical_query_context_pairs[i]
    print(f"\nQuery: {pair['query']}")
    print(f"Context: {pair['context'][:60]}...")
    
    if isinstance(response, dict) and "should_generate" in response:
        if response["should_generate"]:
            print(f"✅ Context-Aware Response: {response['response']}")
            print(f"Confidence: {response['confidence_score']:.3f}")
        else:
            print(f"⚠️ Blocked (confidence: {response['confidence_score']:.3f})")
            print(f"Recommendation: {response['recommendation']}")

# Method 3: Individual response with context
single_response = detector.generate_response(
    prompt="Should I adjust my medication?",
    query_context_pairs=[{
        "query": "Should I adjust my medication?", 
        "context": "Patient experiencing mild side effects from current dosage"
    }],
    check_confidence=True
)
```

### Context Impact Analysis

```python
# Compare confidence with and without context
query = "Is this medication safe during pregnancy?"

# Without context
no_context = detector.predict([query])
print(f"Without context: {no_context['predictions'][0]['confidence_score']:.3f}")

# With context
with_context = detector.predict([query], query_context_pairs=[{
    "query": query,
    "context": "Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins"
}])
print(f"With context: {with_context['predictions'][0]['confidence_score']:.3f}")

# Context benefit
improvement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']
print(f"Context improvement: {improvement:+.3f}")
```

## 🖥️ Command Line Interface

HalluNox provides a comprehensive CLI for various use cases:

### Interactive Mode
```bash
# General model interactive mode
hallunox-infer --interactive

# MedGemma medical interactive mode
hallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text
```

### Batch Processing
```bash
# Process file with general model
hallunox-infer --input_file medical_texts.txt --output_file results.json

# Process with MedGemma and medical settings
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --input_file medical_texts.txt \
    --output_file medical_results.json \
    --show_routing \
    --show_generated_text
```

### Image Analysis (Multimodal models only)
```bash
# Single image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_path chest_xray.jpg \
    --show_generated_text

# Batch image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_folder /path/to/medical/images \
    --output_file image_analysis.json
```

### Demo Mode
```bash
# General demo
hallunox-infer --demo --show_routing

# Medical demo with MedGemma
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode both \
    --show_routing

# Text-only demo (faster initialization)
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode text \
    --show_routing
```

## 🔨 Training Your Own Model

### Quick Training

```python
from hallunox import Trainer, TrainingConfig

# Configure training
config = TrainingConfig(
    # Model selection
    model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Training parameters
    batch_size=8,
    learning_rate=5e-4,
    max_epochs=6,
    warmup_steps=300,
    
    # Dataset configuration
    use_truthfulqa=True,
    use_halueval=True,
    use_fever=True,
    max_samples_per_dataset=3000,
    
    # Output
    output_dir="./models/my_medical_model"
)

# Train model
trainer = Trainer(config)
trainer.train()
```

### Command Line Training
```bash
# Train general model
hallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6

# Train medical model
hallunox-train \
    --model_id convaiinnovations/gemma-finetuned-4b-it \
    --batch_size 4 \
    --learning_rate 3e-4 \
    --max_epochs 8 \
    --output_dir ./models/custom_medgemma
```

## 🏗️ Model Architecture

HalluNox supports two main architectures:

### General Architecture (Llama-3.2-3B)
1. **LLM Component**: Llama-3.2-3B-Instruct
   - Extracts internal hidden representations (3072D)
   - Supports any Llama-architecture model
   
2. **Embedding Model**: BGE-M3 (fixed)
   - Provides reference semantic embeddings
   - 1024-dimensional dense vectors

3. **Projection Network**: Standard ProjectionHead
   - Maps LLM hidden states to embedding space
   - 3-layer MLP with ReLU activations and dropout

### Medical Architecture (MedGemma-4B-IT)
1. **Unified Multimodal Model**: 
   - **Single Model**: AutoModelForImageTextToText handles both text and images
   - **Memory Optimized**: Avoids double loading (saves ~8GB VRAM)
   - **Fallback Support**: Graceful degradation to text-only if needed
   
2. **Embedding Model**: BGE-M3 (same as general)
   - Enhanced with medical context formatting
   
3. **Projection Network**: UltraStableProjectionHead
   - Ultra-stable architecture with heavy normalization
   - Conservative weight initialization for medical precision
   - Tanh activations for stability
   - Enhanced dropout and layer normalization

4. **Multimodal Processor**: AutoProcessor
   - Handles image + text inputs
   - Supports chat template formatting

5. **Quantization Support**: 4-bit NF4 with double quantization
   - Reduces memory usage by ~75%
   - Maintains 85-90% performance
   - Automatic fallback for CPU

## 📊 API Reference

### HallucinationDetector

#### Constructor Parameters

```python
HallucinationDetector(
    model_path: str = None,                    # Path to trained model (None = auto-download)
    llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct",  # LLM model ID
    embed_model_id: str = "BAAI/bge-m3",      # Embedding model ID
    device: str = None,                        # Device (None = auto-detect)
    max_length: int = 512,                     # LLM sequence length
    bge_max_length: int = 512,                # BGE-M3 sequence length
    use_fp16: bool = True,                     # Mixed precision
    load_llm: bool = True,                     # Load LLM
    enable_inference: bool = False,            # Enable LLM inference
    confidence_threshold: float = None,        # Custom threshold (auto-detected)
    enable_response_generation: bool = False,  # Enable response generation
    use_quantization: bool = False,            # Enable 4-bit quantization for memory savings
    quantization_config: BitsAndBytesConfig = None,  # Custom quantization config
    mode: str = "text",                        # Operation mode: "text", "image", "both", "auto" (default: "text")
)
```

#### Core Methods

**Text Analysis:**
- `predict(texts, query_context_pairs=None)` - Analyze texts for hallucination confidence
- `predict_with_query_context(query_context_pairs)` - Query-context prediction
- `batch_predict(texts, batch_size=16)` - Efficient batch processing

**Response Generation:**
- `generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None)` - Generate responses with confidence checking and optional context
- `generate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False)` - Generate responses for multiple query-context pairs

**Multimodal (MedGemma only):**
- `predict_images(images, image_descriptions=None)` - Analyze image confidence
- `generate_image_response(image, prompt, max_length=200)` - Generate image descriptions

**Analysis:**
- `evaluate_routing_strategy(texts)` - Analyze routing decisions

**Factory Methods:**
- `for_embedding_only()` - Create embedding-only detector
- `for_low_memory()` - Create memory-optimized detector with 4-bit quantization

#### Response Format

```python
{
    "predictions": [
        {
            "text": "input text",
            "confidence_score": 0.85,           # 0.0 to 1.0
            "similarity_score": 0.92,          # Cosine similarity
            "interpretation": "HIGH_CONFIDENCE", # or HIGH_MEDICAL_CONFIDENCE
            "risk_level": "LOW_RISK",          # or LOW_MEDICAL_RISK
            "routing_action": "LOCAL_GENERATION",
            "description": "This response appears to be factual and reliable."
        }
    ],
    "summary": {
        "total_texts": 1,
        "avg_confidence": 0.85,
        "high_confidence_count": 1,
        "medium_confidence_count": 0,
        "low_confidence_count": 0,
        "very_low_confidence_count": 0
    }
}
```

#### Response Generation Format

```python
{
    "response": "Generated response text",
    "confidence_score": 0.85,
    "should_generate": True,
    "meets_threshold": True,
    "forced_generation": False,  # True if generated despite low confidence
    # Or when blocked:
    "reason": "Confidence 0.45 below threshold 0.60",
    "recommendation": "RAG_RETRIEVAL"
}
```

### Training Classes

- **`TrainingConfig`**: Configuration dataclass for training parameters
- **`Trainer`**: Main training class with dataset loading and model training
- **`MultiDatasetLoader`**: Loads and combines multiple hallucination detection datasets

### Utility Functions

- **`download_model()`**: Download general pre-trained model
- **`download_medgemma_model(model_name)`**: Download MedGemma medical model
- **`setup_logging(level)`**: Configure logging
- **`check_gpu_availability()`**: Check CUDA compatibility
- **`validate_model_requirements()`**: Verify dependencies

## 📈 Performance

Our confidence-aware routing system demonstrates:

- **74% hallucination detection rate** (vs 42% baseline)
- **9% false positive rate** (vs 15% baseline)  
- **40% reduction in computational cost** vs post-hoc methods
- **1.6x cost multiplier** vs always using expensive operations (4.2x)

### Medical Domain Performance (MedGemma)
- **Enhanced medical accuracy** with 0.62 confidence threshold
- **Multimodal capability** for medical image analysis
- **Safety-first approach** with conservative thresholds
- **Professional verification workflow** for low-confidence cases

## 🖥️ Hardware Requirements

### Minimum (Inference Only)
- **CPU**: Modern multi-core processor
- **RAM**: 16GB system memory
- **GPU**: 8GB VRAM (RTX 3070, RTX 4060 Ti+)
- **Storage**: 15GB free space
- **Models**: ~5GB each (Llama/MedGemma)

### Recommended (Inference)
- **CPU**: Intel i7/AMD Ryzen 7+
- **RAM**: 32GB system memory  
- **GPU**: 12GB+ VRAM (RTX 4070, RTX 3080+)
- **Storage**: NVMe SSD, 25GB+ free
- **CUDA**: 11.8+ compatible driver

### Training Requirements
- **CPU**: High-performance multi-core (i9/Ryzen 9)
- **RAM**: 64GB+ system memory
- **GPU**: 24GB+ VRAM (RTX 4090, A100, H100)
- **Storage**: 200GB+ NVMe SSD
  - Model checkpoints: ~10GB per epoch
  - Training datasets: ~30GB
  - Logs and outputs: ~50GB
- **Network**: High-speed internet for downloads

### MedGemma Specific
- **Additional storage**: +10GB for multimodal models
- **Image processing**: PIL/Pillow for image capabilities
- **Memory**: +4GB RAM for image processing pipeline

### CPU-Only Mode
- **RAM**: 32GB minimum (64GB recommended)
- **Performance**: 10-50x slower than GPU
- **Not recommended**: For production medical applications

## 🔒 Safety Considerations

### Medical Applications
- **Professional oversight required**: HalluNox is a research tool, not medical advice
- **Validation needed**: All medical outputs should be verified by qualified professionals
- **Conservative thresholds**: 0.62 threshold ensures high precision for medical content
- **Clear disclaimers**: Always include appropriate medical disclaimers in applications

### General Use
- **Confidence-based routing**: Use routing recommendations for appropriate escalation
- **Human oversight**: Very low confidence predictions require human review
- **Regular evaluation**: Monitor performance on your specific use cases

## 🛠️ Troubleshooting

### Common Issues and Solutions

#### CUDA Out of Memory Error
```
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...
```
**Solution**: Use 4-bit quantization
```python
detector = HallucinationDetector.for_low_memory()
```

#### Deprecated torch_dtype Warning
```
`torch_dtype` is deprecated! Use `dtype` instead!
```
**Solution**: Already fixed in HalluNox v0.3.2+ - the package now uses the correct `dtype` parameter.

#### Double Model Loading (MedGemma)
```
Loading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]
Loading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]
```
**Solution**: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.

#### Accelerate Warning
```
WARNING:accelerate.big_modeling:Some parameters are on the meta device...
```
**Solution**: This is normal with quantization - parameters are automatically moved to GPU during inference.

#### Dependency Version Conflict (AutoProcessor)
```
⚠️ Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'
AttributeError: module 'requests' has no attribute 'exceptions'
```
**Solution**: This is a compatibility issue between transformers and requests versions.
```bash
pip install --upgrade transformers requests huggingface_hub
# Or force reinstall
pip install --force-reinstall transformers>=4.45.0 requests>=2.31.0
```
**Fallback**: HalluNox automatically falls back to text-only mode when this occurs.

#### Model Hidden States NaN/Inf Issues ✅ RESOLVED
```
⚠️ Warning: NaN/Inf detected in model hidden states
   Hidden shape: torch.Size([3, 16, 2560])
   NaN count: 122880
```
**✅ FIXED in HalluNox v0.6.3+**: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:

**Root Cause**: 4-bit quantization was causing numerical instabilities with certain model architectures.

**Solution Applied**:
- **Disabled Quantization**: Removed 4-bit quantization that was causing NaN issues
- **Simplified Model Loading**: Now uses the same approach as our proven `inference_gemma.py` 
- **Clean Architecture**: Removed complex stability measures that were interfering
- **Stable Precision**: Uses `torch.bfloat16` for optimal performance without instabilities

#### Repetitive Text and Unwanted Artifacts ✅ RESOLVED
```
🔬 Reference Response (forced): I am programmed to be a harmless AI assistant...
g
I am programmed to be a harmless AI assistant...
g
[repetitive output continues...]
```
**✅ FIXED in HalluNox v0.6.3+**: Repetitive text generation and unwanted artifacts have been completely resolved:

**Root Cause**: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.

**Solution Applied**:
- **Deterministic Generation**: Changed from `do_sample=True` to `do_sample=False` matching Jupyter notebook approach
- **Proper Chat Templates**: Adopted exact message formatting from working Jupyter notebook implementation  
- **Removed Sampling Parameters**: Eliminated `temperature`, `top_p`, `repetition_penalty` that were causing issues
- **Clean Tokenization**: Uses `tokenizer.apply_chat_template()` with proper parameters for conversation structure

**Current Recommended Usage** (v0.6.3+):
```python
# Standard usage - now stable by default
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda"
)

# Both NaN issues and repetitive text are now automatically resolved
```

**Migration from v0.4.9 and earlier**: No code changes needed - existing code will automatically use the stable approach.

#### Environment Optimization
For better memory management, set:
```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

### Memory Requirements by Configuration

| GPU VRAM | Recommended Configuration | Expected Performance |
|----------|--------------------------|---------------------|
| **4-6GB** | `for_low_memory()` + reduce batch size | Basic functionality |
| **8-12GB** | `for_low_memory()` | Full functionality |
| **16GB+** | Standard configuration | Optimal performance |
| **24GB+** | Multiple models + training | Development/research |

## 📄 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

## 📚 Citation

If you use HalluNox in your research, please cite:

```bibtex
@article{nandakishor2024hallunox,
    title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},
    author={Nandakishor M},
    journal={AI Safety Research},
    year={2024},
    organization={Convai Innovations}
}
```

## 🤝 Contributing

We welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.

### Development Setup
```bash
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e ".[dev]"
```

## 📞 Support

For technical support and questions:
- **Email**: support@convaiinnovations.com  
- **Issues**: [GitHub Issues](https://github.com/convai-innovations/hallunox/issues)
- **Documentation**: Full API docs available online

## 👨‍💻 Author

**Nandakishor M**  
AI Safety Research  
Convai Innovations Pvt. Ltd.  
Email: support@convaiinnovations.com

---

**Disclaimer**: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/convai-innovations/hallunox",
    "name": "hallunox",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "\"Convai Innovations Pvt. Ltd.\" <support@convaiinnovations.com>",
    "keywords": "hallucination-detection, llm, confidence-estimation, model-reliability, uncertainty-quantification, ai-safety",
    "author": "Nandakishor M",
    "author_email": "Nandakishor M <support@convaiinnovations.com>",
    "download_url": "https://files.pythonhosted.org/packages/2c/76/f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b/hallunox-0.6.3.tar.gz",
    "platform": null,
    "description": "# HalluNox\r\n\r\n**Confidence-Aware Routing for Large Language Model Reliability Enhancement**\r\n\r\nA Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.\r\n\r\n## \u2728 Features\r\n\r\n- **\ud83c\udfaf Pre-generation Hallucination Detection**: Assess model reliability before generation begins\r\n- **\ud83d\udd04 Confidence-Aware Routing**: Automatically route queries based on estimated confidence\r\n- **\ud83e\udde0 Multi-Signal Approach**: Combines semantic alignment, internal convergence, and learned confidence\r\n- **\u26a1 Multi-Model Support**: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures\r\n- **\ud83c\udfe5 Medical Domain Specialization**: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds\r\n- **\ud83d\uddbc\ufe0f Multimodal Capabilities**: Image analysis and response generation for MedGemma models\r\n- **\ud83d\udcca Comprehensive Evaluation**: Built-in metrics and routing strategy analysis\r\n- **\ud83d\ude80 Easy Integration**: Simple API for both training and inference\r\n- **\ud83c\udfc3\u200d\u2642\ufe0f Performance Optimizations**: Optional LLM loading for faster initialization and lower memory usage\r\n- **\ud83d\udcdd Enhanced Query-Context**: Improved accuracy with structured prompt formatting\r\n- **\ud83c\udf9b\ufe0f Adaptive Thresholds**: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)\r\n- **\ud83d\udcac Response Generation**: Built-in response generation with confidence-gated output\r\n- **\ud83d\udd27 Automatic Model Management**: Auto-download and configuration for supported models\r\n\r\n## \ud83d\udd2c Research Foundation\r\n\r\nBased on the research paper \"Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation\" by Nandakishor M (Convai Innovations).\r\n\r\nThe approach implements deterministic routing to appropriate response pathways:\r\n\r\n### General Models (Llama-3.2-3B)\r\n- **High Confidence (\u22650.65)**: Local generation  \r\n- **Medium Confidence (0.60-0.65)**: Retrieval-augmented generation\r\n- **Low Confidence (0.4-0.60)**: Route to larger models\r\n- **Very Low Confidence (<0.4)**: Human review required\r\n\r\n### Medical Models (MedGemma-4B-IT)\r\n- **High Medical Confidence (\u22650.60)**: Local generation with medical validation\r\n- **Medium Medical Confidence (0.55-0.60)**: Medical literature verification required\r\n- **Low Medical Confidence (0.50-0.55)**: Professional medical verification required\r\n- **Very Low Medical Confidence (<0.50)**: Seek professional medical advice\r\n\r\n## \ud83c\udd95 What's New in v0.6.3\r\n\r\n### \u2728 Enhanced Query-Context Support\r\n- **\ud83d\udd17 Query-Context Pairs**: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses\r\n- **\ud83c\udfaf Improved Accuracy**: Better confidence scoring when context is provided\r\n- **\ud83d\udcdd Enhanced Response Generation**: Context-aware prompt formatting for more accurate medical responses\r\n- **\ud83d\udd04 Batch Processing**: New `generate_response_with_context()` method for processing multiple query-context pairs\r\n\r\n### \ud83c\udfe5 Medical Domain Enhancements\r\n- **\ud83e\ude7a Context Integration**: Medical queries now benefit from patient context and clinical background\r\n- **\ud83d\udcca Better Confidence**: Context helps improve confidence scoring for medical scenarios\r\n- **\ud83c\udf9b\ufe0f Flexible Usage**: Works with existing methods while providing new convenience functions\r\n- **\ud83d\udd0d Example Implementation**: New query_context_example.py demonstrates usage patterns\r\n\r\n### \ud83e\uddf9 Simplified Architecture\r\n- **\ud83d\udcf1 Removed Dashboard**: Eliminated dashboard dependencies for cleaner core package\r\n- **\u26a1 Streamlined Installation**: Faster installation without unnecessary web components\r\n- **\ud83c\udfaf Focused Functionality**: Core hallucination detection without UI overhead\r\n- **\ud83d\udce6 Lightweight**: Reduced package size and dependencies\r\n\r\n### \ud83d\udd27 Technical Improvements\r\n- **\ud83d\udd17 Enhanced Prompt Formatting**: Context gets properly integrated into medical prompts\r\n- **\ud83c\udfaf Backward Compatibility**: All existing code continues to work unchanged\r\n- **\ud83d\udcdd Better Documentation**: Comprehensive examples for query-context usage\r\n- **\ud83d\udee1\ufe0f Stable Performance**: Maintains all stability improvements from v0.6.3\r\n\r\n## \ud83d\ude80 Installation\r\n\r\n### Requirements\r\n\r\n- Python 3.8+\r\n- PyTorch 1.13+\r\n- CUDA-compatible GPU (recommended)\r\n- At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)\r\n- 16GB RAM minimum (32GB recommended for training)\r\n\r\n### Install from PyPI\r\n\r\n```bash\r\npip install hallunox\r\n```\r\n\r\n### Install from Source\r\n\r\n```bash\r\ngit clone https://github.com/convai-innovations/hallunox.git\r\ncd hallunox\r\npip install -e .\r\n```\r\n\r\n### MedGemma Model Access\r\n\r\nHalluNox uses the open-access `convaiinnovations/gemma-finetuned-4b-it` model, which doesn't require authentication. The model will be automatically downloaded on first use.\r\n\r\n### Core Dependencies\r\n\r\nHalluNox automatically installs:\r\n\r\n- `torch>=1.13.0` - PyTorch framework\r\n- `transformers>=4.21.0` - Hugging Face Transformers\r\n- `FlagEmbedding>=1.2.0` - BGE-M3 embedding model\r\n- `datasets>=2.0.0` - Dataset loading utilities\r\n- `scikit-learn>=1.0.0` - Evaluation metrics\r\n- `numpy>=1.21.0` - Numerical computations\r\n- `tqdm>=4.64.0` - Progress bars\r\n- `Pillow>=8.0.0` - Image processing for multimodal capabilities\r\n- `bitsandbytes>=0.41.0` - 4-bit quantization for memory optimization\r\n\r\n## \ud83d\udcd6 Quick Start\r\n\r\n### Basic Usage (Llama-3.2-3B)\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Initialize detector (downloads pre-trained model automatically)\r\ndetector = HallucinationDetector()\r\n\r\n# Analyze text for hallucination risk\r\nresults = detector.predict([\r\n    \"The capital of France is Paris.\",  # High confidence\r\n    \"Your password is 12345678.\",       # Low confidence  \r\n    \"The Moon is made of cheese.\"       # Very low confidence\r\n])\r\n\r\n# View results\r\nfor pred in results[\"predictions\"]:\r\n    print(f\"Text: {pred['text']}\")\r\n    print(f\"Confidence: {pred['confidence_score']:.3f}\")\r\n    print(f\"Risk Level: {pred['risk_level']}\")\r\n    print(f\"Routing Action: {pred['routing_action']}\")\r\n    print()\r\n```\r\n\r\n### \ud83c\udfe5 MedGemma Medical Domain Usage\r\n\r\nFor medical applications using MedGemma 4B-IT with multimodal capabilities:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\nfrom PIL import Image\r\n\r\n# Initialize MedGemma detector (auto-downloads medical model)\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    confidence_threshold=0.60,  # Medical-grade threshold\r\n    enable_response_generation=True,  # Enable response generation\r\n    enable_inference=True,\r\n    mode=\"text\"  # Text-only mode (default)\r\n)\r\n\r\n# Medical text analysis\r\nmedical_results = detector.predict([\r\n    \"Aspirin can help reduce heart attack risk when prescribed by a doctor.\",\r\n    \"Drinking bleach will cure COVID-19.\",  # Dangerous misinformation\r\n    \"Type 2 diabetes requires insulin injections in all cases.\",  # Partially incorrect\r\n])\r\n\r\nfor pred in medical_results[\"predictions\"]:\r\n    print(f\"Medical Text: {pred['text'][:60]}...\")\r\n    print(f\"Confidence: {pred['confidence_score']:.3f}\")\r\n    print(f\"Risk Level: {pred['risk_level']}\")\r\n    print(f\"Medical Action: {pred['routing_action']}\")\r\n    print(f\"Description: {pred['description']}\")\r\n    print(\"-\" * 50)\r\n\r\n# Response generation with confidence checking\r\nquestion = \"What are the symptoms of pneumonia?\"\r\nresponse = detector.generate_response(question, check_confidence=True)\r\n\r\nif response[\"should_generate\"]:\r\n    print(f\"\u2705 Medical Response Generated (confidence: {response['confidence_score']:.3f})\")\r\n    print(f\"Response: {response['response']}\")\r\n    print(f\"Meets threshold: {response['meets_threshold']}\")\r\n    if response.get('forced_generation'):\r\n        print(\"\u26a0\ufe0f Note: Response was generated despite low confidence\")\r\nelse:\r\n    print(f\"\u26a0\ufe0f Response blocked (confidence: {response['confidence_score']:.3f})\")\r\n    print(f\"Reason: {response['reason']}\")\r\n    print(f\"Recommendation: {response['recommendation']}\")\r\n\r\n# Force generation for reference regardless of confidence\r\nforced_response = detector.generate_response(\r\n    question, \r\n    check_confidence=True, \r\n    force_generate=True  # Generate even if confidence is low\r\n)\r\nprint(f\"\ud83d\udd2c Reference Response (forced): {forced_response['response']}\")\r\nprint(f\"\ud83d\udcca Confidence: {forced_response['confidence_score']:.3f}\")\r\nprint(f\"\ud83c\udfaf Forced Generation: {forced_response['forced_generation']}\")\r\n\r\n# Multimodal image analysis (MedGemma 4B-IT only)\r\nif detector.is_multimodal:\r\n    print(\"\\n\ud83d\uddbc\ufe0f Multimodal Image Analysis\")\r\n    \r\n    # Load medical image (replace with actual medical image)\r\n    try:\r\n        image = Image.open(\"chest_xray.jpg\")\r\n    except:\r\n        # Create demo image for testing\r\n        image = Image.new('RGB', (224, 224), color='lightgray')\r\n    \r\n    # Analyze image confidence\r\n    image_results = detector.predict_images([image], [\"Chest X-ray\"])\r\n    \r\n    for pred in image_results[\"predictions\"]:\r\n        print(f\"Image: {pred['image_description']}\")\r\n        print(f\"Confidence: {pred['confidence_score']:.3f}\")\r\n        print(f\"Interpretation: {pred['interpretation']}\")\r\n        print(f\"Risk Level: {pred['risk_level']}\")\r\n    \r\n    # Generate image description\r\n    description = detector.generate_image_response(\r\n        image, \r\n        \"Describe the findings in this chest X-ray.\"\r\n    )\r\n    print(f\"Generated Description: {description}\")\r\n```\r\n\r\n### \ud83d\udd27 Advanced Configuration\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Full configuration example\r\ndetector = HallucinationDetector(\r\n    # Model selection\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",  # or \"unsloth/Llama-3.2-3B-Instruct\"\r\n    embed_model_id=\"BAAI/bge-m3\",\r\n    \r\n    # Custom model weights (optional)\r\n    model_path=\"/path/to/custom/model.pt\",  # None = auto-download\r\n    \r\n    # Hardware configuration\r\n    device=\"cuda\",  # or \"cpu\"\r\n    use_fp16=True,  # Mixed precision for faster inference\r\n    \r\n    # Sequence lengths\r\n    max_length=512,      # LLM context length\r\n    bge_max_length=512,  # BGE-M3 context length\r\n    \r\n    # Feature toggles\r\n    load_llm=True,                    # Load LLM for embeddings\r\n    enable_inference=True,            # Enable LLM inference\r\n    enable_response_generation=True,  # Enable response generation\r\n    \r\n    # Confidence settings\r\n    confidence_threshold=0.60,  # Custom threshold (auto-detected by model type)\r\n    \r\n    # Operation mode\r\n    mode=\"text\",  # \"text\", \"image\", \"both\", or \"auto\"\r\n)\r\n\r\n# Check model capabilities\r\nprint(f\"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}\")\r\nprint(f\"Multimodal support: {detector.is_multimodal}\")\r\nprint(f\"Operation mode: {detector.effective_mode} (requested: {detector.mode})\")\r\nprint(f\"Confidence threshold: {detector.confidence_threshold}\")\r\n```\r\n\r\n### \ud83c\udf9b\ufe0f Operation Mode Configuration\r\n\r\nThe `mode` parameter controls what types of input the detector can process:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Text mode (default) - processes text inputs only\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    mode=\"text\"  # Text-only processing (default)\r\n)\r\n\r\n# Auto mode - detects capabilities from model\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    mode=\"auto\"  # Auto: detects based on model capabilities\r\n)\r\n\r\n# Image-only mode - processes images only (requires multimodal model)\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    mode=\"image\"  # Image processing only\r\n)\r\n\r\n# Both mode - processes text and images (requires multimodal model)\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    mode=\"both\"  # Explicit multimodal mode\r\n)\r\n```\r\n\r\n#### Mode Validation\r\n\r\n- **Text mode**: Available for all models (default)\r\n- **Image mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)\r\n- **Both mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)\r\n- **Auto mode**: Automatically selects based on model capabilities\r\n  - Multimodal models \u2192 `effective_mode = \"both\"`\r\n  - Text-only models \u2192 `effective_mode = \"text\"`\r\n\r\n#### Error Examples\r\n\r\n```python\r\n# This will raise an error - image mode requires multimodal model\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"unsloth/Llama-3.2-3B-Instruct\",\r\n    mode=\"image\"  # \u274c Error: Image mode requires multimodal model\r\n)\r\n\r\n# This will raise an error - calling image methods in text mode\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    mode=\"text\"\r\n)\r\ndetector.predict_images([image])  # \u274c Error: Current mode is 'text'\r\n```\r\n\r\n### \u26a1 Performance Optimized Usage\r\n\r\nFor faster initialization when only doing embedding comparisons:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Option 1: Factory method for embedding-only usage\r\ndetector = HallucinationDetector.for_embedding_only(\r\n    device=\"cuda\",\r\n    use_fp16=True\r\n)\r\n\r\n# Option 2: Explicit parameter control\r\ndetector = HallucinationDetector(\r\n    load_llm=False,         # Skip expensive LLM loading\r\n    enable_inference=False, # Disable inference capabilities\r\n    use_fp16=True          # Use mixed precision\r\n)\r\n\r\n# Note: This configuration cannot perform predictions\r\n# Use for preprocessing or embedding extraction only\r\n```\r\n\r\n### \ud83e\udde0 Memory Optimization with Quantization\r\n\r\nFor GPUs with limited VRAM (8-16GB), use 4-bit quantization:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Option 1: Auto-optimized for low memory (recommended)\r\ndetector = HallucinationDetector.for_low_memory(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",  # Or any supported model\r\n    device=\"cuda\",\r\n    enable_response_generation=True,  # Enable response generation for evaluation\r\n    verbose=True  # Show loading progress (optional)\r\n)\r\n\r\n# Option 2: Manual quantization configuration\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    use_quantization=True,  # Enable 4-bit quantization\r\n    enable_response_generation=True,\r\n    device=\"cuda\"\r\n)\r\n\r\n# Option 3: Custom quantization settings\r\nfrom transformers import BitsAndBytesConfig\r\nimport torch\r\n\r\ncustom_quant_config = BitsAndBytesConfig(\r\n    load_in_4bit=True,\r\n    bnb_4bit_quant_type=\"nf4\",          # NF4 quantization type\r\n    bnb_4bit_use_double_quant=True,     # Double quantization for extra savings\r\n    bnb_4bit_compute_dtype=torch.bfloat16  # Compute in bfloat16\r\n)\r\n\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    quantization_config=custom_quant_config,\r\n    device=\"cuda\"\r\n)\r\n\r\nprint(f\"\u2705 Memory optimized: {detector.use_quantization}\")\r\nprint(f\"\ud83d\udd27 Quantization: 4-bit NF4 with double quantization\")\r\n```\r\n\r\n## \ud83e\udd16 Response Generation & Evaluation\r\n\r\n### Enabling Response Generation\r\n\r\nWhen `enable_response_generation=True`, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Enable response generation for evaluation\r\ndetector = HallucinationDetector.for_low_memory(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    device=\"cuda\",\r\n    enable_response_generation=True,  # Enable response generation\r\n    verbose=False  # Clean logs for evaluation\r\n)\r\n\r\n# Test questions for evaluation\r\ntest_questions = [\r\n    \"What are the symptoms of diabetes?\",\r\n    \"Drinking bleach will cure COVID-19.\",  # Dangerous misinformation\r\n    \"How does aspirin help prevent heart attacks?\",\r\n    \"All vaccines cause autism in children.\",  # Medical misinformation\r\n]\r\n\r\n# Analyze with response generation\r\nfor question in test_questions:\r\n    # The model will generate a response and analyze it\r\n    results = detector.predict([question])\r\n    prediction = results[\"predictions\"][0]\r\n    \r\n    print(f\"Question: {question}\")\r\n    print(f\"Confidence: {prediction['confidence_score']:.3f}\")\r\n    print(f\"Risk Level: {prediction['risk_level']}\")\r\n    print(f\"Action: {prediction['medical_action']}\")\r\n    print(f\"Description: {prediction['description']}\")\r\n    print(\"-\" * 50)\r\n```\r\n\r\n### Response Generation Modes\r\n\r\n```python\r\n# Generate and analyze responses with confidence checking\r\nresponse = detector.generate_response(\r\n    \"What are the side effects of ibuprofen?\", \r\n    check_confidence=True\r\n)\r\n\r\nif response[\"should_generate\"]:\r\n    print(f\"\u2705 Generated Response: {response['response']}\")\r\n    print(f\"Confidence: {response['confidence_score']:.3f}\")\r\n    print(f\"Meets threshold: {response['meets_threshold']}\")\r\nelse:\r\n    print(f\"\u26a0\ufe0f Response blocked (confidence: {response['confidence_score']:.3f})\")\r\n    print(f\"Reason: {response['reason']}\")\r\n    print(f\"Recommendation: {response['recommendation']}\")\r\n\r\n# Force generation for reference (useful for evaluation)\r\nforced_response = detector.generate_response(\r\n    \"What are the side effects of ibuprofen?\", \r\n    check_confidence=True, \r\n    force_generate=True\r\n)\r\nprint(f\"\ud83d\udd2c Reference Response: {forced_response['response']}\")\r\nprint(f\"\ud83d\udcca Confidence: {forced_response['confidence_score']:.3f}\")\r\nprint(f\"\ud83c\udfaf Forced Generation: {forced_response['forced_generation']}\")\r\n```\r\n\r\n### Evaluation Output Example\r\n\r\n```\r\nQuestion: What are the symptoms of diabetes?\r\nGenerated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.\r\nConfidence: 0.857\r\nRisk Level: LOW_MEDICAL_RISK\r\nAction: \u2705 Information can be used as reference\r\n--------------------------------------------------\r\nQuestion: Drinking bleach will cure COVID-19.\r\nGenerated Response: [Response blocked - confidence too low]\r\nConfidence: 0.123\r\nRisk Level: VERY_HIGH_MEDICAL_RISK\r\nAction: \u26d4 Do not use - seek professional medical advice\r\n--------------------------------------------------\r\n```\r\n\r\n### \ud83d\udcbe Memory Usage Comparison\r\n\r\n| Configuration | Model Size | VRAM Usage | Performance |\r\n|--------------|------------|------------|-------------|\r\n| **Full Precision** | ~16GB | ~14GB | 100% speed |\r\n| **FP16 Mixed Precision** | ~8GB | ~7GB | 95% speed |\r\n| **4-bit Quantization** | ~4GB | ~3.5GB | 85-90% speed |\r\n| **4-bit + Double Quant** | ~3.5GB | ~3GB | 85-90% speed |\r\n\r\n**Recommendation**: Use `HallucinationDetector.for_low_memory()` for GPUs with 8GB or less VRAM.\r\n\r\n### \ud83d\udcdd Enhanced Query-Context Support (NEW in v0.6.3!)\r\n\r\nHalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Initialize MedGemma detector for context-aware medical responses\r\ndetector = HallucinationDetector(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    enable_response_generation=True\r\n)\r\n\r\n# Medical query-context pairs for enhanced accuracy\r\nmedical_query_context_pairs = [\r\n    {\r\n        \"query\": \"Is it safe to take ibuprofen daily?\",\r\n        \"context\": \"Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation.\"\r\n    },\r\n    {\r\n        \"query\": \"What's the recommended exercise routine?\",\r\n        \"context\": \"28-year-old pregnant patient at 30 weeks, previously sedentary, no complications.\"\r\n    },\r\n    {\r\n        \"query\": \"How should I manage my diabetes medication?\",\r\n        \"context\": \"Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily.\"\r\n    }\r\n]\r\n\r\n# Method 1: Confidence analysis with context\r\nresults = detector.predict_with_query_context(medical_query_context_pairs)\r\nfor pred in results[\"predictions\"]:\r\n    print(f\"Query: {pred['text']}\")\r\n    print(f\"Context-Enhanced Confidence: {pred['confidence_score']:.3f}\")\r\n    print(f\"Medical Risk Level: {pred['risk_level']}\")\r\n    print(f\"Recommendation: {pred['routing_action']}\")\r\n\r\n# Method 2: Response generation with context\r\nresponses = detector.generate_response_with_context(\r\n    medical_query_context_pairs,\r\n    max_length=300,\r\n    check_confidence=True\r\n)\r\n\r\nfor i, response in enumerate(responses):\r\n    pair = medical_query_context_pairs[i]\r\n    print(f\"\\nQuery: {pair['query']}\")\r\n    print(f\"Context: {pair['context'][:60]}...\")\r\n    \r\n    if isinstance(response, dict) and \"should_generate\" in response:\r\n        if response[\"should_generate\"]:\r\n            print(f\"\u2705 Context-Aware Response: {response['response']}\")\r\n            print(f\"Confidence: {response['confidence_score']:.3f}\")\r\n        else:\r\n            print(f\"\u26a0\ufe0f Blocked (confidence: {response['confidence_score']:.3f})\")\r\n            print(f\"Recommendation: {response['recommendation']}\")\r\n\r\n# Method 3: Individual response with context\r\nsingle_response = detector.generate_response(\r\n    prompt=\"Should I adjust my medication?\",\r\n    query_context_pairs=[{\r\n        \"query\": \"Should I adjust my medication?\", \r\n        \"context\": \"Patient experiencing mild side effects from current dosage\"\r\n    }],\r\n    check_confidence=True\r\n)\r\n```\r\n\r\n### Context Impact Analysis\r\n\r\n```python\r\n# Compare confidence with and without context\r\nquery = \"Is this medication safe during pregnancy?\"\r\n\r\n# Without context\r\nno_context = detector.predict([query])\r\nprint(f\"Without context: {no_context['predictions'][0]['confidence_score']:.3f}\")\r\n\r\n# With context\r\nwith_context = detector.predict([query], query_context_pairs=[{\r\n    \"query\": query,\r\n    \"context\": \"Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins\"\r\n}])\r\nprint(f\"With context: {with_context['predictions'][0]['confidence_score']:.3f}\")\r\n\r\n# Context benefit\r\nimprovement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']\r\nprint(f\"Context improvement: {improvement:+.3f}\")\r\n```\r\n\r\n## \ud83d\udda5\ufe0f Command Line Interface\r\n\r\nHalluNox provides a comprehensive CLI for various use cases:\r\n\r\n### Interactive Mode\r\n```bash\r\n# General model interactive mode\r\nhallunox-infer --interactive\r\n\r\n# MedGemma medical interactive mode\r\nhallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text\r\n```\r\n\r\n### Batch Processing\r\n```bash\r\n# Process file with general model\r\nhallunox-infer --input_file medical_texts.txt --output_file results.json\r\n\r\n# Process with MedGemma and medical settings\r\nhallunox-infer \\\r\n    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n    --input_file medical_texts.txt \\\r\n    --output_file medical_results.json \\\r\n    --show_routing \\\r\n    --show_generated_text\r\n```\r\n\r\n### Image Analysis (Multimodal models only)\r\n```bash\r\n# Single image analysis\r\nhallunox-infer \\\r\n    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n    --image_path chest_xray.jpg \\\r\n    --show_generated_text\r\n\r\n# Batch image analysis\r\nhallunox-infer \\\r\n    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n    --image_folder /path/to/medical/images \\\r\n    --output_file image_analysis.json\r\n```\r\n\r\n### Demo Mode\r\n```bash\r\n# General demo\r\nhallunox-infer --demo --show_routing\r\n\r\n# Medical demo with MedGemma\r\nhallunox-infer \\\r\n    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n    --demo \\\r\n    --mode both \\\r\n    --show_routing\r\n\r\n# Text-only demo (faster initialization)\r\nhallunox-infer \\\r\n    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n    --demo \\\r\n    --mode text \\\r\n    --show_routing\r\n```\r\n\r\n## \ud83d\udd28 Training Your Own Model\r\n\r\n### Quick Training\r\n\r\n```python\r\nfrom hallunox import Trainer, TrainingConfig\r\n\r\n# Configure training\r\nconfig = TrainingConfig(\r\n    # Model selection\r\n    model_id=\"convaiinnovations/gemma-finetuned-4b-it\",  # or \"unsloth/Llama-3.2-3B-Instruct\"\r\n    embed_model_id=\"BAAI/bge-m3\",\r\n    \r\n    # Training parameters\r\n    batch_size=8,\r\n    learning_rate=5e-4,\r\n    max_epochs=6,\r\n    warmup_steps=300,\r\n    \r\n    # Dataset configuration\r\n    use_truthfulqa=True,\r\n    use_halueval=True,\r\n    use_fever=True,\r\n    max_samples_per_dataset=3000,\r\n    \r\n    # Output\r\n    output_dir=\"./models/my_medical_model\"\r\n)\r\n\r\n# Train model\r\ntrainer = Trainer(config)\r\ntrainer.train()\r\n```\r\n\r\n### Command Line Training\r\n```bash\r\n# Train general model\r\nhallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6\r\n\r\n# Train medical model\r\nhallunox-train \\\r\n    --model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n    --batch_size 4 \\\r\n    --learning_rate 3e-4 \\\r\n    --max_epochs 8 \\\r\n    --output_dir ./models/custom_medgemma\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Model Architecture\r\n\r\nHalluNox supports two main architectures:\r\n\r\n### General Architecture (Llama-3.2-3B)\r\n1. **LLM Component**: Llama-3.2-3B-Instruct\r\n   - Extracts internal hidden representations (3072D)\r\n   - Supports any Llama-architecture model\r\n   \r\n2. **Embedding Model**: BGE-M3 (fixed)\r\n   - Provides reference semantic embeddings\r\n   - 1024-dimensional dense vectors\r\n\r\n3. **Projection Network**: Standard ProjectionHead\r\n   - Maps LLM hidden states to embedding space\r\n   - 3-layer MLP with ReLU activations and dropout\r\n\r\n### Medical Architecture (MedGemma-4B-IT)\r\n1. **Unified Multimodal Model**: \r\n   - **Single Model**: AutoModelForImageTextToText handles both text and images\r\n   - **Memory Optimized**: Avoids double loading (saves ~8GB VRAM)\r\n   - **Fallback Support**: Graceful degradation to text-only if needed\r\n   \r\n2. **Embedding Model**: BGE-M3 (same as general)\r\n   - Enhanced with medical context formatting\r\n   \r\n3. **Projection Network**: UltraStableProjectionHead\r\n   - Ultra-stable architecture with heavy normalization\r\n   - Conservative weight initialization for medical precision\r\n   - Tanh activations for stability\r\n   - Enhanced dropout and layer normalization\r\n\r\n4. **Multimodal Processor**: AutoProcessor\r\n   - Handles image + text inputs\r\n   - Supports chat template formatting\r\n\r\n5. **Quantization Support**: 4-bit NF4 with double quantization\r\n   - Reduces memory usage by ~75%\r\n   - Maintains 85-90% performance\r\n   - Automatic fallback for CPU\r\n\r\n## \ud83d\udcca API Reference\r\n\r\n### HallucinationDetector\r\n\r\n#### Constructor Parameters\r\n\r\n```python\r\nHallucinationDetector(\r\n    model_path: str = None,                    # Path to trained model (None = auto-download)\r\n    llm_model_id: str = \"unsloth/Llama-3.2-3B-Instruct\",  # LLM model ID\r\n    embed_model_id: str = \"BAAI/bge-m3\",      # Embedding model ID\r\n    device: str = None,                        # Device (None = auto-detect)\r\n    max_length: int = 512,                     # LLM sequence length\r\n    bge_max_length: int = 512,                # BGE-M3 sequence length\r\n    use_fp16: bool = True,                     # Mixed precision\r\n    load_llm: bool = True,                     # Load LLM\r\n    enable_inference: bool = False,            # Enable LLM inference\r\n    confidence_threshold: float = None,        # Custom threshold (auto-detected)\r\n    enable_response_generation: bool = False,  # Enable response generation\r\n    use_quantization: bool = False,            # Enable 4-bit quantization for memory savings\r\n    quantization_config: BitsAndBytesConfig = None,  # Custom quantization config\r\n    mode: str = \"text\",                        # Operation mode: \"text\", \"image\", \"both\", \"auto\" (default: \"text\")\r\n)\r\n```\r\n\r\n#### Core Methods\r\n\r\n**Text Analysis:**\r\n- `predict(texts, query_context_pairs=None)` - Analyze texts for hallucination confidence\r\n- `predict_with_query_context(query_context_pairs)` - Query-context prediction\r\n- `batch_predict(texts, batch_size=16)` - Efficient batch processing\r\n\r\n**Response Generation:**\r\n- `generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None)` - Generate responses with confidence checking and optional context\r\n- `generate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False)` - Generate responses for multiple query-context pairs\r\n\r\n**Multimodal (MedGemma only):**\r\n- `predict_images(images, image_descriptions=None)` - Analyze image confidence\r\n- `generate_image_response(image, prompt, max_length=200)` - Generate image descriptions\r\n\r\n**Analysis:**\r\n- `evaluate_routing_strategy(texts)` - Analyze routing decisions\r\n\r\n**Factory Methods:**\r\n- `for_embedding_only()` - Create embedding-only detector\r\n- `for_low_memory()` - Create memory-optimized detector with 4-bit quantization\r\n\r\n#### Response Format\r\n\r\n```python\r\n{\r\n    \"predictions\": [\r\n        {\r\n            \"text\": \"input text\",\r\n            \"confidence_score\": 0.85,           # 0.0 to 1.0\r\n            \"similarity_score\": 0.92,          # Cosine similarity\r\n            \"interpretation\": \"HIGH_CONFIDENCE\", # or HIGH_MEDICAL_CONFIDENCE\r\n            \"risk_level\": \"LOW_RISK\",          # or LOW_MEDICAL_RISK\r\n            \"routing_action\": \"LOCAL_GENERATION\",\r\n            \"description\": \"This response appears to be factual and reliable.\"\r\n        }\r\n    ],\r\n    \"summary\": {\r\n        \"total_texts\": 1,\r\n        \"avg_confidence\": 0.85,\r\n        \"high_confidence_count\": 1,\r\n        \"medium_confidence_count\": 0,\r\n        \"low_confidence_count\": 0,\r\n        \"very_low_confidence_count\": 0\r\n    }\r\n}\r\n```\r\n\r\n#### Response Generation Format\r\n\r\n```python\r\n{\r\n    \"response\": \"Generated response text\",\r\n    \"confidence_score\": 0.85,\r\n    \"should_generate\": True,\r\n    \"meets_threshold\": True,\r\n    \"forced_generation\": False,  # True if generated despite low confidence\r\n    # Or when blocked:\r\n    \"reason\": \"Confidence 0.45 below threshold 0.60\",\r\n    \"recommendation\": \"RAG_RETRIEVAL\"\r\n}\r\n```\r\n\r\n### Training Classes\r\n\r\n- **`TrainingConfig`**: Configuration dataclass for training parameters\r\n- **`Trainer`**: Main training class with dataset loading and model training\r\n- **`MultiDatasetLoader`**: Loads and combines multiple hallucination detection datasets\r\n\r\n### Utility Functions\r\n\r\n- **`download_model()`**: Download general pre-trained model\r\n- **`download_medgemma_model(model_name)`**: Download MedGemma medical model\r\n- **`setup_logging(level)`**: Configure logging\r\n- **`check_gpu_availability()`**: Check CUDA compatibility\r\n- **`validate_model_requirements()`**: Verify dependencies\r\n\r\n## \ud83d\udcc8 Performance\r\n\r\nOur confidence-aware routing system demonstrates:\r\n\r\n- **74% hallucination detection rate** (vs 42% baseline)\r\n- **9% false positive rate** (vs 15% baseline)  \r\n- **40% reduction in computational cost** vs post-hoc methods\r\n- **1.6x cost multiplier** vs always using expensive operations (4.2x)\r\n\r\n### Medical Domain Performance (MedGemma)\r\n- **Enhanced medical accuracy** with 0.62 confidence threshold\r\n- **Multimodal capability** for medical image analysis\r\n- **Safety-first approach** with conservative thresholds\r\n- **Professional verification workflow** for low-confidence cases\r\n\r\n## \ud83d\udda5\ufe0f Hardware Requirements\r\n\r\n### Minimum (Inference Only)\r\n- **CPU**: Modern multi-core processor\r\n- **RAM**: 16GB system memory\r\n- **GPU**: 8GB VRAM (RTX 3070, RTX 4060 Ti+)\r\n- **Storage**: 15GB free space\r\n- **Models**: ~5GB each (Llama/MedGemma)\r\n\r\n### Recommended (Inference)\r\n- **CPU**: Intel i7/AMD Ryzen 7+\r\n- **RAM**: 32GB system memory  \r\n- **GPU**: 12GB+ VRAM (RTX 4070, RTX 3080+)\r\n- **Storage**: NVMe SSD, 25GB+ free\r\n- **CUDA**: 11.8+ compatible driver\r\n\r\n### Training Requirements\r\n- **CPU**: High-performance multi-core (i9/Ryzen 9)\r\n- **RAM**: 64GB+ system memory\r\n- **GPU**: 24GB+ VRAM (RTX 4090, A100, H100)\r\n- **Storage**: 200GB+ NVMe SSD\r\n  - Model checkpoints: ~10GB per epoch\r\n  - Training datasets: ~30GB\r\n  - Logs and outputs: ~50GB\r\n- **Network**: High-speed internet for downloads\r\n\r\n### MedGemma Specific\r\n- **Additional storage**: +10GB for multimodal models\r\n- **Image processing**: PIL/Pillow for image capabilities\r\n- **Memory**: +4GB RAM for image processing pipeline\r\n\r\n### CPU-Only Mode\r\n- **RAM**: 32GB minimum (64GB recommended)\r\n- **Performance**: 10-50x slower than GPU\r\n- **Not recommended**: For production medical applications\r\n\r\n## \ud83d\udd12 Safety Considerations\r\n\r\n### Medical Applications\r\n- **Professional oversight required**: HalluNox is a research tool, not medical advice\r\n- **Validation needed**: All medical outputs should be verified by qualified professionals\r\n- **Conservative thresholds**: 0.62 threshold ensures high precision for medical content\r\n- **Clear disclaimers**: Always include appropriate medical disclaimers in applications\r\n\r\n### General Use\r\n- **Confidence-based routing**: Use routing recommendations for appropriate escalation\r\n- **Human oversight**: Very low confidence predictions require human review\r\n- **Regular evaluation**: Monitor performance on your specific use cases\r\n\r\n## \ud83d\udee0\ufe0f Troubleshooting\r\n\r\n### Common Issues and Solutions\r\n\r\n#### CUDA Out of Memory Error\r\n```\r\nOutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...\r\n```\r\n**Solution**: Use 4-bit quantization\r\n```python\r\ndetector = HallucinationDetector.for_low_memory()\r\n```\r\n\r\n#### Deprecated torch_dtype Warning\r\n```\r\n`torch_dtype` is deprecated! Use `dtype` instead!\r\n```\r\n**Solution**: Already fixed in HalluNox v0.3.2+ - the package now uses the correct `dtype` parameter.\r\n\r\n#### Double Model Loading (MedGemma)\r\n```\r\nLoading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]\r\nLoading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]\r\n```\r\n**Solution**: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.\r\n\r\n#### Accelerate Warning\r\n```\r\nWARNING:accelerate.big_modeling:Some parameters are on the meta device...\r\n```\r\n**Solution**: This is normal with quantization - parameters are automatically moved to GPU during inference.\r\n\r\n#### Dependency Version Conflict (AutoProcessor)\r\n```\r\n\u26a0\ufe0f Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'\r\nAttributeError: module 'requests' has no attribute 'exceptions'\r\n```\r\n**Solution**: This is a compatibility issue between transformers and requests versions.\r\n```bash\r\npip install --upgrade transformers requests huggingface_hub\r\n# Or force reinstall\r\npip install --force-reinstall transformers>=4.45.0 requests>=2.31.0\r\n```\r\n**Fallback**: HalluNox automatically falls back to text-only mode when this occurs.\r\n\r\n#### Model Hidden States NaN/Inf Issues \u2705 RESOLVED\r\n```\r\n\u26a0\ufe0f Warning: NaN/Inf detected in model hidden states\r\n   Hidden shape: torch.Size([3, 16, 2560])\r\n   NaN count: 122880\r\n```\r\n**\u2705 FIXED in HalluNox v0.6.3+**: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:\r\n\r\n**Root Cause**: 4-bit quantization was causing numerical instabilities with certain model architectures.\r\n\r\n**Solution Applied**:\r\n- **Disabled Quantization**: Removed 4-bit quantization that was causing NaN issues\r\n- **Simplified Model Loading**: Now uses the same approach as our proven `inference_gemma.py` \r\n- **Clean Architecture**: Removed complex stability measures that were interfering\r\n- **Stable Precision**: Uses `torch.bfloat16` for optimal performance without instabilities\r\n\r\n#### Repetitive Text and Unwanted Artifacts \u2705 RESOLVED\r\n```\r\n\ud83d\udd2c Reference Response (forced): I am programmed to be a harmless AI assistant...\r\ng\r\nI am programmed to be a harmless AI assistant...\r\ng\r\n[repetitive output continues...]\r\n```\r\n**\u2705 FIXED in HalluNox v0.6.3+**: Repetitive text generation and unwanted artifacts have been completely resolved:\r\n\r\n**Root Cause**: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.\r\n\r\n**Solution Applied**:\r\n- **Deterministic Generation**: Changed from `do_sample=True` to `do_sample=False` matching Jupyter notebook approach\r\n- **Proper Chat Templates**: Adopted exact message formatting from working Jupyter notebook implementation  \r\n- **Removed Sampling Parameters**: Eliminated `temperature`, `top_p`, `repetition_penalty` that were causing issues\r\n- **Clean Tokenization**: Uses `tokenizer.apply_chat_template()` with proper parameters for conversation structure\r\n\r\n**Current Recommended Usage** (v0.6.3+):\r\n```python\r\n# Standard usage - now stable by default\r\ndetector = HallucinationDetector.for_low_memory(\r\n    llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n    device=\"cuda\"\r\n)\r\n\r\n# Both NaN issues and repetitive text are now automatically resolved\r\n```\r\n\r\n**Migration from v0.4.9 and earlier**: No code changes needed - existing code will automatically use the stable approach.\r\n\r\n#### Environment Optimization\r\nFor better memory management, set:\r\n```bash\r\nexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True\r\n```\r\n\r\n### Memory Requirements by Configuration\r\n\r\n| GPU VRAM | Recommended Configuration | Expected Performance |\r\n|----------|--------------------------|---------------------|\r\n| **4-6GB** | `for_low_memory()` + reduce batch size | Basic functionality |\r\n| **8-12GB** | `for_low_memory()` | Full functionality |\r\n| **16GB+** | Standard configuration | Optimal performance |\r\n| **24GB+** | Multiple models + training | Development/research |\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).\r\n\r\n## \ud83d\udcda Citation\r\n\r\nIf you use HalluNox in your research, please cite:\r\n\r\n```bibtex\r\n@article{nandakishor2024hallunox,\r\n    title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},\r\n    author={Nandakishor M},\r\n    journal={AI Safety Research},\r\n    year={2024},\r\n    organization={Convai Innovations}\r\n}\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.\r\n\r\n### Development Setup\r\n```bash\r\ngit clone https://github.com/convai-innovations/hallunox.git\r\ncd hallunox\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## \ud83d\udcde Support\r\n\r\nFor technical support and questions:\r\n- **Email**: support@convaiinnovations.com  \r\n- **Issues**: [GitHub Issues](https://github.com/convai-innovations/hallunox/issues)\r\n- **Documentation**: Full API docs available online\r\n\r\n## \ud83d\udc68\u200d\ud83d\udcbb Author\r\n\r\n**Nandakishor M**  \r\nAI Safety Research  \r\nConvai Innovations Pvt. Ltd.  \r\nEmail: support@convaiinnovations.com\r\n\r\n---\r\n\r\n**Disclaimer**: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.\r\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "A confidence-aware routing system for LLM hallucination detection using multi-signal approach",
    "version": "0.6.3",
    "project_urls": {
        "Bug Reports": "https://github.com/convai-innovations/hallunox/issues",
        "Documentation": "https://hallunox.readthedocs.io",
        "Homepage": "https://convaiinnovations.com",
        "Repository": "https://github.com/convai-innovations/hallunox",
        "Source Code": "https://github.com/convai-innovations/hallunox"
    },
    "split_keywords": [
        "hallucination-detection",
        " llm",
        " confidence-estimation",
        " model-reliability",
        " uncertainty-quantification",
        " ai-safety"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3d1715a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7",
                "md5": "533c4b5190880399506fd14d2c0fa1a3",
                "sha256": "6261ab99835db50428edb43953c1b26c8abf052cccd6853c86d8142a6746e8ec"
            },
            "downloads": -1,
            "filename": "hallunox-0.6.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "533c4b5190880399506fd14d2c0fa1a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 63640,
            "upload_time": "2025-10-06T18:07:53",
            "upload_time_iso_8601": "2025-10-06T18:07:53.859392Z",
            "url": "https://files.pythonhosted.org/packages/3d/17/15a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7/hallunox-0.6.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2c76f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b",
                "md5": "4c8b49e98068611d02d576b20b6cbf92",
                "sha256": "75755724d9845caff0cb8f9dc39ee5e5e8c53d266f686ed35056167dce9cd6a7"
            },
            "downloads": -1,
            "filename": "hallunox-0.6.3.tar.gz",
            "has_sig": false,
            "md5_digest": "4c8b49e98068611d02d576b20b6cbf92",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 83983,
            "upload_time": "2025-10-06T18:07:58",
            "upload_time_iso_8601": "2025-10-06T18:07:58.218139Z",
            "url": "https://files.pythonhosted.org/packages/2c/76/f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b/hallunox-0.6.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 18:07:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "convai-innovations",
    "github_project": "hallunox",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "hallunox"
}

Nandakishor M