# HalluNox
**Confidence-Aware Routing for Large Language Model Reliability Enhancement**
A Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.
## โจ Features
- **๐ฏ Pre-generation Hallucination Detection**: Assess model reliability before generation begins
- **๐ Confidence-Aware Routing**: Automatically route queries based on estimated confidence
- **๐ง Multi-Signal Approach**: Combines semantic alignment, internal convergence, and learned confidence
- **โก Multi-Model Support**: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures
- **๐ฅ Medical Domain Specialization**: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds
- **๐ผ๏ธ Multimodal Capabilities**: Image analysis and response generation for MedGemma models
- **๐ Comprehensive Evaluation**: Built-in metrics and routing strategy analysis
- **๐ Easy Integration**: Simple API for both training and inference
- **๐โโ๏ธ Performance Optimizations**: Optional LLM loading for faster initialization and lower memory usage
- **๐ Enhanced Query-Context**: Improved accuracy with structured prompt formatting
- **๐๏ธ Adaptive Thresholds**: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)
- **๐ฌ Response Generation**: Built-in response generation with confidence-gated output
- **๐ง Automatic Model Management**: Auto-download and configuration for supported models
## ๐ฌ Research Foundation
Based on the research paper "Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation" by Nandakishor M (Convai Innovations).
The approach implements deterministic routing to appropriate response pathways:
### General Models (Llama-3.2-3B)
- **High Confidence (โฅ0.65)**: Local generation
- **Medium Confidence (0.60-0.65)**: Retrieval-augmented generation
- **Low Confidence (0.4-0.60)**: Route to larger models
- **Very Low Confidence (<0.4)**: Human review required
### Medical Models (MedGemma-4B-IT)
- **High Medical Confidence (โฅ0.60)**: Local generation with medical validation
- **Medium Medical Confidence (0.55-0.60)**: Medical literature verification required
- **Low Medical Confidence (0.50-0.55)**: Professional medical verification required
- **Very Low Medical Confidence (<0.50)**: Seek professional medical advice
## ๐ What's New in v0.6.3
### โจ Enhanced Query-Context Support
- **๐ Query-Context Pairs**: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses
- **๐ฏ Improved Accuracy**: Better confidence scoring when context is provided
- **๐ Enhanced Response Generation**: Context-aware prompt formatting for more accurate medical responses
- **๐ Batch Processing**: New `generate_response_with_context()` method for processing multiple query-context pairs
### ๐ฅ Medical Domain Enhancements
- **๐ฉบ Context Integration**: Medical queries now benefit from patient context and clinical background
- **๐ Better Confidence**: Context helps improve confidence scoring for medical scenarios
- **๐๏ธ Flexible Usage**: Works with existing methods while providing new convenience functions
- **๐ Example Implementation**: New query_context_example.py demonstrates usage patterns
### ๐งน Simplified Architecture
- **๐ฑ Removed Dashboard**: Eliminated dashboard dependencies for cleaner core package
- **โก Streamlined Installation**: Faster installation without unnecessary web components
- **๐ฏ Focused Functionality**: Core hallucination detection without UI overhead
- **๐ฆ Lightweight**: Reduced package size and dependencies
### ๐ง Technical Improvements
- **๐ Enhanced Prompt Formatting**: Context gets properly integrated into medical prompts
- **๐ฏ Backward Compatibility**: All existing code continues to work unchanged
- **๐ Better Documentation**: Comprehensive examples for query-context usage
- **๐ก๏ธ Stable Performance**: Maintains all stability improvements from v0.6.3
## ๐ Installation
### Requirements
- Python 3.8+
- PyTorch 1.13+
- CUDA-compatible GPU (recommended)
- At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)
- 16GB RAM minimum (32GB recommended for training)
### Install from PyPI
```bash
pip install hallunox
```
### Install from Source
```bash
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e .
```
### MedGemma Model Access
HalluNox uses the open-access `convaiinnovations/gemma-finetuned-4b-it` model, which doesn't require authentication. The model will be automatically downloaded on first use.
### Core Dependencies
HalluNox automatically installs:
- `torch>=1.13.0` - PyTorch framework
- `transformers>=4.21.0` - Hugging Face Transformers
- `FlagEmbedding>=1.2.0` - BGE-M3 embedding model
- `datasets>=2.0.0` - Dataset loading utilities
- `scikit-learn>=1.0.0` - Evaluation metrics
- `numpy>=1.21.0` - Numerical computations
- `tqdm>=4.64.0` - Progress bars
- `Pillow>=8.0.0` - Image processing for multimodal capabilities
- `bitsandbytes>=0.41.0` - 4-bit quantization for memory optimization
## ๐ Quick Start
### Basic Usage (Llama-3.2-3B)
```python
from hallunox import HallucinationDetector
# Initialize detector (downloads pre-trained model automatically)
detector = HallucinationDetector()
# Analyze text for hallucination risk
results = detector.predict([
"The capital of France is Paris.", # High confidence
"Your password is 12345678.", # Low confidence
"The Moon is made of cheese." # Very low confidence
])
# View results
for pred in results["predictions"]:
print(f"Text: {pred['text']}")
print(f"Confidence: {pred['confidence_score']:.3f}")
print(f"Risk Level: {pred['risk_level']}")
print(f"Routing Action: {pred['routing_action']}")
print()
```
### ๐ฅ MedGemma Medical Domain Usage
For medical applications using MedGemma 4B-IT with multimodal capabilities:
```python
from hallunox import HallucinationDetector
from PIL import Image
# Initialize MedGemma detector (auto-downloads medical model)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
confidence_threshold=0.60, # Medical-grade threshold
enable_response_generation=True, # Enable response generation
enable_inference=True,
mode="text" # Text-only mode (default)
)
# Medical text analysis
medical_results = detector.predict([
"Aspirin can help reduce heart attack risk when prescribed by a doctor.",
"Drinking bleach will cure COVID-19.", # Dangerous misinformation
"Type 2 diabetes requires insulin injections in all cases.", # Partially incorrect
])
for pred in medical_results["predictions"]:
print(f"Medical Text: {pred['text'][:60]}...")
print(f"Confidence: {pred['confidence_score']:.3f}")
print(f"Risk Level: {pred['risk_level']}")
print(f"Medical Action: {pred['routing_action']}")
print(f"Description: {pred['description']}")
print("-" * 50)
# Response generation with confidence checking
question = "What are the symptoms of pneumonia?"
response = detector.generate_response(question, check_confidence=True)
if response["should_generate"]:
print(f"โ
Medical Response Generated (confidence: {response['confidence_score']:.3f})")
print(f"Response: {response['response']}")
print(f"Meets threshold: {response['meets_threshold']}")
if response.get('forced_generation'):
print("โ ๏ธ Note: Response was generated despite low confidence")
else:
print(f"โ ๏ธ Response blocked (confidence: {response['confidence_score']:.3f})")
print(f"Reason: {response['reason']}")
print(f"Recommendation: {response['recommendation']}")
# Force generation for reference regardless of confidence
forced_response = detector.generate_response(
question,
check_confidence=True,
force_generate=True # Generate even if confidence is low
)
print(f"๐ฌ Reference Response (forced): {forced_response['response']}")
print(f"๐ Confidence: {forced_response['confidence_score']:.3f}")
print(f"๐ฏ Forced Generation: {forced_response['forced_generation']}")
# Multimodal image analysis (MedGemma 4B-IT only)
if detector.is_multimodal:
print("\n๐ผ๏ธ Multimodal Image Analysis")
# Load medical image (replace with actual medical image)
try:
image = Image.open("chest_xray.jpg")
except:
# Create demo image for testing
image = Image.new('RGB', (224, 224), color='lightgray')
# Analyze image confidence
image_results = detector.predict_images([image], ["Chest X-ray"])
for pred in image_results["predictions"]:
print(f"Image: {pred['image_description']}")
print(f"Confidence: {pred['confidence_score']:.3f}")
print(f"Interpretation: {pred['interpretation']}")
print(f"Risk Level: {pred['risk_level']}")
# Generate image description
description = detector.generate_image_response(
image,
"Describe the findings in this chest X-ray."
)
print(f"Generated Description: {description}")
```
### ๐ง Advanced Configuration
```python
from hallunox import HallucinationDetector
# Full configuration example
detector = HallucinationDetector(
# Model selection
llm_model_id="convaiinnovations/gemma-finetuned-4b-it", # or "unsloth/Llama-3.2-3B-Instruct"
embed_model_id="BAAI/bge-m3",
# Custom model weights (optional)
model_path="/path/to/custom/model.pt", # None = auto-download
# Hardware configuration
device="cuda", # or "cpu"
use_fp16=True, # Mixed precision for faster inference
# Sequence lengths
max_length=512, # LLM context length
bge_max_length=512, # BGE-M3 context length
# Feature toggles
load_llm=True, # Load LLM for embeddings
enable_inference=True, # Enable LLM inference
enable_response_generation=True, # Enable response generation
# Confidence settings
confidence_threshold=0.60, # Custom threshold (auto-detected by model type)
# Operation mode
mode="text", # "text", "image", "both", or "auto"
)
# Check model capabilities
print(f"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}")
print(f"Multimodal support: {detector.is_multimodal}")
print(f"Operation mode: {detector.effective_mode} (requested: {detector.mode})")
print(f"Confidence threshold: {detector.confidence_threshold}")
```
### ๐๏ธ Operation Mode Configuration
The `mode` parameter controls what types of input the detector can process:
```python
from hallunox import HallucinationDetector
# Text mode (default) - processes text inputs only
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="text" # Text-only processing (default)
)
# Auto mode - detects capabilities from model
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="auto" # Auto: detects based on model capabilities
)
# Image-only mode - processes images only (requires multimodal model)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="image" # Image processing only
)
# Both mode - processes text and images (requires multimodal model)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="both" # Explicit multimodal mode
)
```
#### Mode Validation
- **Text mode**: Available for all models (default)
- **Image mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- **Both mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- **Auto mode**: Automatically selects based on model capabilities
- Multimodal models โ `effective_mode = "both"`
- Text-only models โ `effective_mode = "text"`
#### Error Examples
```python
# This will raise an error - image mode requires multimodal model
detector = HallucinationDetector(
llm_model_id="unsloth/Llama-3.2-3B-Instruct",
mode="image" # โ Error: Image mode requires multimodal model
)
# This will raise an error - calling image methods in text mode
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="text"
)
detector.predict_images([image]) # โ Error: Current mode is 'text'
```
### โก Performance Optimized Usage
For faster initialization when only doing embedding comparisons:
```python
from hallunox import HallucinationDetector
# Option 1: Factory method for embedding-only usage
detector = HallucinationDetector.for_embedding_only(
device="cuda",
use_fp16=True
)
# Option 2: Explicit parameter control
detector = HallucinationDetector(
load_llm=False, # Skip expensive LLM loading
enable_inference=False, # Disable inference capabilities
use_fp16=True # Use mixed precision
)
# Note: This configuration cannot perform predictions
# Use for preprocessing or embedding extraction only
```
### ๐ง Memory Optimization with Quantization
For GPUs with limited VRAM (8-16GB), use 4-bit quantization:
```python
from hallunox import HallucinationDetector
# Option 1: Auto-optimized for low memory (recommended)
detector = HallucinationDetector.for_low_memory(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it", # Or any supported model
device="cuda",
enable_response_generation=True, # Enable response generation for evaluation
verbose=True # Show loading progress (optional)
)
# Option 2: Manual quantization configuration
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
use_quantization=True, # Enable 4-bit quantization
enable_response_generation=True,
device="cuda"
)
# Option 3: Custom quantization settings
from transformers import BitsAndBytesConfig
import torch
custom_quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 quantization type
bnb_4bit_use_double_quant=True, # Double quantization for extra savings
bnb_4bit_compute_dtype=torch.bfloat16 # Compute in bfloat16
)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
quantization_config=custom_quant_config,
device="cuda"
)
print(f"โ
Memory optimized: {detector.use_quantization}")
print(f"๐ง Quantization: 4-bit NF4 with double quantization")
```
## ๐ค Response Generation & Evaluation
### Enabling Response Generation
When `enable_response_generation=True`, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:
```python
from hallunox import HallucinationDetector
# Enable response generation for evaluation
detector = HallucinationDetector.for_low_memory(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
device="cuda",
enable_response_generation=True, # Enable response generation
verbose=False # Clean logs for evaluation
)
# Test questions for evaluation
test_questions = [
"What are the symptoms of diabetes?",
"Drinking bleach will cure COVID-19.", # Dangerous misinformation
"How does aspirin help prevent heart attacks?",
"All vaccines cause autism in children.", # Medical misinformation
]
# Analyze with response generation
for question in test_questions:
# The model will generate a response and analyze it
results = detector.predict([question])
prediction = results["predictions"][0]
print(f"Question: {question}")
print(f"Confidence: {prediction['confidence_score']:.3f}")
print(f"Risk Level: {prediction['risk_level']}")
print(f"Action: {prediction['medical_action']}")
print(f"Description: {prediction['description']}")
print("-" * 50)
```
### Response Generation Modes
```python
# Generate and analyze responses with confidence checking
response = detector.generate_response(
"What are the side effects of ibuprofen?",
check_confidence=True
)
if response["should_generate"]:
print(f"โ
Generated Response: {response['response']}")
print(f"Confidence: {response['confidence_score']:.3f}")
print(f"Meets threshold: {response['meets_threshold']}")
else:
print(f"โ ๏ธ Response blocked (confidence: {response['confidence_score']:.3f})")
print(f"Reason: {response['reason']}")
print(f"Recommendation: {response['recommendation']}")
# Force generation for reference (useful for evaluation)
forced_response = detector.generate_response(
"What are the side effects of ibuprofen?",
check_confidence=True,
force_generate=True
)
print(f"๐ฌ Reference Response: {forced_response['response']}")
print(f"๐ Confidence: {forced_response['confidence_score']:.3f}")
print(f"๐ฏ Forced Generation: {forced_response['forced_generation']}")
```
### Evaluation Output Example
```
Question: What are the symptoms of diabetes?
Generated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.
Confidence: 0.857
Risk Level: LOW_MEDICAL_RISK
Action: โ
Information can be used as reference
--------------------------------------------------
Question: Drinking bleach will cure COVID-19.
Generated Response: [Response blocked - confidence too low]
Confidence: 0.123
Risk Level: VERY_HIGH_MEDICAL_RISK
Action: โ Do not use - seek professional medical advice
--------------------------------------------------
```
### ๐พ Memory Usage Comparison
| Configuration | Model Size | VRAM Usage | Performance |
|--------------|------------|------------|-------------|
| **Full Precision** | ~16GB | ~14GB | 100% speed |
| **FP16 Mixed Precision** | ~8GB | ~7GB | 95% speed |
| **4-bit Quantization** | ~4GB | ~3.5GB | 85-90% speed |
| **4-bit + Double Quant** | ~3.5GB | ~3GB | 85-90% speed |
**Recommendation**: Use `HallucinationDetector.for_low_memory()` for GPUs with 8GB or less VRAM.
### ๐ Enhanced Query-Context Support (NEW in v0.6.3!)
HalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:
```python
from hallunox import HallucinationDetector
# Initialize MedGemma detector for context-aware medical responses
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
enable_response_generation=True
)
# Medical query-context pairs for enhanced accuracy
medical_query_context_pairs = [
{
"query": "Is it safe to take ibuprofen daily?",
"context": "Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation."
},
{
"query": "What's the recommended exercise routine?",
"context": "28-year-old pregnant patient at 30 weeks, previously sedentary, no complications."
},
{
"query": "How should I manage my diabetes medication?",
"context": "Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily."
}
]
# Method 1: Confidence analysis with context
results = detector.predict_with_query_context(medical_query_context_pairs)
for pred in results["predictions"]:
print(f"Query: {pred['text']}")
print(f"Context-Enhanced Confidence: {pred['confidence_score']:.3f}")
print(f"Medical Risk Level: {pred['risk_level']}")
print(f"Recommendation: {pred['routing_action']}")
# Method 2: Response generation with context
responses = detector.generate_response_with_context(
medical_query_context_pairs,
max_length=300,
check_confidence=True
)
for i, response in enumerate(responses):
pair = medical_query_context_pairs[i]
print(f"\nQuery: {pair['query']}")
print(f"Context: {pair['context'][:60]}...")
if isinstance(response, dict) and "should_generate" in response:
if response["should_generate"]:
print(f"โ
Context-Aware Response: {response['response']}")
print(f"Confidence: {response['confidence_score']:.3f}")
else:
print(f"โ ๏ธ Blocked (confidence: {response['confidence_score']:.3f})")
print(f"Recommendation: {response['recommendation']}")
# Method 3: Individual response with context
single_response = detector.generate_response(
prompt="Should I adjust my medication?",
query_context_pairs=[{
"query": "Should I adjust my medication?",
"context": "Patient experiencing mild side effects from current dosage"
}],
check_confidence=True
)
```
### Context Impact Analysis
```python
# Compare confidence with and without context
query = "Is this medication safe during pregnancy?"
# Without context
no_context = detector.predict([query])
print(f"Without context: {no_context['predictions'][0]['confidence_score']:.3f}")
# With context
with_context = detector.predict([query], query_context_pairs=[{
"query": query,
"context": "Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins"
}])
print(f"With context: {with_context['predictions'][0]['confidence_score']:.3f}")
# Context benefit
improvement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']
print(f"Context improvement: {improvement:+.3f}")
```
## ๐ฅ๏ธ Command Line Interface
HalluNox provides a comprehensive CLI for various use cases:
### Interactive Mode
```bash
# General model interactive mode
hallunox-infer --interactive
# MedGemma medical interactive mode
hallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text
```
### Batch Processing
```bash
# Process file with general model
hallunox-infer --input_file medical_texts.txt --output_file results.json
# Process with MedGemma and medical settings
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--input_file medical_texts.txt \
--output_file medical_results.json \
--show_routing \
--show_generated_text
```
### Image Analysis (Multimodal models only)
```bash
# Single image analysis
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--image_path chest_xray.jpg \
--show_generated_text
# Batch image analysis
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--image_folder /path/to/medical/images \
--output_file image_analysis.json
```
### Demo Mode
```bash
# General demo
hallunox-infer --demo --show_routing
# Medical demo with MedGemma
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--demo \
--mode both \
--show_routing
# Text-only demo (faster initialization)
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--demo \
--mode text \
--show_routing
```
## ๐จ Training Your Own Model
### Quick Training
```python
from hallunox import Trainer, TrainingConfig
# Configure training
config = TrainingConfig(
# Model selection
model_id="convaiinnovations/gemma-finetuned-4b-it", # or "unsloth/Llama-3.2-3B-Instruct"
embed_model_id="BAAI/bge-m3",
# Training parameters
batch_size=8,
learning_rate=5e-4,
max_epochs=6,
warmup_steps=300,
# Dataset configuration
use_truthfulqa=True,
use_halueval=True,
use_fever=True,
max_samples_per_dataset=3000,
# Output
output_dir="./models/my_medical_model"
)
# Train model
trainer = Trainer(config)
trainer.train()
```
### Command Line Training
```bash
# Train general model
hallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6
# Train medical model
hallunox-train \
--model_id convaiinnovations/gemma-finetuned-4b-it \
--batch_size 4 \
--learning_rate 3e-4 \
--max_epochs 8 \
--output_dir ./models/custom_medgemma
```
## ๐๏ธ Model Architecture
HalluNox supports two main architectures:
### General Architecture (Llama-3.2-3B)
1. **LLM Component**: Llama-3.2-3B-Instruct
- Extracts internal hidden representations (3072D)
- Supports any Llama-architecture model
2. **Embedding Model**: BGE-M3 (fixed)
- Provides reference semantic embeddings
- 1024-dimensional dense vectors
3. **Projection Network**: Standard ProjectionHead
- Maps LLM hidden states to embedding space
- 3-layer MLP with ReLU activations and dropout
### Medical Architecture (MedGemma-4B-IT)
1. **Unified Multimodal Model**:
- **Single Model**: AutoModelForImageTextToText handles both text and images
- **Memory Optimized**: Avoids double loading (saves ~8GB VRAM)
- **Fallback Support**: Graceful degradation to text-only if needed
2. **Embedding Model**: BGE-M3 (same as general)
- Enhanced with medical context formatting
3. **Projection Network**: UltraStableProjectionHead
- Ultra-stable architecture with heavy normalization
- Conservative weight initialization for medical precision
- Tanh activations for stability
- Enhanced dropout and layer normalization
4. **Multimodal Processor**: AutoProcessor
- Handles image + text inputs
- Supports chat template formatting
5. **Quantization Support**: 4-bit NF4 with double quantization
- Reduces memory usage by ~75%
- Maintains 85-90% performance
- Automatic fallback for CPU
## ๐ API Reference
### HallucinationDetector
#### Constructor Parameters
```python
HallucinationDetector(
model_path: str = None, # Path to trained model (None = auto-download)
llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct", # LLM model ID
embed_model_id: str = "BAAI/bge-m3", # Embedding model ID
device: str = None, # Device (None = auto-detect)
max_length: int = 512, # LLM sequence length
bge_max_length: int = 512, # BGE-M3 sequence length
use_fp16: bool = True, # Mixed precision
load_llm: bool = True, # Load LLM
enable_inference: bool = False, # Enable LLM inference
confidence_threshold: float = None, # Custom threshold (auto-detected)
enable_response_generation: bool = False, # Enable response generation
use_quantization: bool = False, # Enable 4-bit quantization for memory savings
quantization_config: BitsAndBytesConfig = None, # Custom quantization config
mode: str = "text", # Operation mode: "text", "image", "both", "auto" (default: "text")
)
```
#### Core Methods
**Text Analysis:**
- `predict(texts, query_context_pairs=None)` - Analyze texts for hallucination confidence
- `predict_with_query_context(query_context_pairs)` - Query-context prediction
- `batch_predict(texts, batch_size=16)` - Efficient batch processing
**Response Generation:**
- `generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None)` - Generate responses with confidence checking and optional context
- `generate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False)` - Generate responses for multiple query-context pairs
**Multimodal (MedGemma only):**
- `predict_images(images, image_descriptions=None)` - Analyze image confidence
- `generate_image_response(image, prompt, max_length=200)` - Generate image descriptions
**Analysis:**
- `evaluate_routing_strategy(texts)` - Analyze routing decisions
**Factory Methods:**
- `for_embedding_only()` - Create embedding-only detector
- `for_low_memory()` - Create memory-optimized detector with 4-bit quantization
#### Response Format
```python
{
"predictions": [
{
"text": "input text",
"confidence_score": 0.85, # 0.0 to 1.0
"similarity_score": 0.92, # Cosine similarity
"interpretation": "HIGH_CONFIDENCE", # or HIGH_MEDICAL_CONFIDENCE
"risk_level": "LOW_RISK", # or LOW_MEDICAL_RISK
"routing_action": "LOCAL_GENERATION",
"description": "This response appears to be factual and reliable."
}
],
"summary": {
"total_texts": 1,
"avg_confidence": 0.85,
"high_confidence_count": 1,
"medium_confidence_count": 0,
"low_confidence_count": 0,
"very_low_confidence_count": 0
}
}
```
#### Response Generation Format
```python
{
"response": "Generated response text",
"confidence_score": 0.85,
"should_generate": True,
"meets_threshold": True,
"forced_generation": False, # True if generated despite low confidence
# Or when blocked:
"reason": "Confidence 0.45 below threshold 0.60",
"recommendation": "RAG_RETRIEVAL"
}
```
### Training Classes
- **`TrainingConfig`**: Configuration dataclass for training parameters
- **`Trainer`**: Main training class with dataset loading and model training
- **`MultiDatasetLoader`**: Loads and combines multiple hallucination detection datasets
### Utility Functions
- **`download_model()`**: Download general pre-trained model
- **`download_medgemma_model(model_name)`**: Download MedGemma medical model
- **`setup_logging(level)`**: Configure logging
- **`check_gpu_availability()`**: Check CUDA compatibility
- **`validate_model_requirements()`**: Verify dependencies
## ๐ Performance
Our confidence-aware routing system demonstrates:
- **74% hallucination detection rate** (vs 42% baseline)
- **9% false positive rate** (vs 15% baseline)
- **40% reduction in computational cost** vs post-hoc methods
- **1.6x cost multiplier** vs always using expensive operations (4.2x)
### Medical Domain Performance (MedGemma)
- **Enhanced medical accuracy** with 0.62 confidence threshold
- **Multimodal capability** for medical image analysis
- **Safety-first approach** with conservative thresholds
- **Professional verification workflow** for low-confidence cases
## ๐ฅ๏ธ Hardware Requirements
### Minimum (Inference Only)
- **CPU**: Modern multi-core processor
- **RAM**: 16GB system memory
- **GPU**: 8GB VRAM (RTX 3070, RTX 4060 Ti+)
- **Storage**: 15GB free space
- **Models**: ~5GB each (Llama/MedGemma)
### Recommended (Inference)
- **CPU**: Intel i7/AMD Ryzen 7+
- **RAM**: 32GB system memory
- **GPU**: 12GB+ VRAM (RTX 4070, RTX 3080+)
- **Storage**: NVMe SSD, 25GB+ free
- **CUDA**: 11.8+ compatible driver
### Training Requirements
- **CPU**: High-performance multi-core (i9/Ryzen 9)
- **RAM**: 64GB+ system memory
- **GPU**: 24GB+ VRAM (RTX 4090, A100, H100)
- **Storage**: 200GB+ NVMe SSD
- Model checkpoints: ~10GB per epoch
- Training datasets: ~30GB
- Logs and outputs: ~50GB
- **Network**: High-speed internet for downloads
### MedGemma Specific
- **Additional storage**: +10GB for multimodal models
- **Image processing**: PIL/Pillow for image capabilities
- **Memory**: +4GB RAM for image processing pipeline
### CPU-Only Mode
- **RAM**: 32GB minimum (64GB recommended)
- **Performance**: 10-50x slower than GPU
- **Not recommended**: For production medical applications
## ๐ Safety Considerations
### Medical Applications
- **Professional oversight required**: HalluNox is a research tool, not medical advice
- **Validation needed**: All medical outputs should be verified by qualified professionals
- **Conservative thresholds**: 0.62 threshold ensures high precision for medical content
- **Clear disclaimers**: Always include appropriate medical disclaimers in applications
### General Use
- **Confidence-based routing**: Use routing recommendations for appropriate escalation
- **Human oversight**: Very low confidence predictions require human review
- **Regular evaluation**: Monitor performance on your specific use cases
## ๐ ๏ธ Troubleshooting
### Common Issues and Solutions
#### CUDA Out of Memory Error
```
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...
```
**Solution**: Use 4-bit quantization
```python
detector = HallucinationDetector.for_low_memory()
```
#### Deprecated torch_dtype Warning
```
`torch_dtype` is deprecated! Use `dtype` instead!
```
**Solution**: Already fixed in HalluNox v0.3.2+ - the package now uses the correct `dtype` parameter.
#### Double Model Loading (MedGemma)
```
Loading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]
Loading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]
```
**Solution**: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.
#### Accelerate Warning
```
WARNING:accelerate.big_modeling:Some parameters are on the meta device...
```
**Solution**: This is normal with quantization - parameters are automatically moved to GPU during inference.
#### Dependency Version Conflict (AutoProcessor)
```
โ ๏ธ Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'
AttributeError: module 'requests' has no attribute 'exceptions'
```
**Solution**: This is a compatibility issue between transformers and requests versions.
```bash
pip install --upgrade transformers requests huggingface_hub
# Or force reinstall
pip install --force-reinstall transformers>=4.45.0 requests>=2.31.0
```
**Fallback**: HalluNox automatically falls back to text-only mode when this occurs.
#### Model Hidden States NaN/Inf Issues โ
RESOLVED
```
โ ๏ธ Warning: NaN/Inf detected in model hidden states
Hidden shape: torch.Size([3, 16, 2560])
NaN count: 122880
```
**โ
FIXED in HalluNox v0.6.3+**: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:
**Root Cause**: 4-bit quantization was causing numerical instabilities with certain model architectures.
**Solution Applied**:
- **Disabled Quantization**: Removed 4-bit quantization that was causing NaN issues
- **Simplified Model Loading**: Now uses the same approach as our proven `inference_gemma.py`
- **Clean Architecture**: Removed complex stability measures that were interfering
- **Stable Precision**: Uses `torch.bfloat16` for optimal performance without instabilities
#### Repetitive Text and Unwanted Artifacts โ
RESOLVED
```
๐ฌ Reference Response (forced): I am programmed to be a harmless AI assistant...
g
I am programmed to be a harmless AI assistant...
g
[repetitive output continues...]
```
**โ
FIXED in HalluNox v0.6.3+**: Repetitive text generation and unwanted artifacts have been completely resolved:
**Root Cause**: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.
**Solution Applied**:
- **Deterministic Generation**: Changed from `do_sample=True` to `do_sample=False` matching Jupyter notebook approach
- **Proper Chat Templates**: Adopted exact message formatting from working Jupyter notebook implementation
- **Removed Sampling Parameters**: Eliminated `temperature`, `top_p`, `repetition_penalty` that were causing issues
- **Clean Tokenization**: Uses `tokenizer.apply_chat_template()` with proper parameters for conversation structure
**Current Recommended Usage** (v0.6.3+):
```python
# Standard usage - now stable by default
detector = HallucinationDetector.for_low_memory(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
device="cuda"
)
# Both NaN issues and repetitive text are now automatically resolved
```
**Migration from v0.4.9 and earlier**: No code changes needed - existing code will automatically use the stable approach.
#### Environment Optimization
For better memory management, set:
```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
### Memory Requirements by Configuration
| GPU VRAM | Recommended Configuration | Expected Performance |
|----------|--------------------------|---------------------|
| **4-6GB** | `for_low_memory()` + reduce batch size | Basic functionality |
| **8-12GB** | `for_low_memory()` | Full functionality |
| **16GB+** | Standard configuration | Optimal performance |
| **24GB+** | Multiple models + training | Development/research |
## ๐ License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
## ๐ Citation
If you use HalluNox in your research, please cite:
```bibtex
@article{nandakishor2024hallunox,
title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},
author={Nandakishor M},
journal={AI Safety Research},
year={2024},
organization={Convai Innovations}
}
```
## ๐ค Contributing
We welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.
### Development Setup
```bash
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e ".[dev]"
```
## ๐ Support
For technical support and questions:
- **Email**: support@convaiinnovations.com
- **Issues**: [GitHub Issues](https://github.com/convai-innovations/hallunox/issues)
- **Documentation**: Full API docs available online
## ๐จโ๐ป Author
**Nandakishor M**
AI Safety Research
Convai Innovations Pvt. Ltd.
Email: support@convaiinnovations.com
---
**Disclaimer**: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.
Raw data
{
"_id": null,
"home_page": "https://github.com/convai-innovations/hallunox",
"name": "hallunox",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "\"Convai Innovations Pvt. Ltd.\" <support@convaiinnovations.com>",
"keywords": "hallucination-detection, llm, confidence-estimation, model-reliability, uncertainty-quantification, ai-safety",
"author": "Nandakishor M",
"author_email": "Nandakishor M <support@convaiinnovations.com>",
"download_url": "https://files.pythonhosted.org/packages/2c/76/f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b/hallunox-0.6.3.tar.gz",
"platform": null,
"description": "# HalluNox\r\n\r\n**Confidence-Aware Routing for Large Language Model Reliability Enhancement**\r\n\r\nA Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.\r\n\r\n## \u2728 Features\r\n\r\n- **\ud83c\udfaf Pre-generation Hallucination Detection**: Assess model reliability before generation begins\r\n- **\ud83d\udd04 Confidence-Aware Routing**: Automatically route queries based on estimated confidence\r\n- **\ud83e\udde0 Multi-Signal Approach**: Combines semantic alignment, internal convergence, and learned confidence\r\n- **\u26a1 Multi-Model Support**: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures\r\n- **\ud83c\udfe5 Medical Domain Specialization**: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds\r\n- **\ud83d\uddbc\ufe0f Multimodal Capabilities**: Image analysis and response generation for MedGemma models\r\n- **\ud83d\udcca Comprehensive Evaluation**: Built-in metrics and routing strategy analysis\r\n- **\ud83d\ude80 Easy Integration**: Simple API for both training and inference\r\n- **\ud83c\udfc3\u200d\u2642\ufe0f Performance Optimizations**: Optional LLM loading for faster initialization and lower memory usage\r\n- **\ud83d\udcdd Enhanced Query-Context**: Improved accuracy with structured prompt formatting\r\n- **\ud83c\udf9b\ufe0f Adaptive Thresholds**: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)\r\n- **\ud83d\udcac Response Generation**: Built-in response generation with confidence-gated output\r\n- **\ud83d\udd27 Automatic Model Management**: Auto-download and configuration for supported models\r\n\r\n## \ud83d\udd2c Research Foundation\r\n\r\nBased on the research paper \"Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation\" by Nandakishor M (Convai Innovations).\r\n\r\nThe approach implements deterministic routing to appropriate response pathways:\r\n\r\n### General Models (Llama-3.2-3B)\r\n- **High Confidence (\u22650.65)**: Local generation \r\n- **Medium Confidence (0.60-0.65)**: Retrieval-augmented generation\r\n- **Low Confidence (0.4-0.60)**: Route to larger models\r\n- **Very Low Confidence (<0.4)**: Human review required\r\n\r\n### Medical Models (MedGemma-4B-IT)\r\n- **High Medical Confidence (\u22650.60)**: Local generation with medical validation\r\n- **Medium Medical Confidence (0.55-0.60)**: Medical literature verification required\r\n- **Low Medical Confidence (0.50-0.55)**: Professional medical verification required\r\n- **Very Low Medical Confidence (<0.50)**: Seek professional medical advice\r\n\r\n## \ud83c\udd95 What's New in v0.6.3\r\n\r\n### \u2728 Enhanced Query-Context Support\r\n- **\ud83d\udd17 Query-Context Pairs**: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses\r\n- **\ud83c\udfaf Improved Accuracy**: Better confidence scoring when context is provided\r\n- **\ud83d\udcdd Enhanced Response Generation**: Context-aware prompt formatting for more accurate medical responses\r\n- **\ud83d\udd04 Batch Processing**: New `generate_response_with_context()` method for processing multiple query-context pairs\r\n\r\n### \ud83c\udfe5 Medical Domain Enhancements\r\n- **\ud83e\ude7a Context Integration**: Medical queries now benefit from patient context and clinical background\r\n- **\ud83d\udcca Better Confidence**: Context helps improve confidence scoring for medical scenarios\r\n- **\ud83c\udf9b\ufe0f Flexible Usage**: Works with existing methods while providing new convenience functions\r\n- **\ud83d\udd0d Example Implementation**: New query_context_example.py demonstrates usage patterns\r\n\r\n### \ud83e\uddf9 Simplified Architecture\r\n- **\ud83d\udcf1 Removed Dashboard**: Eliminated dashboard dependencies for cleaner core package\r\n- **\u26a1 Streamlined Installation**: Faster installation without unnecessary web components\r\n- **\ud83c\udfaf Focused Functionality**: Core hallucination detection without UI overhead\r\n- **\ud83d\udce6 Lightweight**: Reduced package size and dependencies\r\n\r\n### \ud83d\udd27 Technical Improvements\r\n- **\ud83d\udd17 Enhanced Prompt Formatting**: Context gets properly integrated into medical prompts\r\n- **\ud83c\udfaf Backward Compatibility**: All existing code continues to work unchanged\r\n- **\ud83d\udcdd Better Documentation**: Comprehensive examples for query-context usage\r\n- **\ud83d\udee1\ufe0f Stable Performance**: Maintains all stability improvements from v0.6.3\r\n\r\n## \ud83d\ude80 Installation\r\n\r\n### Requirements\r\n\r\n- Python 3.8+\r\n- PyTorch 1.13+\r\n- CUDA-compatible GPU (recommended)\r\n- At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)\r\n- 16GB RAM minimum (32GB recommended for training)\r\n\r\n### Install from PyPI\r\n\r\n```bash\r\npip install hallunox\r\n```\r\n\r\n### Install from Source\r\n\r\n```bash\r\ngit clone https://github.com/convai-innovations/hallunox.git\r\ncd hallunox\r\npip install -e .\r\n```\r\n\r\n### MedGemma Model Access\r\n\r\nHalluNox uses the open-access `convaiinnovations/gemma-finetuned-4b-it` model, which doesn't require authentication. The model will be automatically downloaded on first use.\r\n\r\n### Core Dependencies\r\n\r\nHalluNox automatically installs:\r\n\r\n- `torch>=1.13.0` - PyTorch framework\r\n- `transformers>=4.21.0` - Hugging Face Transformers\r\n- `FlagEmbedding>=1.2.0` - BGE-M3 embedding model\r\n- `datasets>=2.0.0` - Dataset loading utilities\r\n- `scikit-learn>=1.0.0` - Evaluation metrics\r\n- `numpy>=1.21.0` - Numerical computations\r\n- `tqdm>=4.64.0` - Progress bars\r\n- `Pillow>=8.0.0` - Image processing for multimodal capabilities\r\n- `bitsandbytes>=0.41.0` - 4-bit quantization for memory optimization\r\n\r\n## \ud83d\udcd6 Quick Start\r\n\r\n### Basic Usage (Llama-3.2-3B)\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Initialize detector (downloads pre-trained model automatically)\r\ndetector = HallucinationDetector()\r\n\r\n# Analyze text for hallucination risk\r\nresults = detector.predict([\r\n \"The capital of France is Paris.\", # High confidence\r\n \"Your password is 12345678.\", # Low confidence \r\n \"The Moon is made of cheese.\" # Very low confidence\r\n])\r\n\r\n# View results\r\nfor pred in results[\"predictions\"]:\r\n print(f\"Text: {pred['text']}\")\r\n print(f\"Confidence: {pred['confidence_score']:.3f}\")\r\n print(f\"Risk Level: {pred['risk_level']}\")\r\n print(f\"Routing Action: {pred['routing_action']}\")\r\n print()\r\n```\r\n\r\n### \ud83c\udfe5 MedGemma Medical Domain Usage\r\n\r\nFor medical applications using MedGemma 4B-IT with multimodal capabilities:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\nfrom PIL import Image\r\n\r\n# Initialize MedGemma detector (auto-downloads medical model)\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n confidence_threshold=0.60, # Medical-grade threshold\r\n enable_response_generation=True, # Enable response generation\r\n enable_inference=True,\r\n mode=\"text\" # Text-only mode (default)\r\n)\r\n\r\n# Medical text analysis\r\nmedical_results = detector.predict([\r\n \"Aspirin can help reduce heart attack risk when prescribed by a doctor.\",\r\n \"Drinking bleach will cure COVID-19.\", # Dangerous misinformation\r\n \"Type 2 diabetes requires insulin injections in all cases.\", # Partially incorrect\r\n])\r\n\r\nfor pred in medical_results[\"predictions\"]:\r\n print(f\"Medical Text: {pred['text'][:60]}...\")\r\n print(f\"Confidence: {pred['confidence_score']:.3f}\")\r\n print(f\"Risk Level: {pred['risk_level']}\")\r\n print(f\"Medical Action: {pred['routing_action']}\")\r\n print(f\"Description: {pred['description']}\")\r\n print(\"-\" * 50)\r\n\r\n# Response generation with confidence checking\r\nquestion = \"What are the symptoms of pneumonia?\"\r\nresponse = detector.generate_response(question, check_confidence=True)\r\n\r\nif response[\"should_generate\"]:\r\n print(f\"\u2705 Medical Response Generated (confidence: {response['confidence_score']:.3f})\")\r\n print(f\"Response: {response['response']}\")\r\n print(f\"Meets threshold: {response['meets_threshold']}\")\r\n if response.get('forced_generation'):\r\n print(\"\u26a0\ufe0f Note: Response was generated despite low confidence\")\r\nelse:\r\n print(f\"\u26a0\ufe0f Response blocked (confidence: {response['confidence_score']:.3f})\")\r\n print(f\"Reason: {response['reason']}\")\r\n print(f\"Recommendation: {response['recommendation']}\")\r\n\r\n# Force generation for reference regardless of confidence\r\nforced_response = detector.generate_response(\r\n question, \r\n check_confidence=True, \r\n force_generate=True # Generate even if confidence is low\r\n)\r\nprint(f\"\ud83d\udd2c Reference Response (forced): {forced_response['response']}\")\r\nprint(f\"\ud83d\udcca Confidence: {forced_response['confidence_score']:.3f}\")\r\nprint(f\"\ud83c\udfaf Forced Generation: {forced_response['forced_generation']}\")\r\n\r\n# Multimodal image analysis (MedGemma 4B-IT only)\r\nif detector.is_multimodal:\r\n print(\"\\n\ud83d\uddbc\ufe0f Multimodal Image Analysis\")\r\n \r\n # Load medical image (replace with actual medical image)\r\n try:\r\n image = Image.open(\"chest_xray.jpg\")\r\n except:\r\n # Create demo image for testing\r\n image = Image.new('RGB', (224, 224), color='lightgray')\r\n \r\n # Analyze image confidence\r\n image_results = detector.predict_images([image], [\"Chest X-ray\"])\r\n \r\n for pred in image_results[\"predictions\"]:\r\n print(f\"Image: {pred['image_description']}\")\r\n print(f\"Confidence: {pred['confidence_score']:.3f}\")\r\n print(f\"Interpretation: {pred['interpretation']}\")\r\n print(f\"Risk Level: {pred['risk_level']}\")\r\n \r\n # Generate image description\r\n description = detector.generate_image_response(\r\n image, \r\n \"Describe the findings in this chest X-ray.\"\r\n )\r\n print(f\"Generated Description: {description}\")\r\n```\r\n\r\n### \ud83d\udd27 Advanced Configuration\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Full configuration example\r\ndetector = HallucinationDetector(\r\n # Model selection\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\", # or \"unsloth/Llama-3.2-3B-Instruct\"\r\n embed_model_id=\"BAAI/bge-m3\",\r\n \r\n # Custom model weights (optional)\r\n model_path=\"/path/to/custom/model.pt\", # None = auto-download\r\n \r\n # Hardware configuration\r\n device=\"cuda\", # or \"cpu\"\r\n use_fp16=True, # Mixed precision for faster inference\r\n \r\n # Sequence lengths\r\n max_length=512, # LLM context length\r\n bge_max_length=512, # BGE-M3 context length\r\n \r\n # Feature toggles\r\n load_llm=True, # Load LLM for embeddings\r\n enable_inference=True, # Enable LLM inference\r\n enable_response_generation=True, # Enable response generation\r\n \r\n # Confidence settings\r\n confidence_threshold=0.60, # Custom threshold (auto-detected by model type)\r\n \r\n # Operation mode\r\n mode=\"text\", # \"text\", \"image\", \"both\", or \"auto\"\r\n)\r\n\r\n# Check model capabilities\r\nprint(f\"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}\")\r\nprint(f\"Multimodal support: {detector.is_multimodal}\")\r\nprint(f\"Operation mode: {detector.effective_mode} (requested: {detector.mode})\")\r\nprint(f\"Confidence threshold: {detector.confidence_threshold}\")\r\n```\r\n\r\n### \ud83c\udf9b\ufe0f Operation Mode Configuration\r\n\r\nThe `mode` parameter controls what types of input the detector can process:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Text mode (default) - processes text inputs only\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n mode=\"text\" # Text-only processing (default)\r\n)\r\n\r\n# Auto mode - detects capabilities from model\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n mode=\"auto\" # Auto: detects based on model capabilities\r\n)\r\n\r\n# Image-only mode - processes images only (requires multimodal model)\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n mode=\"image\" # Image processing only\r\n)\r\n\r\n# Both mode - processes text and images (requires multimodal model)\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n mode=\"both\" # Explicit multimodal mode\r\n)\r\n```\r\n\r\n#### Mode Validation\r\n\r\n- **Text mode**: Available for all models (default)\r\n- **Image mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)\r\n- **Both mode**: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)\r\n- **Auto mode**: Automatically selects based on model capabilities\r\n - Multimodal models \u2192 `effective_mode = \"both\"`\r\n - Text-only models \u2192 `effective_mode = \"text\"`\r\n\r\n#### Error Examples\r\n\r\n```python\r\n# This will raise an error - image mode requires multimodal model\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"unsloth/Llama-3.2-3B-Instruct\",\r\n mode=\"image\" # \u274c Error: Image mode requires multimodal model\r\n)\r\n\r\n# This will raise an error - calling image methods in text mode\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n mode=\"text\"\r\n)\r\ndetector.predict_images([image]) # \u274c Error: Current mode is 'text'\r\n```\r\n\r\n### \u26a1 Performance Optimized Usage\r\n\r\nFor faster initialization when only doing embedding comparisons:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Option 1: Factory method for embedding-only usage\r\ndetector = HallucinationDetector.for_embedding_only(\r\n device=\"cuda\",\r\n use_fp16=True\r\n)\r\n\r\n# Option 2: Explicit parameter control\r\ndetector = HallucinationDetector(\r\n load_llm=False, # Skip expensive LLM loading\r\n enable_inference=False, # Disable inference capabilities\r\n use_fp16=True # Use mixed precision\r\n)\r\n\r\n# Note: This configuration cannot perform predictions\r\n# Use for preprocessing or embedding extraction only\r\n```\r\n\r\n### \ud83e\udde0 Memory Optimization with Quantization\r\n\r\nFor GPUs with limited VRAM (8-16GB), use 4-bit quantization:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Option 1: Auto-optimized for low memory (recommended)\r\ndetector = HallucinationDetector.for_low_memory(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\", # Or any supported model\r\n device=\"cuda\",\r\n enable_response_generation=True, # Enable response generation for evaluation\r\n verbose=True # Show loading progress (optional)\r\n)\r\n\r\n# Option 2: Manual quantization configuration\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n use_quantization=True, # Enable 4-bit quantization\r\n enable_response_generation=True,\r\n device=\"cuda\"\r\n)\r\n\r\n# Option 3: Custom quantization settings\r\nfrom transformers import BitsAndBytesConfig\r\nimport torch\r\n\r\ncustom_quant_config = BitsAndBytesConfig(\r\n load_in_4bit=True,\r\n bnb_4bit_quant_type=\"nf4\", # NF4 quantization type\r\n bnb_4bit_use_double_quant=True, # Double quantization for extra savings\r\n bnb_4bit_compute_dtype=torch.bfloat16 # Compute in bfloat16\r\n)\r\n\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n quantization_config=custom_quant_config,\r\n device=\"cuda\"\r\n)\r\n\r\nprint(f\"\u2705 Memory optimized: {detector.use_quantization}\")\r\nprint(f\"\ud83d\udd27 Quantization: 4-bit NF4 with double quantization\")\r\n```\r\n\r\n## \ud83e\udd16 Response Generation & Evaluation\r\n\r\n### Enabling Response Generation\r\n\r\nWhen `enable_response_generation=True`, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Enable response generation for evaluation\r\ndetector = HallucinationDetector.for_low_memory(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n device=\"cuda\",\r\n enable_response_generation=True, # Enable response generation\r\n verbose=False # Clean logs for evaluation\r\n)\r\n\r\n# Test questions for evaluation\r\ntest_questions = [\r\n \"What are the symptoms of diabetes?\",\r\n \"Drinking bleach will cure COVID-19.\", # Dangerous misinformation\r\n \"How does aspirin help prevent heart attacks?\",\r\n \"All vaccines cause autism in children.\", # Medical misinformation\r\n]\r\n\r\n# Analyze with response generation\r\nfor question in test_questions:\r\n # The model will generate a response and analyze it\r\n results = detector.predict([question])\r\n prediction = results[\"predictions\"][0]\r\n \r\n print(f\"Question: {question}\")\r\n print(f\"Confidence: {prediction['confidence_score']:.3f}\")\r\n print(f\"Risk Level: {prediction['risk_level']}\")\r\n print(f\"Action: {prediction['medical_action']}\")\r\n print(f\"Description: {prediction['description']}\")\r\n print(\"-\" * 50)\r\n```\r\n\r\n### Response Generation Modes\r\n\r\n```python\r\n# Generate and analyze responses with confidence checking\r\nresponse = detector.generate_response(\r\n \"What are the side effects of ibuprofen?\", \r\n check_confidence=True\r\n)\r\n\r\nif response[\"should_generate\"]:\r\n print(f\"\u2705 Generated Response: {response['response']}\")\r\n print(f\"Confidence: {response['confidence_score']:.3f}\")\r\n print(f\"Meets threshold: {response['meets_threshold']}\")\r\nelse:\r\n print(f\"\u26a0\ufe0f Response blocked (confidence: {response['confidence_score']:.3f})\")\r\n print(f\"Reason: {response['reason']}\")\r\n print(f\"Recommendation: {response['recommendation']}\")\r\n\r\n# Force generation for reference (useful for evaluation)\r\nforced_response = detector.generate_response(\r\n \"What are the side effects of ibuprofen?\", \r\n check_confidence=True, \r\n force_generate=True\r\n)\r\nprint(f\"\ud83d\udd2c Reference Response: {forced_response['response']}\")\r\nprint(f\"\ud83d\udcca Confidence: {forced_response['confidence_score']:.3f}\")\r\nprint(f\"\ud83c\udfaf Forced Generation: {forced_response['forced_generation']}\")\r\n```\r\n\r\n### Evaluation Output Example\r\n\r\n```\r\nQuestion: What are the symptoms of diabetes?\r\nGenerated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.\r\nConfidence: 0.857\r\nRisk Level: LOW_MEDICAL_RISK\r\nAction: \u2705 Information can be used as reference\r\n--------------------------------------------------\r\nQuestion: Drinking bleach will cure COVID-19.\r\nGenerated Response: [Response blocked - confidence too low]\r\nConfidence: 0.123\r\nRisk Level: VERY_HIGH_MEDICAL_RISK\r\nAction: \u26d4 Do not use - seek professional medical advice\r\n--------------------------------------------------\r\n```\r\n\r\n### \ud83d\udcbe Memory Usage Comparison\r\n\r\n| Configuration | Model Size | VRAM Usage | Performance |\r\n|--------------|------------|------------|-------------|\r\n| **Full Precision** | ~16GB | ~14GB | 100% speed |\r\n| **FP16 Mixed Precision** | ~8GB | ~7GB | 95% speed |\r\n| **4-bit Quantization** | ~4GB | ~3.5GB | 85-90% speed |\r\n| **4-bit + Double Quant** | ~3.5GB | ~3GB | 85-90% speed |\r\n\r\n**Recommendation**: Use `HallucinationDetector.for_low_memory()` for GPUs with 8GB or less VRAM.\r\n\r\n### \ud83d\udcdd Enhanced Query-Context Support (NEW in v0.6.3!)\r\n\r\nHalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:\r\n\r\n```python\r\nfrom hallunox import HallucinationDetector\r\n\r\n# Initialize MedGemma detector for context-aware medical responses\r\ndetector = HallucinationDetector(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n enable_response_generation=True\r\n)\r\n\r\n# Medical query-context pairs for enhanced accuracy\r\nmedical_query_context_pairs = [\r\n {\r\n \"query\": \"Is it safe to take ibuprofen daily?\",\r\n \"context\": \"Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation.\"\r\n },\r\n {\r\n \"query\": \"What's the recommended exercise routine?\",\r\n \"context\": \"28-year-old pregnant patient at 30 weeks, previously sedentary, no complications.\"\r\n },\r\n {\r\n \"query\": \"How should I manage my diabetes medication?\",\r\n \"context\": \"Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily.\"\r\n }\r\n]\r\n\r\n# Method 1: Confidence analysis with context\r\nresults = detector.predict_with_query_context(medical_query_context_pairs)\r\nfor pred in results[\"predictions\"]:\r\n print(f\"Query: {pred['text']}\")\r\n print(f\"Context-Enhanced Confidence: {pred['confidence_score']:.3f}\")\r\n print(f\"Medical Risk Level: {pred['risk_level']}\")\r\n print(f\"Recommendation: {pred['routing_action']}\")\r\n\r\n# Method 2: Response generation with context\r\nresponses = detector.generate_response_with_context(\r\n medical_query_context_pairs,\r\n max_length=300,\r\n check_confidence=True\r\n)\r\n\r\nfor i, response in enumerate(responses):\r\n pair = medical_query_context_pairs[i]\r\n print(f\"\\nQuery: {pair['query']}\")\r\n print(f\"Context: {pair['context'][:60]}...\")\r\n \r\n if isinstance(response, dict) and \"should_generate\" in response:\r\n if response[\"should_generate\"]:\r\n print(f\"\u2705 Context-Aware Response: {response['response']}\")\r\n print(f\"Confidence: {response['confidence_score']:.3f}\")\r\n else:\r\n print(f\"\u26a0\ufe0f Blocked (confidence: {response['confidence_score']:.3f})\")\r\n print(f\"Recommendation: {response['recommendation']}\")\r\n\r\n# Method 3: Individual response with context\r\nsingle_response = detector.generate_response(\r\n prompt=\"Should I adjust my medication?\",\r\n query_context_pairs=[{\r\n \"query\": \"Should I adjust my medication?\", \r\n \"context\": \"Patient experiencing mild side effects from current dosage\"\r\n }],\r\n check_confidence=True\r\n)\r\n```\r\n\r\n### Context Impact Analysis\r\n\r\n```python\r\n# Compare confidence with and without context\r\nquery = \"Is this medication safe during pregnancy?\"\r\n\r\n# Without context\r\nno_context = detector.predict([query])\r\nprint(f\"Without context: {no_context['predictions'][0]['confidence_score']:.3f}\")\r\n\r\n# With context\r\nwith_context = detector.predict([query], query_context_pairs=[{\r\n \"query\": query,\r\n \"context\": \"Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins\"\r\n}])\r\nprint(f\"With context: {with_context['predictions'][0]['confidence_score']:.3f}\")\r\n\r\n# Context benefit\r\nimprovement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']\r\nprint(f\"Context improvement: {improvement:+.3f}\")\r\n```\r\n\r\n## \ud83d\udda5\ufe0f Command Line Interface\r\n\r\nHalluNox provides a comprehensive CLI for various use cases:\r\n\r\n### Interactive Mode\r\n```bash\r\n# General model interactive mode\r\nhallunox-infer --interactive\r\n\r\n# MedGemma medical interactive mode\r\nhallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text\r\n```\r\n\r\n### Batch Processing\r\n```bash\r\n# Process file with general model\r\nhallunox-infer --input_file medical_texts.txt --output_file results.json\r\n\r\n# Process with MedGemma and medical settings\r\nhallunox-infer \\\r\n --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n --input_file medical_texts.txt \\\r\n --output_file medical_results.json \\\r\n --show_routing \\\r\n --show_generated_text\r\n```\r\n\r\n### Image Analysis (Multimodal models only)\r\n```bash\r\n# Single image analysis\r\nhallunox-infer \\\r\n --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n --image_path chest_xray.jpg \\\r\n --show_generated_text\r\n\r\n# Batch image analysis\r\nhallunox-infer \\\r\n --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n --image_folder /path/to/medical/images \\\r\n --output_file image_analysis.json\r\n```\r\n\r\n### Demo Mode\r\n```bash\r\n# General demo\r\nhallunox-infer --demo --show_routing\r\n\r\n# Medical demo with MedGemma\r\nhallunox-infer \\\r\n --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n --demo \\\r\n --mode both \\\r\n --show_routing\r\n\r\n# Text-only demo (faster initialization)\r\nhallunox-infer \\\r\n --llm_model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n --demo \\\r\n --mode text \\\r\n --show_routing\r\n```\r\n\r\n## \ud83d\udd28 Training Your Own Model\r\n\r\n### Quick Training\r\n\r\n```python\r\nfrom hallunox import Trainer, TrainingConfig\r\n\r\n# Configure training\r\nconfig = TrainingConfig(\r\n # Model selection\r\n model_id=\"convaiinnovations/gemma-finetuned-4b-it\", # or \"unsloth/Llama-3.2-3B-Instruct\"\r\n embed_model_id=\"BAAI/bge-m3\",\r\n \r\n # Training parameters\r\n batch_size=8,\r\n learning_rate=5e-4,\r\n max_epochs=6,\r\n warmup_steps=300,\r\n \r\n # Dataset configuration\r\n use_truthfulqa=True,\r\n use_halueval=True,\r\n use_fever=True,\r\n max_samples_per_dataset=3000,\r\n \r\n # Output\r\n output_dir=\"./models/my_medical_model\"\r\n)\r\n\r\n# Train model\r\ntrainer = Trainer(config)\r\ntrainer.train()\r\n```\r\n\r\n### Command Line Training\r\n```bash\r\n# Train general model\r\nhallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6\r\n\r\n# Train medical model\r\nhallunox-train \\\r\n --model_id convaiinnovations/gemma-finetuned-4b-it \\\r\n --batch_size 4 \\\r\n --learning_rate 3e-4 \\\r\n --max_epochs 8 \\\r\n --output_dir ./models/custom_medgemma\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Model Architecture\r\n\r\nHalluNox supports two main architectures:\r\n\r\n### General Architecture (Llama-3.2-3B)\r\n1. **LLM Component**: Llama-3.2-3B-Instruct\r\n - Extracts internal hidden representations (3072D)\r\n - Supports any Llama-architecture model\r\n \r\n2. **Embedding Model**: BGE-M3 (fixed)\r\n - Provides reference semantic embeddings\r\n - 1024-dimensional dense vectors\r\n\r\n3. **Projection Network**: Standard ProjectionHead\r\n - Maps LLM hidden states to embedding space\r\n - 3-layer MLP with ReLU activations and dropout\r\n\r\n### Medical Architecture (MedGemma-4B-IT)\r\n1. **Unified Multimodal Model**: \r\n - **Single Model**: AutoModelForImageTextToText handles both text and images\r\n - **Memory Optimized**: Avoids double loading (saves ~8GB VRAM)\r\n - **Fallback Support**: Graceful degradation to text-only if needed\r\n \r\n2. **Embedding Model**: BGE-M3 (same as general)\r\n - Enhanced with medical context formatting\r\n \r\n3. **Projection Network**: UltraStableProjectionHead\r\n - Ultra-stable architecture with heavy normalization\r\n - Conservative weight initialization for medical precision\r\n - Tanh activations for stability\r\n - Enhanced dropout and layer normalization\r\n\r\n4. **Multimodal Processor**: AutoProcessor\r\n - Handles image + text inputs\r\n - Supports chat template formatting\r\n\r\n5. **Quantization Support**: 4-bit NF4 with double quantization\r\n - Reduces memory usage by ~75%\r\n - Maintains 85-90% performance\r\n - Automatic fallback for CPU\r\n\r\n## \ud83d\udcca API Reference\r\n\r\n### HallucinationDetector\r\n\r\n#### Constructor Parameters\r\n\r\n```python\r\nHallucinationDetector(\r\n model_path: str = None, # Path to trained model (None = auto-download)\r\n llm_model_id: str = \"unsloth/Llama-3.2-3B-Instruct\", # LLM model ID\r\n embed_model_id: str = \"BAAI/bge-m3\", # Embedding model ID\r\n device: str = None, # Device (None = auto-detect)\r\n max_length: int = 512, # LLM sequence length\r\n bge_max_length: int = 512, # BGE-M3 sequence length\r\n use_fp16: bool = True, # Mixed precision\r\n load_llm: bool = True, # Load LLM\r\n enable_inference: bool = False, # Enable LLM inference\r\n confidence_threshold: float = None, # Custom threshold (auto-detected)\r\n enable_response_generation: bool = False, # Enable response generation\r\n use_quantization: bool = False, # Enable 4-bit quantization for memory savings\r\n quantization_config: BitsAndBytesConfig = None, # Custom quantization config\r\n mode: str = \"text\", # Operation mode: \"text\", \"image\", \"both\", \"auto\" (default: \"text\")\r\n)\r\n```\r\n\r\n#### Core Methods\r\n\r\n**Text Analysis:**\r\n- `predict(texts, query_context_pairs=None)` - Analyze texts for hallucination confidence\r\n- `predict_with_query_context(query_context_pairs)` - Query-context prediction\r\n- `batch_predict(texts, batch_size=16)` - Efficient batch processing\r\n\r\n**Response Generation:**\r\n- `generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None)` - Generate responses with confidence checking and optional context\r\n- `generate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False)` - Generate responses for multiple query-context pairs\r\n\r\n**Multimodal (MedGemma only):**\r\n- `predict_images(images, image_descriptions=None)` - Analyze image confidence\r\n- `generate_image_response(image, prompt, max_length=200)` - Generate image descriptions\r\n\r\n**Analysis:**\r\n- `evaluate_routing_strategy(texts)` - Analyze routing decisions\r\n\r\n**Factory Methods:**\r\n- `for_embedding_only()` - Create embedding-only detector\r\n- `for_low_memory()` - Create memory-optimized detector with 4-bit quantization\r\n\r\n#### Response Format\r\n\r\n```python\r\n{\r\n \"predictions\": [\r\n {\r\n \"text\": \"input text\",\r\n \"confidence_score\": 0.85, # 0.0 to 1.0\r\n \"similarity_score\": 0.92, # Cosine similarity\r\n \"interpretation\": \"HIGH_CONFIDENCE\", # or HIGH_MEDICAL_CONFIDENCE\r\n \"risk_level\": \"LOW_RISK\", # or LOW_MEDICAL_RISK\r\n \"routing_action\": \"LOCAL_GENERATION\",\r\n \"description\": \"This response appears to be factual and reliable.\"\r\n }\r\n ],\r\n \"summary\": {\r\n \"total_texts\": 1,\r\n \"avg_confidence\": 0.85,\r\n \"high_confidence_count\": 1,\r\n \"medium_confidence_count\": 0,\r\n \"low_confidence_count\": 0,\r\n \"very_low_confidence_count\": 0\r\n }\r\n}\r\n```\r\n\r\n#### Response Generation Format\r\n\r\n```python\r\n{\r\n \"response\": \"Generated response text\",\r\n \"confidence_score\": 0.85,\r\n \"should_generate\": True,\r\n \"meets_threshold\": True,\r\n \"forced_generation\": False, # True if generated despite low confidence\r\n # Or when blocked:\r\n \"reason\": \"Confidence 0.45 below threshold 0.60\",\r\n \"recommendation\": \"RAG_RETRIEVAL\"\r\n}\r\n```\r\n\r\n### Training Classes\r\n\r\n- **`TrainingConfig`**: Configuration dataclass for training parameters\r\n- **`Trainer`**: Main training class with dataset loading and model training\r\n- **`MultiDatasetLoader`**: Loads and combines multiple hallucination detection datasets\r\n\r\n### Utility Functions\r\n\r\n- **`download_model()`**: Download general pre-trained model\r\n- **`download_medgemma_model(model_name)`**: Download MedGemma medical model\r\n- **`setup_logging(level)`**: Configure logging\r\n- **`check_gpu_availability()`**: Check CUDA compatibility\r\n- **`validate_model_requirements()`**: Verify dependencies\r\n\r\n## \ud83d\udcc8 Performance\r\n\r\nOur confidence-aware routing system demonstrates:\r\n\r\n- **74% hallucination detection rate** (vs 42% baseline)\r\n- **9% false positive rate** (vs 15% baseline) \r\n- **40% reduction in computational cost** vs post-hoc methods\r\n- **1.6x cost multiplier** vs always using expensive operations (4.2x)\r\n\r\n### Medical Domain Performance (MedGemma)\r\n- **Enhanced medical accuracy** with 0.62 confidence threshold\r\n- **Multimodal capability** for medical image analysis\r\n- **Safety-first approach** with conservative thresholds\r\n- **Professional verification workflow** for low-confidence cases\r\n\r\n## \ud83d\udda5\ufe0f Hardware Requirements\r\n\r\n### Minimum (Inference Only)\r\n- **CPU**: Modern multi-core processor\r\n- **RAM**: 16GB system memory\r\n- **GPU**: 8GB VRAM (RTX 3070, RTX 4060 Ti+)\r\n- **Storage**: 15GB free space\r\n- **Models**: ~5GB each (Llama/MedGemma)\r\n\r\n### Recommended (Inference)\r\n- **CPU**: Intel i7/AMD Ryzen 7+\r\n- **RAM**: 32GB system memory \r\n- **GPU**: 12GB+ VRAM (RTX 4070, RTX 3080+)\r\n- **Storage**: NVMe SSD, 25GB+ free\r\n- **CUDA**: 11.8+ compatible driver\r\n\r\n### Training Requirements\r\n- **CPU**: High-performance multi-core (i9/Ryzen 9)\r\n- **RAM**: 64GB+ system memory\r\n- **GPU**: 24GB+ VRAM (RTX 4090, A100, H100)\r\n- **Storage**: 200GB+ NVMe SSD\r\n - Model checkpoints: ~10GB per epoch\r\n - Training datasets: ~30GB\r\n - Logs and outputs: ~50GB\r\n- **Network**: High-speed internet for downloads\r\n\r\n### MedGemma Specific\r\n- **Additional storage**: +10GB for multimodal models\r\n- **Image processing**: PIL/Pillow for image capabilities\r\n- **Memory**: +4GB RAM for image processing pipeline\r\n\r\n### CPU-Only Mode\r\n- **RAM**: 32GB minimum (64GB recommended)\r\n- **Performance**: 10-50x slower than GPU\r\n- **Not recommended**: For production medical applications\r\n\r\n## \ud83d\udd12 Safety Considerations\r\n\r\n### Medical Applications\r\n- **Professional oversight required**: HalluNox is a research tool, not medical advice\r\n- **Validation needed**: All medical outputs should be verified by qualified professionals\r\n- **Conservative thresholds**: 0.62 threshold ensures high precision for medical content\r\n- **Clear disclaimers**: Always include appropriate medical disclaimers in applications\r\n\r\n### General Use\r\n- **Confidence-based routing**: Use routing recommendations for appropriate escalation\r\n- **Human oversight**: Very low confidence predictions require human review\r\n- **Regular evaluation**: Monitor performance on your specific use cases\r\n\r\n## \ud83d\udee0\ufe0f Troubleshooting\r\n\r\n### Common Issues and Solutions\r\n\r\n#### CUDA Out of Memory Error\r\n```\r\nOutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...\r\n```\r\n**Solution**: Use 4-bit quantization\r\n```python\r\ndetector = HallucinationDetector.for_low_memory()\r\n```\r\n\r\n#### Deprecated torch_dtype Warning\r\n```\r\n`torch_dtype` is deprecated! Use `dtype` instead!\r\n```\r\n**Solution**: Already fixed in HalluNox v0.3.2+ - the package now uses the correct `dtype` parameter.\r\n\r\n#### Double Model Loading (MedGemma)\r\n```\r\nLoading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]\r\nLoading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]\r\n```\r\n**Solution**: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.\r\n\r\n#### Accelerate Warning\r\n```\r\nWARNING:accelerate.big_modeling:Some parameters are on the meta device...\r\n```\r\n**Solution**: This is normal with quantization - parameters are automatically moved to GPU during inference.\r\n\r\n#### Dependency Version Conflict (AutoProcessor)\r\n```\r\n\u26a0\ufe0f Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'\r\nAttributeError: module 'requests' has no attribute 'exceptions'\r\n```\r\n**Solution**: This is a compatibility issue between transformers and requests versions.\r\n```bash\r\npip install --upgrade transformers requests huggingface_hub\r\n# Or force reinstall\r\npip install --force-reinstall transformers>=4.45.0 requests>=2.31.0\r\n```\r\n**Fallback**: HalluNox automatically falls back to text-only mode when this occurs.\r\n\r\n#### Model Hidden States NaN/Inf Issues \u2705 RESOLVED\r\n```\r\n\u26a0\ufe0f Warning: NaN/Inf detected in model hidden states\r\n Hidden shape: torch.Size([3, 16, 2560])\r\n NaN count: 122880\r\n```\r\n**\u2705 FIXED in HalluNox v0.6.3+**: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:\r\n\r\n**Root Cause**: 4-bit quantization was causing numerical instabilities with certain model architectures.\r\n\r\n**Solution Applied**:\r\n- **Disabled Quantization**: Removed 4-bit quantization that was causing NaN issues\r\n- **Simplified Model Loading**: Now uses the same approach as our proven `inference_gemma.py` \r\n- **Clean Architecture**: Removed complex stability measures that were interfering\r\n- **Stable Precision**: Uses `torch.bfloat16` for optimal performance without instabilities\r\n\r\n#### Repetitive Text and Unwanted Artifacts \u2705 RESOLVED\r\n```\r\n\ud83d\udd2c Reference Response (forced): I am programmed to be a harmless AI assistant...\r\ng\r\nI am programmed to be a harmless AI assistant...\r\ng\r\n[repetitive output continues...]\r\n```\r\n**\u2705 FIXED in HalluNox v0.6.3+**: Repetitive text generation and unwanted artifacts have been completely resolved:\r\n\r\n**Root Cause**: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.\r\n\r\n**Solution Applied**:\r\n- **Deterministic Generation**: Changed from `do_sample=True` to `do_sample=False` matching Jupyter notebook approach\r\n- **Proper Chat Templates**: Adopted exact message formatting from working Jupyter notebook implementation \r\n- **Removed Sampling Parameters**: Eliminated `temperature`, `top_p`, `repetition_penalty` that were causing issues\r\n- **Clean Tokenization**: Uses `tokenizer.apply_chat_template()` with proper parameters for conversation structure\r\n\r\n**Current Recommended Usage** (v0.6.3+):\r\n```python\r\n# Standard usage - now stable by default\r\ndetector = HallucinationDetector.for_low_memory(\r\n llm_model_id=\"convaiinnovations/gemma-finetuned-4b-it\",\r\n device=\"cuda\"\r\n)\r\n\r\n# Both NaN issues and repetitive text are now automatically resolved\r\n```\r\n\r\n**Migration from v0.4.9 and earlier**: No code changes needed - existing code will automatically use the stable approach.\r\n\r\n#### Environment Optimization\r\nFor better memory management, set:\r\n```bash\r\nexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True\r\n```\r\n\r\n### Memory Requirements by Configuration\r\n\r\n| GPU VRAM | Recommended Configuration | Expected Performance |\r\n|----------|--------------------------|---------------------|\r\n| **4-6GB** | `for_low_memory()` + reduce batch size | Basic functionality |\r\n| **8-12GB** | `for_low_memory()` | Full functionality |\r\n| **16GB+** | Standard configuration | Optimal performance |\r\n| **24GB+** | Multiple models + training | Development/research |\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).\r\n\r\n## \ud83d\udcda Citation\r\n\r\nIf you use HalluNox in your research, please cite:\r\n\r\n```bibtex\r\n@article{nandakishor2024hallunox,\r\n title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},\r\n author={Nandakishor M},\r\n journal={AI Safety Research},\r\n year={2024},\r\n organization={Convai Innovations}\r\n}\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.\r\n\r\n### Development Setup\r\n```bash\r\ngit clone https://github.com/convai-innovations/hallunox.git\r\ncd hallunox\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## \ud83d\udcde Support\r\n\r\nFor technical support and questions:\r\n- **Email**: support@convaiinnovations.com \r\n- **Issues**: [GitHub Issues](https://github.com/convai-innovations/hallunox/issues)\r\n- **Documentation**: Full API docs available online\r\n\r\n## \ud83d\udc68\u200d\ud83d\udcbb Author\r\n\r\n**Nandakishor M** \r\nAI Safety Research \r\nConvai Innovations Pvt. Ltd. \r\nEmail: support@convaiinnovations.com\r\n\r\n---\r\n\r\n**Disclaimer**: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.\r\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "A confidence-aware routing system for LLM hallucination detection using multi-signal approach",
"version": "0.6.3",
"project_urls": {
"Bug Reports": "https://github.com/convai-innovations/hallunox/issues",
"Documentation": "https://hallunox.readthedocs.io",
"Homepage": "https://convaiinnovations.com",
"Repository": "https://github.com/convai-innovations/hallunox",
"Source Code": "https://github.com/convai-innovations/hallunox"
},
"split_keywords": [
"hallucination-detection",
" llm",
" confidence-estimation",
" model-reliability",
" uncertainty-quantification",
" ai-safety"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3d1715a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7",
"md5": "533c4b5190880399506fd14d2c0fa1a3",
"sha256": "6261ab99835db50428edb43953c1b26c8abf052cccd6853c86d8142a6746e8ec"
},
"downloads": -1,
"filename": "hallunox-0.6.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "533c4b5190880399506fd14d2c0fa1a3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 63640,
"upload_time": "2025-10-06T18:07:53",
"upload_time_iso_8601": "2025-10-06T18:07:53.859392Z",
"url": "https://files.pythonhosted.org/packages/3d/17/15a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7/hallunox-0.6.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2c76f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b",
"md5": "4c8b49e98068611d02d576b20b6cbf92",
"sha256": "75755724d9845caff0cb8f9dc39ee5e5e8c53d266f686ed35056167dce9cd6a7"
},
"downloads": -1,
"filename": "hallunox-0.6.3.tar.gz",
"has_sig": false,
"md5_digest": "4c8b49e98068611d02d576b20b6cbf92",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 83983,
"upload_time": "2025-10-06T18:07:58",
"upload_time_iso_8601": "2025-10-06T18:07:58.218139Z",
"url": "https://files.pythonhosted.org/packages/2c/76/f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b/hallunox-0.6.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 18:07:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "convai-innovations",
"github_project": "hallunox",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "hallunox"
}