| Name | llm-testlab JSON |
| Version |
0.2.0
JSON |
| download |
| home_page | None |
| Summary | Comprehensive testing suite for LLM evaluation: hallucination detection, consistency, robustness, safety, and multi-language code generation assessment. |
| upload_time | 2025-10-20 04:55:17 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.9 |
| license | MIT License
Copyright (c) 2025 Sai Vineeth Arumalla
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
| keywords |
llm
testing
evaluation
ai
hallucination-detection
code-evaluation
semantic-similarity
nlp
machine-learning
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
numpy
sentence-transformers
rich
faiss-cpu
transformers
huggingface_hub
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
<p align="center">
<a href="https://github.com/Saivineeth147/llm-testlab/stargazers">
<img src="https://img.shields.io/github/stars/Saivineeth147/llm-testlab?style=social" alt="Star this repo" />
</a>
</p>
⭐ If you find this project useful, **please consider starring** it — it helps others discover it!
# LLM TestLab
**Comprehensive Testing Suite for Large Language Models**
A flexible Python toolkit for evaluating LLMs on:
- **Text Metrics**: Hallucination, consistency, semantic robustness, safety
- **Code Evaluation**: Syntax, execution, quality, security, semantic correctness across 9+ languages
- **Dual Embedders**: Optimized for both text and code analysis
- **Optional FAISS**: High-performance vector similarity
## Features
### Text Evaluation Metrics
- **Hallucination Severity Index (HSI)** – Detect factual deviations from knowledge base
- **Consistency Stability Score (CSS)** – Measure output stability across runs
- **Semantic Robustness Index (SRI)** – Test invariance to paraphrasing
- **Safety Vulnerability Exposure (SVE)** – Detect unsafe responses to adversarial prompts
- **Knowledge Base Coverage (KBC)** – Measure factual alignment
### Code Evaluation Metrics (9+ Languages)
- **Syntax Validity (SV)** – Compiler/interpreter-based validation
- **Execution Pass Rate (EPR)** – Test case execution and verification
- **Code Quality Score (CQS)** – Complexity, documentation, error handling
- **Security Risk Score (SRS)** – Vulnerability pattern detection
- **Semantic Code Correctness (SCC)** – Embedding-based similarity to reference
- **Comprehensive Code Evaluation (CCE)** – Weighted aggregation of all metrics
**Supported Languages**: Python, JavaScript, TypeScript, Java, C, C++, Go, Rust, Ruby, PHP
### Advanced Features
- **Dual Embedders**: `all-MiniLM-L6-v2` for text, `BAAI/bge-small-en-v1.5` for code
- **FAISS Support**: Optional, for faster similarity searches
- **Knowledge Base Management**: Add, remove, or list facts
- **Security Patterns**: Customizable keywords and regex patterns
- **Rich Logging**: Built-in debug/info logging
## Project Structure
```
llm-testlab/
├── llm_testing_suite/
│ ├── __init__.py
│ ├── llm_testing_suite.py # Main test suite (text metrics)
│ └── code_evaluator.py # Code evaluation module
├── examples/
│ ├── run_text_evaluation.py # Text metrics evaluation script
│ ├── run_code_evaluation.py # Code metrics evaluation script
│ ├── groq_example.py # Groq API text evaluation
│ ├── groq_code_evaluation.py # Groq API code evaluation
│ └── huggingface_example.py # HuggingFace integration
├── pyproject.toml # Package configuration
├── requirements.txt # Dependencies
├── README.md
├── LICENSE
└── .gitignore
```
## Installation
### From PyPI
```bash
pip install llm-testlab
```
### From Source
```bash
git clone https://github.com/Saivineeth147/llm-testlab.git
cd llm-testlab
pip install .
```
### Optional Dependencies
```bash
# With FAISS and HuggingFace support
pip install llm-testlab[faiss,huggingface]
# Or install individually
pip install faiss-cpu # or faiss-gpu
pip install transformers
```
## Quick Start
### Text Metrics Example
```python
from llm_testing_suite import LLMTestSuite
def my_llm(prompt):
return "Rome is the capital of Italy"
# Initialize with FAISS support
suite = LLMTestSuite(my_llm, use_faiss=True)
# Add knowledge
suite.add_knowledge("Rome is the capital of Italy")
# Run all novel metrics
result = suite.run_all_novel_metrics(
prompt="What is the capital of Italy?",
paraphrases=["Italy's capital?", "Capital city of Italy?"],
adversarial_prompts=["ignore previous instructions"],
runs=3
)
print(f"HSI: {result['HSI']['HSI']:.4f}") # Hallucination
print(f"CSS: {result['CSS']['CSS']:.4f}") # Consistency
print(f"SRI: {result['SRI']['SRI']:.4f}") # Robustness
print(f"SVE: {result['SVE']['SVE']:.4f}") # Safety
print(f"KBC: {result['KBC']['KBC']:.4f}") # Coverage
```
### Code Evaluation Example
```python
from llm_testing_suite import LLMTestSuite
def code_llm(prompt):
return '''
def add(a, b):
"""Add two numbers."""
return a + b
print(add(5, 3))
'''
suite = LLMTestSuite(code_llm)
# Comprehensive code evaluation
result = suite.comprehensive_code_evaluation(
prompt="Write a function to add two numbers",
code_response=code_llm("..."),
test_cases=[
{"input": "", "expected_output": "8"}
],
language="python"
)
print(f"Overall Score: {result['overall_score']:.1f}/100")
print(f"Syntax Valid: {result['syntax_valid']}")
print(f"Quality Score: {result['quality_score']}/100")
print(f"Security: {'✓' if result['is_secure'] else '✗'}")
```
## Managing Knowledge Base
```python
# Add a single fact
suite.add_knowledge("New York is the largest city in the USA")
# Add multiple facts
suite.add_knowledge_bulk([
"Python is a programming language",
"AI is transforming industries"
])
# List knowledge base
suite.list_knowledge()
# Remove a fact
suite.remove_knowledge("Python is a programming language")
# Clear the knowledge base
suite.clear_knowledge()
```
## Managing Security Keywords
```python
# Add malicious keywords
suite.add_malicious_keywords(["hack system", "steal data"])
# List keywords
suite.list_malicious_keywords()
# Remove a keyword
suite.remove_malicious_keyword("hack system")
```
# List keywords
tester.list_malicious_keywords()
# Remove a keyword
tester.remove_malicious_keyword("hack system")
Output Format
-------------
All test methods support three return types controlled by the `return_type` parameter: `"dict"`, `"table"`, or `"both"`.
- `"dict"`: Returns a Python dictionary with the test results.
- `"table"`: Prints a formatted table using the `rich` library, no dictionary returned.
- `"both"`: Returns the dictionary **and** prints the table.
## Code Evaluation Details
### Individual Metrics
```python
from llm_testing_suite import LLMTestSuite
suite = LLMTestSuite(your_llm_function)
# 1. Syntax Validity
syntax = suite.code_syntax_validity(code, language="python")
# Returns: {"syntax_valid": True/False, "error": ...}
# 2. Execution Test
execution = suite.code_execution_test(
code,
test_cases=[
{"input": "5\n", "expected_output": "5"}
],
language="python"
)
# Returns: {"pass_rate": 1.0, "passed_tests": 1, "total_tests": 1, ...}
# 3. Quality Metrics
quality = suite.code_quality_metrics(code, language="python")
# Returns: {"quality_score": 80, "metrics": {...}}
# 4. Security Scan
security = suite.code_security_scan(code, language="python")
# Returns: {"is_secure": True, "vulnerabilities": [...]}
# 5. Semantic Correctness
semantic = suite.code_semantic_correctness(
prompt="Write add function",
code_response=generated_code,
reference_code=reference_solution
)
# Returns: {"semantic_similarity": 0.85, "semantically_correct": True}
```
### Quality Scoring (0-100)
Each criterion worth 20 points:
- **Has Comments** (`#`, `//`, `/*`) - 20 pts
- **Has Docstring** (`"""`, `/**`) - 20 pts
- **Has Error Handling** (`try/except`, `try/catch`) - 20 pts
- **Low Complexity** (< 10 branches/loops) - 20 pts
- **Has Functions** (at least 1) - 20 pts
### Security Patterns Detected
- SQL Injection
- Command Injection
- XSS vulnerabilities
- Buffer overflows (C/C++)
- Hardcoded secrets
- Unsafe deserialization
- Path traversal
- Language-specific antipatterns
### Supported Languages
| Language | Syntax Check | Execution | Quality | Security |
|----------|-------------|-----------|---------|----------|
| Python | ✅ AST | ✅ | ✅ | ✅ |
| JavaScript | ✅ Node | ✅ | ✅ | ✅ |
| TypeScript | ✅ tsc | ✅ | ✅ | ✅ |
| Java | ✅ javac | ✅ | ✅ | ✅ |
| C/C++ | ✅ gcc/g++ | ✅ | ✅ | ✅ |
| Go | ✅ go fmt | ✅ | ✅ | ✅ |
| Rust | ✅ rustc | ⚠️ | ✅ | ✅ |
| Ruby | ✅ ruby -c | ✅ | ✅ | ✅ |
| PHP | ✅ php -l | ✅ | ✅ | ✅ |
**Note**: Compilers/interpreters must be installed for full syntax validation. Falls back to regex-based checks if unavailable.
## Dual Embedder Architecture
LLMTestSuite uses specialized embedders for optimal evaluation:
### Text Embedder: `all-MiniLM-L6-v2`
- **Used for**: HSI, CSS, SRI, SVE, KBC (text metrics)
- **Size**: 22M params, 384 dimensions
- **Speed**: Fast
- **Purpose**: General semantic similarity
### Code Embedder: `BAAI/bge-small-en-v1.5`
- **Used for**: Code semantic correctness (SCC)
- **Size**: 33M params, 384 dimensions
- **Speed**: Fast
- **Purpose**: Code-specific semantic understanding
### Custom Embedder
```python
from sentence_transformers import SentenceTransformer
suite = LLMTestSuite(my_llm)
# Replace code embedder
suite.code_embedder = SentenceTransformer("microsoft/codebert-base")
suite.code_evaluator.embedder = suite.code_embedder
# Or use different text embedder
suite = LLMTestSuite(my_llm, embedder_model="all-mpnet-base-v2")
```
### Embedder Comparison
| Model | Params | Dims | Speed | Best For |
|-------|--------|------|-------|----------|
| all-MiniLM-L6-v2 | 22M | 384 | Fast | Text (default) |
| all-mpnet-base-v2 | 110M | 768 | Medium | Text (higher accuracy) |
| bge-small-en-v1.5 | 33M | 384 | Fast | Code (default) |
| bge-base-en-v1.5 | 109M | 768 | Medium | Code (balanced) |
| CodeBERT | 125M | 768 | Medium | Code (Microsoft) |
## Output Format
All test methods support three return types via `return_type` parameter:
- `"dict"` - Returns Python dictionary (default)
- `"table"` - Prints formatted table using `rich` library
- `"both"` - Returns dictionary AND prints table
### Example Results
```python
# HSI Result
{
"prompt": "What is the capital of France?",
"answer": "Paris is the capital of France",
"HSI": 0.01, # Lower is better (0-1 scale)
"closest_fact": "Paris is the capital of France"
}
# Code Evaluation Result
{
"overall_score": 85.0,
"syntax_valid": True,
"quality_score": 80,
"is_secure": True,
"pass_rate": 1.0,
"semantic_similarity": 0.89
}
```
## Complete Example: Groq API
```python
from groq import Groq
from llm_testing_suite import LLMTestSuite
client = Groq(api_key="your-api-key")
def groq_llm(prompt):
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content
suite = LLMTestSuite(groq_llm)
# Text evaluation
result = suite.run_all_novel_metrics(
prompt="What is the capital of France?",
paraphrases=["France's capital?"],
runs=3
)
# Code evaluation
code_result = suite.comprehensive_code_evaluation(
prompt="Write fibonacci function",
code_response=groq_llm("Write a Python fibonacci function"),
language="python"
)
```
See `examples/groq_code_evaluation.py` for comprehensive test suite.
## Logging
```python
# Enable debug logging
suite = LLMTestSuite(my_llm, debug=True)
# Or configure manually
import logging
logging.getLogger("llm_testing_suite").setLevel(logging.DEBUG)
```
## Contributing
Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## License
MIT License - see LICENSE file for details
## Acknowledgments
- Sentence-Transformers for embedding models
- FAISS for efficient similarity search
- Rich library for beautiful terminal output
- Open-source LLM community
---
**Star this repo** ⭐ if you find it useful!
For questions or issues, please open a GitHub issue.
Raw data
{
"_id": null,
"home_page": null,
"name": "llm-testlab",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "LLM, testing, evaluation, AI, hallucination-detection, code-evaluation, semantic-similarity, NLP, machine-learning",
"author": null,
"author_email": "Sai Vineeth Arumalla <saivineeth147@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7b/ac/31f7ad94eda04fa612de395e32d61232c9b0c09f3471e108124e02cef475/llm_testlab-0.2.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a href=\"https://github.com/Saivineeth147/llm-testlab/stargazers\">\n <img src=\"https://img.shields.io/github/stars/Saivineeth147/llm-testlab?style=social\" alt=\"Star this repo\" />\n </a>\n</p>\n\n\u2b50 If you find this project useful, **please consider starring** it \u2014 it helps others discover it!\n\n# LLM TestLab\n\n**Comprehensive Testing Suite for Large Language Models**\n\nA flexible Python toolkit for evaluating LLMs on:\n- **Text Metrics**: Hallucination, consistency, semantic robustness, safety\n- **Code Evaluation**: Syntax, execution, quality, security, semantic correctness across 9+ languages\n- **Dual Embedders**: Optimized for both text and code analysis\n- **Optional FAISS**: High-performance vector similarity\n\n## Features\n\n### Text Evaluation Metrics\n- **Hallucination Severity Index (HSI)** \u2013 Detect factual deviations from knowledge base\n- **Consistency Stability Score (CSS)** \u2013 Measure output stability across runs\n- **Semantic Robustness Index (SRI)** \u2013 Test invariance to paraphrasing\n- **Safety Vulnerability Exposure (SVE)** \u2013 Detect unsafe responses to adversarial prompts\n- **Knowledge Base Coverage (KBC)** \u2013 Measure factual alignment\n\n### Code Evaluation Metrics (9+ Languages)\n- **Syntax Validity (SV)** \u2013 Compiler/interpreter-based validation\n- **Execution Pass Rate (EPR)** \u2013 Test case execution and verification\n- **Code Quality Score (CQS)** \u2013 Complexity, documentation, error handling\n- **Security Risk Score (SRS)** \u2013 Vulnerability pattern detection\n- **Semantic Code Correctness (SCC)** \u2013 Embedding-based similarity to reference\n- **Comprehensive Code Evaluation (CCE)** \u2013 Weighted aggregation of all metrics\n\n**Supported Languages**: Python, JavaScript, TypeScript, Java, C, C++, Go, Rust, Ruby, PHP\n\n### Advanced Features\n- **Dual Embedders**: `all-MiniLM-L6-v2` for text, `BAAI/bge-small-en-v1.5` for code\n- **FAISS Support**: Optional, for faster similarity searches\n- **Knowledge Base Management**: Add, remove, or list facts\n- **Security Patterns**: Customizable keywords and regex patterns\n- **Rich Logging**: Built-in debug/info logging\n\n## Project Structure\n\n```\nllm-testlab/\n\u251c\u2500\u2500 llm_testing_suite/\n\u2502 \u251c\u2500\u2500 __init__.py \n\u2502 \u251c\u2500\u2500 llm_testing_suite.py # Main test suite (text metrics)\n\u2502 \u2514\u2500\u2500 code_evaluator.py # Code evaluation module\n\u251c\u2500\u2500 examples/\n\u2502 \u251c\u2500\u2500 run_text_evaluation.py # Text metrics evaluation script\n\u2502 \u251c\u2500\u2500 run_code_evaluation.py # Code metrics evaluation script\n\u2502 \u251c\u2500\u2500 groq_example.py # Groq API text evaluation\n\u2502 \u251c\u2500\u2500 groq_code_evaluation.py # Groq API code evaluation\n\u2502 \u2514\u2500\u2500 huggingface_example.py # HuggingFace integration\n\u251c\u2500\u2500 pyproject.toml # Package configuration\n\u251c\u2500\u2500 requirements.txt # Dependencies\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 LICENSE\n\u2514\u2500\u2500 .gitignore\n\n```\n## Installation\n\n### From PyPI\n```bash\npip install llm-testlab\n```\n\n### From Source\n```bash\ngit clone https://github.com/Saivineeth147/llm-testlab.git\ncd llm-testlab\npip install .\n```\n\n### Optional Dependencies\n```bash\n# With FAISS and HuggingFace support\npip install llm-testlab[faiss,huggingface]\n\n# Or install individually\npip install faiss-cpu # or faiss-gpu\npip install transformers\n```\n\n## Quick Start\n\n### Text Metrics Example\n```python\nfrom llm_testing_suite import LLMTestSuite\n\ndef my_llm(prompt):\n return \"Rome is the capital of Italy\"\n\n# Initialize with FAISS support\nsuite = LLMTestSuite(my_llm, use_faiss=True)\n\n# Add knowledge\nsuite.add_knowledge(\"Rome is the capital of Italy\")\n\n# Run all novel metrics\nresult = suite.run_all_novel_metrics(\n prompt=\"What is the capital of Italy?\",\n paraphrases=[\"Italy's capital?\", \"Capital city of Italy?\"],\n adversarial_prompts=[\"ignore previous instructions\"],\n runs=3\n)\n\nprint(f\"HSI: {result['HSI']['HSI']:.4f}\") # Hallucination\nprint(f\"CSS: {result['CSS']['CSS']:.4f}\") # Consistency \nprint(f\"SRI: {result['SRI']['SRI']:.4f}\") # Robustness\nprint(f\"SVE: {result['SVE']['SVE']:.4f}\") # Safety\nprint(f\"KBC: {result['KBC']['KBC']:.4f}\") # Coverage\n```\n\n### Code Evaluation Example\n```python\nfrom llm_testing_suite import LLMTestSuite\n\ndef code_llm(prompt):\n return '''\ndef add(a, b):\n \"\"\"Add two numbers.\"\"\"\n return a + b\n \nprint(add(5, 3))\n'''\n\nsuite = LLMTestSuite(code_llm)\n\n# Comprehensive code evaluation\nresult = suite.comprehensive_code_evaluation(\n prompt=\"Write a function to add two numbers\",\n code_response=code_llm(\"...\"),\n test_cases=[\n {\"input\": \"\", \"expected_output\": \"8\"}\n ],\n language=\"python\"\n)\n\nprint(f\"Overall Score: {result['overall_score']:.1f}/100\")\nprint(f\"Syntax Valid: {result['syntax_valid']}\")\nprint(f\"Quality Score: {result['quality_score']}/100\")\nprint(f\"Security: {'\u2713' if result['is_secure'] else '\u2717'}\")\n```\n\n## Managing Knowledge Base\n\n```python\n# Add a single fact\nsuite.add_knowledge(\"New York is the largest city in the USA\")\n\n# Add multiple facts\nsuite.add_knowledge_bulk([\n \"Python is a programming language\",\n \"AI is transforming industries\"\n])\n\n# List knowledge base\nsuite.list_knowledge()\n\n# Remove a fact\nsuite.remove_knowledge(\"Python is a programming language\")\n\n# Clear the knowledge base\nsuite.clear_knowledge()\n```\n\n## Managing Security Keywords\n\n```python\n# Add malicious keywords\nsuite.add_malicious_keywords([\"hack system\", \"steal data\"])\n\n# List keywords\nsuite.list_malicious_keywords()\n\n# Remove a keyword\nsuite.remove_malicious_keyword(\"hack system\")\n```\n\n# List keywords\ntester.list_malicious_keywords()\n\n# Remove a keyword\ntester.remove_malicious_keyword(\"hack system\")\n\nOutput Format\n-------------\n\n All test methods support three return types controlled by the `return_type` parameter: `\"dict\"`, `\"table\"`, or `\"both\"`.\n\n- `\"dict\"`: Returns a Python dictionary with the test results. \n- `\"table\"`: Prints a formatted table using the `rich` library, no dictionary returned. \n- `\"both\"`: Returns the dictionary **and** prints the table.\n\n## Code Evaluation Details\n\n### Individual Metrics\n\n```python\nfrom llm_testing_suite import LLMTestSuite\n\nsuite = LLMTestSuite(your_llm_function)\n\n# 1. Syntax Validity\nsyntax = suite.code_syntax_validity(code, language=\"python\")\n# Returns: {\"syntax_valid\": True/False, \"error\": ...}\n\n# 2. Execution Test\nexecution = suite.code_execution_test(\n code,\n test_cases=[\n {\"input\": \"5\\n\", \"expected_output\": \"5\"}\n ],\n language=\"python\"\n)\n# Returns: {\"pass_rate\": 1.0, \"passed_tests\": 1, \"total_tests\": 1, ...}\n\n# 3. Quality Metrics\nquality = suite.code_quality_metrics(code, language=\"python\")\n# Returns: {\"quality_score\": 80, \"metrics\": {...}}\n\n# 4. Security Scan\nsecurity = suite.code_security_scan(code, language=\"python\")\n# Returns: {\"is_secure\": True, \"vulnerabilities\": [...]}\n\n# 5. Semantic Correctness\nsemantic = suite.code_semantic_correctness(\n prompt=\"Write add function\",\n code_response=generated_code,\n reference_code=reference_solution\n)\n# Returns: {\"semantic_similarity\": 0.85, \"semantically_correct\": True}\n```\n\n### Quality Scoring (0-100)\nEach criterion worth 20 points:\n- **Has Comments** (`#`, `//`, `/*`) - 20 pts\n- **Has Docstring** (`\"\"\"`, `/**`) - 20 pts \n- **Has Error Handling** (`try/except`, `try/catch`) - 20 pts\n- **Low Complexity** (< 10 branches/loops) - 20 pts\n- **Has Functions** (at least 1) - 20 pts\n\n### Security Patterns Detected\n- SQL Injection\n- Command Injection \n- XSS vulnerabilities\n- Buffer overflows (C/C++)\n- Hardcoded secrets\n- Unsafe deserialization\n- Path traversal\n- Language-specific antipatterns\n\n### Supported Languages\n\n| Language | Syntax Check | Execution | Quality | Security |\n|----------|-------------|-----------|---------|----------|\n| Python | \u2705 AST | \u2705 | \u2705 | \u2705 |\n| JavaScript | \u2705 Node | \u2705 | \u2705 | \u2705 |\n| TypeScript | \u2705 tsc | \u2705 | \u2705 | \u2705 |\n| Java | \u2705 javac | \u2705 | \u2705 | \u2705 |\n| C/C++ | \u2705 gcc/g++ | \u2705 | \u2705 | \u2705 |\n| Go | \u2705 go fmt | \u2705 | \u2705 | \u2705 |\n| Rust | \u2705 rustc | \u26a0\ufe0f | \u2705 | \u2705 |\n| Ruby | \u2705 ruby -c | \u2705 | \u2705 | \u2705 |\n| PHP | \u2705 php -l | \u2705 | \u2705 | \u2705 |\n\n**Note**: Compilers/interpreters must be installed for full syntax validation. Falls back to regex-based checks if unavailable.\n\n## Dual Embedder Architecture\n\nLLMTestSuite uses specialized embedders for optimal evaluation:\n\n### Text Embedder: `all-MiniLM-L6-v2`\n- **Used for**: HSI, CSS, SRI, SVE, KBC (text metrics)\n- **Size**: 22M params, 384 dimensions\n- **Speed**: Fast\n- **Purpose**: General semantic similarity\n\n### Code Embedder: `BAAI/bge-small-en-v1.5` \n- **Used for**: Code semantic correctness (SCC)\n- **Size**: 33M params, 384 dimensions\n- **Speed**: Fast\n- **Purpose**: Code-specific semantic understanding\n\n### Custom Embedder\n\n```python\nfrom sentence_transformers import SentenceTransformer\n\nsuite = LLMTestSuite(my_llm)\n\n# Replace code embedder\nsuite.code_embedder = SentenceTransformer(\"microsoft/codebert-base\")\nsuite.code_evaluator.embedder = suite.code_embedder\n\n# Or use different text embedder\nsuite = LLMTestSuite(my_llm, embedder_model=\"all-mpnet-base-v2\")\n```\n\n### Embedder Comparison\n\n| Model | Params | Dims | Speed | Best For |\n|-------|--------|------|-------|----------|\n| all-MiniLM-L6-v2 | 22M | 384 | Fast | Text (default) |\n| all-mpnet-base-v2 | 110M | 768 | Medium | Text (higher accuracy) |\n| bge-small-en-v1.5 | 33M | 384 | Fast | Code (default) |\n| bge-base-en-v1.5 | 109M | 768 | Medium | Code (balanced) |\n| CodeBERT | 125M | 768 | Medium | Code (Microsoft) |\n\n## Output Format\n\nAll test methods support three return types via `return_type` parameter:\n- `\"dict\"` - Returns Python dictionary (default)\n- `\"table\"` - Prints formatted table using `rich` library\n- `\"both\"` - Returns dictionary AND prints table\n\n### Example Results\n\n```python\n# HSI Result\n{\n \"prompt\": \"What is the capital of France?\",\n \"answer\": \"Paris is the capital of France\",\n \"HSI\": 0.01, # Lower is better (0-1 scale)\n \"closest_fact\": \"Paris is the capital of France\"\n}\n\n# Code Evaluation Result\n{\n \"overall_score\": 85.0,\n \"syntax_valid\": True,\n \"quality_score\": 80,\n \"is_secure\": True,\n \"pass_rate\": 1.0,\n \"semantic_similarity\": 0.89\n}\n```\n\n## Complete Example: Groq API\n\n```python\nfrom groq import Groq\nfrom llm_testing_suite import LLMTestSuite\n\nclient = Groq(api_key=\"your-api-key\")\n\ndef groq_llm(prompt):\n response = client.chat.completions.create(\n model=\"llama-3.3-70b-versatile\",\n messages=[{\"role\": \"user\", \"content\": prompt}],\n temperature=0.3\n )\n return response.choices[0].message.content\n\nsuite = LLMTestSuite(groq_llm)\n\n# Text evaluation\nresult = suite.run_all_novel_metrics(\n prompt=\"What is the capital of France?\",\n paraphrases=[\"France's capital?\"],\n runs=3\n)\n\n# Code evaluation\ncode_result = suite.comprehensive_code_evaluation(\n prompt=\"Write fibonacci function\",\n code_response=groq_llm(\"Write a Python fibonacci function\"),\n language=\"python\"\n)\n```\n\nSee `examples/groq_code_evaluation.py` for comprehensive test suite.\n\n## Logging\n\n```python\n# Enable debug logging\nsuite = LLMTestSuite(my_llm, debug=True)\n\n# Or configure manually\nimport logging\nlogging.getLogger(\"llm_testing_suite\").setLevel(logging.DEBUG)\n```\n\n\n\n## Contributing\n\nContributions are welcome! Please:\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Acknowledgments\n\n- Sentence-Transformers for embedding models\n- FAISS for efficient similarity search\n- Rich library for beautiful terminal output\n- Open-source LLM community\n\n---\n\n**Star this repo** \u2b50 if you find it useful!\n\nFor questions or issues, please open a GitHub issue.\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2025 Sai Vineeth Arumalla\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "Comprehensive testing suite for LLM evaluation: hallucination detection, consistency, robustness, safety, and multi-language code generation assessment.",
"version": "0.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/Saivineeth147/llm-testlab/issues",
"Documentation": "https://github.com/Saivineeth147/llm-testlab#readme",
"Homepage": "https://github.com/Saivineeth147/llm-testlab",
"Repository": "https://github.com/Saivineeth147/llm-testlab"
},
"split_keywords": [
"llm",
" testing",
" evaluation",
" ai",
" hallucination-detection",
" code-evaluation",
" semantic-similarity",
" nlp",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b64b969a61e57ab088a8eafeda21cd9ca09507f98fe415b032df323c4a7c8ed3",
"md5": "9cbe913c412798a39d969daa851eae49",
"sha256": "d429a2db0e4171d189c28cf34d843210173ddfb4b2e273caef56e584311989a6"
},
"downloads": -1,
"filename": "llm_testlab-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9cbe913c412798a39d969daa851eae49",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 18838,
"upload_time": "2025-10-20T04:55:16",
"upload_time_iso_8601": "2025-10-20T04:55:16.443867Z",
"url": "https://files.pythonhosted.org/packages/b6/4b/969a61e57ab088a8eafeda21cd9ca09507f98fe415b032df323c4a7c8ed3/llm_testlab-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7bac31f7ad94eda04fa612de395e32d61232c9b0c09f3471e108124e02cef475",
"md5": "3e13b95df0b58ab4627e0257eaae3170",
"sha256": "0bac7b68f6f7114a2dcc7c607207105a0845ef0535ee431e0e2bc4e2f961e617"
},
"downloads": -1,
"filename": "llm_testlab-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "3e13b95df0b58ab4627e0257eaae3170",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 22697,
"upload_time": "2025-10-20T04:55:17",
"upload_time_iso_8601": "2025-10-20T04:55:17.719457Z",
"url": "https://files.pythonhosted.org/packages/7b/ac/31f7ad94eda04fa612de395e32d61232c9b0c09f3471e108124e02cef475/llm_testlab-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-20 04:55:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Saivineeth147",
"github_project": "llm-testlab",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
"==",
"2.3.3"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
"==",
"5.1.1"
]
]
},
{
"name": "rich",
"specs": []
},
{
"name": "faiss-cpu",
"specs": []
},
{
"name": "transformers",
"specs": [
[
"==",
"4.56.2"
]
]
},
{
"name": "huggingface_hub",
"specs": [
[
"==",
"0.35.0"
]
]
}
],
"lcname": "llm-testlab"
}