<p align="center">
<a href="https://pypi.org/project/rubric/">
<img src="https://img.shields.io/pypi/v/rubric" alt="PyPI version" />
</a>
<a href="https://pypi.org/project/rubric/">
<img src="https://img.shields.io/pypi/pyversions/rubric" alt="Python versions" />
</a>
<a href="https://github.com/The-LLM-Data-Company/rubric/blob/main/LICENSE">
<img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License" />
</a>
</p>
# Rubric
A Python library for LLM-based evaluation using weighted rubrics.
## Installation
```bash
uv add rubric
```
## Usage
1. **Set up environment variables:**
```bash
export OPENAI_API_KEY=your_api_key_here
```
2. **Run the example below**
```python
import asyncio
import os
from openai import AsyncOpenAI
from rubric import Rubric
from rubric.autograders import PerCriterionGrader
async def generate_with_async_openai(system_prompt: str, user_prompt: str) -> str:
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
max_tokens=400,
temperature=0.0,
)
return response.choices[0].message.content or ""
async def main():
rubric = Rubric.from_dict([
{"weight": 10.0, "requirement": "States Q4 2023 base margin as 17.2%"},
{"weight": 8.0, "requirement": "Explicitly uses Shapley attribution for decomposition"},
{"weight": -15.0, "requirement": "Uses total deliveries instead of cash-only deliveries"}
])
grader = PerCriterionGrader(
generate_fn=generate_with_async_openai,
system_prompt="This overrides the default system prompt",
)
result = await rubric.grade(
to_grade="Your text to evaluate...",
autograder=grader
)
print(f"Score: {result.score}/100")
for criterion in result.report:
print(f" {criterion.verdict}: {criterion.requirement}")
asyncio.run(main())
```
## Autograder Strategies
- `PerCriterionGrader` - Evaluates each criterion in parallel LLM calls
- `PerCriterionOneShotGrader` - Evaluates all criteria in a single LLM call
- `RubricAsJudgeGrader` - Holistic evaluation, LLM returns final score directly
### Default System Prompts
Each autograder uses a specialized system prompt optimized for its evaluation approach:
**PerCriterionGrader** - Detailed criterion-by-criterion evaluation with strict JSON formatting requirements. The prompt instructs the LLM to evaluate each criterion independently, handling both positive and negative criteria with specific response formats.
**PerCriterionOneShotGrader** - Streamlined prompt for evaluating all criteria in a single response. Focuses on providing verdicts (MET/UNMET) and explanations for each criterion in a structured JSON format.
**RubricAsJudgeGrader** - Holistic evaluation prompt that asks the LLM to consider the output as a whole and provide a single overall score from 0-100, taking into account the weights of all criteria.
You can view the complete default prompts in the source files:
- [`per_criterion_grader.py`](src/rubric/autograders/per_criterion_grader.py#L10-L55)
- [`per_criterion_one_shot_grader.py`](src/rubric/autograders/per_criterion_one_shot_grader.py#L11-L31)
- [`rubric_as_judge_grader.py`](src/rubric/autograders/rubric_as_judge_grader.py#L11-L22)
**Customizing System Prompts:** You can override the default system prompt by passing a `system_prompt` parameter to any autograder:
```python
grader = PerCriterionGrader(
generate_fn=your_function,
system_prompt="Your custom system prompt here"
)
```
## Customization
You can customize grading at multiple levels:
**1. Custom `generate_fn` (most common)**
Pass any function that takes `(system_prompt, user_prompt)` and returns a string. Use any LLM provider (OpenAI, Anthropic, local models, etc.):
```python
grader = PerCriterionGrader(generate_fn=your_custom_function)
```
**2. Override specific methods**
Subclass any autograder and override:
- `judge()` - Orchestrates LLM calls to evaluate criteria and parse responses into structured results
- `generate()` - Wraps your `generate_fn` to customize how prompts are sent to the LLM
- `aggregation()` - Transforms individual criterion results into a final score and optional report
**3. Full control**
Override the entire `grade()` method for complete end-to-end control over the grading process.
## Loading Rubrics
```python
# Direct construction
rubric = Rubric([
Criterion(weight=10.0, requirement="States Q4 2023 base margin as 17.2%"),
Criterion(weight=8.0, requirement="Explicitly uses Shapley attribution for decomposition"),
Criterion(weight=-15.0, requirement="Uses total deliveries instead of cash-only deliveries")
])
# From list of dictionaries
rubric = Rubric.from_dict([
{"weight": 10.0, "requirement": "States Q4 2023 base margin as 17.2%"},
{"weight": 8.0, "requirement": "Explicitly uses Shapley attribution for decomposition"},
{"weight": -15.0, "requirement": "Uses total deliveries instead of cash-only deliveries"}
])
# From JSON string
rubric = Rubric.from_json('[{"weight": 10.0, "requirement": "Example requirement"}]')
# From YAML string
yaml_data = '''
- weight: 10.0
requirement: "Example requirement"
'''
rubric = Rubric.from_yaml(yaml_data)
# From files
rubric = Rubric.from_file('rubric.json')
rubric = Rubric.from_file('rubric.yaml')
```
### JSON Format
```json
[
{
"weight": 10.0,
"requirement": "States Q4 2023 base margin as 17.2%"
},
{
"weight": 8.0,
"requirement": "Explicitly uses Shapley attribution for decomposition"
},
{
"weight": -15.0,
"requirement": "Uses total deliveries instead of cash-only deliveries"
}
]
```
### YAML Format
```yaml
- weight: 10.0
requirement: "States Q4 2023 base margin as 17.2%"
- weight: 8.0
requirement: "Explicitly uses Shapley attribution for decomposition"
- weight: -15.0
requirement: "Uses total deliveries instead of cash-only deliveries"
```
## Requirements
- Python 3.11+
- An LLM API (e.g., OpenAI, Anthropic, OpenRouter) - set appropriate API keys as environment variables
## License
MIT License - see LICENSE file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "rubric",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "rubrics, grading, eval, rubric",
"author": "Gavin, Joey",
"author_email": "Gavin <gavin@llmdata.com>, Joey <joey@llmdata.com>",
"download_url": "https://files.pythonhosted.org/packages/b9/04/c619577b6873530605ce5a866db575c2528bd99283af1485bb621367b9ef/rubric-1.2.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a href=\"https://pypi.org/project/rubric/\">\n <img src=\"https://img.shields.io/pypi/v/rubric\" alt=\"PyPI version\" />\n </a>\n <a href=\"https://pypi.org/project/rubric/\">\n <img src=\"https://img.shields.io/pypi/pyversions/rubric\" alt=\"Python versions\" />\n </a>\n <a href=\"https://github.com/The-LLM-Data-Company/rubric/blob/main/LICENSE\">\n <img src=\"https://img.shields.io/badge/license-MIT-blue.svg\" alt=\"License\" />\n </a>\n</p>\n\n# Rubric\n\nA Python library for LLM-based evaluation using weighted rubrics.\n\n## Installation\n\n```bash\nuv add rubric\n```\n\n## Usage\n\n1. **Set up environment variables:**\n\n```bash\nexport OPENAI_API_KEY=your_api_key_here\n```\n\n2. **Run the example below**\n\n```python\nimport asyncio\nimport os\nfrom openai import AsyncOpenAI\nfrom rubric import Rubric\nfrom rubric.autograders import PerCriterionGrader\n\nasync def generate_with_async_openai(system_prompt: str, user_prompt: str) -> str:\n client = AsyncOpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n response = await client.chat.completions.create(\n model=\"gpt-4o-mini\",\n messages=[\n {\"role\": \"system\", \"content\": system_prompt},\n {\"role\": \"user\", \"content\": user_prompt},\n ],\n max_tokens=400,\n temperature=0.0,\n )\n return response.choices[0].message.content or \"\"\n\nasync def main():\n rubric = Rubric.from_dict([\n {\"weight\": 10.0, \"requirement\": \"States Q4 2023 base margin as 17.2%\"},\n {\"weight\": 8.0, \"requirement\": \"Explicitly uses Shapley attribution for decomposition\"},\n {\"weight\": -15.0, \"requirement\": \"Uses total deliveries instead of cash-only deliveries\"}\n ])\n\n grader = PerCriterionGrader(\n generate_fn=generate_with_async_openai,\n system_prompt=\"This overrides the default system prompt\",\n )\n\n result = await rubric.grade(\n to_grade=\"Your text to evaluate...\",\n autograder=grader\n )\n\n print(f\"Score: {result.score}/100\")\n for criterion in result.report:\n print(f\" {criterion.verdict}: {criterion.requirement}\")\n\nasyncio.run(main())\n```\n\n## Autograder Strategies\n\n- `PerCriterionGrader` - Evaluates each criterion in parallel LLM calls\n- `PerCriterionOneShotGrader` - Evaluates all criteria in a single LLM call\n- `RubricAsJudgeGrader` - Holistic evaluation, LLM returns final score directly\n\n### Default System Prompts\n\nEach autograder uses a specialized system prompt optimized for its evaluation approach:\n\n**PerCriterionGrader** - Detailed criterion-by-criterion evaluation with strict JSON formatting requirements. The prompt instructs the LLM to evaluate each criterion independently, handling both positive and negative criteria with specific response formats.\n\n**PerCriterionOneShotGrader** - Streamlined prompt for evaluating all criteria in a single response. Focuses on providing verdicts (MET/UNMET) and explanations for each criterion in a structured JSON format.\n\n**RubricAsJudgeGrader** - Holistic evaluation prompt that asks the LLM to consider the output as a whole and provide a single overall score from 0-100, taking into account the weights of all criteria.\n\nYou can view the complete default prompts in the source files:\n\n- [`per_criterion_grader.py`](src/rubric/autograders/per_criterion_grader.py#L10-L55)\n- [`per_criterion_one_shot_grader.py`](src/rubric/autograders/per_criterion_one_shot_grader.py#L11-L31)\n- [`rubric_as_judge_grader.py`](src/rubric/autograders/rubric_as_judge_grader.py#L11-L22)\n\n**Customizing System Prompts:** You can override the default system prompt by passing a `system_prompt` parameter to any autograder:\n\n```python\ngrader = PerCriterionGrader(\n generate_fn=your_function,\n system_prompt=\"Your custom system prompt here\"\n)\n```\n\n## Customization\n\nYou can customize grading at multiple levels:\n\n**1. Custom `generate_fn` (most common)**\nPass any function that takes `(system_prompt, user_prompt)` and returns a string. Use any LLM provider (OpenAI, Anthropic, local models, etc.):\n\n```python\ngrader = PerCriterionGrader(generate_fn=your_custom_function)\n```\n\n**2. Override specific methods**\nSubclass any autograder and override:\n\n- `judge()` - Orchestrates LLM calls to evaluate criteria and parse responses into structured results\n- `generate()` - Wraps your `generate_fn` to customize how prompts are sent to the LLM\n- `aggregation()` - Transforms individual criterion results into a final score and optional report\n\n**3. Full control**\nOverride the entire `grade()` method for complete end-to-end control over the grading process.\n\n## Loading Rubrics\n\n```python\n# Direct construction\nrubric = Rubric([\n Criterion(weight=10.0, requirement=\"States Q4 2023 base margin as 17.2%\"),\n Criterion(weight=8.0, requirement=\"Explicitly uses Shapley attribution for decomposition\"),\n Criterion(weight=-15.0, requirement=\"Uses total deliveries instead of cash-only deliveries\")\n])\n\n# From list of dictionaries\nrubric = Rubric.from_dict([\n {\"weight\": 10.0, \"requirement\": \"States Q4 2023 base margin as 17.2%\"},\n {\"weight\": 8.0, \"requirement\": \"Explicitly uses Shapley attribution for decomposition\"},\n {\"weight\": -15.0, \"requirement\": \"Uses total deliveries instead of cash-only deliveries\"}\n])\n\n# From JSON string\nrubric = Rubric.from_json('[{\"weight\": 10.0, \"requirement\": \"Example requirement\"}]')\n\n# From YAML string\nyaml_data = '''\n- weight: 10.0\n requirement: \"Example requirement\"\n'''\nrubric = Rubric.from_yaml(yaml_data)\n\n# From files\nrubric = Rubric.from_file('rubric.json')\nrubric = Rubric.from_file('rubric.yaml')\n```\n\n### JSON Format\n\n```json\n[\n {\n \"weight\": 10.0,\n \"requirement\": \"States Q4 2023 base margin as 17.2%\"\n },\n {\n \"weight\": 8.0,\n \"requirement\": \"Explicitly uses Shapley attribution for decomposition\"\n },\n {\n \"weight\": -15.0,\n \"requirement\": \"Uses total deliveries instead of cash-only deliveries\"\n }\n]\n```\n\n### YAML Format\n\n```yaml\n- weight: 10.0\n requirement: \"States Q4 2023 base margin as 17.2%\"\n- weight: 8.0\n requirement: \"Explicitly uses Shapley attribution for decomposition\"\n- weight: -15.0\n requirement: \"Uses total deliveries instead of cash-only deliveries\"\n```\n\n## Requirements\n\n- Python 3.11+\n- An LLM API (e.g., OpenAI, Anthropic, OpenRouter) - set appropriate API keys as environment variables\n\n## License\n\nMIT License - see LICENSE file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "rubric",
"version": "1.2.0",
"project_urls": {
"Homepage": "https://github.com/The-LLM-Data-Company/rubric/",
"Issues": "https://github.com/The-LLM-Data-Company/rubric/issues",
"Repository": "https://github.com/The-LLM-Data-Company/rubric"
},
"split_keywords": [
"rubrics",
" grading",
" eval",
" rubric"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "04a9c53948c7a2be6e402a5885de31fa219eb3b916e85414f94ddfbf5ce14367",
"md5": "a6e39253a8abed9af24efc22d6436f26",
"sha256": "944b2943f85441fdf8ad84a485826d82ecfd98916710fdcc7e62c8fe01cad053"
},
"downloads": -1,
"filename": "rubric-1.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a6e39253a8abed9af24efc22d6436f26",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 13163,
"upload_time": "2025-10-27T19:51:08",
"upload_time_iso_8601": "2025-10-27T19:51:08.842328Z",
"url": "https://files.pythonhosted.org/packages/04/a9/c53948c7a2be6e402a5885de31fa219eb3b916e85414f94ddfbf5ce14367/rubric-1.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b904c619577b6873530605ce5a866db575c2528bd99283af1485bb621367b9ef",
"md5": "cf0298446bd023f2f5bd803466848402",
"sha256": "7ce20be4d9a60f348b25e958d3a2b970e17d08766a161ae60f67772d20dc65bf"
},
"downloads": -1,
"filename": "rubric-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "cf0298446bd023f2f5bd803466848402",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 8963,
"upload_time": "2025-10-27T19:51:09",
"upload_time_iso_8601": "2025-10-27T19:51:09.801810Z",
"url": "https://files.pythonhosted.org/packages/b9/04/c619577b6873530605ce5a866db575c2528bd99283af1485bb621367b9ef/rubric-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-27 19:51:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "The-LLM-Data-Company",
"github_project": "rubric",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "rubric"
}