How to Evaluate AI Agents: A Practical Framework for 2025
Learn how leading teams evaluate AI agents for production. This guide covers offline evals, LLM-as-judge, trajectory analysis, and the metrics that actually matter.
By Fruxon Team
January 10, 2025
7 min read
Your AI agent works perfectly on 10 test cases. Then production traffic hits and everything breaks.
This is the evaluation gap—the difference between "it looks good" and "it actually works." According to recent industry data, only 52% of organizations run systematic offline evaluations before deploying agents. The other 48% learn about problems from users.
This guide covers how to evaluate AI agents properly, drawing from practices at companies shipping agents at scale.
Why Agent Evaluation is Different
AI agents aren't just LLMs with prompts. They're multi-step systems that:
- Take actions: Call APIs, modify databases, send emails
- Make decisions: Choose which tools to use and when
- Maintain state: Track context across conversation turns
- Compound errors: One bad decision cascades into more
Traditional LLM evaluation (checking if the output is "good") misses most failure modes. You need to evaluate the entire trajectory—every step the agent takes to reach its goal.
The Three Layers of Agent Evaluation
Layer 1: Component-Level Evaluation
Test individual capabilities in isolation before testing the full system.
Retrieval quality (if using RAG):
# Measure retrieval precision and recall
def evaluate_retrieval(query, retrieved_docs, relevant_docs):
precision = len(set(retrieved_docs) & set(relevant_docs)) / len(retrieved_docs)
recall = len(set(retrieved_docs) & set(relevant_docs)) / len(relevant_docs)
return {"precision": precision, "recall": recall}
Tool selection accuracy:
# Does the agent pick the right tool?
test_cases = [
{"input": "What's the weather in Tokyo?", "expected_tool": "get_weather"},
{"input": "Book a flight to Paris", "expected_tool": "search_flights"},
{"input": "What's 15% tip on $47?", "expected_tool": None}, # No tool needed
]
Response quality for each step, not just final output.
Layer 2: Trajectory Evaluation
This is where most teams fail. Evaluating trajectories means examining the full sequence of decisions:
User: "Cancel my subscription and get a refund"
Trajectory A (good):
1. lookup_customer() → Found user
2. get_subscription() → Active, 3 days old
3. check_refund_policy() → Eligible for full refund
4. cancel_subscription() → Success
5. process_refund() → Success
6. Response: "Done. Full refund processed."
Trajectory B (bad):
1. cancel_subscription() → Failed (no customer context)
2. Response: "I couldn't cancel. Can you provide your email?"
Both trajectories might produce a "helpful sounding" response, but only one actually solves the problem.
Key trajectory metrics:
| Metric | What It Measures |
|---|---|
| Task success rate | Did the agent achieve the goal? |
| Step efficiency | Minimum steps vs actual steps taken |
| Error recovery | Did it recover from failed tool calls? |
| Policy compliance | Did it follow business rules? |
Layer 3: End-to-End Evaluation
Simulate real user interactions with complete scenarios:
scenario: refund_request_edge_case
user_profile:
subscription_status: cancelled_yesterday
refund_history: 2_previous_refunds
conversation:
- user: "I want a refund for this month"
- expected_behavior:
- must_check: refund_eligibility
- must_not: auto_approve_refund
- should_explain: refund_policy_limits
LLM-as-Judge: Done Right
Using an LLM to evaluate another LLM sounds circular, but it works when done properly.
The naive approach (don't do this):
# Too vague, inconsistent results
prompt = "Rate this response 1-10 for quality"
The structured approach:
evaluation_rubric = """
Evaluate the agent response on these specific criteria:
1. TASK_COMPLETION (0 or 1)
- Did the agent fully complete the user's request?
- Partial completion = 0
2. FACTUAL_ACCURACY (0 or 1)
- Is all information provided verifiably correct?
- Any hallucinated details = 0
3. TOOL_USAGE (0, 0.5, or 1)
- 1: Used correct tools in optimal sequence
- 0.5: Completed task but inefficiently
- 0: Used wrong tools or skipped necessary ones
4. POLICY_ADHERENCE (0 or 1)
- Did the agent follow all business rules?
- Reference: {policy_document}
Provide scores and one-sentence justifications for each.
"""
Calibration is essential. Run your LLM-as-judge on 50-100 examples that humans have already scored. Measure agreement. Adjust rubric until agreement exceeds 85%.
Building Your Evaluation Dataset
A good eval dataset has these properties:
Coverage: Include all critical user journeys
- Happy paths (80% of cases)
- Edge cases (15%)
- Adversarial inputs (5%)
Diversity: Vary along multiple dimensions
- Query length and complexity
- User expertise level
- Context requirements
Difficulty calibration: Include easy, medium, and hard cases. If your agent scores 100%, your dataset is too easy.
Example dataset structure:
{
"id": "refund-003",
"category": "refund_requests",
"difficulty": "hard",
"input": "I bought this 6 months ago and it broke. I want my money back.",
"context": {
"order_date": "6_months_ago",
"refund_window": "30_days",
"product_warranty": "1_year"
},
"expected_trajectory": [
"lookup_order",
"check_warranty_status",
"initiate_warranty_claim"
],
"expected_outcome": "warranty_replacement_not_refund",
"grading_notes": "Agent should recognize warranty applies, not refund"
}
Offline vs Online Evaluation
Offline evaluation (before deployment):
- Run against static dataset
- Deterministic, reproducible
- Catches obvious regressions
- Limitation: Can't predict all production scenarios
Online evaluation (in production):
- Real user interactions
- Catches distribution shift
- Measures actual business impact
- Requires safety guardrails
The right balance: Use offline evals as a gate (block bad deploys) and online evals for learning (improve over time).
PR opened
↓
Offline eval suite runs
↓
[Pass threshold?] → No → Block merge
↓ Yes
Deploy to 5% traffic
↓
Online metrics monitored
↓
[Regression detected?] → Yes → Auto-rollback
↓ No
Gradual rollout to 100%
Metrics That Actually Matter
Stop tracking vanity metrics. Focus on these:
Primary Metrics (business impact)
| Metric | Definition | Why It Matters |
|---|---|---|
| Goal completion rate | User's intent fully satisfied | The only metric that truly matters |
| Escalation rate | Handed off to human | High rate means agent not ready |
| Cost per resolution | Total tokens and API calls | Sustainability check |
Secondary Metrics (diagnostic)
| Metric | Definition | What It Reveals |
|---|---|---|
| Steps per task | Actions taken to complete | Efficiency problems |
| Tool call failure rate | Failed API and function calls | Integration issues |
| Retry rate | Same request re-attempted | Confusion or failures |
| Latency P95 | 95th percentile response time | User experience |
Safety Metrics (non-negotiable)
| Metric | Threshold | Action if Breached |
|---|---|---|
| Hallucination rate | Below 1% | Block deployment |
| Policy violation rate | 0% | Immediate rollback |
| PII exposure | 0 incidents | Incident response |
Common Evaluation Mistakes
1. Testing only happy paths
Your eval dataset is too clean. Real users send typos, irrelevant context, and adversarial inputs.
2. Optimizing for eval scores instead of user outcomes
If you tune prompts to ace your evals, you're overfitting. Rotate eval sets regularly.
3. Ignoring trajectory quality
An agent that stumbles through 15 steps to do a 3-step task has problems, even if it eventually succeeds.
4. Skipping human baselines
How would a human handle this request? If you don't know, you can't evaluate the agent meaningfully.
5. Evaluating once, not continuously
Model behavior drifts. User patterns change. Set up continuous evaluation, not one-time audits.
Getting Started This Week
Days 1-2: Audit what you have
- List all agent capabilities
- Identify the 5 most critical user journeys
- Document current failure modes from logs
Days 3-4: Build initial dataset
- Create 10 test cases per critical journey (50 total)
- Include 2-3 edge cases per journey
- Define expected trajectories, not just outputs
Day 5: Set up evaluation pipeline
- Run eval suite on current agent
- Establish baseline metrics
- Add to CI pipeline
Ongoing: Iterate
- Add cases for every production bug
- Review false positives and negatives weekly
- Expand dataset as capabilities grow
Tools and Frameworks
The evaluation tooling space is maturing rapidly:
- Braintrust, Langsmith, Arize - Full-featured eval platforms
- DeepEval, RAGAS - Open-source evaluation frameworks
- OpenTelemetry GenAI conventions - Standardizing trace formats
The tool matters less than the practice. Start with a spreadsheet if needed. Graduate to platforms when scale demands it.
Evaluation is a Discipline, Not a Checklist
The teams shipping reliable agents treat evaluation as a core engineering practice, not an afterthought. They:
- Block deploys on eval failures
- Investigate every production incident
- Expand test coverage continuously
- Measure what matters to users, not what's easy to measure
Your evaluation system is only as good as the questions you ask. Start with "does this actually work for users?" and work backwards.