Fruxon logo
Fruxon
BETA

Back to Blog

Evaluation
Testing
AI Agents
LLM
Production

How to Evaluate AI Agents: A Practical Framework for 2025

Learn how leading teams evaluate AI agents for production. This guide covers offline evals, LLM-as-judge, trajectory analysis, and the metrics that actually matter.

By Fruxon Team

January 10, 2025

7 min read

Listen

Your AI agent works perfectly on 10 test cases. Then production traffic hits and everything breaks.

This is the evaluation gap—the difference between "it looks good" and "it actually works." According to recent industry data, only 52% of organizations run systematic offline evaluations before deploying agents. The other 48% learn about problems from users.

This guide covers how to evaluate AI agents properly, drawing from practices at companies shipping agents at scale.

Why Agent Evaluation is Different

AI agents aren't just LLMs with prompts. They're multi-step systems that:

  • Take actions: Call APIs, modify databases, send emails
  • Make decisions: Choose which tools to use and when
  • Maintain state: Track context across conversation turns
  • Compound errors: One bad decision cascades into more

Traditional LLM evaluation (checking if the output is "good") misses most failure modes. You need to evaluate the entire trajectory—every step the agent takes to reach its goal.

The Three Layers of Agent Evaluation

Layer 1: Component-Level Evaluation

Test individual capabilities in isolation before testing the full system.

Retrieval quality (if using RAG):

# Measure retrieval precision and recall
def evaluate_retrieval(query, retrieved_docs, relevant_docs):
    precision = len(set(retrieved_docs) & set(relevant_docs)) / len(retrieved_docs)
    recall = len(set(retrieved_docs) & set(relevant_docs)) / len(relevant_docs)
    return {"precision": precision, "recall": recall}

Tool selection accuracy:

# Does the agent pick the right tool?
test_cases = [
    {"input": "What's the weather in Tokyo?", "expected_tool": "get_weather"},
    {"input": "Book a flight to Paris", "expected_tool": "search_flights"},
    {"input": "What's 15% tip on $47?", "expected_tool": None},  # No tool needed
]

Response quality for each step, not just final output.

Layer 2: Trajectory Evaluation

This is where most teams fail. Evaluating trajectories means examining the full sequence of decisions:

User: "Cancel my subscription and get a refund"

Trajectory A (good):
1. lookup_customer() → Found user
2. get_subscription() → Active, 3 days old
3. check_refund_policy() → Eligible for full refund
4. cancel_subscription() → Success
5. process_refund() → Success
6. Response: "Done. Full refund processed."

Trajectory B (bad):
1. cancel_subscription() → Failed (no customer context)
2. Response: "I couldn't cancel. Can you provide your email?"

Both trajectories might produce a "helpful sounding" response, but only one actually solves the problem.

Key trajectory metrics:

MetricWhat It Measures
Task success rateDid the agent achieve the goal?
Step efficiencyMinimum steps vs actual steps taken
Error recoveryDid it recover from failed tool calls?
Policy complianceDid it follow business rules?

Layer 3: End-to-End Evaluation

Simulate real user interactions with complete scenarios:

scenario: refund_request_edge_case
user_profile:
  subscription_status: cancelled_yesterday
  refund_history: 2_previous_refunds

conversation:
  - user: "I want a refund for this month"
  - expected_behavior:
      - must_check: refund_eligibility
      - must_not: auto_approve_refund
      - should_explain: refund_policy_limits

LLM-as-Judge: Done Right

Using an LLM to evaluate another LLM sounds circular, but it works when done properly.

The naive approach (don't do this):

# Too vague, inconsistent results
prompt = "Rate this response 1-10 for quality"

The structured approach:

evaluation_rubric = """
Evaluate the agent response on these specific criteria:

1. TASK_COMPLETION (0 or 1)
   - Did the agent fully complete the user's request?
   - Partial completion = 0

2. FACTUAL_ACCURACY (0 or 1)
   - Is all information provided verifiably correct?
   - Any hallucinated details = 0

3. TOOL_USAGE (0, 0.5, or 1)
   - 1: Used correct tools in optimal sequence
   - 0.5: Completed task but inefficiently
   - 0: Used wrong tools or skipped necessary ones

4. POLICY_ADHERENCE (0 or 1)
   - Did the agent follow all business rules?
   - Reference: {policy_document}

Provide scores and one-sentence justifications for each.
"""

Calibration is essential. Run your LLM-as-judge on 50-100 examples that humans have already scored. Measure agreement. Adjust rubric until agreement exceeds 85%.

Building Your Evaluation Dataset

A good eval dataset has these properties:

Coverage: Include all critical user journeys

  • Happy paths (80% of cases)
  • Edge cases (15%)
  • Adversarial inputs (5%)

Diversity: Vary along multiple dimensions

  • Query length and complexity
  • User expertise level
  • Context requirements

Difficulty calibration: Include easy, medium, and hard cases. If your agent scores 100%, your dataset is too easy.

Example dataset structure:

{
  "id": "refund-003",
  "category": "refund_requests",
  "difficulty": "hard",
  "input": "I bought this 6 months ago and it broke. I want my money back.",
  "context": {
    "order_date": "6_months_ago",
    "refund_window": "30_days",
    "product_warranty": "1_year"
  },
  "expected_trajectory": [
    "lookup_order",
    "check_warranty_status",
    "initiate_warranty_claim"
  ],
  "expected_outcome": "warranty_replacement_not_refund",
  "grading_notes": "Agent should recognize warranty applies, not refund"
}

Offline vs Online Evaluation

Offline evaluation (before deployment):

  • Run against static dataset
  • Deterministic, reproducible
  • Catches obvious regressions
  • Limitation: Can't predict all production scenarios

Online evaluation (in production):

  • Real user interactions
  • Catches distribution shift
  • Measures actual business impact
  • Requires safety guardrails

The right balance: Use offline evals as a gate (block bad deploys) and online evals for learning (improve over time).

PR opened
    ↓
Offline eval suite runs
    ↓
[Pass threshold?] → No → Block merge
    ↓ Yes
Deploy to 5% traffic
    ↓
Online metrics monitored
    ↓
[Regression detected?] → Yes → Auto-rollback
    ↓ No
Gradual rollout to 100%

Metrics That Actually Matter

Stop tracking vanity metrics. Focus on these:

Primary Metrics (business impact)

MetricDefinitionWhy It Matters
Goal completion rateUser's intent fully satisfiedThe only metric that truly matters
Escalation rateHanded off to humanHigh rate means agent not ready
Cost per resolutionTotal tokens and API callsSustainability check

Secondary Metrics (diagnostic)

MetricDefinitionWhat It Reveals
Steps per taskActions taken to completeEfficiency problems
Tool call failure rateFailed API and function callsIntegration issues
Retry rateSame request re-attemptedConfusion or failures
Latency P9595th percentile response timeUser experience

Safety Metrics (non-negotiable)

MetricThresholdAction if Breached
Hallucination rateBelow 1%Block deployment
Policy violation rate0%Immediate rollback
PII exposure0 incidentsIncident response

Common Evaluation Mistakes

1. Testing only happy paths

Your eval dataset is too clean. Real users send typos, irrelevant context, and adversarial inputs.

2. Optimizing for eval scores instead of user outcomes

If you tune prompts to ace your evals, you're overfitting. Rotate eval sets regularly.

3. Ignoring trajectory quality

An agent that stumbles through 15 steps to do a 3-step task has problems, even if it eventually succeeds.

4. Skipping human baselines

How would a human handle this request? If you don't know, you can't evaluate the agent meaningfully.

5. Evaluating once, not continuously

Model behavior drifts. User patterns change. Set up continuous evaluation, not one-time audits.

Getting Started This Week

Days 1-2: Audit what you have

  • List all agent capabilities
  • Identify the 5 most critical user journeys
  • Document current failure modes from logs

Days 3-4: Build initial dataset

  • Create 10 test cases per critical journey (50 total)
  • Include 2-3 edge cases per journey
  • Define expected trajectories, not just outputs

Day 5: Set up evaluation pipeline

  • Run eval suite on current agent
  • Establish baseline metrics
  • Add to CI pipeline

Ongoing: Iterate

  • Add cases for every production bug
  • Review false positives and negatives weekly
  • Expand dataset as capabilities grow

Tools and Frameworks

The evaluation tooling space is maturing rapidly:

  • Braintrust, Langsmith, Arize - Full-featured eval platforms
  • DeepEval, RAGAS - Open-source evaluation frameworks
  • OpenTelemetry GenAI conventions - Standardizing trace formats

The tool matters less than the practice. Start with a spreadsheet if needed. Graduate to platforms when scale demands it.

Evaluation is a Discipline, Not a Checklist

The teams shipping reliable agents treat evaluation as a core engineering practice, not an afterthought. They:

  • Block deploys on eval failures
  • Investigate every production incident
  • Expand test coverage continuously
  • Measure what matters to users, not what's easy to measure

Your evaluation system is only as good as the questions you ask. Start with "does this actually work for users?" and work backwards.


Back to Blog