AI Learning Studio — Become an AI Expert

Evaluation Dimensions Overview

Agent evaluation is more complex than traditional classification or generation. Multiple dimensions matter:

Task Success

End-to-end: Does the Agent achieve the intended goal for a given task?
Sub-steps: In multi-step tasks, are individual steps executed correctly?
Measures: Human scoring, comparison to reference answers, LLM-as-Judge scores

Tool Use

Correctness: Are the right tools called with the right parameters?
Necessity: Are unnecessary or erroneous tool calls avoided?
Efficiency: Are call count and number of rounds reasonable?

Safety & Compliance

Overreach: Does the Agent perform actions it should not?
Hallucination: Does it invent non-existent information?
Sensitive data: Does it leak information it should not?

Efficiency & Cost

Latency: End-to-end response time
Token usage: Total tokens and cost
Call count: LLM and tool call counts

Benchmarks & Datasets

SWE-bench: Code repair tasks; evaluates programming ability
WebArena: Web interaction tasks; evaluates multimodal and tool use
AgentBench: Multi-environment evaluation
Custom: Build domain-specific test sets for your use case

LangSmith in Practice

LangSmith is LangChain's observability and evaluation platform:

Trace: Automatically record nodes, inputs/outputs, latency, and errors per run

Datasets: Manage test cases with input–expected output and human feedback

Evaluators: Configure custom evaluators (correctness, relevance, toxicity detection)

CI integration: Run evaluation in GitHub Actions or similar to block failing PRs

Enable with LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY.

Testing Strategy

Unit tests: Test deterministic parts (tools, parsing logic)

Integration tests: Run end-to-end on typical flows; check output format and key content

Regression tests: Golden dataset + fixed seed; monitor metric changes

Manual sampling: Periodically sample production logs to assess real user experience

A/B tests: Compare new models or prompts to baseline before rollout

Summary

Agent evaluation needs task success, tool use, safety, and efficiency. Tools like LangSmith support observability and automated evaluation. A golden dataset and regression tests are essential for keeping Agent quality stable as you iterate.

Agent Evaluation Methodology