Evaluation Dimensions Overview
Agent evaluation is more complex than traditional classification or generation. Multiple dimensions matter:
Task Success
- End-to-end: Does the Agent achieve the intended goal for a given task?
- Sub-steps: In multi-step tasks, are individual steps executed correctly?
- Measures: Human scoring, comparison to reference answers, LLM-as-Judge scores
Tool Use
- Correctness: Are the right tools called with the right parameters?
- Necessity: Are unnecessary or erroneous tool calls avoided?
- Efficiency: Are call count and number of rounds reasonable?
Safety & Compliance
- Overreach: Does the Agent perform actions it should not?
- Hallucination: Does it invent non-existent information?
- Sensitive data: Does it leak information it should not?
Efficiency & Cost
- Latency: End-to-end response time
- Token usage: Total tokens and cost
- Call count: LLM and tool call counts
Benchmarks & Datasets
- SWE-bench: Code repair tasks; evaluates programming ability
- WebArena: Web interaction tasks; evaluates multimodal and tool use
- AgentBench: Multi-environment evaluation
- Custom: Build domain-specific test sets for your use case
LangSmith in Practice
LangSmith is LangChain's observability and evaluation platform:
Enable with LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY.
Testing Strategy
Summary
Agent evaluation needs task success, tool use, safety, and efficiency. Tools like LangSmith support observability and automated evaluation. A golden dataset and regression tests are essential for keeping Agent quality stable as you iterate.