Skip to main content
Rollout status: Evals is currently being rolled out progressively, starting with Enterprise customers. If you’re an Enterprise customer and don’t see this feature yet, reach out to your account manager to discuss access.
Workforce Evals lets you test and evaluate multi-agent systems as a whole — not just individual agents. You can run scenario-based tests against your entire workforce and score how well agents collaborate to complete tasks, or score existing workforce task results without re-running them. The same evaluator types and scoring logic used for agent evals apply to workforce evals. See the agent evals documentation for full details on evaluator types, creating test suites, and understanding results.

Evaluation modes

Workforce Evals supports two modes depending on whether you want to generate new task results or evaluate existing ones.

Generate-and-score mode

The workforce runs a test scenario from scratch and the result is scored against your evaluators. Use this mode when you want to:
  • Test a new workforce configuration before deploying it
  • Run regression tests after making changes to agent instructions or connections
  • Simulate specific scenarios to check how agents hand off work to each other

Score-only mode

Existing workforce task results are passed to your evaluators without re-running the workforce. Use this mode when you want to:
  • Evaluate production workforce runs after the fact
  • Analyze historical task performance across a batch of results
  • Score results from tasks that are expensive or slow to re-run

Evaluators

Workforce Evals uses the same evaluator types as agent evals:

LLM Judge

Uses an LLM to assess task results against criteria you define in a prompt.

String Contains

Checks whether the output includes specific text.

String Equals

Checks whether the output exactly matches an expected value.

Tool Usage

Checks whether a specific tool was used during the task.
For full evaluator configuration details — including how to create global evaluators, configure LLM Judge prompts, and set pass thresholds — see the agent evals documentation.

Key differences from agent evals

Workforce evals evaluate multi-agent collaboration across an entire workflow, not the behavior of a single agent. This means:
  • Evaluation scope: Evaluators assess the combined output of all agents involved in a task, including handoffs, tool calls across agents, and final results.
  • Test scenarios: Scenarios simulate end-to-end workforce tasks rather than single-agent conversations. The simulated input triggers the workforce from its entry point.
  • Score-only mode: Unlike agent evals, workforce evals include a score-only mode for evaluating existing task results without re-running the workforce.

When to use each mode

ScenarioRecommended mode
Testing a new workforce configurationGenerate-and-score
Regression testing after agent changesGenerate-and-score
Evaluating production task resultsScore-only
Analyzing historical performanceScore-only
Checking agent handoff qualityGenerate-and-score
Auditing a batch of completed tasksScore-only

  • Agent evals — Full documentation on evaluator types, test suites, and scoring
  • Workforce task view — Monitor live workforce task performance and review task results
  • Workforces — Overview of how workforces and multi-agent systems work