# How evaluations work & what they measure

### Overview

The Evaluate section in Agentic Studio provides a dedicated environment for validating Digital Worker behavior before deployment and tracking performance over time. It is designed to support both fast, iterative testing of individual scenarios and structured, repeatable evaluation across larger sets of test cases.

Evaluation in Agentic Studio helps answer questions that configuration alone cannot: whether the Digital Worker behaves correctly in a specific scenario, how it performs across a broader range of inputs, whether a recent change improved performance or introduced regressions, and which dimensions of quality need attention. The goal is to make testing observable, structured, and repeatable rather than dependent on informal spot checks.

### Two Modes of Evaluation

Agentic Studio supports two evaluation modes that serve different purposes and work best at different stages of development.

#### Interactive Testing

Interactive testing is used for individual scenario validation, rapid iteration, and ad hoc exploration. It is the fastest path to understanding how a Digital Worker responds to a specific input, useful when validating a new configuration, testing the effect of a prompt change, exploring edge cases, or confirming that a particular workflow path behaves as expected.

Within interactive testing, users can engage with the Digital Worker directly through a conversational interface, observing how it responds to real inputs in context. This simulates how the worker will behave in actual use and allows evaluation of tone, clarity, response structure, tool usage, and instruction adherence in a natural interaction flow.

Interactive tests also serve as a source of reusable test cases. Individual interactions can be saved and added to a dataset, connecting exploratory testing directly to structured regression coverage.

Agentic Studio supports two types of interactive tests. An event simulation test seeds the interaction with a real event payload to validate trigger behavior and the worker's response. A behavioral test evaluates agent logic, output format, or edge case handling independently of any specific trigger event.

#### Dataset Testing

Dataset testing is used for broader, repeatable evaluation across many test cases at once. A dataset is a saved collection of test cases that can be run against a Digital Worker and reused as the worker evolves. Rather than recreating the same scenarios each time a change is made, datasets provide a persistent evaluation asset that can be applied consistently across versions.

Datasets are particularly useful for regression testing, release readiness checks, and performance tracking over time. Running a dataset after a configuration change makes it possible to see whether behavior has improved, degraded, or remained stable across the full range of cases, not just the ones that were manually tested in the current session.

Datasets can be built from scratch or assembled by adding individual interactive test cases into a collection over time. This creates a natural workflow where exploratory testing generates the raw material for structured evaluation.

### Evaluation Metrics

Both interactive tests and dataset runs surface evaluation metrics that make Digital Worker performance visible and comparable. The metrics available include accuracy, relevance, helpfulness, test success rate, failed tests, and average response time.

In dataset testing, these metrics are aggregated across the full test suite, giving a summary view of performance at the collection level rather than requiring review of individual interactions. This makes it easier to assess overall readiness and identify patterns across a large number of cases.

### Out-of-Box and Custom Evaluations

Agentic Studio provides out-of-box evaluations that are available for all Digital Workers without additional configuration, as well as the ability to define custom evaluations for specific use cases. Custom evaluations allow teams to assess performance against criteria that are particular to their workflows, outputs, or quality standards.

An evaluation dashboard provides visibility into evaluation results over time, supporting ongoing quality monitoring beyond individual test runs.

### Evaluation in the Development Lifecycle

Evaluation is not a single gate before deployment. It is a continuous part of the Digital Worker development lifecycle. Interactive testing supports rapid iteration during configuration. Dataset testing supports structured validation before release. Both feed into an ongoing loop where changes are tested, results are reviewed, and the worker is refined before any version reaches production.

This approach keeps the path from development to deployment controlled and evidence-based, reducing the risk that changes which degrade performance go undetected before they affect live users.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://kb.theloops.io/agenticstudio/resource-library/hidden-core-concepts/how-evaluations-work-and-what-they-measure.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
