Automated Testing

Validate agent behavior at scale with synthetic conversations

Automated testing lets you validate your agent's behavior by generating synthetic conversations at scale. An LLM simulates a user talking to your agent, and a separate LLM evaluates whether the agent responded correctly. This two-step loop can run across many sessions, covering everything from scenario handling and politeness to hallucination prevention and adversarial attacks.

Unlike traditional software, an agent can respond differently to the same input every time. A single manual conversation only tells you what happened once, not how the agent performs on average. Running the same test many times reveals your actual pass rate.

Go to Observatory > Testing to get started. The Tests page is where you create, configure, and run tests. The Results page shows outcomes across all runs and lets you inspect individual sessions.

How Testing Works

Testing is a critical part of the agent development lifecycle. Catching issues before deployment prevents poor experiences for real users. A bot that occasionally hallucinates, goes off-topic, or breaks tone can damage trust quickly, and the risk grows with every conversation it handles.

Every test involves three components:

user-robot

Synthetic User

An LLM-powered persona that simulates a real user. You define the scenario, behavior, and goals through a prompt

robot

Agent

The agent being tested. The synthetic user converses with it like a real user would

scale-balanced

Evaluator

A separate LLM that reads the full conversation transcript and judges whether the agent met your success criteria, returning a pass/fail verdict with an explanation

The test runs this process multiple times (different sessions), each generating a separate conversation. Because the agent can respond differently each time, repeated sessions surface failures that a single test would miss.

circle-check

Prerequisites

The Automated Agent Testing plugin must be activated. Go to Designer > Plugins and enable it. Currently, only OpenAI integrations are supported.

Creating a Test

1

Go to Observatory > Testing > Tests and click Add Test.

2

Configure General Settings

Enter a Test Name and optional Description. Select the Agent, Language, and the start Flow to test. Set the number of Sessions to generate.

3

Define the Synthetic User

Select a Model and write a Prompt that defines the synthetic user's persona and scenario. See the Synthetic User for guidance on writing effective prompts.

4

Set Up the Evaluator

Select a Model and write a Prompt that defines your success criteria and output format. See the Evaluator for details on the expected JSON response.

5

Configure Session Settings

Set Max Session Turns to control conversation length. A turn is one user message plus the agent's response. Optionally add Session Tags to label the generated sessions for filtering in Observatory > Sessions.

6

Create Test

Click Create Test to save, or Create & Run to save and execute immediately.

The Tests Dashboard

The tests dashboard is the central hub for managing all your test configurations. Go to Observatory > Testing > Tests to access it. The table lists all saved tests along with their attributes:

chevron-rightTest Namehashtag

The internal name of the test. Click the column header to sort alphabetically or search by name using the search bar above the table.

chevron-rightAgenthashtag

The agent corresponding to the test.

chevron-rightCreated athashtag

Timestamp of when the test was created. Sortable by clicking the column header.

chevron-rightLast Runhashtag

Timestamp of the most recent execution. Shows N/A for tests that have not been run yet. Sortable by clicking the column header.

chevron-rightSessions Passedhashtag

Pass rate of the last run as a percentage. Shows the result of the most recent run only. You can sort and filter this column to quickly find failing tests. Shows - for tests that have not been run yet.

chevron-rightActionshashtag

Each row includes action buttons to run the test, view its results, clone it to create a variant, or permanently delete it.

circle-info

Use the Agents filter above the table to narrow down tests by a specific agent

Editing a Test

Click any row in the Tests table or the edit icon pen to open the edit dialog. This is the same form used during creation. After making changes, click Save Changes to update the configuration without running, or Save & Run to save and execute immediately.

Cloning a Test

Click the clone icon copy to duplicate an existing test. This creates a new test with the same configuration, letting you quickly build variations. For example, duplicate a persuasion test and swap the synthetic user prompt to test a different persona, or assign the cloned test to a different agent to compare how multiple agents handle the same scenario.

Deleting a Test

Click the delete icon trash to permanently remove a test.

triangle-exclamation

Define the Synthetic User

The synthetic user is an LLM-powered persona that simulates a real user talking to your agent. You configure it with a model and a prompt.

Synthetic user messages are simpler to generate than evaluations, so smaller models work well. A mini variant like gpt-4.1-mini handles most scenarios reliably and keeps costs low at scale.

The prompt defines everything about the simulated user: who they are, what situation they are in, how they behave, and what they are trying to achieve. The more specific the prompt, the more realistic and useful the test conversations.

A good synthetic user prompt covers:

Who the user is. A frustrated customer, a first-time user, or a technical expert will each interact with your agent differently and test different capabilities

map

Scenario

The situation and context driving the conversation. For example, reporting an unauthorized charge, requesting a password reset, or asking about a product feature

masks-theater

Behavior

How the user communicates. Define tone, patience level, and verbosity. A calm and cooperative user tests different agent skills than an impatient one who sends short, demanding messages

bullseye

Goals

What the user wants to achieve by the end of the conversation. Clear goals help the evaluator determine whether the agent successfully resolved the request

Early Stopping

The synthetic user can end conversations before reaching the max turn limit by sending termination signals in its response:

  • [+] when satisfied (positive outcome)

  • [-] when giving up or dissatisfied (negative outcome)

Include these instructions in your prompt for variable-length conversations. Omit them if you want every session to run for the full turn count.

Writing Effective Synthetic User Prompts

Scenario Specificity

Match specificity to your testing goals:

  • Too generic: "You have a banking problem" (hard to evaluate)

  • Too specific: "Dispute transaction #12345 from March 3rd at 2:47 PM" (may not match agent capabilities)

  • Right balance: "You noticed an unauthorized $150 charge and want to understand next steps"

Behavior Variation

Test the same scenario with different user personas to stress-test your agent:

  • Patient and cooperative

  • Frustrated and demanding

  • Confused and non-technical

  • Adversarial and manipulative

chevron-rightSynthetic User Prompt Templatehashtag

Define the Evaluator

The evaluator is a separate LLM that reads the full conversation transcript after a session ends and judges whether the agent met your success criteria. Similarly to the synthetic user, you configure it with a model and a prompt.

Evaluation is more demanding than generating user messages. It requires nuanced understanding of conversation context and consistent application of criteria across sessions. Use a capable model like gpt-4.1 or newer for reliable verdicts.

The prompt tells the evaluator what to look for, how strictly to judge, and what format to return. A vague evaluator produces inconsistent results. A specific one gives you actionable feedback you can act on across hundreds of sessions. A good evaluator prompt covers:

list-check

Success Criteria

Define 3-5 specific, measurable conditions the agent must meet. For example, "verify identity before sharing account details" rather than "be thorough"

scale-balanced

Strictness Level

Decide whether all criteria must pass or if partial success counts. High-risk scenarios like security or compliance should fail on any single violation

code

Output Format

Always require JSON output with a passed boolean and a reason string. The platform parses this to display verdicts and explanations in the UI

circle-check

Examples

Show the evaluator what a pass and fail look like for your use case. This anchors its judgment and produces more consistent verdicts across sessions

The evaluator prompt must instruct the LLM to return a JSON object with two fields:

  • passed (boolean): Whether the agent met your criteria

  • reason (string): Explanation of the verdict, displayed in the Detailed Results table

circle-exclamation

Writing Effective Evaluator Prompts

  • Be specific about criteria: "Agent must verify identity before discussing account details" instead of "Agent should be thorough"

  • Define 3-5 key criteria: Too many makes debugging hard, too few misses important issues

  • Include examples: Show the evaluator what a pass and fail look like for your specific use case

  • Calibrate before scaling: Test your evaluator on 5-10 sessions first to verify it produces consistent, accurate verdicts

chevron-rightEvaluator Prompt Templatehashtag

Running Tests

Click the run icon bolt in the Actions column to execute a test. While a test is running, the button turns into a progress indicator. Hover over it to check progress or cancel the run. The platform notifies you when the run finishes, so you can navigate away and come back when ready.

Each run generates the configured number of sessions. The synthetic user and agent converse until the max turn limit is reached or the synthetic user sends a termination signal.

bolt Individual Run

Run a single test in one of three ways:

  • Click the run icon for any test

  • Click Create & Run when creating a new test

  • Click Save & Run after editing an existing test

layer-group Bulk Run

Run multiple tests at once by selecting them with the checkboxes and clicking Bulk Run. This is useful for running a full regression suite after updating your agent.

Review Your Tests

Results are available as soon as a run finishes. The Tests dashboard displays the pass rate of the last run directly in the table, so you can spot failures without leaving the page. For a full run history with per-session breakdowns, switch to Observatory > Testing > Results.

The Results Page

The Results page collects every run across all tests in one place. Use the date range and agent filters at the top of the table to narrow down results and sort them by date, name or pass rate.

circle-info

Hover over the info-circle icon next to a test name to see a quick summary of the test configuration.

Each test run generates the number of sessions you configured during test creation. Click any row to explore what happened in each run and where the agent passed or failed. This panel is divided in three sections:

chart-pie

Overall

The percentage of sessions that passed across the entire run

circle-info

Test Details

A summary of the test configuration including name, description, language, agent, workflow, and run timestamp

table-list

Detailed Results

A per-session breakdown showing the verdict (Passed or Failed) and the evaluator's explanation for each session

Unit Tests vs End-to-End Tests

You can design tests at two levels depending on what you want to validate.

Unit Tests

A unit test targets a single agent response. Set Max Session Turns to 1 and write a synthetic user prompt that sends one specific message. The evaluator then judges that single reply.

Use unit tests to verify isolated behaviors: greeting quality, knowledge base coverage, whether the agent asks for identification, or how it handles an off-topic question.

End-to-End Tests

An end-to-end test simulates a full conversation across multiple turns. Set a higher turn limit and let the synthetic user interact naturally with the agent until the scenario reaches a conclusion.

Use end-to-end tests to validate complete workflows: troubleshooting flows, onboarding sequences, escalation handling, or multi-step processes where earlier responses affect later ones.

circle-check

Scaling Your Tests

LLM-powered agents are non-deterministic. The same input can produce different responses each time, which means an agent might fail a task only 1 in 100 or 1 in 1,000 times. Manual testing cannot catch these rare failures, but they add up when your agent handles thousands of conversations in production.

Failure Rate
Impact at 10,000 Conversations/Month

1/100

~100 failures per month, likely noticed by users

1/1,000

~10 failures per month, hard to detect manually

1/10,000

~1 failure per month, virtually invisible without automated testing

Start Small, Then Scale

  • 1-3 sessions: Confirm your synthetic user and evaluator prompts work as expected. Check that the evaluator produces consistent, accurate verdicts

  • 10-30 sessions: Look for patterns in the results. Is the evaluator too strict or too lenient? Adjust criteria before committing to a large run

  • 50+ sessions: Get statistically meaningful pass rates. At this volume, you can distinguish between a 99% and 99.9% success rate

Adversarial Testing

One of the most valuable uses of automated testing is red-teaming your agent. Create synthetic users that actively try to break your agent's behavior:

  • Jailbreak attempts: Users trying to make the agent ignore instructions or reveal system prompts

  • Social engineering: Users manipulating the agent through persuasion tactics (authority, urgency, reciprocity)

  • Boundary testing: Rapid topic changes, contradictory requests, or attempts to push the agent off-script

  • Safety testing: Users trying to elicit offensive, biased, or inappropriate responses

chevron-rightExample: Social Engineering Testhashtag

Best Practices

  • Start small, then scale: Validate prompts with 5-10 sessions before running hundreds. Calibrate your evaluator to avoid discovering it was too strict or lenient after 100 runs.

  • Use the right model for the job: Smaller models (e.g., gpt-4.1-mini) for synthetic users, more capable models (e.g., gpt-4.1) for evaluators.

  • Test adversarial scenarios: Create synthetic users that try to jailbreak, socially engineer, or push your agent past its boundaries. These red-team tests catch safety issues before real users do,

  • Tag your sessions: Use session tags to filter and group test results in Observatory. Organize tests by type (functional, safety, compliance) and risk level,

  • Calibrate evaluator strictness: High-risk scenarios (security, compliance) need strict pass/fail criteria. General quality tests can be more forgiving,

  • Re-run after changes: Execute tests after every agent update to catch regressions. Compare pass rates across runs to track improvement or degradation,

circle-check

Last updated

Was this helpful?