Automated Testing
Validate agent behavior at scale with synthetic conversations
Automated testing lets you validate your agent's behavior by generating synthetic conversations at scale. An LLM simulates a user talking to your agent, and a separate LLM evaluates whether the agent responded correctly. This two-step loop can run across many sessions, covering everything from scenario handling and politeness to hallucination prevention and adversarial attacks.
Unlike traditional software, an agent can respond differently to the same input every time. A single manual conversation only tells you what happened once, not how the agent performs on average. Running the same test many times reveals your actual pass rate.
Go to Observatory > Testing to get started. The Tests page is where you create, configure, and run tests. The Results page shows outcomes across all runs and lets you inspect individual sessions.

How Testing Works
Testing is a critical part of the agent development lifecycle. Catching issues before deployment prevents poor experiences for real users. A bot that occasionally hallucinates, goes off-topic, or breaks tone can damage trust quickly, and the risk grows with every conversation it handles.
Every test involves three components:
The test runs this process multiple times (different sessions), each generating a separate conversation. Because the agent can respond differently each time, repeated sessions surface failures that a single test would miss.
A failure rate of 1/100 is invisible in manual testing but critical when your agent handles thousands of conversations monthly. Scale your session count to match the reliability level you need.
Prerequisites
The Automated Agent Testing plugin must be activated. Go to Designer > Plugins and enable it. Currently, only OpenAI integrations are supported.
Creating a Test
Define the Synthetic User
Select a Model and write a Prompt that defines the synthetic user's persona and scenario. See the Synthetic User for guidance on writing effective prompts.
Set Up the Evaluator
Select a Model and write a Prompt that defines your success criteria and output format. See the Evaluator for details on the expected JSON response.
The Tests Dashboard
The tests dashboard is the central hub for managing all your test configurations. Go to Observatory > Testing > Tests to access it. The table lists all saved tests along with their attributes:

Test Name
The internal name of the test. Click the column header to sort alphabetically or search by name using the search bar above the table.
Last Run
Timestamp of the most recent execution. Shows N/A for tests that have not been run yet. Sortable by clicking the column header.
Sessions Passed
Pass rate of the last run as a percentage. Shows the result of the most recent run only. You can sort and filter this column to quickly find failing tests. Shows - for tests that have not been run yet.
Actions
Each row includes action buttons to run the test, view its results, clone it to create a variant, or permanently delete it.
Use the Agents filter above the table to narrow down tests by a specific agent
Editing a Test
Click any row in the Tests table or the edit icon to open the edit dialog. This is the same form used during creation. After making changes, click Save Changes to update the configuration without running, or Save & Run to save and execute immediately.
Cloning a Test
Click the clone icon to duplicate an existing test. This creates a new test with the same configuration, letting you quickly build variations. For example, duplicate a persuasion test and swap the synthetic user prompt to test a different persona, or assign the cloned test to a different agent to compare how multiple agents handle the same scenario.
Deleting a Test
Click the delete icon to permanently remove a test.
All associated results in Observatory > Testing > Results are also deleted. The generated chat sessions remain available in Observatory > Sessions.
Define the Synthetic User
The synthetic user is an LLM-powered persona that simulates a real user talking to your agent. You configure it with a model and a prompt.
Synthetic user messages are simpler to generate than evaluations, so smaller models work well. A mini variant like gpt-4.1-mini handles most scenarios reliably and keeps costs low at scale.
The prompt defines everything about the simulated user: who they are, what situation they are in, how they behave, and what they are trying to achieve. The more specific the prompt, the more realistic and useful the test conversations.
A good synthetic user prompt covers:
Early Stopping
The synthetic user can end conversations before reaching the max turn limit by sending termination signals in its response:
[+]when satisfied (positive outcome)[-]when giving up or dissatisfied (negative outcome)
Include these instructions in your prompt for variable-length conversations. Omit them if you want every session to run for the full turn count.
Writing Effective Synthetic User Prompts
Scenario Specificity
Match specificity to your testing goals:
Too generic: "You have a banking problem" (hard to evaluate)
Too specific: "Dispute transaction #12345 from March 3rd at 2:47 PM" (may not match agent capabilities)
Right balance: "You noticed an unauthorized $150 charge and want to understand next steps"
Behavior Variation
Test the same scenario with different user personas to stress-test your agent:
Patient and cooperative
Frustrated and demanding
Confused and non-technical
Adversarial and manipulative
Define the Evaluator
The evaluator is a separate LLM that reads the full conversation transcript after a session ends and judges whether the agent met your success criteria. Similarly to the synthetic user, you configure it with a model and a prompt.
Evaluation is more demanding than generating user messages. It requires nuanced understanding of conversation context and consistent application of criteria across sessions. Use a capable model like gpt-4.1 or newer for reliable verdicts.
The prompt tells the evaluator what to look for, how strictly to judge, and what format to return. A vague evaluator produces inconsistent results. A specific one gives you actionable feedback you can act on across hundreds of sessions. A good evaluator prompt covers:
The evaluator prompt must instruct the LLM to return a JSON object with two fields:
passed(boolean): Whether the agent met your criteriareason(string): Explanation of the verdict, displayed in the Detailed Results table
The evaluator must return a valid JSON object. If the response is not valid JSON, the test session may fail to complete.
Writing Effective Evaluator Prompts
Be specific about criteria: "Agent must verify identity before discussing account details" instead of "Agent should be thorough"
Define 3-5 key criteria: Too many makes debugging hard, too few misses important issues
Include examples: Show the evaluator what a pass and fail look like for your specific use case
Calibrate before scaling: Test your evaluator on 5-10 sessions first to verify it produces consistent, accurate verdicts
Running Tests
Click the run icon in the Actions column to execute a test. While a test is running, the button turns into a progress indicator. Hover over it to check progress or cancel the run. The platform notifies you when the run finishes, so you can navigate away and come back when ready.
Each run generates the configured number of sessions. The synthetic user and agent converse until the max turn limit is reached or the synthetic user sends a termination signal.
Review Your Tests
Results are available as soon as a run finishes. The Tests dashboard displays the pass rate of the last run directly in the table, so you can spot failures without leaving the page. For a full run history with per-session breakdowns, switch to Observatory > Testing > Results.

The Results Page
The Results page collects every run across all tests in one place. Use the date range and agent filters at the top of the table to narrow down results and sort them by date, name or pass rate.
Hover over the icon next to a test name to see a quick summary of the test configuration.
Each test run generates the number of sessions you configured during test creation. Click any row to explore what happened in each run and where the agent passed or failed. This panel is divided in three sections:

Unit Tests vs End-to-End Tests
You can design tests at two levels depending on what you want to validate.
Unit Tests
A unit test targets a single agent response. Set Max Session Turns to 1 and write a synthetic user prompt that sends one specific message. The evaluator then judges that single reply.
Use unit tests to verify isolated behaviors: greeting quality, knowledge base coverage, whether the agent asks for identification, or how it handles an off-topic question.
End-to-End Tests
An end-to-end test simulates a full conversation across multiple turns. Set a higher turn limit and let the synthetic user interact naturally with the agent until the scenario reaches a conclusion.
Use end-to-end tests to validate complete workflows: troubleshooting flows, onboarding sequences, escalation handling, or multi-step processes where earlier responses affect later ones.
Combine both approaches for full coverage. Unit tests catch regressions in specific responses quickly, while end-to-end tests reveal issues that only surface across a full conversation.
Scaling Your Tests
LLM-powered agents are non-deterministic. The same input can produce different responses each time, which means an agent might fail a task only 1 in 100 or 1 in 1,000 times. Manual testing cannot catch these rare failures, but they add up when your agent handles thousands of conversations in production.
1/100
~100 failures per month, likely noticed by users
1/1,000
~10 failures per month, hard to detect manually
1/10,000
~1 failure per month, virtually invisible without automated testing
Start Small, Then Scale
1-3 sessions: Confirm your synthetic user and evaluator prompts work as expected. Check that the evaluator produces consistent, accurate verdicts
10-30 sessions: Look for patterns in the results. Is the evaluator too strict or too lenient? Adjust criteria before committing to a large run
50+ sessions: Get statistically meaningful pass rates. At this volume, you can distinguish between a 99% and 99.9% success rate
Adversarial Testing
One of the most valuable uses of automated testing is red-teaming your agent. Create synthetic users that actively try to break your agent's behavior:
Jailbreak attempts: Users trying to make the agent ignore instructions or reveal system prompts
Social engineering: Users manipulating the agent through persuasion tactics (authority, urgency, reciprocity)
Boundary testing: Rapid topic changes, contradictory requests, or attempts to push the agent off-script
Safety testing: Users trying to elicit offensive, biased, or inappropriate responses
Best Practices
Start small, then scale: Validate prompts with 5-10 sessions before running hundreds. Calibrate your evaluator to avoid discovering it was too strict or lenient after 100 runs.
Use the right model for the job: Smaller models (e.g.,
gpt-4.1-mini) for synthetic users, more capable models (e.g.,gpt-4.1) for evaluators.Test adversarial scenarios: Create synthetic users that try to jailbreak, socially engineer, or push your agent past its boundaries. These red-team tests catch safety issues before real users do,
Tag your sessions: Use session tags to filter and group test results in Observatory. Organize tests by type (functional, safety, compliance) and risk level,
Calibrate evaluator strictness: High-risk scenarios (security, compliance) need strict pass/fail criteria. General quality tests can be more forgiving,
Re-run after changes: Execute tests after every agent update to catch regressions. Compare pass rates across runs to track improvement or degradation,
You now know how to create automated tests, configure synthetic users and evaluators, and interpret results. Create your first test and start validating your agent at scale.
Last updated
Was this helpful?

