> For the complete documentation index, see [llms.txt](https://docs.helvia.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.helvia.ai/security/ai-safety.md).

# AI Safety

Conversational AI introduces a different category of risk. Agents can invent answers, follow instructions hidden in user input, drift across long conversations, and produce content nobody can trace back to a source. This page covers the controls the platform uses to address each of them.

### Our Responsible AI Principles

These principles shape every product decision and every safeguard described on this page.

<table data-column-title-hidden data-view="cards"><thead><tr><th>Title</th><th>Description</th></tr></thead><tbody><tr><td><h4><i class="fa-book-open">:book-open:</i></h4><h4>Grounded Answers</h4></td><td>Agents speak from your verified sources, with citations on every response</td></tr><tr><td><h4><i class="fa-vector-square">:vector-square:</i></h4><h4>Bounded Behavior</h4></td><td>Agents stay within their defined role, topic, and audience</td></tr><tr><td><h4><i class="fa-user-shield">:user-shield:</i></h4><h4>Privacy by Design</h4></td><td>Personal data is anonymized, encrypted, and retained only as long as configured</td></tr><tr><td><h4><i class="fa-eye">:eye:</i></h4><h4>Observable Model Calls</h4></td><td>Every LLM input and output is logged and reviewable in Observatory</td></tr><tr><td><h4><i class="fa-headset">:headset:</i></h4><h4>Human Oversight</h4></td><td>Agents can escalate or hand off when they should not answer alone</td></tr><tr><td><h4><i class="fa-vials">:vials:</i></h4><h4>Automated Testing</h4></td><td>Synthetic conversations stress-test agent behavior at scale</td></tr></tbody></table>

### Risks We Address

AI risk is different from traditional software risk because the output is generated, not retrieved. That single property opens up failure modes that do not exist elsewhere: invented facts, hijacked instructions, unsafe content, and leaked data. Each has a specific control on the platform.

<table><thead><tr><th width="280">Risk</th><th>How it is Handled</th></tr></thead><tbody><tr><td>Hallucinated or out-of-scope answers</td><td>RAG+C pipeline enforces answers from retrieved sources, with citations</td></tr><tr><td>Prompt injection and jailbreak attempts</td><td>Input guardrails inspect user messages before they reach the model</td></tr><tr><td>Inappropriate or unsafe model outputs</td><td>Output guardrails filter responses before delivery</td></tr><tr><td>Sensitive data leaking to the model</td><td>User messages are optionally anonymized per agent, before reaching storage or the model</td></tr><tr><td>Unbounded behavior in multi-step agents</td><td>Low-code workflow design, automated end-to-end testing, and human escalation</td></tr></tbody></table>

### Grounded Answers With RAG+C

The single largest source of AI risk in production is the model inventing a confident but incorrect answer. Helvia.ai addresses this with a Retrieval Augmented Generation with Citation (RAG+C) toolkit: a knowledge base of indexed sources, a semantic search node that retrieves the relevant ones, and LLM nodes that rewrite queries and write cited answers from the retrieved\
context.

How those tools come together is the builder's call, with the agent template's workflow as a starting point. The builder connects the sources, tunes how queries are rewritten, adds ranking or filtering, and decides how citations are presented. Grounding is a workflow you shape.

{% stepper %}
{% step %}

#### Sources Become Citable Chunks

When a knowledge source is connected to an agent, the platform parses it into self-contained, human-readable segments. Each segment is small enough to serve as a citation and large enough to carry its own meaning, so every answer can point back to the exact passage that produced it.
{% endstep %}

{% step %}

#### Retrieval Filters the Context

Before any text reaches the language model, a semantic search node shortlists the relevant segments from the connected knowledge base. The LLM only ever sees content selected by retrieval, never the full corpus and never arbitrary internet content.
{% endstep %}

{% step %}

#### Selection Is Enforced

The workflow's prompt and selection logic keep the model answering strictly from the retrieved segments. If nothing relevant was retrieved, the agent says so rather than improvising.
{% endstep %}
{% endstepper %}

Two additional mechanisms reinforce grounding:

* **Citations on every response:** users can see which source produced each answer and verify it
* **Query rewriting:** ambiguous or under-specified questions are rewritten before retrieval, so the right context reaches the model

```mermaid
graph LR
    KB[(Knowledge Base)] <-.-> R
    Q([Question]) --> QR[Query Rewrite] --> R[Retriever]

    subgraph RC[Ranked Context]
        direction TB
        C1[Segment]
        C2[Segment]
        C3[...]
        C4[Segment]
        C5[Segment]
    end

    R --> C1
    R --> C2
    R --> C3
    R --> C4
    R --> C5
    C1 --> G[Generator]
    C2 --> G
    C3 --> G
    C4 --> G
    C5 --> G
    G --> BC[Best Context]
    G --> A[Answer]

    style Q fill:#E5E7EB,stroke:#9CA3AF,color:#1F2937
    style QR fill:#615DEC,stroke:#615DEC,color:#fff
    style R fill:#615DEC,stroke:#615DEC,color:#fff
    style KB fill:#DBEAFE,stroke:#3B82F6,color:#1E3A8A  
    style C1 fill:#DBEAFE,stroke:#3B82F6,color:#1E3A8A  
    style C2 fill:#DBEAFE,stroke:#3B82F6,color:#1E3A8A  
    style C3 fill:#DBEAFE,stroke:#3B82F6,color:#1E3A8A  
    style C4 fill:#DBEAFE,stroke:#3B82F6,color:#1E3A8A  
    style C5 fill:#DBEAFE,stroke:#3B82F6,color:#1E3A8A  
    style G fill:#615DEC,stroke:#615DEC,color:#fff
    style BC fill:#E5E7EB,stroke:#9CA3AF,color:#1F2937
    style A fill:#E5E7EB,stroke:#9CA3AF,color:#1F2937
    style RC fill:none,stroke:#9CA3AF
```

### Workflow Guardrails

Guardrails are built into the workflow. Place an LLM node upstream of the main LLM to inspect user messages, or downstream to check responses before they reach the user. You define what the LLM flags, blocks, or rewrites.

{% tabs %}
{% tab title="Input Guardrails" %}
Add an LLM node upstream of the main LLM, with a prompt that inspects the user's message before it reaches the model. Depending on what the node returns, the workflow:

* Blocks the message and returns a safe refusal
* Forwards it untouched
* Forwards a sanitized version

Common uses include screening for prompt injection patterns, jailbreak attempts, or instructions that conflict with the agent's defined role.
{% endtab %}

{% tab title="Output Guardrails" %}
Add an LLM node downstream of the main LLM, with a prompt that inspects the agent's response before it reaches the user. Depending on what the node returns, the workflow:

* Blocks the response and returns a safe alternative
* Forwards it untouched
* Forwards a rewritten version

Common uses include checking for off-policy content, sensitive information that should not appear in customer-facing responses, or tone violations.
{% endtab %}
{% endtabs %}

{% hint style="success" %}
**Apply guardrails to agents:** Input validation is the recommended baseline when the agent is exposed to untrusted users.
{% endhint %}

### Safe LLM Integration

Every LLM call the platform makes runs through the same controls:

<table><thead><tr><th width="240">Control</th><th>How It Works</th></tr></thead><tbody><tr><td>Encrypted transmission</td><td>All traffic to LLM providers is encrypted with TLS, end-to-end</td></tr><tr><td>Anonymization before invocation</td><td>Optional PII detection runs on user messages and replaces sensitive entities before they reach the model</td></tr><tr><td>Customer-owned accounts</td><td>Calls run through your own provider credentials, so data-use terms are governed by your contract</td></tr><tr><td>Full input and output logging</td><td>Every prompt sent and response received is logged at session level and reviewable in Observatory</td></tr><tr><td>Vetted providers</td><td>Each LLM provider is reviewed against recognized standards (ISO 27001, SOC 2) before integration</td></tr></tbody></table>

### Testing and Evaluation

Agents respond differently to the same input every time, so a single manual test only tells you what happened once. The platform's testing framework runs synthetic conversations at scale and judges each one against your success criteria, so non-deterministic failures surface as a pass rate instead of slipping through.

Every test involves three roles:

* **Synthetic user:** an LLM-powered persona that simulates a real user, following the scenario and goals you define in its prompt
* **Agent under test:** the agent being evaluated, talking to the synthetic user as it would to a real one
* **Evaluator:** a separate LLM that reads the full transcript and returns a pass-or-fail verdict with an explanation, judged against your success criteria

Each test runs across many sessions, so behavior you would otherwise see only intermittently surfaces as a statistical pattern. Configure and run tests to cover hallucination prevention, scenario handling, tone, adversarial inputs, and any other behavior worth validating before deployment.

### Agent Behavior Boundaries

Several behavior bounds are set by the builder during agent construction, not by the platform automatically.

<details>

<summary><strong>Personality and role variable</strong></summary>

Each agent has a configurable text variable that describes how it should respond and what role it plays. Update it to refine the agent's behavior without touching the rest of the workflow.

</details>

<details>

<summary><strong>Escalation paths</strong></summary>

Agents do not escalate automatically. The builder decides when and how a conversation should hand off to a human, and wires those conditions into the workflow as explicit nodes. Common triggers include user intent, sentiment, or repeated unanswered questions.

</details>

{% hint style="success" %}
You now have a clear view of how Helvia.ai keeps AI agents grounded, how it bounds their behavior, how it filters misuse, and how it tests them at scale.
{% endhint %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.helvia.ai/security/ai-safety.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
Title	Description
Grounded Answers	Agents speak from your verified sources, with citations on every response
Bounded Behavior	Agents stay within their defined role, topic, and audience
Privacy by Design	Personal data is anonymized, encrypted, and retained only as long as configured
Observable Model Calls	Every LLM input and output is logged and reviewable in Observatory
Human Oversight	Agents can escalate or hand off when they should not answer alone
Automated Testing	Synthetic conversations stress-test agent behavior at scale
Risk	How it is Handled
Hallucinated or out-of-scope answers	RAG+C pipeline enforces answers from retrieved sources, with citations
Prompt injection and jailbreak attempts	Input guardrails inspect user messages before they reach the model
Inappropriate or unsafe model outputs	Output guardrails filter responses before delivery
Sensitive data leaking to the model	User messages are optionally anonymized per agent, before reaching storage or the model
Unbounded behavior in multi-step agents	Low-code workflow design, automated end-to-end testing, and human escalation
Control	How It Works
Encrypted transmission	All traffic to LLM providers is encrypted with TLS, end-to-end
Anonymization before invocation	Optional PII detection runs on user messages and replaces sensitive entities before they reach the model
Customer-owned accounts	Calls run through your own provider credentials, so data-use terms are governed by your contract
Full input and output logging	Every prompt sent and response received is logged at session level and reviewable in Observatory
Vetted providers	Each LLM provider is reviewed against recognized standards (ISO 27001, SOC 2) before integration