Runtime Firewall: Protect Agents from Prompt Injection

The Trident firewall intercepts prompts before they reach your LLM and scans outputs before they leave your agent — blocking prompt injection, jailbreaks, canary leaks, and other attacks in real time. It runs a two-stage decision: your project’s custom deny rules fire first (populated automatically from confirmed findings), and then the LLM Guard ensemble takes over for anything that gets through. You have two ways to integrate it: route traffic through the gateway proxy (zero code change) or call trident.scan() directly in your agent code.

How the two-stage scan works

Stage 1 — Tenant deny rules: Trident checks the prompt against your project’s custom rule bank. This bank is automatically populated when you confirm a finding in the dashboard — confirmed attacks become deny rules within 5 minutes. This stage is fast (pure regex/substring matching, no network hop) and blocks known-bad patterns your agent has already encountered.
Stage 2 — LLM Guard firewall: If no tenant rule fires, the prompt is forwarded to the LLM Guard ensemble for deeper analysis. This stage runs the full scanner suite and returns a per-scanner verdict.

A blocked prompt returns is_valid: false with the matched rule or scanner result included in the response so you can surface a meaningful error to the user.

Integration option 1 — Gateway proxy (recommended)

The gateway proxy is the fastest way to add firewall coverage. Change your LLM client’s baseURL to the Trident gateway endpoint. Every request your agent makes to OpenAI or Anthropic is automatically scanned before being forwarded upstream.

TypeScript — OpenAI
TypeScript — Anthropic
Python — OpenAI
Python — Anthropic

import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://app.tryvouch.ai/api/public/gateway/openai/v1",
  defaultHeaders: {
    "Authorization": `Basic ${Buffer.from(
      `${process.env.TRIDENT_PROJECT_PUBLIC_KEY}:${process.env.TRIDENT_PROJECT_SECRET_KEY}`
    ).toString("base64")}`,
  },
});

// All chat completions are now scanned automatically.
const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: userMessage }],
});

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
  baseURL: "https://app.tryvouch.ai/api/public/gateway/anthropic/v1",
  defaultHeaders: {
    "Authorization": `Basic ${Buffer.from(
      `${process.env.TRIDENT_PROJECT_PUBLIC_KEY}:${process.env.TRIDENT_PROJECT_SECRET_KEY}`
    ).toString("base64")}`,
  },
});

// All messages are now scanned automatically.
const response = await anthropic.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages: [{ role: "user", content: userMessage }],
});

from openai import OpenAI
import base64
import os

credentials = base64.b64encode(
    f"{os.environ['TRIDENT_PROJECT_PUBLIC_KEY']}:{os.environ['TRIDENT_PROJECT_SECRET_KEY']}".encode()
).decode()

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://app.tryvouch.ai/api/public/gateway/openai/v1",
    default_headers={"Authorization": f"Basic {credentials}"},
)

# All chat completions are now scanned automatically.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_message}],
)

import anthropic
import base64
import os

credentials = base64.b64encode(
    f"{os.environ['TRIDENT_PROJECT_PUBLIC_KEY']}:{os.environ['TRIDENT_PROJECT_SECRET_KEY']}".encode()
).decode()

client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    base_url="https://app.tryvouch.ai/api/public/gateway/anthropic/v1",
    default_headers={"Authorization": f"Basic {credentials}"},
)

# All messages are now scanned automatically.
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": user_message}],
)

Blocked requests return HTTP 400 with a JSON body explaining why the prompt was rejected:

{
  "error": "blocked",
  "is_valid": false,
  "source": "trident.tenantRule",
  "matched_rule": {
    "id": "rule_01HX...",
    "label": "Indirect injection via document",
    "kind": "substring",
    "scope": "project",
    "snippet": "ignore previous instructions",
    "severity": "HIGH"
  }
}

Integration option 2 — Direct scan

Call trident.scan() before passing a prompt to your LLM. Use this when you need fine-grained control over which inputs are scanned, or when the gateway proxy is not suitable for your architecture.

TypeScript
Python

import { trident } from "@vouch-ai/sdk";

// Initialize once at startup (see Tracing docs).
trident.init({ projectPk: "pk-...", projectSk: "sk-..." });

async function handleUserMessage(userMessage: string) {
  const verdict = await trident.scan({
    prompt: userMessage,
    agentId: "prod-rag-bot", // optional
  });

  if (!verdict.ok || !verdict.is_valid) {
    return "I'm unable to process that request.";
  }

  // Safe to proceed — call your LLM here.
  return await callLLM(userMessage);
}

import vouch_sdk
import os
import urllib.request
import json
import base64

def scan_prompt(prompt: str) -> bool:
    """Returns True if the prompt is safe, False if it should be blocked."""
    pk = os.environ["TRIDENT_PROJECT_PUBLIC_KEY"]
    sk = os.environ["TRIDENT_PROJECT_SECRET_KEY"]
    auth = base64.b64encode(f"{pk}:{sk}".encode()).decode()

    payload = json.dumps({"prompt": prompt}).encode()
    req = urllib.request.Request(
        "https://app.tryvouch.ai/api/public/trident/scan",
        data=payload,
        method="POST",
        headers={
            "Authorization": f"Basic {auth}",
            "Content-Type": "application/json",
        },
    )
    with urllib.request.urlopen(req, timeout=8) as resp:
        result = json.loads(resp.read())
        return result.get("is_valid", True)

def handle_user_message(user_message: str) -> str:
    if not scan_prompt(user_message):
        return "I'm unable to process that request."
    return call_llm(user_message)

The scan() call returns a result object with:

Field	Description
`ok`	`true` if the scan completed (even if the prompt was blocked); `false` on a network or auth error
`is_valid`	`true` = safe to proceed, `false` = block the prompt
`source`	Which stage made the decision: `"trident.tenantRule"` (a project-level deny rule matched), `"trident.orgRule"` (an organisation-wide policy matched), or `"trident.firewall"` (the LLM Guard ensemble ran)
`scanners`	Per-scanner verdicts from LLM Guard (present when stage 2 ran)
`matched_rule`	The deny rule that matched, if `source` is `"trident.tenantRule"` or `"trident.orgRule"`
`latencyMs`	End-to-end scan latency

What the firewall detects

Trident’s firewall runs a suite of specialised scanners on every prompt: Structural prompt injection — detects instruction-shaped text masquerading as data, including fake system: / [INST] / <<SYS>> role headers smuggled into retrieved content, AgentDojo-style <INFORMATION> blocks containing imperatives, and precondition tricks like “before you can answer you must…”. These structural patterns survive re-wording and evade phrase-list filters. Indirect injection in retrieved context — scans tool outputs, RAG documents, and other ingested content for injected instructions before the agent’s reasoning step processes them. Canary token leaks in outputs — if you embed secret canary strings in your system prompt (e.g. TRIDENT-CANARY-7f3a), the firewall blocks any model response that echoes them. A leaked canary indicates either system-prompt exfiltration or a successful injection that coerced the model to repeat hidden context. Jailbreak patterns — LLM Guard’s prompt injection model (deberta-v3-base-prompt-injection-v2) catches known jailbreak families including DAN, role-play bypasses, and hypothetical framing. Custom ban rules — your project’s deny bank, auto-populated from confirmed findings. When you confirm a finding in the dashboard, the attacker’s payload is added as a deny rule and takes effect within 5 minutes.

Confirmed findings → ban rules

Every time you confirm a finding in the Findings inbox, Trident extracts the attack payload and adds it to your project’s deny rule bank. The next scan call that matches the pattern is blocked at stage 1 — before the LLM Guard ensemble even runs. This creates a feedback loop: the more findings you confirm, the faster and more precise the firewall becomes for your specific agent.

Scan modes

Mode	Latency	When to use
Fast	< 50 ms	High-throughput agents where latency budget is tight. Uses regex/keyword matching and structural pattern detection only.
Full	150–300 ms	Production agents where security coverage matters more than raw speed. Runs the complete LLM Guard ensemble including the DeBERTa prompt injection model.

The gateway proxy and trident.scan() both run full mode by default. Fast mode can be enabled at the project level from the dashboard’s Firewall settings.

Viewing firewall events

Open the Firewall tab in the Trident dashboard to see a real-time log of all scanned requests. You can filter by:

Verdict — show only blocked requests, or all requests
Scanner — drill into which scanner triggered a block
Agent — scope the view to a specific agent
Time range — narrow to an incident window

Each firewall event links to the corresponding trace (if tracing is enabled) so you can see the full context around a blocked prompt.

The gateway proxy routes your LLM API calls through Trident’s infrastructure before forwarding them to OpenAI or Anthropic. Review your data residency and compliance requirements before enabling it. If you cannot route traffic through a third-party proxy, use the direct scan integration instead.

​How the two-stage scan works

​Integration option 1 — Gateway proxy (recommended)

​Integration option 2 — Direct scan

​What the firewall detects

​Confirmed findings → ban rules

​Scan modes

​Viewing firewall events

How the two-stage scan works

Integration option 1 — Gateway proxy (recommended)

Integration option 2 — Direct scan

What the firewall detects

Confirmed findings → ban rules

Scan modes

Viewing firewall events