Automated Red-Teaming for AI Agents

Trident’s red-teaming engine automatically attacks your AI agent the way a skilled adversary would — probing for prompt injection flaws, jailbreaks, tool-call hijacks, data exfiltration paths, and more — then scores every finding against the OWASP Agentic Top-10. Use red-team campaigns as pre-deployment gates to catch vulnerabilities before users do, or run them periodically against production agents to detect regressions when you update a system prompt, add new tools, or upgrade models.

Attack categories

Trident’s attack library covers 200+ distinct vectors across 10 categories. Each category maps directly to an OWASP Agentic Top-10 risk.

Category	Description
Prompt injection	Direct instructions embedded in user input that attempt to override the agent’s system prompt
Jailbreaks	16+ patterns that attempt to disable safety guardrails through persona hijacking, hypothetical framing, and authority spoofing
Encoding bypass	15+ obfuscation techniques — Base64, ROT13, Unicode homoglyphs, zero-width characters — designed to evade keyword-based filters
RAG poisoning	Adversarial documents injected into retrieval corpora to redirect the agent’s reasoning at query time
Multi-turn social engineering	Gradual escalation sequences that build rapport and authority across many turns before attempting exploitation
Tool-call hijack	Payloads that coerce the agent into calling a tool with attacker-controlled arguments
MCP exploitation	Attacks targeting Model Context Protocol tool descriptions and server responses
Sandbox escape via tools	Attempts to execute unintended system commands or access out-of-scope resources through tool interfaces
Indirect prompt injection	Instructions embedded in retrieved content (web pages, documents, email bodies, database rows) that the agent processes as trusted data
Resource exhaustion	Inputs designed to cause excessive token consumption, infinite loops, or cost amplification

Run a campaign from the dashboard

Open the Red-Team tab

Navigate to the Red-Team tab in the Trident dashboard.

Select your agent

Choose the agent you want to test from the agent selector. If you have already called trident.init() with an agentId, the agent appears in the list automatically.

Configure the attack scope

Choose a scan mode:

Quick — a targeted subset of high-signal attack vectors, suitable for fast feedback during development
Standard — broad coverage across all 10 categories, recommended for pre-deployment gates
Deep — exhaustive multi-phase assessment with up to 300 turns per attack class
Exhaustive — every skill in the library, audit-grade coverage for quarterly reviews

Optionally restrict the run to specific attack categories, or add context such as canary tokens and high-blast tool names.

Run

Click Run campaign. The campaign is queued and results stream into the Findings inbox as the attacks complete.

Trigger a campaign via API

You can enqueue a red-team campaign programmatically using the REST API. This is the recommended approach for CI/CD pipelines. Start a campaign:

curl -X POST https://app.tryvouch.ai/api/public/trident/redteam/campaign \
  -H "Authorization: Basic $(echo -n 'pk-...:sk-...' | base64)" \
  -H "Content-Type: application/json" \
  -d '{
    "agentId": "prod-rag-bot",
    "scanMode": "standard",
    "target": {
      "kind": "openai-chat",
      "baseUrl": "https://your-agent.example.com/api/chat",
      "systemPrompt": "You are a helpful assistant..."
    }
  }'

Response:

{
  "jobId": "a1b2c3d4-...",
  "statusUrl": "/api/public/trident/redteam/campaign/a1b2c3d4-...",
  "findingsUrl": "/api/public/trident/findings?redteamRunId=a1b2c3d4-...&sinceDays=1",
  "forecast": {
    "mode": "standard",
    "expectedCostUsd": 2.40,
    "costRangeUsd": { "low": 1.68, "high": 3.12 },
    "expectedDuration": "8–12 minutes",
    "hardCostCapUsd": 10.00,
    "cacheHitRate": 0.35,
    "derivation": "6 skills × avg $0.40/skill"
  }
}

Poll campaign status

Use the jobId returned by the POST endpoint to check the campaign’s progress:

curl https://app.tryvouch.ai/api/public/trident/redteam/campaign/a1b2c3d4-... \
  -H "Authorization: Basic $(echo -n 'pk-...:sk-...' | base64)"

Response:

{
  "jobId": "a1b2c3d4-...",
  "state": "active",
  "enqueuedAt": 1748390400000,
  "startedAt": 1748390405000,
  "finishedAt": null,
  "progress": { "skillsComplete": 3, "skillsTotal": 6 },
  "failedReason": null,
  "findingCount": 2
}

The state field progresses through queued → active → completed (or failed). Once the state is completed, query the findingsUrl to retrieve all findings from the run.

CI/CD integration

You can gate deployments on red-team results by calling the campaign API in your pipeline and polling until the run completes. For a full example with GitHub Actions and a pass/fail threshold, see the CI/CD integration guide.

Campaign results

All findings from a red-team campaign appear in the Findings inbox with:

Attack transcript — the full multi-turn conversation between the Trident attacker and your agent
Severity — AIVSS score mapped to LOW, MEDIUM, HIGH, or CRITICAL
OWASP category — the Agentic Top-10 code (e.g. LLM01, LLM06) that the finding maps to
Skill ID — the specific attack vector that triggered the finding

Orchestrators and judging

Trident uses three orchestration strategies depending on the attack category:

Crescendo — gradually escalates prompts across multiple turns, building context and authority before attempting exploitation. Effective for social engineering and multi-turn attacks.
Converter pipeline — applies deterministic encoding and obfuscation transforms to payloads. Covers the full encoding bypass category.
Multi-agent campaign — coordinates independent attacker agents working different angles simultaneously.

Every agent response is evaluated by a 3-judge ensemble to minimise false positives and false negatives:

Deterministic judge — regex and keyword rules that flag known-bad patterns with high precision
Tool-oracle judge — inspects actual tool call arguments to detect hijacks that a text-only judge would miss
LLM judge — a language model that evaluates whether the agent’s response constitutes a successful exploitation

A finding is only filed when the ensemble reaches a majority verdict.

Run a red-team campaign every time you update your agent’s system prompt or add new tools. A new tool expands the attack surface — what passed before may not pass now.

​Attack categories

​Run a campaign from the dashboard

​Trigger a campaign via API

​Poll campaign status

​CI/CD integration

​Campaign results

​Orchestrators and judging

Attack categories

Run a campaign from the dashboard

Trigger a campaign via API

Poll campaign status

CI/CD integration

Campaign results

Orchestrators and judging