Skip to main content
Trident’s red-teaming engine automatically attacks your AI agent the way a skilled adversary would — probing for prompt injection flaws, jailbreaks, tool-call hijacks, data exfiltration paths, and more — then scores every finding against the OWASP Agentic Top-10. Use red-team campaigns as pre-deployment gates to catch vulnerabilities before users do, or run them periodically against production agents to detect regressions when you update a system prompt, add new tools, or upgrade models.

Attack categories

Trident’s attack library covers 200+ distinct vectors across 10 categories. Each category maps directly to an OWASP Agentic Top-10 risk.
CategoryDescription
Prompt injectionDirect instructions embedded in user input that attempt to override the agent’s system prompt
Jailbreaks16+ patterns that attempt to disable safety guardrails through persona hijacking, hypothetical framing, and authority spoofing
Encoding bypass15+ obfuscation techniques — Base64, ROT13, Unicode homoglyphs, zero-width characters — designed to evade keyword-based filters
RAG poisoningAdversarial documents injected into retrieval corpora to redirect the agent’s reasoning at query time
Multi-turn social engineeringGradual escalation sequences that build rapport and authority across many turns before attempting exploitation
Tool-call hijackPayloads that coerce the agent into calling a tool with attacker-controlled arguments
MCP exploitationAttacks targeting Model Context Protocol tool descriptions and server responses
Sandbox escape via toolsAttempts to execute unintended system commands or access out-of-scope resources through tool interfaces
Indirect prompt injectionInstructions embedded in retrieved content (web pages, documents, email bodies, database rows) that the agent processes as trusted data
Resource exhaustionInputs designed to cause excessive token consumption, infinite loops, or cost amplification

Run a campaign from the dashboard

1

Open the Red-Team tab

Navigate to the Red-Team tab in the Trident dashboard.
2

Select your agent

Choose the agent you want to test from the agent selector. If you have already called trident.init() with an agentId, the agent appears in the list automatically.
3

Configure the attack scope

Choose a scan mode:
  • Quick — a targeted subset of high-signal attack vectors, suitable for fast feedback during development
  • Standard — broad coverage across all 10 categories, recommended for pre-deployment gates
  • Deep — exhaustive multi-phase assessment with up to 300 turns per attack class
  • Exhaustive — every skill in the library, audit-grade coverage for quarterly reviews
Optionally restrict the run to specific attack categories, or add context such as canary tokens and high-blast tool names.
4

Run

Click Run campaign. The campaign is queued and results stream into the Findings inbox as the attacks complete.

Trigger a campaign via API

You can enqueue a red-team campaign programmatically using the REST API. This is the recommended approach for CI/CD pipelines. Start a campaign:
curl -X POST https://app.tryvouch.ai/api/public/trident/redteam/campaign \
  -H "Authorization: Basic $(echo -n 'pk-...:sk-...' | base64)" \
  -H "Content-Type: application/json" \
  -d '{
    "agentId": "prod-rag-bot",
    "scanMode": "standard",
    "target": {
      "kind": "openai-chat",
      "baseUrl": "https://your-agent.example.com/api/chat",
      "systemPrompt": "You are a helpful assistant..."
    }
  }'
Response:
{
  "jobId": "a1b2c3d4-...",
  "statusUrl": "/api/public/trident/redteam/campaign/a1b2c3d4-...",
  "findingsUrl": "/api/public/trident/findings?redteamRunId=a1b2c3d4-...&sinceDays=1",
  "forecast": {
    "mode": "standard",
    "expectedCostUsd": 2.40,
    "costRangeUsd": { "low": 1.68, "high": 3.12 },
    "expectedDuration": "8–12 minutes",
    "hardCostCapUsd": 10.00,
    "cacheHitRate": 0.35,
    "derivation": "6 skills × avg $0.40/skill"
  }
}

Poll campaign status

Use the jobId returned by the POST endpoint to check the campaign’s progress:
curl https://app.tryvouch.ai/api/public/trident/redteam/campaign/a1b2c3d4-... \
  -H "Authorization: Basic $(echo -n 'pk-...:sk-...' | base64)"
Response:
{
  "jobId": "a1b2c3d4-...",
  "state": "active",
  "enqueuedAt": 1748390400000,
  "startedAt": 1748390405000,
  "finishedAt": null,
  "progress": { "skillsComplete": 3, "skillsTotal": 6 },
  "failedReason": null,
  "findingCount": 2
}
The state field progresses through queuedactivecompleted (or failed). Once the state is completed, query the findingsUrl to retrieve all findings from the run.

CI/CD integration

You can gate deployments on red-team results by calling the campaign API in your pipeline and polling until the run completes. For a full example with GitHub Actions and a pass/fail threshold, see the CI/CD integration guide.

Campaign results

All findings from a red-team campaign appear in the Findings inbox with:
  • Attack transcript — the full multi-turn conversation between the Trident attacker and your agent
  • Severity — AIVSS score mapped to LOW, MEDIUM, HIGH, or CRITICAL
  • OWASP category — the Agentic Top-10 code (e.g. LLM01, LLM06) that the finding maps to
  • Skill ID — the specific attack vector that triggered the finding

Orchestrators and judging

Trident uses three orchestration strategies depending on the attack category:
  • Crescendo — gradually escalates prompts across multiple turns, building context and authority before attempting exploitation. Effective for social engineering and multi-turn attacks.
  • Converter pipeline — applies deterministic encoding and obfuscation transforms to payloads. Covers the full encoding bypass category.
  • Multi-agent campaign — coordinates independent attacker agents working different angles simultaneously.
Every agent response is evaluated by a 3-judge ensemble to minimise false positives and false negatives:
  1. Deterministic judge — regex and keyword rules that flag known-bad patterns with high precision
  2. Tool-oracle judge — inspects actual tool call arguments to detect hijacks that a text-only judge would miss
  3. LLM judge — a language model that evaluates whether the agent’s response constitutes a successful exploitation
A finding is only filed when the ensemble reaches a majority verdict.
Run a red-team campaign every time you update your agent’s system prompt or add new tools. A new tool expands the attack surface — what passed before may not pass now.