Guardian AI

Guardian AI is a secondary language model that evaluates every incoming message for safety before the main agent processes it.

How It Works

  1. A user sends a message through any channel
  2. Before the main agent sees it, Guardian AI evaluates the message
  3. The message is classified into a threat level
  4. Messages at HIGH or above are blocked with an explanation
User Message → Guardian AI → Threat Assessment → Allow / Block
Main Agent (if allowed)

Threat Levels

LevelActionExample
NONEAllow”What’s the weather today?”
LOWAllow (logged)“How do firewalls work?”
MEDIUMAllow (logged, flagged)“Explain SQL injection”
HIGHBlockRequests to create malware
CRITICALBlockAttempts to harm systems

Implementation

Guardian AI uses AsyncAnthropic directly (not the main agent’s LLM router):

from anthropic import AsyncAnthropic
class GuardianAI:
def __init__(self, api_key: str):
self.client = AsyncAnthropic(api_key=api_key)
async def check(self, message: str) -> ThreatAssessment:
response = await self.client.messages.create(
model="claude-sonnet-4-5-20250929",
system="You are a safety classifier...",
messages=[{"role": "user", "content": message}],
)
return self._parse_assessment(response)

Configuration

Guardian AI uses the same Anthropic API key as the main agent:

Terminal window
export POCKETCLAW_ANTHROPIC_API_KEY="sk-ant-..."
Info

Guardian AI adds a small latency to each message (one additional API call). For latency-sensitive deployments, the threat level threshold can be adjusted.