Guardian AI

Guardian AI is a secondary language model that evaluates every incoming message for safety before the main agent processes it.

How It Works

A user sends a message through any channel
Before the main agent sees it, Guardian AI evaluates the message
The message is classified into a threat level
Messages at HIGH or above are blocked with an explanation

User Message → Guardian AI → Threat Assessment → Allow / Block
                                                    ↓
                                              Main Agent (if allowed)

Threat Levels

Level	Action	Example
`NONE`	Allow	”What’s the weather today?”
`LOW`	Allow (logged)	“How do firewalls work?”
`MEDIUM`	Allow (logged, flagged)	“Explain SQL injection”
`HIGH`	Block	Requests to create malware
`CRITICAL`	Block	Attempts to harm systems

Implementation

Guardian AI uses AsyncAnthropic directly (not the main agent’s LLM router):

from anthropic import AsyncAnthropic

class GuardianAI:
    def __init__(self, api_key: str):
        self.client = AsyncAnthropic(api_key=api_key)

    async def check(self, message: str) -> ThreatAssessment:
        response = await self.client.messages.create(
            model="claude-sonnet-4-5-20250929",
            system="You are a safety classifier...",
            messages=[{"role": "user", "content": message}],
        )
        return self._parse_assessment(response)

Configuration

Guardian AI uses the same Anthropic API key as the main agent:

export POCKETCLAW_ANTHROPIC_API_KEY="sk-ant-..."

Info

Guardian AI adds a small latency to each message (one additional API call). For latency-sensitive deployments, the threat level threshold can be adjusted.

Last updated: February 12, 2026

Edit this page

Was this page helpful?