OCR

The OCR tool extracts text from images using a two-tier approach: GPT-4o Vision as the primary engine with pytesseract as a fallback.

How It Works

Primary: Sends the image to GPT-4o Vision for intelligent text extraction
Fallback: If GPT-4o is unavailable, falls back to pytesseract (local OCR)

GPT-4o Vision produces superior results for complex layouts, handwriting, and non-standard text. Pytesseract works offline but is limited to printed text.

Setup

# For GPT-4o Vision (primary)
export POCKETCLAW_OPENAI_API_KEY="sk-..."

# For pytesseract fallback (optional)
# Install tesseract-ocr system package
sudo apt install tesseract-ocr  # Ubuntu/Debian
brew install tesseract           # macOS

Usage

User: What does this image say? /path/to/screenshot.png
Agent: [uses ocr tool] → "The image contains the following text..."

Tool Schema

{
  "name": "ocr",
  "description": "Extract text from an image file",
  "input_schema": {
    "type": "object",
    "properties": {
      "image_path": {
        "type": "string",
        "description": "Path to the image file"
      }
    },
    "required": ["image_path"]
  }
}

Policy Group

Belongs to group:media.

Last updated: February 12, 2026

Edit this page

Was this page helpful?