OCR

The OCR tool extracts text from images using a two-tier approach: GPT-4o Vision as the primary engine with pytesseract as a fallback.

How It Works

  1. Primary: Sends the image to GPT-4o Vision for intelligent text extraction
  2. Fallback: If GPT-4o is unavailable, falls back to pytesseract (local OCR)

GPT-4o Vision produces superior results for complex layouts, handwriting, and non-standard text. Pytesseract works offline but is limited to printed text.

Setup

Terminal window
# For GPT-4o Vision (primary)
export POCKETCLAW_OPENAI_API_KEY="sk-..."
# For pytesseract fallback (optional)
# Install tesseract-ocr system package
sudo apt install tesseract-ocr # Ubuntu/Debian
brew install tesseract # macOS

Usage

User: What does this image say? /path/to/screenshot.png
Agent: [uses ocr tool] → "The image contains the following text..."

Tool Schema

{
"name": "ocr",
"description": "Extract text from an image file",
"input_schema": {
"type": "object",
"properties": {
"image_path": {
"type": "string",
"description": "Path to the image file"
}
},
"required": ["image_path"]
}
}

Policy Group

Belongs to group:media.