OCR
The OCR tool extracts text from images using a two-tier approach: GPT-4o Vision as the primary engine with pytesseract as a fallback.
How It Works
- Primary: Sends the image to GPT-4o Vision for intelligent text extraction
- Fallback: If GPT-4o is unavailable, falls back to pytesseract (local OCR)
GPT-4o Vision produces superior results for complex layouts, handwriting, and non-standard text. Pytesseract works offline but is limited to printed text.
Setup
# For GPT-4o Vision (primary)export POCKETCLAW_OPENAI_API_KEY="sk-..."
# For pytesseract fallback (optional)# Install tesseract-ocr system packagesudo apt install tesseract-ocr # Ubuntu/Debianbrew install tesseract # macOSUsage
User: What does this image say? /path/to/screenshot.pngAgent: [uses ocr tool] → "The image contains the following text..."Tool Schema
{ "name": "ocr", "description": "Extract text from an image file", "input_schema": { "type": "object", "properties": { "image_path": { "type": "string", "description": "Path to the image file" } }, "required": ["image_path"] }}Policy Group
Belongs to group:media.
Was this page helpful?