Version: 1.0-draft | Status: Open RFC | Published: May 2026 Reference implementation: farscry
Every agent framework handles visual context differently. Devin sends raw images. Claude Code re-processes every step. No standard exists for how agents represent or share visual UI state.
Vision APIs describe. VASP gives coordinates.
A typed, deterministic, offline interchange format for screenshot-derived visual context. Works without app cooperation, without VLM, without GPU.
vasp_version: "1.0"
state_id: phash:<16-char-hex>
screen_type: error | config | terminal | conversation | ui | unknown
confidence: high | medium | low | none
lang: string
agent_context: string (one-line summary for the agent)
delta_from: phash:<hex> | null
context_similarity: 0.0-1.0 | null
context_changed: bool | null
phash:<16-char-lowercase-hex>| Value | Description |
|---|---|
error |
Error messages, stack traces, failed states |
config |
Forms, settings panels, configuration screens |
terminal |
Shell output, command prompts, build logs |
conversation |
Chat interfaces, message threads |
ui |
General UI (buttons, nav, dashboards) |
unknown |
Unclassified |
Compute token overlap between before/after OCR outputs.
If overlap < 0.20: emit context_changed: true, skip diff.
Prevents diffing unrelated UIs.
=== farscry visual context ===
source: screenshot.png
screen_type: config
state_id: phash:8f4a2c9d1e3b7f6a
confidence: high
lang: eng
agent_context: "Payment settings - Save available"
---
[top-center] heading "Payment Settings"
[middle-left] label "Max Value:"
[middle-center] input value="1500" editable:true
[middle-right] button "Save Changes" enabled:true
[bottom] error "Value must be <= 10000"
affordances:
click -> "Save Changes" at (400,300) enabled:true
type -> "Max Value" at (200,120) current:"1500"
=== farscry diff ===
delta_from: phash:8f4a2c9d1e3b7f6a
state_id: phash:3d9b1e4f2a8c7b5e
context_similarity: 0.923
context_changed: false
---
appeared: error "Card declined" at (20,350)
changed: button "Submit" -> "Processing..." disabled:true
unchanged: [3 elements]
33% of macOS apps have broken or missing AX trees. VASP works on any screenshot, any app, any platform.
Vision APIs return prose. Agents guess coordinates. VASP returns typed coordinates. Agents act with precision. ~9x fewer tokens on 1080p. ~16x fewer on 4K. $0. Offline.
Open RFC. Comments welcome. Open an issue: https://github.com/vasp-protocol/spec/issues
Apache 2.0