Vision-Based Web Automation: Why Screenshots Are Replacing Selectors
Table of content
Traditional browser automation scripts break constantly. A website updates its CSS class names, renames a button ID, or restructures its HTML, and your Selenium script fails at 3am. Vision-based automation takes a different approach: instead of parsing DOM elements, it looks at the screen and decides what to click.
The Selector Problem
DOM-based automation relies on identifying elements through XPath, CSS selectors, or element IDs:
# Traditional approach - breaks when HTML changes
driver.find_element(By.XPATH, "//button[@class='submit-btn-primary-v2']")
driver.find_element(By.ID, "checkout-form-submit")
This works until:
- The site renames
submit-btn-primary-v2tocta-button-main - A React update changes the component hierarchy
- The site switches from server-rendered HTML to client-side JavaScript
- A/B testing shows different layouts to different users
The same button exists visually. Users click it without noticing any change. But your automation fails because it depends on invisible implementation details.
How Vision-Based Automation Works
Vision models analyze screenshots the way humans process pages: by recognizing visual patterns, reading text, and understanding spatial relationships.
| Component | Role |
|---|---|
| Screenshot capture | Takes image of current viewport |
| Vision model | Identifies buttons, forms, links by appearance |
| Coordinate mapping | Translates “the blue Submit button” to pixel coordinates |
| Action execution | Clicks at those coordinates via Playwright or similar |
The process runs in a loop: capture screen, analyze with vision model, decide action, execute, repeat.
Skyvern, built by Suchintan Singh, uses this approach for enterprise RPA workflows. The system recognizes a submit button whether it’s styled as a green rectangle, blue pill, or custom graphic. It doesn’t care that the underlying HTML changed from <button> to <div role="button">.
Vision vs DOM vs Hybrid
Three approaches exist in the current browser agent ecosystem:
| Approach | How it works | Strength | Weakness |
|---|---|---|---|
| DOM-only | Parse HTML, find elements by selectors | Fast, precise coordinates | Breaks on layout changes, misses JS content |
| Vision-only | Screenshot analysis, click by pixel | Adapts to any visual design | Slower, sometimes hallucinates positions |
| Hybrid | DOM parsing + screenshot verification | Best accuracy | Higher token cost, more complexity |
Browser Use, created by Gregor Zunic and Magnus Muller, uses the hybrid approach. It extracts HTML structure for reliable element identification, then uses vision to verify the page looks correct before acting. This catches cases where the DOM says an element exists but it’s hidden or obscured.
From the Browser Use funding announcement:
“A lot of agents rely on vision-based methods to ‘see’ websites and try to work their way through them. But such techniques are slow and expensive, and they don’t always work very well.”
The hybrid approach addresses this by using vision selectively rather than for every action.
The Self-Driving Car Parallel
The vision vs structured data debate mirrors self-driving cars. Tesla bet on vision-only (cameras). Waymo used both vision and structured data (lidar + cameras). Complex automation tasks perform better with more information, not less.
Ken Acquah, an engineer who works on browser agents, described this parallel:
“Like any LLM problem, computer use performance improves with context, and the best teams are leveraging both.”
For browser automation, the DOM provides structural context (this is a form, these are the input fields, this button submits). Vision provides verification (the form is visible, not blocked by a modal, the button text says “Submit Order”).
Key Players
| Tool | Approach | Benchmark Score | Funding |
|---|---|---|---|
| Skyvern | Vision-first with DOM fallback | 85.8% WebVoyager | $2.7M (YC S23) |
| Browser Use | Hybrid DOM + vision | 89.1% WebVoyager | $17M (YC W25) |
| OpenAI Operator | Vision-first (CUA model) | Proprietary | N/A |
| Anthropic Computer Use | Vision-first | Proprietary | N/A |
| Google Mariner | Vision-first | Comparable to Skyvern | N/A |
The open-source tools (Skyvern, Browser Use) provide transparency into their approaches. OpenAI’s Computer-Using Agent (CUA) and Anthropic’s Computer Use are proprietary but use similar vision-first principles.
When to Use Vision-Based Automation
Vision-based approaches work best for:
- Unpredictable sites: Automating across many different websites without custom code per site
- Frequently changing UIs: Sites that update their frontend often
- Complex workflows: Multi-step tasks where traditional scripts become unmaintainable
- Research and data gathering: Low-stakes tasks where occasional failures are acceptable
Stick with traditional DOM automation for:
- Single stable site: You control the site or it rarely changes
- Speed-critical tasks: Vision adds latency (screenshot capture + model inference)
- High-stakes transactions: Financial operations where reliability matters more than flexibility
Implementation Example
Using browser-use, a hybrid agent runs in a few lines:
from browser_use import Agent
from browser_use.llm import ChatOpenAI
agent = Agent(
task="Go to Amazon and add an iPhone 16 case to cart",
llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()
The agent:
- Captures the page screenshot and HTML
- Uses the LLM to understand both the visual layout and DOM structure
- Identifies the search box, enters the query
- Recognizes product listings visually
- Clicks “Add to Cart” by appearance, verified against DOM
No XPath. No CSS selectors. The automation survives Amazon’s constant UI experiments.
Common Mistakes
| Mistake | Why it fails | Fix |
|---|---|---|
| Vision-only without DOM | Hallucinates click coordinates | Use hybrid approach with DOM verification |
| DOM-only without vision | Misses dynamically loaded content | Add screenshot verification for JS-heavy sites |
| No validation step | Clicks wrong elements silently | Implement validator that confirms actions succeeded |
| Trusting single screenshot | Page may still be loading | Wait for network idle before capture |
| Generic task descriptions | Agent doesn’t know when to stop | Be specific: “first 5 results” not “get results” |
Token Economics
Vision-based automation consumes more tokens than traditional scripts:
| Operation | Approximate cost (GPT-4o) |
|---|---|
| Single screenshot analysis | 1,000-2,000 tokens |
| Full page with HTML context | 3,000-5,000 tokens |
| 10-step workflow | 30,000-50,000 tokens |
At current pricing, a 10-step workflow costs roughly $0.15-0.25 in API calls. Traditional Selenium: effectively free after development time.
The trade-off: you pay per execution instead of paying in maintenance time. For automations that break weekly, vision wins. For stable scripts running thousands of times daily, DOM wins.
What’s Next
Vision-based automation is becoming the default for AI agents. As multimodal models get faster and cheaper, the economics shift further toward vision. Websites are designed for human eyes, not parsers. Tools that see pages like humans do can automate anything humans can click.
Next: Build Your First Browser Agent
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.