Vision-Based Web Automation: Why Screenshots Are Replacing Selectors

Table of content

Traditional browser automation scripts break constantly. A website updates its CSS class names, renames a button ID, or restructures its HTML, and your Selenium script fails at 3am. Vision-based automation takes a different approach: instead of parsing DOM elements, it looks at the screen and decides what to click.

The Selector Problem

DOM-based automation relies on identifying elements through XPath, CSS selectors, or element IDs:

# Traditional approach - breaks when HTML changes
driver.find_element(By.XPATH, "//button[@class='submit-btn-primary-v2']")
driver.find_element(By.ID, "checkout-form-submit")

This works until:

The site renames submit-btn-primary-v2 to cta-button-main
A React update changes the component hierarchy
The site switches from server-rendered HTML to client-side JavaScript
A/B testing shows different layouts to different users

The same button exists visually. Users click it without noticing any change. But your automation fails because it depends on invisible implementation details.

How Vision-Based Automation Works

Vision models analyze screenshots the way humans process pages: by recognizing visual patterns, reading text, and understanding spatial relationships.

Component	Role
Screenshot capture	Takes image of current viewport
Vision model	Identifies buttons, forms, links by appearance
Coordinate mapping	Translates “the blue Submit button” to pixel coordinates
Action execution	Clicks at those coordinates via Playwright or similar

The process runs in a loop: capture screen, analyze with vision model, decide action, execute, repeat.

Skyvern, built by Suchintan Singh, uses this approach for enterprise RPA workflows. The system recognizes a submit button whether it’s styled as a green rectangle, blue pill, or custom graphic. It doesn’t care that the underlying HTML changed from <button> to <div role="button">.

Vision vs DOM vs Hybrid

Three approaches exist in the current browser agent ecosystem:

Approach	How it works	Strength	Weakness
DOM-only	Parse HTML, find elements by selectors	Fast, precise coordinates	Breaks on layout changes, misses JS content
Vision-only	Screenshot analysis, click by pixel	Adapts to any visual design	Slower, sometimes hallucinates positions
Hybrid	DOM parsing + screenshot verification	Best accuracy	Higher token cost, more complexity

Browser Use, created by Gregor Zunic and Magnus Muller, uses the hybrid approach. It extracts HTML structure for reliable element identification, then uses vision to verify the page looks correct before acting. This catches cases where the DOM says an element exists but it’s hidden or obscured.

From the Browser Use funding announcement:

“A lot of agents rely on vision-based methods to ‘see’ websites and try to work their way through them. But such techniques are slow and expensive, and they don’t always work very well.”

The hybrid approach addresses this by using vision selectively rather than for every action.

The Self-Driving Car Parallel

The vision vs structured data debate mirrors self-driving cars. Tesla bet on vision-only (cameras). Waymo used both vision and structured data (lidar + cameras). Complex automation tasks perform better with more information, not less.

Ken Acquah, an engineer who works on browser agents, described this parallel:

“Like any LLM problem, computer use performance improves with context, and the best teams are leveraging both.”

For browser automation, the DOM provides structural context (this is a form, these are the input fields, this button submits). Vision provides verification (the form is visible, not blocked by a modal, the button text says “Submit Order”).

Key Players

Tool	Approach	Benchmark Score	Funding
Skyvern	Vision-first with DOM fallback	85.8% WebVoyager	$2.7M (YC S23)
Browser Use	Hybrid DOM + vision	89.1% WebVoyager	$17M (YC W25)
OpenAI Operator	Vision-first (CUA model)	Proprietary	N/A
Anthropic Computer Use	Vision-first	Proprietary	N/A
Google Mariner	Vision-first	Comparable to Skyvern	N/A

The open-source tools (Skyvern, Browser Use) provide transparency into their approaches. OpenAI’s Computer-Using Agent (CUA) and Anthropic’s Computer Use are proprietary but use similar vision-first principles.

When to Use Vision-Based Automation

Vision-based approaches work best for:

Unpredictable sites: Automating across many different websites without custom code per site
Frequently changing UIs: Sites that update their frontend often
Complex workflows: Multi-step tasks where traditional scripts become unmaintainable
Research and data gathering: Low-stakes tasks where occasional failures are acceptable

Stick with traditional DOM automation for:

Single stable site: You control the site or it rarely changes
Speed-critical tasks: Vision adds latency (screenshot capture + model inference)
High-stakes transactions: Financial operations where reliability matters more than flexibility

Implementation Example

Using browser-use, a hybrid agent runs in a few lines:

from browser_use import Agent
from browser_use.llm import ChatOpenAI

agent = Agent(
    task="Go to Amazon and add an iPhone 16 case to cart",
    llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()

The agent:

Captures the page screenshot and HTML
Uses the LLM to understand both the visual layout and DOM structure
Identifies the search box, enters the query
Recognizes product listings visually
Clicks “Add to Cart” by appearance, verified against DOM

No XPath. No CSS selectors. The automation survives Amazon’s constant UI experiments.

Common Mistakes

Mistake	Why it fails	Fix
Vision-only without DOM	Hallucinates click coordinates	Use hybrid approach with DOM verification
DOM-only without vision	Misses dynamically loaded content	Add screenshot verification for JS-heavy sites
No validation step	Clicks wrong elements silently	Implement validator that confirms actions succeeded
Trusting single screenshot	Page may still be loading	Wait for network idle before capture
Generic task descriptions	Agent doesn’t know when to stop	Be specific: “first 5 results” not “get results”

Token Economics

Vision-based automation consumes more tokens than traditional scripts:

Operation	Approximate cost (GPT-4o)
Single screenshot analysis	1,000-2,000 tokens
Full page with HTML context	3,000-5,000 tokens
10-step workflow	30,000-50,000 tokens

At current pricing, a 10-step workflow costs roughly $0.15-0.25 in API calls. Traditional Selenium: effectively free after development time.

The trade-off: you pay per execution instead of paying in maintenance time. For automations that break weekly, vision wins. For stable scripts running thousands of times daily, DOM wins.

What’s Next

Vision-based automation is becoming the default for AI agents. As multimodal models get faster and cheaper, the economics shift further toward vision. Websites are designed for human eyes, not parsers. Tools that see pages like humans do can automate anything humans can click.

Next: Build Your First Browser Agent