Vision-Based Web Automation: Why Screenshots Are Replacing Selectors

Table of content

Traditional browser automation scripts break constantly. A website updates its CSS class names, renames a button ID, or restructures its HTML, and your Selenium script fails at 3am. Vision-based automation takes a different approach: instead of parsing DOM elements, it looks at the screen and decides what to click.

The Selector Problem

DOM-based automation relies on identifying elements through XPath, CSS selectors, or element IDs:

# Traditional approach - breaks when HTML changes
driver.find_element(By.XPATH, "//button[@class='submit-btn-primary-v2']")
driver.find_element(By.ID, "checkout-form-submit")

This works until:

The same button exists visually. Users click it without noticing any change. But your automation fails because it depends on invisible implementation details.

How Vision-Based Automation Works

Vision models analyze screenshots the way humans process pages: by recognizing visual patterns, reading text, and understanding spatial relationships.

ComponentRole
Screenshot captureTakes image of current viewport
Vision modelIdentifies buttons, forms, links by appearance
Coordinate mappingTranslates “the blue Submit button” to pixel coordinates
Action executionClicks at those coordinates via Playwright or similar

The process runs in a loop: capture screen, analyze with vision model, decide action, execute, repeat.

Skyvern, built by Suchintan Singh, uses this approach for enterprise RPA workflows. The system recognizes a submit button whether it’s styled as a green rectangle, blue pill, or custom graphic. It doesn’t care that the underlying HTML changed from <button> to <div role="button">.

Vision vs DOM vs Hybrid

Three approaches exist in the current browser agent ecosystem:

ApproachHow it worksStrengthWeakness
DOM-onlyParse HTML, find elements by selectorsFast, precise coordinatesBreaks on layout changes, misses JS content
Vision-onlyScreenshot analysis, click by pixelAdapts to any visual designSlower, sometimes hallucinates positions
HybridDOM parsing + screenshot verificationBest accuracyHigher token cost, more complexity

Browser Use, created by Gregor Zunic and Magnus Muller, uses the hybrid approach. It extracts HTML structure for reliable element identification, then uses vision to verify the page looks correct before acting. This catches cases where the DOM says an element exists but it’s hidden or obscured.

From the Browser Use funding announcement:

“A lot of agents rely on vision-based methods to ‘see’ websites and try to work their way through them. But such techniques are slow and expensive, and they don’t always work very well.”

The hybrid approach addresses this by using vision selectively rather than for every action.

The Self-Driving Car Parallel

The vision vs structured data debate mirrors self-driving cars. Tesla bet on vision-only (cameras). Waymo used both vision and structured data (lidar + cameras). Complex automation tasks perform better with more information, not less.

Ken Acquah, an engineer who works on browser agents, described this parallel:

“Like any LLM problem, computer use performance improves with context, and the best teams are leveraging both.”

For browser automation, the DOM provides structural context (this is a form, these are the input fields, this button submits). Vision provides verification (the form is visible, not blocked by a modal, the button text says “Submit Order”).

Key Players

ToolApproachBenchmark ScoreFunding
SkyvernVision-first with DOM fallback85.8% WebVoyager$2.7M (YC S23)
Browser UseHybrid DOM + vision89.1% WebVoyager$17M (YC W25)
OpenAI OperatorVision-first (CUA model)ProprietaryN/A
Anthropic Computer UseVision-firstProprietaryN/A
Google MarinerVision-firstComparable to SkyvernN/A

The open-source tools (Skyvern, Browser Use) provide transparency into their approaches. OpenAI’s Computer-Using Agent (CUA) and Anthropic’s Computer Use are proprietary but use similar vision-first principles.

When to Use Vision-Based Automation

Vision-based approaches work best for:

Stick with traditional DOM automation for:

Implementation Example

Using browser-use, a hybrid agent runs in a few lines:

from browser_use import Agent
from browser_use.llm import ChatOpenAI

agent = Agent(
    task="Go to Amazon and add an iPhone 16 case to cart",
    llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()

The agent:

  1. Captures the page screenshot and HTML
  2. Uses the LLM to understand both the visual layout and DOM structure
  3. Identifies the search box, enters the query
  4. Recognizes product listings visually
  5. Clicks “Add to Cart” by appearance, verified against DOM

No XPath. No CSS selectors. The automation survives Amazon’s constant UI experiments.

Common Mistakes

MistakeWhy it failsFix
Vision-only without DOMHallucinates click coordinatesUse hybrid approach with DOM verification
DOM-only without visionMisses dynamically loaded contentAdd screenshot verification for JS-heavy sites
No validation stepClicks wrong elements silentlyImplement validator that confirms actions succeeded
Trusting single screenshotPage may still be loadingWait for network idle before capture
Generic task descriptionsAgent doesn’t know when to stopBe specific: “first 5 results” not “get results”

Token Economics

Vision-based automation consumes more tokens than traditional scripts:

OperationApproximate cost (GPT-4o)
Single screenshot analysis1,000-2,000 tokens
Full page with HTML context3,000-5,000 tokens
10-step workflow30,000-50,000 tokens

At current pricing, a 10-step workflow costs roughly $0.15-0.25 in API calls. Traditional Selenium: effectively free after development time.

The trade-off: you pay per execution instead of paying in maintenance time. For automations that break weekly, vision wins. For stable scripts running thousands of times daily, DOM wins.

What’s Next

Vision-based automation is becoming the default for AI agents. As multimodal models get faster and cheaper, the economics shift further toward vision. Websites are designed for human eyes, not parsers. Tools that see pages like humans do can automate anything humans can click.


Next: Build Your First Browser Agent

Topics: browser-automation ai-agents automation