scraper-toolkit

Playwright web scraping best practices and patterns learned from production scraping

View on GitHub
Author Nathan Vale
Namespace @nathanvale/side-quest-marketplace
Category development
Version 1.0.0
Stars 1
Downloads 3
self.md verified
Table of content

Playwright web scraping best practices and patterns learned from production scraping

Installation

npx claude-plugins install @nathanvale/side-quest-marketplace/scraper-toolkit

Contents

Folders: commands, hooks, skills

Files: package.json

Included Skills

This plugin includes 1 skill definition:

playwright-scraper

|

View skill definition

Playwright Web Scraper

Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.


Core Principles

1. Selector-First Approach

Always prefer semantic locators over CSS selectors:

// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')

// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')

// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')

2. Page Text Extraction

Critical difference between textContent and innerText:

// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");

// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");

Use case for each:

3. Regex Patterns for Extraction

Handle newlines and whitespace in HTML:

// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);

// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);

Common patterns:

…(truncated)

Source

View on GitHub

Tags: development scraperplaywrightweb-scrapingbest-practices