scraper-toolkit

Name: scraper-toolkit
Rating: 4.5 (3 reviews)
Author: Nathan Vale

Playwright web scraping best practices and patterns learned from production scraping

View on GitHub

Author Nathan Vale

Namespace @nathanvale/side-quest-marketplace

Category development

Version 1.0.0

Stars 1

Downloads 3

self.md verified

Table of content

Playwright web scraping best practices and patterns learned from production scraping

Installation

npx claude-plugins install @nathanvale/side-quest-marketplace/scraper-toolkit

Folders: commands, hooks, skills

Files: package.json

Included Skills

This plugin includes 1 skill definition:

playwright-scraper

|

View skill definition

Playwright Web Scraper

Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.

Core Principles

1. Selector-First Approach

Always prefer semantic locators over CSS selectors:

// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')

// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')

// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')

2. Page Text Extraction

Critical difference between textContent and innerText:

// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");

// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");

Use case for each:

innerText("body") - Extract visible content for regex matching
textContent(selector) - Get text from specific elements

3. Regex Patterns for Extraction

Handle newlines and whitespace in HTML:

// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);

// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);

Common patterns:

…(truncated)

Source