scraper-toolkit
Playwright web scraping best practices and patterns learned from production scraping
View on GitHubTable of content
Playwright web scraping best practices and patterns learned from production scraping
Installation
npx claude-plugins install @nathanvale/side-quest-marketplace/scraper-toolkit
Contents
Folders: commands, hooks, skills
Files: package.json
Included Skills
This plugin includes 1 skill definition:
playwright-scraper
|
View skill definition
Playwright Web Scraper
Production-proven web scraping patterns using Playwright with selector-first approach and robust error handling.
Core Principles
1. Selector-First Approach
Always prefer semantic locators over CSS selectors:
// ✅ BEST: Semantic locators (accessible, maintainable)
await page.getByRole('button', { name: 'Submit' })
await page.getByText('Welcome')
await page.getByLabel('Email')
// ⚠️ ACCEPTABLE: Text patterns for dynamic content
await page.locator('text=/\\$\\d+\\.\\d{2}/')
// ❌ AVOID: Brittle CSS selectors
await page.locator('.btn-primary')
await page.locator('#submit-button')
2. Page Text Extraction
Critical difference between textContent and innerText:
// ❌ WRONG: Returns ALL text including hidden elements, scripts, iframes
const pageText = await page.textContent("body");
// ✅ CORRECT: Returns only VISIBLE text (what users see)
const pageText = await page.innerText("body");
Use case for each:
innerText("body")- Extract visible content for regex matchingtextContent(selector)- Get text from specific elements
3. Regex Patterns for Extraction
Handle newlines and whitespace in HTML:
// ❌ FAILS: [^$]* doesn't match across newlines
const match = pageText.match(/ADULT[^$]*(\$\d+\.\d{2})/);
// ✅ WORKS: [\s\S]{0,10} matches any character including newlines
const match = pageText.match(/ADULT[\s\S]{0,10}(\$\d+\.\d{2})/);
Common patterns:
…(truncated)