Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...
Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Score your scraping operation (2 points each): SignalHealthyUnhealthyLegal compliancerobots.txt checked, ToS reviewedScraping blindlyArchitectureTool matches site complexityUsing Puppeteer for static HTMLAnti-detectionRotation, delays, fingerprint diversitySingle IP, no delaysData qualityValidation + dedup pipelineRaw dumps, no cleaningError handlingRetry logic, circuit breakersCrashes on first 403MonitoringSuccess rates tracked, alerts setNo visibilityStorageStructured, deduplicated, versionedFlat files, duplicatesSchedulingAppropriate frequency, off-peakHammering during business hours Score: /16 → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign
compliance_brief: target_domain: "" date_assessed: "" robots_txt: checked: false target_paths_allowed: false crawl_delay_specified: "" ai_bot_rules: "" # Many sites now block AI crawlers specifically terms_of_service: reviewed: false scraping_mentioned: false scraping_prohibited: false api_available: false api_sufficient: false data_classification: type: "" # public-factual | public-personal | behind-auth | copyrighted contains_pii: false pii_types: [] # name, email, phone, address, photo gdpr_applies: false # EU residents' data ccpa_applies: false # California residents' data legal_risk: "" # low | medium | high | do-not-scrape decision: "" # proceed | use-api | request-permission | abandon justification: ""
ScenarioRisk LevelKey Case LawPublic data, no login, robots.txt allowsLOWhiQ v. LinkedIn (2022)Public data, robots.txt disallowsMEDIUMMeta v. Bright Data (2024)Behind authenticationHIGHVan Buren v. US (2021), CFAAPersonal data without consentHIGHGDPR Art. 6, CCPA §1798.100Republishing copyrighted contentHIGHCopyright Act §106Price/product comparisonLOWeBay v. Bidder's Edge (fair use)Academic/research useLOW-MEDIUMVaries by jurisdictionBypassing anti-bot measuresHIGHCFAA "exceeds authorized access"
API exists and covers your needs? → Use the API. Always. robots.txt disallows your target? → Respect it unless you have written permission. Data behind login? → Do not scrape without explicit authorization. Contains PII? → GDPR/CCPA compliance required before collection. Copyrighted content? → Extract facts/data points only, never full content. Site explicitly prohibits scraping? → Request permission or find alternative source.
Many sites now specifically block AI-related crawlers: # Common AI bot blocks in robots.txt User-agent: GPTBot User-agent: ChatGPT-User User-agent: Google-Extended User-agent: CCBot User-agent: anthropic-ai User-agent: ClaudeBot User-agent: Bytespider User-agent: PerplexityBot Rule: If collecting data for AI training, check for these specific blocks.
Tool/ApproachBest ForSpeedJS SupportComplexityCostHTTP client (requests/axios)Static HTML, APIs⚡⚡⚡❌LowFreeBeautiful Soup / CheerioStatic HTML parsing⚡⚡⚡❌LowFreeScrapyLarge-scale structured crawling⚡⚡⚡PluginMediumFreePlaywright / PuppeteerJS-rendered, SPAs, interactions⚡✅MediumFreeSeleniumLegacy, browser automation⚡✅HighFreeCrawleeHybrid (HTTP + browser fallback)⚡⚡✅MediumFreeFirecrawl / ScrapingBeeManaged, anti-bot bypass⚡⚡✅LowPaidBright Data / OxylabsEnterprise, proxy + browser⚡⚡✅LowPaid
Is the content in the initial HTML source? ├── YES → Is the site structure consistent? │ ├── YES → Static scraper (requests + BeautifulSoup/Cheerio) │ └── NO → Scrapy with custom parsers └── NO → Does the page require user interaction? ├── YES → Playwright/Puppeteer with interaction scripts └── NO → Playwright in non-interactive mode └── At scale (>10K pages)? → Crawlee (hybrid mode) └── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)
scraping_project: name: "" objective: "" # What data, why, how often targets: - domain: "" pages_estimated: 0 rendering: "static" | "javascript" | "spa" anti_bot: "none" | "basic" | "cloudflare" | "advanced" rate_limit: "" # requests per second safe limit tool_selected: "" justification: "" data_schema: fields: [] output_format: "" # json | csv | database schedule: frequency: "" # once | hourly | daily | weekly preferred_time: "" # off-peak for target timezone infrastructure: proxy_needed: false proxy_type: "" # residential | datacenter | mobile storage: "" monitoring: ""
# Python example — production request pattern import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry session = requests.Session() # Retry strategy retry = Retry( total=3, backoff_factor=1, # 1s, 2s, 4s status_forcelist=[429, 500, 502, 503, 504], respect_retry_after_header=True ) session.mount("https://", HTTPAdapter(max_retries=retry)) # Realistic headers session.headers.update({ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Cache-Control": "no-cache", })
Rotate these to avoid fingerprinting: HeaderRotation Pool SizeNotesUser-Agent20-50 real browser UAsMatch OS distributionAccept-Language5-10 locale combosMatch proxy geoSec-Ch-UaMatch User-AgentChrome/Edge/BraveRefererVary per requestPrevious page or search engine
Site TypeSafe DelayAggressive (risky)Small business site5-10 seconds2-3 secondsMedium site2-5 seconds1-2 secondsLarge platform (Amazon, etc.)3-5 seconds1 secondAPI endpointPer API docsNever exceedrobots.txt crawl-delayRespect exactlyNever below Rules: Always respect Crawl-delay in robots.txt Add random jitter (±30%) to avoid pattern detection Slow down during business hours for smaller sites Respect Retry-After headers — they mean it Watch for 429s — back off exponentially (2x each time)
Data attributes → [data-product-id], [data-price] (most stable) Semantic IDs → #product-title, #price (stable but can change) ARIA attributes → [aria-label="Price"] (accessibility, fairly stable) Semantic HTML → article, main, nav (structural, stable) Class names → .product-card (can change with redesigns) XPath position → //div[3]/span[2] (FRAGILE — last resort)
Structured data first — Check before writing CSS selectors: # 1. Check JSON-LD (best source — structured, clean) import json from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for script in soup.find_all('script', type='application/ld+json'): data = json.loads(script.string) # Often contains: Product, Article, Organization, etc. # 2. Check Open Graph meta tags og_title = soup.find('meta', property='og:title') og_price = soup.find('meta', property='product:price:amount') # 3. Check microdata items = soup.find_all(itemtype=True) # 4. Fall back to CSS selectors only if above are empty Table extraction pattern: import pandas as pd # Quick table extraction tables = pd.read_html(html) # Returns list of DataFrames # For complex tables with merged cells def extract_table(soup, selector): table = soup.select_one(selector) headers = [th.get_text(strip=True) for th in table.select('thead th')] rows = [] for tr in table.select('tbody tr'): cells = [td.get_text(strip=True) for td in tr.select('td')] rows.append(dict(zip(headers, cells))) return rows Pagination handling: # Pattern 1: Next button while True: # ... scrape current page ... next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a') if not next_link or not next_link.get('href'): break url = urljoin(base_url, next_link['href']) # Pattern 2: API pagination (infinite scroll sites) page = 1 while True: resp = session.get(f"{api_url}?page={page}&limit=50") data = resp.json() if not data.get('results'): break # ... process results ... page += 1 # Pattern 3: Cursor-based cursor = None while True: params = {"limit": 50} if cursor: params["cursor"] = cursor resp = session.get(api_url, params=params) data = resp.json() # ... process ... cursor = data.get('next_cursor') if not cursor: break
# Playwright pattern for JS-rendered pages from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=True) context = browser.new_context( viewport={"width": 1920, "height": 1080}, user_agent="Mozilla/5.0 ...", ) page = context.new_page() # Block unnecessary resources (speed + stealth) page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort()) page.goto(url, wait_until="networkidle") # Wait for specific content (better than arbitrary sleep) page.wait_for_selector('[data-product-id]', timeout=10000) # Extract after JS rendering content = page.content() # ... parse with BeautifulSoup/Cheerio ... browser.close()
SignalDetection MethodMitigationIP reputationIP blacklists, datacenter rangesResidential proxiesRequest rateRequests/min from same IPRate limiting + jitterTLS fingerprintJA3/JA4 hash matchingUse real browser or curl-impersonateBrowser fingerprintCanvas, WebGL, fontsPlaywright with stealth pluginJavaScript challengesCloudflare Turnstile, hCaptchaManaged browser servicesCookie/session behaviorMissing cookies, no historyFull session managementNavigation patternDirect URL hits, no referrerSimulate natural browsingMouse/keyboard eventsNo interaction telemetryEvent simulation (Playwright)Header consistencyMismatched headers vs UAHeader sets that match
proxy_strategy: # Tier 1: Free/Datacenter (for non-protected sites) basic: type: "datacenter" cost: "$1-5/GB" success_rate: "60-80%" use_for: "APIs, small sites, no anti-bot" # Tier 2: Residential (for most protected sites) standard: type: "residential" cost: "$5-15/GB" success_rate: "90-95%" use_for: "Cloudflare, major platforms" rotation: "per-request or sticky 10min" # Tier 3: Mobile/ISP (for maximum stealth) premium: type: "mobile" cost: "$15-30/GB" success_rate: "95-99%" use_for: "Aggressive anti-bot, social media" rules: - Start with cheapest tier, escalate only on blocks - Match proxy geo to target audience geo - Rotate on 403/429, not every request - Use sticky sessions for multi-page scrapes - Monitor proxy health — remove slow/blocked IPs
# Essential stealth for Playwright from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch( headless=True, args=[ '--disable-blink-features=AutomationControlled', '--disable-features=IsolateOrigins,site-per-process', ] ) context = browser.new_context( viewport={"width": 1920, "height": 1080}, locale="en-US", timezone_id="America/New_York", geolocation={"latitude": 40.7128, "longitude": -74.0060}, permissions=["geolocation"], ) # Remove automation indicators page = context.new_page() page.add_init_script(""" Object.defineProperty(navigator, 'webdriver', {get: () => undefined}); Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]}); """)
Cloudflare detected? ├── JS Challenge only → Playwright with stealth + residential proxy ├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data) ├── Under Attack Mode → Wait, try later, or managed service └── WAF blocking → Different approach needed ├── Check for API endpoints (network tab) ├── Check for mobile app API └── Consider if data is available elsewhere
# Validation pattern — validate BEFORE storing from dataclasses import dataclass, field from typing import Optional import re from datetime import datetime @dataclass class ScrapedProduct: url: str title: str price: Optional[float] currency: str = "USD" scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat()) def validate(self) -> list[str]: errors = [] if not self.url.startswith('http'): errors.append("Invalid URL") if not self.title or len(self.title) < 3: errors.append("Title too short or missing") if self.price is not None and self.price < 0: errors.append("Negative price") if self.price is not None and self.price > 1_000_000: errors.append("Price suspiciously high — verify") if self.currency not in ("USD", "EUR", "GBP", "BTC"): errors.append(f"Unknown currency: {self.currency}") return errors
MethodWhen to UseImplementationURL-basedPages with unique URLsHash the canonical URLContent hashSame URL, changing contentMD5/SHA256 of key fieldsFuzzy matchingNear-duplicate detectionJaccard similarity > 0.85Composite keyMulti-field uniquenessHash(domain + product_id + variant) import hashlib def dedup_key(item: dict, fields: list[str]) -> str: """Generate dedup key from selected fields.""" values = "|".join(str(item.get(f, "")) for f in fields) return hashlib.sha256(values.encode()).hexdigest() # Usage seen = set() for item in scraped_items: key = dedup_key(item, ["url", "product_id"]) if key not in seen: seen.add(key) clean_items.append(item)
Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store ↓ Quarantine (failed validation) Common cleaning operations: ProblemSolutionHTML entities (&)html.unescape()Extra whitespace" ".join(text.split())Unicode issuesunicodedata.normalize('NFKD', text)Price in text ("$49.99")Regex: r'[\$£€]?([\d,]+\.?\d*)'Date formats varydateutil.parser.parse() with dayfirst flagRelative URLsurllib.parse.urljoin(base, relative)Encoding issueschardet.detect() then decode
VolumeFrequencyQuery NeedsRecommendation<10K recordsOne-timeNoneJSON/CSV files<10K recordsRecurringSimple lookupsSQLite10K-1M recordsRecurringComplex queriesPostgreSQL1M+ recordsContinuousAnalyticsPostgreSQL + partitioningAppend-only logsContinuousTime-seriesClickHouse / TimescaleDB
import sqlite3 import json from datetime import datetime def init_db(path="scraper_data.db"): conn = sqlite3.connect(path) conn.execute(""" CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY, url TEXT UNIQUE, data JSON NOT NULL, scraped_at TEXT DEFAULT (datetime('now')), updated_at TEXT, checksum TEXT ) """) conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)") conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)") return conn def upsert(conn, url, data, checksum): conn.execute(""" INSERT INTO items (url, data, checksum) VALUES (?, ?, ?) ON CONFLICT(url) DO UPDATE SET data = excluded.data, updated_at = datetime('now'), checksum = excluded.checksum WHERE items.checksum != excluded.checksum """, (url, json.dumps(data), checksum)) conn.commit()
# CSV export import csv def to_csv(items, path, fields): with open(path, 'w', newline='') as f: writer = csv.DictWriter(f, fieldnames=fields) writer.writeheader() writer.writerows(items) # JSON Lines (best for large datasets — streaming) def to_jsonl(items, path): with open(path, 'w') as f: for item in items: f.write(json.dumps(item) + '\n') # Incremental export (only new/changed since last export) def export_since(conn, last_export_time): cursor = conn.execute( "SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?", (last_export_time, last_export_time) ) return [json.loads(row[0]) for row in cursor]
HTTP CodeMeaningAction200SuccessProcess normally301/302RedirectFollow (max 5 hops)403Forbidden/blockedRotate proxy, slow down404Not foundLog, skip, mark URL dead429Rate limitedRespect Retry-After, back off 2x500-504Server errorRetry 3x with backoffConnection timeoutNetwork issueRetry with different proxySSL errorCertificate issueLog, investigate, skip
class CircuitBreaker: def __init__(self, failure_threshold=5, reset_timeout=300): self.failures = 0 self.threshold = failure_threshold self.reset_timeout = reset_timeout self.last_failure = 0 self.state = "closed" # closed | open | half-open def record_failure(self): self.failures += 1 self.last_failure = time.time() if self.failures >= self.threshold: self.state = "open" # Alert: "Circuit open — too many failures" def record_success(self): self.failures = 0 self.state = "closed" def can_proceed(self): if self.state == "closed": return True if self.state == "open": if time.time() - self.last_failure > self.reset_timeout: self.state = "half-open" return True # Try one request return False return True # half-open: allow attempt
import json from pathlib import Path class Checkpointer: def __init__(self, path="checkpoint.json"): self.path = Path(path) self.state = self._load() def _load(self): if self.path.exists(): return json.loads(self.path.read_text()) return {"completed_urls": [], "last_page": 0, "cursor": None} def save(self): self.path.write_text(json.dumps(self.state)) def is_done(self, url): return url in self.state["completed_urls"] def mark_done(self, url): self.state["completed_urls"].append(url) if len(self.state["completed_urls"]) % 50 == 0: self.save() # Periodic save
dashboard: real_time: - metric: "requests_per_minute" alert_if: "> 60 for small sites" - metric: "success_rate" alert_if: "< 90%" - metric: "avg_response_time_ms" alert_if: "> 5000" - metric: "blocked_rate" alert_if: "> 10%" per_run: - metric: "pages_scraped" - metric: "items_extracted" - metric: "items_validated" - metric: "items_deduplicated" - metric: "new_items" - metric: "updated_items" - metric: "errors_by_type" - metric: "run_duration" - metric: "proxy_cost" weekly: - metric: "data_freshness" description: "% of records updated in last 7 days" - metric: "site_structure_changes" description: "Selectors that stopped matching" - metric: "total_cost" description: "Proxy + compute + storage"
Sites redesign. Selectors break. Detect it early: def health_check(results: list[dict], expected_fields: list[str]) -> dict: """Check if scraper is still extracting correctly.""" total = len(results) if total == 0: return {"status": "CRITICAL", "message": "Zero results — likely broken"} field_coverage = {} for field in expected_fields: filled = sum(1 for r in results if r.get(field)) coverage = filled / total field_coverage[field] = coverage issues = [] for field, coverage in field_coverage.items(): if coverage < 0.5: issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)") if issues: return {"status": "WARNING", "issues": issues} return {"status": "OK", "field_coverage": field_coverage}
Daily: Check success rate per target domain Review error logs for new patterns Verify data freshness Weekly: Compare extraction counts vs baseline (>20% drop = investigate) Review proxy spend Spot-check 10 random records for accuracy Monthly: Full selector validation against live pages Review legal compliance (robots.txt changes, ToS updates) Cost optimization review Prune dead URLs from queue
use_case: "Track competitor prices daily" tool: "requests + BeautifulSoup" schedule: "Daily at 03:00 UTC (off-peak)" targets: ["competitor-a.com/products", "competitor-b.com/api"] data: - product_id - product_name - price - currency - in_stock - scraped_at storage: "SQLite with price history" alerts: "Price change > 10% → notify"
use_case: "Aggregate job listings from multiple boards" tool: "Scrapy with per-site spiders" schedule: "Every 6 hours" targets: ["board-a.com", "board-b.com", "board-c.com"] data: - title - company - location - salary_range - posted_date - url - source dedup: "Hash(title + company + location)" storage: "PostgreSQL"
use_case: "Monitor industry news mentions" tool: "requests + RSS feeds (preferred) + web fallback" schedule: "Every 30 minutes" approach: 1: "RSS/Atom feeds (fastest, cleanest)" 2: "Google News RSS for topic" 3: "Direct scraping if no feed" data: - headline - source - url - published_at - snippet - sentiment alerts: "Keyword match → immediate notification"
use_case: "Track brand mentions and sentiment" tool: "Official APIs (always) + web search fallback" rules: - NEVER scrape social platforms directly — use APIs - Twitter/X: Official API ($100/mo basic) - Reddit: Official API (free tier available) - LinkedIn: No scraping (aggressive legal action) - Instagram: Official API only (Meta Business) fallback: "Brave/Google search for public mentions"
use_case: "Track property listings and prices" tool: "Playwright (most listing sites are JS-heavy)" schedule: "Daily" challenges: - Heavy JavaScript rendering - Anti-bot measures (Cloudflare common) - Frequent layout changes - Map-based results approach: "API endpoint discovery via network tab first"
Single machine (small scale): ├── asyncio + aiohttp (Python) → 50-200 concurrent requests ├── Worker pool (ThreadPoolExecutor) → 10-50 threads └── Scrapy reactor → Built-in concurrency Multi-machine (large scale): ├── URL queue: Redis / RabbitMQ / SQS ├── Workers: Multiple Scrapy/custom workers ├── Results: Shared PostgreSQL / S3 └── Coordinator: Celery / custom scheduler
LeverImpactHowStatic > Browser10-50x cheaperAlways try HTTP firstBlock images/CSS/fonts60-80% bandwidth savedRoute filteringCache DNSMinor but cumulativeLocal DNS cacheCompress responses50-70% bandwidthAccept-Encoding: gzip, brSmart schedulingAvoid redundant scrapesChange detection before full re-scrapeProxy tier matching3-10x cost differenceDon't use residential for easy sites
Before building a scraper, check if the site has hidden API endpoints: Open DevTools → Network tab Filter by XHR/Fetch Navigate the site, click load-more, filter/sort Look for JSON responses — these are your goldmine Most SPAs load data via REST/GraphQL APIs Common hidden API patterns: /api/v1/products?page=1&limit=20 /graphql with query parameters /_next/data/... (Next.js data routes) /wp-json/wp/v2/posts (WordPress)
# Minimize browser resource usage context = browser.new_context( viewport={"width": 1280, "height": 720}, java_script_enabled=True, # Only if needed has_touch=False, is_mobile=False, ) # Block resource types you don't need page.route("**/*", lambda route: ( route.abort() if route.request.resource_type in ["image", "stylesheet", "font", "media"] else route.continue_() ))
# When authorized to scrape behind login # ALWAYS use session-based auth, never store passwords in code # Pattern: Login once, reuse session session = requests.Session() login_resp = session.post("https://example.com/login", data={ "username": os.environ["SCRAPE_USER"], "password": os.environ["SCRAPE_PASS"], }) assert login_resp.ok, "Login failed" # Session cookies are now stored — use for subsequent requests data_resp = session.get("https://example.com/api/data")
import hashlib def has_changed(url, session, last_etag=None, last_modified=None): """Check if page changed without downloading full content.""" headers = {} if last_etag: headers["If-None-Match"] = last_etag if last_modified: headers["If-Modified-Since"] = last_modified resp = session.head(url, headers=headers) if resp.status_code == 304: return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified") return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
DimensionWeightWhat to AssessLegal compliance20%robots.txt, ToS, PII handling, audit trailData quality20%Validation, accuracy, completeness, freshnessResilience15%Error handling, retries, circuit breakers, checkpointingAnti-detection15%Proxy rotation, fingerprint diversity, rate limitingArchitecture10%Right tool selection, clean code, modularityMonitoring10%Success rates, breakage detection, alertingPerformance5%Speed, cost efficiency, resource usageDocumentation5%Runbook, schema docs, legal assessment Grading: 90+ Excellent | 75-89 Good | 60-74 Needs work | <60 Redesign
#MistakeFix1No robots.txt checkAlways check first — it's your legal defense2Fixed delays (no jitter)Add ±30% random jitter to all delays3No data validationValidate every field before storing4Using browser for static HTMLHTTP client is 10-50x faster and cheaper5Single IP, no rotationProxy rotation for any serious scraping6No breakage detectionMonitor extraction counts and field fill rates7Storing raw HTML onlyExtract + structure immediately8No checkpoint/resumeLong scrapes must be resumable9Ignoring structured dataJSON-LD/microdata is cleaner than CSS selectors10Scraping when API existsAlways check for API first
Single-page apps (React/Vue/Angular): Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable. Infinite scroll: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts page or offset params. CAPTCHAs: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach. Dynamic class names (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. [data-testid="price"] survives redesigns. .sc-bdVTJa does not. Multi-language sites: Detect language via html[lang] attribute. Set Accept-Language header to get desired locale. Watch for different URL structures (/en/, /de/, subdomains).
"Check if I can scrape [URL]" → Run compliance checklist (robots.txt, ToS, data type) "What tool should I use for [site]?" → Analyze site rendering, anti-bot, recommend tool "Build a scraper for [description]" → Full architecture brief + code pattern "My scraper is getting blocked" → Anti-detection diagnostic + proxy/stealth recommendations "Extract [data] from [URL]" → Check structured data first, then CSS selectors "Monitor [site] for changes" → Change detection + scheduling + alerting setup "How do I handle pagination on [site]?" → Identify pagination type + code pattern "Scrape at scale ([N] pages)" → Concurrency architecture + cost estimate "Clean and store this scraped data" → Validation + dedup + storage recommendation "Is my scraper healthy?" → Run health check + breakage detection "Find the API behind [site]" → Network tab mining guide + common patterns "Set up price monitoring for [competitors]" → Full e-commerce monitor pattern
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.