← All skills
Tencent SkillHub · Developer Tools

Web Scraping & Data Extraction Engine

Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
README.md, SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 45 sections Open source page

Quick Health Check (Run First)

Score your scraping operation (2 points each): SignalHealthyUnhealthyLegal compliancerobots.txt checked, ToS reviewedScraping blindlyArchitectureTool matches site complexityUsing Puppeteer for static HTMLAnti-detectionRotation, delays, fingerprint diversitySingle IP, no delaysData qualityValidation + dedup pipelineRaw dumps, no cleaningError handlingRetry logic, circuit breakersCrashes on first 403MonitoringSuccess rates tracked, alerts setNo visibilityStorageStructured, deduplicated, versionedFlat files, duplicatesSchedulingAppropriate frequency, off-peakHammering during business hours Score: /16 → 12+: Production-ready | 8-11: Needs work | <8: Stop and redesign

Pre-Scrape Compliance Checklist

compliance_brief: target_domain: "" date_assessed: "" robots_txt: checked: false target_paths_allowed: false crawl_delay_specified: "" ai_bot_rules: "" # Many sites now block AI crawlers specifically terms_of_service: reviewed: false scraping_mentioned: false scraping_prohibited: false api_available: false api_sufficient: false data_classification: type: "" # public-factual | public-personal | behind-auth | copyrighted contains_pii: false pii_types: [] # name, email, phone, address, photo gdpr_applies: false # EU residents' data ccpa_applies: false # California residents' data legal_risk: "" # low | medium | high | do-not-scrape decision: "" # proceed | use-api | request-permission | abandon justification: ""

Legal Landscape Quick Reference

ScenarioRisk LevelKey Case LawPublic data, no login, robots.txt allowsLOWhiQ v. LinkedIn (2022)Public data, robots.txt disallowsMEDIUMMeta v. Bright Data (2024)Behind authenticationHIGHVan Buren v. US (2021), CFAAPersonal data without consentHIGHGDPR Art. 6, CCPA §1798.100Republishing copyrighted contentHIGHCopyright Act §106Price/product comparisonLOWeBay v. Bidder's Edge (fair use)Academic/research useLOW-MEDIUMVaries by jurisdictionBypassing anti-bot measuresHIGHCFAA "exceeds authorized access"

Decision Rules

API exists and covers your needs? → Use the API. Always. robots.txt disallows your target? → Respect it unless you have written permission. Data behind login? → Do not scrape without explicit authorization. Contains PII? → GDPR/CCPA compliance required before collection. Copyrighted content? → Extract facts/data points only, never full content. Site explicitly prohibits scraping? → Request permission or find alternative source.

AI Crawler Considerations (2025+)

Many sites now specifically block AI-related crawlers: # Common AI bot blocks in robots.txt User-agent: GPTBot User-agent: ChatGPT-User User-agent: Google-Extended User-agent: CCBot User-agent: anthropic-ai User-agent: ClaudeBot User-agent: Bytespider User-agent: PerplexityBot Rule: If collecting data for AI training, check for these specific blocks.

Tool Selection Matrix

Tool/ApproachBest ForSpeedJS SupportComplexityCostHTTP client (requests/axios)Static HTML, APIs⚡⚡⚡❌LowFreeBeautiful Soup / CheerioStatic HTML parsing⚡⚡⚡❌LowFreeScrapyLarge-scale structured crawling⚡⚡⚡PluginMediumFreePlaywright / PuppeteerJS-rendered, SPAs, interactions⚡✅MediumFreeSeleniumLegacy, browser automation⚡✅HighFreeCrawleeHybrid (HTTP + browser fallback)⚡⚡✅MediumFreeFirecrawl / ScrapingBeeManaged, anti-bot bypass⚡⚡✅LowPaidBright Data / OxylabsEnterprise, proxy + browser⚡⚡✅LowPaid

Decision Tree

Is the content in the initial HTML source? ├── YES → Is the site structure consistent? │ ├── YES → Static scraper (requests + BeautifulSoup/Cheerio) │ └── NO → Scrapy with custom parsers └── NO → Does the page require user interaction? ├── YES → Playwright/Puppeteer with interaction scripts └── NO → Playwright in non-interactive mode └── At scale (>10K pages)? → Crawlee (hybrid mode) └── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)

Architecture Brief YAML

scraping_project: name: "" objective: "" # What data, why, how often targets: - domain: "" pages_estimated: 0 rendering: "static" | "javascript" | "spa" anti_bot: "none" | "basic" | "cloudflare" | "advanced" rate_limit: "" # requests per second safe limit tool_selected: "" justification: "" data_schema: fields: [] output_format: "" # json | csv | database schedule: frequency: "" # once | hourly | daily | weekly preferred_time: "" # off-peak for target timezone infrastructure: proxy_needed: false proxy_type: "" # residential | datacenter | mobile storage: "" monitoring: ""

HTTP Request Best Practices

# Python example — production request pattern import requests from requests.adapters import HTTPAdapter from urllib3.util.retry import Retry session = requests.Session() # Retry strategy retry = Retry( total=3, backoff_factor=1, # 1s, 2s, 4s status_forcelist=[429, 500, 502, 503, 504], respect_retry_after_header=True ) session.mount("https://", HTTPAdapter(max_retries=retry)) # Realistic headers session.headers.update({ "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Cache-Control": "no-cache", })

Header Rotation Strategy

Rotate these to avoid fingerprinting: HeaderRotation Pool SizeNotesUser-Agent20-50 real browser UAsMatch OS distributionAccept-Language5-10 locale combosMatch proxy geoSec-Ch-UaMatch User-AgentChrome/Edge/BraveRefererVary per requestPrevious page or search engine

Rate Limiting Rules

Site TypeSafe DelayAggressive (risky)Small business site5-10 seconds2-3 secondsMedium site2-5 seconds1-2 secondsLarge platform (Amazon, etc.)3-5 seconds1 secondAPI endpointPer API docsNever exceedrobots.txt crawl-delayRespect exactlyNever below Rules: Always respect Crawl-delay in robots.txt Add random jitter (±30%) to avoid pattern detection Slow down during business hours for smaller sites Respect Retry-After headers — they mean it Watch for 429s — back off exponentially (2x each time)

CSS Selector Strategy (Priority Order)

Data attributes → [data-product-id], [data-price] (most stable) Semantic IDs → #product-title, #price (stable but can change) ARIA attributes → [aria-label="Price"] (accessibility, fairly stable) Semantic HTML → article, main, nav (structural, stable) Class names → .product-card (can change with redesigns) XPath position → //div[3]/span[2] (FRAGILE — last resort)

Extraction Patterns

Structured data first — Check before writing CSS selectors: # 1. Check JSON-LD (best source — structured, clean) import json from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for script in soup.find_all('script', type='application/ld+json'): data = json.loads(script.string) # Often contains: Product, Article, Organization, etc. # 2. Check Open Graph meta tags og_title = soup.find('meta', property='og:title') og_price = soup.find('meta', property='product:price:amount') # 3. Check microdata items = soup.find_all(itemtype=True) # 4. Fall back to CSS selectors only if above are empty Table extraction pattern: import pandas as pd # Quick table extraction tables = pd.read_html(html) # Returns list of DataFrames # For complex tables with merged cells def extract_table(soup, selector): table = soup.select_one(selector) headers = [th.get_text(strip=True) for th in table.select('thead th')] rows = [] for tr in table.select('tbody tr'): cells = [td.get_text(strip=True) for td in tr.select('td')] rows.append(dict(zip(headers, cells))) return rows Pagination handling: # Pattern 1: Next button while True: # ... scrape current page ... next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a') if not next_link or not next_link.get('href'): break url = urljoin(base_url, next_link['href']) # Pattern 2: API pagination (infinite scroll sites) page = 1 while True: resp = session.get(f"{api_url}?page={page}&limit=50") data = resp.json() if not data.get('results'): break # ... process results ... page += 1 # Pattern 3: Cursor-based cursor = None while True: params = {"limit": 50} if cursor: params["cursor"] = cursor resp = session.get(api_url, params=params) data = resp.json() # ... process ... cursor = data.get('next_cursor') if not cursor: break

JavaScript-Rendered Content

# Playwright pattern for JS-rendered pages from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=True) context = browser.new_context( viewport={"width": 1920, "height": 1080}, user_agent="Mozilla/5.0 ...", ) page = context.new_page() # Block unnecessary resources (speed + stealth) page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", lambda route: route.abort()) page.goto(url, wait_until="networkidle") # Wait for specific content (better than arbitrary sleep) page.wait_for_selector('[data-product-id]', timeout=10000) # Extract after JS rendering content = page.content() # ... parse with BeautifulSoup/Cheerio ... browser.close()

Detection Signals (What Sites Check)

SignalDetection MethodMitigationIP reputationIP blacklists, datacenter rangesResidential proxiesRequest rateRequests/min from same IPRate limiting + jitterTLS fingerprintJA3/JA4 hash matchingUse real browser or curl-impersonateBrowser fingerprintCanvas, WebGL, fontsPlaywright with stealth pluginJavaScript challengesCloudflare Turnstile, hCaptchaManaged browser servicesCookie/session behaviorMissing cookies, no historyFull session managementNavigation patternDirect URL hits, no referrerSimulate natural browsingMouse/keyboard eventsNo interaction telemetryEvent simulation (Playwright)Header consistencyMismatched headers vs UAHeader sets that match

Proxy Strategy

proxy_strategy: # Tier 1: Free/Datacenter (for non-protected sites) basic: type: "datacenter" cost: "$1-5/GB" success_rate: "60-80%" use_for: "APIs, small sites, no anti-bot" # Tier 2: Residential (for most protected sites) standard: type: "residential" cost: "$5-15/GB" success_rate: "90-95%" use_for: "Cloudflare, major platforms" rotation: "per-request or sticky 10min" # Tier 3: Mobile/ISP (for maximum stealth) premium: type: "mobile" cost: "$15-30/GB" success_rate: "95-99%" use_for: "Aggressive anti-bot, social media" rules: - Start with cheapest tier, escalate only on blocks - Match proxy geo to target audience geo - Rotate on 403/429, not every request - Use sticky sessions for multi-page scrapes - Monitor proxy health — remove slow/blocked IPs

Playwright Stealth Configuration

# Essential stealth for Playwright from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch( headless=True, args=[ '--disable-blink-features=AutomationControlled', '--disable-features=IsolateOrigins,site-per-process', ] ) context = browser.new_context( viewport={"width": 1920, "height": 1080}, locale="en-US", timezone_id="America/New_York", geolocation={"latitude": 40.7128, "longitude": -74.0060}, permissions=["geolocation"], ) # Remove automation indicators page = context.new_page() page.add_init_script(""" Object.defineProperty(navigator, 'webdriver', {get: () => undefined}); Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]}); """)

Cloudflare Bypass Decision

Cloudflare detected? ├── JS Challenge only → Playwright with stealth + residential proxy ├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data) ├── Under Attack Mode → Wait, try later, or managed service └── WAF blocking → Different approach needed ├── Check for API endpoints (network tab) ├── Check for mobile app API └── Consider if data is available elsewhere

Data Validation Rules

# Validation pattern — validate BEFORE storing from dataclasses import dataclass, field from typing import Optional import re from datetime import datetime @dataclass class ScrapedProduct: url: str title: str price: Optional[float] currency: str = "USD" scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat()) def validate(self) -> list[str]: errors = [] if not self.url.startswith('http'): errors.append("Invalid URL") if not self.title or len(self.title) < 3: errors.append("Title too short or missing") if self.price is not None and self.price < 0: errors.append("Negative price") if self.price is not None and self.price > 1_000_000: errors.append("Price suspiciously high — verify") if self.currency not in ("USD", "EUR", "GBP", "BTC"): errors.append(f"Unknown currency: {self.currency}") return errors

Deduplication Strategy

MethodWhen to UseImplementationURL-basedPages with unique URLsHash the canonical URLContent hashSame URL, changing contentMD5/SHA256 of key fieldsFuzzy matchingNear-duplicate detectionJaccard similarity > 0.85Composite keyMulti-field uniquenessHash(domain + product_id + variant) import hashlib def dedup_key(item: dict, fields: list[str]) -> str: """Generate dedup key from selected fields.""" values = "|".join(str(item.get(f, "")) for f in fields) return hashlib.sha256(values.encode()).hexdigest() # Usage seen = set() for item in scraped_items: key = dedup_key(item, ["url", "product_id"]) if key not in seen: seen.add(key) clean_items.append(item)

Data Cleaning Pipeline

Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store ↓ Quarantine (failed validation) Common cleaning operations: ProblemSolutionHTML entities (&amp;)html.unescape()Extra whitespace" ".join(text.split())Unicode issuesunicodedata.normalize('NFKD', text)Price in text ("$49.99")Regex: r'[\$£€]?([\d,]+\.?\d*)'Date formats varydateutil.parser.parse() with dayfirst flagRelative URLsurllib.parse.urljoin(base, relative)Encoding issueschardet.detect() then decode

Storage Decision Guide

VolumeFrequencyQuery NeedsRecommendation<10K recordsOne-timeNoneJSON/CSV files<10K recordsRecurringSimple lookupsSQLite10K-1M recordsRecurringComplex queriesPostgreSQL1M+ recordsContinuousAnalyticsPostgreSQL + partitioningAppend-only logsContinuousTime-seriesClickHouse / TimescaleDB

SQLite Pattern (Most Common)

import sqlite3 import json from datetime import datetime def init_db(path="scraper_data.db"): conn = sqlite3.connect(path) conn.execute(""" CREATE TABLE IF NOT EXISTS items ( id INTEGER PRIMARY KEY, url TEXT UNIQUE, data JSON NOT NULL, scraped_at TEXT DEFAULT (datetime('now')), updated_at TEXT, checksum TEXT ) """) conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)") conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)") return conn def upsert(conn, url, data, checksum): conn.execute(""" INSERT INTO items (url, data, checksum) VALUES (?, ?, ?) ON CONFLICT(url) DO UPDATE SET data = excluded.data, updated_at = datetime('now'), checksum = excluded.checksum WHERE items.checksum != excluded.checksum """, (url, json.dumps(data), checksum)) conn.commit()

Export Formats

# CSV export import csv def to_csv(items, path, fields): with open(path, 'w', newline='') as f: writer = csv.DictWriter(f, fieldnames=fields) writer.writeheader() writer.writerows(items) # JSON Lines (best for large datasets — streaming) def to_jsonl(items, path): with open(path, 'w') as f: for item in items: f.write(json.dumps(item) + '\n') # Incremental export (only new/changed since last export) def export_since(conn, last_export_time): cursor = conn.execute( "SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?", (last_export_time, last_export_time) ) return [json.loads(row[0]) for row in cursor]

Error Classification

HTTP CodeMeaningAction200SuccessProcess normally301/302RedirectFollow (max 5 hops)403Forbidden/blockedRotate proxy, slow down404Not foundLog, skip, mark URL dead429Rate limitedRespect Retry-After, back off 2x500-504Server errorRetry 3x with backoffConnection timeoutNetwork issueRetry with different proxySSL errorCertificate issueLog, investigate, skip

Circuit Breaker Pattern

class CircuitBreaker: def __init__(self, failure_threshold=5, reset_timeout=300): self.failures = 0 self.threshold = failure_threshold self.reset_timeout = reset_timeout self.last_failure = 0 self.state = "closed" # closed | open | half-open def record_failure(self): self.failures += 1 self.last_failure = time.time() if self.failures >= self.threshold: self.state = "open" # Alert: "Circuit open — too many failures" def record_success(self): self.failures = 0 self.state = "closed" def can_proceed(self): if self.state == "closed": return True if self.state == "open": if time.time() - self.last_failure > self.reset_timeout: self.state = "half-open" return True # Try one request return False return True # half-open: allow attempt

Checkpoint & Resume

import json from pathlib import Path class Checkpointer: def __init__(self, path="checkpoint.json"): self.path = Path(path) self.state = self._load() def _load(self): if self.path.exists(): return json.loads(self.path.read_text()) return {"completed_urls": [], "last_page": 0, "cursor": None} def save(self): self.path.write_text(json.dumps(self.state)) def is_done(self, url): return url in self.state["completed_urls"] def mark_done(self, url): self.state["completed_urls"].append(url) if len(self.state["completed_urls"]) % 50 == 0: self.save() # Periodic save

Scraper Health Dashboard

dashboard: real_time: - metric: "requests_per_minute" alert_if: "> 60 for small sites" - metric: "success_rate" alert_if: "< 90%" - metric: "avg_response_time_ms" alert_if: "> 5000" - metric: "blocked_rate" alert_if: "> 10%" per_run: - metric: "pages_scraped" - metric: "items_extracted" - metric: "items_validated" - metric: "items_deduplicated" - metric: "new_items" - metric: "updated_items" - metric: "errors_by_type" - metric: "run_duration" - metric: "proxy_cost" weekly: - metric: "data_freshness" description: "% of records updated in last 7 days" - metric: "site_structure_changes" description: "Selectors that stopped matching" - metric: "total_cost" description: "Proxy + compute + storage"

Breakage Detection

Sites redesign. Selectors break. Detect it early: def health_check(results: list[dict], expected_fields: list[str]) -> dict: """Check if scraper is still extracting correctly.""" total = len(results) if total == 0: return {"status": "CRITICAL", "message": "Zero results — likely broken"} field_coverage = {} for field in expected_fields: filled = sum(1 for r in results if r.get(field)) coverage = filled / total field_coverage[field] = coverage issues = [] for field, coverage in field_coverage.items(): if coverage < 0.5: issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)") if issues: return {"status": "WARNING", "issues": issues} return {"status": "OK", "field_coverage": field_coverage}

Operational Runbook

Daily: Check success rate per target domain Review error logs for new patterns Verify data freshness Weekly: Compare extraction counts vs baseline (>20% drop = investigate) Review proxy spend Spot-check 10 random records for accuracy Monthly: Full selector validation against live pages Review legal compliance (robots.txt changes, ToS updates) Cost optimization review Prune dead URLs from queue

Pattern 1: E-commerce Price Monitor

use_case: "Track competitor prices daily" tool: "requests + BeautifulSoup" schedule: "Daily at 03:00 UTC (off-peak)" targets: ["competitor-a.com/products", "competitor-b.com/api"] data: - product_id - product_name - price - currency - in_stock - scraped_at storage: "SQLite with price history" alerts: "Price change > 10% → notify"

Pattern 2: Job Board Aggregator

use_case: "Aggregate job listings from multiple boards" tool: "Scrapy with per-site spiders" schedule: "Every 6 hours" targets: ["board-a.com", "board-b.com", "board-c.com"] data: - title - company - location - salary_range - posted_date - url - source dedup: "Hash(title + company + location)" storage: "PostgreSQL"

Pattern 3: News & Content Monitor

use_case: "Monitor industry news mentions" tool: "requests + RSS feeds (preferred) + web fallback" schedule: "Every 30 minutes" approach: 1: "RSS/Atom feeds (fastest, cleanest)" 2: "Google News RSS for topic" 3: "Direct scraping if no feed" data: - headline - source - url - published_at - snippet - sentiment alerts: "Keyword match → immediate notification"

Pattern 4: Social Media Intelligence

use_case: "Track brand mentions and sentiment" tool: "Official APIs (always) + web search fallback" rules: - NEVER scrape social platforms directly — use APIs - Twitter/X: Official API ($100/mo basic) - Reddit: Official API (free tier available) - LinkedIn: No scraping (aggressive legal action) - Instagram: Official API only (Meta Business) fallback: "Brave/Google search for public mentions"

Pattern 5: Real Estate Listings

use_case: "Track property listings and prices" tool: "Playwright (most listing sites are JS-heavy)" schedule: "Daily" challenges: - Heavy JavaScript rendering - Anti-bot measures (Cloudflare common) - Frequent layout changes - Map-based results approach: "API endpoint discovery via network tab first"

Concurrency Architecture

Single machine (small scale): ├── asyncio + aiohttp (Python) → 50-200 concurrent requests ├── Worker pool (ThreadPoolExecutor) → 10-50 threads └── Scrapy reactor → Built-in concurrency Multi-machine (large scale): ├── URL queue: Redis / RabbitMQ / SQS ├── Workers: Multiple Scrapy/custom workers ├── Results: Shared PostgreSQL / S3 └── Coordinator: Celery / custom scheduler

Cost Optimization

LeverImpactHowStatic > Browser10-50x cheaperAlways try HTTP firstBlock images/CSS/fonts60-80% bandwidth savedRoute filteringCache DNSMinor but cumulativeLocal DNS cacheCompress responses50-70% bandwidthAccept-Encoding: gzip, brSmart schedulingAvoid redundant scrapesChange detection before full re-scrapeProxy tier matching3-10x cost differenceDon't use residential for easy sites

API Discovery (Network Tab Mining)

Before building a scraper, check if the site has hidden API endpoints: Open DevTools → Network tab Filter by XHR/Fetch Navigate the site, click load-more, filter/sort Look for JSON responses — these are your goldmine Most SPAs load data via REST/GraphQL APIs Common hidden API patterns: /api/v1/products?page=1&limit=20 /graphql with query parameters /_next/data/... (Next.js data routes) /wp-json/wp/v2/posts (WordPress)

Headless Browser Optimization

# Minimize browser resource usage context = browser.new_context( viewport={"width": 1280, "height": 720}, java_script_enabled=True, # Only if needed has_touch=False, is_mobile=False, ) # Block resource types you don't need page.route("**/*", lambda route: ( route.abort() if route.request.resource_type in ["image", "stylesheet", "font", "media"] else route.continue_() ))

Scraping Behind Authentication

# When authorized to scrape behind login # ALWAYS use session-based auth, never store passwords in code # Pattern: Login once, reuse session session = requests.Session() login_resp = session.post("https://example.com/login", data={ "username": os.environ["SCRAPE_USER"], "password": os.environ["SCRAPE_PASS"], }) assert login_resp.ok, "Login failed" # Session cookies are now stored — use for subsequent requests data_resp = session.get("https://example.com/api/data")

Change Detection (Avoid Redundant Scrapes)

import hashlib def has_changed(url, session, last_etag=None, last_modified=None): """Check if page changed without downloading full content.""" headers = {} if last_etag: headers["If-None-Match"] = last_etag if last_modified: headers["If-Modified-Since"] = last_modified resp = session.head(url, headers=headers) if resp.status_code == 304: return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified") return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")

Quality Scoring Rubric (0-100)

DimensionWeightWhat to AssessLegal compliance20%robots.txt, ToS, PII handling, audit trailData quality20%Validation, accuracy, completeness, freshnessResilience15%Error handling, retries, circuit breakers, checkpointingAnti-detection15%Proxy rotation, fingerprint diversity, rate limitingArchitecture10%Right tool selection, clean code, modularityMonitoring10%Success rates, breakage detection, alertingPerformance5%Speed, cost efficiency, resource usageDocumentation5%Runbook, schema docs, legal assessment Grading: 90+ Excellent | 75-89 Good | 60-74 Needs work | <60 Redesign

10 Common Mistakes

#MistakeFix1No robots.txt checkAlways check first — it's your legal defense2Fixed delays (no jitter)Add ±30% random jitter to all delays3No data validationValidate every field before storing4Using browser for static HTMLHTTP client is 10-50x faster and cheaper5Single IP, no rotationProxy rotation for any serious scraping6No breakage detectionMonitor extraction counts and field fill rates7Storing raw HTML onlyExtract + structure immediately8No checkpoint/resumeLong scrapes must be resumable9Ignoring structured dataJSON-LD/microdata is cleaner than CSS selectors10Scraping when API existsAlways check for API first

5 Edge Cases

Single-page apps (React/Vue/Angular): Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable. Infinite scroll: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts page or offset params. CAPTCHAs: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach. Dynamic class names (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. [data-testid="price"] survives redesigns. .sc-bdVTJa does not. Multi-language sites: Detect language via html[lang] attribute. Set Accept-Language header to get desired locale. Watch for different URL structures (/en/, /de/, subdomains).

Natural Language Commands

"Check if I can scrape [URL]" → Run compliance checklist (robots.txt, ToS, data type) "What tool should I use for [site]?" → Analyze site rendering, anti-bot, recommend tool "Build a scraper for [description]" → Full architecture brief + code pattern "My scraper is getting blocked" → Anti-detection diagnostic + proxy/stealth recommendations "Extract [data] from [URL]" → Check structured data first, then CSS selectors "Monitor [site] for changes" → Change detection + scheduling + alerting setup "How do I handle pagination on [site]?" → Identify pagination type + code pattern "Scrape at scale ([N] pages)" → Concurrency architecture + cost estimate "Clean and store this scraped data" → Validation + dedup + storage recommendation "Is my scraper healthy?" → Run health check + breakage detection "Find the API behind [site]" → Network tab mining guide + common patterns "Set up price monitoring for [competitors]" → Full e-commerce monitor pattern

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
2 Docs
  • SKILL.md Primary doc
  • README.md Docs