Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) for execution; this skill...
Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) for execution; this skill...
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Guidance Layer + MCP Integration Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.
pip install scrapling[mcp] # Or for full features: pip install scrapling[mcp,playwright] python -m playwright install chromium
{ "mcpServers": { "scrapling": { "command": "python", "args": ["-m", "scrapling.mcp"] } } }
mcporter call scrapling fetch_page --url "https://example.com"
TaskToolExampleFetch a pagemcportermcporter call scrapling fetch_page --url URLExtract with CSSmcportermcporter call scrapling css_select --selector ".title::text"Which fetcher to use?This skillSee "Fetcher Selection Guide" belowAnti-bot strategy?This skillSee "Anti-Bot Escalation Ladder"Complex crawl patterns?This skillSee "Spider Recipes"
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ Fetcher │────▶│ DynamicFetcher │────▶│ StealthyFetcher │ │ (HTTP) │ │ (Browser/JS) │ │ (Anti-bot) │ └─────────────────┘ └──────────────────┘ └──────────────────┘ Fastest JS-rendered Cloudflare, Static pages SPAs, React/Vue Turnstile, etc.
Static HTML? → Fetcher (10-100x faster) Need JS execution? → DynamicFetcher Getting blocked? → StealthyFetcher Complex session? → Use Session variants
fetch_page — HTTP fetcher fetch_dynamic — Browser-based with Playwright fetch_stealthy — Anti-bot bypass mode
# MCP call: fetch_page with options { "url": "https://example.com", "headers": {"User-Agent": "..."}, "delay": 2.0 }
# Use sessions for cookie/state across requests FetcherSession(impersonate="chrome") # TLS fingerprint spoofing
# MCP: fetch_stealthy StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, # Auto-solve Turnstile network_idle=True )
See references/proxy-rotation.md
Scrapling can survive website redesigns using adaptive selectors: # First run — save fingerprints products = page.css('.product', auto_save=True) # Later runs — auto-relocate if DOM changed products = page.css('.product', adaptive=True) MCP usage: mcporter call scrapling css_select \\ --selector ".product" \\ --adaptive true \\ --auto-save true
When to use Spiders vs direct fetching: ✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation ✅ Direct: 1-5 pages, quick extraction, simple flow
from scrapling.spiders import Spider, Response class ProductSpider(Spider): name = "products" start_urls = ["https://example.com/products"] concurrent_requests = 10 download_delay = 1.0 async def parse(self, response: Response): for product in response.css('.product'): yield { "name": product.css('h2::text').get(), "price": product.css('.price::text').get(), "url": response.url } # Follow pagination next_page = response.css('.next a::attr(href)').get() if next_page: yield response.follow(next_page) # Run with resume capability result = ProductSpider(crawldir="./crawl_data").start() result.items.to_jsonl("products.jsonl")
from scrapling.spiders import Spider, Request, Response from scrapling.fetchers import FetcherSession, AsyncStealthySession class MultiSessionSpider(Spider): name = "multi" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "/protected/" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast")
Pause/Resume: crawldir parameter saves checkpoints Streaming: async for item in spider.stream() for real-time processing Auto-retry: Configurable retry on blocked requests Export: Built-in to_json(), to_jsonl()
# Extract to markdown scrapling extract get 'https://example.com' content.md # Extract specific element scrapling extract get 'https://example.com' content.txt \\ --css-selector '.article' \\ --impersonate 'chrome' # Stealth mode scrapling extract stealthy-fetch 'https://protected.com' content.md \\ --no-headless \\ --solve-cloudflare
scrapling shell # Inside shell: >>> page = Fetcher.get('https://example.com') >>> page.css('h1::text').get() >>> page.find_all('div', class_='item')
# Find by attributes page.find_all('div', {'class': 'product', 'data-id': True}) page.find_all('div', class_='product', id=re.compile(r'item-\\d+')) # Text search page.find_by_text('Add to Cart', tag='button') page.find_by_regex(r'\\$\\d+\\.\\d{2}') # Navigation first = page.css('.product')[0] parent = first.parent siblings = first.next_siblings children = first.children # Similarity similar = first.find_similar() # Find visually/structurally similar elements below = first.below_elements() # Elements below in DOM
# Get robust selector for any element element = page.css('.product')[0] selector = element.auto_css_selector() # Returns stable CSS path xpath = element.auto_xpath()
from scrapling.spiders import ProxyRotator # Cyclic rotation rotator = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080", "http://user:pass@proxy3:8080" ], strategy="cyclic") # Use with any session with FetcherSession(proxy=rotator.next()) as session: page = session.get('https://example.com')
# Page numbers for page_num in range(1, 11): url = f"https://example.com/products?page={page_num}" ... # Next button while next_page := response.css('.next a::attr(href)').get(): yield response.follow(next_page) # Infinite scroll (DynamicFetcher) with DynamicSession() as session: page = session.fetch(url) page.scroll_to_bottom() items = page.css('.item').getall()
with StealthySession(headless=False) as session: # Login login_page = session.fetch('https://example.com/login') login_page.fill('input[name="username"]', 'user') login_page.fill('input[name="password"]', 'pass') login_page.click('button[type="submit"]') # Now session has cookies protected_page = session.fetch('https://example.com/dashboard')
# Extract JSON from __NEXT_DATA__ import json import re next_data = json.loads( re.search( r'__NEXT_DATA__" type="application/json">(.*?)</script>', page.html_content, re.S ).group(1) ) props = next_data['props']['pageProps']
# JSON (pretty) result.items.to_json('output.json') # JSONL (streaming, one per line) result.items.to_jsonl('output.jsonl') # Python objects for item in result.items: print(item['title'])
Use HTTP fetcher when possible — 10-100x faster than browser Impersonate browsers — impersonate='chrome' for TLS fingerprinting HTTP/3 support — FetcherSession(http3=True) Limit resources — disable_resources=True in Dynamic/Stealthy Connection pooling — Reuse sessions across requests
Only scrape content you're authorized to access Respect robots.txt and ToS Add delays (download_delay) for large crawls Don't bypass paywalls or authentication without permission Never scrape personal/sensitive data
references/mcp-setup.md — Detailed MCP configuration references/anti-bot.md — Anti-bot handling strategies references/proxy-rotation.md — Proxy setup and rotation references/spider-recipes.md — Advanced crawling patterns references/api-reference.md — Quick API reference references/links.md — Official docs links
scripts/scrapling_scrape.py — Quick one-off extraction scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.