Tencent SkillHub · Developer Tools

Scrapling Web Scraping

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) for execution; this skill...

skill openclawclawhub Free

0 Downloads

0 Stars

0 Installs

0 Score

High Signal

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) for execution; this skill...

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup

Download the package from Yavira.
Extract the archive and review SKILL.md first.
Import or place the package into your OpenClaw setup.

Requirements

Target platform: OpenClaw
Install method: Manual import
Extraction: Extract archive
Prerequisites: OpenClaw
Primary doc: SKILL.md

Package facts

Download mode: Yavira redirect
Package format: ZIP package
Source platform: Tencent SkillHub
What's included: SKILL.md, _meta.json, references/anti-bot.md, references/api_reference.md, references/links.md, references/mcp-setup.md

Validation

Use the Yavira download entry.
Review SKILL.md after the package is downloaded.
Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

Download the package from Yavira.
Extract it into a folder your agent can access.
Paste one of the prompts below and point your agent at the extracted folder.

New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Open Send to Agent page Open JSON manifest Open Markdown brief

Trust & source

Release facts

Source: Tencent SkillHub
Verification: Indexed source record
Version: 1.2.0

Provenance

Publisher: DevBD1
Source page: View original listing
Canonical URL: Open canonical page

Documentation

ClawHub primary doc Primary doc: SKILL.md 30 sections Open source page

Scrapling MCP — Web Scraping Guidance

Guidance Layer + MCP Integration Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

1. Install Scrapling with MCP support

pip install scrapling[mcp] # Or for full features: pip install scrapling[mcp,playwright] python -m playwright install chromium

2. Add to OpenClaw MCP config

{ "mcpServers": { "scrapling": { "command": "python", "args": ["-m", "scrapling.mcp"] } } }

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

TaskToolExampleFetch a pagemcportermcporter call scrapling fetch_page --url URLExtract with CSSmcportermcporter call scrapling css_select --selector ".title::text"Which fetcher to use?This skillSee "Fetcher Selection Guide" belowAnti-bot strategy?This skillSee "Anti-Bot Escalation Ladder"Complex crawl patterns?This skillSee "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ Fetcher │────▶│ DynamicFetcher │────▶│ StealthyFetcher │ │ (HTTP) │ │ (Browser/JS) │ │ (Anti-bot) │ └─────────────────┘ └──────────────────┘ └──────────────────┘ Fastest JS-rendered Cloudflare, Static pages SPAs, React/Vue Turnstile, etc.

Decision Tree

Static HTML? → Fetcher (10-100x faster) Need JS execution? → DynamicFetcher Getting blocked? → StealthyFetcher Complex session? → Use Session variants

MCP Fetch Modes

fetch_page — HTTP fetcher fetch_dynamic — Browser-based with Playwright fetch_stealthy — Anti-bot bypass mode

Level 1: Polite HTTP

# MCP call: fetch_page with options { "url": "https://example.com", "headers": {"User-Agent": "..."}, "delay": 2.0 }

Level 2: Session Persistence

# Use sessions for cookie/state across requests FetcherSession(impersonate="chrome") # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, # Auto-solve Turnstile network_idle=True )

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors: # First run — save fingerprints products = page.css('.product', auto_save=True) # Later runs — auto-relocate if DOM changed products = page.css('.product', adaptive=True) MCP usage: mcporter call scrapling css_select \\ --selector ".product" \\ --adaptive true \\ --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching: ✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation ✅ Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response class ProductSpider(Spider): name = "products" start_urls = ["https://example.com/products"] concurrent_requests = 10 download_delay = 1.0 async def parse(self, response: Response): for product in response.css('.product'): yield { "name": product.css('h2::text').get(), "price": product.css('.price::text').get(), "url": response.url } # Follow pagination next_page = response.css('.next a::attr(href)').get() if next_page: yield response.follow(next_page) # Run with resume capability result = ProductSpider(crawldir="./crawl_data").start() result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response from scrapling.fetchers import FetcherSession, AsyncStealthySession class MultiSessionSpider(Spider): name = "multi" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "/protected/" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast")

Spider Features

Pause/Resume: crawldir parameter saves checkpoints Streaming: async for item in spider.stream() for real-time processing Auto-retry: Configurable retry on blocked requests Export: Built-in to_json(), to_jsonl()

Terminal Extraction (No Code)

# Extract to markdown scrapling extract get 'https://example.com' content.md # Extract specific element scrapling extract get 'https://example.com' content.txt \\ --css-selector '.article' \\ --impersonate 'chrome' # Stealth mode scrapling extract stealthy-fetch 'https://protected.com' content.md \\ --no-headless \\ --solve-cloudflare

Interactive Shell

scrapling shell # Inside shell: >>> page = Fetcher.get('https://example.com') >>> page.css('h1::text').get() >>> page.find_all('div', class_='item')

BeautifulSoup-Style Methods

# Find by attributes page.find_all('div', {'class': 'product', 'data-id': True}) page.find_all('div', class_='product', id=re.compile(r'item-\\d+')) # Text search page.find_by_text('Add to Cart', tag='button') page.find_by_regex(r'\\$\\d+\\.\\d{2}') # Navigation first = page.css('.product')[0] parent = first.parent siblings = first.next_siblings children = first.children # Similarity similar = first.find_similar() # Find visually/structurally similar elements below = first.below_elements() # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element element = page.css('.product')[0] selector = element.auto_css_selector() # Returns stable CSS path xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator # Cyclic rotation rotator = ProxyRotator([ "http://proxy1:8080", "http://proxy2:8080", "http://user:pass@proxy3:8080" ], strategy="cyclic") # Use with any session with FetcherSession(proxy=rotator.next()) as session: page = session.get('https://example.com')

Pagination Patterns

# Page numbers for page_num in range(1, 11): url = f"https://example.com/products?page={page_num}" ... # Next button while next_page := response.css('.next a::attr(href)').get(): yield response.follow(next_page) # Infinite scroll (DynamicFetcher) with DynamicSession() as session: page = session.fetch(url) page.scroll_to_bottom() items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session: # Login login_page = session.fetch('https://example.com/login') login_page.fill('input[name="username"]', 'user') login_page.fill('input[name="password"]', 'pass') login_page.click('button[type="submit"]') # Now session has cookies protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__ import json import re next_data = json.loads( re.search( r'__NEXT_DATA__" type="application/json">(.*?)</script>', page.html_content, re.S ).group(1) ) props = next_data['props']['pageProps']

Output Formats

# JSON (pretty) result.items.to_json('output.json') # JSONL (streaming, one per line) result.items.to_jsonl('output.jsonl') # Python objects for item in result.items: print(item['title'])

Performance Tips

Use HTTP fetcher when possible — 10-100x faster than browser Impersonate browsers — impersonate='chrome' for TLS fingerprinting HTTP/3 support — FetcherSession(http3=True) Limit resources — disable_resources=True in Dynamic/Stealthy Connection pooling — Reuse sessions across requests

Guardrails (Always)

Only scrape content you're authorized to access Respect robots.txt and ToS Add delays (download_delay) for large crawls Don't bypass paywalls or authentication without permission Never scrape personal/sensitive data

References

references/mcp-setup.md — Detailed MCP configuration references/anti-bot.md — Anti-bot handling strategies references/proxy-rotation.md — Proxy setup and rotation references/spider-recipes.md — Advanced crawling patterns references/api-reference.md — Quick API reference references/links.md — Official docs links

Scripts

scripts/scrapling_scrape.py — Quick one-off extraction scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package

5 Docs1 Config

SKILL.md Primary doc
references/anti-bot.md Docs
references/api_reference.md Docs
references/links.md Docs
references/mcp-setup.md Docs
_meta.json Config