← All skills

Tencent SkillHub · Developer Tools

Scrapling

Adaptive web scraping framework with anti-bot bypass and spider crawling.

skill openclawclawhub Free

0 Downloads

0 Stars

0 Installs

0 Score

High Signal

Adaptive web scraping framework with anti-bot bypass and spider crawling.

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup

Download the package from Yavira.
Extract the archive and review SKILL.md first.
Import or place the package into your OpenClaw setup.

Requirements

Target platform: OpenClaw
Install method: Manual import
Extraction: Extract archive
Prerequisites: OpenClaw
Primary doc: SKILL.md

Package facts

Download mode: Yavira redirect
Package format: ZIP package
Source platform: Tencent SkillHub
What's included: SKILL.md, _meta.json, run.sh

Validation

Use the Yavira download entry.
Review SKILL.md after the package is downloaded.
Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

Download the package from Yavira.
Extract it into a folder your agent can access.
Paste one of the prompts below and point your agent at the extracted folder.

New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Open Send to Agent page Open JSON manifest Open Markdown brief

Trust & source

Release facts

Source: Tencent SkillHub
Verification: Indexed source record
Version: 1.0.8

Provenance

Publisher: zendenho7
Source page: View original listing
Canonical URL: Open canonical page

Documentation

ClawHub primary doc Primary doc: SKILL.md 43 sections Open source page

Scrapling - Adaptive Web Scraping

"Effortless web scraping for the modern web."

Core Library

Repository: https://github.com/D4Vinci/Scrapling Author: D4Vinci (Karim Shoair) License: BSD-3-Clause Documentation: https://scrapling.readthedocs.io

API Reverse Engineering Methodology

GitHub: https://github.com/paoloanzn/free-solscan-api X Post: https://x.com/paoloanzn/status/2026361234032046319 Author: @paoloanzn Insight: "Web scraping is 80% reverse engineering"

Installation

# Core library (parser only) pip install scrapling # With fetchers (HTTP + browser automation) - RECOMMENDED pip install "scrapling[fetchers]" scrapling install # With shell (CLI tools) - RECOMMENDED pip install "scrapling[shell]" # With AI (MCP server) - OPTIONAL pip install "scrapling[ai]" # Everything pip install "scrapling[all]" # Browser for stealth/dynamic mode playwright install chromium # For Cloudflare bypass (advanced) pip install cloudscraper

When to Use Scrapling

Use Scrapling when: Research topics from websites Extract data from blogs, news sites, docs Crawl multiple pages with Spider Gather content for summaries Extract brand data from any website Reverse engineer APIs from websites Do NOT use for: X/Twitter (use x-tweet-fetcher skill) Login-protected sites (unless credentials provided) Paywalled content (respect robots.txt) Sites that prohibit scraping in their TOS

1. Basic Fetch (Most Common)

from scrapling.fetchers import Fetcher page = Fetcher.get('https://example.com') # Extract content title = page.css('h1::text').get() paragraphs = page.css('p::text').getall()

2. Stealthy Fetch (Anti-Bot/Cloudflare)

from scrapling.fetchers import StealthyFetcher StealthyFetcher.adaptive = True page = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)

3. Dynamic Fetch (Full Browser Automation)

from scrapling.fetchers import DynamicFetcher page = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)

4. Adaptive Parsing (Survives Design Changes)

from scrapling.fetchers import Fetcher page = Fetcher.get('https://example.com') # First scrape - saves selectors items = page.css('.product', auto_save=True) # Later - if site changes, use adaptive=True to relocate items = page.css('.product', adaptive=True)

5. Spider (Multiple Pages)

from scrapling.spiders import Spider, Response class MySpider(Spider): name = "demo" start_urls = ["https://example.com"] concurrent_requests = 3 async def parse(self, response: Response): for item in response.css('.item'): yield {"item": item.css('h2::text').get()} # Follow links next_page = response.css('.next a') if next_page: yield response.follow(next_page[0].attrib['href']) MySpider().start()

6. CLI Usage

# Simple fetch to file scrapling extract get https://example.com content.html # Stealthy fetch (bypass anti-bot) scrapling extract stealthy-fetch https://example.com content.html # Interactive shell scrapling shell https://example.com

Extract Article Content

from scrapling.fetchers import Fetcher page = Fetcher.get('https://example.com/article') # Try multiple selectors for title title = ( page.css('[itemprop="headline"]::text').get() or page.css('article h1::text').get() or page.css('h1::text').get() ) # Get paragraphs content = page.css('article p::text, .article-body p::text').getall() print(f"Title: {title}") print(f"Paragraphs: {len(content)}")

Research Multiple Pages

from scrapling.spiders import Spider, Response class ResearchSpider(Spider): name = "research" start_urls = ["https://news.ycombinator.com"] concurrent_requests = 5 async def parse(self, response: Response): for item in response.css('.titleline a::text').getall()[:10]: yield {"title": item, "source": "HN"} more = response.css('.morelink::attr(href)').get() if more: yield response.follow(more) ResearchSpider().start()

Crawl Entire Site (Easy Mode)

Auto-crawl all pages on a domain by following internal links: from scrapling.spiders import Spider, Response from urllib.parse import urljoin, urlparse class EasyCrawl(Spider): """Auto-crawl all pages on a domain.""" name = "easy_crawl" start_urls = ["https://example.com"] concurrent_requests = 3 def __init__(self): super().__init__() self.visited = set() async def parse(self, response: Response): # Extract content yield { 'url': response.url, 'title': response.css('title::text').get(), 'h1': response.css('h1::text').get(), } # Follow internal links (limit to 50 pages) if len(self.visited) >= 50: return self.visited.add(response.url) links = response.css('a::attr(href)').getall()[:20] for link in links: full_url = urljoin(response.url, link) if full_url not in self.visited: yield response.follow(full_url) # Usage result = EasyCrawl() result.start()

Sitemap Crawl

Crawl pages from sitemap.xml (with fallback to link discovery): from scrapling.fetchers import Fetcher from scrapling.spiders import Spider, Response from urllib.parse import urljoin, urlparse import re def get_sitemap_urls(url: str, max_urls: int = 100) -> list: """Extract URLs from sitemap.xml - also checks robots.txt.""" parsed = urlparse(url) base_url = f"{parsed.scheme}://{parsed.netloc}" sitemap_urls = [ f"{base_url}/sitemap.xml", f"{base_url}/sitemap-index.xml", f"{base_url}/sitemap_index.xml", f"{base_url}/sitemap-news.xml", ] all_urls = [] # First check robots.txt for sitemap URL try: robots = Fetcher.get(f"{base_url}/robots.txt") if robots.status == 200: sitemap_in_robots = re.findall(r'Sitemap:\s*(\S+)', robots.text, re.IGNORECASE) for sm in sitemap_in_robots: sitemap_urls.insert(0, sm) except: pass # Try each sitemap location for sitemap_url in sitemap_urls: try: page = Fetcher.get(sitemap_url, timeout=10) if page.status != 200: continue text = page.text # Check if it's XML if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text: urls = re.findall(r'<loc>([^<]+)</loc>', text) all_urls.extend(urls[:max_urls]) print(f"Found {len(urls)} URLs in {sitemap_url}") except: continue return list(set(all_urls))[:max_urls] def crawl_from_sitemap(domain_url: str, max_pages: int = 50): """Crawl pages from sitemap.""" print(f"Fetching sitemap for {domain_url}...") urls = get_sitemap_urls(domain_url) if not urls: print("No sitemap found. Use EasyCrawl instead!") return [] print(f"Found {len(urls)} URLs, crawling first {max_pages}...") results = [] for url in urls[:max_pages]: try: page = Fetcher.get(url, timeout=10) results.append({ 'url': url, 'status': page.status, 'title': page.css('title::text').get(), }) except Exception as e: results.append({'url': url, 'error': str(e)[:50]}) return results # Usage print("=== Sitemap Crawl ===") results = crawl_from_sitemap('https://example.com', max_pages=10) for r in results[:3]: print(f" {r.get('title', r.get('error', 'N/A'))}") # Alternative: Easy crawl all links print("\n=== Easy Crawl (Link Discovery) ===") result = EasyCrawl(start_urls=["https://example.com"], max_pages=10).start() print(f"Crawled {len(result.items)} pages")

Firecrawl-Style Crawl (Best of Both Worlds)

Inspired by Firecrawl's behavior - combines sitemap discovery with link following: from scrapling.fetchers import Fetcher from scrapling.spiders import Spider, Response from urllib.parse import urljoin, urlparse import re def firecrawl_crawl(url: str, max_pages: int = 50, use_sitemap: bool = True): """ Firecrawl-style crawling: - use_sitemap=True: Discover URLs from sitemap first (default) - use_sitemap=False: Only follow HTML links (like sitemap:"skip") Matches Firecrawl's crawl behavior. """ parsed = urlparse(url) domain = parsed.netloc # ========== Method 1: Sitemap Discovery ========== if use_sitemap: print(f"[Firecrawl] Discovering URLs from sitemap...") sitemap_urls = [ f"{url.rstrip('/')}/sitemap.xml", f"{url.rstrip('/')}/sitemap-index.xml", ] all_urls = [] # Try sitemaps for sm_url in sitemap_urls: try: page = Fetcher.get(sm_url, timeout=15) if page.status == 200: # Handle bytes text = page.body.decode('utf-8', errors='ignore') if isinstance(page.body, bytes) else str(page.body) if '<urlset' in text: urls = re.findall(r'<loc>([^<]+)</loc>', text) all_urls.extend(urls[:max_pages]) print(f"[Firecrawl] Found {len(urls)} URLs in {sm_url}") except: continue if all_urls: print(f"[Firecrawl] Total: {len(all_urls)} URLs from sitemap") # Crawl discovered URLs results = [] for page_url in all_urls[:max_pages]: try: page = Fetcher.get(page_url, timeout=15) results.append({ 'url': page_url, 'status': page.status, 'title': page.css('title::text').get() if page.status == 200 else None, }) except Exception as e: results.append({'url': page_url, 'error': str(e)[:50]}) return results # ========== Method 2: Link Discovery (sitemap: skip) ========== print(f"[Firecrawl] Sitemap skip - using link discovery...") class LinkCrawl(Spider): name = "firecrawl_link" start_urls = [url] concurrent_requests = 3 def __init__(self): super().__init__() self.visited = set() self.domain = domain self.results = [] async def parse(self, response: Response): if len(self.results) >= max_pages: return self.results.append({ 'url': response.url, 'status': response.status, 'title': response.css('title::text').get(), }) # Follow internal links links = response.css('a::attr(href)').getall()[:20] for link in links: full_url = urljoin(response.url, link) parsed_link = urlparse(full_url) if parsed_link.netloc == self.domain and full_url not in self.visited: self.visited.add(full_url) if len(self.visited) < max_pages: yield response.follow(full_url) result = LinkCrawl() result.start() return result.results # Usage print("=== Firecrawl-Style (sitemap: include) ===") results = firecrawl_crawl('https://www.cloudflare.com', max_pages=5, use_sitemap=True) print(f"Crawled: {len(results)} pages") print("\n=== Firecrawl-Style (sitemap: skip) ===") results = firecrawl_crawl('https://example.com', max_pages=5, use_sitemap=False) print(f"Crawled: {len(results)} pages")

Handle Errors

from scrapling.fetchers import Fetcher, StealthyFetcher try: page = Fetcher.get('https://example.com') except Exception as e: # Try stealth mode page = StealthyFetcher.fetch('https://example.com', headless=True) if page.status == 403: print("Blocked - try StealthyFetcher") elif page.status == 200: print("Success!")

Session Management

from scrapling.fetchers import FetcherSession with FetcherSession(impersonate='chrome') as session: page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) quotes = page.css('.quote .text::text').getall()

Multiple Session Types in Spider

from scrapling.spiders import Spider, Request, Response from scrapling.fetchers import FetcherSession, AsyncStealthySession class MultiSessionSpider(Spider): name = "multi" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "protected" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast", callback=self.parse)

Advanced Parsing & Navigation

from scrapling.fetchers import Fetcher page = Fetcher.get('https://quotes.toscrape.com/') # Multiple selection methods quotes = page.css('.quote') # CSS quotes = page.xpath('//div[@class="quote"]') # XPath quotes = page.find_all('div', class_='quote') # BeautifulSoup-style # Navigation first_quote = page.css('.quote')[0] author = first_quote.css('.author::text').get() parent = first_quote.parent # Find similar elements similar = first_quote.find_similar()

Advanced: API Reverse Engineering

"Web scraping is 80% reverse engineering." This section covers advanced techniques to discover and replicate APIs directly from websites — often revealing data that's "hidden" behind paid APIs.

1. API Endpoint Discovery

Many websites load data via client-side requests. Use browser DevTools to find them: Steps: Open browser DevTools (F12) Go to Network tab Reload the page Look for XHR or Fetch requests Check if endpoints return JSON data What to look for: Requests to /api/* endpoints Responses containing structured data (JSON) Same endpoints used on both free and paid sections Example pattern: # Found in Network tab: GET https://api.example.com/v1/users/transactions Response: {"data": [...], "pagination": {...}}

2. JavaScript Analysis

Auth tokens often generated client-side. Find them in .js files: Steps: In Network tab, look at Initiator column Click the .js file making the request Search for auth header name (e.g., sol-aut, Authorization, X-API-Key) Find the function generating the token Common patterns: Plain text function names: generateToken(), createAuthHeader() Obfuscated: Search for the header name directly Random string generation: Math.random(), crypto.getRandomValues()

3. Replicating Discovered APIs

Once you've found the endpoint and auth pattern: import requests import random import string def generate_auth_token(): """Replicate discovered token generation logic.""" chars = string.ascii_letters + string.digits token = ''.join(random.choice(chars) for _ in range(40)) # Insert fixed string at random position fixed = "B9dls0fK" pos = random.randint(0, len(token)) return token[:pos] + fixed + token[pos:] def scrape_api_endpoint(url): """Hit discovered API endpoint with replicated auth.""" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept': 'application/json', 'sol-aut': generate_auth_token(), # Replicate discovered header } response = requests.get(url, headers=headers) return response.json()

4. Cloudscraper Bypass (Cloudflare)

For Cloudflare-protected endpoints, use cloudscraper: pip install cloudscraper import cloudscraper def create_scraper(): """Create a cloudscraper session that bypasses Cloudflare.""" scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'windows', 'desktop': True } ) return scraper # Usage scraper = create_scraper() response = scraper.get('https://api.example.com/endpoint') data = response.json()

5. Complete API Replication Pattern

import cloudscraper import random import string import json class APIReplicator: """Replicate discovered API from website.""" def __init__(self, base_url): self.base_url = base_url self.session = cloudscraper.create_scraper() def generate_token(self, pattern="random"): """Replicate discovered token generation.""" if pattern == "solscan": # 40-char random + fixed string at random position chars = string.ascii_letters + string.digits token = ''.join(random.choice(chars) for _ in range(40)) fixed = "B9dls0fK" pos = random.randint(0, len(token)) return token[:pos] + fixed + token[pos:] else: # Generic random token return ''.join(random.choices(string.ascii_letters + string.digits, k=32)) def get(self, endpoint, headers=None, auth_header=None, auth_pattern="random"): """Make API request with discovered auth.""" url = f"{self.base_url}{endpoint}" # Build headers request_headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', 'Accept': 'application/json', } # Add discovered auth header if auth_header: request_headers[auth_header] = self.generate_token(auth_pattern) # Merge custom headers if headers: request_headers.update(headers) response = self.session.get(url, headers=request_headers) return response # Usage example api = APIReplicator("https://api.solscan.io") data = api.get( "/account/transactions", auth_header="sol-aut", auth_pattern="solscan" ) print(data)

6. Discovery Checklist

When approaching a new site: StepActionTool1Open DevTools Network tabF122Reload page, filter by XHR/FetchNetwork filter3Look for JSON responsesResponse tab4Check if same endpoint used for "premium" dataCompare requests5Find auth header in JS filesInitiator column6Extract token generation logicJS debugger7Replicate in PythonReplicator class8Test against APIRun script

Brand Data Extraction (Firecrawl Alternative)

Extract brand data, colors, logos, and copy from any website: from scrapling.fetchers import Fetcher from urllib.parse import urljoin import re def extract_brand_data(url: str) -> dict: """Extract structured brand data from any website - Firecrawl style.""" # Try stealth mode first (handles anti-bot) try: page = Fetcher.get(url) except: from scrapling.fetchers import StealthyFetcher page = StealthyFetcher.fetch(url, headless=True) # Helper to get text from element def get_text(elements): return elements[0].text if elements else None # Helper to get attribute def get_attr(elements, attr_name): return elements[0].attrib.get(attr_name) if elements else None # Brand name (try multiple selectors) brand_name = ( get_text(page.css('[property="og:site_name"]')) or get_text(page.css('h1')) or get_text(page.css('title')) ) # Tagline tagline = ( get_text(page.css('[property="og:description"]')) or get_text(page.css('.tagline')) or get_text(page.css('.hero-text')) or get_text(page.css('header h2')) ) # Logo URL logo_url = ( get_attr(page.css('[rel="icon"]'), 'href') or get_attr(page.css('[rel="apple-touch-icon"]'), 'href') or get_attr(page.css('.logo img'), 'src') ) if logo_url and not logo_url.startswith('http'): logo_url = urljoin(url, logo_url) # Favicon favicon = get_attr(page.css('[rel="icon"]'), 'href') favicon_url = urljoin(url, favicon) if favicon else None # OG Image og_image = get_attr(page.css('[property="og:image"]'), 'content') og_image_url = urljoin(url, og_image) if og_image else None # Screenshot (using external service) screenshot_url = f"https://image.thum.io/get/width/1200/crop/800/{url}" # Description description = ( get_text(page.css('[property="og:description"]')) or get_attr(page.css('[name="description"]'), 'content') ) # CTA text cta_text = ( get_text(page.css('a[href*="signup"]')) or get_text(page.css('.cta')) or get_text(page.css('[class*="button"]')) ) # Social links social_links = {} for platform in ['twitter', 'facebook', 'instagram', 'linkedin', 'youtube', 'github']: link = get_attr(page.css(f'a[href*="{platform}"]'), 'href') if link: social_links[platform] = link # Features (from feature grid/cards) features = [] feature_cards = page.css('[class*="feature"], .feature-card, .benefit-item') for card in feature_cards[:6]: feature_text = get_text(card.css('h3, h4, p')) if feature_text: features.append(feature_text.strip()) return { 'brandName': brand_name, 'tagline': tagline, 'description': description, 'features': features, 'logoUrl': logo_url, 'faviconUrl': favicon_url, 'ctaText': cta_text, 'socialLinks': social_links, 'screenshotUrl': screenshot_url, 'ogImageUrl': og_image_url } # Usage brand_data = extract_brand_data('https://example.com') print(brand_data)

Brand Data CLI

# Extract brand data using the Python function above python3 -c " import json import sys sys.path.insert(0, '/path/to/skill') from brand_extraction import extract_brand_data data = extract_brand_data('$URL') print(json.dumps(data, indent=2)) "

Feature Comparison

FeatureStatusNotesBasic fetch✅ WorkingFetcher.get()Stealthy fetch✅ WorkingStealthyFetcher.fetch()Dynamic fetch✅ WorkingDynamicFetcher.fetch()Adaptive parsing✅ Workingauto_save + adaptiveSpider crawling✅ Workingasync def parse()CSS selectors✅ Working.css()XPath✅ Working.xpath()Session management✅ WorkingFetcherSession, StealthySessionProxy rotation✅ WorkingProxyRotator classCLI tools✅ Workingscrapling extractBrand data extraction✅ Workingextract_brand_data()API reverse engineering✅ WorkingAPIReplicator classCloudscraper bypass✅ Workingcloudscraper integrationEasy site crawl✅ WorkingEasyCrawl classSitemap crawl✅ Workingget_sitemap_urls()MCP server❌ ExcludedNot needed

IEEE Spectrum

page = Fetcher.get('https://spectrum.ieee.org/...') title = page.css('h1::text').get() content = page.css('article p::text').getall() ✅ Works

Hacker News

page = Fetcher.get('https://news.ycombinator.com') stories = page.css('.titleline a::text').getall() ✅ Works

Example Domain

page = Fetcher.get('https://example.com') title = page.css('h1::text').get() ✅ Works

🔧 Quick Troubleshooting

IssueSolution403/429 BlockedUse StealthyFetcher or cloudscraperCloudflareUse StealthyFetcher or cloudscraperJavaScript requiredUse DynamicFetcherSite changedUse adaptive=TruePaid API exposedUse API reverse engineeringCaptchaCannot bypass - skip or use official APIAuth requiredDo NOT bypass - use official API

Skill Graph

Related skills: [[content-research]] - Research workflow [[blogwatcher]] - RSS/feed monitoring [[youtube-watcher]] - Video content [[chirp]] - Twitter/X interactions [[newsletter-digest]] - Content summarization [[x-tweet-fetcher]] - X/Twitter (use instead of Scrapling)

v1.0.8 (2026-02-25)

Added: Firecrawl-Style Crawl - Combines sitemap discovery + link following Added: use_sitemap parameter - Matches Firecrawl's sitemap:"include"/"skip" behavior Verified: cloudflare.com returns 2,447 URLs from sitemap!

v1.0.7 (2026-02-25)

Fixed: EasyCrawl Spider syntax - Updated to work with scrapling's actual Spider API Verified: Spider crawling works - Tested and crawled 20+ pages from example.com

v1.0.6 (2026-02-25)

Added: Easy Site Crawl - Auto-crawl all pages on a domain with EasyCrawl spider Added: Sitemap Crawl - Extract URLs from sitemap.xml and crawl them Feature parity with Firecrawl for site crawling capabilities

v1.0.5 (2026-02-25)

Enhanced: API Reverse Engineering methodology Detailed step-by-step process from @paoloanzn's work Real Solscan case study with exact timeline Added: Step-by-step methodology section Added: Real example documentation (Solscan March 2025 vs Feb 2026) Added: Discovery checklist with 10 steps Documented: How to find auth headers in JS files Documented: Token generation pattern extraction Updated: Cloudscraper integration with multi-attempt pattern Verified: Solscan now patched (Cloudflare on both endpoints)

v1.0.4 (2026-02-25)

Fixed: Brand Data Extraction API - Corrected selectors for scrapling's Response object Fixed .html → .text / .body Fixed .title() → page.css('title') Fixed .logo img::src → .logo img::attr(src) Tested and verified working

v1.0.3 (2026-02-25)

Added: API Reverse Engineering section API Endpoint Discovery (Network tab analysis) JavaScript Analysis (finding auth logic) Cloudscraper integration for Cloudflare bypass Complete APIReplicator class Discovery checklist Added cloudscraper to installation

v1.0.2 (2026-02-25)

Synced with upstream GitHub README exactly Added Brand Data Extraction section Clean, core-only version

v1.0.1 (2026-02-25)

Synced with original Scrapling GitHub README Last updated: 2026-02-25

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package

1 Docs1 Scripts1 Config

SKILL.md Primary doc
run.sh Scripts
_meta.json Config