{
  "schemaVersion": "1.0",
  "item": {
    "slug": "scrapling",
    "name": "Scrapling",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/zendenho7/scrapling",
    "canonicalUrl": "https://clawhub.ai/zendenho7/scrapling",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/scrapling",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=scrapling",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "_meta.json",
      "run.sh"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-07T17:22:31.273Z",
      "expiresAt": "2026-05-14T17:22:31.273Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
        "contentDisposition": "attachment; filename=\"afrexai-annual-report-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/scrapling"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/scrapling",
    "agentPageUrl": "https://openagent3.xyz/skills/scrapling/agent",
    "manifestUrl": "https://openagent3.xyz/skills/scrapling/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/scrapling/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Scrapling - Adaptive Web Scraping",
        "body": "\"Effortless web scraping for the modern web.\""
      },
      {
        "title": "Core Library",
        "body": "Repository: https://github.com/D4Vinci/Scrapling\nAuthor: D4Vinci (Karim Shoair)\nLicense: BSD-3-Clause\nDocumentation: https://scrapling.readthedocs.io"
      },
      {
        "title": "API Reverse Engineering Methodology",
        "body": "GitHub: https://github.com/paoloanzn/free-solscan-api\nX Post: https://x.com/paoloanzn/status/2026361234032046319\nAuthor: @paoloanzn\nInsight: \"Web scraping is 80% reverse engineering\""
      },
      {
        "title": "Installation",
        "body": "# Core library (parser only)\npip install scrapling\n\n# With fetchers (HTTP + browser automation) - RECOMMENDED\npip install \"scrapling[fetchers]\"\nscrapling install\n\n# With shell (CLI tools) - RECOMMENDED\npip install \"scrapling[shell]\"\n\n# With AI (MCP server) - OPTIONAL\npip install \"scrapling[ai]\"\n\n# Everything\npip install \"scrapling[all]\"\n\n# Browser for stealth/dynamic mode\nplaywright install chromium\n\n# For Cloudflare bypass (advanced)\npip install cloudscraper"
      },
      {
        "title": "When to Use Scrapling",
        "body": "Use Scrapling when:\n\nResearch topics from websites\nExtract data from blogs, news sites, docs\nCrawl multiple pages with Spider\nGather content for summaries\nExtract brand data from any website\nReverse engineer APIs from websites\n\nDo NOT use for:\n\nX/Twitter (use x-tweet-fetcher skill)\nLogin-protected sites (unless credentials provided)\nPaywalled content (respect robots.txt)\nSites that prohibit scraping in their TOS"
      },
      {
        "title": "1. Basic Fetch (Most Common)",
        "body": "from scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://example.com')\n\n# Extract content\ntitle = page.css('h1::text').get()\nparagraphs = page.css('p::text').getall()"
      },
      {
        "title": "2. Stealthy Fetch (Anti-Bot/Cloudflare)",
        "body": "from scrapling.fetchers import StealthyFetcher\n\nStealthyFetcher.adaptive = True\npage = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)"
      },
      {
        "title": "3. Dynamic Fetch (Full Browser Automation)",
        "body": "from scrapling.fetchers import DynamicFetcher\n\npage = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)"
      },
      {
        "title": "4. Adaptive Parsing (Survives Design Changes)",
        "body": "from scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://example.com')\n\n# First scrape - saves selectors\nitems = page.css('.product', auto_save=True)\n\n# Later - if site changes, use adaptive=True to relocate\nitems = page.css('.product', adaptive=True)"
      },
      {
        "title": "5. Spider (Multiple Pages)",
        "body": "from scrapling.spiders import Spider, Response\n\nclass MySpider(Spider):\n    name = \"demo\"\n    start_urls = [\"https://example.com\"]\n    concurrent_requests = 3\n    \n    async def parse(self, response: Response):\n        for item in response.css('.item'):\n            yield {\"item\": item.css('h2::text').get()}\n        \n        # Follow links\n        next_page = response.css('.next a')\n        if next_page:\n            yield response.follow(next_page[0].attrib['href'])\n\nMySpider().start()"
      },
      {
        "title": "6. CLI Usage",
        "body": "# Simple fetch to file\nscrapling extract get https://example.com content.html\n\n# Stealthy fetch (bypass anti-bot)\nscrapling extract stealthy-fetch https://example.com content.html\n\n# Interactive shell\nscrapling shell https://example.com"
      },
      {
        "title": "Extract Article Content",
        "body": "from scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://example.com/article')\n\n# Try multiple selectors for title\ntitle = (\n    page.css('[itemprop=\"headline\"]::text').get() or\n    page.css('article h1::text').get() or\n    page.css('h1::text').get()\n)\n\n# Get paragraphs\ncontent = page.css('article p::text, .article-body p::text').getall()\n\nprint(f\"Title: {title}\")\nprint(f\"Paragraphs: {len(content)}\")"
      },
      {
        "title": "Research Multiple Pages",
        "body": "from scrapling.spiders import Spider, Response\n\nclass ResearchSpider(Spider):\n    name = \"research\"\n    start_urls = [\"https://news.ycombinator.com\"]\n    concurrent_requests = 5\n    \n    async def parse(self, response: Response):\n        for item in response.css('.titleline a::text').getall()[:10]:\n            yield {\"title\": item, \"source\": \"HN\"}\n        \n        more = response.css('.morelink::attr(href)').get()\n        if more:\n            yield response.follow(more)\n\nResearchSpider().start()"
      },
      {
        "title": "Crawl Entire Site (Easy Mode)",
        "body": "Auto-crawl all pages on a domain by following internal links:\n\nfrom scrapling.spiders import Spider, Response\nfrom urllib.parse import urljoin, urlparse\n\nclass EasyCrawl(Spider):\n    \"\"\"Auto-crawl all pages on a domain.\"\"\"\n    \n    name = \"easy_crawl\"\n    start_urls = [\"https://example.com\"]\n    concurrent_requests = 3\n    \n    def __init__(self):\n        super().__init__()\n        self.visited = set()\n    \n    async def parse(self, response: Response):\n        # Extract content\n        yield {\n            'url': response.url,\n            'title': response.css('title::text').get(),\n            'h1': response.css('h1::text').get(),\n        }\n        \n        # Follow internal links (limit to 50 pages)\n        if len(self.visited) >= 50:\n            return\n        \n        self.visited.add(response.url)\n        \n        links = response.css('a::attr(href)').getall()[:20]\n        for link in links:\n            full_url = urljoin(response.url, link)\n            if full_url not in self.visited:\n                yield response.follow(full_url)\n\n# Usage\nresult = EasyCrawl()\nresult.start()"
      },
      {
        "title": "Sitemap Crawl",
        "body": "Crawl pages from sitemap.xml (with fallback to link discovery):\n\nfrom scrapling.fetchers import Fetcher\nfrom scrapling.spiders import Spider, Response\nfrom urllib.parse import urljoin, urlparse\nimport re\n\ndef get_sitemap_urls(url: str, max_urls: int = 100) -> list:\n    \"\"\"Extract URLs from sitemap.xml - also checks robots.txt.\"\"\"\n    \n    parsed = urlparse(url)\n    base_url = f\"{parsed.scheme}://{parsed.netloc}\"\n    \n    sitemap_urls = [\n        f\"{base_url}/sitemap.xml\",\n        f\"{base_url}/sitemap-index.xml\",\n        f\"{base_url}/sitemap_index.xml\",\n        f\"{base_url}/sitemap-news.xml\",\n    ]\n    \n    all_urls = []\n    \n    # First check robots.txt for sitemap URL\n    try:\n        robots = Fetcher.get(f\"{base_url}/robots.txt\")\n        if robots.status == 200:\n            sitemap_in_robots = re.findall(r'Sitemap:\\s*(\\S+)', robots.text, re.IGNORECASE)\n            for sm in sitemap_in_robots:\n                sitemap_urls.insert(0, sm)\n    except:\n        pass\n    \n    # Try each sitemap location\n    for sitemap_url in sitemap_urls:\n        try:\n            page = Fetcher.get(sitemap_url, timeout=10)\n            if page.status != 200:\n                continue\n            \n            text = page.text\n            \n            # Check if it's XML\n            if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:\n                urls = re.findall(r'<loc>([^<]+)</loc>', text)\n                all_urls.extend(urls[:max_urls])\n                print(f\"Found {len(urls)} URLs in {sitemap_url}\")\n        except:\n            continue\n    \n    return list(set(all_urls))[:max_urls]\n\ndef crawl_from_sitemap(domain_url: str, max_pages: int = 50):\n    \"\"\"Crawl pages from sitemap.\"\"\"\n    \n    print(f\"Fetching sitemap for {domain_url}...\")\n    urls = get_sitemap_urls(domain_url)\n    \n    if not urls:\n        print(\"No sitemap found. Use EasyCrawl instead!\")\n        return []\n    \n    print(f\"Found {len(urls)} URLs, crawling first {max_pages}...\")\n    \n    results = []\n    for url in urls[:max_pages]:\n        try:\n            page = Fetcher.get(url, timeout=10)\n            results.append({\n                'url': url,\n                'status': page.status,\n                'title': page.css('title::text').get(),\n            })\n        except Exception as e:\n            results.append({'url': url, 'error': str(e)[:50]})\n    \n    return results\n\n# Usage\nprint(\"=== Sitemap Crawl ===\")\nresults = crawl_from_sitemap('https://example.com', max_pages=10)\nfor r in results[:3]:\n    print(f\"  {r.get('title', r.get('error', 'N/A'))}\")\n\n# Alternative: Easy crawl all links\nprint(\"\\n=== Easy Crawl (Link Discovery) ===\")\nresult = EasyCrawl(start_urls=[\"https://example.com\"], max_pages=10).start()\nprint(f\"Crawled {len(result.items)} pages\")"
      },
      {
        "title": "Firecrawl-Style Crawl (Best of Both Worlds)",
        "body": "Inspired by Firecrawl's behavior - combines sitemap discovery with link following:\n\nfrom scrapling.fetchers import Fetcher\nfrom scrapling.spiders import Spider, Response\nfrom urllib.parse import urljoin, urlparse\nimport re\n\ndef firecrawl_crawl(url: str, max_pages: int = 50, use_sitemap: bool = True):\n    \"\"\"\n    Firecrawl-style crawling:\n    - use_sitemap=True: Discover URLs from sitemap first (default)\n    - use_sitemap=False: Only follow HTML links (like sitemap:\"skip\")\n    \n    Matches Firecrawl's crawl behavior.\n    \"\"\"\n    \n    parsed = urlparse(url)\n    domain = parsed.netloc\n    \n    # ========== Method 1: Sitemap Discovery ==========\n    if use_sitemap:\n        print(f\"[Firecrawl] Discovering URLs from sitemap...\")\n        \n        sitemap_urls = [\n            f\"{url.rstrip('/')}/sitemap.xml\",\n            f\"{url.rstrip('/')}/sitemap-index.xml\",\n        ]\n        \n        all_urls = []\n        \n        # Try sitemaps\n        for sm_url in sitemap_urls:\n            try:\n                page = Fetcher.get(sm_url, timeout=15)\n                if page.status == 200:\n                    # Handle bytes\n                    text = page.body.decode('utf-8', errors='ignore') if isinstance(page.body, bytes) else str(page.body)\n                    \n                    if '<urlset' in text:\n                        urls = re.findall(r'<loc>([^<]+)</loc>', text)\n                        all_urls.extend(urls[:max_pages])\n                        print(f\"[Firecrawl] Found {len(urls)} URLs in {sm_url}\")\n            except:\n                continue\n        \n        if all_urls:\n            print(f\"[Firecrawl] Total: {len(all_urls)} URLs from sitemap\")\n            \n            # Crawl discovered URLs\n            results = []\n            for page_url in all_urls[:max_pages]:\n                try:\n                    page = Fetcher.get(page_url, timeout=15)\n                    results.append({\n                        'url': page_url,\n                        'status': page.status,\n                        'title': page.css('title::text').get() if page.status == 200 else None,\n                    })\n                except Exception as e:\n                    results.append({'url': page_url, 'error': str(e)[:50]})\n            \n            return results\n    \n    # ========== Method 2: Link Discovery (sitemap: skip) ==========\n    print(f\"[Firecrawl] Sitemap skip - using link discovery...\")\n    \n    class LinkCrawl(Spider):\n        name = \"firecrawl_link\"\n        start_urls = [url]\n        concurrent_requests = 3\n        \n        def __init__(self):\n            super().__init__()\n            self.visited = set()\n            self.domain = domain\n            self.results = []\n        \n        async def parse(self, response: Response):\n            if len(self.results) >= max_pages:\n                return\n            \n            self.results.append({\n                'url': response.url,\n                'status': response.status,\n                'title': response.css('title::text').get(),\n            })\n            \n            # Follow internal links\n            links = response.css('a::attr(href)').getall()[:20]\n            for link in links:\n                full_url = urljoin(response.url, link)\n                parsed_link = urlparse(full_url)\n                \n                if parsed_link.netloc == self.domain and full_url not in self.visited:\n                    self.visited.add(full_url)\n                    if len(self.visited) < max_pages:\n                        yield response.follow(full_url)\n    \n    result = LinkCrawl()\n    result.start()\n    return result.results\n\n# Usage\nprint(\"=== Firecrawl-Style (sitemap: include) ===\")\nresults = firecrawl_crawl('https://www.cloudflare.com', max_pages=5, use_sitemap=True)\nprint(f\"Crawled: {len(results)} pages\")\n\nprint(\"\\n=== Firecrawl-Style (sitemap: skip) ===\")\nresults = firecrawl_crawl('https://example.com', max_pages=5, use_sitemap=False)\nprint(f\"Crawled: {len(results)} pages\")"
      },
      {
        "title": "Handle Errors",
        "body": "from scrapling.fetchers import Fetcher, StealthyFetcher\n\ntry:\n    page = Fetcher.get('https://example.com')\nexcept Exception as e:\n    # Try stealth mode\n    page = StealthyFetcher.fetch('https://example.com', headless=True)\n    \nif page.status == 403:\n    print(\"Blocked - try StealthyFetcher\")\nelif page.status == 200:\n    print(\"Success!\")"
      },
      {
        "title": "Session Management",
        "body": "from scrapling.fetchers import FetcherSession\n\nwith FetcherSession(impersonate='chrome') as session:\n    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)\n    quotes = page.css('.quote .text::text').getall()"
      },
      {
        "title": "Multiple Session Types in Spider",
        "body": "from scrapling.spiders import Spider, Request, Response\nfrom scrapling.fetchers import FetcherSession, AsyncStealthySession\n\nclass MultiSessionSpider(Spider):\n    name = \"multi\"\n    start_urls = [\"https://example.com/\"]\n    \n    def configure_sessions(self, manager):\n        manager.add(\"fast\", FetcherSession(impersonate=\"chrome\"))\n        manager.add(\"stealth\", AsyncStealthySession(headless=True), lazy=True)\n    \n    async def parse(self, response: Response):\n        for link in response.css('a::attr(href)').getall():\n            if \"protected\" in link:\n                yield Request(link, sid=\"stealth\")\n            else:\n                yield Request(link, sid=\"fast\", callback=self.parse)"
      },
      {
        "title": "Advanced Parsing & Navigation",
        "body": "from scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://quotes.toscrape.com/')\n\n# Multiple selection methods\nquotes = page.css('.quote')           # CSS\nquotes = page.xpath('//div[@class=\"quote\"]')  # XPath\nquotes = page.find_all('div', class_='quote')  # BeautifulSoup-style\n\n# Navigation\nfirst_quote = page.css('.quote')[0]\nauthor = first_quote.css('.author::text').get()\nparent = first_quote.parent\n\n# Find similar elements\nsimilar = first_quote.find_similar()"
      },
      {
        "title": "Advanced: API Reverse Engineering",
        "body": "\"Web scraping is 80% reverse engineering.\"\n\nThis section covers advanced techniques to discover and replicate APIs directly from websites — often revealing data that's \"hidden\" behind paid APIs."
      },
      {
        "title": "1. API Endpoint Discovery",
        "body": "Many websites load data via client-side requests. Use browser DevTools to find them:\n\nSteps:\n\nOpen browser DevTools (F12)\nGo to Network tab\nReload the page\nLook for XHR or Fetch requests\nCheck if endpoints return JSON data\n\nWhat to look for:\n\nRequests to /api/* endpoints\nResponses containing structured data (JSON)\nSame endpoints used on both free and paid sections\n\nExample pattern:\n\n# Found in Network tab:\nGET https://api.example.com/v1/users/transactions\nResponse: {\"data\": [...], \"pagination\": {...}}"
      },
      {
        "title": "2. JavaScript Analysis",
        "body": "Auth tokens often generated client-side. Find them in .js files:\n\nSteps:\n\nIn Network tab, look at Initiator column\nClick the .js file making the request\nSearch for auth header name (e.g., sol-aut, Authorization, X-API-Key)\nFind the function generating the token\n\nCommon patterns:\n\nPlain text function names: generateToken(), createAuthHeader()\nObfuscated: Search for the header name directly\nRandom string generation: Math.random(), crypto.getRandomValues()"
      },
      {
        "title": "3. Replicating Discovered APIs",
        "body": "Once you've found the endpoint and auth pattern:\n\nimport requests\nimport random\nimport string\n\ndef generate_auth_token():\n    \"\"\"Replicate discovered token generation logic.\"\"\"\n    chars = string.ascii_letters + string.digits\n    token = ''.join(random.choice(chars) for _ in range(40))\n    # Insert fixed string at random position\n    fixed = \"B9dls0fK\"\n    pos = random.randint(0, len(token))\n    return token[:pos] + fixed + token[pos:]\n\ndef scrape_api_endpoint(url):\n    \"\"\"Hit discovered API endpoint with replicated auth.\"\"\"\n    headers = {\n        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',\n        'Accept': 'application/json',\n        'sol-aut': generate_auth_token(),  # Replicate discovered header\n    }\n    \n    response = requests.get(url, headers=headers)\n    return response.json()"
      },
      {
        "title": "4. Cloudscraper Bypass (Cloudflare)",
        "body": "For Cloudflare-protected endpoints, use cloudscraper:\n\npip install cloudscraper\n\nimport cloudscraper\n\ndef create_scraper():\n    \"\"\"Create a cloudscraper session that bypasses Cloudflare.\"\"\"\n    scraper = cloudscraper.create_scraper(\n        browser={\n            'browser': 'chrome',\n            'platform': 'windows',\n            'desktop': True\n        }\n    )\n    return scraper\n\n# Usage\nscraper = create_scraper()\nresponse = scraper.get('https://api.example.com/endpoint')\ndata = response.json()"
      },
      {
        "title": "5. Complete API Replication Pattern",
        "body": "import cloudscraper\nimport random\nimport string\nimport json\n\nclass APIReplicator:\n    \"\"\"Replicate discovered API from website.\"\"\"\n    \n    def __init__(self, base_url):\n        self.base_url = base_url\n        self.session = cloudscraper.create_scraper()\n    \n    def generate_token(self, pattern=\"random\"):\n        \"\"\"Replicate discovered token generation.\"\"\"\n        if pattern == \"solscan\":\n            # 40-char random + fixed string at random position\n            chars = string.ascii_letters + string.digits\n            token = ''.join(random.choice(chars) for _ in range(40))\n            fixed = \"B9dls0fK\"\n            pos = random.randint(0, len(token))\n            return token[:pos] + fixed + token[pos:]\n        else:\n            # Generic random token\n            return ''.join(random.choices(string.ascii_letters + string.digits, k=32))\n    \n    def get(self, endpoint, headers=None, auth_header=None, auth_pattern=\"random\"):\n        \"\"\"Make API request with discovered auth.\"\"\"\n        url = f\"{self.base_url}{endpoint}\"\n        \n        # Build headers\n        request_headers = {\n            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',\n            'Accept': 'application/json',\n        }\n        \n        # Add discovered auth header\n        if auth_header:\n            request_headers[auth_header] = self.generate_token(auth_pattern)\n        \n        # Merge custom headers\n        if headers:\n            request_headers.update(headers)\n        \n        response = self.session.get(url, headers=request_headers)\n        return response\n\n# Usage example\napi = APIReplicator(\"https://api.solscan.io\")\ndata = api.get(\n    \"/account/transactions\",\n    auth_header=\"sol-aut\",\n    auth_pattern=\"solscan\"\n)\nprint(data)"
      },
      {
        "title": "6. Discovery Checklist",
        "body": "When approaching a new site:\n\nStepActionTool1Open DevTools Network tabF122Reload page, filter by XHR/FetchNetwork filter3Look for JSON responsesResponse tab4Check if same endpoint used for \"premium\" dataCompare requests5Find auth header in JS filesInitiator column6Extract token generation logicJS debugger7Replicate in PythonReplicator class8Test against APIRun script"
      },
      {
        "title": "Brand Data Extraction (Firecrawl Alternative)",
        "body": "Extract brand data, colors, logos, and copy from any website:\n\nfrom scrapling.fetchers import Fetcher\nfrom urllib.parse import urljoin\nimport re\n\ndef extract_brand_data(url: str) -> dict:\n    \"\"\"Extract structured brand data from any website - Firecrawl style.\"\"\"\n    \n    # Try stealth mode first (handles anti-bot)\n    try:\n        page = Fetcher.get(url)\n    except:\n        from scrapling.fetchers import StealthyFetcher\n        page = StealthyFetcher.fetch(url, headless=True)\n    \n    # Helper to get text from element\n    def get_text(elements):\n        return elements[0].text if elements else None\n    \n    # Helper to get attribute\n    def get_attr(elements, attr_name):\n        return elements[0].attrib.get(attr_name) if elements else None\n    \n    # Brand name (try multiple selectors)\n    brand_name = (\n        get_text(page.css('[property=\"og:site_name\"]')) or\n        get_text(page.css('h1')) or\n        get_text(page.css('title'))\n    )\n    \n    # Tagline\n    tagline = (\n        get_text(page.css('[property=\"og:description\"]')) or\n        get_text(page.css('.tagline')) or\n        get_text(page.css('.hero-text')) or\n        get_text(page.css('header h2'))\n    )\n    \n    # Logo URL\n    logo_url = (\n        get_attr(page.css('[rel=\"icon\"]'), 'href') or\n        get_attr(page.css('[rel=\"apple-touch-icon\"]'), 'href') or\n        get_attr(page.css('.logo img'), 'src')\n    )\n    if logo_url and not logo_url.startswith('http'):\n        logo_url = urljoin(url, logo_url)\n    \n    # Favicon\n    favicon = get_attr(page.css('[rel=\"icon\"]'), 'href')\n    favicon_url = urljoin(url, favicon) if favicon else None\n    \n    # OG Image\n    og_image = get_attr(page.css('[property=\"og:image\"]'), 'content')\n    og_image_url = urljoin(url, og_image) if og_image else None\n    \n    # Screenshot (using external service)\n    screenshot_url = f\"https://image.thum.io/get/width/1200/crop/800/{url}\"\n    \n    # Description\n    description = (\n        get_text(page.css('[property=\"og:description\"]')) or\n        get_attr(page.css('[name=\"description\"]'), 'content')\n    )\n    \n    # CTA text\n    cta_text = (\n        get_text(page.css('a[href*=\"signup\"]')) or\n        get_text(page.css('.cta')) or\n        get_text(page.css('[class*=\"button\"]'))\n    )\n    \n    # Social links\n    social_links = {}\n    for platform in ['twitter', 'facebook', 'instagram', 'linkedin', 'youtube', 'github']:\n        link = get_attr(page.css(f'a[href*=\"{platform}\"]'), 'href')\n        if link:\n            social_links[platform] = link\n    \n    # Features (from feature grid/cards)\n    features = []\n    feature_cards = page.css('[class*=\"feature\"], .feature-card, .benefit-item')\n    for card in feature_cards[:6]:\n        feature_text = get_text(card.css('h3, h4, p'))\n        if feature_text:\n            features.append(feature_text.strip())\n    \n    return {\n        'brandName': brand_name,\n        'tagline': tagline,\n        'description': description,\n        'features': features,\n        'logoUrl': logo_url,\n        'faviconUrl': favicon_url,\n        'ctaText': cta_text,\n        'socialLinks': social_links,\n        'screenshotUrl': screenshot_url,\n        'ogImageUrl': og_image_url\n    }\n\n# Usage\nbrand_data = extract_brand_data('https://example.com')\nprint(brand_data)"
      },
      {
        "title": "Brand Data CLI",
        "body": "# Extract brand data using the Python function above\npython3 -c \"\nimport json\nimport sys\nsys.path.insert(0, '/path/to/skill')\nfrom brand_extraction import extract_brand_data\ndata = extract_brand_data('$URL')\nprint(json.dumps(data, indent=2))\n\""
      },
      {
        "title": "Feature Comparison",
        "body": "FeatureStatusNotesBasic fetch✅ WorkingFetcher.get()Stealthy fetch✅ WorkingStealthyFetcher.fetch()Dynamic fetch✅ WorkingDynamicFetcher.fetch()Adaptive parsing✅ Workingauto_save + adaptiveSpider crawling✅ Workingasync def parse()CSS selectors✅ Working.css()XPath✅ Working.xpath()Session management✅ WorkingFetcherSession, StealthySessionProxy rotation✅ WorkingProxyRotator classCLI tools✅ Workingscrapling extractBrand data extraction✅ Workingextract_brand_data()API reverse engineering✅ WorkingAPIReplicator classCloudscraper bypass✅ Workingcloudscraper integrationEasy site crawl✅ WorkingEasyCrawl classSitemap crawl✅ Workingget_sitemap_urls()MCP server❌ ExcludedNot needed"
      },
      {
        "title": "IEEE Spectrum",
        "body": "page = Fetcher.get('https://spectrum.ieee.org/...')\ntitle = page.css('h1::text').get()\ncontent = page.css('article p::text').getall()\n\n✅ Works"
      },
      {
        "title": "Hacker News",
        "body": "page = Fetcher.get('https://news.ycombinator.com')\nstories = page.css('.titleline a::text').getall()\n\n✅ Works"
      },
      {
        "title": "Example Domain",
        "body": "page = Fetcher.get('https://example.com')\ntitle = page.css('h1::text').get()\n\n✅ Works"
      },
      {
        "title": "🔧 Quick Troubleshooting",
        "body": "IssueSolution403/429 BlockedUse StealthyFetcher or cloudscraperCloudflareUse StealthyFetcher or cloudscraperJavaScript requiredUse DynamicFetcherSite changedUse adaptive=TruePaid API exposedUse API reverse engineeringCaptchaCannot bypass - skip or use official APIAuth requiredDo NOT bypass - use official API"
      },
      {
        "title": "Skill Graph",
        "body": "Related skills:\n\n[[content-research]] - Research workflow\n[[blogwatcher]] - RSS/feed monitoring\n[[youtube-watcher]] - Video content\n[[chirp]] - Twitter/X interactions\n[[newsletter-digest]] - Content summarization\n[[x-tweet-fetcher]] - X/Twitter (use instead of Scrapling)"
      },
      {
        "title": "v1.0.8 (2026-02-25)",
        "body": "Added: Firecrawl-Style Crawl - Combines sitemap discovery + link following\nAdded: use_sitemap parameter - Matches Firecrawl's sitemap:\"include\"/\"skip\" behavior\nVerified: cloudflare.com returns 2,447 URLs from sitemap!"
      },
      {
        "title": "v1.0.7 (2026-02-25)",
        "body": "Fixed: EasyCrawl Spider syntax - Updated to work with scrapling's actual Spider API\nVerified: Spider crawling works - Tested and crawled 20+ pages from example.com"
      },
      {
        "title": "v1.0.6 (2026-02-25)",
        "body": "Added: Easy Site Crawl - Auto-crawl all pages on a domain with EasyCrawl spider\nAdded: Sitemap Crawl - Extract URLs from sitemap.xml and crawl them\nFeature parity with Firecrawl for site crawling capabilities"
      },
      {
        "title": "v1.0.5 (2026-02-25)",
        "body": "Enhanced: API Reverse Engineering methodology\n\nDetailed step-by-step process from @paoloanzn's work\nReal Solscan case study with exact timeline\nAdded: Step-by-step methodology section\nAdded: Real example documentation (Solscan March 2025 vs Feb 2026)\nAdded: Discovery checklist with 10 steps\nDocumented: How to find auth headers in JS files\nDocumented: Token generation pattern extraction\nUpdated: Cloudscraper integration with multi-attempt pattern\nVerified: Solscan now patched (Cloudflare on both endpoints)"
      },
      {
        "title": "v1.0.4 (2026-02-25)",
        "body": "Fixed: Brand Data Extraction API - Corrected selectors for scrapling's Response object\nFixed .html → .text / .body\nFixed .title() → page.css('title')\nFixed .logo img::src → .logo img::attr(src)\nTested and verified working"
      },
      {
        "title": "v1.0.3 (2026-02-25)",
        "body": "Added: API Reverse Engineering section\n\nAPI Endpoint Discovery (Network tab analysis)\nJavaScript Analysis (finding auth logic)\nCloudscraper integration for Cloudflare bypass\nComplete APIReplicator class\nDiscovery checklist\n\n\nAdded cloudscraper to installation"
      },
      {
        "title": "v1.0.2 (2026-02-25)",
        "body": "Synced with upstream GitHub README exactly\nAdded Brand Data Extraction section\nClean, core-only version"
      },
      {
        "title": "v1.0.1 (2026-02-25)",
        "body": "Synced with original Scrapling GitHub README\n\nLast updated: 2026-02-25"
      }
    ],
    "body": "Scrapling - Adaptive Web Scraping\n\n\"Effortless web scraping for the modern web.\"\n\nCredits\nCore Library\nRepository: https://github.com/D4Vinci/Scrapling\nAuthor: D4Vinci (Karim Shoair)\nLicense: BSD-3-Clause\nDocumentation: https://scrapling.readthedocs.io\nAPI Reverse Engineering Methodology\nGitHub: https://github.com/paoloanzn/free-solscan-api\nX Post: https://x.com/paoloanzn/status/2026361234032046319\nAuthor: @paoloanzn\nInsight: \"Web scraping is 80% reverse engineering\"\nInstallation\n# Core library (parser only)\npip install scrapling\n\n# With fetchers (HTTP + browser automation) - RECOMMENDED\npip install \"scrapling[fetchers]\"\nscrapling install\n\n# With shell (CLI tools) - RECOMMENDED\npip install \"scrapling[shell]\"\n\n# With AI (MCP server) - OPTIONAL\npip install \"scrapling[ai]\"\n\n# Everything\npip install \"scrapling[all]\"\n\n# Browser for stealth/dynamic mode\nplaywright install chromium\n\n# For Cloudflare bypass (advanced)\npip install cloudscraper\n\nAgent Instructions\nWhen to Use Scrapling\n\nUse Scrapling when:\n\nResearch topics from websites\nExtract data from blogs, news sites, docs\nCrawl multiple pages with Spider\nGather content for summaries\nExtract brand data from any website\nReverse engineer APIs from websites\n\nDo NOT use for:\n\nX/Twitter (use x-tweet-fetcher skill)\nLogin-protected sites (unless credentials provided)\nPaywalled content (respect robots.txt)\nSites that prohibit scraping in their TOS\nQuick Commands\n1. Basic Fetch (Most Common)\nfrom scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://example.com')\n\n# Extract content\ntitle = page.css('h1::text').get()\nparagraphs = page.css('p::text').getall()\n\n2. Stealthy Fetch (Anti-Bot/Cloudflare)\nfrom scrapling.fetchers import StealthyFetcher\n\nStealthyFetcher.adaptive = True\npage = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)\n\n3. Dynamic Fetch (Full Browser Automation)\nfrom scrapling.fetchers import DynamicFetcher\n\npage = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)\n\n4. Adaptive Parsing (Survives Design Changes)\nfrom scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://example.com')\n\n# First scrape - saves selectors\nitems = page.css('.product', auto_save=True)\n\n# Later - if site changes, use adaptive=True to relocate\nitems = page.css('.product', adaptive=True)\n\n5. Spider (Multiple Pages)\nfrom scrapling.spiders import Spider, Response\n\nclass MySpider(Spider):\n    name = \"demo\"\n    start_urls = [\"https://example.com\"]\n    concurrent_requests = 3\n    \n    async def parse(self, response: Response):\n        for item in response.css('.item'):\n            yield {\"item\": item.css('h2::text').get()}\n        \n        # Follow links\n        next_page = response.css('.next a')\n        if next_page:\n            yield response.follow(next_page[0].attrib['href'])\n\nMySpider().start()\n\n6. CLI Usage\n# Simple fetch to file\nscrapling extract get https://example.com content.html\n\n# Stealthy fetch (bypass anti-bot)\nscrapling extract stealthy-fetch https://example.com content.html\n\n# Interactive shell\nscrapling shell https://example.com\n\nCommon Patterns\nExtract Article Content\nfrom scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://example.com/article')\n\n# Try multiple selectors for title\ntitle = (\n    page.css('[itemprop=\"headline\"]::text').get() or\n    page.css('article h1::text').get() or\n    page.css('h1::text').get()\n)\n\n# Get paragraphs\ncontent = page.css('article p::text, .article-body p::text').getall()\n\nprint(f\"Title: {title}\")\nprint(f\"Paragraphs: {len(content)}\")\n\nResearch Multiple Pages\nfrom scrapling.spiders import Spider, Response\n\nclass ResearchSpider(Spider):\n    name = \"research\"\n    start_urls = [\"https://news.ycombinator.com\"]\n    concurrent_requests = 5\n    \n    async def parse(self, response: Response):\n        for item in response.css('.titleline a::text').getall()[:10]:\n            yield {\"title\": item, \"source\": \"HN\"}\n        \n        more = response.css('.morelink::attr(href)').get()\n        if more:\n            yield response.follow(more)\n\nResearchSpider().start()\n\nCrawl Entire Site (Easy Mode)\n\nAuto-crawl all pages on a domain by following internal links:\n\nfrom scrapling.spiders import Spider, Response\nfrom urllib.parse import urljoin, urlparse\n\nclass EasyCrawl(Spider):\n    \"\"\"Auto-crawl all pages on a domain.\"\"\"\n    \n    name = \"easy_crawl\"\n    start_urls = [\"https://example.com\"]\n    concurrent_requests = 3\n    \n    def __init__(self):\n        super().__init__()\n        self.visited = set()\n    \n    async def parse(self, response: Response):\n        # Extract content\n        yield {\n            'url': response.url,\n            'title': response.css('title::text').get(),\n            'h1': response.css('h1::text').get(),\n        }\n        \n        # Follow internal links (limit to 50 pages)\n        if len(self.visited) >= 50:\n            return\n        \n        self.visited.add(response.url)\n        \n        links = response.css('a::attr(href)').getall()[:20]\n        for link in links:\n            full_url = urljoin(response.url, link)\n            if full_url not in self.visited:\n                yield response.follow(full_url)\n\n# Usage\nresult = EasyCrawl()\nresult.start()\n\nSitemap Crawl\n\nCrawl pages from sitemap.xml (with fallback to link discovery):\n\nfrom scrapling.fetchers import Fetcher\nfrom scrapling.spiders import Spider, Response\nfrom urllib.parse import urljoin, urlparse\nimport re\n\ndef get_sitemap_urls(url: str, max_urls: int = 100) -> list:\n    \"\"\"Extract URLs from sitemap.xml - also checks robots.txt.\"\"\"\n    \n    parsed = urlparse(url)\n    base_url = f\"{parsed.scheme}://{parsed.netloc}\"\n    \n    sitemap_urls = [\n        f\"{base_url}/sitemap.xml\",\n        f\"{base_url}/sitemap-index.xml\",\n        f\"{base_url}/sitemap_index.xml\",\n        f\"{base_url}/sitemap-news.xml\",\n    ]\n    \n    all_urls = []\n    \n    # First check robots.txt for sitemap URL\n    try:\n        robots = Fetcher.get(f\"{base_url}/robots.txt\")\n        if robots.status == 200:\n            sitemap_in_robots = re.findall(r'Sitemap:\\s*(\\S+)', robots.text, re.IGNORECASE)\n            for sm in sitemap_in_robots:\n                sitemap_urls.insert(0, sm)\n    except:\n        pass\n    \n    # Try each sitemap location\n    for sitemap_url in sitemap_urls:\n        try:\n            page = Fetcher.get(sitemap_url, timeout=10)\n            if page.status != 200:\n                continue\n            \n            text = page.text\n            \n            # Check if it's XML\n            if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:\n                urls = re.findall(r'<loc>([^<]+)</loc>', text)\n                all_urls.extend(urls[:max_urls])\n                print(f\"Found {len(urls)} URLs in {sitemap_url}\")\n        except:\n            continue\n    \n    return list(set(all_urls))[:max_urls]\n\ndef crawl_from_sitemap(domain_url: str, max_pages: int = 50):\n    \"\"\"Crawl pages from sitemap.\"\"\"\n    \n    print(f\"Fetching sitemap for {domain_url}...\")\n    urls = get_sitemap_urls(domain_url)\n    \n    if not urls:\n        print(\"No sitemap found. Use EasyCrawl instead!\")\n        return []\n    \n    print(f\"Found {len(urls)} URLs, crawling first {max_pages}...\")\n    \n    results = []\n    for url in urls[:max_pages]:\n        try:\n            page = Fetcher.get(url, timeout=10)\n            results.append({\n                'url': url,\n                'status': page.status,\n                'title': page.css('title::text').get(),\n            })\n        except Exception as e:\n            results.append({'url': url, 'error': str(e)[:50]})\n    \n    return results\n\n# Usage\nprint(\"=== Sitemap Crawl ===\")\nresults = crawl_from_sitemap('https://example.com', max_pages=10)\nfor r in results[:3]:\n    print(f\"  {r.get('title', r.get('error', 'N/A'))}\")\n\n# Alternative: Easy crawl all links\nprint(\"\\n=== Easy Crawl (Link Discovery) ===\")\nresult = EasyCrawl(start_urls=[\"https://example.com\"], max_pages=10).start()\nprint(f\"Crawled {len(result.items)} pages\")\n\nFirecrawl-Style Crawl (Best of Both Worlds)\n\nInspired by Firecrawl's behavior - combines sitemap discovery with link following:\n\nfrom scrapling.fetchers import Fetcher\nfrom scrapling.spiders import Spider, Response\nfrom urllib.parse import urljoin, urlparse\nimport re\n\ndef firecrawl_crawl(url: str, max_pages: int = 50, use_sitemap: bool = True):\n    \"\"\"\n    Firecrawl-style crawling:\n    - use_sitemap=True: Discover URLs from sitemap first (default)\n    - use_sitemap=False: Only follow HTML links (like sitemap:\"skip\")\n    \n    Matches Firecrawl's crawl behavior.\n    \"\"\"\n    \n    parsed = urlparse(url)\n    domain = parsed.netloc\n    \n    # ========== Method 1: Sitemap Discovery ==========\n    if use_sitemap:\n        print(f\"[Firecrawl] Discovering URLs from sitemap...\")\n        \n        sitemap_urls = [\n            f\"{url.rstrip('/')}/sitemap.xml\",\n            f\"{url.rstrip('/')}/sitemap-index.xml\",\n        ]\n        \n        all_urls = []\n        \n        # Try sitemaps\n        for sm_url in sitemap_urls:\n            try:\n                page = Fetcher.get(sm_url, timeout=15)\n                if page.status == 200:\n                    # Handle bytes\n                    text = page.body.decode('utf-8', errors='ignore') if isinstance(page.body, bytes) else str(page.body)\n                    \n                    if '<urlset' in text:\n                        urls = re.findall(r'<loc>([^<]+)</loc>', text)\n                        all_urls.extend(urls[:max_pages])\n                        print(f\"[Firecrawl] Found {len(urls)} URLs in {sm_url}\")\n            except:\n                continue\n        \n        if all_urls:\n            print(f\"[Firecrawl] Total: {len(all_urls)} URLs from sitemap\")\n            \n            # Crawl discovered URLs\n            results = []\n            for page_url in all_urls[:max_pages]:\n                try:\n                    page = Fetcher.get(page_url, timeout=15)\n                    results.append({\n                        'url': page_url,\n                        'status': page.status,\n                        'title': page.css('title::text').get() if page.status == 200 else None,\n                    })\n                except Exception as e:\n                    results.append({'url': page_url, 'error': str(e)[:50]})\n            \n            return results\n    \n    # ========== Method 2: Link Discovery (sitemap: skip) ==========\n    print(f\"[Firecrawl] Sitemap skip - using link discovery...\")\n    \n    class LinkCrawl(Spider):\n        name = \"firecrawl_link\"\n        start_urls = [url]\n        concurrent_requests = 3\n        \n        def __init__(self):\n            super().__init__()\n            self.visited = set()\n            self.domain = domain\n            self.results = []\n        \n        async def parse(self, response: Response):\n            if len(self.results) >= max_pages:\n                return\n            \n            self.results.append({\n                'url': response.url,\n                'status': response.status,\n                'title': response.css('title::text').get(),\n            })\n            \n            # Follow internal links\n            links = response.css('a::attr(href)').getall()[:20]\n            for link in links:\n                full_url = urljoin(response.url, link)\n                parsed_link = urlparse(full_url)\n                \n                if parsed_link.netloc == self.domain and full_url not in self.visited:\n                    self.visited.add(full_url)\n                    if len(self.visited) < max_pages:\n                        yield response.follow(full_url)\n    \n    result = LinkCrawl()\n    result.start()\n    return result.results\n\n# Usage\nprint(\"=== Firecrawl-Style (sitemap: include) ===\")\nresults = firecrawl_crawl('https://www.cloudflare.com', max_pages=5, use_sitemap=True)\nprint(f\"Crawled: {len(results)} pages\")\n\nprint(\"\\n=== Firecrawl-Style (sitemap: skip) ===\")\nresults = firecrawl_crawl('https://example.com', max_pages=5, use_sitemap=False)\nprint(f\"Crawled: {len(results)} pages\")\n\nHandle Errors\nfrom scrapling.fetchers import Fetcher, StealthyFetcher\n\ntry:\n    page = Fetcher.get('https://example.com')\nexcept Exception as e:\n    # Try stealth mode\n    page = StealthyFetcher.fetch('https://example.com', headless=True)\n    \nif page.status == 403:\n    print(\"Blocked - try StealthyFetcher\")\nelif page.status == 200:\n    print(\"Success!\")\n\nSession Management\nfrom scrapling.fetchers import FetcherSession\n\nwith FetcherSession(impersonate='chrome') as session:\n    page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)\n    quotes = page.css('.quote .text::text').getall()\n\nMultiple Session Types in Spider\nfrom scrapling.spiders import Spider, Request, Response\nfrom scrapling.fetchers import FetcherSession, AsyncStealthySession\n\nclass MultiSessionSpider(Spider):\n    name = \"multi\"\n    start_urls = [\"https://example.com/\"]\n    \n    def configure_sessions(self, manager):\n        manager.add(\"fast\", FetcherSession(impersonate=\"chrome\"))\n        manager.add(\"stealth\", AsyncStealthySession(headless=True), lazy=True)\n    \n    async def parse(self, response: Response):\n        for link in response.css('a::attr(href)').getall():\n            if \"protected\" in link:\n                yield Request(link, sid=\"stealth\")\n            else:\n                yield Request(link, sid=\"fast\", callback=self.parse)\n\nAdvanced Parsing & Navigation\nfrom scrapling.fetchers import Fetcher\n\npage = Fetcher.get('https://quotes.toscrape.com/')\n\n# Multiple selection methods\nquotes = page.css('.quote')           # CSS\nquotes = page.xpath('//div[@class=\"quote\"]')  # XPath\nquotes = page.find_all('div', class_='quote')  # BeautifulSoup-style\n\n# Navigation\nfirst_quote = page.css('.quote')[0]\nauthor = first_quote.css('.author::text').get()\nparent = first_quote.parent\n\n# Find similar elements\nsimilar = first_quote.find_similar()\n\nAdvanced: API Reverse Engineering\n\n\"Web scraping is 80% reverse engineering.\"\n\nThis section covers advanced techniques to discover and replicate APIs directly from websites — often revealing data that's \"hidden\" behind paid APIs.\n\n1. API Endpoint Discovery\n\nMany websites load data via client-side requests. Use browser DevTools to find them:\n\nSteps:\n\nOpen browser DevTools (F12)\nGo to Network tab\nReload the page\nLook for XHR or Fetch requests\nCheck if endpoints return JSON data\n\nWhat to look for:\n\nRequests to /api/* endpoints\nResponses containing structured data (JSON)\nSame endpoints used on both free and paid sections\n\nExample pattern:\n\n# Found in Network tab:\nGET https://api.example.com/v1/users/transactions\nResponse: {\"data\": [...], \"pagination\": {...}}\n\n2. JavaScript Analysis\n\nAuth tokens often generated client-side. Find them in .js files:\n\nSteps:\n\nIn Network tab, look at Initiator column\nClick the .js file making the request\nSearch for auth header name (e.g., sol-aut, Authorization, X-API-Key)\nFind the function generating the token\n\nCommon patterns:\n\nPlain text function names: generateToken(), createAuthHeader()\nObfuscated: Search for the header name directly\nRandom string generation: Math.random(), crypto.getRandomValues()\n3. Replicating Discovered APIs\n\nOnce you've found the endpoint and auth pattern:\n\nimport requests\nimport random\nimport string\n\ndef generate_auth_token():\n    \"\"\"Replicate discovered token generation logic.\"\"\"\n    chars = string.ascii_letters + string.digits\n    token = ''.join(random.choice(chars) for _ in range(40))\n    # Insert fixed string at random position\n    fixed = \"B9dls0fK\"\n    pos = random.randint(0, len(token))\n    return token[:pos] + fixed + token[pos:]\n\ndef scrape_api_endpoint(url):\n    \"\"\"Hit discovered API endpoint with replicated auth.\"\"\"\n    headers = {\n        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',\n        'Accept': 'application/json',\n        'sol-aut': generate_auth_token(),  # Replicate discovered header\n    }\n    \n    response = requests.get(url, headers=headers)\n    return response.json()\n\n4. Cloudscraper Bypass (Cloudflare)\n\nFor Cloudflare-protected endpoints, use cloudscraper:\n\npip install cloudscraper\n\nimport cloudscraper\n\ndef create_scraper():\n    \"\"\"Create a cloudscraper session that bypasses Cloudflare.\"\"\"\n    scraper = cloudscraper.create_scraper(\n        browser={\n            'browser': 'chrome',\n            'platform': 'windows',\n            'desktop': True\n        }\n    )\n    return scraper\n\n# Usage\nscraper = create_scraper()\nresponse = scraper.get('https://api.example.com/endpoint')\ndata = response.json()\n\n5. Complete API Replication Pattern\nimport cloudscraper\nimport random\nimport string\nimport json\n\nclass APIReplicator:\n    \"\"\"Replicate discovered API from website.\"\"\"\n    \n    def __init__(self, base_url):\n        self.base_url = base_url\n        self.session = cloudscraper.create_scraper()\n    \n    def generate_token(self, pattern=\"random\"):\n        \"\"\"Replicate discovered token generation.\"\"\"\n        if pattern == \"solscan\":\n            # 40-char random + fixed string at random position\n            chars = string.ascii_letters + string.digits\n            token = ''.join(random.choice(chars) for _ in range(40))\n            fixed = \"B9dls0fK\"\n            pos = random.randint(0, len(token))\n            return token[:pos] + fixed + token[pos:]\n        else:\n            # Generic random token\n            return ''.join(random.choices(string.ascii_letters + string.digits, k=32))\n    \n    def get(self, endpoint, headers=None, auth_header=None, auth_pattern=\"random\"):\n        \"\"\"Make API request with discovered auth.\"\"\"\n        url = f\"{self.base_url}{endpoint}\"\n        \n        # Build headers\n        request_headers = {\n            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',\n            'Accept': 'application/json',\n        }\n        \n        # Add discovered auth header\n        if auth_header:\n            request_headers[auth_header] = self.generate_token(auth_pattern)\n        \n        # Merge custom headers\n        if headers:\n            request_headers.update(headers)\n        \n        response = self.session.get(url, headers=request_headers)\n        return response\n\n# Usage example\napi = APIReplicator(\"https://api.solscan.io\")\ndata = api.get(\n    \"/account/transactions\",\n    auth_header=\"sol-aut\",\n    auth_pattern=\"solscan\"\n)\nprint(data)\n\n6. Discovery Checklist\n\nWhen approaching a new site:\n\nStep\tAction\tTool\n1\tOpen DevTools Network tab\tF12\n2\tReload page, filter by XHR/Fetch\tNetwork filter\n3\tLook for JSON responses\tResponse tab\n4\tCheck if same endpoint used for \"premium\" data\tCompare requests\n5\tFind auth header in JS files\tInitiator column\n6\tExtract token generation logic\tJS debugger\n7\tReplicate in Python\tReplicator class\n8\tTest against API\tRun script\nBrand Data Extraction (Firecrawl Alternative)\n\nExtract brand data, colors, logos, and copy from any website:\n\nfrom scrapling.fetchers import Fetcher\nfrom urllib.parse import urljoin\nimport re\n\ndef extract_brand_data(url: str) -> dict:\n    \"\"\"Extract structured brand data from any website - Firecrawl style.\"\"\"\n    \n    # Try stealth mode first (handles anti-bot)\n    try:\n        page = Fetcher.get(url)\n    except:\n        from scrapling.fetchers import StealthyFetcher\n        page = StealthyFetcher.fetch(url, headless=True)\n    \n    # Helper to get text from element\n    def get_text(elements):\n        return elements[0].text if elements else None\n    \n    # Helper to get attribute\n    def get_attr(elements, attr_name):\n        return elements[0].attrib.get(attr_name) if elements else None\n    \n    # Brand name (try multiple selectors)\n    brand_name = (\n        get_text(page.css('[property=\"og:site_name\"]')) or\n        get_text(page.css('h1')) or\n        get_text(page.css('title'))\n    )\n    \n    # Tagline\n    tagline = (\n        get_text(page.css('[property=\"og:description\"]')) or\n        get_text(page.css('.tagline')) or\n        get_text(page.css('.hero-text')) or\n        get_text(page.css('header h2'))\n    )\n    \n    # Logo URL\n    logo_url = (\n        get_attr(page.css('[rel=\"icon\"]'), 'href') or\n        get_attr(page.css('[rel=\"apple-touch-icon\"]'), 'href') or\n        get_attr(page.css('.logo img'), 'src')\n    )\n    if logo_url and not logo_url.startswith('http'):\n        logo_url = urljoin(url, logo_url)\n    \n    # Favicon\n    favicon = get_attr(page.css('[rel=\"icon\"]'), 'href')\n    favicon_url = urljoin(url, favicon) if favicon else None\n    \n    # OG Image\n    og_image = get_attr(page.css('[property=\"og:image\"]'), 'content')\n    og_image_url = urljoin(url, og_image) if og_image else None\n    \n    # Screenshot (using external service)\n    screenshot_url = f\"https://image.thum.io/get/width/1200/crop/800/{url}\"\n    \n    # Description\n    description = (\n        get_text(page.css('[property=\"og:description\"]')) or\n        get_attr(page.css('[name=\"description\"]'), 'content')\n    )\n    \n    # CTA text\n    cta_text = (\n        get_text(page.css('a[href*=\"signup\"]')) or\n        get_text(page.css('.cta')) or\n        get_text(page.css('[class*=\"button\"]'))\n    )\n    \n    # Social links\n    social_links = {}\n    for platform in ['twitter', 'facebook', 'instagram', 'linkedin', 'youtube', 'github']:\n        link = get_attr(page.css(f'a[href*=\"{platform}\"]'), 'href')\n        if link:\n            social_links[platform] = link\n    \n    # Features (from feature grid/cards)\n    features = []\n    feature_cards = page.css('[class*=\"feature\"], .feature-card, .benefit-item')\n    for card in feature_cards[:6]:\n        feature_text = get_text(card.css('h3, h4, p'))\n        if feature_text:\n            features.append(feature_text.strip())\n    \n    return {\n        'brandName': brand_name,\n        'tagline': tagline,\n        'description': description,\n        'features': features,\n        'logoUrl': logo_url,\n        'faviconUrl': favicon_url,\n        'ctaText': cta_text,\n        'socialLinks': social_links,\n        'screenshotUrl': screenshot_url,\n        'ogImageUrl': og_image_url\n    }\n\n# Usage\nbrand_data = extract_brand_data('https://example.com')\nprint(brand_data)\n\nBrand Data CLI\n# Extract brand data using the Python function above\npython3 -c \"\nimport json\nimport sys\nsys.path.insert(0, '/path/to/skill')\nfrom brand_extraction import extract_brand_data\ndata = extract_brand_data('$URL')\nprint(json.dumps(data, indent=2))\n\"\n\nFeature Comparison\nFeature\tStatus\tNotes\nBasic fetch\t✅ Working\tFetcher.get()\nStealthy fetch\t✅ Working\tStealthyFetcher.fetch()\nDynamic fetch\t✅ Working\tDynamicFetcher.fetch()\nAdaptive parsing\t✅ Working\tauto_save + adaptive\nSpider crawling\t✅ Working\tasync def parse()\nCSS selectors\t✅ Working\t.css()\nXPath\t✅ Working\t.xpath()\nSession management\t✅ Working\tFetcherSession, StealthySession\nProxy rotation\t✅ Working\tProxyRotator class\nCLI tools\t✅ Working\tscrapling extract\nBrand data extraction\t✅ Working\textract_brand_data()\nAPI reverse engineering\t✅ Working\tAPIReplicator class\nCloudscraper bypass\t✅ Working\tcloudscraper integration\nEasy site crawl\t✅ Working\tEasyCrawl class\nSitemap crawl\t✅ Working\tget_sitemap_urls()\nMCP server\t❌ Excluded\tNot needed\nExamples Tested\nIEEE Spectrum\npage = Fetcher.get('https://spectrum.ieee.org/...')\ntitle = page.css('h1::text').get()\ncontent = page.css('article p::text').getall()\n\n\n✅ Works\n\nHacker News\npage = Fetcher.get('https://news.ycombinator.com')\nstories = page.css('.titleline a::text').getall()\n\n\n✅ Works\n\nExample Domain\npage = Fetcher.get('https://example.com')\ntitle = page.css('h1::text').get()\n\n\n✅ Works\n\n🔧 Quick Troubleshooting\nIssue\tSolution\n403/429 Blocked\tUse StealthyFetcher or cloudscraper\nCloudflare\tUse StealthyFetcher or cloudscraper\nJavaScript required\tUse DynamicFetcher\nSite changed\tUse adaptive=True\nPaid API exposed\tUse API reverse engineering\nCaptcha\tCannot bypass - skip or use official API\nAuth required\tDo NOT bypass - use official API\nSkill Graph\n\nRelated skills:\n\n[[content-research]] - Research workflow\n[[blogwatcher]] - RSS/feed monitoring\n[[youtube-watcher]] - Video content\n[[chirp]] - Twitter/X interactions\n[[newsletter-digest]] - Content summarization\n[[x-tweet-fetcher]] - X/Twitter (use instead of Scrapling)\nChangelog\nv1.0.8 (2026-02-25)\nAdded: Firecrawl-Style Crawl - Combines sitemap discovery + link following\nAdded: use_sitemap parameter - Matches Firecrawl's sitemap:\"include\"/\"skip\" behavior\nVerified: cloudflare.com returns 2,447 URLs from sitemap!\nv1.0.7 (2026-02-25)\nFixed: EasyCrawl Spider syntax - Updated to work with scrapling's actual Spider API\nVerified: Spider crawling works - Tested and crawled 20+ pages from example.com\nv1.0.6 (2026-02-25)\nAdded: Easy Site Crawl - Auto-crawl all pages on a domain with EasyCrawl spider\nAdded: Sitemap Crawl - Extract URLs from sitemap.xml and crawl them\nFeature parity with Firecrawl for site crawling capabilities\nv1.0.5 (2026-02-25)\nEnhanced: API Reverse Engineering methodology\nDetailed step-by-step process from @paoloanzn's work\nReal Solscan case study with exact timeline\nAdded: Step-by-step methodology section\nAdded: Real example documentation (Solscan March 2025 vs Feb 2026)\nAdded: Discovery checklist with 10 steps\nDocumented: How to find auth headers in JS files\nDocumented: Token generation pattern extraction\nUpdated: Cloudscraper integration with multi-attempt pattern\nVerified: Solscan now patched (Cloudflare on both endpoints)\nv1.0.4 (2026-02-25)\nFixed: Brand Data Extraction API - Corrected selectors for scrapling's Response object\nFixed .html → .text / .body\nFixed .title() → page.css('title')\nFixed .logo img::src → .logo img::attr(src)\nTested and verified working\nv1.0.3 (2026-02-25)\nAdded: API Reverse Engineering section\nAPI Endpoint Discovery (Network tab analysis)\nJavaScript Analysis (finding auth logic)\nCloudscraper integration for Cloudflare bypass\nComplete APIReplicator class\nDiscovery checklist\nAdded cloudscraper to installation\nv1.0.2 (2026-02-25)\nSynced with upstream GitHub README exactly\nAdded Brand Data Extraction section\nClean, core-only version\nv1.0.1 (2026-02-25)\nSynced with original Scrapling GitHub README\n\nLast updated: 2026-02-25"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/zendenho7/scrapling",
    "publisherUrl": "https://clawhub.ai/zendenho7/scrapling",
    "owner": "zendenho7",
    "version": "1.0.8",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/scrapling",
    "downloadUrl": "https://openagent3.xyz/downloads/scrapling",
    "agentUrl": "https://openagent3.xyz/skills/scrapling/agent",
    "manifestUrl": "https://openagent3.xyz/skills/scrapling/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/scrapling/agent.md"
  }
}