{
  "schemaVersion": "1.0",
  "item": {
    "slug": "crawl4ai",
    "name": "Crawl4ai",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/codylrn804/crawl4ai",
    "canonicalUrl": "https://clawhub.ai/codylrn804/crawl4ai",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/crawl4ai",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=crawl4ai",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "references/api_reference.md",
      "references/error_handling.md",
      "references/examples.md",
      "scripts/extract_from_html.py",
      "scripts/scrape_multiple_pages.py"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/crawl4ai"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/crawl4ai",
    "agentPageUrl": "https://openagent3.xyz/skills/crawl4ai/agent",
    "manifestUrl": "https://openagent3.xyz/skills/crawl4ai/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/crawl4ai/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Overview",
        "body": "Crawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages."
      },
      {
        "title": "When to Use This Skill",
        "body": "Use when Codex needs to:\n\nExtract structured data from web pages (products, articles, forms, tables, etc.)\nScrape websites with dynamic content or complex JavaScript\nClean and normalize extracted data from various HTML structures\nWork with APIs or web services that return HTML\nHandle CORS limitations by scraping directly\nProcess web content at scale with reliability\n\nTrigger phrases:\n\n\"Extract data from this website\"\n\"Scrape this page for [specific data]\"\n\"Parse this HTML\"\n\"Get data from [URL]\"\n\"Extract structured information from [website]\"\n\"Scrape [website] for [data type]\"\n\"Web scrape [URL]\""
      },
      {
        "title": "Basic Usage",
        "body": "from crawl4ai import AsyncWebCrawler, BrowserMode\n\nasync def scrape_page(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            browser_mode=BrowserMode.LATEST,\n            headless=True\n        )\n        return result.markdown, result.clean_html"
      },
      {
        "title": "Extracting Structured Data",
        "body": "from crawl4ai import AsyncWebCrawler, JsonModeScreener\nimport json\n\nasync def extract_products(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            screenshot=True,\n            javascript=True,\n            bypass_cache=True\n        )\n        # Extract product data\n        products = []\n        for item in result.extracted_content:\n            if item['type'] == 'product':\n                products.append({\n                    'name': item['name'],\n                    'price': item['price'],\n                    'url': item['url']\n                })\n        return products"
      },
      {
        "title": "Web Scraping Basics",
        "body": "Scenario: User wants to scrape a website for all article titles.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def scrape_articles(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            javascript=True,\n            verbose=True\n        )\n        # Extract article titles from HTML\n        articles = result.extracted_content if result.extracted_content else []\n        titles = [item.get('name', item.get('text', '')) for item in articles]\n        return titles\n\nTrigger: \"Scrape this site for article titles\" or \"Get all titles from [URL]\""
      },
      {
        "title": "Dynamic Content Handling",
        "body": "Scenario: Website loads data via JavaScript.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def scrape_dynamic_site(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            javascript=True,  # Wait for JS execution\n            wait_for=\"body\",   # Wait for specific element\n            delay=1.5,         # Wait time after load\n            headless=True\n        )\n        return result.markdown\n\nTrigger: \"Scrape this dynamic website\" or \"This page needs JavaScript to load data\""
      },
      {
        "title": "Structured Data Extraction",
        "body": "Scenario: Extract specific fields like prices, descriptions, etc.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def extract_product_details(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            screenshot=True,\n            js_code=\"\"\"\n                const products = document.querySelectorAll('.product');\n                return Array.from(products).map(p => ({\n                    name: p.querySelector('.name')?.textContent,\n                    price: p.querySelector('.price')?.textContent,\n                    url: p.querySelector('a')?.href\n                }));\n            \"\"\"\n        )\n        return result.extracted_content\n\nTrigger: \"Extract product details from this page\" or \"Get price and name from [URL]\""
      },
      {
        "title": "HTML Cleaning and Parsing",
        "body": "Scenario: Clean messy HTML and extract clean text.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def clean_and_parse(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            remove_tags=['script', 'style', 'nav', 'footer', 'header'],\n            only_main_content=True\n        )\n        # Clean and return markdown\n        clean_text = result.clean_html\n        return clean_text\n\nTrigger: \"Clean this HTML\" or \"Extract main content from this page\""
      },
      {
        "title": "Custom JavaScript Injection",
        "body": "async def custom_scrape(url, custom_js):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            js_code=custom_js,\n            js_only=True  # Only execute JS, don't download resources\n        )\n        return result.extracted_content"
      },
      {
        "title": "Session Management",
        "body": "from crawl4ai import AsyncWebCrawler\n\nasync def multi_page_scrape(base_url, urls):\n    async with AsyncWebCrawler() as crawler:\n        results = []\n        for url in urls:\n            result = await crawler.arun(\n                url=url,\n                session_id=f\"session_{url}\",\n                bypass_cache=True\n            )\n            results.append({\n                'url': url,\n                'content': result.markdown,\n                'status': result.success\n            })\n        return results"
      },
      {
        "title": "Best Practices",
        "body": "Always check if the site allows scraping - Respect robots.txt and terms of service\nUse appropriate delays - Add delays between requests to avoid overwhelming servers\nHandle errors gracefully - Implement retry logic and error handling\nBe selective with data - Extract only what you need, don't dump entire pages\nStore data reliably - Save extracted data in structured formats (JSON, CSV)\nClean URLs - Handle redirects and malformed URLs"
      },
      {
        "title": "Error Handling",
        "body": "async def robust_scrape(url):\n    try:\n        async with AsyncWebCrawler() as crawler:\n            result = await crawler.arun(\n                url=url,\n                timeout=30000  # 30 seconds timeout\n            )\n            if result.success:\n                return result.markdown, result.extracted_content\n            else:\n                print(f\"Scraping failed: {result.error_message}\")\n                return None, None\n    except Exception as e:\n        print(f\"Scraping error: {str(e)}\")\n        return None, None"
      },
      {
        "title": "Output Formats",
        "body": "Crawl4ai supports multiple output formats:\n\nMarkdown: Clean, readable text (result.markdown)\nClean HTML: Structured, cleaned HTML (result.clean_html)\nExtracted Content: Structured JSON data (result.extracted_content)\nScreenshot: Visual representation (result.screenshot)\nLinks: All links found on page (result.links)"
      },
      {
        "title": "scripts/",
        "body": "Python scripts for common crawling operations:\n\nscrape_single_page.py - Basic scraping utility\nscrape_multiple_pages.py - Batch scraping with pagination\nextract_from_html.py - HTML parsing helper\nclean_html.py - HTML cleaning utility"
      },
      {
        "title": "references/",
        "body": "Documentation and examples:\n\napi_reference.md - Complete API documentation\nexamples.md - Common use cases and patterns\nerror_handling.md - Troubleshooting guide"
      }
    ],
    "body": "Crawl4ai\nOverview\n\nCrawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.\n\nWhen to Use This Skill\n\nUse when Codex needs to:\n\nExtract structured data from web pages (products, articles, forms, tables, etc.)\nScrape websites with dynamic content or complex JavaScript\nClean and normalize extracted data from various HTML structures\nWork with APIs or web services that return HTML\nHandle CORS limitations by scraping directly\nProcess web content at scale with reliability\n\nTrigger phrases:\n\n\"Extract data from this website\"\n\"Scrape this page for [specific data]\"\n\"Parse this HTML\"\n\"Get data from [URL]\"\n\"Extract structured information from [website]\"\n\"Scrape [website] for [data type]\"\n\"Web scrape [URL]\"\nQuick Start\nBasic Usage\nfrom crawl4ai import AsyncWebCrawler, BrowserMode\n\nasync def scrape_page(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            browser_mode=BrowserMode.LATEST,\n            headless=True\n        )\n        return result.markdown, result.clean_html\n\nExtracting Structured Data\nfrom crawl4ai import AsyncWebCrawler, JsonModeScreener\nimport json\n\nasync def extract_products(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            screenshot=True,\n            javascript=True,\n            bypass_cache=True\n        )\n        # Extract product data\n        products = []\n        for item in result.extracted_content:\n            if item['type'] == 'product':\n                products.append({\n                    'name': item['name'],\n                    'price': item['price'],\n                    'url': item['url']\n                })\n        return products\n\nCommon Tasks\nWeb Scraping Basics\n\nScenario: User wants to scrape a website for all article titles.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def scrape_articles(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            javascript=True,\n            verbose=True\n        )\n        # Extract article titles from HTML\n        articles = result.extracted_content if result.extracted_content else []\n        titles = [item.get('name', item.get('text', '')) for item in articles]\n        return titles\n\n\nTrigger: \"Scrape this site for article titles\" or \"Get all titles from [URL]\"\n\nDynamic Content Handling\n\nScenario: Website loads data via JavaScript.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def scrape_dynamic_site(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            javascript=True,  # Wait for JS execution\n            wait_for=\"body\",   # Wait for specific element\n            delay=1.5,         # Wait time after load\n            headless=True\n        )\n        return result.markdown\n\n\nTrigger: \"Scrape this dynamic website\" or \"This page needs JavaScript to load data\"\n\nStructured Data Extraction\n\nScenario: Extract specific fields like prices, descriptions, etc.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def extract_product_details(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            screenshot=True,\n            js_code=\"\"\"\n                const products = document.querySelectorAll('.product');\n                return Array.from(products).map(p => ({\n                    name: p.querySelector('.name')?.textContent,\n                    price: p.querySelector('.price')?.textContent,\n                    url: p.querySelector('a')?.href\n                }));\n            \"\"\"\n        )\n        return result.extracted_content\n\n\nTrigger: \"Extract product details from this page\" or \"Get price and name from [URL]\"\n\nHTML Cleaning and Parsing\n\nScenario: Clean messy HTML and extract clean text.\n\nfrom crawl4ai import AsyncWebCrawler\n\nasync def clean_and_parse(url):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            remove_tags=['script', 'style', 'nav', 'footer', 'header'],\n            only_main_content=True\n        )\n        # Clean and return markdown\n        clean_text = result.clean_html\n        return clean_text\n\n\nTrigger: \"Clean this HTML\" or \"Extract main content from this page\"\n\nAdvanced Features\nCustom JavaScript Injection\nasync def custom_scrape(url, custom_js):\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=url,\n            js_code=custom_js,\n            js_only=True  # Only execute JS, don't download resources\n        )\n        return result.extracted_content\n\nSession Management\nfrom crawl4ai import AsyncWebCrawler\n\nasync def multi_page_scrape(base_url, urls):\n    async with AsyncWebCrawler() as crawler:\n        results = []\n        for url in urls:\n            result = await crawler.arun(\n                url=url,\n                session_id=f\"session_{url}\",\n                bypass_cache=True\n            )\n            results.append({\n                'url': url,\n                'content': result.markdown,\n                'status': result.success\n            })\n        return results\n\nBest Practices\nAlways check if the site allows scraping - Respect robots.txt and terms of service\nUse appropriate delays - Add delays between requests to avoid overwhelming servers\nHandle errors gracefully - Implement retry logic and error handling\nBe selective with data - Extract only what you need, don't dump entire pages\nStore data reliably - Save extracted data in structured formats (JSON, CSV)\nClean URLs - Handle redirects and malformed URLs\nError Handling\nasync def robust_scrape(url):\n    try:\n        async with AsyncWebCrawler() as crawler:\n            result = await crawler.arun(\n                url=url,\n                timeout=30000  # 30 seconds timeout\n            )\n            if result.success:\n                return result.markdown, result.extracted_content\n            else:\n                print(f\"Scraping failed: {result.error_message}\")\n                return None, None\n    except Exception as e:\n        print(f\"Scraping error: {str(e)}\")\n        return None, None\n\nOutput Formats\n\nCrawl4ai supports multiple output formats:\n\nMarkdown: Clean, readable text (result.markdown)\nClean HTML: Structured, cleaned HTML (result.clean_html)\nExtracted Content: Structured JSON data (result.extracted_content)\nScreenshot: Visual representation (result.screenshot)\nLinks: All links found on page (result.links)\nResources\nscripts/\n\nPython scripts for common crawling operations:\n\nscrape_single_page.py - Basic scraping utility\nscrape_multiple_pages.py - Batch scraping with pagination\nextract_from_html.py - HTML parsing helper\nclean_html.py - HTML cleaning utility\nreferences/\n\nDocumentation and examples:\n\napi_reference.md - Complete API documentation\nexamples.md - Common use cases and patterns\nerror_handling.md - Troubleshooting guide"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/codylrn804/crawl4ai",
    "publisherUrl": "https://clawhub.ai/codylrn804/crawl4ai",
    "owner": "codylrn804",
    "version": "1.0.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/crawl4ai",
    "downloadUrl": "https://openagent3.xyz/downloads/crawl4ai",
    "agentUrl": "https://openagent3.xyz/skills/crawl4ai/agent",
    "manifestUrl": "https://openagent3.xyz/skills/crawl4ai/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/crawl4ai/agent.md"
  }
}