Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
AI-powered web scraping framework for extracting structured data from websites. Use when Codex needs to crawl, scrape, or extract data from web pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.
AI-powered web scraping framework for extracting structured data from websites. Use when Codex needs to crawl, scrape, or extract data from web pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Crawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.
Use when Codex needs to: Extract structured data from web pages (products, articles, forms, tables, etc.) Scrape websites with dynamic content or complex JavaScript Clean and normalize extracted data from various HTML structures Work with APIs or web services that return HTML Handle CORS limitations by scraping directly Process web content at scale with reliability Trigger phrases: "Extract data from this website" "Scrape this page for [specific data]" "Parse this HTML" "Get data from [URL]" "Extract structured information from [website]" "Scrape [website] for [data type]" "Web scrape [URL]"
from crawl4ai import AsyncWebCrawler, BrowserMode async def scrape_page(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, browser_mode=BrowserMode.LATEST, headless=True ) return result.markdown, result.clean_html
from crawl4ai import AsyncWebCrawler, JsonModeScreener import json async def extract_products(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, screenshot=True, javascript=True, bypass_cache=True ) # Extract product data products = [] for item in result.extracted_content: if item['type'] == 'product': products.append({ 'name': item['name'], 'price': item['price'], 'url': item['url'] }) return products
Scenario: User wants to scrape a website for all article titles. from crawl4ai import AsyncWebCrawler async def scrape_articles(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, javascript=True, verbose=True ) # Extract article titles from HTML articles = result.extracted_content if result.extracted_content else [] titles = [item.get('name', item.get('text', '')) for item in articles] return titles Trigger: "Scrape this site for article titles" or "Get all titles from [URL]"
Scenario: Website loads data via JavaScript. from crawl4ai import AsyncWebCrawler async def scrape_dynamic_site(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, javascript=True, # Wait for JS execution wait_for="body", # Wait for specific element delay=1.5, # Wait time after load headless=True ) return result.markdown Trigger: "Scrape this dynamic website" or "This page needs JavaScript to load data"
Scenario: Extract specific fields like prices, descriptions, etc. from crawl4ai import AsyncWebCrawler async def extract_product_details(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, screenshot=True, js_code=""" const products = document.querySelectorAll('.product'); return Array.from(products).map(p => ({ name: p.querySelector('.name')?.textContent, price: p.querySelector('.price')?.textContent, url: p.querySelector('a')?.href })); """ ) return result.extracted_content Trigger: "Extract product details from this page" or "Get price and name from [URL]"
Scenario: Clean messy HTML and extract clean text. from crawl4ai import AsyncWebCrawler async def clean_and_parse(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, remove_tags=['script', 'style', 'nav', 'footer', 'header'], only_main_content=True ) # Clean and return markdown clean_text = result.clean_html return clean_text Trigger: "Clean this HTML" or "Extract main content from this page"
async def custom_scrape(url, custom_js): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, js_code=custom_js, js_only=True # Only execute JS, don't download resources ) return result.extracted_content
from crawl4ai import AsyncWebCrawler async def multi_page_scrape(base_url, urls): async with AsyncWebCrawler() as crawler: results = [] for url in urls: result = await crawler.arun( url=url, session_id=f"session_{url}", bypass_cache=True ) results.append({ 'url': url, 'content': result.markdown, 'status': result.success }) return results
Always check if the site allows scraping - Respect robots.txt and terms of service Use appropriate delays - Add delays between requests to avoid overwhelming servers Handle errors gracefully - Implement retry logic and error handling Be selective with data - Extract only what you need, don't dump entire pages Store data reliably - Save extracted data in structured formats (JSON, CSV) Clean URLs - Handle redirects and malformed URLs
async def robust_scrape(url): try: async with AsyncWebCrawler() as crawler: result = await crawler.arun( url=url, timeout=30000 # 30 seconds timeout ) if result.success: return result.markdown, result.extracted_content else: print(f"Scraping failed: {result.error_message}") return None, None except Exception as e: print(f"Scraping error: {str(e)}") return None, None
Crawl4ai supports multiple output formats: Markdown: Clean, readable text (result.markdown) Clean HTML: Structured, cleaned HTML (result.clean_html) Extracted Content: Structured JSON data (result.extracted_content) Screenshot: Visual representation (result.screenshot) Links: All links found on page (result.links)
Python scripts for common crawling operations: scrape_single_page.py - Basic scraping utility scrape_multiple_pages.py - Batch scraping with pagination extract_from_html.py - HTML parsing helper clean_html.py - HTML cleaning utility
Documentation and examples: api_reference.md - Complete API documentation examples.md - Common use cases and patterns error_handling.md - Troubleshooting guide
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.