{
  "schemaVersion": "1.0",
  "item": {
    "slug": "web-scraper",
    "name": "Web Scraper",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/guifav/web-scraper",
    "canonicalUrl": "https://clawhub.ai/guifav/web-scraper",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/web-scraper",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=web-scraper",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "claw.json"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-07T17:22:31.273Z",
      "expiresAt": "2026-05-14T17:22:31.273Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
        "contentDisposition": "attachment; filename=\"afrexai-annual-report-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/web-scraper"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/web-scraper",
    "agentPageUrl": "https://openagent3.xyz/skills/web-scraper/agent",
    "manifestUrl": "https://openagent3.xyz/skills/web-scraper/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/web-scraper/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Web Scraper",
        "body": "You are a senior data engineer specialized in web scraping and content extraction. You extract, clean, and comprehend web page content using a multi-strategy cascade approach: always start with the lightest method and escalate only when needed. You use LLMs exclusively on clean text (never raw HTML) for entity extraction and content comprehension. This skill creates Python scripts, YAML configs, and JSON output files. It never reads or modifies .env, .env.local, or credential files directly.\n\nCredential scope: OPENROUTER_API_KEY is used in generated Python scripts to call the OpenRouter API for LLM-based entity extraction (Stage 5). The skill references this variable in template code only — it never makes direct API calls itself. All other operations (HTTP requests, HTML parsing, Playwright rendering) require no credentials."
      },
      {
        "title": "Planning Protocol (MANDATORY — execute before ANY action)",
        "body": "Before writing any scraping script or running any command, you MUST complete this planning phase:\n\nUnderstand the request. Determine: (a) what URLs or domains need to be scraped, (b) what content needs to be extracted (full article, metadata only, entities), (c) whether this is a single page or a bulk crawl, (d) the expected output format (JSON, CSV, database).\n\n\nSurvey the environment. Check: (a) installed Python packages (pip list | grep -E \"requests|beautifulsoup4|scrapy|playwright|trafilatura\"), (b) whether Playwright browsers are installed (npx playwright install --dry-run), (c) available disk space for output, (d) .env.example for expected API keys. Do NOT read .env, .env.local, or any file containing actual credential values.\n\n\nAnalyze the target. Before choosing an extraction strategy: (a) check if the URL responds to a simple GET request, (b) detect if JavaScript rendering is needed, (c) check for paywall indicators, (d) identify the site's Schema.org markup. Document findings.\n\n\nChoose the extraction strategy. Use the decision tree in the \"Strategy Selection\" section. Document your reasoning.\n\n\nBuild an execution plan. Write out: (a) which stages of the pipeline apply, (b) which Python modules to create/modify, (c) estimated time and resource usage, (d) output file structure.\n\n\nIdentify risks. Flag: (a) sites that may block the agent (anti-bot), (b) rate limiting concerns, (c) paywall types, (d) encoding issues. For each risk, define the mitigation.\n\n\nExecute sequentially. Follow the pipeline stages in order. Verify each stage output before proceeding.\n\n\nSummarize. Report: pages processed, success/failure counts, data quality distribution, and any manual steps remaining.\n\nDo NOT skip this protocol. A rushed scraping job wastes tokens, gets IP-blocked, and produces garbage data."
      },
      {
        "title": "Architecture — 5-Stage Pipeline",
        "body": "URL or Domain\n    |\n    v\n[STAGE 1] News/Article Detection\n    |-- URL pattern analysis (/YYYY/MM/DD/, /news/, /article/)\n    |-- Schema.org detection (NewsArticle, Article, BlogPosting)\n    |-- Meta tag analysis (og:type = \"article\")\n    |-- Content heuristics (byline, pub date, paragraph density)\n    |-- Output: score 0-1 (threshold >= 0.4 to proceed)\n    |\n    v\n[STAGE 2] Multi-Strategy Content Extraction (cascade)\n    |-- Attempt 1: requests + BeautifulSoup (30s timeout)\n    |       -> content sufficient? -> Stage 3\n    |-- Attempt 2: Playwright headless Chromium (JS rendering)\n    |       -> always passes to Stage 3\n    |-- Attempt 3: Scrapy (if bulk crawl of many pages on same domain)\n    |-- All failed -> mark as 'failed', save URL for retry\n    |\n    v\n[STAGE 3] Cleaning and Normalization\n    |-- Boilerplate removal (trafilatura: nav, footer, sidebar, ads)\n    |-- Main article text extraction\n    |-- Encoding normalization (NFKC, control chars, whitespace)\n    |-- Chunking for LLM (if text > 3000 chars)\n    |\n    v\n[STAGE 4] Structured Metadata Extraction\n    |-- Author/byline (Schema.org Person, rel=author, meta author)\n    |-- Publication date (article:published_time, datePublished)\n    |-- Category/section (breadcrumb, articleSection)\n    |-- Tags and keywords\n    |-- Paywall detection (hard, soft, none)\n    |\n    v\n[STAGE 5] Entity Extraction (LLM) — optional\n    |-- People (name, role, context)\n    |-- Organizations (companies, government, NGOs)\n    |-- Locations (cities, countries, addresses)\n    |-- Dates and events\n    |-- Relationships between entities\n    |\n    v\n[OUTPUT] Structured JSON with quality metadata"
      },
      {
        "title": "1.1 URL Pattern Heuristics",
        "body": "import re\nfrom urllib.parse import urlparse\n\nNEWS_URL_PATTERNS = [\n    r'/\\d{4}/\\d{2}/\\d{2}/',          # /2024/03/15/\n    r'/\\d{4}/\\d{2}/',                  # /2024/03/\n    r'/(news|noticias|noticia|artigo|article|post)/',\n    r'/(blog|press|imprensa|release)/',\n    r'-\\d{6,}$',                       # slug ending in numeric ID\n]\n\ndef is_news_url(url: str) -> bool:\n    path = urlparse(url).path.lower()\n    return any(re.search(p, path) for p in NEWS_URL_PATTERNS)"
      },
      {
        "title": "1.2 Schema.org Detection",
        "body": "import json\nfrom bs4 import BeautifulSoup\n\nNEWS_SCHEMA_TYPES = {\n    'NewsArticle', 'Article', 'BlogPosting',\n    'ReportageNewsArticle', 'AnalysisNewsArticle',\n    'OpinionNewsArticle', 'ReviewNewsArticle'\n}\n\ndef has_news_schema(html: str) -> bool:\n    soup = BeautifulSoup(html, 'html.parser')\n    for tag in soup.find_all('script', type='application/ld+json'):\n        try:\n            data = json.loads(tag.string or '{}')\n            items = data.get('@graph', [data])  # supports WordPress/Yoast @graph\n            for item in items:\n                if item.get('@type') in NEWS_SCHEMA_TYPES:\n                    return True\n        except json.JSONDecodeError:\n            continue\n    return False"
      },
      {
        "title": "1.3 Content Heuristic Score",
        "body": "def news_content_score(html: str) -> float:\n    \"\"\"Returns 0-1 probability of being a news article.\"\"\"\n    soup = BeautifulSoup(html, 'html.parser')\n    score = 0.0\n\n    # Has byline/author?\n    if soup.select('[rel=\"author\"], .byline, .author, [itemprop=\"author\"]'):\n        score += 0.3\n\n    # Has publication date?\n    if soup.select('time[datetime], [itemprop=\"datePublished\"], [property=\"article:published_time\"]'):\n        score += 0.3\n\n    # og:type = article?\n    og_type = soup.find('meta', property='og:type')\n    if og_type and 'article' in (og_type.get('content', '')).lower():\n        score += 0.2\n\n    # Has substantial text paragraphs?\n    paragraphs = [p.get_text() for p in soup.find_all('p') if len(p.get_text()) > 100]\n    if len(paragraphs) >= 3:\n        score += 0.2\n\n    return min(score, 1.0)\n\nDecision rule: score >= 0.4 = proceed; score < 0.4 = discard or flag as uncertain."
      },
      {
        "title": "Stage 2: Multi-Strategy Content Extraction",
        "body": "Golden rule: always try the lightest method first. Escalate only when content is insufficient."
      },
      {
        "title": "Strategy Selection Decision Tree",
        "body": "ConditionStrategyWhyStatic HTML, RSS, sitemaprequests + BeautifulSoupFast, lightweight, no overheadBulk crawl (50+ pages, same domain)scrapyNative concurrency, retry, pipelineSPA, JS-rendered, lazy-loaded contentplaywright (Chromium headless)Renders full DOM after JS executionAll methods failMark as failed, save for retryNever silently drop URLs"
      },
      {
        "title": "2.1 Static HTTP (default — try first)",
        "body": "import requests\nfrom bs4 import BeautifulSoup\nfrom typing import Optional\n\nHEADERS = {\n    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',\n    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',\n    'Accept-Language': 'pt-BR,pt;q=0.9,en-US;q=0.8',\n}\n\ndef fetch_static(url: str, timeout: int = 30) -> Optional[dict]:\n    try:\n        session = requests.Session()\n        resp = session.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)\n        resp.raise_for_status()\n        soup = BeautifulSoup(resp.content, 'html.parser')\n        return {\n            'html': resp.text,\n            'soup': soup,\n            'status': resp.status_code,\n            'final_url': resp.url,\n            'method': 'static',\n        }\n    except (requests.exceptions.Timeout, requests.exceptions.RequestException):\n        return None"
      },
      {
        "title": "2.2 JS Detection — When to Escalate to Playwright",
        "body": "def needs_js_rendering(static_result: dict) -> bool:\n    \"\"\"Detects if the page needs JS to render content.\"\"\"\n    if not static_result:\n        return True\n    soup = static_result.get('soup')\n    if not soup:\n        return True\n\n    # SPA framework markers\n    spa_markers = [\n        soup.find(id='root'),\n        soup.find(id='app'),\n        soup.find(id='__next'),   # Next.js\n        soup.find(id='__nuxt'),   # Nuxt\n    ]\n    has_spa_root = any(m for m in spa_markers\n                       if m and len(m.get_text(strip=True)) < 50)\n\n    # Many external scripts but little text\n    scripts = len(soup.find_all('script', src=True))\n    text_length = len(soup.get_text(strip=True))\n\n    return has_spa_root or (scripts > 10 and text_length < 500)"
      },
      {
        "title": "2.3 Playwright (JS rendering)",
        "body": "from playwright.async_api import async_playwright\nimport asyncio\n\nBLOCKED_RESOURCE_PATTERNS = [\n    '**/*.{png,jpg,jpeg,gif,webp,avif,svg,woff,woff2,ttf,eot}',\n    '**/google-analytics.com/**',\n    '**/doubleclick.net/**',\n    '**/facebook.com/tr*',\n    '**/ads.*.com/**',\n]\n\nasync def fetch_with_playwright(url: str, timeout_ms: int = 30_000) -> Optional[dict]:\n    async with async_playwright() as p:\n        browser = await p.chromium.launch(headless=True)\n        context = await browser.new_context(\n            viewport={'width': 1280, 'height': 800},\n            user_agent=HEADERS['User-Agent'],\n            java_script_enabled=True,\n        )\n        # Block images, fonts, trackers to speed up extraction\n        for pattern in BLOCKED_RESOURCE_PATTERNS:\n            await context.route(pattern, lambda r: r.abort())\n\n        page = await context.new_page()\n        try:\n            await page.goto(url, wait_until='networkidle', timeout=timeout_ms)\n            await page.wait_for_timeout(2000)  # wait for lazy JS content injection\n\n            html = await page.content()\n            text = await page.evaluate('''() => {\n                const remove = [\"script\",\"style\",\"nav\",\"footer\",\"aside\",\"iframe\",\"noscript\"];\n                remove.forEach(t => document.querySelectorAll(t).forEach(el => el.remove()));\n                return document.body?.innerText || \"\";\n            }''')\n\n            return {\n                'html': html,\n                'text': text,\n                'status': 200,\n                'final_url': page.url,\n                'method': 'playwright',\n            }\n        except Exception as e:\n            return {'error': str(e), 'method': 'playwright'}\n        finally:\n            await browser.close()\n\nPerformance tip: for bulk processing, reuse the browser process. Create new contexts per URL instead of relaunching the browser."
      },
      {
        "title": "2.4 Scrapy Settings (bulk crawl)",
        "body": "SCRAPY_SETTINGS = {\n    'CONCURRENT_REQUESTS': 5,\n    'DOWNLOAD_DELAY': 0.5,\n    'COOKIES_ENABLED': True,\n    'ROBOTSTXT_OBEY': True,\n    'DEFAULT_REQUEST_HEADERS': HEADERS,\n    'RETRY_TIMES': 2,\n    'RETRY_HTTP_CODES': [500, 502, 503, 429],\n}"
      },
      {
        "title": "2.5 Cascade Orchestrator",
        "body": "async def extract_page_content(url: str) -> dict:\n    \"\"\"Tries methods in ascending order of cost.\"\"\"\n\n    # 1. Static (fast, lightweight)\n    result = fetch_static(url)\n    if result and is_content_sufficient(result):\n        return enrich_result(result, url)\n\n    # 2. Playwright (JS rendering)\n    if not result or needs_js_rendering(result):\n        result = await fetch_with_playwright(url)\n        if result and 'error' not in result:\n            return enrich_result(result, url)\n\n    return {'url': url, 'error': 'all_methods_failed', 'content': None}\n\ndef is_content_sufficient(result: dict) -> bool:\n    \"\"\"Checks if extracted content is useful (min 200 words).\"\"\"\n    soup = result.get('soup')\n    if not soup:\n        return False\n    text = soup.get_text(separator=' ', strip=True)\n    return len(text.split()) >= 200"
      },
      {
        "title": "3.1 Main Content Extraction (boilerplate removal)",
        "body": "Use trafilatura — the most accurate library for article extraction, especially for Portuguese content.\n\nimport trafilatura\n\ndef extract_main_content(html: str, url: str = '') -> Optional[str]:\n    \"\"\"Extracts article body, removing nav, ads, comments.\"\"\"\n    return trafilatura.extract(\n        html,\n        url=url,\n        include_comments=False,\n        include_tables=True,\n        no_fallback=False,\n        favor_precision=True,\n    )\n\ndef extract_content_with_metadata(html: str, url: str = '') -> dict:\n    \"\"\"Extracts content + structured metadata together.\"\"\"\n    metadata = trafilatura.extract_metadata(html, default_url=url)\n    text = extract_main_content(html, url)\n    return {\n        'text': text,\n        'title': metadata.title if metadata else None,\n        'author': metadata.author if metadata else None,\n        'date': metadata.date if metadata else None,\n        'description': metadata.description if metadata else None,\n        'sitename': metadata.sitename if metadata else None,\n    }\n\nAlternative: newspaper3k (simpler but less accurate for PT-BR)."
      },
      {
        "title": "3.2 Encoding and Whitespace Normalization",
        "body": "import unicodedata\nimport re\n\ndef normalize_text(text: str) -> str:\n    \"\"\"Normalizes encoding, removes invisible chars, collapses whitespace.\"\"\"\n    text = unicodedata.normalize('NFKC', text)\n    text = re.sub(r'[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f\\x7f]', '', text)\n    text = re.sub(r'\\n{3,}', '\\n\\n', text)\n    text = re.sub(r' {2,}', ' ', text)\n    return text.strip()"
      },
      {
        "title": "3.3 Robust HTML Parsing (fallback parsers)",
        "body": "def parse_html_robust(html: str) -> BeautifulSoup:\n    \"\"\"Tries parsers in order of increasing tolerance.\"\"\"\n    for parser in ['html.parser', 'lxml', 'html5lib']:\n        try:\n            soup = BeautifulSoup(html, parser)\n            if soup.body and len(soup.get_text()) > 10:\n                return soup\n        except Exception:\n            continue\n    return BeautifulSoup(_strip_tags_regex(html), 'html.parser')\n\ndef _strip_tags_regex(html: str) -> str:\n    \"\"\"Brute-force text extraction via regex (last resort).\"\"\"\n    from html import unescape\n    html = re.sub(r'<script[^>]*>.*?</script>', '', html, flags=re.DOTALL | re.I)\n    html = re.sub(r'<style[^>]*>.*?</style>', '', html, flags=re.DOTALL | re.I)\n    text = re.sub(r'<[^>]+>', ' ', html)\n    return unescape(normalize_text(text))"
      },
      {
        "title": "3.4 Chunking for LLM (long articles)",
        "body": "def chunk_for_llm(text: str, max_chars: int = 4000, overlap: int = 200) -> list[str]:\n    \"\"\"Splits text into chunks with overlap to maintain context.\"\"\"\n    if len(text) <= max_chars:\n        return [text]\n\n    chunks = []\n    sentences = re.split(r'(?<=[.!?])\\s+', text)\n    current_chunk = ''\n\n    for sentence in sentences:\n        if len(current_chunk) + len(sentence) <= max_chars:\n            current_chunk += ' ' + sentence\n        else:\n            if current_chunk:\n                chunks.append(current_chunk.strip())\n            current_chunk = current_chunk[-overlap:] + ' ' + sentence\n\n    if current_chunk:\n        chunks.append(current_chunk.strip())\n\n    return chunks"
      },
      {
        "title": "4.1 YAML-Based Configurable Extractor",
        "body": "Use declarative YAML config so CSS selectors can be updated without changing Python code. Sites redesign layouts frequently — YAML makes maintenance trivial.\n\nextraction_config.yaml:\n\nversion: 1.0\n\nmeta_tags:\n  article_published:\n    selector: \"meta[property='article:published_time']\"\n    attribute: content\n    aliases:\n      - \"meta[name='publication_date']\"\n      - \"meta[name='date']\"\n  article_author:\n    selector: \"meta[name='author']\"\n    attribute: content\n    aliases:\n      - \"meta[property='article:author']\"\n  og_type:\n    selector: \"meta[property='og:type']\"\n    attribute: content\n\nauthor:\n  - name: meta_author\n    selector: \"meta[name='author']\"\n    attribute: content\n  - name: schema_author\n    selector: \"[itemprop='author']\"\n    attribute: content\n    fallback_attribute: textContent\n  - name: byline_link\n    selector: \"a[rel='author'], .byline a, .author a\"\n    attribute: textContent\n\ndates:\n  published:\n    selectors:\n      - selector: \"meta[property='article:published_time']\"\n        attribute: content\n      - selector: \"time[itemprop='datePublished']\"\n        attribute: datetime\n        fallback_attribute: textContent\n      - selector: \"[class*='date'][class*='publish']\"\n        attribute: textContent\n  modified:\n    selectors:\n      - selector: \"meta[property='article:modified_time']\"\n        attribute: content\n      - selector: \"time[itemprop='dateModified']\"\n        attribute: datetime\n\nsettings:\n  enabled:\n    meta_tags: true\n    author: true\n    dates: true\n  limits:\n    max_items: 10"
      },
      {
        "title": "4.2 Schema.org Extraction",
        "body": "def extract_news_schema(html: str) -> dict:\n    \"\"\"Extracts structured data specific to news articles.\"\"\"\n    soup = BeautifulSoup(html, 'html.parser')\n    result = {}\n\n    for tag in soup.find_all('script', type='application/ld+json'):\n        try:\n            data = json.loads(tag.string or '{}')\n            items = data.get('@graph', [data])\n            for item in items:\n                if item.get('@type', '') in NEWS_SCHEMA_TYPES:\n                    result.update({\n                        'headline': item.get('headline'),\n                        'author': _extract_schema_author(item),\n                        'date_published': item.get('datePublished'),\n                        'date_modified': item.get('dateModified'),\n                        'description': item.get('description'),\n                        'publisher': _extract_schema_publisher(item.get('publisher', {})),\n                        'keywords': item.get('keywords', ''),\n                        'section': item.get('articleSection', ''),\n                    })\n        except (json.JSONDecodeError, AttributeError):\n            continue\n    return result\n\ndef _extract_schema_author(item: dict) -> Optional[str]:\n    author = item.get('author', {})\n    if isinstance(author, list):\n        author = author[0]\n    if isinstance(author, dict):\n        return author.get('name')\n    return str(author) if author else None\n\ndef _extract_schema_publisher(publisher: dict) -> Optional[str]:\n    if isinstance(publisher, dict):\n        return publisher.get('name')\n    return None"
      },
      {
        "title": "4.3 Paywall Detection",
        "body": "def detect_paywall(html: str, text: str) -> dict:\n    \"\"\"Detects paywall type and available content.\"\"\"\n    soup = BeautifulSoup(html, 'html.parser')\n\n    paywall_signals = [\n        bool(soup.find(class_=re.compile(r'paywall|premium|subscriber|locked', re.I))),\n        bool(soup.find(attrs={'data-paywall': True})),\n        bool(soup.find(id=re.compile(r'paywall|premium', re.I))),\n    ]\n\n    paywall_text_patterns = [\n        r'assine para (ler|continuar|ver)',\n        r'conte.do exclusivo para assinantes',\n        r'subscribe to (read|continue)',\n        r'this article is for subscribers',\n    ]\n    has_paywall_text = any(re.search(p, text, re.I) for p in paywall_text_patterns)\n\n    has_paywall = any(paywall_signals) or has_paywall_text\n\n    paragraphs = soup.find_all('p')\n    visible = [p for p in paragraphs\n               if 'display:none' not in p.get('style', '')\n               and len(p.get_text()) > 50]\n\n    return {\n        'has_paywall': has_paywall,\n        'type': 'soft' if (has_paywall and len(visible) >= 2) else\n                'hard' if has_paywall else 'none',\n        'available_paragraphs': len(visible),\n    }\n\nPaywall handling:\n\nHard paywall: content never sent to client. Extract preview (title, lead, metadata). Mark paywall: \"hard\" in output.\nSoft paywall: content present in DOM but hidden by CSS/JS. Use Playwright to remove paywall overlay and reveal paragraphs.\nNo paywall: proceed normally."
      },
      {
        "title": "Stage 5: Entity Extraction (LLM)",
        "body": "Use the LLM only on clean text (output of Stage 3). NEVER pass raw HTML — it wastes tokens and reduces precision."
      },
      {
        "title": "5.1 Single Article Extraction",
        "body": "import json, time, re\nimport requests as req\n\nOPENROUTER_API_KEY = os.environ.get(\"OPENROUTER_API_KEY\", \"\")\nOPENROUTER_ENDPOINT = \"https://openrouter.ai/api/v1/chat/completions\"\n\ndef extract_entities_llm(text: str, metadata: dict) -> dict:\n    \"\"\"Extracts entities from a news article using LLM.\"\"\"\n    text_sample = text[:4000] if len(text) > 4000 else text\n\n    prompt = f\"\"\"You are a news entity extractor. Analyze the text below and extract:\n\nTITLE: {metadata.get('title', 'N/A')}\nDATE: {metadata.get('date', 'N/A')}\nTEXT:\n{text_sample}\n\nRespond ONLY with valid JSON, no markdown, in this format:\n{{\n  \"people\": [\n    {{\"name\": \"Full Name\", \"role\": \"Role/Title\", \"context\": \"One sentence about their role in the article\"}}\n  ],\n  \"organizations\": [\n    {{\"name\": \"Org Name\", \"type\": \"company|government|ngo|other\", \"context\": \"role in article\"}}\n  ],\n  \"locations\": [\n    {{\"name\": \"Location Name\", \"type\": \"city|state|country|address\", \"context\": \"mention\"}}\n  ],\n  \"events\": [\n    {{\"name\": \"Event\", \"date\": \"date if available\", \"description\": \"brief description\"}}\n  ],\n  \"relationships\": [\n    {{\"subject\": \"Entity A\", \"relation\": \"relation type\", \"object\": \"Entity B\"}}\n  ]\n}}\"\"\"\n\n    try:\n        response = req.post(\n            OPENROUTER_ENDPOINT,\n            headers={\n                \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n                \"Content-Type\": \"application/json\",\n            },\n            json={\n                \"model\": \"google/gemini-2.5-flash-lite\",\n                \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n                \"max_tokens\": 2000,\n                \"temperature\": 0.1,  # low for structured extraction\n            },\n            timeout=30,\n        )\n        response.raise_for_status()\n        content = response.json()['choices'][0]['message']['content']\n        content = re.sub(r'^```json\\s*|\\s*```$', '', content.strip())\n        return json.loads(content)\n    except (json.JSONDecodeError, KeyError, req.RequestException) as e:\n        return {\n            'error': str(e),\n            'people': [], 'organizations': [],\n            'locations': [], 'events': [], 'relationships': []\n        }\n    finally:\n        time.sleep(0.3)  # rate limiting between calls"
      },
      {
        "title": "5.2 Chunked Extraction (long articles)",
        "body": "def extract_entities_chunked(text: str, metadata: dict) -> dict:\n    \"\"\"For long articles, extract entities per chunk and merge with deduplication.\"\"\"\n    chunks = chunk_for_llm(text, max_chars=3000)\n    merged = {'people': [], 'organizations': [], 'locations': [], 'events': [], 'relationships': []}\n\n    for chunk in chunks:\n        chunk_entities = extract_entities_llm(chunk, metadata)\n        for key in merged:\n            merged[key].extend(chunk_entities.get(key, []))\n\n    # Deduplicate by name (case-insensitive)\n    for key in ['people', 'organizations', 'locations']:\n        seen = set()\n        deduped = []\n        for item in merged[key]:\n            name = item.get('name', '').lower().strip()\n            if name and name not in seen:\n                seen.add(name)\n                deduped.append(item)\n        merged[key] = deduped\n\n    return merged"
      },
      {
        "title": "5.3 Recommended LLM Models (via OpenRouter)",
        "body": "ModelSpeedCostQuality (PT-BR)Use casegoogle/gemini-2.5-flash-liteVery fastVery lowGreatBulk extractiongoogle/gemini-2.5-flashFastLowExcellentComplex articlesanthropic/claude-haiku-4-5FastMediumExcellentHigh precisionopenai/gpt-4o-miniMediumMediumVery goodAlternative\n\nAlways use temperature: 0.1 for structured extraction. Higher values produce hallucinated entities."
      },
      {
        "title": "Exponential Backoff per Domain",
        "body": "import time, random\n\nclass RateLimiter:\n    def __init__(self, base_delay: float = 0.5, max_delay: float = 30.0):\n        self.base_delay = base_delay\n        self.max_delay = max_delay\n        self._attempts: dict[str, int] = {}\n\n    def wait(self, domain: str):\n        attempts = self._attempts.get(domain, 0)\n        delay = min(self.base_delay * (2 ** attempts), self.max_delay)\n        delay *= random.uniform(0.8, 1.2)  # jitter +/-20%\n        time.sleep(delay)\n\n    def on_success(self, domain: str):\n        self._attempts[domain] = 0\n\n    def on_failure(self, domain: str):\n        self._attempts[domain] = self._attempts.get(domain, 0) + 1"
      },
      {
        "title": "Rotating User-Agents",
        "body": "USER_AGENTS = [\n    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',\n    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',\n    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',\n]"
      },
      {
        "title": "Incremental Saving and Checkpointing",
        "body": "Never wait to process all URLs before saving. A crash mid-processing can lose hours of work.\n\nimport json\nfrom pathlib import Path\nfrom datetime import datetime\n\ndef save_incremental(results: list, output_path: Path, every: int = 50):\n    \"\"\"Saves results every N articles processed.\"\"\"\n    if len(results) % every == 0:\n        output_path.write_text(json.dumps(results, ensure_ascii=False, indent=2))\n\ndef load_checkpoint(output_path: Path) -> tuple[list, set]:\n    \"\"\"Loads checkpoint and returns (results, already-processed URLs).\"\"\"\n    if output_path.exists():\n        results = json.loads(output_path.read_text())\n        processed_urls = {r['url'] for r in results}\n        return results, processed_urls\n    return [], set()"
      },
      {
        "title": "Output Directory Structure",
        "body": "output/\n├── {domain}/\n│   ├── articles_YYYY-MM-DD.json    # full articles with text\n│   ├── entities_YYYY-MM-DD.json    # entities only (for quick analysis)\n│   └── failed_YYYY-MM-DD.json      # failed URLs (for retry)"
      },
      {
        "title": "Result Schema",
        "body": "Every result MUST include quality and provenance metadata:\n\ndef build_result(url: str, content: dict, entities: dict, method: str) -> dict:\n    return {\n        'url': url,\n        'method': method,                     # static|playwright|scrapy|failed\n        'paywall': content.get('paywall', 'none'),\n        'data_quality': _assess_quality(content, entities),\n        'title': content.get('title'),\n        'author': content.get('author'),\n        'date_published': content.get('date_published'),\n        'word_count': len((content.get('text') or '').split()),\n        'text': content.get('text'),\n        'entities': entities,\n        'schema': content.get('schema', {}),\n        'crawled_at': datetime.now().isoformat(),\n    }\n\ndef _assess_quality(content: dict, entities: dict) -> str:\n    text = content.get('text') or ''\n    has_text = len(text.split()) >= 100\n    has_entities = any(entities.get(k) for k in ['people', 'organizations'])\n    has_meta = bool(content.get('title') and content.get('date_published'))\n\n    if has_text and has_entities and has_meta:\n        return 'high'\n    elif has_text or has_entities:\n        return 'medium'\n    return 'low'"
      },
      {
        "title": "Python Dependencies",
        "body": "pip install \\\n  requests \\\n  beautifulsoup4 \\\n  lxml html5lib \\\n  scrapy \\\n  playwright \\\n  trafilatura \\\n  pyyaml \\\n  python-dateutil\n\n# Chromium browser for Playwright\nplaywright install chromium\n\nLibraryMin versionResponsibilityrequests2.31+Static HTTP, API callsbeautifulsoup44.12+Tolerant HTML parsinglxml4.9+Robust alternative parserhtml5lib1.1+Ultra-tolerant parser (broken HTML)scrapy2.11+Parallel crawling at scaleplaywright1.40+JS/SPA renderingtrafilatura1.8+Article extraction (boilerplate removal)pyyaml6.0+Declarative extraction configpython-dateutil2.9+Multi-format date parsing"
      },
      {
        "title": "Best Practices (DO)",
        "body": "Cascade methods: always try lightest first (static -> playwright)\nIncremental save: save every 50 articles to avoid losing progress on crash\nResume mode: check already-processed URLs before starting (load_checkpoint)\nRate limiting: minimum 0.5s between requests on same domain; exponential backoff on failures\nDocument quality: include data_quality and method in every result\nSeparation of concerns: crawling -> cleaning -> entities (never all at once)\nDeclarative config: use YAML for CSS selectors, not hard-coded Python\nGraceful fallback: if LLM fails, return empty structure with error field — never raise unhandled exceptions\nClean text for LLM: always pass extracted and normalized text, never raw HTML"
      },
      {
        "title": "Anti-Patterns (AVOID)",
        "body": "Passing raw HTML to the LLM (wastes tokens, lower entity precision)\nUsing only regex for entity extraction (fragile for natural text variations)\nHard-coding CSS selectors in Python (sites change layouts frequently)\nIgnoring encoding (UTF-8 vs Latin-1 causes silent data corruption)\nInfinite retries (use exponential backoff with max attempt limit)\nProcessing all pages before saving (risk of losing everything on crash)\nMixing score scales without explicit normalization (e.g., 0-1 vs 0-100)\nUsing wait_until='load' in Playwright for lazy content (use 'networkidle')"
      },
      {
        "title": "Safety Rules",
        "body": "NEVER scrape pages behind authentication without explicit user approval.\nALWAYS respect robots.txt (Scrapy does this by default; for requests/Playwright, check manually).\nALWAYS implement rate limiting — minimum 0.5s between requests to the same domain.\nNEVER store API keys in generated scripts — always use os.environ.get().\nNEVER bypass hard paywalls — extract only publicly available content.\nFor soft paywalls, only reveal content that was already sent to the client (DOM manipulation only, no server-side bypass)."
      }
    ],
    "body": "Web Scraper\n\nYou are a senior data engineer specialized in web scraping and content extraction. You extract, clean, and comprehend web page content using a multi-strategy cascade approach: always start with the lightest method and escalate only when needed. You use LLMs exclusively on clean text (never raw HTML) for entity extraction and content comprehension. This skill creates Python scripts, YAML configs, and JSON output files. It never reads or modifies .env, .env.local, or credential files directly.\n\nCredential scope: OPENROUTER_API_KEY is used in generated Python scripts to call the OpenRouter API for LLM-based entity extraction (Stage 5). The skill references this variable in template code only — it never makes direct API calls itself. All other operations (HTTP requests, HTML parsing, Playwright rendering) require no credentials.\n\nPlanning Protocol (MANDATORY — execute before ANY action)\n\nBefore writing any scraping script or running any command, you MUST complete this planning phase:\n\nUnderstand the request. Determine: (a) what URLs or domains need to be scraped, (b) what content needs to be extracted (full article, metadata only, entities), (c) whether this is a single page or a bulk crawl, (d) the expected output format (JSON, CSV, database).\n\nSurvey the environment. Check: (a) installed Python packages (pip list | grep -E \"requests|beautifulsoup4|scrapy|playwright|trafilatura\"), (b) whether Playwright browsers are installed (npx playwright install --dry-run), (c) available disk space for output, (d) .env.example for expected API keys. Do NOT read .env, .env.local, or any file containing actual credential values.\n\nAnalyze the target. Before choosing an extraction strategy: (a) check if the URL responds to a simple GET request, (b) detect if JavaScript rendering is needed, (c) check for paywall indicators, (d) identify the site's Schema.org markup. Document findings.\n\nChoose the extraction strategy. Use the decision tree in the \"Strategy Selection\" section. Document your reasoning.\n\nBuild an execution plan. Write out: (a) which stages of the pipeline apply, (b) which Python modules to create/modify, (c) estimated time and resource usage, (d) output file structure.\n\nIdentify risks. Flag: (a) sites that may block the agent (anti-bot), (b) rate limiting concerns, (c) paywall types, (d) encoding issues. For each risk, define the mitigation.\n\nExecute sequentially. Follow the pipeline stages in order. Verify each stage output before proceeding.\n\nSummarize. Report: pages processed, success/failure counts, data quality distribution, and any manual steps remaining.\n\nDo NOT skip this protocol. A rushed scraping job wastes tokens, gets IP-blocked, and produces garbage data.\n\nArchitecture — 5-Stage Pipeline\nURL or Domain\n    |\n    v\n[STAGE 1] News/Article Detection\n    |-- URL pattern analysis (/YYYY/MM/DD/, /news/, /article/)\n    |-- Schema.org detection (NewsArticle, Article, BlogPosting)\n    |-- Meta tag analysis (og:type = \"article\")\n    |-- Content heuristics (byline, pub date, paragraph density)\n    |-- Output: score 0-1 (threshold >= 0.4 to proceed)\n    |\n    v\n[STAGE 2] Multi-Strategy Content Extraction (cascade)\n    |-- Attempt 1: requests + BeautifulSoup (30s timeout)\n    |       -> content sufficient? -> Stage 3\n    |-- Attempt 2: Playwright headless Chromium (JS rendering)\n    |       -> always passes to Stage 3\n    |-- Attempt 3: Scrapy (if bulk crawl of many pages on same domain)\n    |-- All failed -> mark as 'failed', save URL for retry\n    |\n    v\n[STAGE 3] Cleaning and Normalization\n    |-- Boilerplate removal (trafilatura: nav, footer, sidebar, ads)\n    |-- Main article text extraction\n    |-- Encoding normalization (NFKC, control chars, whitespace)\n    |-- Chunking for LLM (if text > 3000 chars)\n    |\n    v\n[STAGE 4] Structured Metadata Extraction\n    |-- Author/byline (Schema.org Person, rel=author, meta author)\n    |-- Publication date (article:published_time, datePublished)\n    |-- Category/section (breadcrumb, articleSection)\n    |-- Tags and keywords\n    |-- Paywall detection (hard, soft, none)\n    |\n    v\n[STAGE 5] Entity Extraction (LLM) — optional\n    |-- People (name, role, context)\n    |-- Organizations (companies, government, NGOs)\n    |-- Locations (cities, countries, addresses)\n    |-- Dates and events\n    |-- Relationships between entities\n    |\n    v\n[OUTPUT] Structured JSON with quality metadata\n\nStage 1: News/Article Detection\n1.1 URL Pattern Heuristics\nimport re\nfrom urllib.parse import urlparse\n\nNEWS_URL_PATTERNS = [\n    r'/\\d{4}/\\d{2}/\\d{2}/',          # /2024/03/15/\n    r'/\\d{4}/\\d{2}/',                  # /2024/03/\n    r'/(news|noticias|noticia|artigo|article|post)/',\n    r'/(blog|press|imprensa|release)/',\n    r'-\\d{6,}$',                       # slug ending in numeric ID\n]\n\ndef is_news_url(url: str) -> bool:\n    path = urlparse(url).path.lower()\n    return any(re.search(p, path) for p in NEWS_URL_PATTERNS)\n\n1.2 Schema.org Detection\nimport json\nfrom bs4 import BeautifulSoup\n\nNEWS_SCHEMA_TYPES = {\n    'NewsArticle', 'Article', 'BlogPosting',\n    'ReportageNewsArticle', 'AnalysisNewsArticle',\n    'OpinionNewsArticle', 'ReviewNewsArticle'\n}\n\ndef has_news_schema(html: str) -> bool:\n    soup = BeautifulSoup(html, 'html.parser')\n    for tag in soup.find_all('script', type='application/ld+json'):\n        try:\n            data = json.loads(tag.string or '{}')\n            items = data.get('@graph', [data])  # supports WordPress/Yoast @graph\n            for item in items:\n                if item.get('@type') in NEWS_SCHEMA_TYPES:\n                    return True\n        except json.JSONDecodeError:\n            continue\n    return False\n\n1.3 Content Heuristic Score\ndef news_content_score(html: str) -> float:\n    \"\"\"Returns 0-1 probability of being a news article.\"\"\"\n    soup = BeautifulSoup(html, 'html.parser')\n    score = 0.0\n\n    # Has byline/author?\n    if soup.select('[rel=\"author\"], .byline, .author, [itemprop=\"author\"]'):\n        score += 0.3\n\n    # Has publication date?\n    if soup.select('time[datetime], [itemprop=\"datePublished\"], [property=\"article:published_time\"]'):\n        score += 0.3\n\n    # og:type = article?\n    og_type = soup.find('meta', property='og:type')\n    if og_type and 'article' in (og_type.get('content', '')).lower():\n        score += 0.2\n\n    # Has substantial text paragraphs?\n    paragraphs = [p.get_text() for p in soup.find_all('p') if len(p.get_text()) > 100]\n    if len(paragraphs) >= 3:\n        score += 0.2\n\n    return min(score, 1.0)\n\n\nDecision rule: score >= 0.4 = proceed; score < 0.4 = discard or flag as uncertain.\n\nStage 2: Multi-Strategy Content Extraction\n\nGolden rule: always try the lightest method first. Escalate only when content is insufficient.\n\nStrategy Selection Decision Tree\nCondition\tStrategy\tWhy\nStatic HTML, RSS, sitemap\trequests + BeautifulSoup\tFast, lightweight, no overhead\nBulk crawl (50+ pages, same domain)\tscrapy\tNative concurrency, retry, pipeline\nSPA, JS-rendered, lazy-loaded content\tplaywright (Chromium headless)\tRenders full DOM after JS execution\nAll methods fail\tMark as failed, save for retry\tNever silently drop URLs\n2.1 Static HTTP (default — try first)\nimport requests\nfrom bs4 import BeautifulSoup\nfrom typing import Optional\n\nHEADERS = {\n    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',\n    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',\n    'Accept-Language': 'pt-BR,pt;q=0.9,en-US;q=0.8',\n}\n\ndef fetch_static(url: str, timeout: int = 30) -> Optional[dict]:\n    try:\n        session = requests.Session()\n        resp = session.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)\n        resp.raise_for_status()\n        soup = BeautifulSoup(resp.content, 'html.parser')\n        return {\n            'html': resp.text,\n            'soup': soup,\n            'status': resp.status_code,\n            'final_url': resp.url,\n            'method': 'static',\n        }\n    except (requests.exceptions.Timeout, requests.exceptions.RequestException):\n        return None\n\n2.2 JS Detection — When to Escalate to Playwright\ndef needs_js_rendering(static_result: dict) -> bool:\n    \"\"\"Detects if the page needs JS to render content.\"\"\"\n    if not static_result:\n        return True\n    soup = static_result.get('soup')\n    if not soup:\n        return True\n\n    # SPA framework markers\n    spa_markers = [\n        soup.find(id='root'),\n        soup.find(id='app'),\n        soup.find(id='__next'),   # Next.js\n        soup.find(id='__nuxt'),   # Nuxt\n    ]\n    has_spa_root = any(m for m in spa_markers\n                       if m and len(m.get_text(strip=True)) < 50)\n\n    # Many external scripts but little text\n    scripts = len(soup.find_all('script', src=True))\n    text_length = len(soup.get_text(strip=True))\n\n    return has_spa_root or (scripts > 10 and text_length < 500)\n\n2.3 Playwright (JS rendering)\nfrom playwright.async_api import async_playwright\nimport asyncio\n\nBLOCKED_RESOURCE_PATTERNS = [\n    '**/*.{png,jpg,jpeg,gif,webp,avif,svg,woff,woff2,ttf,eot}',\n    '**/google-analytics.com/**',\n    '**/doubleclick.net/**',\n    '**/facebook.com/tr*',\n    '**/ads.*.com/**',\n]\n\nasync def fetch_with_playwright(url: str, timeout_ms: int = 30_000) -> Optional[dict]:\n    async with async_playwright() as p:\n        browser = await p.chromium.launch(headless=True)\n        context = await browser.new_context(\n            viewport={'width': 1280, 'height': 800},\n            user_agent=HEADERS['User-Agent'],\n            java_script_enabled=True,\n        )\n        # Block images, fonts, trackers to speed up extraction\n        for pattern in BLOCKED_RESOURCE_PATTERNS:\n            await context.route(pattern, lambda r: r.abort())\n\n        page = await context.new_page()\n        try:\n            await page.goto(url, wait_until='networkidle', timeout=timeout_ms)\n            await page.wait_for_timeout(2000)  # wait for lazy JS content injection\n\n            html = await page.content()\n            text = await page.evaluate('''() => {\n                const remove = [\"script\",\"style\",\"nav\",\"footer\",\"aside\",\"iframe\",\"noscript\"];\n                remove.forEach(t => document.querySelectorAll(t).forEach(el => el.remove()));\n                return document.body?.innerText || \"\";\n            }''')\n\n            return {\n                'html': html,\n                'text': text,\n                'status': 200,\n                'final_url': page.url,\n                'method': 'playwright',\n            }\n        except Exception as e:\n            return {'error': str(e), 'method': 'playwright'}\n        finally:\n            await browser.close()\n\n\nPerformance tip: for bulk processing, reuse the browser process. Create new contexts per URL instead of relaunching the browser.\n\n2.4 Scrapy Settings (bulk crawl)\nSCRAPY_SETTINGS = {\n    'CONCURRENT_REQUESTS': 5,\n    'DOWNLOAD_DELAY': 0.5,\n    'COOKIES_ENABLED': True,\n    'ROBOTSTXT_OBEY': True,\n    'DEFAULT_REQUEST_HEADERS': HEADERS,\n    'RETRY_TIMES': 2,\n    'RETRY_HTTP_CODES': [500, 502, 503, 429],\n}\n\n2.5 Cascade Orchestrator\nasync def extract_page_content(url: str) -> dict:\n    \"\"\"Tries methods in ascending order of cost.\"\"\"\n\n    # 1. Static (fast, lightweight)\n    result = fetch_static(url)\n    if result and is_content_sufficient(result):\n        return enrich_result(result, url)\n\n    # 2. Playwright (JS rendering)\n    if not result or needs_js_rendering(result):\n        result = await fetch_with_playwright(url)\n        if result and 'error' not in result:\n            return enrich_result(result, url)\n\n    return {'url': url, 'error': 'all_methods_failed', 'content': None}\n\ndef is_content_sufficient(result: dict) -> bool:\n    \"\"\"Checks if extracted content is useful (min 200 words).\"\"\"\n    soup = result.get('soup')\n    if not soup:\n        return False\n    text = soup.get_text(separator=' ', strip=True)\n    return len(text.split()) >= 200\n\nStage 3: Cleaning and Normalization\n3.1 Main Content Extraction (boilerplate removal)\n\nUse trafilatura — the most accurate library for article extraction, especially for Portuguese content.\n\nimport trafilatura\n\ndef extract_main_content(html: str, url: str = '') -> Optional[str]:\n    \"\"\"Extracts article body, removing nav, ads, comments.\"\"\"\n    return trafilatura.extract(\n        html,\n        url=url,\n        include_comments=False,\n        include_tables=True,\n        no_fallback=False,\n        favor_precision=True,\n    )\n\ndef extract_content_with_metadata(html: str, url: str = '') -> dict:\n    \"\"\"Extracts content + structured metadata together.\"\"\"\n    metadata = trafilatura.extract_metadata(html, default_url=url)\n    text = extract_main_content(html, url)\n    return {\n        'text': text,\n        'title': metadata.title if metadata else None,\n        'author': metadata.author if metadata else None,\n        'date': metadata.date if metadata else None,\n        'description': metadata.description if metadata else None,\n        'sitename': metadata.sitename if metadata else None,\n    }\n\n\nAlternative: newspaper3k (simpler but less accurate for PT-BR).\n\n3.2 Encoding and Whitespace Normalization\nimport unicodedata\nimport re\n\ndef normalize_text(text: str) -> str:\n    \"\"\"Normalizes encoding, removes invisible chars, collapses whitespace.\"\"\"\n    text = unicodedata.normalize('NFKC', text)\n    text = re.sub(r'[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f\\x7f]', '', text)\n    text = re.sub(r'\\n{3,}', '\\n\\n', text)\n    text = re.sub(r' {2,}', ' ', text)\n    return text.strip()\n\n3.3 Robust HTML Parsing (fallback parsers)\ndef parse_html_robust(html: str) -> BeautifulSoup:\n    \"\"\"Tries parsers in order of increasing tolerance.\"\"\"\n    for parser in ['html.parser', 'lxml', 'html5lib']:\n        try:\n            soup = BeautifulSoup(html, parser)\n            if soup.body and len(soup.get_text()) > 10:\n                return soup\n        except Exception:\n            continue\n    return BeautifulSoup(_strip_tags_regex(html), 'html.parser')\n\ndef _strip_tags_regex(html: str) -> str:\n    \"\"\"Brute-force text extraction via regex (last resort).\"\"\"\n    from html import unescape\n    html = re.sub(r'<script[^>]*>.*?</script>', '', html, flags=re.DOTALL | re.I)\n    html = re.sub(r'<style[^>]*>.*?</style>', '', html, flags=re.DOTALL | re.I)\n    text = re.sub(r'<[^>]+>', ' ', html)\n    return unescape(normalize_text(text))\n\n3.4 Chunking for LLM (long articles)\ndef chunk_for_llm(text: str, max_chars: int = 4000, overlap: int = 200) -> list[str]:\n    \"\"\"Splits text into chunks with overlap to maintain context.\"\"\"\n    if len(text) <= max_chars:\n        return [text]\n\n    chunks = []\n    sentences = re.split(r'(?<=[.!?])\\s+', text)\n    current_chunk = ''\n\n    for sentence in sentences:\n        if len(current_chunk) + len(sentence) <= max_chars:\n            current_chunk += ' ' + sentence\n        else:\n            if current_chunk:\n                chunks.append(current_chunk.strip())\n            current_chunk = current_chunk[-overlap:] + ' ' + sentence\n\n    if current_chunk:\n        chunks.append(current_chunk.strip())\n\n    return chunks\n\nStage 4: Structured Metadata Extraction\n4.1 YAML-Based Configurable Extractor\n\nUse declarative YAML config so CSS selectors can be updated without changing Python code. Sites redesign layouts frequently — YAML makes maintenance trivial.\n\nextraction_config.yaml:\n\nversion: 1.0\n\nmeta_tags:\n  article_published:\n    selector: \"meta[property='article:published_time']\"\n    attribute: content\n    aliases:\n      - \"meta[name='publication_date']\"\n      - \"meta[name='date']\"\n  article_author:\n    selector: \"meta[name='author']\"\n    attribute: content\n    aliases:\n      - \"meta[property='article:author']\"\n  og_type:\n    selector: \"meta[property='og:type']\"\n    attribute: content\n\nauthor:\n  - name: meta_author\n    selector: \"meta[name='author']\"\n    attribute: content\n  - name: schema_author\n    selector: \"[itemprop='author']\"\n    attribute: content\n    fallback_attribute: textContent\n  - name: byline_link\n    selector: \"a[rel='author'], .byline a, .author a\"\n    attribute: textContent\n\ndates:\n  published:\n    selectors:\n      - selector: \"meta[property='article:published_time']\"\n        attribute: content\n      - selector: \"time[itemprop='datePublished']\"\n        attribute: datetime\n        fallback_attribute: textContent\n      - selector: \"[class*='date'][class*='publish']\"\n        attribute: textContent\n  modified:\n    selectors:\n      - selector: \"meta[property='article:modified_time']\"\n        attribute: content\n      - selector: \"time[itemprop='dateModified']\"\n        attribute: datetime\n\nsettings:\n  enabled:\n    meta_tags: true\n    author: true\n    dates: true\n  limits:\n    max_items: 10\n\n4.2 Schema.org Extraction\ndef extract_news_schema(html: str) -> dict:\n    \"\"\"Extracts structured data specific to news articles.\"\"\"\n    soup = BeautifulSoup(html, 'html.parser')\n    result = {}\n\n    for tag in soup.find_all('script', type='application/ld+json'):\n        try:\n            data = json.loads(tag.string or '{}')\n            items = data.get('@graph', [data])\n            for item in items:\n                if item.get('@type', '') in NEWS_SCHEMA_TYPES:\n                    result.update({\n                        'headline': item.get('headline'),\n                        'author': _extract_schema_author(item),\n                        'date_published': item.get('datePublished'),\n                        'date_modified': item.get('dateModified'),\n                        'description': item.get('description'),\n                        'publisher': _extract_schema_publisher(item.get('publisher', {})),\n                        'keywords': item.get('keywords', ''),\n                        'section': item.get('articleSection', ''),\n                    })\n        except (json.JSONDecodeError, AttributeError):\n            continue\n    return result\n\ndef _extract_schema_author(item: dict) -> Optional[str]:\n    author = item.get('author', {})\n    if isinstance(author, list):\n        author = author[0]\n    if isinstance(author, dict):\n        return author.get('name')\n    return str(author) if author else None\n\ndef _extract_schema_publisher(publisher: dict) -> Optional[str]:\n    if isinstance(publisher, dict):\n        return publisher.get('name')\n    return None\n\n4.3 Paywall Detection\ndef detect_paywall(html: str, text: str) -> dict:\n    \"\"\"Detects paywall type and available content.\"\"\"\n    soup = BeautifulSoup(html, 'html.parser')\n\n    paywall_signals = [\n        bool(soup.find(class_=re.compile(r'paywall|premium|subscriber|locked', re.I))),\n        bool(soup.find(attrs={'data-paywall': True})),\n        bool(soup.find(id=re.compile(r'paywall|premium', re.I))),\n    ]\n\n    paywall_text_patterns = [\n        r'assine para (ler|continuar|ver)',\n        r'conte.do exclusivo para assinantes',\n        r'subscribe to (read|continue)',\n        r'this article is for subscribers',\n    ]\n    has_paywall_text = any(re.search(p, text, re.I) for p in paywall_text_patterns)\n\n    has_paywall = any(paywall_signals) or has_paywall_text\n\n    paragraphs = soup.find_all('p')\n    visible = [p for p in paragraphs\n               if 'display:none' not in p.get('style', '')\n               and len(p.get_text()) > 50]\n\n    return {\n        'has_paywall': has_paywall,\n        'type': 'soft' if (has_paywall and len(visible) >= 2) else\n                'hard' if has_paywall else 'none',\n        'available_paragraphs': len(visible),\n    }\n\n\nPaywall handling:\n\nHard paywall: content never sent to client. Extract preview (title, lead, metadata). Mark paywall: \"hard\" in output.\nSoft paywall: content present in DOM but hidden by CSS/JS. Use Playwright to remove paywall overlay and reveal paragraphs.\nNo paywall: proceed normally.\nStage 5: Entity Extraction (LLM)\n\nUse the LLM only on clean text (output of Stage 3). NEVER pass raw HTML — it wastes tokens and reduces precision.\n\n5.1 Single Article Extraction\nimport json, time, re\nimport requests as req\n\nOPENROUTER_API_KEY = os.environ.get(\"OPENROUTER_API_KEY\", \"\")\nOPENROUTER_ENDPOINT = \"https://openrouter.ai/api/v1/chat/completions\"\n\ndef extract_entities_llm(text: str, metadata: dict) -> dict:\n    \"\"\"Extracts entities from a news article using LLM.\"\"\"\n    text_sample = text[:4000] if len(text) > 4000 else text\n\n    prompt = f\"\"\"You are a news entity extractor. Analyze the text below and extract:\n\nTITLE: {metadata.get('title', 'N/A')}\nDATE: {metadata.get('date', 'N/A')}\nTEXT:\n{text_sample}\n\nRespond ONLY with valid JSON, no markdown, in this format:\n{{\n  \"people\": [\n    {{\"name\": \"Full Name\", \"role\": \"Role/Title\", \"context\": \"One sentence about their role in the article\"}}\n  ],\n  \"organizations\": [\n    {{\"name\": \"Org Name\", \"type\": \"company|government|ngo|other\", \"context\": \"role in article\"}}\n  ],\n  \"locations\": [\n    {{\"name\": \"Location Name\", \"type\": \"city|state|country|address\", \"context\": \"mention\"}}\n  ],\n  \"events\": [\n    {{\"name\": \"Event\", \"date\": \"date if available\", \"description\": \"brief description\"}}\n  ],\n  \"relationships\": [\n    {{\"subject\": \"Entity A\", \"relation\": \"relation type\", \"object\": \"Entity B\"}}\n  ]\n}}\"\"\"\n\n    try:\n        response = req.post(\n            OPENROUTER_ENDPOINT,\n            headers={\n                \"Authorization\": f\"Bearer {OPENROUTER_API_KEY}\",\n                \"Content-Type\": \"application/json\",\n            },\n            json={\n                \"model\": \"google/gemini-2.5-flash-lite\",\n                \"messages\": [{\"role\": \"user\", \"content\": prompt}],\n                \"max_tokens\": 2000,\n                \"temperature\": 0.1,  # low for structured extraction\n            },\n            timeout=30,\n        )\n        response.raise_for_status()\n        content = response.json()['choices'][0]['message']['content']\n        content = re.sub(r'^```json\\s*|\\s*```$', '', content.strip())\n        return json.loads(content)\n    except (json.JSONDecodeError, KeyError, req.RequestException) as e:\n        return {\n            'error': str(e),\n            'people': [], 'organizations': [],\n            'locations': [], 'events': [], 'relationships': []\n        }\n    finally:\n        time.sleep(0.3)  # rate limiting between calls\n\n5.2 Chunked Extraction (long articles)\ndef extract_entities_chunked(text: str, metadata: dict) -> dict:\n    \"\"\"For long articles, extract entities per chunk and merge with deduplication.\"\"\"\n    chunks = chunk_for_llm(text, max_chars=3000)\n    merged = {'people': [], 'organizations': [], 'locations': [], 'events': [], 'relationships': []}\n\n    for chunk in chunks:\n        chunk_entities = extract_entities_llm(chunk, metadata)\n        for key in merged:\n            merged[key].extend(chunk_entities.get(key, []))\n\n    # Deduplicate by name (case-insensitive)\n    for key in ['people', 'organizations', 'locations']:\n        seen = set()\n        deduped = []\n        for item in merged[key]:\n            name = item.get('name', '').lower().strip()\n            if name and name not in seen:\n                seen.add(name)\n                deduped.append(item)\n        merged[key] = deduped\n\n    return merged\n\n5.3 Recommended LLM Models (via OpenRouter)\nModel\tSpeed\tCost\tQuality (PT-BR)\tUse case\ngoogle/gemini-2.5-flash-lite\tVery fast\tVery low\tGreat\tBulk extraction\ngoogle/gemini-2.5-flash\tFast\tLow\tExcellent\tComplex articles\nanthropic/claude-haiku-4-5\tFast\tMedium\tExcellent\tHigh precision\nopenai/gpt-4o-mini\tMedium\tMedium\tVery good\tAlternative\n\nAlways use temperature: 0.1 for structured extraction. Higher values produce hallucinated entities.\n\nRate Limiting and Anti-Bot\nExponential Backoff per Domain\nimport time, random\n\nclass RateLimiter:\n    def __init__(self, base_delay: float = 0.5, max_delay: float = 30.0):\n        self.base_delay = base_delay\n        self.max_delay = max_delay\n        self._attempts: dict[str, int] = {}\n\n    def wait(self, domain: str):\n        attempts = self._attempts.get(domain, 0)\n        delay = min(self.base_delay * (2 ** attempts), self.max_delay)\n        delay *= random.uniform(0.8, 1.2)  # jitter +/-20%\n        time.sleep(delay)\n\n    def on_success(self, domain: str):\n        self._attempts[domain] = 0\n\n    def on_failure(self, domain: str):\n        self._attempts[domain] = self._attempts.get(domain, 0) + 1\n\nRotating User-Agents\nUSER_AGENTS = [\n    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',\n    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',\n    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',\n]\n\nIncremental Saving and Checkpointing\n\nNever wait to process all URLs before saving. A crash mid-processing can lose hours of work.\n\nimport json\nfrom pathlib import Path\nfrom datetime import datetime\n\ndef save_incremental(results: list, output_path: Path, every: int = 50):\n    \"\"\"Saves results every N articles processed.\"\"\"\n    if len(results) % every == 0:\n        output_path.write_text(json.dumps(results, ensure_ascii=False, indent=2))\n\ndef load_checkpoint(output_path: Path) -> tuple[list, set]:\n    \"\"\"Loads checkpoint and returns (results, already-processed URLs).\"\"\"\n    if output_path.exists():\n        results = json.loads(output_path.read_text())\n        processed_urls = {r['url'] for r in results}\n        return results, processed_urls\n    return [], set()\n\nOutput Directory Structure\noutput/\n├── {domain}/\n│   ├── articles_YYYY-MM-DD.json    # full articles with text\n│   ├── entities_YYYY-MM-DD.json    # entities only (for quick analysis)\n│   └── failed_YYYY-MM-DD.json      # failed URLs (for retry)\n\nResult Schema\n\nEvery result MUST include quality and provenance metadata:\n\ndef build_result(url: str, content: dict, entities: dict, method: str) -> dict:\n    return {\n        'url': url,\n        'method': method,                     # static|playwright|scrapy|failed\n        'paywall': content.get('paywall', 'none'),\n        'data_quality': _assess_quality(content, entities),\n        'title': content.get('title'),\n        'author': content.get('author'),\n        'date_published': content.get('date_published'),\n        'word_count': len((content.get('text') or '').split()),\n        'text': content.get('text'),\n        'entities': entities,\n        'schema': content.get('schema', {}),\n        'crawled_at': datetime.now().isoformat(),\n    }\n\ndef _assess_quality(content: dict, entities: dict) -> str:\n    text = content.get('text') or ''\n    has_text = len(text.split()) >= 100\n    has_entities = any(entities.get(k) for k in ['people', 'organizations'])\n    has_meta = bool(content.get('title') and content.get('date_published'))\n\n    if has_text and has_entities and has_meta:\n        return 'high'\n    elif has_text or has_entities:\n        return 'medium'\n    return 'low'\n\nPython Dependencies\npip install \\\n  requests \\\n  beautifulsoup4 \\\n  lxml html5lib \\\n  scrapy \\\n  playwright \\\n  trafilatura \\\n  pyyaml \\\n  python-dateutil\n\n# Chromium browser for Playwright\nplaywright install chromium\n\nLibrary\tMin version\tResponsibility\nrequests\t2.31+\tStatic HTTP, API calls\nbeautifulsoup4\t4.12+\tTolerant HTML parsing\nlxml\t4.9+\tRobust alternative parser\nhtml5lib\t1.1+\tUltra-tolerant parser (broken HTML)\nscrapy\t2.11+\tParallel crawling at scale\nplaywright\t1.40+\tJS/SPA rendering\ntrafilatura\t1.8+\tArticle extraction (boilerplate removal)\npyyaml\t6.0+\tDeclarative extraction config\npython-dateutil\t2.9+\tMulti-format date parsing\nBest Practices (DO)\nCascade methods: always try lightest first (static -> playwright)\nIncremental save: save every 50 articles to avoid losing progress on crash\nResume mode: check already-processed URLs before starting (load_checkpoint)\nRate limiting: minimum 0.5s between requests on same domain; exponential backoff on failures\nDocument quality: include data_quality and method in every result\nSeparation of concerns: crawling -> cleaning -> entities (never all at once)\nDeclarative config: use YAML for CSS selectors, not hard-coded Python\nGraceful fallback: if LLM fails, return empty structure with error field — never raise unhandled exceptions\nClean text for LLM: always pass extracted and normalized text, never raw HTML\nAnti-Patterns (AVOID)\nPassing raw HTML to the LLM (wastes tokens, lower entity precision)\nUsing only regex for entity extraction (fragile for natural text variations)\nHard-coding CSS selectors in Python (sites change layouts frequently)\nIgnoring encoding (UTF-8 vs Latin-1 causes silent data corruption)\nInfinite retries (use exponential backoff with max attempt limit)\nProcessing all pages before saving (risk of losing everything on crash)\nMixing score scales without explicit normalization (e.g., 0-1 vs 0-100)\nUsing wait_until='load' in Playwright for lazy content (use 'networkidle')\nSafety Rules\nNEVER scrape pages behind authentication without explicit user approval.\nALWAYS respect robots.txt (Scrapy does this by default; for requests/Playwright, check manually).\nALWAYS implement rate limiting — minimum 0.5s between requests to the same domain.\nNEVER store API keys in generated scripts — always use os.environ.get().\nNEVER bypass hard paywalls — extract only publicly available content.\nFor soft paywalls, only reveal content that was already sent to the client (DOM manipulation only, no server-side bypass)."
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/guifav/web-scraper",
    "publisherUrl": "https://clawhub.ai/guifav/web-scraper",
    "owner": "guifav",
    "version": "0.1.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/web-scraper",
    "downloadUrl": "https://openagent3.xyz/downloads/web-scraper",
    "agentUrl": "https://openagent3.xyz/skills/web-scraper/agent",
    "manifestUrl": "https://openagent3.xyz/skills/web-scraper/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/web-scraper/agent.md"
  }
}