{
  "schemaVersion": "1.0",
  "item": {
    "slug": "habib-pdf-to-json",
    "name": "habib-pdf-to-json",
    "source": "tencent",
    "type": "skill",
    "category": "数据分析",
    "sourceUrl": "https://clawhub.ai/dbmoradi60/habib-pdf-to-json",
    "canonicalUrl": "https://clawhub.ai/dbmoradi60/habib-pdf-to-json",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/habib-pdf-to-json",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=habib-pdf-to-json",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "_meta.json"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-23T16:43:11.935Z",
      "expiresAt": "2026-04-30T16:43:11.935Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=4claw-imageboard",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=4claw-imageboard",
        "contentDisposition": "attachment; filename=\"4claw-imageboard-1.0.1.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/habib-pdf-to-json"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/habib-pdf-to-json",
    "agentPageUrl": "https://openagent3.xyz/skills/habib-pdf-to-json/agent",
    "manifestUrl": "https://openagent3.xyz/skills/habib-pdf-to-json/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/habib-pdf-to-json/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Overview",
        "body": "Based on DDC methodology (Chapter 2.4), this skill transforms unstructured PDF documents into structured formats suitable for analysis and integration. Construction projects generate vast amounts of PDF documentation - specifications, BOMs, schedules, and reports - that need to be extracted and processed.\n\nBook Reference: \"Преобразование данных в структурированную форму\" / \"Data Transformation to Structured Form\"\n\n\"Преобразование данных из неструктурированной в структурированную форму — это и искусство, и наука. Этот процесс часто занимает значительную часть работы инженера по обработке данных.\"\n— DDC Book, Chapter 2.4"
      },
      {
        "title": "ETL Process Overview",
        "body": "The conversion follows the ETL pattern:\n\nExtract: Load the PDF document\nTransform: Parse and structure the content\nLoad: Save to CSV, Excel, or JSON"
      },
      {
        "title": "Quick Start",
        "body": "import pdfplumber\nimport pandas as pd\n\n# Extract table from PDF\nwith pdfplumber.open(\"construction_spec.pdf\") as pdf:\n    page = pdf.pages[0]\n    table = page.extract_table()\n    df = pd.DataFrame(table[1:], columns=table[0])\n    df.to_excel(\"extracted_data.xlsx\", index=False)"
      },
      {
        "title": "Installation",
        "body": "# Core libraries\npip install pdfplumber pandas openpyxl\n\n# For scanned PDFs (OCR)\npip install pytesseract pdf2image\n# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract\n\n# For advanced PDF operations\npip install pypdf"
      },
      {
        "title": "Extract All Tables from PDF",
        "body": "import pdfplumber\nimport pandas as pd\n\ndef extract_tables_from_pdf(pdf_path):\n    \"\"\"Extract all tables from a PDF file\"\"\"\n    all_tables = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for page_num, page in enumerate(pdf.pages):\n            tables = page.extract_tables()\n            for table_num, table in enumerate(tables):\n                if table and len(table) > 1:\n                    # First row as header\n                    df = pd.DataFrame(table[1:], columns=table[0])\n                    df['_page'] = page_num + 1\n                    df['_table'] = table_num + 1\n                    all_tables.append(df)\n\n    if all_tables:\n        return pd.concat(all_tables, ignore_index=True)\n    return pd.DataFrame()\n\n# Usage\ndf = extract_tables_from_pdf(\"material_specification.pdf\")\ndf.to_excel(\"materials.xlsx\", index=False)"
      },
      {
        "title": "Extract Text with Layout",
        "body": "import pdfplumber\n\ndef extract_text_with_layout(pdf_path):\n    \"\"\"Extract text preserving layout structure\"\"\"\n    full_text = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for page in pdf.pages:\n            text = page.extract_text()\n            if text:\n                full_text.append(text)\n\n    return \"\\n\\n--- Page Break ---\\n\\n\".join(full_text)\n\n# Usage\ntext = extract_text_with_layout(\"project_report.pdf\")\nwith open(\"report_text.txt\", \"w\", encoding=\"utf-8\") as f:\n    f.write(text)"
      },
      {
        "title": "Extract Specific Table by Position",
        "body": "import pdfplumber\nimport pandas as pd\n\ndef extract_table_from_area(pdf_path, page_num, bbox):\n    \"\"\"\n    Extract table from specific area on page\n\n    Args:\n        pdf_path: Path to PDF file\n        page_num: Page number (0-indexed)\n        bbox: Bounding box (x0, top, x1, bottom) in points\n    \"\"\"\n    with pdfplumber.open(pdf_path) as pdf:\n        page = pdf.pages[page_num]\n        cropped = page.within_bbox(bbox)\n        table = cropped.extract_table()\n\n        if table:\n            return pd.DataFrame(table[1:], columns=table[0])\n    return pd.DataFrame()\n\n# Usage - extract table from specific area\n# bbox format: (left, top, right, bottom) in points (1 inch = 72 points)\ndf = extract_table_from_area(\"drawing.pdf\", 0, (50, 100, 550, 400))"
      },
      {
        "title": "Extract Text from Scanned PDF",
        "body": "import pytesseract\nfrom pdf2image import convert_from_path\nimport pandas as pd\n\ndef ocr_scanned_pdf(pdf_path, language='eng'):\n    \"\"\"\n    Extract text from scanned PDF using OCR\n\n    Args:\n        pdf_path: Path to scanned PDF\n        language: Tesseract language code (eng, deu, rus, etc.)\n    \"\"\"\n    # Convert PDF pages to images\n    images = convert_from_path(pdf_path, dpi=300)\n\n    extracted_text = []\n    for i, image in enumerate(images):\n        text = pytesseract.image_to_string(image, lang=language)\n        extracted_text.append({\n            'page': i + 1,\n            'text': text\n        })\n\n    return pd.DataFrame(extracted_text)\n\n# Usage\ndf = ocr_scanned_pdf(\"scanned_specification.pdf\", language='eng')\ndf.to_csv(\"ocr_results.csv\", index=False)"
      },
      {
        "title": "OCR Table Extraction",
        "body": "import pytesseract\nfrom pdf2image import convert_from_path\nimport pandas as pd\nimport cv2\nimport numpy as np\n\ndef ocr_table_from_scanned_pdf(pdf_path, page_num=0):\n    \"\"\"Extract table from scanned PDF using OCR with table detection\"\"\"\n    # Convert specific page to image\n    images = convert_from_path(pdf_path, first_page=page_num+1,\n                                last_page=page_num+1, dpi=300)\n    image = np.array(images[0])\n\n    # Convert to grayscale\n    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)\n\n    # Apply thresholding\n    _, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)\n\n    # Extract text with table structure\n    custom_config = r'--oem 3 --psm 6'\n    text = pytesseract.image_to_string(gray, config=custom_config)\n\n    # Parse text into table structure\n    lines = text.strip().split('\\n')\n    data = [line.split() for line in lines if line.strip()]\n\n    if data:\n        # Assume first row is header\n        df = pd.DataFrame(data[1:], columns=data[0] if len(data[0]) > 0 else None)\n        return df\n    return pd.DataFrame()\n\n# Usage\ndf = ocr_table_from_scanned_pdf(\"scanned_bom.pdf\")\nprint(df)"
      },
      {
        "title": "Bill of Materials (BOM) Extraction",
        "body": "import pdfplumber\nimport pandas as pd\nimport re\n\ndef extract_bom_from_pdf(pdf_path):\n    \"\"\"Extract Bill of Materials from construction PDF\"\"\"\n    all_items = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for page in pdf.pages:\n            tables = page.extract_tables()\n            for table in tables:\n                if not table or len(table) < 2:\n                    continue\n\n                # Find header row (look for common BOM headers)\n                header_keywords = ['item', 'description', 'quantity', 'unit', 'material']\n                for i, row in enumerate(table):\n                    if row and any(keyword in str(row).lower() for keyword in header_keywords):\n                        # Found header, process remaining rows\n                        headers = [str(h).strip() for h in row]\n                        for data_row in table[i+1:]:\n                            if data_row and any(cell for cell in data_row if cell):\n                                item = dict(zip(headers, data_row))\n                                all_items.append(item)\n                        break\n\n    return pd.DataFrame(all_items)\n\n# Usage\nbom = extract_bom_from_pdf(\"project_bom.pdf\")\nbom.to_excel(\"bom_extracted.xlsx\", index=False)"
      },
      {
        "title": "Project Schedule Extraction",
        "body": "import pdfplumber\nimport pandas as pd\nfrom datetime import datetime\n\ndef extract_schedule_from_pdf(pdf_path):\n    \"\"\"Extract project schedule/gantt data from PDF\"\"\"\n    with pdfplumber.open(pdf_path) as pdf:\n        all_tasks = []\n\n        for page in pdf.pages:\n            tables = page.extract_tables()\n            for table in tables:\n                if not table:\n                    continue\n\n                # Look for schedule-like table\n                headers = table[0] if table else []\n\n                # Check if it looks like a schedule\n                schedule_keywords = ['task', 'activity', 'start', 'end', 'duration']\n                if any(kw in str(headers).lower() for kw in schedule_keywords):\n                    for row in table[1:]:\n                        if row and any(cell for cell in row if cell):\n                            task = dict(zip(headers, row))\n                            all_tasks.append(task)\n\n    df = pd.DataFrame(all_tasks)\n\n    # Try to parse dates\n    date_columns = ['Start', 'End', 'Start Date', 'End Date', 'Finish']\n    for col in date_columns:\n        if col in df.columns:\n            df[col] = pd.to_datetime(df[col], errors='coerce')\n\n    return df\n\n# Usage\nschedule = extract_schedule_from_pdf(\"project_schedule.pdf\")\nprint(schedule)"
      },
      {
        "title": "Specification Parsing",
        "body": "import pdfplumber\nimport pandas as pd\nimport re\n\ndef parse_specification_pdf(pdf_path):\n    \"\"\"Parse construction specification document\"\"\"\n    specs = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        full_text = \"\"\n        for page in pdf.pages:\n            text = page.extract_text()\n            if text:\n                full_text += text + \"\\n\"\n\n    # Parse sections (common spec format)\n    section_pattern = r'(\\d+\\.\\d+(?:\\.\\d+)?)\\s+([A-Z][^\\n]+)'\n    sections = re.findall(section_pattern, full_text)\n\n    for num, title in sections:\n        specs.append({\n            'section_number': num,\n            'title': title.strip(),\n            'level': len(num.split('.'))\n        })\n\n    return pd.DataFrame(specs)\n\n# Usage\nspecs = parse_specification_pdf(\"technical_spec.pdf\")\nprint(specs)"
      },
      {
        "title": "Process Multiple PDFs",
        "body": "import pdfplumber\nimport pandas as pd\nfrom pathlib import Path\n\ndef batch_extract_tables(folder_path, output_folder):\n    \"\"\"Process all PDFs in folder and extract tables\"\"\"\n    pdf_files = Path(folder_path).glob(\"*.pdf\")\n    results = []\n\n    for pdf_path in pdf_files:\n        print(f\"Processing: {pdf_path.name}\")\n        try:\n            with pdfplumber.open(pdf_path) as pdf:\n                for page_num, page in enumerate(pdf.pages):\n                    tables = page.extract_tables()\n                    for table_num, table in enumerate(tables):\n                        if table and len(table) > 1:\n                            df = pd.DataFrame(table[1:], columns=table[0])\n                            df['_source_file'] = pdf_path.name\n                            df['_page'] = page_num + 1\n\n                            # Save individual table\n                            output_name = f\"{pdf_path.stem}_p{page_num+1}_t{table_num+1}.xlsx\"\n                            df.to_excel(Path(output_folder) / output_name, index=False)\n                            results.append(df)\n        except Exception as e:\n            print(f\"Error processing {pdf_path.name}: {e}\")\n\n    # Combined output\n    if results:\n        combined = pd.concat(results, ignore_index=True)\n        combined.to_excel(Path(output_folder) / \"all_tables.xlsx\", index=False)\n\n    return len(results)\n\n# Usage\ncount = batch_extract_tables(\"./pdf_documents/\", \"./extracted/\")\nprint(f\"Extracted {count} tables\")"
      },
      {
        "title": "Data Cleaning After Extraction",
        "body": "import pandas as pd\n\ndef clean_extracted_data(df):\n    \"\"\"Clean common issues in PDF-extracted data\"\"\"\n    # Remove completely empty rows\n    df = df.dropna(how='all')\n\n    # Strip whitespace from string columns\n    for col in df.select_dtypes(include=['object']).columns:\n        df[col] = df[col].str.strip()\n\n    # Remove rows where all cells are empty strings\n    df = df[df.apply(lambda row: any(cell != '' for cell in row), axis=1)]\n\n    # Convert numeric columns\n    for col in df.columns:\n        # Try to convert to numeric\n        numeric_series = pd.to_numeric(df[col], errors='coerce')\n        if numeric_series.notna().sum() > len(df) * 0.5:  # More than 50% numeric\n            df[col] = numeric_series\n\n    return df\n\n# Usage\ndf = extract_tables_from_pdf(\"document.pdf\")\ndf_clean = clean_extracted_data(df)\ndf_clean.to_excel(\"clean_data.xlsx\", index=False)"
      },
      {
        "title": "Export Options",
        "body": "import pandas as pd\nimport json\n\ndef export_to_multiple_formats(df, base_name):\n    \"\"\"Export DataFrame to multiple formats\"\"\"\n    # Excel\n    df.to_excel(f\"{base_name}.xlsx\", index=False)\n\n    # CSV\n    df.to_csv(f\"{base_name}.csv\", index=False, encoding='utf-8-sig')\n\n    # JSON\n    df.to_json(f\"{base_name}.json\", orient='records', indent=2)\n\n    # JSON Lines (for large datasets)\n    df.to_json(f\"{base_name}.jsonl\", orient='records', lines=True)\n\n# Usage\ndf = extract_tables_from_pdf(\"document.pdf\")\nexport_to_multiple_formats(df, \"extracted_data\")"
      },
      {
        "title": "Quick Reference",
        "body": "TaskToolCodeExtract tablepdfplumberpage.extract_table()Extract textpdfplumberpage.extract_text()OCR scannedpytesseractpytesseract.image_to_string(image)Merge PDFspypdfwriter.add_page(page)Convert to imagepdf2imageconvert_from_path(pdf)"
      },
      {
        "title": "Troubleshooting",
        "body": "IssueSolutionTable not detectedTry adjusting table settings: page.extract_table(table_settings={})Wrong column alignmentUse visual debugging: page.to_image().draw_rects()OCR quality poorIncrease DPI, preprocess image, use correct languageMemory issuesProcess pages one at a time, close PDF after processing"
      },
      {
        "title": "Resources",
        "body": "Book: \"Data-Driven Construction\" by Artem Boiko, Chapter 2.4\nWebsite: https://datadrivenconstruction.io\npdfplumber Docs: https://github.com/jsvine/pdfplumber\nTesseract OCR: https://github.com/tesseract-ocr/tesseract"
      },
      {
        "title": "Next Steps",
        "body": "See image-to-data for image processing\nSee cad-to-data for CAD/BIM data extraction\nSee etl-pipeline for automated processing workflows\nSee data-quality-check for validating extracted data"
      }
    ],
    "body": "PDF to Structured Data Conversion\nOverview\n\nBased on DDC methodology (Chapter 2.4), this skill transforms unstructured PDF documents into structured formats suitable for analysis and integration. Construction projects generate vast amounts of PDF documentation - specifications, BOMs, schedules, and reports - that need to be extracted and processed.\n\nBook Reference: \"Преобразование данных в структурированную форму\" / \"Data Transformation to Structured Form\"\n\n\"Преобразование данных из неструктурированной в структурированную форму — это и искусство, и наука. Этот процесс часто занимает значительную часть работы инженера по обработке данных.\" — DDC Book, Chapter 2.4\n\nETL Process Overview\n\nThe conversion follows the ETL pattern:\n\nExtract: Load the PDF document\nTransform: Parse and structure the content\nLoad: Save to CSV, Excel, or JSON\nQuick Start\nimport pdfplumber\nimport pandas as pd\n\n# Extract table from PDF\nwith pdfplumber.open(\"construction_spec.pdf\") as pdf:\n    page = pdf.pages[0]\n    table = page.extract_table()\n    df = pd.DataFrame(table[1:], columns=table[0])\n    df.to_excel(\"extracted_data.xlsx\", index=False)\n\nInstallation\n# Core libraries\npip install pdfplumber pandas openpyxl\n\n# For scanned PDFs (OCR)\npip install pytesseract pdf2image\n# Also install Tesseract OCR: https://github.com/tesseract-ocr/tesseract\n\n# For advanced PDF operations\npip install pypdf\n\nNative PDF Extraction (pdfplumber)\nExtract All Tables from PDF\nimport pdfplumber\nimport pandas as pd\n\ndef extract_tables_from_pdf(pdf_path):\n    \"\"\"Extract all tables from a PDF file\"\"\"\n    all_tables = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for page_num, page in enumerate(pdf.pages):\n            tables = page.extract_tables()\n            for table_num, table in enumerate(tables):\n                if table and len(table) > 1:\n                    # First row as header\n                    df = pd.DataFrame(table[1:], columns=table[0])\n                    df['_page'] = page_num + 1\n                    df['_table'] = table_num + 1\n                    all_tables.append(df)\n\n    if all_tables:\n        return pd.concat(all_tables, ignore_index=True)\n    return pd.DataFrame()\n\n# Usage\ndf = extract_tables_from_pdf(\"material_specification.pdf\")\ndf.to_excel(\"materials.xlsx\", index=False)\n\nExtract Text with Layout\nimport pdfplumber\n\ndef extract_text_with_layout(pdf_path):\n    \"\"\"Extract text preserving layout structure\"\"\"\n    full_text = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for page in pdf.pages:\n            text = page.extract_text()\n            if text:\n                full_text.append(text)\n\n    return \"\\n\\n--- Page Break ---\\n\\n\".join(full_text)\n\n# Usage\ntext = extract_text_with_layout(\"project_report.pdf\")\nwith open(\"report_text.txt\", \"w\", encoding=\"utf-8\") as f:\n    f.write(text)\n\nExtract Specific Table by Position\nimport pdfplumber\nimport pandas as pd\n\ndef extract_table_from_area(pdf_path, page_num, bbox):\n    \"\"\"\n    Extract table from specific area on page\n\n    Args:\n        pdf_path: Path to PDF file\n        page_num: Page number (0-indexed)\n        bbox: Bounding box (x0, top, x1, bottom) in points\n    \"\"\"\n    with pdfplumber.open(pdf_path) as pdf:\n        page = pdf.pages[page_num]\n        cropped = page.within_bbox(bbox)\n        table = cropped.extract_table()\n\n        if table:\n            return pd.DataFrame(table[1:], columns=table[0])\n    return pd.DataFrame()\n\n# Usage - extract table from specific area\n# bbox format: (left, top, right, bottom) in points (1 inch = 72 points)\ndf = extract_table_from_area(\"drawing.pdf\", 0, (50, 100, 550, 400))\n\nScanned PDF Processing (OCR)\nExtract Text from Scanned PDF\nimport pytesseract\nfrom pdf2image import convert_from_path\nimport pandas as pd\n\ndef ocr_scanned_pdf(pdf_path, language='eng'):\n    \"\"\"\n    Extract text from scanned PDF using OCR\n\n    Args:\n        pdf_path: Path to scanned PDF\n        language: Tesseract language code (eng, deu, rus, etc.)\n    \"\"\"\n    # Convert PDF pages to images\n    images = convert_from_path(pdf_path, dpi=300)\n\n    extracted_text = []\n    for i, image in enumerate(images):\n        text = pytesseract.image_to_string(image, lang=language)\n        extracted_text.append({\n            'page': i + 1,\n            'text': text\n        })\n\n    return pd.DataFrame(extracted_text)\n\n# Usage\ndf = ocr_scanned_pdf(\"scanned_specification.pdf\", language='eng')\ndf.to_csv(\"ocr_results.csv\", index=False)\n\nOCR Table Extraction\nimport pytesseract\nfrom pdf2image import convert_from_path\nimport pandas as pd\nimport cv2\nimport numpy as np\n\ndef ocr_table_from_scanned_pdf(pdf_path, page_num=0):\n    \"\"\"Extract table from scanned PDF using OCR with table detection\"\"\"\n    # Convert specific page to image\n    images = convert_from_path(pdf_path, first_page=page_num+1,\n                                last_page=page_num+1, dpi=300)\n    image = np.array(images[0])\n\n    # Convert to grayscale\n    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)\n\n    # Apply thresholding\n    _, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)\n\n    # Extract text with table structure\n    custom_config = r'--oem 3 --psm 6'\n    text = pytesseract.image_to_string(gray, config=custom_config)\n\n    # Parse text into table structure\n    lines = text.strip().split('\\n')\n    data = [line.split() for line in lines if line.strip()]\n\n    if data:\n        # Assume first row is header\n        df = pd.DataFrame(data[1:], columns=data[0] if len(data[0]) > 0 else None)\n        return df\n    return pd.DataFrame()\n\n# Usage\ndf = ocr_table_from_scanned_pdf(\"scanned_bom.pdf\")\nprint(df)\n\nConstruction-Specific Extractions\nBill of Materials (BOM) Extraction\nimport pdfplumber\nimport pandas as pd\nimport re\n\ndef extract_bom_from_pdf(pdf_path):\n    \"\"\"Extract Bill of Materials from construction PDF\"\"\"\n    all_items = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        for page in pdf.pages:\n            tables = page.extract_tables()\n            for table in tables:\n                if not table or len(table) < 2:\n                    continue\n\n                # Find header row (look for common BOM headers)\n                header_keywords = ['item', 'description', 'quantity', 'unit', 'material']\n                for i, row in enumerate(table):\n                    if row and any(keyword in str(row).lower() for keyword in header_keywords):\n                        # Found header, process remaining rows\n                        headers = [str(h).strip() for h in row]\n                        for data_row in table[i+1:]:\n                            if data_row and any(cell for cell in data_row if cell):\n                                item = dict(zip(headers, data_row))\n                                all_items.append(item)\n                        break\n\n    return pd.DataFrame(all_items)\n\n# Usage\nbom = extract_bom_from_pdf(\"project_bom.pdf\")\nbom.to_excel(\"bom_extracted.xlsx\", index=False)\n\nProject Schedule Extraction\nimport pdfplumber\nimport pandas as pd\nfrom datetime import datetime\n\ndef extract_schedule_from_pdf(pdf_path):\n    \"\"\"Extract project schedule/gantt data from PDF\"\"\"\n    with pdfplumber.open(pdf_path) as pdf:\n        all_tasks = []\n\n        for page in pdf.pages:\n            tables = page.extract_tables()\n            for table in tables:\n                if not table:\n                    continue\n\n                # Look for schedule-like table\n                headers = table[0] if table else []\n\n                # Check if it looks like a schedule\n                schedule_keywords = ['task', 'activity', 'start', 'end', 'duration']\n                if any(kw in str(headers).lower() for kw in schedule_keywords):\n                    for row in table[1:]:\n                        if row and any(cell for cell in row if cell):\n                            task = dict(zip(headers, row))\n                            all_tasks.append(task)\n\n    df = pd.DataFrame(all_tasks)\n\n    # Try to parse dates\n    date_columns = ['Start', 'End', 'Start Date', 'End Date', 'Finish']\n    for col in date_columns:\n        if col in df.columns:\n            df[col] = pd.to_datetime(df[col], errors='coerce')\n\n    return df\n\n# Usage\nschedule = extract_schedule_from_pdf(\"project_schedule.pdf\")\nprint(schedule)\n\nSpecification Parsing\nimport pdfplumber\nimport pandas as pd\nimport re\n\ndef parse_specification_pdf(pdf_path):\n    \"\"\"Parse construction specification document\"\"\"\n    specs = []\n\n    with pdfplumber.open(pdf_path) as pdf:\n        full_text = \"\"\n        for page in pdf.pages:\n            text = page.extract_text()\n            if text:\n                full_text += text + \"\\n\"\n\n    # Parse sections (common spec format)\n    section_pattern = r'(\\d+\\.\\d+(?:\\.\\d+)?)\\s+([A-Z][^\\n]+)'\n    sections = re.findall(section_pattern, full_text)\n\n    for num, title in sections:\n        specs.append({\n            'section_number': num,\n            'title': title.strip(),\n            'level': len(num.split('.'))\n        })\n\n    return pd.DataFrame(specs)\n\n# Usage\nspecs = parse_specification_pdf(\"technical_spec.pdf\")\nprint(specs)\n\nBatch Processing\nProcess Multiple PDFs\nimport pdfplumber\nimport pandas as pd\nfrom pathlib import Path\n\ndef batch_extract_tables(folder_path, output_folder):\n    \"\"\"Process all PDFs in folder and extract tables\"\"\"\n    pdf_files = Path(folder_path).glob(\"*.pdf\")\n    results = []\n\n    for pdf_path in pdf_files:\n        print(f\"Processing: {pdf_path.name}\")\n        try:\n            with pdfplumber.open(pdf_path) as pdf:\n                for page_num, page in enumerate(pdf.pages):\n                    tables = page.extract_tables()\n                    for table_num, table in enumerate(tables):\n                        if table and len(table) > 1:\n                            df = pd.DataFrame(table[1:], columns=table[0])\n                            df['_source_file'] = pdf_path.name\n                            df['_page'] = page_num + 1\n\n                            # Save individual table\n                            output_name = f\"{pdf_path.stem}_p{page_num+1}_t{table_num+1}.xlsx\"\n                            df.to_excel(Path(output_folder) / output_name, index=False)\n                            results.append(df)\n        except Exception as e:\n            print(f\"Error processing {pdf_path.name}: {e}\")\n\n    # Combined output\n    if results:\n        combined = pd.concat(results, ignore_index=True)\n        combined.to_excel(Path(output_folder) / \"all_tables.xlsx\", index=False)\n\n    return len(results)\n\n# Usage\ncount = batch_extract_tables(\"./pdf_documents/\", \"./extracted/\")\nprint(f\"Extracted {count} tables\")\n\nData Cleaning After Extraction\nimport pandas as pd\n\ndef clean_extracted_data(df):\n    \"\"\"Clean common issues in PDF-extracted data\"\"\"\n    # Remove completely empty rows\n    df = df.dropna(how='all')\n\n    # Strip whitespace from string columns\n    for col in df.select_dtypes(include=['object']).columns:\n        df[col] = df[col].str.strip()\n\n    # Remove rows where all cells are empty strings\n    df = df[df.apply(lambda row: any(cell != '' for cell in row), axis=1)]\n\n    # Convert numeric columns\n    for col in df.columns:\n        # Try to convert to numeric\n        numeric_series = pd.to_numeric(df[col], errors='coerce')\n        if numeric_series.notna().sum() > len(df) * 0.5:  # More than 50% numeric\n            df[col] = numeric_series\n\n    return df\n\n# Usage\ndf = extract_tables_from_pdf(\"document.pdf\")\ndf_clean = clean_extracted_data(df)\ndf_clean.to_excel(\"clean_data.xlsx\", index=False)\n\nExport Options\nimport pandas as pd\nimport json\n\ndef export_to_multiple_formats(df, base_name):\n    \"\"\"Export DataFrame to multiple formats\"\"\"\n    # Excel\n    df.to_excel(f\"{base_name}.xlsx\", index=False)\n\n    # CSV\n    df.to_csv(f\"{base_name}.csv\", index=False, encoding='utf-8-sig')\n\n    # JSON\n    df.to_json(f\"{base_name}.json\", orient='records', indent=2)\n\n    # JSON Lines (for large datasets)\n    df.to_json(f\"{base_name}.jsonl\", orient='records', lines=True)\n\n# Usage\ndf = extract_tables_from_pdf(\"document.pdf\")\nexport_to_multiple_formats(df, \"extracted_data\")\n\nQuick Reference\nTask\tTool\tCode\nExtract table\tpdfplumber\tpage.extract_table()\nExtract text\tpdfplumber\tpage.extract_text()\nOCR scanned\tpytesseract\tpytesseract.image_to_string(image)\nMerge PDFs\tpypdf\twriter.add_page(page)\nConvert to image\tpdf2image\tconvert_from_path(pdf)\nTroubleshooting\nIssue\tSolution\nTable not detected\tTry adjusting table settings: page.extract_table(table_settings={})\nWrong column alignment\tUse visual debugging: page.to_image().draw_rects()\nOCR quality poor\tIncrease DPI, preprocess image, use correct language\nMemory issues\tProcess pages one at a time, close PDF after processing\nResources\nBook: \"Data-Driven Construction\" by Artem Boiko, Chapter 2.4\nWebsite: https://datadrivenconstruction.io\npdfplumber Docs: https://github.com/jsvine/pdfplumber\nTesseract OCR: https://github.com/tesseract-ocr/tesseract\nNext Steps\nSee image-to-data for image processing\nSee cad-to-data for CAD/BIM data extraction\nSee etl-pipeline for automated processing workflows\nSee data-quality-check for validating extracted data"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/dbmoradi60/habib-pdf-to-json",
    "publisherUrl": "https://clawhub.ai/dbmoradi60/habib-pdf-to-json",
    "owner": "dbmoradi60",
    "version": "1.0.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/habib-pdf-to-json",
    "downloadUrl": "https://openagent3.xyz/downloads/habib-pdf-to-json",
    "agentUrl": "https://openagent3.xyz/skills/habib-pdf-to-json/agent",
    "manifestUrl": "https://openagent3.xyz/skills/habib-pdf-to-json/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/habib-pdf-to-json/agent.md"
  }
}