{
  "schemaVersion": "1.0",
  "item": {
    "slug": "rag-construction",
    "name": "Rag Construction",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/datadrivenconstruction/rag-construction",
    "canonicalUrl": "https://clawhub.ai/datadrivenconstruction/rag-construction",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/rag-construction",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=rag-construction",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "claw.json",
      "instructions.md",
      "SKILL.md"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-07T17:22:31.273Z",
      "expiresAt": "2026-05-14T17:22:31.273Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
        "contentDisposition": "attachment; filename=\"afrexai-annual-report-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/rag-construction"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/rag-construction",
    "agentPageUrl": "https://openagent3.xyz/skills/rag-construction/agent",
    "manifestUrl": "https://openagent3.xyz/skills/rag-construction/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/rag-construction/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Overview",
        "body": "Based on DDC methodology (Chapter 2.3), this skill builds Retrieval-Augmented Generation (RAG) systems for construction knowledge bases, enabling semantic search and AI-powered question answering over construction documents.\n\nBook Reference: \"Pandas DataFrame и LLM ChatGPT\" / \"Pandas DataFrame and LLM ChatGPT\""
      },
      {
        "title": "Quick Start",
        "body": "from dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import List, Dict, Optional, Any, Callable\nfrom datetime import datetime\nimport json\nimport hashlib\nimport re\n\nclass DocumentType(Enum):\n    \"\"\"Types of construction documents\"\"\"\n    SPECIFICATION = \"specification\"\n    DRAWING = \"drawing\"\n    CONTRACT = \"contract\"\n    RFI = \"rfi\"\n    SUBMITTAL = \"submittal\"\n    CHANGE_ORDER = \"change_order\"\n    MEETING_MINUTES = \"meeting_minutes\"\n    DAILY_REPORT = \"daily_report\"\n    SAFETY_REPORT = \"safety_report\"\n    INSPECTION = \"inspection\"\n    MANUAL = \"manual\"\n    STANDARD = \"standard\"\n\nclass ChunkingStrategy(Enum):\n    \"\"\"Text chunking strategies\"\"\"\n    FIXED_SIZE = \"fixed_size\"\n    PARAGRAPH = \"paragraph\"\n    SECTION = \"section\"\n    SEMANTIC = \"semantic\"\n    SENTENCE = \"sentence\"\n\n@dataclass\nclass DocumentChunk:\n    \"\"\"A chunk of document text\"\"\"\n    id: str\n    document_id: str\n    content: str\n    metadata: Dict[str, Any]\n    embedding: Optional[List[float]] = None\n    token_count: int = 0\n    position: int = 0\n\n@dataclass\nclass Document:\n    \"\"\"Construction document\"\"\"\n    id: str\n    title: str\n    doc_type: DocumentType\n    content: str\n    source: str\n    metadata: Dict[str, Any] = field(default_factory=dict)\n    chunks: List[DocumentChunk] = field(default_factory=list)\n    created_at: datetime = field(default_factory=datetime.now)\n\n@dataclass\nclass SearchResult:\n    \"\"\"Search result from vector store\"\"\"\n    chunk: DocumentChunk\n    score: float\n    document_title: str\n    doc_type: DocumentType\n\n@dataclass\nclass RAGResponse:\n    \"\"\"Response from RAG system\"\"\"\n    query: str\n    answer: str\n    sources: List[SearchResult]\n    confidence: float\n    tokens_used: int\n\n\nclass TextChunker:\n    \"\"\"Split documents into chunks for embedding\"\"\"\n\n    def __init__(\n        self,\n        strategy: ChunkingStrategy = ChunkingStrategy.PARAGRAPH,\n        chunk_size: int = 500,\n        chunk_overlap: int = 50\n    ):\n        self.strategy = strategy\n        self.chunk_size = chunk_size\n        self.chunk_overlap = chunk_overlap\n\n    def chunk_document(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Split document into chunks\"\"\"\n        if self.strategy == ChunkingStrategy.FIXED_SIZE:\n            return self._chunk_fixed_size(document)\n        elif self.strategy == ChunkingStrategy.PARAGRAPH:\n            return self._chunk_by_paragraph(document)\n        elif self.strategy == ChunkingStrategy.SECTION:\n            return self._chunk_by_section(document)\n        elif self.strategy == ChunkingStrategy.SENTENCE:\n            return self._chunk_by_sentence(document)\n        else:\n            return self._chunk_fixed_size(document)\n\n    def _chunk_fixed_size(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by fixed character size with overlap\"\"\"\n        chunks = []\n        text = document.content\n        start = 0\n        position = 0\n\n        while start < len(text):\n            end = start + self.chunk_size\n\n            # Find word boundary\n            if end < len(text):\n                while end > start and text[end] not in ' \\n\\t':\n                    end -= 1\n\n            chunk_text = text[start:end].strip()\n            if chunk_text:\n                chunk_id = self._generate_chunk_id(document.id, position)\n                chunks.append(DocumentChunk(\n                    id=chunk_id,\n                    document_id=document.id,\n                    content=chunk_text,\n                    metadata={\n                        \"doc_type\": document.doc_type.value,\n                        \"title\": document.title,\n                        **document.metadata\n                    },\n                    token_count=len(chunk_text.split()),\n                    position=position\n                ))\n                position += 1\n\n            start = end - self.chunk_overlap\n            if start >= len(text):\n                break\n\n        return chunks\n\n    def _chunk_by_paragraph(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by paragraphs\"\"\"\n        chunks = []\n        paragraphs = document.content.split('\\n\\n')\n        current_chunk = \"\"\n        position = 0\n\n        for para in paragraphs:\n            para = para.strip()\n            if not para:\n                continue\n\n            if len(current_chunk) + len(para) < self.chunk_size:\n                current_chunk += \"\\n\\n\" + para if current_chunk else para\n            else:\n                if current_chunk:\n                    chunk_id = self._generate_chunk_id(document.id, position)\n                    chunks.append(DocumentChunk(\n                        id=chunk_id,\n                        document_id=document.id,\n                        content=current_chunk,\n                        metadata={\n                            \"doc_type\": document.doc_type.value,\n                            \"title\": document.title,\n                            **document.metadata\n                        },\n                        token_count=len(current_chunk.split()),\n                        position=position\n                    ))\n                    position += 1\n                current_chunk = para\n\n        # Add remaining content\n        if current_chunk:\n            chunk_id = self._generate_chunk_id(document.id, position)\n            chunks.append(DocumentChunk(\n                id=chunk_id,\n                document_id=document.id,\n                content=current_chunk,\n                metadata={\n                    \"doc_type\": document.doc_type.value,\n                    \"title\": document.title,\n                    **document.metadata\n                },\n                token_count=len(current_chunk.split()),\n                position=position\n            ))\n\n        return chunks\n\n    def _chunk_by_section(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by document sections (headers)\"\"\"\n        # Split by common section patterns\n        section_pattern = r'\\n(?=(?:\\d+\\.|\\d+\\s|SECTION|ARTICLE|PART)\\s+[A-Z])'\n        sections = re.split(section_pattern, document.content)\n\n        chunks = []\n        for position, section in enumerate(sections):\n            section = section.strip()\n            if section:\n                # If section is too large, further split it\n                if len(section) > self.chunk_size * 2:\n                    sub_chunker = TextChunker(ChunkingStrategy.PARAGRAPH, self.chunk_size)\n                    sub_doc = Document(\n                        id=f\"{document.id}_sec{position}\",\n                        title=document.title,\n                        doc_type=document.doc_type,\n                        content=section,\n                        source=document.source,\n                        metadata=document.metadata\n                    )\n                    sub_chunks = sub_chunker.chunk_document(sub_doc)\n                    for i, chunk in enumerate(sub_chunks):\n                        chunk.id = self._generate_chunk_id(document.id, position * 100 + i)\n                        chunk.position = position * 100 + i\n                    chunks.extend(sub_chunks)\n                else:\n                    chunk_id = self._generate_chunk_id(document.id, position)\n                    chunks.append(DocumentChunk(\n                        id=chunk_id,\n                        document_id=document.id,\n                        content=section,\n                        metadata={\n                            \"doc_type\": document.doc_type.value,\n                            \"title\": document.title,\n                            **document.metadata\n                        },\n                        token_count=len(section.split()),\n                        position=position\n                    ))\n\n        return chunks\n\n    def _chunk_by_sentence(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by sentences, grouping to meet size requirements\"\"\"\n        # Simple sentence splitting\n        sentences = re.split(r'(?<=[.!?])\\s+', document.content)\n\n        chunks = []\n        current_chunk = \"\"\n        position = 0\n\n        for sentence in sentences:\n            if len(current_chunk) + len(sentence) < self.chunk_size:\n                current_chunk += \" \" + sentence if current_chunk else sentence\n            else:\n                if current_chunk:\n                    chunk_id = self._generate_chunk_id(document.id, position)\n                    chunks.append(DocumentChunk(\n                        id=chunk_id,\n                        document_id=document.id,\n                        content=current_chunk.strip(),\n                        metadata={\n                            \"doc_type\": document.doc_type.value,\n                            \"title\": document.title,\n                            **document.metadata\n                        },\n                        token_count=len(current_chunk.split()),\n                        position=position\n                    ))\n                    position += 1\n                current_chunk = sentence\n\n        if current_chunk:\n            chunk_id = self._generate_chunk_id(document.id, position)\n            chunks.append(DocumentChunk(\n                id=chunk_id,\n                document_id=document.id,\n                content=current_chunk.strip(),\n                metadata={\n                    \"doc_type\": document.doc_type.value,\n                    \"title\": document.title,\n                    **document.metadata\n                },\n                token_count=len(current_chunk.split()),\n                position=position\n            ))\n\n        return chunks\n\n    def _generate_chunk_id(self, doc_id: str, position: int) -> str:\n        \"\"\"Generate unique chunk ID\"\"\"\n        return hashlib.md5(f\"{doc_id}_{position}\".encode()).hexdigest()[:12]\n\n\nclass VectorStore:\n    \"\"\"Simple in-memory vector store for RAG\"\"\"\n\n    def __init__(self):\n        self.chunks: Dict[str, DocumentChunk] = {}\n        self.embeddings: Dict[str, List[float]] = {}\n\n    def add_chunks(self, chunks: List[DocumentChunk]):\n        \"\"\"Add chunks to the store\"\"\"\n        for chunk in chunks:\n            self.chunks[chunk.id] = chunk\n            if chunk.embedding:\n                self.embeddings[chunk.id] = chunk.embedding\n\n    def search(\n        self,\n        query_embedding: List[float],\n        top_k: int = 5,\n        filter_metadata: Optional[Dict] = None\n    ) -> List[Tuple[DocumentChunk, float]]:\n        \"\"\"Search for similar chunks\"\"\"\n        results = []\n\n        for chunk_id, chunk in self.chunks.items():\n            # Apply metadata filter\n            if filter_metadata:\n                match = all(\n                    chunk.metadata.get(k) == v\n                    for k, v in filter_metadata.items()\n                )\n                if not match:\n                    continue\n\n            # Calculate similarity (cosine similarity simulation)\n            if chunk_id in self.embeddings:\n                score = self._cosine_similarity(query_embedding, self.embeddings[chunk_id])\n                results.append((chunk, score))\n\n        # Sort by score descending\n        results.sort(key=lambda x: x[1], reverse=True)\n        return results[:top_k]\n\n    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:\n        \"\"\"Calculate cosine similarity between two vectors\"\"\"\n        if len(a) != len(b):\n            return 0.0\n\n        dot_product = sum(x * y for x, y in zip(a, b))\n        norm_a = sum(x * x for x in a) ** 0.5\n        norm_b = sum(x * x for x in b) ** 0.5\n\n        if norm_a == 0 or norm_b == 0:\n            return 0.0\n\n        return dot_product / (norm_a * norm_b)\n\n    def get_stats(self) -> Dict:\n        \"\"\"Get store statistics\"\"\"\n        doc_types = {}\n        for chunk in self.chunks.values():\n            doc_type = chunk.metadata.get(\"doc_type\", \"unknown\")\n            doc_types[doc_type] = doc_types.get(doc_type, 0) + 1\n\n        return {\n            \"total_chunks\": len(self.chunks),\n            \"chunks_with_embeddings\": len(self.embeddings),\n            \"chunks_by_type\": doc_types\n        }\n\n\nclass EmbeddingModel:\n    \"\"\"Simulated embedding model (replace with actual model in production)\"\"\"\n\n    def __init__(self, model_name: str = \"text-embedding-ada-002\"):\n        self.model_name = model_name\n        self.dimension = 1536\n\n    def embed(self, text: str) -> List[float]:\n        \"\"\"Generate embedding for text\"\"\"\n        # Simulation: generate deterministic embedding based on text hash\n        text_hash = hashlib.sha256(text.encode()).digest()\n        embedding = []\n        for i in range(self.dimension):\n            byte_idx = i % len(text_hash)\n            embedding.append((text_hash[byte_idx] - 128) / 128.0)\n        return embedding\n\n    def embed_batch(self, texts: List[str]) -> List[List[float]]:\n        \"\"\"Generate embeddings for multiple texts\"\"\"\n        return [self.embed(text) for text in texts]\n\n\nclass ConstructionRAG:\n    \"\"\"\n    RAG system for construction knowledge bases.\n    Based on DDC methodology Chapter 2.3.\n    \"\"\"\n\n    def __init__(\n        self,\n        embedding_model: Optional[EmbeddingModel] = None,\n        chunking_strategy: ChunkingStrategy = ChunkingStrategy.PARAGRAPH,\n        chunk_size: int = 500\n    ):\n        self.embedding_model = embedding_model or EmbeddingModel()\n        self.chunker = TextChunker(chunking_strategy, chunk_size)\n        self.vector_store = VectorStore()\n        self.documents: Dict[str, Document] = {}\n\n    def add_document(self, document: Document) -> int:\n        \"\"\"\n        Add a document to the knowledge base.\n\n        Args:\n            document: Document to add\n\n        Returns:\n            Number of chunks created\n        \"\"\"\n        # Store document\n        self.documents[document.id] = document\n\n        # Chunk document\n        chunks = self.chunker.chunk_document(document)\n\n        # Generate embeddings\n        for chunk in chunks:\n            chunk.embedding = self.embedding_model.embed(chunk.content)\n\n        # Add to vector store\n        self.vector_store.add_chunks(chunks)\n\n        # Update document with chunks\n        document.chunks = chunks\n\n        return len(chunks)\n\n    def add_documents(self, documents: List[Document]) -> Dict[str, int]:\n        \"\"\"Add multiple documents\"\"\"\n        results = {}\n        for doc in documents:\n            results[doc.id] = self.add_document(doc)\n        return results\n\n    def search(\n        self,\n        query: str,\n        top_k: int = 5,\n        doc_type: Optional[DocumentType] = None\n    ) -> List[SearchResult]:\n        \"\"\"\n        Search the knowledge base.\n\n        Args:\n            query: Search query\n            top_k: Number of results to return\n            doc_type: Filter by document type\n\n        Returns:\n            List of search results\n        \"\"\"\n        # Generate query embedding\n        query_embedding = self.embedding_model.embed(query)\n\n        # Build filter\n        filter_metadata = None\n        if doc_type:\n            filter_metadata = {\"doc_type\": doc_type.value}\n\n        # Search vector store\n        results = self.vector_store.search(\n            query_embedding,\n            top_k=top_k,\n            filter_metadata=filter_metadata\n        )\n\n        # Build search results\n        search_results = []\n        for chunk, score in results:\n            doc = self.documents.get(chunk.document_id)\n            search_results.append(SearchResult(\n                chunk=chunk,\n                score=score,\n                document_title=doc.title if doc else \"Unknown\",\n                doc_type=doc.doc_type if doc else DocumentType.MANUAL\n            ))\n\n        return search_results\n\n    def query(\n        self,\n        question: str,\n        top_k: int = 5,\n        doc_type: Optional[DocumentType] = None\n    ) -> RAGResponse:\n        \"\"\"\n        Answer a question using RAG.\n\n        Args:\n            question: Question to answer\n            top_k: Number of context chunks to use\n            doc_type: Filter by document type\n\n        Returns:\n            RAG response with answer and sources\n        \"\"\"\n        # Search for relevant context\n        search_results = self.search(question, top_k=top_k, doc_type=doc_type)\n\n        if not search_results:\n            return RAGResponse(\n                query=question,\n                answer=\"I couldn't find relevant information to answer this question.\",\n                sources=[],\n                confidence=0.0,\n                tokens_used=0\n            )\n\n        # Build context from search results\n        context_parts = []\n        for i, result in enumerate(search_results):\n            context_parts.append(\n                f\"[Source {i+1}: {result.document_title}]\\n{result.chunk.content}\"\n            )\n\n        context = \"\\n\\n\".join(context_parts)\n\n        # Generate answer (simulated - in production, call LLM)\n        answer = self._generate_answer(question, context, search_results)\n\n        # Calculate confidence\n        avg_score = sum(r.score for r in search_results) / len(search_results)\n\n        return RAGResponse(\n            query=question,\n            answer=answer,\n            sources=search_results,\n            confidence=avg_score,\n            tokens_used=len(context.split()) + len(question.split())\n        )\n\n    def _generate_answer(\n        self,\n        question: str,\n        context: str,\n        sources: List[SearchResult]\n    ) -> str:\n        \"\"\"\n        Generate answer from context.\n        In production, this would call an LLM API.\n        \"\"\"\n        # Simulated answer generation\n        answer_parts = [\n            f\"Based on the available construction documentation:\\n\"\n        ]\n\n        # Extract key information from sources\n        for source in sources[:3]:\n            # Take first sentence of each relevant chunk\n            first_sentence = source.chunk.content.split('.')[0] + '.'\n            answer_parts.append(f\"- {first_sentence}\")\n\n        answer_parts.append(\n            f\"\\n\\nThis information comes from {len(sources)} source documents \"\n            f\"including: {', '.join(set(s.document_title for s in sources[:3]))}.\"\n        )\n\n        return \"\\n\".join(answer_parts)\n\n    def get_document_summary(self, document_id: str) -> Optional[Dict]:\n        \"\"\"Get summary of a document\"\"\"\n        doc = self.documents.get(document_id)\n        if not doc:\n            return None\n\n        return {\n            \"id\": doc.id,\n            \"title\": doc.title,\n            \"type\": doc.doc_type.value,\n            \"chunks\": len(doc.chunks),\n            \"total_tokens\": sum(c.token_count for c in doc.chunks),\n            \"source\": doc.source,\n            \"created_at\": doc.created_at.isoformat()\n        }\n\n    def get_stats(self) -> Dict:\n        \"\"\"Get system statistics\"\"\"\n        return {\n            \"total_documents\": len(self.documents),\n            \"vector_store\": self.vector_store.get_stats(),\n            \"embedding_model\": self.embedding_model.model_name,\n            \"chunking_strategy\": self.chunker.strategy.value\n        }\n\n    def export_knowledge_base(self) -> Dict:\n        \"\"\"Export knowledge base for backup/transfer\"\"\"\n        return {\n            \"documents\": [\n                {\n                    \"id\": doc.id,\n                    \"title\": doc.title,\n                    \"type\": doc.doc_type.value,\n                    \"content\": doc.content,\n                    \"source\": doc.source,\n                    \"metadata\": doc.metadata\n                }\n                for doc in self.documents.values()\n            ],\n            \"stats\": self.get_stats(),\n            \"exported_at\": datetime.now().isoformat()\n        }"
      },
      {
        "title": "Build Construction Knowledge Base",
        "body": "rag = ConstructionRAG(\n    chunking_strategy=ChunkingStrategy.SECTION,\n    chunk_size=500\n)\n\n# Add specifications\nspec_doc = Document(\n    id=\"spec-03300\",\n    title=\"Cast-in-Place Concrete Specification\",\n    doc_type=DocumentType.SPECIFICATION,\n    content=\"\"\"\n    SECTION 03 30 00 - CAST-IN-PLACE CONCRETE\n\n    PART 1 - GENERAL\n    1.1 SUMMARY\n    A. Section includes cast-in-place concrete for foundations,\n       slabs, walls, and other structural elements.\n\n    1.2 RELATED SECTIONS\n    A. Section 03 10 00 - Concrete Forming\n    B. Section 03 20 00 - Concrete Reinforcing\n\n    PART 2 - PRODUCTS\n    2.1 CONCRETE MATERIALS\n    A. Portland Cement: ASTM C150, Type I or II\n    B. Aggregates: ASTM C33, graded\n    C. Water: Clean, potable\n    \"\"\",\n    source=\"project_specs.pdf\",\n    metadata={\"division\": \"03\", \"project\": \"Building A\"}\n)\n\nchunks_created = rag.add_document(spec_doc)\nprint(f\"Created {chunks_created} chunks\")"
      },
      {
        "title": "Search Knowledge Base",
        "body": "# Search for concrete requirements\nresults = rag.search(\n    query=\"concrete strength requirements\",\n    top_k=5,\n    doc_type=DocumentType.SPECIFICATION\n)\n\nfor result in results:\n    print(f\"Score: {result.score:.3f}\")\n    print(f\"Document: {result.document_title}\")\n    print(f\"Content: {result.chunk.content[:200]}...\")\n    print()"
      },
      {
        "title": "Answer Questions with RAG",
        "body": "response = rag.query(\n    question=\"What type of cement should be used for foundations?\",\n    top_k=3\n)\n\nprint(f\"Answer: {response.answer}\")\nprint(f\"Confidence: {response.confidence:.0%}\")\nprint(f\"Sources: {len(response.sources)}\")"
      },
      {
        "title": "Quick Reference",
        "body": "ComponentPurposeConstructionRAGMain RAG systemTextChunkerDocument chunkingVectorStoreEmbedding storageEmbeddingModelText embeddingsDocumentChunkChunk with metadataRAGResponseQuery response"
      },
      {
        "title": "Resources",
        "body": "Book: \"Data-Driven Construction\" by Artem Boiko, Chapter 2.3\nWebsite: https://datadrivenconstruction.io"
      },
      {
        "title": "Next Steps",
        "body": "Use llm-data-automation for automation\nUse vector-search for advanced search\nUse document-classification-nlp for classification"
      }
    ],
    "body": "RAG Construction\nOverview\n\nBased on DDC methodology (Chapter 2.3), this skill builds Retrieval-Augmented Generation (RAG) systems for construction knowledge bases, enabling semantic search and AI-powered question answering over construction documents.\n\nBook Reference: \"Pandas DataFrame и LLM ChatGPT\" / \"Pandas DataFrame and LLM ChatGPT\"\n\nQuick Start\nfrom dataclasses import dataclass, field\nfrom enum import Enum\nfrom typing import List, Dict, Optional, Any, Callable\nfrom datetime import datetime\nimport json\nimport hashlib\nimport re\n\nclass DocumentType(Enum):\n    \"\"\"Types of construction documents\"\"\"\n    SPECIFICATION = \"specification\"\n    DRAWING = \"drawing\"\n    CONTRACT = \"contract\"\n    RFI = \"rfi\"\n    SUBMITTAL = \"submittal\"\n    CHANGE_ORDER = \"change_order\"\n    MEETING_MINUTES = \"meeting_minutes\"\n    DAILY_REPORT = \"daily_report\"\n    SAFETY_REPORT = \"safety_report\"\n    INSPECTION = \"inspection\"\n    MANUAL = \"manual\"\n    STANDARD = \"standard\"\n\nclass ChunkingStrategy(Enum):\n    \"\"\"Text chunking strategies\"\"\"\n    FIXED_SIZE = \"fixed_size\"\n    PARAGRAPH = \"paragraph\"\n    SECTION = \"section\"\n    SEMANTIC = \"semantic\"\n    SENTENCE = \"sentence\"\n\n@dataclass\nclass DocumentChunk:\n    \"\"\"A chunk of document text\"\"\"\n    id: str\n    document_id: str\n    content: str\n    metadata: Dict[str, Any]\n    embedding: Optional[List[float]] = None\n    token_count: int = 0\n    position: int = 0\n\n@dataclass\nclass Document:\n    \"\"\"Construction document\"\"\"\n    id: str\n    title: str\n    doc_type: DocumentType\n    content: str\n    source: str\n    metadata: Dict[str, Any] = field(default_factory=dict)\n    chunks: List[DocumentChunk] = field(default_factory=list)\n    created_at: datetime = field(default_factory=datetime.now)\n\n@dataclass\nclass SearchResult:\n    \"\"\"Search result from vector store\"\"\"\n    chunk: DocumentChunk\n    score: float\n    document_title: str\n    doc_type: DocumentType\n\n@dataclass\nclass RAGResponse:\n    \"\"\"Response from RAG system\"\"\"\n    query: str\n    answer: str\n    sources: List[SearchResult]\n    confidence: float\n    tokens_used: int\n\n\nclass TextChunker:\n    \"\"\"Split documents into chunks for embedding\"\"\"\n\n    def __init__(\n        self,\n        strategy: ChunkingStrategy = ChunkingStrategy.PARAGRAPH,\n        chunk_size: int = 500,\n        chunk_overlap: int = 50\n    ):\n        self.strategy = strategy\n        self.chunk_size = chunk_size\n        self.chunk_overlap = chunk_overlap\n\n    def chunk_document(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Split document into chunks\"\"\"\n        if self.strategy == ChunkingStrategy.FIXED_SIZE:\n            return self._chunk_fixed_size(document)\n        elif self.strategy == ChunkingStrategy.PARAGRAPH:\n            return self._chunk_by_paragraph(document)\n        elif self.strategy == ChunkingStrategy.SECTION:\n            return self._chunk_by_section(document)\n        elif self.strategy == ChunkingStrategy.SENTENCE:\n            return self._chunk_by_sentence(document)\n        else:\n            return self._chunk_fixed_size(document)\n\n    def _chunk_fixed_size(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by fixed character size with overlap\"\"\"\n        chunks = []\n        text = document.content\n        start = 0\n        position = 0\n\n        while start < len(text):\n            end = start + self.chunk_size\n\n            # Find word boundary\n            if end < len(text):\n                while end > start and text[end] not in ' \\n\\t':\n                    end -= 1\n\n            chunk_text = text[start:end].strip()\n            if chunk_text:\n                chunk_id = self._generate_chunk_id(document.id, position)\n                chunks.append(DocumentChunk(\n                    id=chunk_id,\n                    document_id=document.id,\n                    content=chunk_text,\n                    metadata={\n                        \"doc_type\": document.doc_type.value,\n                        \"title\": document.title,\n                        **document.metadata\n                    },\n                    token_count=len(chunk_text.split()),\n                    position=position\n                ))\n                position += 1\n\n            start = end - self.chunk_overlap\n            if start >= len(text):\n                break\n\n        return chunks\n\n    def _chunk_by_paragraph(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by paragraphs\"\"\"\n        chunks = []\n        paragraphs = document.content.split('\\n\\n')\n        current_chunk = \"\"\n        position = 0\n\n        for para in paragraphs:\n            para = para.strip()\n            if not para:\n                continue\n\n            if len(current_chunk) + len(para) < self.chunk_size:\n                current_chunk += \"\\n\\n\" + para if current_chunk else para\n            else:\n                if current_chunk:\n                    chunk_id = self._generate_chunk_id(document.id, position)\n                    chunks.append(DocumentChunk(\n                        id=chunk_id,\n                        document_id=document.id,\n                        content=current_chunk,\n                        metadata={\n                            \"doc_type\": document.doc_type.value,\n                            \"title\": document.title,\n                            **document.metadata\n                        },\n                        token_count=len(current_chunk.split()),\n                        position=position\n                    ))\n                    position += 1\n                current_chunk = para\n\n        # Add remaining content\n        if current_chunk:\n            chunk_id = self._generate_chunk_id(document.id, position)\n            chunks.append(DocumentChunk(\n                id=chunk_id,\n                document_id=document.id,\n                content=current_chunk,\n                metadata={\n                    \"doc_type\": document.doc_type.value,\n                    \"title\": document.title,\n                    **document.metadata\n                },\n                token_count=len(current_chunk.split()),\n                position=position\n            ))\n\n        return chunks\n\n    def _chunk_by_section(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by document sections (headers)\"\"\"\n        # Split by common section patterns\n        section_pattern = r'\\n(?=(?:\\d+\\.|\\d+\\s|SECTION|ARTICLE|PART)\\s+[A-Z])'\n        sections = re.split(section_pattern, document.content)\n\n        chunks = []\n        for position, section in enumerate(sections):\n            section = section.strip()\n            if section:\n                # If section is too large, further split it\n                if len(section) > self.chunk_size * 2:\n                    sub_chunker = TextChunker(ChunkingStrategy.PARAGRAPH, self.chunk_size)\n                    sub_doc = Document(\n                        id=f\"{document.id}_sec{position}\",\n                        title=document.title,\n                        doc_type=document.doc_type,\n                        content=section,\n                        source=document.source,\n                        metadata=document.metadata\n                    )\n                    sub_chunks = sub_chunker.chunk_document(sub_doc)\n                    for i, chunk in enumerate(sub_chunks):\n                        chunk.id = self._generate_chunk_id(document.id, position * 100 + i)\n                        chunk.position = position * 100 + i\n                    chunks.extend(sub_chunks)\n                else:\n                    chunk_id = self._generate_chunk_id(document.id, position)\n                    chunks.append(DocumentChunk(\n                        id=chunk_id,\n                        document_id=document.id,\n                        content=section,\n                        metadata={\n                            \"doc_type\": document.doc_type.value,\n                            \"title\": document.title,\n                            **document.metadata\n                        },\n                        token_count=len(section.split()),\n                        position=position\n                    ))\n\n        return chunks\n\n    def _chunk_by_sentence(self, document: Document) -> List[DocumentChunk]:\n        \"\"\"Chunk by sentences, grouping to meet size requirements\"\"\"\n        # Simple sentence splitting\n        sentences = re.split(r'(?<=[.!?])\\s+', document.content)\n\n        chunks = []\n        current_chunk = \"\"\n        position = 0\n\n        for sentence in sentences:\n            if len(current_chunk) + len(sentence) < self.chunk_size:\n                current_chunk += \" \" + sentence if current_chunk else sentence\n            else:\n                if current_chunk:\n                    chunk_id = self._generate_chunk_id(document.id, position)\n                    chunks.append(DocumentChunk(\n                        id=chunk_id,\n                        document_id=document.id,\n                        content=current_chunk.strip(),\n                        metadata={\n                            \"doc_type\": document.doc_type.value,\n                            \"title\": document.title,\n                            **document.metadata\n                        },\n                        token_count=len(current_chunk.split()),\n                        position=position\n                    ))\n                    position += 1\n                current_chunk = sentence\n\n        if current_chunk:\n            chunk_id = self._generate_chunk_id(document.id, position)\n            chunks.append(DocumentChunk(\n                id=chunk_id,\n                document_id=document.id,\n                content=current_chunk.strip(),\n                metadata={\n                    \"doc_type\": document.doc_type.value,\n                    \"title\": document.title,\n                    **document.metadata\n                },\n                token_count=len(current_chunk.split()),\n                position=position\n            ))\n\n        return chunks\n\n    def _generate_chunk_id(self, doc_id: str, position: int) -> str:\n        \"\"\"Generate unique chunk ID\"\"\"\n        return hashlib.md5(f\"{doc_id}_{position}\".encode()).hexdigest()[:12]\n\n\nclass VectorStore:\n    \"\"\"Simple in-memory vector store for RAG\"\"\"\n\n    def __init__(self):\n        self.chunks: Dict[str, DocumentChunk] = {}\n        self.embeddings: Dict[str, List[float]] = {}\n\n    def add_chunks(self, chunks: List[DocumentChunk]):\n        \"\"\"Add chunks to the store\"\"\"\n        for chunk in chunks:\n            self.chunks[chunk.id] = chunk\n            if chunk.embedding:\n                self.embeddings[chunk.id] = chunk.embedding\n\n    def search(\n        self,\n        query_embedding: List[float],\n        top_k: int = 5,\n        filter_metadata: Optional[Dict] = None\n    ) -> List[Tuple[DocumentChunk, float]]:\n        \"\"\"Search for similar chunks\"\"\"\n        results = []\n\n        for chunk_id, chunk in self.chunks.items():\n            # Apply metadata filter\n            if filter_metadata:\n                match = all(\n                    chunk.metadata.get(k) == v\n                    for k, v in filter_metadata.items()\n                )\n                if not match:\n                    continue\n\n            # Calculate similarity (cosine similarity simulation)\n            if chunk_id in self.embeddings:\n                score = self._cosine_similarity(query_embedding, self.embeddings[chunk_id])\n                results.append((chunk, score))\n\n        # Sort by score descending\n        results.sort(key=lambda x: x[1], reverse=True)\n        return results[:top_k]\n\n    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:\n        \"\"\"Calculate cosine similarity between two vectors\"\"\"\n        if len(a) != len(b):\n            return 0.0\n\n        dot_product = sum(x * y for x, y in zip(a, b))\n        norm_a = sum(x * x for x in a) ** 0.5\n        norm_b = sum(x * x for x in b) ** 0.5\n\n        if norm_a == 0 or norm_b == 0:\n            return 0.0\n\n        return dot_product / (norm_a * norm_b)\n\n    def get_stats(self) -> Dict:\n        \"\"\"Get store statistics\"\"\"\n        doc_types = {}\n        for chunk in self.chunks.values():\n            doc_type = chunk.metadata.get(\"doc_type\", \"unknown\")\n            doc_types[doc_type] = doc_types.get(doc_type, 0) + 1\n\n        return {\n            \"total_chunks\": len(self.chunks),\n            \"chunks_with_embeddings\": len(self.embeddings),\n            \"chunks_by_type\": doc_types\n        }\n\n\nclass EmbeddingModel:\n    \"\"\"Simulated embedding model (replace with actual model in production)\"\"\"\n\n    def __init__(self, model_name: str = \"text-embedding-ada-002\"):\n        self.model_name = model_name\n        self.dimension = 1536\n\n    def embed(self, text: str) -> List[float]:\n        \"\"\"Generate embedding for text\"\"\"\n        # Simulation: generate deterministic embedding based on text hash\n        text_hash = hashlib.sha256(text.encode()).digest()\n        embedding = []\n        for i in range(self.dimension):\n            byte_idx = i % len(text_hash)\n            embedding.append((text_hash[byte_idx] - 128) / 128.0)\n        return embedding\n\n    def embed_batch(self, texts: List[str]) -> List[List[float]]:\n        \"\"\"Generate embeddings for multiple texts\"\"\"\n        return [self.embed(text) for text in texts]\n\n\nclass ConstructionRAG:\n    \"\"\"\n    RAG system for construction knowledge bases.\n    Based on DDC methodology Chapter 2.3.\n    \"\"\"\n\n    def __init__(\n        self,\n        embedding_model: Optional[EmbeddingModel] = None,\n        chunking_strategy: ChunkingStrategy = ChunkingStrategy.PARAGRAPH,\n        chunk_size: int = 500\n    ):\n        self.embedding_model = embedding_model or EmbeddingModel()\n        self.chunker = TextChunker(chunking_strategy, chunk_size)\n        self.vector_store = VectorStore()\n        self.documents: Dict[str, Document] = {}\n\n    def add_document(self, document: Document) -> int:\n        \"\"\"\n        Add a document to the knowledge base.\n\n        Args:\n            document: Document to add\n\n        Returns:\n            Number of chunks created\n        \"\"\"\n        # Store document\n        self.documents[document.id] = document\n\n        # Chunk document\n        chunks = self.chunker.chunk_document(document)\n\n        # Generate embeddings\n        for chunk in chunks:\n            chunk.embedding = self.embedding_model.embed(chunk.content)\n\n        # Add to vector store\n        self.vector_store.add_chunks(chunks)\n\n        # Update document with chunks\n        document.chunks = chunks\n\n        return len(chunks)\n\n    def add_documents(self, documents: List[Document]) -> Dict[str, int]:\n        \"\"\"Add multiple documents\"\"\"\n        results = {}\n        for doc in documents:\n            results[doc.id] = self.add_document(doc)\n        return results\n\n    def search(\n        self,\n        query: str,\n        top_k: int = 5,\n        doc_type: Optional[DocumentType] = None\n    ) -> List[SearchResult]:\n        \"\"\"\n        Search the knowledge base.\n\n        Args:\n            query: Search query\n            top_k: Number of results to return\n            doc_type: Filter by document type\n\n        Returns:\n            List of search results\n        \"\"\"\n        # Generate query embedding\n        query_embedding = self.embedding_model.embed(query)\n\n        # Build filter\n        filter_metadata = None\n        if doc_type:\n            filter_metadata = {\"doc_type\": doc_type.value}\n\n        # Search vector store\n        results = self.vector_store.search(\n            query_embedding,\n            top_k=top_k,\n            filter_metadata=filter_metadata\n        )\n\n        # Build search results\n        search_results = []\n        for chunk, score in results:\n            doc = self.documents.get(chunk.document_id)\n            search_results.append(SearchResult(\n                chunk=chunk,\n                score=score,\n                document_title=doc.title if doc else \"Unknown\",\n                doc_type=doc.doc_type if doc else DocumentType.MANUAL\n            ))\n\n        return search_results\n\n    def query(\n        self,\n        question: str,\n        top_k: int = 5,\n        doc_type: Optional[DocumentType] = None\n    ) -> RAGResponse:\n        \"\"\"\n        Answer a question using RAG.\n\n        Args:\n            question: Question to answer\n            top_k: Number of context chunks to use\n            doc_type: Filter by document type\n\n        Returns:\n            RAG response with answer and sources\n        \"\"\"\n        # Search for relevant context\n        search_results = self.search(question, top_k=top_k, doc_type=doc_type)\n\n        if not search_results:\n            return RAGResponse(\n                query=question,\n                answer=\"I couldn't find relevant information to answer this question.\",\n                sources=[],\n                confidence=0.0,\n                tokens_used=0\n            )\n\n        # Build context from search results\n        context_parts = []\n        for i, result in enumerate(search_results):\n            context_parts.append(\n                f\"[Source {i+1}: {result.document_title}]\\n{result.chunk.content}\"\n            )\n\n        context = \"\\n\\n\".join(context_parts)\n\n        # Generate answer (simulated - in production, call LLM)\n        answer = self._generate_answer(question, context, search_results)\n\n        # Calculate confidence\n        avg_score = sum(r.score for r in search_results) / len(search_results)\n\n        return RAGResponse(\n            query=question,\n            answer=answer,\n            sources=search_results,\n            confidence=avg_score,\n            tokens_used=len(context.split()) + len(question.split())\n        )\n\n    def _generate_answer(\n        self,\n        question: str,\n        context: str,\n        sources: List[SearchResult]\n    ) -> str:\n        \"\"\"\n        Generate answer from context.\n        In production, this would call an LLM API.\n        \"\"\"\n        # Simulated answer generation\n        answer_parts = [\n            f\"Based on the available construction documentation:\\n\"\n        ]\n\n        # Extract key information from sources\n        for source in sources[:3]:\n            # Take first sentence of each relevant chunk\n            first_sentence = source.chunk.content.split('.')[0] + '.'\n            answer_parts.append(f\"- {first_sentence}\")\n\n        answer_parts.append(\n            f\"\\n\\nThis information comes from {len(sources)} source documents \"\n            f\"including: {', '.join(set(s.document_title for s in sources[:3]))}.\"\n        )\n\n        return \"\\n\".join(answer_parts)\n\n    def get_document_summary(self, document_id: str) -> Optional[Dict]:\n        \"\"\"Get summary of a document\"\"\"\n        doc = self.documents.get(document_id)\n        if not doc:\n            return None\n\n        return {\n            \"id\": doc.id,\n            \"title\": doc.title,\n            \"type\": doc.doc_type.value,\n            \"chunks\": len(doc.chunks),\n            \"total_tokens\": sum(c.token_count for c in doc.chunks),\n            \"source\": doc.source,\n            \"created_at\": doc.created_at.isoformat()\n        }\n\n    def get_stats(self) -> Dict:\n        \"\"\"Get system statistics\"\"\"\n        return {\n            \"total_documents\": len(self.documents),\n            \"vector_store\": self.vector_store.get_stats(),\n            \"embedding_model\": self.embedding_model.model_name,\n            \"chunking_strategy\": self.chunker.strategy.value\n        }\n\n    def export_knowledge_base(self) -> Dict:\n        \"\"\"Export knowledge base for backup/transfer\"\"\"\n        return {\n            \"documents\": [\n                {\n                    \"id\": doc.id,\n                    \"title\": doc.title,\n                    \"type\": doc.doc_type.value,\n                    \"content\": doc.content,\n                    \"source\": doc.source,\n                    \"metadata\": doc.metadata\n                }\n                for doc in self.documents.values()\n            ],\n            \"stats\": self.get_stats(),\n            \"exported_at\": datetime.now().isoformat()\n        }\n\nCommon Use Cases\nBuild Construction Knowledge Base\nrag = ConstructionRAG(\n    chunking_strategy=ChunkingStrategy.SECTION,\n    chunk_size=500\n)\n\n# Add specifications\nspec_doc = Document(\n    id=\"spec-03300\",\n    title=\"Cast-in-Place Concrete Specification\",\n    doc_type=DocumentType.SPECIFICATION,\n    content=\"\"\"\n    SECTION 03 30 00 - CAST-IN-PLACE CONCRETE\n\n    PART 1 - GENERAL\n    1.1 SUMMARY\n    A. Section includes cast-in-place concrete for foundations,\n       slabs, walls, and other structural elements.\n\n    1.2 RELATED SECTIONS\n    A. Section 03 10 00 - Concrete Forming\n    B. Section 03 20 00 - Concrete Reinforcing\n\n    PART 2 - PRODUCTS\n    2.1 CONCRETE MATERIALS\n    A. Portland Cement: ASTM C150, Type I or II\n    B. Aggregates: ASTM C33, graded\n    C. Water: Clean, potable\n    \"\"\",\n    source=\"project_specs.pdf\",\n    metadata={\"division\": \"03\", \"project\": \"Building A\"}\n)\n\nchunks_created = rag.add_document(spec_doc)\nprint(f\"Created {chunks_created} chunks\")\n\nSearch Knowledge Base\n# Search for concrete requirements\nresults = rag.search(\n    query=\"concrete strength requirements\",\n    top_k=5,\n    doc_type=DocumentType.SPECIFICATION\n)\n\nfor result in results:\n    print(f\"Score: {result.score:.3f}\")\n    print(f\"Document: {result.document_title}\")\n    print(f\"Content: {result.chunk.content[:200]}...\")\n    print()\n\nAnswer Questions with RAG\nresponse = rag.query(\n    question=\"What type of cement should be used for foundations?\",\n    top_k=3\n)\n\nprint(f\"Answer: {response.answer}\")\nprint(f\"Confidence: {response.confidence:.0%}\")\nprint(f\"Sources: {len(response.sources)}\")\n\nQuick Reference\nComponent\tPurpose\nConstructionRAG\tMain RAG system\nTextChunker\tDocument chunking\nVectorStore\tEmbedding storage\nEmbeddingModel\tText embeddings\nDocumentChunk\tChunk with metadata\nRAGResponse\tQuery response\nResources\nBook: \"Data-Driven Construction\" by Artem Boiko, Chapter 2.3\nWebsite: https://datadrivenconstruction.io\nNext Steps\nUse llm-data-automation for automation\nUse vector-search for advanced search\nUse document-classification-nlp for classification"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/datadrivenconstruction/rag-construction",
    "publisherUrl": "https://clawhub.ai/datadrivenconstruction/rag-construction",
    "owner": "datadrivenconstruction",
    "version": "2.1.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/rag-construction",
    "downloadUrl": "https://openagent3.xyz/downloads/rag-construction",
    "agentUrl": "https://openagent3.xyz/skills/rag-construction/agent",
    "manifestUrl": "https://openagent3.xyz/skills/rag-construction/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/rag-construction/agent.md"
  }
}