# Send RAG Engineering to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "afrexai-rag-engineering",
    "name": "RAG Engineering",
    "source": "tencent",
    "type": "skill",
    "category": "AI 智能",
    "sourceUrl": "https://clawhub.ai/1kalin/afrexai-rag-engineering",
    "canonicalUrl": "https://clawhub.ai/1kalin/afrexai-rag-engineering",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/afrexai-rag-engineering",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-rag-engineering",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "README.md",
      "SKILL.md"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "slug": "afrexai-rag-engineering",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-03T07:46:55.272Z",
      "expiresAt": "2026-05-10T07:46:55.272Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-rag-engineering",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-rag-engineering",
        "contentDisposition": "attachment; filename=\"afrexai-rag-engineering-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null,
        "slug": "afrexai-rag-engineering"
      },
      "scope": "item",
      "summary": "Item download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this item.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/afrexai-rag-engineering"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/afrexai-rag-engineering",
    "downloadUrl": "https://openagent3.xyz/downloads/afrexai-rag-engineering",
    "agentUrl": "https://openagent3.xyz/skills/afrexai-rag-engineering/agent",
    "manifestUrl": "https://openagent3.xyz/skills/afrexai-rag-engineering/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/afrexai-rag-engineering/agent.md"
  }
}
```
## Documentation

### RAG Engineering — Complete Retrieval-Augmented Generation System

Build production RAG systems that actually work. From chunking strategy to evaluation — the complete methodology.

You are an expert RAG engineer. When the user needs to build, optimize, or debug a RAG system, follow this complete methodology.

### Quick Health Check (Existing Systems)

SignalHealthyWarningCriticalAnswer relevance>85% users satisfied60-85%<60%Retrieval precision@5>70% relevant chunks40-70%<40%Hallucination rate<5%5-15%>15%Latency (P95)<3s3-8s>8sContext utilization>60% of retrieved used30-60%<30%Cost per query<$0.05$0.05-0.20>$0.20

### RAG Project Brief

rag_brief:
  project: "[name]"
  date: "YYYY-MM-DD"

  # What problem are we solving?
  use_case: "[customer support / code search / document Q&A / research / legal / medical]"
  user_persona: "[who asks questions]"
  query_types:
    - factual: "[percentage] — direct fact lookup"
    - analytical: "[percentage] — synthesis across documents"
    - procedural: "[percentage] — how-to, step-by-step"
    - comparative: "[percentage] — compare X vs Y"
    - conversational: "[percentage] — multi-turn follow-ups"

  # What data do we have?
  corpus:
    total_documents: "[count]"
    total_size: "[GB/TB]"
    document_types:
      - type: "[PDF/HTML/markdown/code/JSON/CSV]"
        count: "[count]"
        avg_length: "[pages/tokens]"
    update_frequency: "[static / daily / real-time]"
    languages: ["en", "..."]
    quality: "[curated / mixed / noisy]"

  # Requirements
  accuracy_target: "[% — start with 85%]"
  latency_target: "[ms P95]"
    max_cost_per_query: "[$]"
  scale: "[queries/day]"
  multi_turn: "[yes/no]"
  citations_required: "[yes/no]"

  # Constraints
  deployment: "[cloud / on-prem / hybrid]"
  data_sensitivity: "[public / internal / PII / regulated]"
  budget: "[$/month for infrastructure]"

### RAG Architecture Decision Tree

Is your corpus < 100 documents AND < 50 pages each?
├─ YES → Consider full-context stuffing (no RAG needed)
│        Use: Long-context model (Gemini 1M, Claude 200K)
│        When: Static docs, low query volume, budget allows
│
└─ NO → RAG is appropriate
         │
         Is real-time freshness critical?
         ├─ YES → Streaming RAG with incremental indexing
         └─ NO → Batch-indexed RAG
                  │
                  Do queries need multi-step reasoning?
                  ├─ YES → Agentic RAG (query planning + tool use)
                  └─ NO → Standard retrieval pipeline
                           │
                           Single document type?
                           ├─ YES → Single-index RAG
                           └─ NO → Multi-index with routing

### Architecture Patterns

PatternUse CaseComplexityQualityNaive RAGSimple Q&A, prototypesLowMediumAdvanced RAGProduction systemsMediumHighModular RAGComplex multi-sourceHighHighestAgentic RAGMulti-step researchHighestHighestGraph RAGEntity-heavy domainsHighHigh for relational queriesHybrid RAGMixed query typesMedium-HighHigh

### Document Processing Pipeline

Raw Documents → Extraction → Cleaning → Enrichment → Chunking → Embedding → Indexing

### Extraction Strategy by Document Type

Document TypeExtraction ToolKey ChallengesQuality TipsPDFPyMuPDF, Unstructured, DoclingTables, images, multi-columnUse layout-aware parser; OCR for scannedHTMLBeautifulSoup, TrafilaturaBoilerplate, navigationExtract main content only; preserve headersMarkdownDirect parseMinimalPreserve structure; handle frontmatterCodeTree-sitter, ASTContext lossInclude file path + imports as metadataCSV/JSONpandas, jqSchema understandingConvert rows to natural languageDOCX/PPTXpython-docx, python-pptxFormatting, embedded mediaExtract text + table structureImagesGPT-4V, Claude VisionOCR accuracyGenerate text descriptions; store as metadataAudio/VideoWhisper, AssemblyTimestamps, speakersChunk by speaker turn or topic segment

### Cleaning Checklist

Remove headers/footers/page numbers (PDF artifacts)
 Normalize whitespace (collapse multiple spaces/newlines)
 Fix encoding issues (UTF-8 normalize)
 Remove boilerplate (disclaimers, repeated navigation)
 Preserve meaningful formatting (tables, lists, code blocks)
 Handle special characters and Unicode consistently
 Detect and flag low-quality documents (OCR confidence < 80%)
 Deduplicate (exact + near-duplicate detection)

### Metadata Enrichment

Always extract and store:

document_metadata:
  source_id: "[unique document identifier]"
  source_url: "[original URL or file path]"
  title: "[document title]"
  author: "[if available]"
  created_date: "[ISO 8601]"
  modified_date: "[ISO 8601]"
  document_type: "[pdf/html/md/code/...]"
  language: "[ISO 639-1]"
  section_hierarchy: ["Chapter", "Section", "Subsection"]
  tags: ["auto-generated", "topic", "tags"]
  access_level: "[public/internal/restricted]"
  quality_score: "[0-100 from cleaning pipeline]"

Enrichment strategies:

Auto-generate summaries per document (for hybrid search)
Extract entities (people, companies, products, dates)
Classify by topic/category
Generate hypothetical questions (HyDE technique at index time)

### The Chunking Decision Is Critical

Bad chunking is the #1 cause of poor RAG quality. No amount of model sophistication fixes bad chunks.

### Chunking Method Selection

MethodBest ForChunk QualityImplementationFixed-sizeHomogeneous text, quick prototypeMediumSimpleRecursive characterGeneral purpose, structured docsGoodLangChain defaultSemanticVaried content, topic shiftsHighEmbedding-basedDocument-structureTechnical docs, legal, academicHighestCustom per doc typeAgentic/LLMHigh-value docs, complex structureHighestExpensiveSentence-windowDense factual contentGoodSentence + contextParent-childHierarchical docs, manualsHighTwo-level index

### Chunking Decision Tree

Is your content highly structured (headers, sections, numbered)?
├─ YES → Document-structure chunking
│        Split on: H1 > H2 > H3 > paragraph boundaries
│        Keep: section title chain as metadata
│
└─ NO → Is content topically diverse within documents?
         ├─ YES → Semantic chunking
         │        Split when: embedding similarity drops below threshold
         │        Typical threshold: cosine similarity < 0.75
         │
         └─ NO → Recursive character splitting
                  With: chunk_size=512, overlap=64 (tokens)
                  Separators: ["\\n\\n", "\\n", ". ", " "]

### Chunk Size Guidelines

Use CaseTarget TokensOverlapRationaleFactual Q&A256-51232-64Precise retrievalSummarization512-102464-128Broader contextCode searchFunction/class level0Natural boundariesLegal/regulatorySection/clause level1 sentencePreserve clause integrityConversational256-51264Quick, focused answersResearch/analysis1024-2048128-256Deep context

### Chunk Quality Rules

Self-contained: A chunk should make sense on its own (add context headers if needed)
Atomic: One main idea per chunk when possible
Retrievable: Would this chunk be useful if a user searched for its content?
No orphans: Don't create chunks < 50 tokens (merge with neighbors)
Preserve structure: Tables, code blocks, and lists should not be split mid-element
Context prefix: Prepend document title + section hierarchy to each chunk

### Parent-Child (Two-Level) Strategy

Parent chunks: 2048 tokens (stored for LLM context)
  └─ Child chunks: 256 tokens (stored for retrieval)

Retrieval: Search child chunks → Return parent chunk to LLM
Benefit: Precise retrieval + rich context

### Chunk Quality Scoring

Score each chunk (automated):

DimensionWeight0 (Bad)5 (Good)10 (Great)Self-contained25%Sentence fragmentNeeds contextStandalone meaningfulInformation density25%Mostly boilerplateMixedDense, useful contentBoundary quality20%Mid-sentence splitParagraph boundarySection/topic boundaryMetadata completeness15%No metadataBasic fieldsFull enrichmentSize appropriateness15%<50 or >2048 tokensWithin rangeOptimal for use case

Target: Average chunk quality score > 7.0

### Embedding Model Selection

ModelDimensionsMax TokensQualitySpeedCosttext-embedding-3-large (OpenAI)3072 (or 256-3072 via MRL)8191ExcellentFast$0.13/1M tokenstext-embedding-3-small (OpenAI)1536 (or 256-1536)8191GoodVery fast$0.02/1M tokensvoyage-3-large (Voyage)102432000ExcellentFast$0.18/1M tokensvoyage-code-3 (Voyage)102432000Best for codeFast$0.18/1M tokensCohere embed-v41024128000ExcellentFast$0.10/1M tokensBGE-M3 (open source)10248192Very goodSelf-hostFree (compute)nomic-embed-text (open source)7688192GoodSelf-hostFree (compute)GTE-Qwen2 (open source)1024-17928192ExcellentSelf-hostFree (compute)

### Model Selection Rules

Start with: text-embedding-3-small (best cost/quality for prototypes)
Production default: text-embedding-3-large or voyage-3-large
Code search: voyage-code-3 or domain-fine-tuned
Multilingual: Cohere embed-v4 or BGE-M3
Privacy/on-prem: BGE-M3 or GTE-Qwen2
Budget constrained: MRL (Matryoshka) — reduce dimensions (e.g., 3072→256) for 10x storage savings with ~5% quality loss

### Embedding Best Practices

Prefix queries differently from documents: Some models (Nomic, E5) need task-specific prefixes

Document: "search_document: {text}"
Query: "search_query: {text}"


Normalize embeddings: L2 normalize for cosine similarity
Batch embedding: Process in batches of 100-500 for throughput
Cache embeddings: Store and reuse; don't re-embed unchanged documents
Benchmark on YOUR data: Generic benchmarks (MTEB) don't predict domain-specific performance

### Embedding Quality Test

Before committing to a model, run this:

Create 50 query-document pairs from your actual data
Embed all queries and documents
Calculate recall@5 and recall@10
Compare 2-3 models
Pick the one with highest recall on YOUR domain

Target: recall@5 > 0.7 on your domain test set

### Vector Database Selection

DatabaseTypeScaleFeaturesBest ForPineconeManagedBillionsServerless, metadata filterProduction SaaSWeaviateManaged/Self-hostMillions-BillionsHybrid search, modulesFeature-rich appsQdrantManaged/Self-hostBillionsFiltering, quantizationHigh-performanceChromaDBEmbeddedThousands-MillionsSimple APIPrototypes, localpgvectorExtensionMillionsSQL integrationPostgres-native appsMilvusSelf-hostBillionsGPU supportLarge scaleLanceDBEmbeddedMillionsServerless, multimodalCost-sensitive

### Selection Decision

Scale < 100K chunks AND simple use case?
├─ YES → ChromaDB or pgvector
└─ NO → Need managed service?
         ├─ YES → Pinecone (simplest) or Weaviate (feature-rich)
         └─ NO → Qdrant (performance) or Milvus (scale)

### Indexing Strategy

Index TypeRecallSpeedMemoryUse WhenFlat/Brute100%SlowHigh<50K vectors, accuracy criticalIVF95-99%FastMedium50K-10M vectorsHNSW95-99%Very fastHighDefault choice for qualityPQ (Product Quantization)90-95%FastLowMemory constrainedHNSW+PQ93-98%FastMediumScale + quality balance

Default recommendation: HNSW with ef_construction=200, M=16

### Hybrid Search Architecture

Query → [Sparse Search (BM25)] → Top K₁ results
      → [Dense Search (Vector)] → Top K₂ results
      → [Reciprocal Rank Fusion] → Final Top K results → LLM

Why hybrid?

Dense (vector) excels at semantic similarity
Sparse (BM25/keyword) excels at exact term matching, acronyms, IDs
Hybrid captures both — 5-15% improvement over either alone

RRF Formula: score = Σ 1/(k + rank_i) where k=60 (default)

### Metadata Filtering

Always support these filters:

filterable_fields:
  - source_type: "[document type]"
  - created_after: "[date filter]"
  - access_level: "[permission-based filtering]"
  - language: "[language filter]"
  - tags: "[topic/category filter]"
  - quality_score_min: "[minimum quality threshold]"

Rule: Filter BEFORE vector search, not after — reduces search space and improves relevance.

### Query Processing Pipeline

User Query → Query Understanding → Query Transformation → Retrieval → Reranking → Context Assembly → LLM

### Query Transformation Techniques

TechniqueWhat It DoesWhen to UseQuality BoostQuery rewritingLLM rewrites query for clarityVague/conversational queries+10-15%HyDEGenerate hypothetical answer, embed thatFactual Q&A+5-15%Multi-queryGenerate 3-5 query variantsComplex questions+10-20%Step-backAbstract to higher-level questionComplex reasoning+5-10%Query decompositionBreak into sub-questionsMulti-part questions+15-25%Query routingRoute to different indexesMulti-source systems+10-20%

### Recommended: Multi-Query + Reranking

# Pseudocode for production retrieval
def retrieve(user_query: str, top_k: int = 5) -> list[Chunk]:
    # Step 1: Generate query variants
    queries = generate_query_variants(user_query, n=3)  # LLM generates 3 variants
    queries.append(user_query)  # Include original

    # Step 2: Retrieve candidates from each query
    candidates = set()
    for q in queries:
        results = hybrid_search(q, top_k=20)  # Over-retrieve
        candidates.update(results)

    # Step 3: Rerank
    reranked = rerank(user_query, list(candidates), top_k=top_k)

    return reranked

### Reranking

Why rerank? Embedding similarity is a rough filter. Cross-encoder rerankers are 10-30% more accurate but too slow for initial retrieval.

RerankerQualitySpeedCostCohere Rerank 3.5ExcellentFast$2/1M queriesVoyage Reranker 2ExcellentFastAPI pricingBGE-reranker-v2-m3Very goodMediumFree (self-host)ColBERT v2ExcellentMediumFree (self-host)LLM-as-rerankerBestSlowExpensive

Default: Cohere Rerank 3.5 (best quality/cost ratio)

### Retrieval Parameters

ParameterDefaultRangeImpacttop_k (initial retrieval)2010-50Higher = better recall, more noisetop_k (after reranking)53-10Higher = more context, more costsimilarity threshold0.30.2-0.5Filter low-relevance resultsMMR diversityλ=0.70.5-1.0Lower = more diverse results

### Context Assembly

context_assembly:
  ordering: "relevance_descending"  # Most relevant first
  deduplication: true  # Remove near-duplicate chunks
  max_context_tokens: 4000  # Leave room for system prompt + answer
  include_metadata: true  # Source, date, section as inline citations
  separator: "\\n---\\n"  # Clear chunk boundaries

  # Citation format
  citation_style: |
    [Source: {title} | Section: {section} | Date: {date}]
    {chunk_text}

### System Prompt Template

You are a helpful assistant that answers questions based on the provided context.

## Rules
1. Answer ONLY based on the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer this question."
2. Always cite your sources using [Source: X] notation.
3. If the context contains conflicting information, acknowledge the conflict and present both perspectives.
4. Never make up information or fill gaps with your training data.
5. If the question is ambiguous, ask for clarification.
6. Keep answers concise but complete.

## Context
{retrieved_context}

## Conversation History (if multi-turn)
{conversation_history}

## User Question
{user_query}

### Prompt Engineering for RAG

Grounding rules (prevent hallucination):

Explicitly instruct: "Only use the provided context"
Add: "If you're unsure, say so rather than guessing"
Include: "Quote relevant passages to support your answer"
Test: Ask questions NOT in the context — model should decline

Citation instructions:

Inline: "Based on [Document Title, Section X]..."
Footnote: "...the process involves three steps.[1]"
Both: Use inline for key claims, footnotes for supporting details

### Model Selection for Generation

ModelContext WindowQualityCostBest ForGPT-4o128KExcellentMediumGeneral productionGPT-4o-mini128KGoodLowHigh-volume, cost-sensitiveClaude Sonnet200KExcellentMediumNuanced answers, long contextClaude Haiku200KGoodLowFast, cost-sensitiveGemini 1.5 Pro1MExcellentMediumVery large context needsLlama 3.1 70B128KVery goodSelf-hostPrivacy, on-prem

### Multi-Turn Conversation

conversation_strategy:
  # How to handle follow-up questions
  query_reformulation: true  # Rewrite follow-ups as standalone queries
  context_carry_forward: "last_2_turns"  # How much history to include
  memory:
    type: "sliding_window"  # or "summary" for long conversations
    window_size: 5  # Number of turns to keep

  # Example reformulation
  # Turn 1: "What is RAG?" → search as-is
  # Turn 2: "How does it handle updates?" → reformulate: "How does RAG handle document updates?"

### RAG Evaluation is Non-Negotiable

If you're not measuring, you're guessing. Every production RAG system needs automated evaluation.

### Evaluation Dimensions

DimensionWhat It MeasuresMethodRetrieval PrecisionAre retrieved chunks relevant?Human or LLM judgeRetrieval RecallAre all relevant chunks found?Gold set comparisonAnswer FaithfulnessDoes answer match context? (no hallucination)LLM-as-judgeAnswer RelevanceDoes answer address the question?LLM-as-judgeAnswer CompletenessAre all aspects of the question addressed?LLM-as-judgeCitation AccuracyAre citations correct and sufficient?Automated + humanLatencyEnd-to-end response timeInstrumentationCostPer-query costInstrumentation

### Evaluation Dataset

Build a golden test set (minimum 100 examples):

eval_example:
  query: "What is the refund policy for enterprise customers?"
  expected_sources: ["policy-doc-v3.pdf", "enterprise-agreement.md"]
  expected_answer_contains:
    - "30-day refund window"
    - "written notice required"
    - "prorated for annual plans"
  answer_type: "factual"
  difficulty: "easy"  # easy / medium / hard

Test set composition:

40% easy (single document, direct answer)
35% medium (multiple documents, synthesis needed)
15% hard (requires reasoning, edge cases)
10% unanswerable (answer NOT in corpus — must detect)

### LLM-as-Judge Prompts

Faithfulness (hallucination detection):

Given the context and the answer, determine if the answer is faithful to the context.

Context: {context}
Question: {question}
Answer: {answer}

Score 1-5:
1 = Contains fabricated information not in context
2 = Mostly faithful but includes unsupported claims
3 = Faithful with minor extrapolation
4 = Faithful, well-supported
5 = Perfectly faithful, every claim traceable to context

Score: [1-5]
Reasoning: [explain]

Answer Relevance:

Does this answer address the user's question?

Question: {question}
Answer: {answer}

Score 1-5:
1 = Completely irrelevant
2 = Partially relevant, misses key aspects
3 = Relevant but incomplete
4 = Relevant and mostly complete
5 = Perfectly addresses all aspects of the question

Score: [1-5]
Reasoning: [explain]

### Evaluation Tools

ToolTypeBest ForRAGASOpen sourceComprehensive RAG metricsDeepEvalOpen sourceLLM-as-judge + classic metricsArize PhoenixOpen sourceTracing + evaluationLangSmithManagedLangChain ecosystemBraintrustManagedEval + logging + monitoringCustomDIYMaximum control

### Evaluation Cadence

FrequencyWhat to EvaluateEvery PRRun golden test set (automated CI)WeeklySample 50 production queries for human reviewMonthlyFull evaluation suite + benchmark comparisonQuarterlyRevisit golden test set, add new examples

### Production Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   Client     │────▶│  API Gateway  │────▶│  RAG Service  │
│   (App/API)  │     │  (Rate limit) │     │              │
└─────────────┘     └──────────────┘     │  Query Proc.  │
                                          │  Retrieval    │
                                          │  Reranking    │
                                          │  Generation   │
                                          └───────┬───────┘
                                                  │
                    ┌──────────────┐     ┌────────▼───────┐
                    │  Ingestion    │────▶│  Vector Store   │
                    │  Pipeline     │     │  + Metadata     │
                    └──────────────┘     └────────────────┘

### Production Checklist

Pre-Launch (Mandatory):

Golden test set passing (>85% on all dimensions)
 Hallucination rate < 5% on test set
 Latency P95 < target (typically 3-5s)
 Rate limiting configured
 Input validation (max query length, content filtering)
 Output filtering (PII detection, content safety)
 Error handling (vector DB down, LLM timeout, empty results)
 Fallback behavior defined ("I don't know" > hallucination)
 Logging and tracing enabled
 Cost monitoring and alerts set
 Load tested at 2x expected peak

Security:

No prompt injection vectors (user input sanitized)
 Access control on documents (user sees only authorized content)
 No PII leakage across user boundaries
 API authentication required
 Rate limiting per user/API key
 Audit logging for compliance

### Caching Strategy

caching:
  query_cache:
    type: "semantic"  # Cache semantically similar queries
    ttl: 3600  # 1 hour
    similarity_threshold: 0.95
    expected_hit_rate: "20-40%"

  embedding_cache:
    type: "exact"  # Cache document embeddings
    ttl: 86400  # 24 hours (or until document changes)

  llm_response_cache:
    type: "exact_query_context"
    ttl: 1800  # 30 minutes
    invalidate_on: "source_document_update"

### Scaling Considerations

ScaleArchitectureNotes<1K queries/daySingle instance, managed vector DBKeep it simple1K-100K/dayHorizontal scaling, cachingAdd semantic cache100K-1M/dayMicroservices, async, CDNSeparate ingestion/retrieval>1M/dayDistributed, multi-regionCustom infrastructure

### Production Monitoring Dashboard

rag_dashboard:
  real_time:
    - query_volume: "[queries/min]"
    - latency_p50: "[ms]"
    - latency_p95: "[ms]"
    - latency_p99: "[ms]"
    - error_rate: "[%]"
    - cache_hit_rate: "[%]"

  quality_signals:
    - retrieval_confidence_avg: "[0-1 — average similarity score]"
    - empty_retrieval_rate: "[% queries with no results above threshold]"
    - fallback_rate: "[% queries where model says 'I don't know']"
    - user_feedback_positive: "[% thumbs up]"
    - citation_rate: "[% answers with citations]"

  cost:
    - embedding_cost_daily: "[$]"
    - llm_cost_daily: "[$]"
    - reranker_cost_daily: "[$]"
    - vector_db_cost_daily: "[$]"
    - total_cost_per_query: "[$]"

  data_health:
    - index_freshness: "[time since last update]"
    - total_chunks_indexed: "[count]"
    - failed_ingestion_count: "[count]"
    - avg_chunk_quality_score: "[0-10]"

### Alert Rules

AlertThresholdSeverityLatency P95 > 8s5 min sustainedWarningLatency P95 > 15s1 min sustainedCriticalError rate > 5%5 min sustainedCriticalEmpty retrieval > 30%1 hourWarningHallucination detectedAny flaggedWarningCost per query > 2x baseline1 hourWarningVector DB latency > 500ms5 min sustainedWarningIndex staleness > 24hIf freshness SLA is <24hWarning

### Continuous Improvement Loop

Monitor → Identify Failure Patterns → Root Cause → Fix → Evaluate → Deploy

Weekly review questions:

What are the top 5 query types with lowest satisfaction?
Which documents are never retrieved? (potential indexing issues)
Which queries trigger "I don't know"? (coverage gaps)
What's the hallucination trend? (improving or degrading?)
Are costs trending up or down per query?

### Agentic RAG

User Query → Query Planner (LLM) → [Plan: search A, then search B, compare]
                                     ↓
                               Tool Execution
                               ├─ search_documents(query_A)
                               ├─ search_documents(query_B)
                               ├─ calculate(comparison)
                               └─ synthesize(results)
                                     ↓
                               Final Answer

When to use: Multi-step reasoning, cross-document comparison, calculation needed.

Implementation:

Define tools: search_docs, lookup_entity, calculate, compare
Use function calling with planning prompt
Limit iterations (max 5 tool calls per query)
Track and log the full reasoning chain

### Graph RAG

graph_rag:
  when_to_use:
    - "Entity-heavy domains (legal, medical, organizational)"
    - "Queries about relationships ('who reports to X?')"
    - "Multi-hop reasoning ('what products use components from supplier Y?')"

  architecture:
    entities: "[Extract entities from documents]"
    relationships: "[Extract entity-entity relationships]"
    communities: "[Cluster entities into topic communities]"
    summaries: "[Generate community summaries]"

  retrieval:
    local_search: "Entity-focused — find specific entities and their neighbors"
    global_search: "Community-focused — synthesize across topic clusters"
    hybrid: "Combine vector similarity + graph traversal"

### Corrective RAG (CRAG)

Query → Retrieve → Evaluate Relevance → 
  ├─ CORRECT: Retrieved docs are relevant → Generate answer
  ├─ AMBIGUOUS: Partially relevant → Refine query + re-retrieve
  └─ INCORRECT: Not relevant → Fall back to web search or "I don't know"

### Self-RAG

Query → Retrieve → Generate + Self-Reflect →
  ├─ "Is retrieval needed?" → Skip if query is simple
  ├─ "Are results relevant?" → Re-retrieve if not
  ├─ "Is my answer supported?" → Revise if not faithful
  └─ "Is my answer useful?" → Regenerate if not

### RAG + Fine-Tuning

ApproachWhenBenefitRAG onlyDynamic knowledge, many sourcesFlexible, no training neededFine-tuning onlyStatic knowledge, consistent formatFast inference, no retrievalRAG + Fine-tuned embeddingsDomain-specific vocabularyBetter retrieval qualityRAG + Fine-tuned generatorConsistent output format neededBetter answers + grounding

### Multi-Modal RAG

multimodal_rag:
  document_types:
    images: "Generate text descriptions via vision model; embed descriptions"
    tables: "Convert to structured text; embed as markdown"
    charts: "Describe in natural language; embed description"
    diagrams: "Generate detailed caption; store image reference + caption"

  retrieval:
    strategy: "Text-first retrieval with multimodal context assembly"
    image_in_context: "Include as base64 or URL reference in prompt"

### Diagnostic Decision Tree

RAG quality is poor
├─ Retrieved chunks are irrelevant
│   ├─ Check: Chunking strategy → Are chunks self-contained?
│   ├─ Check: Embedding model → Run domain benchmark test
│   ├─ Check: Query transformation → Enable multi-query or HyDE
│   └─ Fix: Add reranking if not present
│
├─ Retrieved chunks are relevant but answer is wrong
│   ├─ Check: System prompt → Is grounding instruction clear?
│   ├─ Check: Context window → Is relevant info getting truncated?
│   ├─ Check: Conflicting sources → Add conflict resolution instructions
│   └─ Fix: Upgrade generation model
│
├─ System says "I don't know" too often
│   ├─ Check: Similarity threshold → Too high? Lower from 0.5 to 0.3
│   ├─ Check: Corpus coverage → Missing documents?
│   ├─ Check: top_k → Too low? Increase from 5 to 10
│   └─ Fix: Add query expansion
│
├─ Hallucination / makes things up
│   ├─ Check: System prompt → Add explicit grounding instructions
│   ├─ Check: Temperature → Set to 0.0-0.3 for factual tasks
│   ├─ Check: Retrieved context → Is it misleading or ambiguous?
│   └─ Fix: Add faithfulness evaluation in post-processing
│
└─ Too slow
    ├─ Check: Embedding latency → Batch? Cache?
    ├─ Check: Vector search → Index type? Quantization?
    ├─ Check: Reranker → Faster model or reduce candidate set
    └─ Fix: Add caching layer (semantic query cache)

### 10 RAG Anti-Patterns

#Anti-PatternWhy It's BadFix1No rerankingVector similarity is noisyAdd cross-encoder reranker2Fixed chunk size for all docsDifferent docs need different strategiesUse document-aware chunking3No evaluationFlying blindBuild golden test set + automated eval4Ignoring metadataMissing obvious filtering opportunitiesAdd metadata enrichment + filtering5Single query embeddingMisses semantic variantsUse multi-query retrieval6No "I don't know"Hallucination when context insufficientAdd explicit grounding + confidence7Embedding documents without contextChunks lose meaning in isolationPrepend title/section to chunks8No freshness managementStale answers from outdated docsImplement update pipeline + TTL9Oversized contextWasted tokens, increased cost + latencyOptimize top_k, use reranking10No access controlUsers see unauthorized contentImplement document-level ACL filtering

### 10 Common Mistakes

MistakeImpactFixStarting with complex architectureWasted timeStart naive, add complexity based on eval dataNot measuring before optimizingOptimizing wrong thingEval first, then optimize worst dimensionChunking at arbitrary character countBad retrievalUse semantic or structure-aware chunkingUsing same embedding for all languagesPoor multilingual resultsUse multilingual model or per-language indexIgnoring the 20% of hard queries80% of user complaintsBuild hard query test set, optimize for tailNo conversation contextBad multi-turn experienceImplement query reformulationStuffing entire documentsWasted tokens, noiseRetrieve only relevant chunksNot handling "no results" gracefullyHallucinationDefine explicit fallback behaviorOver-engineering from day 1Never shipsMVP in 1 week, iterate from dataNot versioning your indexCan't rollbackVersion embeddings + index config

### RAG System Health Score (0-100)

DimensionWeightScore 0-10Retrieval quality (precision + recall)20%___Answer faithfulness (no hallucination)20%___Answer relevance & completeness15%___Latency & performance10%___Cost efficiency10%___Evaluation coverage10%___Data freshness & quality10%___Security & access control5%___

Weighted Score: ___ / 100

GradeScoreActionA85-100Production-ready, continuous improvementB70-84Good foundation, address gapsC55-69Significant improvements neededD40-54Fundamental issues, review architectureF<40Rebuild needed

### Low-Volume / Small Corpus

Skip vector DB — use in-memory search or full-context stuffing
Focus on chunking quality over retrieval sophistication
Simple keyword + semantic hybrid is sufficient

### High-Security / Regulated

On-prem vector DB + self-hosted embedding model
Document-level ACL enforcement at retrieval time
Audit logging every query + response
Data residency compliance for vector storage
Consider homomorphic encryption for embeddings

### Multi-Language

Use multilingual embedding model (BGE-M3, Cohere embed-v4)
Consider per-language indexes for large corpora
Query language detection → route to appropriate index
Cross-lingual retrieval: query in English, retrieve in any language

### Real-Time / Streaming

Event-driven ingestion (Kafka/webhooks → chunk → embed → index)
Incremental indexing (add/update/delete individual chunks)
Version management (don't serve partially indexed documents)
Consider time-weighted scoring (recent docs ranked higher)

### Very Large Corpus (>10M documents)

Tiered retrieval: coarse filter → fine retrieval → reranking
Hierarchical indexing (cluster → sub-cluster → document → chunk)
Async processing pipeline with queue management
Consider pre-computed answers for top 1000 queries

### Natural Language Commands

When the user says... you respond with:

CommandAction"Design a RAG system for [use case]"Complete Phase 1 brief + architecture recommendation"Help me chunk [document type]"Chunking strategy recommendation + implementation"Which embedding model should I use?"Model comparison for their use case + benchmark plan"My RAG results are bad"Diagnostic decision tree walkthrough"Evaluate my RAG system"Evaluation framework setup + golden test set design"Optimize retrieval"Query transformation + reranking recommendations"How do I handle [specific scenario]?"Relevant pattern from advanced section"Set up monitoring"Dashboard YAML + alert rules for their scale"How much will this cost?"Cost estimation based on their scale + optimization tips"Compare [approach A] vs [approach B]"Decision matrix with pros/cons for their context"I'm getting hallucinations"Faithfulness diagnosis + grounding improvements"Score my RAG system"Full quality rubric assessment

Built by AfrexAI — AI agents that compound capital and code.
Zero dependencies. Pure methodology. Works with any RAG stack.
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: 1kalin
- Version: 1.0.0
## Source health
- Status: healthy
- Item download looks usable.
- Yavira can redirect you to the upstream package for this item.
- Health scope: item
- Reason: direct_download_ok
- Checked at: 2026-05-03T07:46:55.272Z
- Expires at: 2026-05-10T07:46:55.272Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/afrexai-rag-engineering)
- [Send to Agent page](https://openagent3.xyz/skills/afrexai-rag-engineering/agent)
- [JSON manifest](https://openagent3.xyz/skills/afrexai-rag-engineering/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/afrexai-rag-engineering/agent.md)
- [Download page](https://openagent3.xyz/downloads/afrexai-rag-engineering)