Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, implementing data governance, or troubleshooting data issues.
Data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, implementing data governance, or troubleshooting data issues.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Production-grade data engineering skill for building scalable, reliable data systems.
Trigger Phrases Quick Start Workflows Building a Batch ETL Pipeline Implementing Real-Time Streaming Data Quality Framework Setup Architecture Decision Framework Tech Stack Reference Documentation Troubleshooting
Activate this skill when you see: Pipeline Design: "Design a data pipeline for..." "Build an ETL/ELT process..." "How should I ingest data from..." "Set up data extraction from..." Architecture: "Should I use batch or streaming?" "Lambda vs Kappa architecture" "How to handle late-arriving data" "Design a data lakehouse" Data Modeling: "Create a dimensional model..." "Star schema vs snowflake" "Implement slowly changing dimensions" "Design a data vault" Data Quality: "Add data validation to..." "Set up data quality checks" "Monitor data freshness" "Implement data contracts" Performance: "Optimize this Spark job" "Query is running slow" "Reduce pipeline execution time" "Tune Airflow DAG"
# Generate pipeline orchestration config python scripts/pipeline_orchestrator.py generate \ --type airflow \ --source postgres \ --destination snowflake \ --schedule "0 5 * * *" # Validate data quality python scripts/data_quality_validator.py validate \ --input data/sales.parquet \ --schema schemas/sales.json \ --checks freshness,completeness,uniqueness # Optimize ETL performance python scripts/etl_performance_optimizer.py analyze \ --query queries/daily_aggregation.sql \ --engine spark \ --recommend
β See references/workflows.md for details
Use this framework to choose the right approach for your data pipeline.
CriteriaBatchStreamingLatency requirementHours to daysSeconds to minutesData volumeLarge historical datasetsContinuous event streamsProcessing complexityComplex transformations, MLSimple aggregations, filteringCost sensitivityMore cost-effectiveHigher infrastructure costError handlingEasier to reprocessRequires careful design Decision Tree: Is real-time insight required? βββ Yes β Use streaming β βββ Is exactly-once semantics needed? β βββ Yes β Kafka + Flink/Spark Structured Streaming β βββ No β Kafka + consumer groups βββ No β Use batch βββ Is data volume > 1TB daily? βββ Yes β Spark/Databricks βββ No β dbt + warehouse compute
AspectLambdaKappaComplexityTwo codebases (batch + stream)Single codebaseMaintenanceHigher (sync batch/stream logic)LowerReprocessingNative batch layerReplay from sourceUse caseML training + real-time servingPure event-driven When to choose Lambda: Need to train ML models on historical data Complex batch transformations not feasible in streaming Existing batch infrastructure When to choose Kappa: Event-sourced architecture All processing can be expressed as stream operations Starting fresh without legacy systems
FeatureWarehouse (Snowflake/BigQuery)Lakehouse (Delta/Iceberg)Best forBI, SQL analyticsML, unstructured dataStorage costHigher (proprietary format)Lower (open formats)FlexibilitySchema-on-writeSchema-on-readPerformanceExcellent for SQLGood, improvingEcosystemMature BI toolsGrowing ML tooling
CategoryTechnologiesLanguagesPython, SQL, ScalaOrchestrationAirflow, Prefect, DagsterTransformationdbt, Spark, FlinkStreamingKafka, Kinesis, Pub/SubStorageS3, GCS, Delta Lake, IcebergWarehousesSnowflake, BigQuery, Redshift, DatabricksQualityGreat Expectations, dbt tests, Monte CarloMonitoringPrometheus, Grafana, Datadog
See references/data_pipeline_architecture.md for: Lambda vs Kappa architecture patterns Batch processing with Spark and Airflow Stream processing with Kafka and Flink Exactly-once semantics implementation Error handling and dead letter queues
See references/data_modeling_patterns.md for: Dimensional modeling (Star/Snowflake) Slowly Changing Dimensions (SCD Types 1-6) Data Vault modeling dbt best practices Partitioning and clustering
See references/dataops_best_practices.md for: Data testing frameworks Data contracts and schema validation CI/CD for data pipelines Observability and lineage Incident response
β See references/troubleshooting.md for details
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.