← All skills
Tencent SkillHub Β· AI

ML Engineering

Provides end-to-end methodology for defining, engineering, experimenting, deploying, and operating production ML/AI systems at scale.

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Provides end-to-end methodology for defining, engineering, experimenting, deploying, and operating production ML/AI systems at scale.

⬇ 0 downloads β˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
README.md, SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 37 sections Open source page

ML & AI Engineering System

Complete methodology for building, deploying, and operating production ML/AI systems β€” from experiment to scale.

Phase 1: Problem Framing

Before writing any code, define the ML problem precisely.

ML Problem Brief

problem_brief: business_objective: "" # What business metric improves? success_metric: "" # Quantified target (e.g., "reduce churn 15%") baseline: "" # Current performance without ML ml_task_type: "" # classification | regression | ranking | generation | clustering | anomaly_detection | recommendation prediction_target: "" # What exactly are we predicting? prediction_consumer: "" # Who/what uses the prediction? (API | dashboard | email | automated action) latency_requirement: "" # real-time (<100ms) | near-real-time (<1s) | batch (minutes-hours) data_available: "" # What data exists today? data_gaps: "" # What's missing? ethical_considerations: "" # Bias risks, fairness requirements, privacy kill_criteria: # When to abandon the ML approach - "Baseline heuristic achieves >90% of ML performance" - "Data quality too poor after 2 weeks of cleaning" - "Model can't beat random by >10% on holdout set"

ML vs Rules Decision

SignalUse RulesUse MLLogic is explainable in <10 rulesβœ…βŒPattern is too complex for humansβŒβœ…Training data >1,000 labeled examplesβ€”βœ…Needs to adapt to new patternsβŒβœ…Must be 100% auditable/deterministicβœ…βŒPattern changes faster than you can update rulesβŒβœ… Rule of thumb: Start with rules/heuristics. Only add ML when rules fail to capture the pattern.

Data Quality Assessment

Score each data source (0-5 per dimension): Dimension0 (Terrible)5 (Excellent)Completeness>50% missing<1% missingAccuracyKnown errors, no validationValidated against source of truthConsistencyDifferent formats, duplicatesStandardized, deduplicatedTimelinessMonths staleReal-time or daily refreshRelevanceWeak proxy for targetDirect signal for predictionVolume<100 samples>10,000 samples per class Minimum score to proceed: 18/30. Below 18 β†’ fix data first, don't build models.

Feature Engineering Patterns

feature_types: numerical: - raw_value # Use as-is if normally distributed - log_transform # Right-skewed distributions (revenue, counts) - standardize # z-score for algorithms sensitive to scale (SVM, KNN, neural nets) - bin_to_categorical # When relationship is non-linear and data is limited categorical: - one_hot # <20 categories, tree-based models handle natively - target_encoding # High-cardinality (>20 categories), use with K-fold to prevent leakage - embedding # Very high-cardinality (user IDs, product IDs) with deep learning temporal: - lag_features # Value at t-1, t-7, t-30 - rolling_statistics # Mean, std, min, max over windows - time_since_event # Days since last purchase, hours since login - cyclical_encoding # sin/cos for hour-of-day, day-of-week, month text: - tfidf # Simple, interpretable, good baseline - sentence_embeddings # semantic similarity, modern NLP - llm_extraction # Use LLM to extract structured fields from unstructured text interaction: - ratios # Feature A / Feature B (e.g., clicks/impressions = CTR) - differences # Feature A - Feature B (e.g., price - competitor_price) - polynomial # A * B, A^2 (use sparingly, high-cardinality features)

Feature Store Design

feature_store: offline_store: # For training β€” batch computed, stored in data warehouse storage: "BigQuery | Snowflake | S3+Parquet" compute: "Spark | dbt | SQL" refresh: "daily | hourly" online_store: # For serving β€” low-latency lookups storage: "Redis | DynamoDB | Feast online" latency_target: "<10ms p99" refresh: "streaming | near-real-time" registry: # Feature metadata naming: "{entity}_{feature_name}_{window}_{aggregation}" # e.g., user_purchase_count_30d_sum documentation: - description - data_type - source_table - owner - created_date - known_issues

Data Leakage Prevention Checklist

No future information in features (time-travel check) Train/val/test split done BEFORE feature engineering Target encoding uses only training fold statistics No features derived from the target variable Temporal splits for time-series (no random shuffle) Holdout set created BEFORE any EDA Duplicates removed BEFORE splitting (same entity not in train AND test) Normalization/scaling fit on train, applied to val/test

Experiment Tracking Template

experiment: id: "EXP-{YYYY-MM-DD}-{NNN}" hypothesis: "" # "Adding user tenure features will improve churn prediction AUC by >2%" dataset_version: "" # Hash or version of training data features_used: [] # List of feature names model_type: "" # Algorithm name hyperparameters: {} # All hyperparams logged training_time: "" # Wall clock metrics: primary: {} # The one metric that matters secondary: {} # Supporting metrics baseline_comparison: "" # Delta vs baseline verdict: "promoted | archived | iterate" notes: "" artifacts: - model_path: "" - notebook_path: "" - confusion_matrix: ""

Model Selection Guide

TaskStart WithScale ToAvoidTabular classificationXGBoost/LightGBMNeural nets only if >100K samplesDeep learning on <10K samplesTabular regressionXGBoost/LightGBMCatBoost for high-cardinality catsLinear regression without feature engineeringImage classificationFine-tune ResNet/EfficientNetVision Transformer if >100K imagesTraining from scratchText classificationFine-tune BERT/RoBERTaLLM few-shot if labeled data scarceBag-of-words for nuanced tasksText generationGPT-4/Claude APIFine-tuned Llama/Mistral for costTraining from scratchTime seriesProphet/ARIMA baseline β†’ LightGBMTemporal Fusion TransformerLSTM without strong reasonRecommendationCollaborative filtering baselineTwo-tower neuralComplex models on <1K usersAnomaly detectionIsolation ForestAutoencoder if high-dimensionalSupervised methods without labeled anomaliesSearch/rankingBM25 baseline β†’ Learning to RankCross-encoder rerankingPure keyword without semantic

Hyperparameter Tuning Strategy

Manual first β€” understand 3-5 most impactful parameters Bayesian optimization (Optuna) β€” 50-100 trials for production models Grid search β€” only for final fine-tuning of 2-3 parameters Random search β€” better than grid for >4 parameters Key hyperparameters by model: ModelCritical ParamsTypical RangeXGBoostlearning_rate, max_depth, n_estimators, min_child_weight0.01-0.3, 3-10, 100-1000, 1-10LightGBMlearning_rate, num_leaves, feature_fraction, min_data_in_leaf0.01-0.3, 15-255, 0.5-1.0, 5-100Neural Netlearning_rate, batch_size, hidden_dims, dropout1e-5 to 1e-2, 32-512, arch-dependent, 0.1-0.5Random Forestn_estimators, max_depth, min_samples_leaf100-1000, 5-30, 1-20

Metric Selection by Task

TaskPrimary MetricWhen to UseWatch Out ForBinary classification (balanced)F1-scoreEqual importance of precision/recallβ€”Binary classification (imbalanced)PR-AUCRare positive class (<5%)ROC-AUC hides poor performance on minorityMulti-classMacro F1All classes equally importantMicro F1 if class frequency = importanceRegressionMAEOutliers should not dominateRMSE penalizes large errors moreRankingNDCG@KTop-K results matter mostMAP if binary relevanceGenerationHuman eval + automatedQuality is subjectiveBLEU/ROUGE alone are insufficientAnomaly detectionPrecision@KFalse positives are expensiveRecall if missing anomalies is dangerous

Evaluation Rigor Checklist

Metrics computed on TRUE holdout (never seen during training OR tuning) Cross-validation for small datasets (<10K samples) Stratified splits for imbalanced classes Temporal split for time-dependent data Confidence intervals reported (bootstrap or cross-val) Performance broken down by important segments (geography, user cohort, etc.) Fairness metrics across protected groups Comparison against simple baseline (majority class, mean prediction, rules) Error analysis: examined top 50 worst predictions manually Calibration plot for probabilistic predictions

Offline-to-Online Gap Analysis

Before deploying, verify these don't cause train-serving skew: CheckOfflineOnlineActionFeature computationBatch SQLReal-time APIVerify same logic, test with replayData freshnessPoint-in-time snapshotLatest valueDocument acceptable stalenessMissing valuesImputed in pipelineMay be truly missingHandle gracefully in servingFeature distributionsTraining periodCurrent periodMonitor drift post-deploy

Deployment Pattern Decision Tree

Is latency < 100ms required? β”œβ”€β”€ Yes β†’ Is model < 500MB? β”‚ β”œβ”€β”€ Yes β†’ Embedded serving (FastAPI + model in memory) β”‚ └── No β†’ Model server (Triton, TorchServe, vLLM) └── No β†’ Is it a batch prediction? β”œβ”€β”€ Yes β†’ Batch pipeline (Spark, Airflow + offline inference) └── No β†’ Async queue (Celery/SQS β†’ worker β†’ result store)

Production Serving Checklist

serving_config: model: format: "" # ONNX | TorchScript | SavedModel | safetensors version: "" # Semantic version size_mb: null load_time_seconds: null infrastructure: compute: "" # CPU | GPU (T4/A10/A100/H100) instances: null # Min/max for autoscaling autoscale_metric: "" # RPS | latency_p99 | GPU_utilization autoscale_target: null api: endpoint: "" input_schema: {} # Pydantic model or JSON schema output_schema: {} timeout_ms: null rate_limit: null reliability: health_check: "/health" readiness_check: "/ready" # Model loaded and warm graceful_shutdown: true circuit_breaker: true fallback: "" # Rules-based fallback when model is down

Containerization Template

# Multi-stage build for minimal image FROM python:3.11-slim AS builder WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt FROM python:3.11-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages COPY --from=builder /usr/local/bin /usr/local/bin COPY model/ ./model/ COPY src/ ./src/ # Non-root user RUN useradd -m appuser && chown -R appuser /app USER appuser # Health check HEALTHCHECK --interval=30s --timeout=5s CMD curl -f http://localhost:8080/health || exit 1 EXPOSE 8080 CMD ["uvicorn", "src.serve:app", "--host", "0.0.0.0", "--port", "8080"]

A/B Testing for Models

ab_test: name: "" hypothesis: "" primary_metric: "" # Business metric (revenue, engagement, etc.) guardrail_metrics: [] # Metrics that must NOT degrade traffic_split: control: 50 # Current model treatment: 50 # New model minimum_sample_size: null # Power analysis: use statsmodels or online calculator minimum_runtime_days: null # At least 1 full business cycle (7 days min) decision_criteria: ship: "Treatment > control by >X% with p<0.05 AND no guardrail regression" iterate: "Promising signal but not significant β€” extend test or refine model" kill: "No improvement after 2x minimum runtime OR guardrail breach"

LLM Application Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Application Layer β”‚ β”‚ (Prompt templates, chains, output parsers) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Orchestration Layer β”‚ β”‚ (Routing, fallback, retry, caching) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Model Layer β”‚ β”‚ (API calls, fine-tuned models, embeddings) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Data Layer β”‚ β”‚ (Vector store, context retrieval, memory) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Model Selection for LLM Tasks

TaskBest OptionCost-Effective OptionWhen to Fine-TuneGeneral reasoningClaude Opus / GPT-4oClaude Sonnet / GPT-4o-miniNever for general reasoningClassificationFine-tuned small modelFew-shot with Sonnet>1,000 labeled examples + high volumeExtractionStructured output APIRegex + LLM fallbackConsistent format needed at scaleSummarizationClaude SonnetGPT-4o-miniDomain-specific style neededCode generationClaude SonnetCodestral / DeepSeekInternal codebase conventionsEmbeddingstext-embedding-3-largetext-embedding-3-smallDomain-specific vocab (medical, legal)

RAG System Architecture

rag_pipeline: ingestion: chunking: strategy: "semantic" # semantic | fixed_size | recursive chunk_size: 512 # tokens (512-1024 for most use cases) overlap: 50 # tokens overlap between chunks metadata_to_preserve: - source_document - page_number - section_heading - date_created embedding: model: "text-embedding-3-large" dimensions: 1536 # Or 256/512 with Matryoshka for cost savings vector_store: "Pinecone | Weaviate | pgvector | Qdrant" retrieval: strategy: "hybrid" # dense | sparse | hybrid (recommended) top_k: 10 # Retrieve more, then rerank reranking: model: "Cohere rerank | cross-encoder" top_n: 3 # Final context chunks filters: [] # Metadata filters (date range, source, etc.) generation: model: "" system_prompt: | Answer based ONLY on the provided context. If the context doesn't contain the answer, say "I don't have enough information." Cite sources using [Source: document_name, page X]. temperature: 0.1 # Low for factual, higher for creative max_tokens: null

RAG Quality Checklist

Chunking preserves semantic meaning (not cutting mid-sentence) Metadata enables filtering (dates, sources, categories) Retrieval returns relevant chunks (test with 50+ queries manually) Reranking improves precision (compare with/without) System prompt prevents hallucination (tested with adversarial queries) Sources are cited and verifiable Handles "I don't know" gracefully Latency acceptable (<3s for interactive, <30s for complex) Cost per query tracked and within budget

LLM Cost Optimization

StrategySavingsTrade-offPrompt caching50-90% on repeated prefixesRequires cache-friendly prompt designModel routing (small β†’ large)40-70%Slightly higher latency, need router logicBatch API50%Hours of delay, batch-only workloadsShorter promptsLinear with token reductionMay reduce qualityFine-tuned small model80-95% vs large model APITraining cost + maintenanceSemantic caching50-80% for similar queriesMay return stale/wrong cached resultOutput token limitsProportionalMay truncate useful information

Monitoring Dashboard

monitoring: model_performance: metrics: - name: "primary_metric" # Same as offline evaluation threshold: null # Alert if below window: "1h | 1d | 7d" - name: "prediction_distribution" alert: "KL divergence > 0.1 from training distribution" latency: p50_ms: null p95_ms: null p99_ms: null alert_threshold_ms: null throughput: requests_per_second: null error_rate_threshold: 0.01 # Alert if >1% errors data_drift: method: "PSI | KS-test | JS-divergence" features_to_monitor: [] # Top 10 most important features check_frequency: "hourly | daily" alert_threshold: null # PSI > 0.2 = significant drift concept_drift: method: "performance_degradation" ground_truth_delay: "" # How long until we get labels? proxy_metrics: [] # Metrics available before ground truth retraining_trigger: "" # When to retrain

Drift Response Playbook

Drift TypeDetectionSeverityResponseFeature drift (input distribution shifts)PSI > 0.1WarningInvestigate cause, monitor performanceFeature drift (PSI > 0.25)PSI > 0.25CriticalRetrain on recent data within 24hConcept drift (relationship changes)Performance drop >5%CriticalRetrain with new labels, review featuresLabel drift (target distribution changes)Chi-square testWarningVerify label quality, check for data issuesPrediction drift (output distribution shifts)KL divergenceWarningMay indicate upstream data issue

Automated Retraining Pipeline

retraining: triggers: - type: "scheduled" frequency: "weekly | monthly" - type: "performance" condition: "primary_metric < threshold for 24h" - type: "drift" condition: "PSI > 0.2 on any top-10 feature" pipeline: 1_data_validation: - check_completeness - check_distribution_shift - check_label_quality 2_training: - use_latest_N_months_data - same_hyperparameters_as_production # Unless scheduled tuning - log_all_metrics 3_evaluation: - compare_vs_production_model - must_beat_production_on_primary_metric - must_not_regress_on_guardrail_metrics - evaluate_on_golden_test_set 4_deployment: - canary_deployment: 5% - monitor_for: "4h minimum" - auto_rollback_if: "error_rate > 2x baseline" - gradual_rollout: "5% β†’ 25% β†’ 50% β†’ 100%" 5_notification: - log_retraining_event - notify_team_on_failure - update_model_registry

ML Platform Components

ComponentPurposeToolsExperiment trackingLog runs, compare resultsMLflow, W&B, NeptuneFeature storeCentralized feature managementFeast, Tecton, HopsworksModel registryVersion, stage, approve modelsMLflow Registry, SageMakerPipeline orchestrationDAG-based ML workflowsAirflow, Prefect, Dagster, KubeflowModel servingLow-latency inferenceTriton, TorchServe, vLLM, BentoMLMonitoringDrift, performance, data qualityEvidently, Whylogs, Great ExpectationsVector storeEmbedding storage for RAGPinecone, Weaviate, pgvector, QdrantGPU managementTraining and inference computeK8s + GPU operator, RunPod, Modal

CI/CD for ML

ml_cicd: on_code_change: - lint_and_type_check - unit_tests (data transforms, feature logic) - integration_tests (pipeline end-to-end on sample data) on_data_change: - data_validation (Great Expectations / custom) - feature_pipeline_run - smoke_test_predictions on_model_change: - full_evaluation_suite - bias_and_fairness_check - performance_regression_test - model_size_and_latency_check - security_scan (model file, dependencies) - staging_deployment - integration_test_in_staging - approval_gate (manual for major versions) - canary_deployment

Model Registry Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Development β”‚ ───→ β”‚ Staging β”‚ ───→ β”‚ Production β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ - Experiment β”‚ β”‚ - Eval suite β”‚ β”‚ - Canary β”‚ β”‚ - Log metricsβ”‚ β”‚ - Load test β”‚ β”‚ - Monitor β”‚ β”‚ - Compare β”‚ β”‚ - Approval β”‚ β”‚ - Rollback β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Promotion criteria: Dev β†’ Staging: Beats current production on offline metrics Staging β†’ Production: Passes load test + integration test + human approval Auto-rollback: Error rate >2x OR latency >2x OR primary metric drops >5%

Bias Detection Checklist

Training data represents all demographic groups proportionally Performance metrics broken down by protected attributes Equal opportunity: similar true positive rates across groups Calibration: predicted probabilities match actual rates per group No proxy features for protected attributes (ZIP code β†’ race) Fairness metric selected and threshold defined BEFORE training Disparate impact ratio >0.8 (80% rule) Edge cases tested: what happens with unusual inputs?

Model Card Template

model_card: model_name: "" version: "" date: "" owner: "" description: "" intended_use: "" out_of_scope_uses: "" training_data: source: "" size: "" date_range: "" known_biases: "" evaluation: metrics: {} datasets: [] sliced_metrics: {} # Performance by subgroup limitations: [] ethical_considerations: [] maintenance: retraining_schedule: "" monitoring: "" contact: ""

GPU Selection Guide

Use CaseGPUVRAMCost/hr (cloud)Best ForFine-tune 7B modelA10G24GB~$1LoRA/QLoRA fine-tuningFine-tune 70B modelA100 80GB80GB~$4Full fine-tuning medium modelsServe 7B modelT416GB~$0.50Inference at scaleServe 70B modelA100 40GB40GB~$2Large model inferenceTrain from scratchH10080GB~$8Pre-training, large-scale training

Inference Optimization Techniques

TechniqueSpeedupQuality ImpactComplexityQuantization (INT8)2-3x<1% degradationLowQuantization (INT4/GPTQ)3-4x1-3% degradationMediumBatching2-10x throughputNoneLowKV-cache optimization20-40% memory savingsNoneMediumSpeculative decoding2-3x for LLMsNone (mathematically exact)HighModel distillation5-10x smaller model2-5% degradationHighONNX Runtime1.5-3xNoneLowTensorRT2-5x<1%MediumvLLM (PagedAttention)2-4x throughput for LLMsNoneLow

Cost Tracking Template

ml_costs: training: compute_cost_per_run: null runs_per_month: null data_storage_monthly: null experiment_tracking: null inference: cost_per_1k_predictions: null daily_volume: null monthly_cost: null cost_per_query_breakdown: compute: null model_api_calls: null vector_db: null data_transfer: null optimization_targets: cost_per_prediction: null # Target monthly_budget: null cost_reduction_goal: ""

Phase 11: ML System Quality Rubric

Score your ML system (0-100): DimensionWeight0-2 (Poor)3-4 (Good)5 (Excellent)Problem framing15%No clear business metricDefined success metricKill criteria + baseline + ROI estimateData quality15%Ad-hoc data, no validationAutomated quality checksFeature store + lineage + versioningExperiment rigor15%No tracking, one-off notebooksMLflow/W&B trackingReproducible pipelines + proper evaluationModel performance15%Barely beats baselineSignificant improvementCalibrated, fair, robust to edge casesDeployment10%Manual deploymentCI/CD for modelsCanary + auto-rollback + A/B testingMonitoring15%No monitoringBasic metrics dashboardDrift detection + auto-retraining + alertsDocumentation5%Nothing documentedModel card existsFull model card + runbooks + decision logCost efficiency10%No cost trackingBudget existsOptimized inference + cost-per-prediction tracking Scoring: 80-100: Production-grade ML system 60-79: Good foundations, missing operational maturity 40-59: Prototype quality, not ready for production <40: Science project, needs fundamental rework

Common Mistakes

MistakeFixOptimizing model before fixing dataData quality > model complexity. Always.Using accuracy on imbalanced dataUse PR-AUC, F1, or domain-specific metricNo baseline comparisonAlways start with simple heuristic baselineTraining on future dataTemporal splits for time-series, strict leakage checksDeploying without monitoringNo model in production without drift detectionFine-tuning when prompting worksTry few-shot prompting first β€” fine-tune only for scale/costGPU for everythingCPU inference is often sufficient and 10x cheaperIgnoring calibrationIf probabilities matter (risk scoring), calibrateOne-time model deploymentML is a continuous system β€” plan for retraining from day 1Premature scalingProve value with batch predictions before building real-time serving

Quick Commands

"Frame ML problem" β†’ Phase 1 brief "Assess data quality" β†’ Phase 2 scoring "Select model" β†’ Phase 3 guide "Evaluate model" β†’ Phase 4 checklist "Deploy model" β†’ Phase 5 serving config "Build RAG" β†’ Phase 6 RAG architecture "Set up monitoring" β†’ Phase 7 dashboard "Optimize costs" β†’ Phase 10 tracking "Score ML system" β†’ Phase 11 rubric "Detect drift" β†’ Phase 7 playbook "A/B test model" β†’ Phase 5 template "Create model card" β†’ Phase 9 template

Category context

Agent frameworks, memory systems, reasoning layers, and model-native orchestration.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
2 Docs
  • SKILL.md Primary doc
  • README.md Docs