{
  "schemaVersion": "1.0",
  "item": {
    "slug": "ml-pipeline",
    "name": "ML Pipeline",
    "source": "tencent",
    "type": "skill",
    "category": "AI 智能",
    "sourceUrl": "https://clawhub.ai/ahuserious/ml-pipeline",
    "canonicalUrl": "https://clawhub.ai/ahuserious/ml-pipeline",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/ml-pipeline",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=ml-pipeline",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "assets/README.md",
      "references/README.md",
      "scripts/README.md",
      "scripts/data_validation.py",
      "scripts/data_visualizer.py"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/ml-pipeline"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/ml-pipeline",
    "agentPageUrl": "https://openagent3.xyz/skills/ml-pipeline/agent",
    "manifestUrl": "https://openagent3.xyz/skills/ml-pipeline/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/ml-pipeline/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "ML Pipeline",
        "body": "Unified skill for the complete ML pipeline within a quant trading research system.\nConsolidates eight prior skills into a single authoritative reference covering\nthe full lifecycle: data validation, feature creation, selection,\ntransformation, anti-leakage checks, pipeline automation, deep learning optimization, and deployment."
      },
      {
        "title": "1. When to Use",
        "body": "Activate this skill when the task involves any of the following:\n\nCreating, selecting, or transforming features for an ML-driven strategy.\nAuditing an existing feature pipeline for data leakage or overfitting risk.\nAutomating an end-to-end ML pipeline (data prep through model export).\nEvaluating feature importance, scaling, encoding, or interaction effects.\nIntegrating features with a feature store (Feast, Tecton, custom Parquet store).\nExplaining core ML concepts (bias-variance, cross-validation, regularisation)\nin the context of feature engineering decisions."
      },
      {
        "title": "2. Inputs to Gather",
        "body": "Before starting work, collect or confirm:\n\nInputDetailsObjectiveTarget metric (Sharpe, accuracy, RMSE ...), constraints, time horizon.DataSymbols / instruments, timeframe, bar type, sampling frequency, data sources.Leakage risksPoint-in-time concerns, survivorship bias, look-ahead in labels or features.Compute budgetCPU/GPU limits, wall-clock budget for AutoML search.LatencyOnline vs. offline inference, acceptable prediction latency.InterpretabilityRegulatory or research need for explainable features / models.Deployment targetWhere the model will run (notebook, backtest harness, live engine)."
      },
      {
        "title": "3.1 Numerical Features",
        "body": "Interaction terms: price * volume, high / low, close - open.\nRolling statistics: mean, std, skew, kurtosis over configurable windows.\nPolynomial / log transforms: log(volume + 1), spread^2.\nBinning / discretisation: equal-width, quantile-based, or domain-driven bins."
      },
      {
        "title": "3.2 Categorical Features",
        "body": "One-hot encoding: for low-cardinality categoricals (sector, exchange).\nTarget encoding: mean-target per category with smoothing (careful of leakage -- use only in-fold means).\nOrdinal encoding: when categories have a natural order (credit rating)."
      },
      {
        "title": "3.3 Time-Series Specific",
        "body": "Lag features: return_{t-1}, return_{t-5}, etc.\nCalendar features: day-of-week, month, quarter, options-expiry flag.\nRolling z-score: (x - rolling_mean) / rolling_std for stationarity.\nFractional differentiation: preserve memory while achieving stationarity (Lopez de Prado)."
      },
      {
        "title": "3.4 Feature Selection Techniques",
        "body": "Filter methods: mutual information, variance threshold, correlation pruning.\nWrapper methods: recursive feature elimination (RFE), forward/backward selection.\nEmbedded methods: L1 regularisation, tree-based importance, SHAP values.\nPermutation importance: model-agnostic; run on out-of-fold predictions."
      },
      {
        "title": "4. Anti-Leakage Checks",
        "body": "Data leakage is the single most common cause of inflated backtest results.\nApply these checks at every pipeline stage:"
      },
      {
        "title": "4.1 Label Leakage",
        "body": "Labels must be computed from future returns relative to the feature\ntimestamp. Verify that the label window does not overlap the feature window.\nUse purging and embargo when labels span multiple bars."
      },
      {
        "title": "4.2 Feature Leakage",
        "body": "No feature may use information from time t+1 or later at prediction time t.\nRolling statistics must use a closed left window: df['feat'].rolling(20).mean().shift(1).\nTarget-encoded categoricals must be computed on the training fold only."
      },
      {
        "title": "4.3 Cross-Validation Leakage",
        "body": "Use purged k-fold or walk-forward CV for time-series. Never use random\nk-fold on ordered data.\nInsert an embargo gap between train and test folds to prevent bleed-through\nfrom autocorrelation."
      },
      {
        "title": "4.4 Survivorship & Selection Bias",
        "body": "Ensure the universe of instruments at time t reflects what was actually\ntradable at that time (delisted stocks, halted symbols removed later).\nBackfill from point-in-time databases where available."
      },
      {
        "title": "4.5 Validation Checklist",
        "body": "Run before every backtest:\n\n[ ] Labels computed strictly from future returns (no overlap with features)\n[ ] All rolling features shifted by at least 1 bar\n[ ] Target encoding uses in-fold means only\n[ ] Walk-forward or purged CV used (no random shuffle on time-series)\n[ ] Embargo gap >= max(label_horizon, autocorrelation_lag)\n[ ] Universe is point-in-time (no survivorship bias)\n[ ] No global scaling fitted on full dataset (fit on train, transform test)"
      },
      {
        "title": "5.1 Prerequisites",
        "body": "Python environment with one or more AutoML libraries:\nAuto-sklearn, TPOT, H2O AutoML, PyCaret, Optuna, or custom Optuna pipelines.\nTraining data in CSV / Parquet / database.\nProblem type identified: classification, regression, or time-series forecasting."
      },
      {
        "title": "5.2 Pipeline Steps",
        "body": "StepAction1. Define requirementsProblem type, evaluation metric, time/resource budget, interpretability needs.2. Data infrastructureLoad data, quality assessment, train/val/test split strategy, define feature transforms.3. Configure AutoMLSelect framework, define algorithm search space, set preprocessing steps, choose tuning strategy (Bayesian, random, Hyperband).4. Execute trainingRun automated feature engineering, model selection, hyperparameter optimisation, cross-validation.5. Analyse & exportCompare models, extract best config, feature importance, visualisations, export for deployment."
      },
      {
        "title": "5.3 Pipeline Configuration Template",
        "body": "pipeline_config = {\n    \"task_type\": \"classification\",        # or \"regression\", \"time_series\"\n    \"time_budget_seconds\": 3600,\n    \"algorithms\": [\"rf\", \"xgboost\", \"catboost\", \"lightgbm\"],\n    \"preprocessing\": [\"scaling\", \"encoding\", \"imputation\"],\n    \"tuning_strategy\": \"bayesian\",        # or \"random\", \"hyperband\"\n    \"cv_folds\": 5,\n    \"cv_type\": \"purged_kfold\",            # or \"walk_forward\"\n    \"embargo_bars\": 10,\n    \"early_stopping_rounds\": 50,\n    \"metric\": \"sharpe_ratio\",             # domain-specific metric\n}"
      },
      {
        "title": "5.4 Output Artifacts",
        "body": "automl_config.py -- pipeline configuration.\nbest_model.pkl / .joblib / .onnx -- serialised model.\nfeature_pipeline.pkl -- fitted preprocessing + feature transforms.\nevaluation_report.json -- metrics, confusion matrix / residuals, feature rankings.\ndeployment/ -- prediction API code, input validation, requirements.txt."
      },
      {
        "title": "6.1 Bias-Variance Trade-off",
        "body": "More features increase model capacity (lower bias) but risk overfitting (higher variance).\nUse regularisation (L1/L2), feature selection, or dimensionality reduction to manage."
      },
      {
        "title": "6.2 Evaluation Strategy",
        "body": "Walk-forward validation: the gold standard for time-series strategies.\nRoll a fixed-width training window forward; test on the next out-of-sample period.\nMonte Carlo permutation tests: shuffle labels and re-evaluate to estimate\nthe probability that observed performance is due to chance.\nCombinatorial purged CV (CPCV): generate many train/test combinations with\npurging for more robust performance estimates."
      },
      {
        "title": "6.3 Feature Scaling",
        "body": "Fit scalers (StandardScaler, MinMaxScaler, RobustScaler) on the training set only.\nApply the same fitted scaler to validation and test sets.\nRobustScaler is often preferred for financial data due to heavy tails."
      },
      {
        "title": "6.4 Handling Missing Data",
        "body": "Forward-fill then backward-fill for price data (be aware of leakage on backfill).\nIndicator column for missingness can itself be informative.\nTree-based models can handle NaN natively; linear models cannot."
      },
      {
        "title": "7. Workflow",
        "body": "For any feature engineering task, follow this sequence:\n\nRestate the task in measurable terms (metric, constraints, deadline).\nEnumerate required artifacts: datasets, feature definitions, configs, scripts, reports.\nPropose a default approach and 1-2 alternatives with trade-offs.\nImplement feature pipeline with anti-leakage checks built in.\nValidate with walk-forward CV, Monte Carlo, and the leakage checklist above.\nDeliver repo-ready code, documentation, and a run command."
      },
      {
        "title": "8.1 Optimizer Selection",
        "body": "OptimizerBest ForLearning RateAdamMost cases, adaptive1e-3 to 1e-4AdamWTransformers, weight decay1e-4 to 1e-5SGD + MomentumLarge batches, fine-tuning1e-2 to 1e-3RAdamStability without warmup1e-3"
      },
      {
        "title": "8.2 Learning Rate Scheduling",
        "body": "OneCycleLR: Best for short training, fast convergence\nCosineAnnealing: Smooth decay, good generalization\nReduceOnPlateau: Adaptive when validation loss plateaus\nWarmup + Decay: Standard for transformers"
      },
      {
        "title": "8.3 Regularization Techniques",
        "body": "Dropout: 0.1-0.5 for fully connected layers\nL2 (Weight Decay): 1e-4 to 1e-2\nBatch Normalization: Stabilizes training\nEarly Stopping: Monitor validation loss, patience 5-10 epochs"
      },
      {
        "title": "8.4 PyTorch Lightning Integration",
        "body": "import pytorch_lightning as pl\n\nclass TradingModel(pl.LightningModule):\n    def configure_optimizers(self):\n        optimizer = torch.optim.AdamW(self.parameters(), lr=1e-4)\n        scheduler = torch.optim.lr_scheduler.OneCycleLR(\n            optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches\n        )\n        return [optimizer], [scheduler]"
      },
      {
        "title": "8.5 Financial Reinforcement Learning",
        "body": "State: Market features, portfolio state, position\nAction: Buy/Sell/Hold, position sizing\nReward: Risk-adjusted returns (Sharpe, Sortino)\nFrameworks: Stable-Baselines3, RLlib, FinRL"
      },
      {
        "title": "9. Error Handling",
        "body": "ProblemCauseFixAutoML search finds no good modelInsufficient time budget or poor featuresIncrease budget, engineer better features, expand algorithm search space.Out of memory during trainingDataset too large for available RAMDownsample, use incremental learning, simplify feature engineering.Model accuracy below thresholdWeak signal or overfittingCollect more data, add domain-driven features, regularise, adjust metric.Feature transforms produce NaN/InfDivision by zero, log of negativeAdd guards: np.where(denom != 0, ...), np.log1p(np.abs(x)).Optimiser fails to convergeBad hyperparameter rangesTighten search bounds, increase iterations, exclude unstable algorithms."
      },
      {
        "title": "10. Bundled Scripts",
        "body": "All scripts live in scripts/ within this skill directory.\n\nScriptPurposedata_validation.pyValidate input data quality before pipeline execution.model_evaluation.pyEvaluate trained model performance and generate reports.pipeline_deployment.pyDeploy a trained pipeline to a target environment with rollback support.feature_engineering_pipeline.pyEnd-to-end feature engineering: load, clean, transform, select, train.feature_importance_analyzer.pyAnalyse feature importance (permutation, SHAP, tree-based).data_visualizer.pyVisualise feature distributions and relationships to target.feature_store_integration.pyIntegrate with feature stores (Feast, Tecton) for online/offline serving."
      },
      {
        "title": "Frameworks",
        "body": "scikit-learn -- preprocessing, feature selection, pipelines.\nAuto-sklearn / TPOT / H2O AutoML / PyCaret -- automated pipeline search.\nOptuna -- flexible hyperparameter optimisation.\nSHAP -- model-agnostic feature importance.\nFeast / Tecton -- feature store management.\nPyTorch Lightning -- https://lightning.ai/docs/pytorch/stable/\nStable-Baselines3 -- https://stable-baselines3.readthedocs.io/\nFinRL -- https://github.com/AI4Finance-Foundation/FinRL"
      },
      {
        "title": "Key References",
        "body": "Lopez de Prado, Advances in Financial Machine Learning (2018) -- purged CV, fractional differentiation, meta-labelling.\nHastie, Tibshirani & Friedman, The Elements of Statistical Learning -- bias-variance, regularisation, model selection.\nscikit-learn user guide: feature extraction, preprocessing, model selection."
      },
      {
        "title": "Best Practices",
        "body": "Always start with a simple baseline before running AutoML.\nBalance automation with domain knowledge -- blind search rarely beats informed priors.\nMonitor resource consumption; set hard timeouts.\nValidate on true out-of-sample holdout data, not just cross-validation.\nDocument every pipeline decision for reproducibility."
      }
    ],
    "body": "ML Pipeline\n\nUnified skill for the complete ML pipeline within a quant trading research system. Consolidates eight prior skills into a single authoritative reference covering the full lifecycle: data validation, feature creation, selection, transformation, anti-leakage checks, pipeline automation, deep learning optimization, and deployment.\n\n1. When to Use\n\nActivate this skill when the task involves any of the following:\n\nCreating, selecting, or transforming features for an ML-driven strategy.\nAuditing an existing feature pipeline for data leakage or overfitting risk.\nAutomating an end-to-end ML pipeline (data prep through model export).\nEvaluating feature importance, scaling, encoding, or interaction effects.\nIntegrating features with a feature store (Feast, Tecton, custom Parquet store).\nExplaining core ML concepts (bias-variance, cross-validation, regularisation) in the context of feature engineering decisions.\n2. Inputs to Gather\n\nBefore starting work, collect or confirm:\n\nInput\tDetails\nObjective\tTarget metric (Sharpe, accuracy, RMSE ...), constraints, time horizon.\nData\tSymbols / instruments, timeframe, bar type, sampling frequency, data sources.\nLeakage risks\tPoint-in-time concerns, survivorship bias, look-ahead in labels or features.\nCompute budget\tCPU/GPU limits, wall-clock budget for AutoML search.\nLatency\tOnline vs. offline inference, acceptable prediction latency.\nInterpretability\tRegulatory or research need for explainable features / models.\nDeployment target\tWhere the model will run (notebook, backtest harness, live engine).\n3. Feature Creation Patterns\n3.1 Numerical Features\nInteraction terms: price * volume, high / low, close - open.\nRolling statistics: mean, std, skew, kurtosis over configurable windows.\nPolynomial / log transforms: log(volume + 1), spread^2.\nBinning / discretisation: equal-width, quantile-based, or domain-driven bins.\n3.2 Categorical Features\nOne-hot encoding: for low-cardinality categoricals (sector, exchange).\nTarget encoding: mean-target per category with smoothing (careful of leakage -- use only in-fold means).\nOrdinal encoding: when categories have a natural order (credit rating).\n3.3 Time-Series Specific\nLag features: return_{t-1}, return_{t-5}, etc.\nCalendar features: day-of-week, month, quarter, options-expiry flag.\nRolling z-score: (x - rolling_mean) / rolling_std for stationarity.\nFractional differentiation: preserve memory while achieving stationarity (Lopez de Prado).\n3.4 Feature Selection Techniques\nFilter methods: mutual information, variance threshold, correlation pruning.\nWrapper methods: recursive feature elimination (RFE), forward/backward selection.\nEmbedded methods: L1 regularisation, tree-based importance, SHAP values.\nPermutation importance: model-agnostic; run on out-of-fold predictions.\n4. Anti-Leakage Checks\n\nData leakage is the single most common cause of inflated backtest results. Apply these checks at every pipeline stage:\n\n4.1 Label Leakage\nLabels must be computed from future returns relative to the feature timestamp. Verify that the label window does not overlap the feature window.\nUse purging and embargo when labels span multiple bars.\n4.2 Feature Leakage\nNo feature may use information from time t+1 or later at prediction time t.\nRolling statistics must use a closed left window: df['feat'].rolling(20).mean().shift(1).\nTarget-encoded categoricals must be computed on the training fold only.\n4.3 Cross-Validation Leakage\nUse purged k-fold or walk-forward CV for time-series. Never use random k-fold on ordered data.\nInsert an embargo gap between train and test folds to prevent bleed-through from autocorrelation.\n4.4 Survivorship & Selection Bias\nEnsure the universe of instruments at time t reflects what was actually tradable at that time (delisted stocks, halted symbols removed later).\nBackfill from point-in-time databases where available.\n4.5 Validation Checklist\n\nRun before every backtest:\n\n[ ] Labels computed strictly from future returns (no overlap with features)\n[ ] All rolling features shifted by at least 1 bar\n[ ] Target encoding uses in-fold means only\n[ ] Walk-forward or purged CV used (no random shuffle on time-series)\n[ ] Embargo gap >= max(label_horizon, autocorrelation_lag)\n[ ] Universe is point-in-time (no survivorship bias)\n[ ] No global scaling fitted on full dataset (fit on train, transform test)\n\n5. Pipeline Automation (AutoML)\n5.1 Prerequisites\nPython environment with one or more AutoML libraries: Auto-sklearn, TPOT, H2O AutoML, PyCaret, Optuna, or custom Optuna pipelines.\nTraining data in CSV / Parquet / database.\nProblem type identified: classification, regression, or time-series forecasting.\n5.2 Pipeline Steps\nStep\tAction\n1. Define requirements\tProblem type, evaluation metric, time/resource budget, interpretability needs.\n2. Data infrastructure\tLoad data, quality assessment, train/val/test split strategy, define feature transforms.\n3. Configure AutoML\tSelect framework, define algorithm search space, set preprocessing steps, choose tuning strategy (Bayesian, random, Hyperband).\n4. Execute training\tRun automated feature engineering, model selection, hyperparameter optimisation, cross-validation.\n5. Analyse & export\tCompare models, extract best config, feature importance, visualisations, export for deployment.\n5.3 Pipeline Configuration Template\npipeline_config = {\n    \"task_type\": \"classification\",        # or \"regression\", \"time_series\"\n    \"time_budget_seconds\": 3600,\n    \"algorithms\": [\"rf\", \"xgboost\", \"catboost\", \"lightgbm\"],\n    \"preprocessing\": [\"scaling\", \"encoding\", \"imputation\"],\n    \"tuning_strategy\": \"bayesian\",        # or \"random\", \"hyperband\"\n    \"cv_folds\": 5,\n    \"cv_type\": \"purged_kfold\",            # or \"walk_forward\"\n    \"embargo_bars\": 10,\n    \"early_stopping_rounds\": 50,\n    \"metric\": \"sharpe_ratio\",             # domain-specific metric\n}\n\n5.4 Output Artifacts\nautoml_config.py -- pipeline configuration.\nbest_model.pkl / .joblib / .onnx -- serialised model.\nfeature_pipeline.pkl -- fitted preprocessing + feature transforms.\nevaluation_report.json -- metrics, confusion matrix / residuals, feature rankings.\ndeployment/ -- prediction API code, input validation, requirements.txt.\n6. Core ML Fundamentals (Feature-Engineering Context)\n6.1 Bias-Variance Trade-off\nMore features increase model capacity (lower bias) but risk overfitting (higher variance).\nUse regularisation (L1/L2), feature selection, or dimensionality reduction to manage.\n6.2 Evaluation Strategy\nWalk-forward validation: the gold standard for time-series strategies. Roll a fixed-width training window forward; test on the next out-of-sample period.\nMonte Carlo permutation tests: shuffle labels and re-evaluate to estimate the probability that observed performance is due to chance.\nCombinatorial purged CV (CPCV): generate many train/test combinations with purging for more robust performance estimates.\n6.3 Feature Scaling\nFit scalers (StandardScaler, MinMaxScaler, RobustScaler) on the training set only.\nApply the same fitted scaler to validation and test sets.\nRobustScaler is often preferred for financial data due to heavy tails.\n6.4 Handling Missing Data\nForward-fill then backward-fill for price data (be aware of leakage on backfill).\nIndicator column for missingness can itself be informative.\nTree-based models can handle NaN natively; linear models cannot.\n7. Workflow\n\nFor any feature engineering task, follow this sequence:\n\nRestate the task in measurable terms (metric, constraints, deadline).\nEnumerate required artifacts: datasets, feature definitions, configs, scripts, reports.\nPropose a default approach and 1-2 alternatives with trade-offs.\nImplement feature pipeline with anti-leakage checks built in.\nValidate with walk-forward CV, Monte Carlo, and the leakage checklist above.\nDeliver repo-ready code, documentation, and a run command.\n8. Deep Learning Optimization\n8.1 Optimizer Selection\nOptimizer\tBest For\tLearning Rate\nAdam\tMost cases, adaptive\t1e-3 to 1e-4\nAdamW\tTransformers, weight decay\t1e-4 to 1e-5\nSGD + Momentum\tLarge batches, fine-tuning\t1e-2 to 1e-3\nRAdam\tStability without warmup\t1e-3\n8.2 Learning Rate Scheduling\nOneCycleLR: Best for short training, fast convergence\nCosineAnnealing: Smooth decay, good generalization\nReduceOnPlateau: Adaptive when validation loss plateaus\nWarmup + Decay: Standard for transformers\n8.3 Regularization Techniques\nDropout: 0.1-0.5 for fully connected layers\nL2 (Weight Decay): 1e-4 to 1e-2\nBatch Normalization: Stabilizes training\nEarly Stopping: Monitor validation loss, patience 5-10 epochs\n8.4 PyTorch Lightning Integration\nimport pytorch_lightning as pl\n\nclass TradingModel(pl.LightningModule):\n    def configure_optimizers(self):\n        optimizer = torch.optim.AdamW(self.parameters(), lr=1e-4)\n        scheduler = torch.optim.lr_scheduler.OneCycleLR(\n            optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches\n        )\n        return [optimizer], [scheduler]\n\n8.5 Financial Reinforcement Learning\nState: Market features, portfolio state, position\nAction: Buy/Sell/Hold, position sizing\nReward: Risk-adjusted returns (Sharpe, Sortino)\nFrameworks: Stable-Baselines3, RLlib, FinRL\n9. Error Handling\nProblem\tCause\tFix\nAutoML search finds no good model\tInsufficient time budget or poor features\tIncrease budget, engineer better features, expand algorithm search space.\nOut of memory during training\tDataset too large for available RAM\tDownsample, use incremental learning, simplify feature engineering.\nModel accuracy below threshold\tWeak signal or overfitting\tCollect more data, add domain-driven features, regularise, adjust metric.\nFeature transforms produce NaN/Inf\tDivision by zero, log of negative\tAdd guards: np.where(denom != 0, ...), np.log1p(np.abs(x)).\nOptimiser fails to converge\tBad hyperparameter ranges\tTighten search bounds, increase iterations, exclude unstable algorithms.\n10. Bundled Scripts\n\nAll scripts live in scripts/ within this skill directory.\n\nScript\tPurpose\ndata_validation.py\tValidate input data quality before pipeline execution.\nmodel_evaluation.py\tEvaluate trained model performance and generate reports.\npipeline_deployment.py\tDeploy a trained pipeline to a target environment with rollback support.\nfeature_engineering_pipeline.py\tEnd-to-end feature engineering: load, clean, transform, select, train.\nfeature_importance_analyzer.py\tAnalyse feature importance (permutation, SHAP, tree-based).\ndata_visualizer.py\tVisualise feature distributions and relationships to target.\nfeature_store_integration.py\tIntegrate with feature stores (Feast, Tecton) for online/offline serving.\n11. Resources\nFrameworks\nscikit-learn -- preprocessing, feature selection, pipelines.\nAuto-sklearn / TPOT / H2O AutoML / PyCaret -- automated pipeline search.\nOptuna -- flexible hyperparameter optimisation.\nSHAP -- model-agnostic feature importance.\nFeast / Tecton -- feature store management.\nPyTorch Lightning -- https://lightning.ai/docs/pytorch/stable/\nStable-Baselines3 -- https://stable-baselines3.readthedocs.io/\nFinRL -- https://github.com/AI4Finance-Foundation/FinRL\nKey References\nLopez de Prado, Advances in Financial Machine Learning (2018) -- purged CV, fractional differentiation, meta-labelling.\nHastie, Tibshirani & Friedman, The Elements of Statistical Learning -- bias-variance, regularisation, model selection.\nscikit-learn user guide: feature extraction, preprocessing, model selection.\nBest Practices\nAlways start with a simple baseline before running AutoML.\nBalance automation with domain knowledge -- blind search rarely beats informed priors.\nMonitor resource consumption; set hard timeouts.\nValidate on true out-of-sample holdout data, not just cross-validation.\nDocument every pipeline decision for reproducibility."
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/ahuserious/ml-pipeline",
    "publisherUrl": "https://clawhub.ai/ahuserious/ml-pipeline",
    "owner": "ahuserious",
    "version": "0.1.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/ml-pipeline",
    "downloadUrl": "https://openagent3.xyz/downloads/ml-pipeline",
    "agentUrl": "https://openagent3.xyz/skills/ml-pipeline/agent",
    "manifestUrl": "https://openagent3.xyz/skills/ml-pipeline/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/ml-pipeline/agent.md"
  }
}