{
  "schemaVersion": "1.0",
  "item": {
    "slug": "data-cleaning-annotation-workflow",
    "name": "Data Cleaning & Annotation Workflow",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/Deyashmukh/data-cleaning-annotation-workflow",
    "canonicalUrl": "https://clawhub.ai/Deyashmukh/data-cleaning-annotation-workflow",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/data-cleaning-annotation-workflow",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=data-cleaning-annotation-workflow",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "references/platform_guide.md",
      "scripts/clean_dataset.py",
      "scripts/download_kaggle.sh"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/data-cleaning-annotation-workflow"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/data-cleaning-annotation-workflow",
    "agentPageUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow/agent",
    "manifestUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Simulacrum Data Annotation Workflow",
        "body": "Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com)."
      },
      {
        "title": "What This Skill Does",
        "body": "This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:\n\nFind Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data\nDownload: Get CSV files via browser or Kaggle CLI\nClean: Run Python/pandas script to handle missing values, duplicates, formatting\nUpload RAW: Upload original CSV with metadata (name, domain, source URL, description)\nConfigure Headers: Set column types (Time, Target, Covariate, Group) and units\nAssign Groups: Select ALL variables (target + covariates), apply ALL group tags\nUpload Cleaned: Final upload → CLEAN status"
      },
      {
        "title": "Supported Domains",
        "body": "Energy: Power consumption, utilities, renewable energy, grid data\nManufacturing: Industrial processes, steel production, emissions, equipment data\nClimate: CO2 emissions, environmental monitoring, weather correlation data"
      },
      {
        "title": "Quick Start",
        "body": "For the full pipeline from Kaggle to annotated dataset:\n\n1. Find dataset on Kaggle\n2. Download (browser or kaggle CLI)\n3. Clean with scripts/clean_dataset.py\n4. Upload RAW dataset to data.smlcrm.com (with metadata)\n5. Click \"Clean\" and upload cleaned file\n6. Configure column metadata (types, units)\n7. Assign groups to variables\n8. Upload cleaned dataset → CLEAN status"
      },
      {
        "title": "Step 1: Find and Download Dataset",
        "body": "From Kaggle (Browser Method):\n\nNavigate to kaggle.com/datasets\nSearch for relevant dataset (e.g., \"steel industry energy consumption\", \"manufacturing emissions\", \"climate CO2\")\nReview data description, file list, and preview\nClick \"Download\" button\nExtract CSV file from downloaded zip\n\nAlternative: Kaggle CLI\n\n# Install if needed: pip install kaggle\n# Configure: kaggle competitions list\n\nscripts/download_kaggle.sh <dataset-name> [output-dir]\n# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption"
      },
      {
        "title": "Step 2: Clean the Dataset",
        "body": "Always run the cleaning script before upload:\n\npython3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]\n\nWhat the script does:\n\nStrips whitespace from column names\nRemoves duplicate rows\nFills missing numeric values with median\nFills missing categorical values with mode or 'Unknown'\nConverts timestamp columns to datetime format\nOutputs column summary for metadata configuration\n\nOutput:\n\nCleaned CSV file ready for upload\nColumn summary printed to console (save this for metadata config)"
      },
      {
        "title": "Step 3: Upload Raw Dataset to Platform",
        "body": "Navigate to data.smlcrm.com/dashboard\nClick \"Upload Dataset\" button\nFill in metadata for the RAW dataset:\n\nName: Descriptive dataset name\nDomain: Category (Energy, Manufacturing, Climate, etc.)\nSource URL: Kaggle or original source URL\nDescription: Brief summary of the dataset\n\n\nUpload the original/raw CSV file (not cleaned yet)\nClick Upload\n\nResult: Dataset appears in list with RAW status"
      },
      {
        "title": "Step 4: Upload Cleaned File & Configure Metadata",
        "body": "Find the RAW dataset in the list\nClick \"Clean\" button\nUpload the cleaned CSV file (from Step 2)\nConfigure headers for each column:\n\nSettingDescriptionNameColumn name (editable)UnitsMeasurement units (kWh, °C, %, ratio, tCO2, etc.)TypeTime / Target / Covariate / Group\n\nColumn Type Guide:\n\nTime: Timestamp/datetime columns (usually required)\nTarget: Variable to predict (at least one required)\nCovariate: Input features/independent variables\nGroup: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.)\n\nBulk Configuration:\n\nSelect multiple rows via checkboxes\nUse \"Apply\" dropdown to set type for selected columns\nSet units individually or in bulk\n\nCommon Unit Patterns:\n\nEnergy: kWh, MWh, MW\nPower: kVarh, kW\nEmissions: tCO2, kgCO2\nRatios: ratio, %\nTime: seconds, minutes, hours"
      },
      {
        "title": "Step 5: Assign Groups to Variables",
        "body": "Purpose: Group variables define how data is segmented for analysis.\n\nExact Workflow:\n\nSelect ALL variables by checking their checkboxes:\n\nTarget variable(s)\nALL covariate variables\n\n\n\nApply ALL group tags to selected variables:\n\nClick first group tag (e.g., WeekStatus) → all selected get this group\nClick second group tag (e.g., Day_of_week) → all selected get this group\nClick third group tag (e.g., Load_Type) → all selected get this group\nContinue for all available group tags\n\n\n\nResult: All variables have all groups assigned (e.g., \"WeekStatus × Day_of_week × Load_Type\")\n\nImportant: Assign groups to BOTH target variables AND all covariates."
      },
      {
        "title": "Step 6: Final Upload",
        "body": "Click \"Upload Cleaned Dataset\" button\nWait for processing\nDataset status changes from RAW → CLEAN\nVerify data points count is correct"
      },
      {
        "title": "Example: Steel Industry Energy Dataset",
        "body": "Source: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption\n\nMetadata:\n\nName: Steel Industry Energy Consumption (South Korea)\nDomain: Energy\nData Points: 350,400\n\nColumn Configuration:\n\nColumnTypeUnitsTimestampsTime-Usage_kWhTargetkWhLagging_Current_Reactive.Power_kVarhCovariatekVarhLeading_Current_Reactive_Power_kVarhCovariatekVarhCO2(tCO2)CovariatetCO2Lagging_Current_Power_FactorCovariateratioLeading_Current_Power_FactorCovariateratioNSMCovariatesecondsWeekStatusGroup-Day_of_weekGroup-Load_TypeGroup-\n\nGroup Assignment:\n\nSelect: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM\nClick: WeekStatus → all selected get WeekStatus\nClick: Day_of_week → all selected get Day_of_week\nClick: Load_Type → all selected get Load_Type\nFinal: All variables show \"WeekStatus × Day_of_week × Load_Type\""
      },
      {
        "title": "Reference Materials",
        "body": "For detailed platform configuration guidance, see references/platform_guide.md."
      },
      {
        "title": "Troubleshooting",
        "body": "\"Next\" button disabled:\n\nCheck at least one Time column is set\nCheck at least one Target column is set\nVerify all columns have types assigned\n\nGroups not appearing:\n\nColumns must be marked as \"Group\" type first\nProceed to next step after setting Group types\n\nUpload fails:\n\nRe-run cleaning script\nCheck CSV format (comma-delimited)\nVerify no empty column names"
      },
      {
        "title": "Scripts",
        "body": "ScriptPurposescripts/clean_dataset.pyClean and prepare CSV for uploadscripts/download_kaggle.shDownload datasets via Kaggle CLI"
      },
      {
        "title": "Platform URL",
        "body": "Data Annotation Platform: https://data.smlcrm.com"
      }
    ],
    "body": "Simulacrum Data Annotation Workflow\n\nComplete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com).\n\nWhat This Skill Does\n\nThis skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status:\n\nFind Dataset: Search Kaggle for Energy/Manufacturing/Climate time series data\nDownload: Get CSV files via browser or Kaggle CLI\nClean: Run Python/pandas script to handle missing values, duplicates, formatting\nUpload RAW: Upload original CSV with metadata (name, domain, source URL, description)\nConfigure Headers: Set column types (Time, Target, Covariate, Group) and units\nAssign Groups: Select ALL variables (target + covariates), apply ALL group tags\nUpload Cleaned: Final upload → CLEAN status\nSupported Domains\nEnergy: Power consumption, utilities, renewable energy, grid data\nManufacturing: Industrial processes, steel production, emissions, equipment data\nClimate: CO2 emissions, environmental monitoring, weather correlation data\nQuick Start\n\nFor the full pipeline from Kaggle to annotated dataset:\n\n1. Find dataset on Kaggle\n2. Download (browser or kaggle CLI)\n3. Clean with scripts/clean_dataset.py\n4. Upload RAW dataset to data.smlcrm.com (with metadata)\n5. Click \"Clean\" and upload cleaned file\n6. Configure column metadata (types, units)\n7. Assign groups to variables\n8. Upload cleaned dataset → CLEAN status\n\nWorkflow Steps\nStep 1: Find and Download Dataset\n\nFrom Kaggle (Browser Method):\n\nNavigate to kaggle.com/datasets\nSearch for relevant dataset (e.g., \"steel industry energy consumption\", \"manufacturing emissions\", \"climate CO2\")\nReview data description, file list, and preview\nClick \"Download\" button\nExtract CSV file from downloaded zip\n\nAlternative: Kaggle CLI\n\n# Install if needed: pip install kaggle\n# Configure: kaggle competitions list\n\nscripts/download_kaggle.sh <dataset-name> [output-dir]\n# Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption\n\nStep 2: Clean the Dataset\n\nAlways run the cleaning script before upload:\n\npython3 scripts/clean_dataset.py <input.csv> [-o <output.csv>]\n\n\nWhat the script does:\n\nStrips whitespace from column names\nRemoves duplicate rows\nFills missing numeric values with median\nFills missing categorical values with mode or 'Unknown'\nConverts timestamp columns to datetime format\nOutputs column summary for metadata configuration\n\nOutput:\n\nCleaned CSV file ready for upload\nColumn summary printed to console (save this for metadata config)\nStep 3: Upload Raw Dataset to Platform\nNavigate to data.smlcrm.com/dashboard\nClick \"Upload Dataset\" button\nFill in metadata for the RAW dataset:\nName: Descriptive dataset name\nDomain: Category (Energy, Manufacturing, Climate, etc.)\nSource URL: Kaggle or original source URL\nDescription: Brief summary of the dataset\nUpload the original/raw CSV file (not cleaned yet)\nClick Upload\n\nResult: Dataset appears in list with RAW status\n\nStep 4: Upload Cleaned File & Configure Metadata\nFind the RAW dataset in the list\nClick \"Clean\" button\nUpload the cleaned CSV file (from Step 2)\nConfigure headers for each column:\nSetting\tDescription\nName\tColumn name (editable)\nUnits\tMeasurement units (kWh, °C, %, ratio, tCO2, etc.)\nType\tTime / Target / Covariate / Group\n\nColumn Type Guide:\n\nTime: Timestamp/datetime columns (usually required)\nTarget: Variable to predict (at least one required)\nCovariate: Input features/independent variables\nGroup: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.)\n\nBulk Configuration:\n\nSelect multiple rows via checkboxes\nUse \"Apply\" dropdown to set type for selected columns\nSet units individually or in bulk\n\nCommon Unit Patterns:\n\nEnergy: kWh, MWh, MW\nPower: kVarh, kW\nEmissions: tCO2, kgCO2\nRatios: ratio, %\nTime: seconds, minutes, hours\nStep 5: Assign Groups to Variables\n\nPurpose: Group variables define how data is segmented for analysis.\n\nExact Workflow:\n\nSelect ALL variables by checking their checkboxes:\n\nTarget variable(s)\nALL covariate variables\n\nApply ALL group tags to selected variables:\n\nClick first group tag (e.g., WeekStatus) → all selected get this group\nClick second group tag (e.g., Day_of_week) → all selected get this group\nClick third group tag (e.g., Load_Type) → all selected get this group\nContinue for all available group tags\n\nResult: All variables have all groups assigned (e.g., \"WeekStatus × Day_of_week × Load_Type\")\n\nImportant: Assign groups to BOTH target variables AND all covariates.\n\nStep 6: Final Upload\nClick \"Upload Cleaned Dataset\" button\nWait for processing\nDataset status changes from RAW → CLEAN\nVerify data points count is correct\nExample: Steel Industry Energy Dataset\n\nSource: https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption\n\nMetadata:\n\nName: Steel Industry Energy Consumption (South Korea)\nDomain: Energy\nData Points: 350,400\n\nColumn Configuration:\n\nColumn\tType\tUnits\nTimestamps\tTime\t-\nUsage_kWh\tTarget\tkWh\nLagging_Current_Reactive.Power_kVarh\tCovariate\tkVarh\nLeading_Current_Reactive_Power_kVarh\tCovariate\tkVarh\nCO2(tCO2)\tCovariate\ttCO2\nLagging_Current_Power_Factor\tCovariate\tratio\nLeading_Current_Power_Factor\tCovariate\tratio\nNSM\tCovariate\tseconds\nWeekStatus\tGroup\t-\nDay_of_week\tGroup\t-\nLoad_Type\tGroup\t-\n\nGroup Assignment:\n\nSelect: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM\nClick: WeekStatus → all selected get WeekStatus\nClick: Day_of_week → all selected get Day_of_week\nClick: Load_Type → all selected get Load_Type\nFinal: All variables show \"WeekStatus × Day_of_week × Load_Type\"\nReference Materials\n\nFor detailed platform configuration guidance, see references/platform_guide.md.\n\nTroubleshooting\n\n\"Next\" button disabled:\n\nCheck at least one Time column is set\nCheck at least one Target column is set\nVerify all columns have types assigned\n\nGroups not appearing:\n\nColumns must be marked as \"Group\" type first\nProceed to next step after setting Group types\n\nUpload fails:\n\nRe-run cleaning script\nCheck CSV format (comma-delimited)\nVerify no empty column names\nScripts\nScript\tPurpose\nscripts/clean_dataset.py\tClean and prepare CSV for upload\nscripts/download_kaggle.sh\tDownload datasets via Kaggle CLI\nPlatform URL\n\nData Annotation Platform: https://data.smlcrm.com"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/Deyashmukh/data-cleaning-annotation-workflow",
    "publisherUrl": "https://clawhub.ai/Deyashmukh/data-cleaning-annotation-workflow",
    "owner": "Deyashmukh",
    "version": "1.0.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow",
    "downloadUrl": "https://openagent3.xyz/downloads/data-cleaning-annotation-workflow",
    "agentUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow/agent",
    "manifestUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/data-cleaning-annotation-workflow/agent.md"
  }
}