{
  "schemaVersion": "1.0",
  "item": {
    "slug": "dataset-finder",
    "name": "Dataset Finder",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/anisafifi/dataset-finder",
    "canonicalUrl": "https://clawhub.ai/anisafifi/dataset-finder",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/dataset-finder",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=dataset-finder",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "references/readme.md",
      "scripts/dataset.py",
      "scripts/requirements.txt"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/dataset-finder"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/dataset-finder",
    "agentPageUrl": "https://openagent3.xyz/skills/dataset-finder/agent",
    "manifestUrl": "https://openagent3.xyz/skills/dataset-finder/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/dataset-finder/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Dataset Finder",
        "body": "Search, download, and explore datasets from multiple repositories including Kaggle, Hugging Face, UCI ML Repository, and Data.gov. Preview statistics, generate data cards, and manage datasets for machine learning projects.\n\n⚠️ Prerequisite: Install OpenClawCLI (Windows, MacOS)\n\nInstallation:\n\n# Standard installation\npip install kaggle datasets pandas huggingface-hub requests beautifulsoup4\n\n# If you encounter permission errors, use a virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\npip install kaggle datasets pandas huggingface-hub requests beautifulsoup4\n\nNever use --break-system-packages as it can damage your system's Python installation."
      },
      {
        "title": "Quick Reference",
        "body": "TaskCommandSearch Kagglepython scripts/dataset.py kaggle search \"housing prices\"Download Kaggle datasetpython scripts/dataset.py kaggle download \"username/dataset-name\"Search Hugging Facepython scripts/dataset.py huggingface search \"sentiment\"Download HF datasetpython scripts/dataset.py huggingface download \"dataset-name\"Search UCI MLpython scripts/dataset.py uci search \"classification\"Preview datasetpython scripts/dataset.py preview dataset.csvGenerate data cardpython scripts/dataset.py datacard dataset.csv --output README.mdList local datasetspython scripts/dataset.py list"
      },
      {
        "title": "1. Multi-Repository Search",
        "body": "Search across multiple data repositories from a single interface.\n\nSupported Sources:\n\nKaggle - ML competitions and community datasets\nHugging Face - NLP, vision, and audio datasets\nUCI ML Repository - Classic ML datasets\nData.gov - US government open data\nLocal - Manage downloaded datasets"
      },
      {
        "title": "2. Dataset Download",
        "body": "Download datasets with automatic format detection.\n\nSupported formats:\n\nCSV, TSV\nJSON, JSONL\nParquet\nExcel (XLSX, XLS)\nZIP archives\nHDF5\nFeather"
      },
      {
        "title": "3. Dataset Preview",
        "body": "Get quick statistics and insights without loading entire datasets.\n\nPreview features:\n\nShape (rows × columns)\nColumn names and types\nMissing value counts\nBasic statistics (mean, std, min, max)\nMemory usage\nSample rows"
      },
      {
        "title": "4. Data Card Generation",
        "body": "Automatically generate dataset documentation.\n\nIncludes:\n\nDataset description\nSchema information\nStatistics summary\nUsage examples\nLicense information\nCitation details"
      },
      {
        "title": "Kaggle",
        "body": "Search and download datasets from Kaggle.\n\nSetup:\n\nGet Kaggle API credentials from https://www.kaggle.com/settings\nPlace kaggle.json in ~/.kaggle/ (Linux/Mac) or %USERPROFILE%\\.kaggle\\ (Windows)\n\n# Search datasets\npython scripts/dataset.py kaggle search \"house prices\"\n\n# Search with filters\npython scripts/dataset.py kaggle search \"NLP\" --file-type csv --sort-by hotness\n\n# Download dataset\npython scripts/dataset.py kaggle download \"zillow/zecon\"\n\n# Download specific files\npython scripts/dataset.py kaggle download \"username/dataset\" --file \"train.csv\"\n\n# List dataset files\npython scripts/dataset.py kaggle list \"username/dataset-name\"\n\nSearch options:\n\n--file-type - Filter by file type (csv, json, etc.)\n--license - Filter by license type\n--sort-by - Sort by hotness, votes, updated, or relevance\n--max-results - Limit number of results\n\nOutput:\n\n1. House Prices - Advanced Regression Techniques\n   Owner: zillow/zecon\n   Size: 1.5 MB\n   Last updated: 2023-06-15\n   Downloads: 150,000+\n   URL: https://www.kaggle.com/datasets/zillow/zecon\n\n2. Housing Prices Dataset\n   Owner: username/housing-data\n   Size: 850 KB\n   Last updated: 2023-08-20\n   Downloads: 50,000+\n   URL: https://www.kaggle.com/datasets/username/housing-data"
      },
      {
        "title": "Hugging Face Datasets",
        "body": "Search and download datasets from Hugging Face Hub.\n\n# Search datasets\npython scripts/dataset.py huggingface search \"sentiment analysis\"\n\n# Search with filters\npython scripts/dataset.py huggingface search \"NLP\" --task text-classification --language en\n\n# Download dataset\npython scripts/dataset.py huggingface download \"imdb\"\n\n# Download specific split\npython scripts/dataset.py huggingface download \"imdb\" --split train\n\n# Download specific configuration\npython scripts/dataset.py huggingface download \"glue\" --config mrpc\n\n# Stream large datasets\npython scripts/dataset.py huggingface download \"large-dataset\" --streaming\n\nSearch options:\n\n--task - Filter by task (text-classification, translation, etc.)\n--language - Filter by language code\n--multimodal - Include multimodal datasets\n--benchmark - Only benchmark datasets\n--max-results - Limit results\n\nOutput:\n\n1. IMDB Movie Reviews\n   Dataset ID: imdb\n   Tasks: sentiment-classification\n   Languages: en\n   Size: 84.1 MB\n   Downloads: 1M+\n   URL: https://huggingface.co/datasets/imdb\n\n2. Stanford Sentiment Treebank\n   Dataset ID: sst2\n   Tasks: sentiment-classification\n   Languages: en\n   Size: 7.4 MB\n   Downloads: 500K+\n   URL: https://huggingface.co/datasets/sst2"
      },
      {
        "title": "UCI ML Repository",
        "body": "Search and download classic ML datasets.\n\n# Search datasets\npython scripts/dataset.py uci search \"classification\"\n\n# Search by characteristics\npython scripts/dataset.py uci search \"regression\" --min-samples 1000\n\n# Download dataset\npython scripts/dataset.py uci download \"iris\"\n\n# Download with metadata\npython scripts/dataset.py uci download \"wine-quality\" --include-metadata\n\nSearch options:\n\n--task-type - classification, regression, clustering\n--min-samples - Minimum number of instances\n--min-features - Minimum number of features\n--data-type - tabular, text, image, time-series\n\nOutput:\n\n1. Iris Dataset\n   ID: iris\n   Task: classification\n   Samples: 150\n   Features: 4\n   Classes: 3\n   Missing values: No\n   URL: https://archive.ics.uci.edu/ml/datasets/iris\n\n2. Wine Quality\n   ID: wine-quality\n   Task: classification/regression\n   Samples: 6497\n   Features: 11\n   Missing values: No\n   URL: https://archive.ics.uci.edu/ml/datasets/wine+quality"
      },
      {
        "title": "Data.gov",
        "body": "Search US government open data.\n\n# Search datasets\npython scripts/dataset.py datagov search \"census\"\n\n# Search with organization filter\npython scripts/dataset.py datagov search \"health\" --organization \"cdc.gov\"\n\n# Search by topic\npython scripts/dataset.py datagov search \"education\" --tags \"schools,students\"\n\n# Download dataset\npython scripts/dataset.py datagov download \"dataset-id\"\n\nSearch options:\n\n--organization - Filter by publishing organization\n--tags - Filter by tags (comma-separated)\n--format - Filter by format (csv, json, xml, etc.)\n--max-results - Limit results\n\nOutput:\n\n1. 2020 Census Demographic Data\n   Organization: census.gov\n   Format: CSV\n   Size: 125 MB\n   Last updated: 2023-01-15\n   Tags: census, demographics, population\n   URL: https://catalog.data.gov/dataset/..."
      },
      {
        "title": "Preview Datasets",
        "body": "Get quick insights without loading entire datasets.\n\n# Basic preview\npython scripts/dataset.py preview data.csv\n\n# Detailed statistics\npython scripts/dataset.py preview data.csv --detailed\n\n# Custom sample size\npython scripts/dataset.py preview data.csv --sample 20\n\n# Multiple files\npython scripts/dataset.py preview train.csv test.csv\n\nOutput:\n\nDataset: train.csv\nShape: 1000 rows × 15 columns\nSize: 2.5 MB\nMemory usage: 120 KB\n\nColumns:\n  - id (int64): no missing values\n  - name (object): 5 missing values\n  - age (int64): no missing values\n  - income (float64): 12 missing values\n  - category (object): no missing values\n\nNumeric columns statistics:\n           age       income\ncount   1000.0       988.0\nmean      35.2     65432.1\nstd       12.5     25000.0\nmin       18.0     20000.0\nmax       75.0    150000.0\n\nCategorical columns:\n  - category: 5 unique values\n  - name: 995 unique values\n\nSample (first 5 rows):\n   id      name  age    income category\n0   1  John Doe   35   65000.0        A\n1   2  Jane Doe   28   55000.0        B\n2   3  Bob Smith  42   85000.0        A\n..."
      },
      {
        "title": "Generate Data Cards",
        "body": "Create standardized dataset documentation.\n\n# Generate data card\npython scripts/dataset.py datacard dataset.csv --output DATACARD.md\n\n# Include statistics\npython scripts/dataset.py datacard dataset.csv --include-stats --output README.md\n\n# Custom template\npython scripts/dataset.py datacard dataset.csv --template custom_template.md\n\n# Multiple datasets\npython scripts/dataset.py datacard train.csv test.csv --output-dir datacards/\n\nGenerated data card includes:\n\nDataset description\nFile information (size, format, rows, columns)\nSchema (column names, types, descriptions)\nStatistics (distributions, missing values, correlations)\nSample data\nUsage examples\nLicense and citation\nKnown issues/limitations\n\nExample output (DATACARD.md):\n\n# Dataset Card: Housing Prices\n\n## Dataset Description\nThis dataset contains housing prices and features for regression analysis.\n\n## Dataset Information\n- **Format:** CSV\n- **Size:** 1.2 MB\n- **Rows:** 1,460\n- **Columns:** 81\n\n## Schema\n| Column | Type | Description | Missing |\n|--------|------|-------------|---------|\n| Id | int64 | Unique identifier | 0 |\n| MSSubClass | int64 | Building class | 0 |\n| LotArea | int64 | Lot size in sq ft | 0 |\n| SalePrice | int64 | Sale price | 0 |\n...\n\n## Statistics\n- Numerical features: 38\n- Categorical features: 43\n- Missing values: 19 columns affected\n- Target variable: SalePrice (range: $34,900 - $755,000)\n\n## Usage\n```python\nimport pandas as pd\ndf = pd.read_csv('housing_prices.csv')"
      },
      {
        "title": "License",
        "body": "Creative Commons\n\n### List Local Datasets\n\nManage downloaded datasets.\n\n```bash\n# List all datasets\npython scripts/dataset.py list\n\n# List with details\npython scripts/dataset.py list --detailed\n\n# Filter by source\npython scripts/dataset.py list --source kaggle\n\n# Filter by size\npython scripts/dataset.py list --min-size 100MB --max-size 1GB\n\nOutput:\n\nLocal Datasets (5 total, 2.5 GB):\n\n1. zillow/zecon (Kaggle)\n   Downloaded: 2024-01-15\n   Size: 1.5 MB\n   Files: train.csv, test.csv\n   Location: datasets/kaggle/zillow/zecon/\n\n2. imdb (Hugging Face)\n   Downloaded: 2024-01-20\n   Size: 84.1 MB\n   Splits: train, test, unsupervised\n   Location: datasets/huggingface/imdb/\n\n3. iris (UCI ML)\n   Downloaded: 2024-01-18\n   Size: 4.5 KB\n   Files: iris.data, iris.names\n   Location: datasets/uci/iris/"
      },
      {
        "title": "Machine Learning Project Setup",
        "body": "Find and download datasets for a new ML project.\n\n# Step 1: Search for relevant datasets\npython scripts/dataset.py kaggle search \"house prices\" --max-results 10 --output search_results.json\n\n# Step 2: Download selected dataset\npython scripts/dataset.py kaggle download \"zillow/zecon\"\n\n# Step 3: Preview the data\npython scripts/dataset.py preview datasets/kaggle/zillow/zecon/train.csv --detailed\n\n# Step 4: Generate documentation\npython scripts/dataset.py datacard datasets/kaggle/zillow/zecon/train.csv --output DATACARD.md"
      },
      {
        "title": "NLP Project Dataset Collection",
        "body": "Gather text datasets for NLP tasks.\n\n# Search Hugging Face for sentiment datasets\npython scripts/dataset.py huggingface search \"sentiment\" --task text-classification --language en\n\n# Download multiple datasets\npython scripts/dataset.py huggingface download \"imdb\"\npython scripts/dataset.py huggingface download \"sst2\"\npython scripts/dataset.py huggingface download \"yelp_polarity\"\n\n# Preview each dataset\npython scripts/dataset.py list --source huggingface"
      },
      {
        "title": "Dataset Comparison",
        "body": "Compare multiple datasets for selection.\n\n# Search across repositories\npython scripts/dataset.py kaggle search \"titanic\" --output kaggle_results.json\npython scripts/dataset.py uci search \"classification\" --output uci_results.json\n\n# Preview candidates\npython scripts/dataset.py preview candidate1.csv --output stats1.txt\npython scripts/dataset.py preview candidate2.csv --output stats2.txt\n\n# Generate comparison data cards\npython scripts/dataset.py datacard candidate1.csv candidate2.csv --output-dir comparison/"
      },
      {
        "title": "Building a Dataset Library",
        "body": "Organize datasets for team use.\n\n# Create organized structure\nmkdir -p datasets/{kaggle,huggingface,uci,custom}\n\n# Download datasets with metadata\npython scripts/dataset.py kaggle download \"dataset1\" --output-dir datasets/kaggle/\npython scripts/dataset.py huggingface download \"dataset2\" --output-dir datasets/huggingface/\n\n# Generate data cards for all\npython scripts/dataset.py datacard datasets/**/*.csv --output-dir datacards/\n\n# Create inventory\npython scripts/dataset.py list --detailed --output inventory.json"
      },
      {
        "title": "Data Quality Assessment",
        "body": "Assess dataset quality before use.\n\n# Preview with detailed statistics\npython scripts/dataset.py preview dataset.csv --detailed --output quality_report.txt\n\n# Check for issues\npython scripts/dataset.py validate dataset.csv --check-missing --check-duplicates --check-outliers\n\n# Generate comprehensive data card\npython scripts/dataset.py datacard dataset.csv --include-stats --include-quality --output QA_REPORT.md"
      },
      {
        "title": "Batch Download",
        "body": "Download multiple datasets at once.\n\n# Create download list\ncat > datasets.txt << EOF\nkaggle:zillow/zecon\nkaggle:username/housing\nhuggingface:imdb\nuci:iris\nEOF\n\n# Batch download\npython scripts/dataset.py batch-download datasets.txt --output-dir datasets/"
      },
      {
        "title": "Dataset Conversion",
        "body": "Convert between formats.\n\n# CSV to Parquet\npython scripts/dataset.py convert data.csv --format parquet --output data.parquet\n\n# Excel to CSV\npython scripts/dataset.py convert data.xlsx --format csv --output data.csv\n\n# JSON to CSV\npython scripts/dataset.py convert data.json --format csv --output data.csv"
      },
      {
        "title": "Dataset Splitting",
        "body": "Split datasets for ML workflows.\n\n# Train/test split\npython scripts/dataset.py split data.csv --train 0.8 --test 0.2\n\n# Train/val/test split\npython scripts/dataset.py split data.csv --train 0.7 --val 0.15 --test 0.15\n\n# Stratified split\npython scripts/dataset.py split data.csv --stratify target_column --train 0.8 --test 0.2"
      },
      {
        "title": "Dataset Merging",
        "body": "Combine multiple datasets.\n\n# Concatenate datasets\npython scripts/dataset.py merge file1.csv file2.csv --output combined.csv\n\n# Join on key\npython scripts/dataset.py merge left.csv right.csv --on id --how inner --output joined.csv"
      },
      {
        "title": "Search Strategy",
        "body": "Start broad - Use general keywords first\nRefine iteratively - Add filters based on results\nCheck multiple sources - Different repositories have different strengths\nReview metadata - Check size, format, license before downloading"
      },
      {
        "title": "Download Management",
        "body": "Check size first - Use search to see dataset size\nPreview before download - When possible, preview samples\nOrganize by source - Keep repository structure clear\nTrack downloads - Use list command to manage local datasets"
      },
      {
        "title": "Data Quality",
        "body": "Always preview - Check data before using\nGenerate data cards - Document all datasets\nValidate data - Check for missing values, outliers\nKeep metadata - Save original descriptions and licenses"
      },
      {
        "title": "Storage",
        "body": "Use version control - Track dataset versions\nCompress when possible - Use Parquet or HDF5 for large datasets\nClean regularly - Remove unused datasets\nBackup important data - Keep copies of critical datasets"
      },
      {
        "title": "Installation Issues",
        "body": "\"Missing required dependency\"\n\n# Install all dependencies\npip install kaggle datasets pandas huggingface-hub requests beautifulsoup4\n\n# Or use virtual environment\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n\n\"Kaggle API credentials not found\"\n\nGo to https://www.kaggle.com/settings\nClick \"Create New API Token\"\nSave kaggle.json to:\n\nLinux/Mac: ~/.kaggle/\nWindows: %USERPROFILE%\\.kaggle\\\n\n\nSet permissions: chmod 600 ~/.kaggle/kaggle.json\n\n\"Hugging Face authentication required\"\n\n# Login to Hugging Face\nhuggingface-cli login\n\n# Or set token\nexport HF_TOKEN=\"your_token_here\""
      },
      {
        "title": "Search Issues",
        "body": "\"No results found\"\n\nTry broader search terms\nRemove restrictive filters\nCheck spelling\nTry different repository\n\n\"Search timeout\"\n\nCheck internet connection\nRepository may be down temporarily\nTry again in a few minutes"
      },
      {
        "title": "Download Issues",
        "body": "\"Download failed\"\n\nCheck internet connection\nVerify dataset still exists\nCheck available disk space\nTry downloading specific files\n\n\"Permission denied\"\n\nSome datasets require accepting terms\nMay need API credentials\nCheck dataset license\n\n\"Out of memory\"\n\nUse streaming for large datasets\nDownload in chunks\nUse Parquet instead of CSV"
      },
      {
        "title": "Preview Issues",
        "body": "\"Cannot load dataset\"\n\nCheck file format\nVerify file is not corrupted\nTry specifying encoding: --encoding utf-8\n\n\"Preview too slow\"\n\nUse smaller sample size\nPreview first N rows only\nUse format-specific tools"
      },
      {
        "title": "Command Reference",
        "body": "python scripts/dataset.py <command> [OPTIONS]\n\nCOMMANDS:\n  kaggle              Kaggle operations (search, download, list)\n  huggingface         Hugging Face operations\n  uci                 UCI ML Repository operations\n  datagov             Data.gov operations\n  preview             Preview dataset statistics\n  datacard            Generate dataset documentation\n  list                List local datasets\n  batch-download      Download multiple datasets\n  convert             Convert dataset formats\n  split               Split dataset for ML\n  merge               Combine datasets\n\nKAGGLE:\n  search QUERY        Search Kaggle datasets\n    --file-type       Filter by file type\n    --license         Filter by license\n    --sort-by         Sort results\n    --max-results     Limit results\n  \n  download DATASET    Download Kaggle dataset\n    --file            Download specific file\n    --output-dir      Output directory\n\nHUGGING FACE:\n  search QUERY        Search HF datasets\n    --task            Filter by task\n    --language        Filter by language\n    --max-results     Limit results\n  \n  download DATASET    Download HF dataset\n    --split           Specific split\n    --config          Configuration\n    --streaming       Stream large datasets\n\nUCI:\n  search QUERY        Search UCI datasets\n    --task-type       Filter by task\n    --min-samples     Minimum samples\n  \n  download DATASET    Download UCI dataset\n\nPREVIEW:\n  preview FILE        Preview dataset\n    --detailed        Detailed statistics\n    --sample N        Sample size\n\nDATACARD:\n  datacard FILE       Generate data card\n    --output          Output file\n    --include-stats   Include statistics\n    --template        Custom template\n\nLIST:\n  list                List local datasets\n    --detailed        Show details\n    --source          Filter by source\n\nHELP:\n  --help              Show help"
      },
      {
        "title": "Quick Dataset Search",
        "body": "# Find housing datasets\npython scripts/dataset.py kaggle search \"housing\"\n\n# Find NLP datasets\npython scripts/dataset.py huggingface search \"sentiment\" --task text-classification\n\n# Find classic ML datasets\npython scripts/dataset.py uci search \"classification\""
      },
      {
        "title": "Download and Preview",
        "body": "# Download from Kaggle\npython scripts/dataset.py kaggle download \"zillow/zecon\"\n\n# Preview the data\npython scripts/dataset.py preview datasets/kaggle/zillow/zecon/train.csv --detailed\n\n# Generate documentation\npython scripts/dataset.py datacard datasets/kaggle/zillow/zecon/train.csv"
      },
      {
        "title": "Multi-Source Search",
        "body": "# Search all repositories\npython scripts/dataset.py kaggle search \"titanic\" --output kaggle.json\npython scripts/dataset.py huggingface search \"titanic\" --output hf.json\npython scripts/dataset.py uci search \"classification\" --output uci.json\n\n# Compare results\ncat kaggle.json hf.json uci.json"
      },
      {
        "title": "Dataset Management",
        "body": "# List all downloaded datasets\npython scripts/dataset.py list --detailed\n\n# Preview multiple datasets\npython scripts/dataset.py preview *.csv\n\n# Generate data cards for all\npython scripts/dataset.py datacard *.csv --output-dir datacards/"
      },
      {
        "title": "Support",
        "body": "For issues or questions:\n\nCheck this documentation\nRun python scripts/dataset.py --help\nVerify API credentials are set\nCheck repository-specific documentation\n\nResources:\n\nOpenClawCLI: https://clawhub.ai/\nKaggle API: https://github.com/Kaggle/kaggle-api\nHugging Face Datasets: https://huggingface.co/docs/datasets/\nUCI ML Repository: https://archive.ics.uci.edu/ml/\nData.gov API: https://www.data.gov/developers/apis"
      }
    ],
    "body": "Dataset Finder\n\nSearch, download, and explore datasets from multiple repositories including Kaggle, Hugging Face, UCI ML Repository, and Data.gov. Preview statistics, generate data cards, and manage datasets for machine learning projects.\n\n⚠️ Prerequisite: Install OpenClawCLI (Windows, MacOS)\n\nInstallation:\n\n# Standard installation\npip install kaggle datasets pandas huggingface-hub requests beautifulsoup4\n\n# If you encounter permission errors, use a virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\npip install kaggle datasets pandas huggingface-hub requests beautifulsoup4\n\n\nNever use --break-system-packages as it can damage your system's Python installation.\n\nQuick Reference\nTask\tCommand\nSearch Kaggle\tpython scripts/dataset.py kaggle search \"housing prices\"\nDownload Kaggle dataset\tpython scripts/dataset.py kaggle download \"username/dataset-name\"\nSearch Hugging Face\tpython scripts/dataset.py huggingface search \"sentiment\"\nDownload HF dataset\tpython scripts/dataset.py huggingface download \"dataset-name\"\nSearch UCI ML\tpython scripts/dataset.py uci search \"classification\"\nPreview dataset\tpython scripts/dataset.py preview dataset.csv\nGenerate data card\tpython scripts/dataset.py datacard dataset.csv --output README.md\nList local datasets\tpython scripts/dataset.py list\nCore Features\n1. Multi-Repository Search\n\nSearch across multiple data repositories from a single interface.\n\nSupported Sources:\n\nKaggle - ML competitions and community datasets\nHugging Face - NLP, vision, and audio datasets\nUCI ML Repository - Classic ML datasets\nData.gov - US government open data\nLocal - Manage downloaded datasets\n2. Dataset Download\n\nDownload datasets with automatic format detection.\n\nSupported formats:\n\nCSV, TSV\nJSON, JSONL\nParquet\nExcel (XLSX, XLS)\nZIP archives\nHDF5\nFeather\n3. Dataset Preview\n\nGet quick statistics and insights without loading entire datasets.\n\nPreview features:\n\nShape (rows × columns)\nColumn names and types\nMissing value counts\nBasic statistics (mean, std, min, max)\nMemory usage\nSample rows\n4. Data Card Generation\n\nAutomatically generate dataset documentation.\n\nIncludes:\n\nDataset description\nSchema information\nStatistics summary\nUsage examples\nLicense information\nCitation details\nRepository-Specific Commands\nKaggle\n\nSearch and download datasets from Kaggle.\n\nSetup:\n\nGet Kaggle API credentials from https://www.kaggle.com/settings\nPlace kaggle.json in ~/.kaggle/ (Linux/Mac) or %USERPROFILE%\\.kaggle\\ (Windows)\n# Search datasets\npython scripts/dataset.py kaggle search \"house prices\"\n\n# Search with filters\npython scripts/dataset.py kaggle search \"NLP\" --file-type csv --sort-by hotness\n\n# Download dataset\npython scripts/dataset.py kaggle download \"zillow/zecon\"\n\n# Download specific files\npython scripts/dataset.py kaggle download \"username/dataset\" --file \"train.csv\"\n\n# List dataset files\npython scripts/dataset.py kaggle list \"username/dataset-name\"\n\n\nSearch options:\n\n--file-type - Filter by file type (csv, json, etc.)\n--license - Filter by license type\n--sort-by - Sort by hotness, votes, updated, or relevance\n--max-results - Limit number of results\n\nOutput:\n\n1. House Prices - Advanced Regression Techniques\n   Owner: zillow/zecon\n   Size: 1.5 MB\n   Last updated: 2023-06-15\n   Downloads: 150,000+\n   URL: https://www.kaggle.com/datasets/zillow/zecon\n\n2. Housing Prices Dataset\n   Owner: username/housing-data\n   Size: 850 KB\n   Last updated: 2023-08-20\n   Downloads: 50,000+\n   URL: https://www.kaggle.com/datasets/username/housing-data\n\nHugging Face Datasets\n\nSearch and download datasets from Hugging Face Hub.\n\n# Search datasets\npython scripts/dataset.py huggingface search \"sentiment analysis\"\n\n# Search with filters\npython scripts/dataset.py huggingface search \"NLP\" --task text-classification --language en\n\n# Download dataset\npython scripts/dataset.py huggingface download \"imdb\"\n\n# Download specific split\npython scripts/dataset.py huggingface download \"imdb\" --split train\n\n# Download specific configuration\npython scripts/dataset.py huggingface download \"glue\" --config mrpc\n\n# Stream large datasets\npython scripts/dataset.py huggingface download \"large-dataset\" --streaming\n\n\nSearch options:\n\n--task - Filter by task (text-classification, translation, etc.)\n--language - Filter by language code\n--multimodal - Include multimodal datasets\n--benchmark - Only benchmark datasets\n--max-results - Limit results\n\nOutput:\n\n1. IMDB Movie Reviews\n   Dataset ID: imdb\n   Tasks: sentiment-classification\n   Languages: en\n   Size: 84.1 MB\n   Downloads: 1M+\n   URL: https://huggingface.co/datasets/imdb\n\n2. Stanford Sentiment Treebank\n   Dataset ID: sst2\n   Tasks: sentiment-classification\n   Languages: en\n   Size: 7.4 MB\n   Downloads: 500K+\n   URL: https://huggingface.co/datasets/sst2\n\nUCI ML Repository\n\nSearch and download classic ML datasets.\n\n# Search datasets\npython scripts/dataset.py uci search \"classification\"\n\n# Search by characteristics\npython scripts/dataset.py uci search \"regression\" --min-samples 1000\n\n# Download dataset\npython scripts/dataset.py uci download \"iris\"\n\n# Download with metadata\npython scripts/dataset.py uci download \"wine-quality\" --include-metadata\n\n\nSearch options:\n\n--task-type - classification, regression, clustering\n--min-samples - Minimum number of instances\n--min-features - Minimum number of features\n--data-type - tabular, text, image, time-series\n\nOutput:\n\n1. Iris Dataset\n   ID: iris\n   Task: classification\n   Samples: 150\n   Features: 4\n   Classes: 3\n   Missing values: No\n   URL: https://archive.ics.uci.edu/ml/datasets/iris\n\n2. Wine Quality\n   ID: wine-quality\n   Task: classification/regression\n   Samples: 6497\n   Features: 11\n   Missing values: No\n   URL: https://archive.ics.uci.edu/ml/datasets/wine+quality\n\nData.gov\n\nSearch US government open data.\n\n# Search datasets\npython scripts/dataset.py datagov search \"census\"\n\n# Search with organization filter\npython scripts/dataset.py datagov search \"health\" --organization \"cdc.gov\"\n\n# Search by topic\npython scripts/dataset.py datagov search \"education\" --tags \"schools,students\"\n\n# Download dataset\npython scripts/dataset.py datagov download \"dataset-id\"\n\n\nSearch options:\n\n--organization - Filter by publishing organization\n--tags - Filter by tags (comma-separated)\n--format - Filter by format (csv, json, xml, etc.)\n--max-results - Limit results\n\nOutput:\n\n1. 2020 Census Demographic Data\n   Organization: census.gov\n   Format: CSV\n   Size: 125 MB\n   Last updated: 2023-01-15\n   Tags: census, demographics, population\n   URL: https://catalog.data.gov/dataset/...\n\nDataset Management\nPreview Datasets\n\nGet quick insights without loading entire datasets.\n\n# Basic preview\npython scripts/dataset.py preview data.csv\n\n# Detailed statistics\npython scripts/dataset.py preview data.csv --detailed\n\n# Custom sample size\npython scripts/dataset.py preview data.csv --sample 20\n\n# Multiple files\npython scripts/dataset.py preview train.csv test.csv\n\n\nOutput:\n\nDataset: train.csv\nShape: 1000 rows × 15 columns\nSize: 2.5 MB\nMemory usage: 120 KB\n\nColumns:\n  - id (int64): no missing values\n  - name (object): 5 missing values\n  - age (int64): no missing values\n  - income (float64): 12 missing values\n  - category (object): no missing values\n\nNumeric columns statistics:\n           age       income\ncount   1000.0       988.0\nmean      35.2     65432.1\nstd       12.5     25000.0\nmin       18.0     20000.0\nmax       75.0    150000.0\n\nCategorical columns:\n  - category: 5 unique values\n  - name: 995 unique values\n\nSample (first 5 rows):\n   id      name  age    income category\n0   1  John Doe   35   65000.0        A\n1   2  Jane Doe   28   55000.0        B\n2   3  Bob Smith  42   85000.0        A\n...\n\nGenerate Data Cards\n\nCreate standardized dataset documentation.\n\n# Generate data card\npython scripts/dataset.py datacard dataset.csv --output DATACARD.md\n\n# Include statistics\npython scripts/dataset.py datacard dataset.csv --include-stats --output README.md\n\n# Custom template\npython scripts/dataset.py datacard dataset.csv --template custom_template.md\n\n# Multiple datasets\npython scripts/dataset.py datacard train.csv test.csv --output-dir datacards/\n\n\nGenerated data card includes:\n\nDataset description\nFile information (size, format, rows, columns)\nSchema (column names, types, descriptions)\nStatistics (distributions, missing values, correlations)\nSample data\nUsage examples\nLicense and citation\nKnown issues/limitations\n\nExample output (DATACARD.md):\n\n# Dataset Card: Housing Prices\n\n## Dataset Description\nThis dataset contains housing prices and features for regression analysis.\n\n## Dataset Information\n- **Format:** CSV\n- **Size:** 1.2 MB\n- **Rows:** 1,460\n- **Columns:** 81\n\n## Schema\n| Column | Type | Description | Missing |\n|--------|------|-------------|---------|\n| Id | int64 | Unique identifier | 0 |\n| MSSubClass | int64 | Building class | 0 |\n| LotArea | int64 | Lot size in sq ft | 0 |\n| SalePrice | int64 | Sale price | 0 |\n...\n\n## Statistics\n- Numerical features: 38\n- Categorical features: 43\n- Missing values: 19 columns affected\n- Target variable: SalePrice (range: $34,900 - $755,000)\n\n## Usage\n```python\nimport pandas as pd\ndf = pd.read_csv('housing_prices.csv')\n\nLicense\n\nCreative Commons\n\n\n### List Local Datasets\n\nManage downloaded datasets.\n\n```bash\n# List all datasets\npython scripts/dataset.py list\n\n# List with details\npython scripts/dataset.py list --detailed\n\n# Filter by source\npython scripts/dataset.py list --source kaggle\n\n# Filter by size\npython scripts/dataset.py list --min-size 100MB --max-size 1GB\n\n\nOutput:\n\nLocal Datasets (5 total, 2.5 GB):\n\n1. zillow/zecon (Kaggle)\n   Downloaded: 2024-01-15\n   Size: 1.5 MB\n   Files: train.csv, test.csv\n   Location: datasets/kaggle/zillow/zecon/\n\n2. imdb (Hugging Face)\n   Downloaded: 2024-01-20\n   Size: 84.1 MB\n   Splits: train, test, unsupervised\n   Location: datasets/huggingface/imdb/\n\n3. iris (UCI ML)\n   Downloaded: 2024-01-18\n   Size: 4.5 KB\n   Files: iris.data, iris.names\n   Location: datasets/uci/iris/\n\nCommon Workflows\nMachine Learning Project Setup\n\nFind and download datasets for a new ML project.\n\n# Step 1: Search for relevant datasets\npython scripts/dataset.py kaggle search \"house prices\" --max-results 10 --output search_results.json\n\n# Step 2: Download selected dataset\npython scripts/dataset.py kaggle download \"zillow/zecon\"\n\n# Step 3: Preview the data\npython scripts/dataset.py preview datasets/kaggle/zillow/zecon/train.csv --detailed\n\n# Step 4: Generate documentation\npython scripts/dataset.py datacard datasets/kaggle/zillow/zecon/train.csv --output DATACARD.md\n\nNLP Project Dataset Collection\n\nGather text datasets for NLP tasks.\n\n# Search Hugging Face for sentiment datasets\npython scripts/dataset.py huggingface search \"sentiment\" --task text-classification --language en\n\n# Download multiple datasets\npython scripts/dataset.py huggingface download \"imdb\"\npython scripts/dataset.py huggingface download \"sst2\"\npython scripts/dataset.py huggingface download \"yelp_polarity\"\n\n# Preview each dataset\npython scripts/dataset.py list --source huggingface\n\nDataset Comparison\n\nCompare multiple datasets for selection.\n\n# Search across repositories\npython scripts/dataset.py kaggle search \"titanic\" --output kaggle_results.json\npython scripts/dataset.py uci search \"classification\" --output uci_results.json\n\n# Preview candidates\npython scripts/dataset.py preview candidate1.csv --output stats1.txt\npython scripts/dataset.py preview candidate2.csv --output stats2.txt\n\n# Generate comparison data cards\npython scripts/dataset.py datacard candidate1.csv candidate2.csv --output-dir comparison/\n\nBuilding a Dataset Library\n\nOrganize datasets for team use.\n\n# Create organized structure\nmkdir -p datasets/{kaggle,huggingface,uci,custom}\n\n# Download datasets with metadata\npython scripts/dataset.py kaggle download \"dataset1\" --output-dir datasets/kaggle/\npython scripts/dataset.py huggingface download \"dataset2\" --output-dir datasets/huggingface/\n\n# Generate data cards for all\npython scripts/dataset.py datacard datasets/**/*.csv --output-dir datacards/\n\n# Create inventory\npython scripts/dataset.py list --detailed --output inventory.json\n\nData Quality Assessment\n\nAssess dataset quality before use.\n\n# Preview with detailed statistics\npython scripts/dataset.py preview dataset.csv --detailed --output quality_report.txt\n\n# Check for issues\npython scripts/dataset.py validate dataset.csv --check-missing --check-duplicates --check-outliers\n\n# Generate comprehensive data card\npython scripts/dataset.py datacard dataset.csv --include-stats --include-quality --output QA_REPORT.md\n\nAdvanced Features\nBatch Download\n\nDownload multiple datasets at once.\n\n# Create download list\ncat > datasets.txt << EOF\nkaggle:zillow/zecon\nkaggle:username/housing\nhuggingface:imdb\nuci:iris\nEOF\n\n# Batch download\npython scripts/dataset.py batch-download datasets.txt --output-dir datasets/\n\nDataset Conversion\n\nConvert between formats.\n\n# CSV to Parquet\npython scripts/dataset.py convert data.csv --format parquet --output data.parquet\n\n# Excel to CSV\npython scripts/dataset.py convert data.xlsx --format csv --output data.csv\n\n# JSON to CSV\npython scripts/dataset.py convert data.json --format csv --output data.csv\n\nDataset Splitting\n\nSplit datasets for ML workflows.\n\n# Train/test split\npython scripts/dataset.py split data.csv --train 0.8 --test 0.2\n\n# Train/val/test split\npython scripts/dataset.py split data.csv --train 0.7 --val 0.15 --test 0.15\n\n# Stratified split\npython scripts/dataset.py split data.csv --stratify target_column --train 0.8 --test 0.2\n\nDataset Merging\n\nCombine multiple datasets.\n\n# Concatenate datasets\npython scripts/dataset.py merge file1.csv file2.csv --output combined.csv\n\n# Join on key\npython scripts/dataset.py merge left.csv right.csv --on id --how inner --output joined.csv\n\nBest Practices\nSearch Strategy\nStart broad - Use general keywords first\nRefine iteratively - Add filters based on results\nCheck multiple sources - Different repositories have different strengths\nReview metadata - Check size, format, license before downloading\nDownload Management\nCheck size first - Use search to see dataset size\nPreview before download - When possible, preview samples\nOrganize by source - Keep repository structure clear\nTrack downloads - Use list command to manage local datasets\nData Quality\nAlways preview - Check data before using\nGenerate data cards - Document all datasets\nValidate data - Check for missing values, outliers\nKeep metadata - Save original descriptions and licenses\nStorage\nUse version control - Track dataset versions\nCompress when possible - Use Parquet or HDF5 for large datasets\nClean regularly - Remove unused datasets\nBackup important data - Keep copies of critical datasets\nTroubleshooting\nInstallation Issues\n\n\"Missing required dependency\"\n\n# Install all dependencies\npip install kaggle datasets pandas huggingface-hub requests beautifulsoup4\n\n# Or use virtual environment\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n\n\n\"Kaggle API credentials not found\"\n\nGo to https://www.kaggle.com/settings\nClick \"Create New API Token\"\nSave kaggle.json to:\nLinux/Mac: ~/.kaggle/\nWindows: %USERPROFILE%\\.kaggle\\\nSet permissions: chmod 600 ~/.kaggle/kaggle.json\n\n\"Hugging Face authentication required\"\n\n# Login to Hugging Face\nhuggingface-cli login\n\n# Or set token\nexport HF_TOKEN=\"your_token_here\"\n\nSearch Issues\n\n\"No results found\"\n\nTry broader search terms\nRemove restrictive filters\nCheck spelling\nTry different repository\n\n\"Search timeout\"\n\nCheck internet connection\nRepository may be down temporarily\nTry again in a few minutes\nDownload Issues\n\n\"Download failed\"\n\nCheck internet connection\nVerify dataset still exists\nCheck available disk space\nTry downloading specific files\n\n\"Permission denied\"\n\nSome datasets require accepting terms\nMay need API credentials\nCheck dataset license\n\n\"Out of memory\"\n\nUse streaming for large datasets\nDownload in chunks\nUse Parquet instead of CSV\nPreview Issues\n\n\"Cannot load dataset\"\n\nCheck file format\nVerify file is not corrupted\nTry specifying encoding: --encoding utf-8\n\n\"Preview too slow\"\n\nUse smaller sample size\nPreview first N rows only\nUse format-specific tools\nCommand Reference\npython scripts/dataset.py <command> [OPTIONS]\n\nCOMMANDS:\n  kaggle              Kaggle operations (search, download, list)\n  huggingface         Hugging Face operations\n  uci                 UCI ML Repository operations\n  datagov             Data.gov operations\n  preview             Preview dataset statistics\n  datacard            Generate dataset documentation\n  list                List local datasets\n  batch-download      Download multiple datasets\n  convert             Convert dataset formats\n  split               Split dataset for ML\n  merge               Combine datasets\n\nKAGGLE:\n  search QUERY        Search Kaggle datasets\n    --file-type       Filter by file type\n    --license         Filter by license\n    --sort-by         Sort results\n    --max-results     Limit results\n  \n  download DATASET    Download Kaggle dataset\n    --file            Download specific file\n    --output-dir      Output directory\n\nHUGGING FACE:\n  search QUERY        Search HF datasets\n    --task            Filter by task\n    --language        Filter by language\n    --max-results     Limit results\n  \n  download DATASET    Download HF dataset\n    --split           Specific split\n    --config          Configuration\n    --streaming       Stream large datasets\n\nUCI:\n  search QUERY        Search UCI datasets\n    --task-type       Filter by task\n    --min-samples     Minimum samples\n  \n  download DATASET    Download UCI dataset\n\nPREVIEW:\n  preview FILE        Preview dataset\n    --detailed        Detailed statistics\n    --sample N        Sample size\n\nDATACARD:\n  datacard FILE       Generate data card\n    --output          Output file\n    --include-stats   Include statistics\n    --template        Custom template\n\nLIST:\n  list                List local datasets\n    --detailed        Show details\n    --source          Filter by source\n\nHELP:\n  --help              Show help\n\nExamples by Use Case\nQuick Dataset Search\n# Find housing datasets\npython scripts/dataset.py kaggle search \"housing\"\n\n# Find NLP datasets\npython scripts/dataset.py huggingface search \"sentiment\" --task text-classification\n\n# Find classic ML datasets\npython scripts/dataset.py uci search \"classification\"\n\nDownload and Preview\n# Download from Kaggle\npython scripts/dataset.py kaggle download \"zillow/zecon\"\n\n# Preview the data\npython scripts/dataset.py preview datasets/kaggle/zillow/zecon/train.csv --detailed\n\n# Generate documentation\npython scripts/dataset.py datacard datasets/kaggle/zillow/zecon/train.csv\n\nMulti-Source Search\n# Search all repositories\npython scripts/dataset.py kaggle search \"titanic\" --output kaggle.json\npython scripts/dataset.py huggingface search \"titanic\" --output hf.json\npython scripts/dataset.py uci search \"classification\" --output uci.json\n\n# Compare results\ncat kaggle.json hf.json uci.json\n\nDataset Management\n# List all downloaded datasets\npython scripts/dataset.py list --detailed\n\n# Preview multiple datasets\npython scripts/dataset.py preview *.csv\n\n# Generate data cards for all\npython scripts/dataset.py datacard *.csv --output-dir datacards/\n\nSupport\n\nFor issues or questions:\n\nCheck this documentation\nRun python scripts/dataset.py --help\nVerify API credentials are set\nCheck repository-specific documentation\n\nResources:\n\nOpenClawCLI: https://clawhub.ai/\nKaggle API: https://github.com/Kaggle/kaggle-api\nHugging Face Datasets: https://huggingface.co/docs/datasets/\nUCI ML Repository: https://archive.ics.uci.edu/ml/\nData.gov API: https://www.data.gov/developers/apis"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/anisafifi/dataset-finder",
    "publisherUrl": "https://clawhub.ai/anisafifi/dataset-finder",
    "owner": "anisafifi",
    "version": "0.1.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/dataset-finder",
    "downloadUrl": "https://openagent3.xyz/downloads/dataset-finder",
    "agentUrl": "https://openagent3.xyz/skills/dataset-finder/agent",
    "manifestUrl": "https://openagent3.xyz/skills/dataset-finder/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/dataset-finder/agent.md"
  }
}