{
  "schemaVersion": "1.0",
  "item": {
    "slug": "data-quality-check",
    "name": "Data Quality Check",
    "source": "tencent",
    "type": "skill",
    "category": "数据分析",
    "sourceUrl": "https://clawhub.ai/datadrivenconstruction/data-quality-check",
    "canonicalUrl": "https://clawhub.ai/datadrivenconstruction/data-quality-check",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/data-quality-check",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=data-quality-check",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "claw.json",
      "instructions.md",
      "SKILL.md"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-23T16:43:11.935Z",
      "expiresAt": "2026-04-30T16:43:11.935Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=4claw-imageboard",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=4claw-imageboard",
        "contentDisposition": "attachment; filename=\"4claw-imageboard-1.0.1.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/data-quality-check"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/data-quality-check",
    "agentPageUrl": "https://openagent3.xyz/skills/data-quality-check/agent",
    "manifestUrl": "https://openagent3.xyz/skills/data-quality-check/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/data-quality-check/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Overview",
        "body": "Based on DDC methodology (Chapter 2.6), this skill provides comprehensive data quality assessment for construction projects. Poor data quality leads to poor decisions - validate early, validate often.\n\nBook Reference: \"Требования к качеству данных и его обеспечение\" / \"Data Quality Requirements\"\n\n\"Качество данных определяется пятью ключевыми метриками: полнота, точность, согласованность, своевременность и достоверность.\"\n— DDC Book, Chapter 2.6"
      },
      {
        "title": "Quick Start",
        "body": "import pandas as pd\n\n# Load construction data\ndf = pd.read_excel(\"bim_export.xlsx\")\n\n# Quick quality check\nquality_score = {\n    'completeness': (1 - df.isnull().sum().sum() / df.size) * 100,\n    'unique_ids': df['ElementId'].nunique() == len(df),\n    'valid_volumes': (df['Volume_m3'] >= 0).all()\n}\n\nprint(f\"Completeness: {quality_score['completeness']:.1f}%\")\nprint(f\"Unique IDs: {quality_score['unique_ids']}\")\nprint(f\"Valid volumes: {quality_score['valid_volumes']}\")"
      },
      {
        "title": "The 5 Quality Metrics",
        "body": "import pandas as pd\nimport numpy as np\nimport re\nfrom datetime import datetime, timedelta\n\nclass DataQualityChecker:\n    \"\"\"Comprehensive data quality assessment for construction data\"\"\"\n\n    def __init__(self, df):\n        self.df = df.copy()\n        self.results = {}\n        self.issues = []\n\n    def check_completeness(self, required_columns=None):\n        \"\"\"Check for missing values (Полнота)\"\"\"\n        if required_columns is None:\n            required_columns = self.df.columns.tolist()\n\n        completeness = {}\n        for col in required_columns:\n            if col in self.df.columns:\n                non_null = self.df[col].notna().sum()\n                total = len(self.df)\n                completeness[col] = (non_null / total) * 100\n            else:\n                completeness[col] = 0\n                self.issues.append(f\"Missing required column: {col}\")\n\n        overall = np.mean(list(completeness.values()))\n\n        self.results['completeness'] = {\n            'by_column': completeness,\n            'overall': overall,\n            'threshold': 95,\n            'passed': overall >= 95\n        }\n\n        return self.results['completeness']\n\n    def check_accuracy(self, rules=None):\n        \"\"\"Check data accuracy against rules (Точность)\"\"\"\n        if rules is None:\n            # Default construction data rules\n            rules = {\n                'Volume_m3': {'min': 0, 'max': 10000},\n                'Area_m2': {'min': 0, 'max': 100000},\n                'Weight_kg': {'min': 0, 'max': 1000000},\n                'Cost': {'min': 0, 'max': 100000000}\n            }\n\n        accuracy = {}\n        for col, bounds in rules.items():\n            if col in self.df.columns:\n                valid = self.df[col].between(\n                    bounds.get('min', -np.inf),\n                    bounds.get('max', np.inf)\n                ).sum()\n                total = self.df[col].notna().sum()\n                accuracy[col] = (valid / total * 100) if total > 0 else 100\n\n                # Log invalid values\n                invalid_count = total - valid\n                if invalid_count > 0:\n                    self.issues.append(\n                        f\"{col}: {invalid_count} values outside range [{bounds.get('min')}, {bounds.get('max')}]\"\n                    )\n\n        overall = np.mean(list(accuracy.values())) if accuracy else 100\n\n        self.results['accuracy'] = {\n            'by_column': accuracy,\n            'overall': overall,\n            'threshold': 98,\n            'passed': overall >= 98\n        }\n\n        return self.results['accuracy']\n\n    def check_consistency(self, unique_cols=None, relationship_rules=None):\n        \"\"\"Check data consistency (Согласованность)\"\"\"\n        consistency = {}\n\n        # Check unique columns\n        if unique_cols is None:\n            unique_cols = ['ElementId']\n\n        for col in unique_cols:\n            if col in self.df.columns:\n                is_unique = self.df[col].nunique() == len(self.df)\n                consistency[f'{col}_unique'] = 100 if is_unique else \\\n                    (self.df[col].nunique() / len(self.df) * 100)\n\n                if not is_unique:\n                    duplicates = self.df[self.df[col].duplicated()][col].unique()\n                    self.issues.append(f\"Duplicate {col}: {len(duplicates)} duplicates found\")\n\n        # Check cross-field relationships\n        if relationship_rules is None:\n            relationship_rules = [\n                ('End_Date', '>=', 'Start_Date'),\n                ('Gross_Volume', '>=', 'Net_Volume')\n            ]\n\n        for col1, op, col2 in relationship_rules:\n            if col1 in self.df.columns and col2 in self.df.columns:\n                if op == '>=':\n                    valid = (self.df[col1] >= self.df[col2]).sum()\n                elif op == '>':\n                    valid = (self.df[col1] > self.df[col2]).sum()\n                elif op == '==':\n                    valid = (self.df[col1] == self.df[col2]).sum()\n\n                total = self.df[[col1, col2]].notna().all(axis=1).sum()\n                consistency[f'{col1}_{op}_{col2}'] = (valid / total * 100) if total > 0 else 100\n\n        overall = np.mean(list(consistency.values())) if consistency else 100\n\n        self.results['consistency'] = {\n            'checks': consistency,\n            'overall': overall,\n            'threshold': 99,\n            'passed': overall >= 99\n        }\n\n        return self.results['consistency']\n\n    def check_timeliness(self, date_col='Modified_Date', max_age_days=30):\n        \"\"\"Check data timeliness (Своевременность)\"\"\"\n        if date_col not in self.df.columns:\n            self.results['timeliness'] = {\n                'overall': None,\n                'message': f'Column {date_col} not found'\n            }\n            return self.results['timeliness']\n\n        dates = pd.to_datetime(self.df[date_col], errors='coerce')\n        cutoff = datetime.now() - timedelta(days=max_age_days)\n\n        recent = (dates >= cutoff).sum()\n        total = dates.notna().sum()\n        timeliness_pct = (recent / total * 100) if total > 0 else 0\n\n        oldest = dates.min()\n        newest = dates.max()\n        avg_age = (datetime.now() - dates.mean()).days if dates.notna().any() else None\n\n        self.results['timeliness'] = {\n            'recent_percentage': timeliness_pct,\n            'oldest_record': oldest,\n            'newest_record': newest,\n            'average_age_days': avg_age,\n            'threshold': 80,\n            'passed': timeliness_pct >= 80\n        }\n\n        return self.results['timeliness']\n\n    def check_validity(self, patterns=None):\n        \"\"\"Check data validity with regex patterns (Достоверность)\"\"\"\n        if patterns is None:\n            patterns = {\n                'ElementId': r'^[A-Z]{1,3}\\d{3,6}$',  # e.g., W001, FL12345\n                'Level': r'^Level\\s*\\d+$|^L\\d+$|^Уровень\\s*\\d+$',\n                'Email': r'^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$',\n                'Phone': r'^\\+?\\d{10,15}$'\n            }\n\n        validity = {}\n        for col, pattern in patterns.items():\n            if col in self.df.columns:\n                non_null = self.df[col].dropna()\n                if len(non_null) > 0:\n                    matches = non_null.astype(str).str.match(pattern).sum()\n                    validity[col] = (matches / len(non_null) * 100)\n\n                    invalid = len(non_null) - matches\n                    if invalid > 0:\n                        self.issues.append(f\"{col}: {invalid} values don't match pattern\")\n                else:\n                    validity[col] = 100\n\n        overall = np.mean(list(validity.values())) if validity else 100\n\n        self.results['validity'] = {\n            'by_column': validity,\n            'overall': overall,\n            'threshold': 95,\n            'passed': overall >= 95\n        }\n\n        return self.results['validity']\n\n    def run_full_check(self):\n        \"\"\"Run all quality checks\"\"\"\n        self.check_completeness()\n        self.check_accuracy()\n        self.check_consistency()\n        self.check_timeliness()\n        self.check_validity()\n\n        # Calculate overall score\n        scores = []\n        for metric in ['completeness', 'accuracy', 'consistency', 'validity']:\n            if metric in self.results and self.results[metric].get('overall'):\n                scores.append(self.results[metric]['overall'])\n\n        self.results['overall_score'] = np.mean(scores) if scores else 0\n        self.results['grade'] = self._calculate_grade(self.results['overall_score'])\n        self.results['issues'] = self.issues\n\n        return self.results\n\n    def _calculate_grade(self, score):\n        \"\"\"Calculate quality grade\"\"\"\n        if score >= 98:\n            return 'A+'\n        elif score >= 95:\n            return 'A'\n        elif score >= 90:\n            return 'B'\n        elif score >= 80:\n            return 'C'\n        elif score >= 70:\n            return 'D'\n        else:\n            return 'F'\n\n    def generate_report(self):\n        \"\"\"Generate quality report\"\"\"\n        if not self.results:\n            self.run_full_check()\n\n        report = []\n        report.append(\"=\" * 60)\n        report.append(\"DATA QUALITY REPORT\")\n        report.append(\"=\" * 60)\n        report.append(f\"Records analyzed: {len(self.df)}\")\n        report.append(f\"Columns: {len(self.df.columns)}\")\n        report.append(\"\")\n        report.append(f\"OVERALL SCORE: {self.results['overall_score']:.1f}% (Grade: {self.results['grade']})\")\n        report.append(\"\")\n        report.append(\"-\" * 60)\n\n        # Detail by dimension\n        for metric in ['completeness', 'accuracy', 'consistency', 'validity', 'timeliness']:\n            if metric in self.results:\n                r = self.results[metric]\n                passed = '✓' if r.get('passed', False) else '✗'\n                overall = r.get('overall', r.get('recent_percentage', 'N/A'))\n                if isinstance(overall, (int, float)):\n                    report.append(f\"{metric.upper():15s}: {overall:>6.1f}% {passed}\")\n                else:\n                    report.append(f\"{metric.upper():15s}: {overall}\")\n\n        report.append(\"-\" * 60)\n\n        if self.issues:\n            report.append(\"\")\n            report.append(\"ISSUES FOUND:\")\n            for issue in self.issues[:10]:  # Show first 10\n                report.append(f\"  • {issue}\")\n            if len(self.issues) > 10:\n                report.append(f\"  ... and {len(self.issues) - 10} more issues\")\n\n        report.append(\"\")\n        report.append(\"=\" * 60)\n\n        return \"\\n\".join(report)"
      },
      {
        "title": "Custom Validation Rules",
        "body": "class ValidationRulesBuilder:\n    \"\"\"Build custom validation rules for construction data\"\"\"\n\n    def __init__(self):\n        self.rules = []\n\n    def add_not_null(self, column):\n        \"\"\"Column must not have null values\"\"\"\n        self.rules.append({\n            'type': 'not_null',\n            'column': column,\n            'check': lambda df, col=column: df[col].notna().all()\n        })\n        return self\n\n    def add_unique(self, column):\n        \"\"\"Column must have unique values\"\"\"\n        self.rules.append({\n            'type': 'unique',\n            'column': column,\n            'check': lambda df, col=column: df[col].nunique() == len(df)\n        })\n        return self\n\n    def add_range(self, column, min_val=None, max_val=None):\n        \"\"\"Column values must be within range\"\"\"\n        self.rules.append({\n            'type': 'range',\n            'column': column,\n            'min': min_val,\n            'max': max_val,\n            'check': lambda df, col=column, mn=min_val, mx=max_val:\n                df[col].between(mn or -np.inf, mx or np.inf).all()\n        })\n        return self\n\n    def add_regex(self, column, pattern):\n        \"\"\"Column values must match regex pattern\"\"\"\n        self.rules.append({\n            'type': 'regex',\n            'column': column,\n            'pattern': pattern,\n            'check': lambda df, col=column, p=pattern:\n                df[col].astype(str).str.match(p).all()\n        })\n        return self\n\n    def add_in_list(self, column, valid_values):\n        \"\"\"Column values must be in list\"\"\"\n        self.rules.append({\n            'type': 'in_list',\n            'column': column,\n            'valid_values': valid_values,\n            'check': lambda df, col=column, vals=valid_values:\n                df[col].isin(vals).all()\n        })\n        return self\n\n    def add_custom(self, name, check_func):\n        \"\"\"Add custom validation function\"\"\"\n        self.rules.append({\n            'type': 'custom',\n            'name': name,\n            'check': check_func\n        })\n        return self\n\n    def validate(self, df):\n        \"\"\"Run all validation rules\"\"\"\n        results = []\n\n        for rule in self.rules:\n            try:\n                passed = rule['check'](df)\n                results.append({\n                    'rule': rule.get('name', f\"{rule['type']}:{rule.get('column', 'custom')}\"),\n                    'passed': passed,\n                    'type': rule['type']\n                })\n            except Exception as e:\n                results.append({\n                    'rule': rule.get('name', f\"{rule['type']}:{rule.get('column', 'custom')}\"),\n                    'passed': False,\n                    'error': str(e)\n                })\n\n        return results\n\n# Usage example\nrules = (ValidationRulesBuilder()\n    .add_not_null('ElementId')\n    .add_unique('ElementId')\n    .add_range('Volume_m3', min_val=0)\n    .add_range('Cost', min_val=0)\n    .add_in_list('Category', ['Wall', 'Floor', 'Column', 'Beam', 'Slab'])\n    .add_regex('Level', r'^Level\\s*\\d+$')\n)\n\nresults = rules.validate(df)\nfor r in results:\n    status = '✓' if r['passed'] else '✗'\n    print(f\"{status} {r['rule']}\")"
      },
      {
        "title": "Automated Quality Pipeline",
        "body": "class DataQualityPipeline:\n    \"\"\"Automated data quality pipeline\"\"\"\n\n    def __init__(self, config=None):\n        self.config = config or self._default_config()\n        self.history = []\n\n    def _default_config(self):\n        return {\n            'required_columns': ['ElementId', 'Category', 'Volume_m3'],\n            'unique_columns': ['ElementId'],\n            'numeric_ranges': {\n                'Volume_m3': (0, 10000),\n                'Area_m2': (0, 100000),\n                'Cost': (0, 100000000)\n            },\n            'valid_categories': ['Wall', 'Floor', 'Column', 'Beam', 'Slab',\n                                 'Foundation', 'Roof', 'Stair', 'Door', 'Window'],\n            'min_quality_score': 90\n        }\n\n    def run(self, df, source_name='unknown'):\n        \"\"\"Run quality pipeline\"\"\"\n        checker = DataQualityChecker(df)\n\n        # Configure checks based on config\n        checker.check_completeness(self.config['required_columns'])\n        checker.check_accuracy({\n            col: {'min': r[0], 'max': r[1]}\n            for col, r in self.config['numeric_ranges'].items()\n        })\n        checker.check_consistency(self.config['unique_columns'])\n        checker.check_validity()\n\n        results = checker.run_full_check()\n\n        # Store in history\n        self.history.append({\n            'timestamp': datetime.now(),\n            'source': source_name,\n            'records': len(df),\n            'score': results['overall_score'],\n            'grade': results['grade'],\n            'issues_count': len(results['issues'])\n        })\n\n        # Check threshold\n        passed = results['overall_score'] >= self.config['min_quality_score']\n\n        return {\n            'passed': passed,\n            'score': results['overall_score'],\n            'grade': results['grade'],\n            'details': results,\n            'report': checker.generate_report()\n        }\n\n    def get_history_summary(self):\n        \"\"\"Get quality history summary\"\"\"\n        if not self.history:\n            return \"No quality checks performed yet.\"\n\n        df_history = pd.DataFrame(self.history)\n        return {\n            'total_checks': len(self.history),\n            'avg_score': df_history['score'].mean(),\n            'min_score': df_history['score'].min(),\n            'max_score': df_history['score'].max(),\n            'latest': self.history[-1]\n        }"
      },
      {
        "title": "Export Quality Report",
        "body": "def export_quality_report(df, output_path, include_details=True):\n    \"\"\"Export comprehensive quality report to Excel\"\"\"\n    checker = DataQualityChecker(df)\n    results = checker.run_full_check()\n\n    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:\n        # Summary sheet\n        summary = pd.DataFrame({\n            'Metric': ['Overall Score', 'Grade', 'Records', 'Columns', 'Issues'],\n            'Value': [\n                f\"{results['overall_score']:.1f}%\",\n                results['grade'],\n                len(df),\n                len(df.columns),\n                len(results['issues'])\n            ]\n        })\n        summary.to_excel(writer, sheet_name='Summary', index=False)\n\n        # Completeness details\n        if 'completeness' in results:\n            comp_df = pd.DataFrame.from_dict(\n                results['completeness']['by_column'],\n                orient='index',\n                columns=['Completeness_%']\n            )\n            comp_df.to_excel(writer, sheet_name='Completeness')\n\n        # Issues list\n        if results['issues']:\n            issues_df = pd.DataFrame({'Issue': results['issues']})\n            issues_df.to_excel(writer, sheet_name='Issues', index=False)\n\n        # Missing values analysis\n        if include_details:\n            missing = df.isnull().sum()\n            missing_df = pd.DataFrame({\n                'Column': missing.index,\n                'Missing_Count': missing.values,\n                'Missing_%': (missing.values / len(df) * 100).round(2)\n            })\n            missing_df.to_excel(writer, sheet_name='Missing_Values', index=False)\n\n    return output_path"
      },
      {
        "title": "Quick Reference",
        "body": "MetricDescriptionThresholdCompleteness% non-null values≥ 95%AccuracyValues within valid range≥ 98%ConsistencyUnique IDs, valid relationships≥ 99%ValidityMatch expected patterns≥ 95%TimelinessRecords updated recently≥ 80%"
      },
      {
        "title": "Common Validation Patterns",
        "body": "# Construction-specific regex patterns\nPATTERNS = {\n    'element_id': r'^[A-Z]{1,3}\\d{3,8}$',\n    'revit_id': r'^\\d{5,8}$',\n    'ifc_guid': r'^[A-Za-z0-9_$]{22}$',\n    'level': r'^(Level|L|Уровень)\\s*[-]?\\d+$',\n    'grid': r'^[A-Z]{1,2}[-/]?\\d{0,3}$',\n    'date_iso': r'^\\d{4}-\\d{2}-\\d{2}$',\n    'cost_code': r'^\\d{2,3}[.-]\\d{2,4}[.-]?\\d{0,4}$'\n}"
      },
      {
        "title": "Resources",
        "body": "Book: \"Data-Driven Construction\" by Artem Boiko, Chapter 2.6\nWebsite: https://datadrivenconstruction.io\nGreat Expectations: https://greatexpectations.io"
      },
      {
        "title": "Next Steps",
        "body": "See bim-validation-pipeline for BIM-specific validation\nSee etl-pipeline for data processing pipelines\nSee data-visualization for quality dashboards"
      }
    ],
    "body": "Data Quality Check for Construction\nOverview\n\nBased on DDC methodology (Chapter 2.6), this skill provides comprehensive data quality assessment for construction projects. Poor data quality leads to poor decisions - validate early, validate often.\n\nBook Reference: \"Требования к качеству данных и его обеспечение\" / \"Data Quality Requirements\"\n\n\"Качество данных определяется пятью ключевыми метриками: полнота, точность, согласованность, своевременность и достоверность.\" — DDC Book, Chapter 2.6\n\nQuick Start\nimport pandas as pd\n\n# Load construction data\ndf = pd.read_excel(\"bim_export.xlsx\")\n\n# Quick quality check\nquality_score = {\n    'completeness': (1 - df.isnull().sum().sum() / df.size) * 100,\n    'unique_ids': df['ElementId'].nunique() == len(df),\n    'valid_volumes': (df['Volume_m3'] >= 0).all()\n}\n\nprint(f\"Completeness: {quality_score['completeness']:.1f}%\")\nprint(f\"Unique IDs: {quality_score['unique_ids']}\")\nprint(f\"Valid volumes: {quality_score['valid_volumes']}\")\n\nData Quality Dimensions\nThe 5 Quality Metrics\nimport pandas as pd\nimport numpy as np\nimport re\nfrom datetime import datetime, timedelta\n\nclass DataQualityChecker:\n    \"\"\"Comprehensive data quality assessment for construction data\"\"\"\n\n    def __init__(self, df):\n        self.df = df.copy()\n        self.results = {}\n        self.issues = []\n\n    def check_completeness(self, required_columns=None):\n        \"\"\"Check for missing values (Полнота)\"\"\"\n        if required_columns is None:\n            required_columns = self.df.columns.tolist()\n\n        completeness = {}\n        for col in required_columns:\n            if col in self.df.columns:\n                non_null = self.df[col].notna().sum()\n                total = len(self.df)\n                completeness[col] = (non_null / total) * 100\n            else:\n                completeness[col] = 0\n                self.issues.append(f\"Missing required column: {col}\")\n\n        overall = np.mean(list(completeness.values()))\n\n        self.results['completeness'] = {\n            'by_column': completeness,\n            'overall': overall,\n            'threshold': 95,\n            'passed': overall >= 95\n        }\n\n        return self.results['completeness']\n\n    def check_accuracy(self, rules=None):\n        \"\"\"Check data accuracy against rules (Точность)\"\"\"\n        if rules is None:\n            # Default construction data rules\n            rules = {\n                'Volume_m3': {'min': 0, 'max': 10000},\n                'Area_m2': {'min': 0, 'max': 100000},\n                'Weight_kg': {'min': 0, 'max': 1000000},\n                'Cost': {'min': 0, 'max': 100000000}\n            }\n\n        accuracy = {}\n        for col, bounds in rules.items():\n            if col in self.df.columns:\n                valid = self.df[col].between(\n                    bounds.get('min', -np.inf),\n                    bounds.get('max', np.inf)\n                ).sum()\n                total = self.df[col].notna().sum()\n                accuracy[col] = (valid / total * 100) if total > 0 else 100\n\n                # Log invalid values\n                invalid_count = total - valid\n                if invalid_count > 0:\n                    self.issues.append(\n                        f\"{col}: {invalid_count} values outside range [{bounds.get('min')}, {bounds.get('max')}]\"\n                    )\n\n        overall = np.mean(list(accuracy.values())) if accuracy else 100\n\n        self.results['accuracy'] = {\n            'by_column': accuracy,\n            'overall': overall,\n            'threshold': 98,\n            'passed': overall >= 98\n        }\n\n        return self.results['accuracy']\n\n    def check_consistency(self, unique_cols=None, relationship_rules=None):\n        \"\"\"Check data consistency (Согласованность)\"\"\"\n        consistency = {}\n\n        # Check unique columns\n        if unique_cols is None:\n            unique_cols = ['ElementId']\n\n        for col in unique_cols:\n            if col in self.df.columns:\n                is_unique = self.df[col].nunique() == len(self.df)\n                consistency[f'{col}_unique'] = 100 if is_unique else \\\n                    (self.df[col].nunique() / len(self.df) * 100)\n\n                if not is_unique:\n                    duplicates = self.df[self.df[col].duplicated()][col].unique()\n                    self.issues.append(f\"Duplicate {col}: {len(duplicates)} duplicates found\")\n\n        # Check cross-field relationships\n        if relationship_rules is None:\n            relationship_rules = [\n                ('End_Date', '>=', 'Start_Date'),\n                ('Gross_Volume', '>=', 'Net_Volume')\n            ]\n\n        for col1, op, col2 in relationship_rules:\n            if col1 in self.df.columns and col2 in self.df.columns:\n                if op == '>=':\n                    valid = (self.df[col1] >= self.df[col2]).sum()\n                elif op == '>':\n                    valid = (self.df[col1] > self.df[col2]).sum()\n                elif op == '==':\n                    valid = (self.df[col1] == self.df[col2]).sum()\n\n                total = self.df[[col1, col2]].notna().all(axis=1).sum()\n                consistency[f'{col1}_{op}_{col2}'] = (valid / total * 100) if total > 0 else 100\n\n        overall = np.mean(list(consistency.values())) if consistency else 100\n\n        self.results['consistency'] = {\n            'checks': consistency,\n            'overall': overall,\n            'threshold': 99,\n            'passed': overall >= 99\n        }\n\n        return self.results['consistency']\n\n    def check_timeliness(self, date_col='Modified_Date', max_age_days=30):\n        \"\"\"Check data timeliness (Своевременность)\"\"\"\n        if date_col not in self.df.columns:\n            self.results['timeliness'] = {\n                'overall': None,\n                'message': f'Column {date_col} not found'\n            }\n            return self.results['timeliness']\n\n        dates = pd.to_datetime(self.df[date_col], errors='coerce')\n        cutoff = datetime.now() - timedelta(days=max_age_days)\n\n        recent = (dates >= cutoff).sum()\n        total = dates.notna().sum()\n        timeliness_pct = (recent / total * 100) if total > 0 else 0\n\n        oldest = dates.min()\n        newest = dates.max()\n        avg_age = (datetime.now() - dates.mean()).days if dates.notna().any() else None\n\n        self.results['timeliness'] = {\n            'recent_percentage': timeliness_pct,\n            'oldest_record': oldest,\n            'newest_record': newest,\n            'average_age_days': avg_age,\n            'threshold': 80,\n            'passed': timeliness_pct >= 80\n        }\n\n        return self.results['timeliness']\n\n    def check_validity(self, patterns=None):\n        \"\"\"Check data validity with regex patterns (Достоверность)\"\"\"\n        if patterns is None:\n            patterns = {\n                'ElementId': r'^[A-Z]{1,3}\\d{3,6}$',  # e.g., W001, FL12345\n                'Level': r'^Level\\s*\\d+$|^L\\d+$|^Уровень\\s*\\d+$',\n                'Email': r'^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$',\n                'Phone': r'^\\+?\\d{10,15}$'\n            }\n\n        validity = {}\n        for col, pattern in patterns.items():\n            if col in self.df.columns:\n                non_null = self.df[col].dropna()\n                if len(non_null) > 0:\n                    matches = non_null.astype(str).str.match(pattern).sum()\n                    validity[col] = (matches / len(non_null) * 100)\n\n                    invalid = len(non_null) - matches\n                    if invalid > 0:\n                        self.issues.append(f\"{col}: {invalid} values don't match pattern\")\n                else:\n                    validity[col] = 100\n\n        overall = np.mean(list(validity.values())) if validity else 100\n\n        self.results['validity'] = {\n            'by_column': validity,\n            'overall': overall,\n            'threshold': 95,\n            'passed': overall >= 95\n        }\n\n        return self.results['validity']\n\n    def run_full_check(self):\n        \"\"\"Run all quality checks\"\"\"\n        self.check_completeness()\n        self.check_accuracy()\n        self.check_consistency()\n        self.check_timeliness()\n        self.check_validity()\n\n        # Calculate overall score\n        scores = []\n        for metric in ['completeness', 'accuracy', 'consistency', 'validity']:\n            if metric in self.results and self.results[metric].get('overall'):\n                scores.append(self.results[metric]['overall'])\n\n        self.results['overall_score'] = np.mean(scores) if scores else 0\n        self.results['grade'] = self._calculate_grade(self.results['overall_score'])\n        self.results['issues'] = self.issues\n\n        return self.results\n\n    def _calculate_grade(self, score):\n        \"\"\"Calculate quality grade\"\"\"\n        if score >= 98:\n            return 'A+'\n        elif score >= 95:\n            return 'A'\n        elif score >= 90:\n            return 'B'\n        elif score >= 80:\n            return 'C'\n        elif score >= 70:\n            return 'D'\n        else:\n            return 'F'\n\n    def generate_report(self):\n        \"\"\"Generate quality report\"\"\"\n        if not self.results:\n            self.run_full_check()\n\n        report = []\n        report.append(\"=\" * 60)\n        report.append(\"DATA QUALITY REPORT\")\n        report.append(\"=\" * 60)\n        report.append(f\"Records analyzed: {len(self.df)}\")\n        report.append(f\"Columns: {len(self.df.columns)}\")\n        report.append(\"\")\n        report.append(f\"OVERALL SCORE: {self.results['overall_score']:.1f}% (Grade: {self.results['grade']})\")\n        report.append(\"\")\n        report.append(\"-\" * 60)\n\n        # Detail by dimension\n        for metric in ['completeness', 'accuracy', 'consistency', 'validity', 'timeliness']:\n            if metric in self.results:\n                r = self.results[metric]\n                passed = '✓' if r.get('passed', False) else '✗'\n                overall = r.get('overall', r.get('recent_percentage', 'N/A'))\n                if isinstance(overall, (int, float)):\n                    report.append(f\"{metric.upper():15s}: {overall:>6.1f}% {passed}\")\n                else:\n                    report.append(f\"{metric.upper():15s}: {overall}\")\n\n        report.append(\"-\" * 60)\n\n        if self.issues:\n            report.append(\"\")\n            report.append(\"ISSUES FOUND:\")\n            for issue in self.issues[:10]:  # Show first 10\n                report.append(f\"  • {issue}\")\n            if len(self.issues) > 10:\n                report.append(f\"  ... and {len(self.issues) - 10} more issues\")\n\n        report.append(\"\")\n        report.append(\"=\" * 60)\n\n        return \"\\n\".join(report)\n\nValidation Rules Builder\nCustom Validation Rules\nclass ValidationRulesBuilder:\n    \"\"\"Build custom validation rules for construction data\"\"\"\n\n    def __init__(self):\n        self.rules = []\n\n    def add_not_null(self, column):\n        \"\"\"Column must not have null values\"\"\"\n        self.rules.append({\n            'type': 'not_null',\n            'column': column,\n            'check': lambda df, col=column: df[col].notna().all()\n        })\n        return self\n\n    def add_unique(self, column):\n        \"\"\"Column must have unique values\"\"\"\n        self.rules.append({\n            'type': 'unique',\n            'column': column,\n            'check': lambda df, col=column: df[col].nunique() == len(df)\n        })\n        return self\n\n    def add_range(self, column, min_val=None, max_val=None):\n        \"\"\"Column values must be within range\"\"\"\n        self.rules.append({\n            'type': 'range',\n            'column': column,\n            'min': min_val,\n            'max': max_val,\n            'check': lambda df, col=column, mn=min_val, mx=max_val:\n                df[col].between(mn or -np.inf, mx or np.inf).all()\n        })\n        return self\n\n    def add_regex(self, column, pattern):\n        \"\"\"Column values must match regex pattern\"\"\"\n        self.rules.append({\n            'type': 'regex',\n            'column': column,\n            'pattern': pattern,\n            'check': lambda df, col=column, p=pattern:\n                df[col].astype(str).str.match(p).all()\n        })\n        return self\n\n    def add_in_list(self, column, valid_values):\n        \"\"\"Column values must be in list\"\"\"\n        self.rules.append({\n            'type': 'in_list',\n            'column': column,\n            'valid_values': valid_values,\n            'check': lambda df, col=column, vals=valid_values:\n                df[col].isin(vals).all()\n        })\n        return self\n\n    def add_custom(self, name, check_func):\n        \"\"\"Add custom validation function\"\"\"\n        self.rules.append({\n            'type': 'custom',\n            'name': name,\n            'check': check_func\n        })\n        return self\n\n    def validate(self, df):\n        \"\"\"Run all validation rules\"\"\"\n        results = []\n\n        for rule in self.rules:\n            try:\n                passed = rule['check'](df)\n                results.append({\n                    'rule': rule.get('name', f\"{rule['type']}:{rule.get('column', 'custom')}\"),\n                    'passed': passed,\n                    'type': rule['type']\n                })\n            except Exception as e:\n                results.append({\n                    'rule': rule.get('name', f\"{rule['type']}:{rule.get('column', 'custom')}\"),\n                    'passed': False,\n                    'error': str(e)\n                })\n\n        return results\n\n# Usage example\nrules = (ValidationRulesBuilder()\n    .add_not_null('ElementId')\n    .add_unique('ElementId')\n    .add_range('Volume_m3', min_val=0)\n    .add_range('Cost', min_val=0)\n    .add_in_list('Category', ['Wall', 'Floor', 'Column', 'Beam', 'Slab'])\n    .add_regex('Level', r'^Level\\s*\\d+$')\n)\n\nresults = rules.validate(df)\nfor r in results:\n    status = '✓' if r['passed'] else '✗'\n    print(f\"{status} {r['rule']}\")\n\nAutomated Quality Pipeline\nclass DataQualityPipeline:\n    \"\"\"Automated data quality pipeline\"\"\"\n\n    def __init__(self, config=None):\n        self.config = config or self._default_config()\n        self.history = []\n\n    def _default_config(self):\n        return {\n            'required_columns': ['ElementId', 'Category', 'Volume_m3'],\n            'unique_columns': ['ElementId'],\n            'numeric_ranges': {\n                'Volume_m3': (0, 10000),\n                'Area_m2': (0, 100000),\n                'Cost': (0, 100000000)\n            },\n            'valid_categories': ['Wall', 'Floor', 'Column', 'Beam', 'Slab',\n                                 'Foundation', 'Roof', 'Stair', 'Door', 'Window'],\n            'min_quality_score': 90\n        }\n\n    def run(self, df, source_name='unknown'):\n        \"\"\"Run quality pipeline\"\"\"\n        checker = DataQualityChecker(df)\n\n        # Configure checks based on config\n        checker.check_completeness(self.config['required_columns'])\n        checker.check_accuracy({\n            col: {'min': r[0], 'max': r[1]}\n            for col, r in self.config['numeric_ranges'].items()\n        })\n        checker.check_consistency(self.config['unique_columns'])\n        checker.check_validity()\n\n        results = checker.run_full_check()\n\n        # Store in history\n        self.history.append({\n            'timestamp': datetime.now(),\n            'source': source_name,\n            'records': len(df),\n            'score': results['overall_score'],\n            'grade': results['grade'],\n            'issues_count': len(results['issues'])\n        })\n\n        # Check threshold\n        passed = results['overall_score'] >= self.config['min_quality_score']\n\n        return {\n            'passed': passed,\n            'score': results['overall_score'],\n            'grade': results['grade'],\n            'details': results,\n            'report': checker.generate_report()\n        }\n\n    def get_history_summary(self):\n        \"\"\"Get quality history summary\"\"\"\n        if not self.history:\n            return \"No quality checks performed yet.\"\n\n        df_history = pd.DataFrame(self.history)\n        return {\n            'total_checks': len(self.history),\n            'avg_score': df_history['score'].mean(),\n            'min_score': df_history['score'].min(),\n            'max_score': df_history['score'].max(),\n            'latest': self.history[-1]\n        }\n\nQuality Reporting\nExport Quality Report\ndef export_quality_report(df, output_path, include_details=True):\n    \"\"\"Export comprehensive quality report to Excel\"\"\"\n    checker = DataQualityChecker(df)\n    results = checker.run_full_check()\n\n    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:\n        # Summary sheet\n        summary = pd.DataFrame({\n            'Metric': ['Overall Score', 'Grade', 'Records', 'Columns', 'Issues'],\n            'Value': [\n                f\"{results['overall_score']:.1f}%\",\n                results['grade'],\n                len(df),\n                len(df.columns),\n                len(results['issues'])\n            ]\n        })\n        summary.to_excel(writer, sheet_name='Summary', index=False)\n\n        # Completeness details\n        if 'completeness' in results:\n            comp_df = pd.DataFrame.from_dict(\n                results['completeness']['by_column'],\n                orient='index',\n                columns=['Completeness_%']\n            )\n            comp_df.to_excel(writer, sheet_name='Completeness')\n\n        # Issues list\n        if results['issues']:\n            issues_df = pd.DataFrame({'Issue': results['issues']})\n            issues_df.to_excel(writer, sheet_name='Issues', index=False)\n\n        # Missing values analysis\n        if include_details:\n            missing = df.isnull().sum()\n            missing_df = pd.DataFrame({\n                'Column': missing.index,\n                'Missing_Count': missing.values,\n                'Missing_%': (missing.values / len(df) * 100).round(2)\n            })\n            missing_df.to_excel(writer, sheet_name='Missing_Values', index=False)\n\n    return output_path\n\nQuick Reference\nMetric\tDescription\tThreshold\nCompleteness\t% non-null values\t≥ 95%\nAccuracy\tValues within valid range\t≥ 98%\nConsistency\tUnique IDs, valid relationships\t≥ 99%\nValidity\tMatch expected patterns\t≥ 95%\nTimeliness\tRecords updated recently\t≥ 80%\nCommon Validation Patterns\n# Construction-specific regex patterns\nPATTERNS = {\n    'element_id': r'^[A-Z]{1,3}\\d{3,8}$',\n    'revit_id': r'^\\d{5,8}$',\n    'ifc_guid': r'^[A-Za-z0-9_$]{22}$',\n    'level': r'^(Level|L|Уровень)\\s*[-]?\\d+$',\n    'grid': r'^[A-Z]{1,2}[-/]?\\d{0,3}$',\n    'date_iso': r'^\\d{4}-\\d{2}-\\d{2}$',\n    'cost_code': r'^\\d{2,3}[.-]\\d{2,4}[.-]?\\d{0,4}$'\n}\n\nResources\nBook: \"Data-Driven Construction\" by Artem Boiko, Chapter 2.6\nWebsite: https://datadrivenconstruction.io\nGreat Expectations: https://greatexpectations.io\nNext Steps\nSee bim-validation-pipeline for BIM-specific validation\nSee etl-pipeline for data processing pipelines\nSee data-visualization for quality dashboards"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/datadrivenconstruction/data-quality-check",
    "publisherUrl": "https://clawhub.ai/datadrivenconstruction/data-quality-check",
    "owner": "datadrivenconstruction",
    "version": "2.1.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/data-quality-check",
    "downloadUrl": "https://openagent3.xyz/downloads/data-quality-check",
    "agentUrl": "https://openagent3.xyz/skills/data-quality-check/agent",
    "manifestUrl": "https://openagent3.xyz/skills/data-quality-check/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/data-quality-check/agent.md"
  }
}