← All skills
Tencent SkillHub Β· AI

Azure Ai Evaluation Py

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".

⬇ 0 downloads β˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md, references/built-in-evaluators.md, references/custom-evaluators.md, scripts/run_batch_evaluation.py

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
0.1.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 16 sections Open source page

Azure AI Evaluation SDK for Python

Assess generative AI application performance with built-in and custom evaluators.

Installation

pip install azure-ai-evaluation # With remote evaluation support pip install azure-ai-evaluation[remote]

Environment Variables

# For AI-assisted evaluators AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<your-api-key> AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini # For Foundry project integration AIPROJECT_CONNECTION_STRING=<your-connection-string>

Quality Evaluators (AI-Assisted)

from azure.ai.evaluation import ( GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, RetrievalEvaluator ) # Initialize with Azure OpenAI model config model_config = { "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"], "api_key": os.environ["AZURE_OPENAI_API_KEY"], "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"] } groundedness = GroundednessEvaluator(model_config) relevance = RelevanceEvaluator(model_config) coherence = CoherenceEvaluator(model_config)

Quality Evaluators (NLP-based)

from azure.ai.evaluation import ( F1ScoreEvaluator, RougeScoreEvaluator, BleuScoreEvaluator, GleuScoreEvaluator, MeteorScoreEvaluator ) f1 = F1ScoreEvaluator() rouge = RougeScoreEvaluator() bleu = BleuScoreEvaluator()

Safety Evaluators

from azure.ai.evaluation import ( ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, IndirectAttackEvaluator, ProtectedMaterialEvaluator ) violence = ViolenceEvaluator(azure_ai_project=project_scope) sexual = SexualEvaluator(azure_ai_project=project_scope)

Single Row Evaluation

from azure.ai.evaluation import GroundednessEvaluator groundedness = GroundednessEvaluator(model_config) result = groundedness( query="What is Azure AI?", context="Azure AI is Microsoft's AI platform...", response="Azure AI provides AI services and tools." ) print(f"Groundedness score: {result['groundedness']}") print(f"Reason: {result['groundedness_reason']}")

Batch Evaluation with evaluate()

from azure.ai.evaluation import evaluate result = evaluate( data="test_data.jsonl", evaluators={ "groundedness": groundedness, "relevance": relevance, "coherence": coherence }, evaluator_config={ "default": { "column_mapping": { "query": "${data.query}", "context": "${data.context}", "response": "${data.response}" } } } ) print(result["metrics"])

Composite Evaluators

from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator # All quality metrics in one qa_evaluator = QAEvaluator(model_config) # All safety metrics in one safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope) result = evaluate( data="data.jsonl", evaluators={ "qa": qa_evaluator, "content_safety": safety_evaluator } )

Evaluate Application Target

from azure.ai.evaluation import evaluate from my_app import chat_app # Your application result = evaluate( data="queries.jsonl", target=chat_app, # Callable that takes query, returns response evaluators={ "groundedness": groundedness }, evaluator_config={ "default": { "column_mapping": { "query": "${data.query}", "context": "${outputs.context}", "response": "${outputs.response}" } } } )

Code-Based

from azure.ai.evaluation import evaluator @evaluator def word_count_evaluator(response: str) -> dict: return {"word_count": len(response.split())} # Use in evaluate() result = evaluate( data="data.jsonl", evaluators={"word_count": word_count_evaluator} )

Prompt-Based

from azure.ai.evaluation import PromptChatTarget class CustomEvaluator: def __init__(self, model_config): self.model = PromptChatTarget(model_config) def __call__(self, query: str, response: str) -> dict: prompt = f"Rate this response 1-5: Query: {query}, Response: {response}" result = self.model.send_prompt(prompt) return {"custom_score": int(result)}

Log to Foundry Project

from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential project = AIProjectClient.from_connection_string( conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential() ) result = evaluate( data="data.jsonl", evaluators={"groundedness": groundedness}, azure_ai_project=project.scope # Logs results to Foundry ) print(f"View results: {result['studio_url']}")

Evaluator Reference

EvaluatorTypeMetricsGroundednessEvaluatorAIgroundedness (1-5)RelevanceEvaluatorAIrelevance (1-5)CoherenceEvaluatorAIcoherence (1-5)FluencyEvaluatorAIfluency (1-5)SimilarityEvaluatorAIsimilarity (1-5)RetrievalEvaluatorAIretrieval (1-5)F1ScoreEvaluatorNLPf1_score (0-1)RougeScoreEvaluatorNLProuge scoresViolenceEvaluatorSafetyviolence (0-7)SexualEvaluatorSafetysexual (0-7)SelfHarmEvaluatorSafetyself_harm (0-7)HateUnfairnessEvaluatorSafetyhate_unfairness (0-7)QAEvaluatorCompositeAll quality metricsContentSafetyEvaluatorCompositeAll safety metrics

Best Practices

Use composite evaluators for comprehensive assessment Map columns correctly β€” mismatched columns cause silent failures Log to Foundry for tracking and comparison across runs Create custom evaluators for domain-specific metrics Use NLP evaluators when you have ground truth answers Safety evaluators require Azure AI project scope Batch evaluation is more efficient than single-row loops

Reference Files

FileContentsreferences/built-in-evaluators.mdDetailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tablesreferences/custom-evaluators.mdCreating code-based and prompt-based custom evaluators, testing patternsscripts/run_batch_evaluation.pyCLI tool for running batch evaluations with quality, safety, and custom evaluators

Category context

Agent frameworks, memory systems, reasoning layers, and model-native orchestration.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
3 Docs1 Scripts
  • SKILL.md Primary doc
  • references/built-in-evaluators.md Docs
  • references/custom-evaluators.md Docs
  • scripts/run_batch_evaluation.py Scripts