Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".
Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Assess generative AI application performance with built-in and custom evaluators.
pip install azure-ai-evaluation # With remote evaluation support pip install azure-ai-evaluation[remote]
# For AI-assisted evaluators AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com AZURE_OPENAI_API_KEY=<your-api-key> AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini # For Foundry project integration AIPROJECT_CONNECTION_STRING=<your-connection-string>
from azure.ai.evaluation import ( GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, RetrievalEvaluator ) # Initialize with Azure OpenAI model config model_config = { "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"], "api_key": os.environ["AZURE_OPENAI_API_KEY"], "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"] } groundedness = GroundednessEvaluator(model_config) relevance = RelevanceEvaluator(model_config) coherence = CoherenceEvaluator(model_config)
from azure.ai.evaluation import ( F1ScoreEvaluator, RougeScoreEvaluator, BleuScoreEvaluator, GleuScoreEvaluator, MeteorScoreEvaluator ) f1 = F1ScoreEvaluator() rouge = RougeScoreEvaluator() bleu = BleuScoreEvaluator()
from azure.ai.evaluation import ( ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, IndirectAttackEvaluator, ProtectedMaterialEvaluator ) violence = ViolenceEvaluator(azure_ai_project=project_scope) sexual = SexualEvaluator(azure_ai_project=project_scope)
from azure.ai.evaluation import GroundednessEvaluator groundedness = GroundednessEvaluator(model_config) result = groundedness( query="What is Azure AI?", context="Azure AI is Microsoft's AI platform...", response="Azure AI provides AI services and tools." ) print(f"Groundedness score: {result['groundedness']}") print(f"Reason: {result['groundedness_reason']}")
from azure.ai.evaluation import evaluate result = evaluate( data="test_data.jsonl", evaluators={ "groundedness": groundedness, "relevance": relevance, "coherence": coherence }, evaluator_config={ "default": { "column_mapping": { "query": "${data.query}", "context": "${data.context}", "response": "${data.response}" } } } ) print(result["metrics"])
from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator # All quality metrics in one qa_evaluator = QAEvaluator(model_config) # All safety metrics in one safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope) result = evaluate( data="data.jsonl", evaluators={ "qa": qa_evaluator, "content_safety": safety_evaluator } )
from azure.ai.evaluation import evaluate from my_app import chat_app # Your application result = evaluate( data="queries.jsonl", target=chat_app, # Callable that takes query, returns response evaluators={ "groundedness": groundedness }, evaluator_config={ "default": { "column_mapping": { "query": "${data.query}", "context": "${outputs.context}", "response": "${outputs.response}" } } } )
from azure.ai.evaluation import evaluator @evaluator def word_count_evaluator(response: str) -> dict: return {"word_count": len(response.split())} # Use in evaluate() result = evaluate( data="data.jsonl", evaluators={"word_count": word_count_evaluator} )
from azure.ai.evaluation import PromptChatTarget class CustomEvaluator: def __init__(self, model_config): self.model = PromptChatTarget(model_config) def __call__(self, query: str, response: str) -> dict: prompt = f"Rate this response 1-5: Query: {query}, Response: {response}" result = self.model.send_prompt(prompt) return {"custom_score": int(result)}
from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential project = AIProjectClient.from_connection_string( conn_str=os.environ["AIPROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential() ) result = evaluate( data="data.jsonl", evaluators={"groundedness": groundedness}, azure_ai_project=project.scope # Logs results to Foundry ) print(f"View results: {result['studio_url']}")
EvaluatorTypeMetricsGroundednessEvaluatorAIgroundedness (1-5)RelevanceEvaluatorAIrelevance (1-5)CoherenceEvaluatorAIcoherence (1-5)FluencyEvaluatorAIfluency (1-5)SimilarityEvaluatorAIsimilarity (1-5)RetrievalEvaluatorAIretrieval (1-5)F1ScoreEvaluatorNLPf1_score (0-1)RougeScoreEvaluatorNLProuge scoresViolenceEvaluatorSafetyviolence (0-7)SexualEvaluatorSafetysexual (0-7)SelfHarmEvaluatorSafetyself_harm (0-7)HateUnfairnessEvaluatorSafetyhate_unfairness (0-7)QAEvaluatorCompositeAll quality metricsContentSafetyEvaluatorCompositeAll safety metrics
Use composite evaluators for comprehensive assessment Map columns correctly β mismatched columns cause silent failures Log to Foundry for tracking and comparison across runs Create custom evaluators for domain-specific metrics Use NLP evaluators when you have ground truth answers Safety evaluators require Azure AI project scope Batch evaluation is more efficient than single-row loops
FileContentsreferences/built-in-evaluators.mdDetailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tablesreferences/custom-evaluators.mdCreating code-based and prompt-based custom evaluators, testing patternsscripts/run_batch_evaluation.pyCLI tool for running batch evaluations with quality, safety, and custom evaluators
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.