# Send Adaptive Testing to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "adaptivetest",
    "name": "Adaptive Testing",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/woodstocksoftware/adaptivetest",
    "canonicalUrl": "https://clawhub.ai/woodstocksoftware/adaptivetest",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/adaptivetest",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=adaptivetest",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "SKILL.md",
      "references/calibration.md",
      "references/implementation.md"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "slug": "adaptivetest",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-29T08:50:56.859Z",
      "expiresAt": "2026-05-06T08:50:56.859Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=adaptivetest",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=adaptivetest",
        "contentDisposition": "attachment; filename=\"adaptivetest-1.0.3.zip\"",
        "redirectLocation": null,
        "bodySnippet": null,
        "slug": "adaptivetest"
      },
      "scope": "item",
      "summary": "Item download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this item.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/adaptivetest"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/adaptivetest",
    "downloadUrl": "https://openagent3.xyz/downloads/adaptivetest",
    "agentUrl": "https://openagent3.xyz/skills/adaptivetest/agent",
    "manifestUrl": "https://openagent3.xyz/skills/adaptivetest/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/adaptivetest/agent.md"
  }
}
```
## Documentation

### Adaptive Testing with IRT

Design computerized adaptive tests that measure ability efficiently and accurately using Item Response Theory.

### Core Concept

Adaptive tests adjust difficulty in real-time based on student responses. A correct answer → harder question. Incorrect → easier question. The result: accurate ability estimates in ~50% fewer questions than fixed-length tests.

Key advantage: Traditional tests waste time on too-easy or too-hard questions. Adaptive tests spend time where measurement matters most — near the student's ability level.

### Quick Decision Tree

You need to...SeeUnderstand IRT models and parametersIRT FundamentalsDesign a new adaptive testTest Design WorkflowChoose item selection algorithmItem SelectionDecide when to stop the testStopping RulesCalibrate new questionsreferences/calibration.mdImplement CAT algorithmreferences/implementation.md

### The 3-Parameter Logistic (3PL) Model

Most adaptive tests use the 3PL model. Each question has three parameters:

a (discrimination) — How well the question differentiates ability levels. Higher = steeper curve. Typical range: 0.5 to 2.5
b (difficulty) — The ability level where P(correct) = 0.5. Range: -3 to +3 (standardized scale)
c (guessing) — Probability of guessing correctly. Usually 0.2 to 0.25 for multiple choice

Probability of correct response:

P(correct | ability, a, b, c) = c + (1 - c) / (1 + e^(-a(ability - b)))

Simpler models:

2PL: Set c = 0 (no guessing parameter)
1PL (Rasch): Set c = 0 and a = 1 for all items (only difficulty varies)

Use 3PL for high-stakes tests. Use 2PL/1PL when sample size is small (<500 responses per item).

### Information and Standard Error

Information measures how precisely an item estimates ability at a given level. Peak information occurs when ability ≈ difficulty (b parameter).

Standard Error (SE) is the inverse of information:

SE = 1 / sqrt(Information)

Goal of CAT: Maximize information (minimize SE) at the student's true ability level.

### 1. Define Test Specifications

Purpose: Placement, diagnostic, certification, progress monitoring?
Content domain: Single skill or multidimensional?
Target population: What ability range (-3 to +3)?
Constraints: Time limit, minimum/maximum length, content balance

### 2. Build Item Bank

Minimum bank size: 10× the average test length. For a 20-item CAT, you need ≥200 calibrated items.

Distribution targets:

Difficulty (b): Spread across expected ability range
Discrimination (a): Target 1.0 to 2.0 (high discrimination)
Exposure: No item used >20% of the time

Content balancing: If testing math, ensure geometry/algebra/etc. are proportionally represented.

### 3. Choose Algorithms

Pick one from each category:

Item selection: (see below)

Maximum Information
Randomesque (MFI + exposure control)
Content balancing

Ability estimation:

Maximum Likelihood Estimation (MLE)
Expected A Posteriori (EAP) — better for extreme scores
Weighted Likelihood (WLE)

Stopping rule: (see below)

Fixed length
Standard error threshold
Information threshold

### 4. Simulate Performance

Before going live, simulate 1000+ test sessions with known abilities. Check:

Average test length
SE at different ability levels
Item exposure rates
Content balance adherence

Adjust if needed.

### Maximum Fisher Information (MFI)

Rule: Select the item with highest information at current ability estimate.

Pros: Optimal precision, shortest tests
Cons: Overuses "best" items, poor security

Use when: Pilot testing, low-stakes practice

### Randomesque (MFI + Exposure Control)

Rule: Select from top N items by information (e.g., top 5), choose randomly from that set.

Pros: Balances precision and security
Cons: Slightly longer tests than pure MFI

Use when: Operational tests, default choice

### a-Stratified

Rule: Start with high-discrimination items (high a), use mid-discrimination later.

Pros: Fast initial ability estimate
Cons: Complex to implement

Use when: Very large item banks, research settings

### Content Balancing

Rule: Track content area usage, prioritize underrepresented areas when selecting next item.

Implementation: Weight information by content constraint satisfaction.

Use when: Blueprint requirements, multidimensional tests

### Fixed Length

Stop after N items (e.g., 20 questions).

Pros: Predictable time, simple
Cons: May over/under-test some students

Use when: Time limits matter, simple implementation needed

### Standard Error Threshold

Stop when SE < target (e.g., SE < 0.3).

Pros: Consistent precision across ability levels
Cons: Variable test length (harder to schedule)

Typical targets:

Low-stakes: SE < 0.4
Medium-stakes: SE < 0.3
High-stakes: SE < 0.25

Use when: Precision matters more than time

### Combined Rule

Stop when (SE < target) OR (length ≥ max) OR (length ≥ min AND ability estimate stable).

Use when: Production systems (safest approach)

### Starting Ability Estimate

Options:

Population mean (θ = 0)
Prior information (e.g., grade level, previous test)
First question is medium difficulty, estimate from there

Never start at extremes (-3 or +3).

### Handling Extreme Response Patterns

All correct or all incorrect: MLE fails. Use EAP or Bayesian prior to regularize.

Rapid changes: If ability estimate jumps >1.0, consider response anomaly (cheating, guessing).

### Exposure Control

Track how often each item is used. Flag items used >20% of the time. Consider:

Randomesque selection (above)
Sympson-Hetter method (advanced)
Periodic item bank refresh

### Multidimensional IRT (MIRT)

If testing multiple skills (e.g., algebra + geometry), use separate ability estimates per dimension. Select items to balance information across dimensions.

Warning: MIRT requires larger item banks and more complex calibration.

### Common Mistakes

❌ Too few items in bank → High exposure, security risk
✅ Aim for 10× average test length

❌ Poorly distributed difficulties → Accurate only in narrow ability range
✅ Spread items across -2 to +2 difficulty

❌ Ignoring content balance → May skip important topics
✅ Build content constraints into item selection

❌ Using MLE for all incorrect → Returns -∞
✅ Use EAP or cap estimates at -3/+3

❌ No exposure control → Same items every test
✅ Use randomesque or Sympson-Hetter

### When to Load References

NeedFileCalibrate new items (collect data, estimate parameters)references/calibration.mdImplement CAT algorithm (code patterns, libraries)references/implementation.md

### Real-World Example: K-12 Math Placement

Setup:

Item bank: 300 questions, b from -2 (basic) to +2 (advanced)
Target: SE < 0.35 or max 25 questions
Content: 40% algebra, 30% geometry, 30% statistics
Algorithm: Randomesque (top 5), EAP estimation

Flow:

Start at θ = 0 (grade-level average)
Select item: b ≈ 0, content area needed
Student answers → update ability estimate (EAP)
Select next: maximize information at new θ, respect content balance, randomesque from top 5
Stop when SE < 0.35 or 25 questions reached
Report: ability estimate + placement recommendation

Result: Average 18 questions, 95% of students placed within ±0.5 grade levels of true ability.

### Further Reading

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems
Wainer, H. (2000). Computerized Adaptive Testing: A Primer (2nd ed.)
van der Linden, W. J., & Glas, C. A. W. (2010). Elements of Adaptive Testing

IRT packages:

Python: mirt, girth, catsim
R: mirt, TAM, catR
Production: Custom implementation or AdaptiveTest.io
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: woodstocksoftware
- Version: 1.0.3
## Source health
- Status: healthy
- Item download looks usable.
- Yavira can redirect you to the upstream package for this item.
- Health scope: item
- Reason: direct_download_ok
- Checked at: 2026-04-29T08:50:56.859Z
- Expires at: 2026-05-06T08:50:56.859Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/adaptivetest)
- [Send to Agent page](https://openagent3.xyz/skills/adaptivetest/agent)
- [JSON manifest](https://openagent3.xyz/skills/adaptivetest/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/adaptivetest/agent.md)
- [Download page](https://openagent3.xyz/downloads/adaptivetest)