Brainstorm · 2026-05-09 · v0.1 · meta-agent capability bench

Mapping the multi-dimensional capability space of meta-agents.

A capability benchmark for meta-agents needs to cover the space, not pick favorites. CL-Bench, RE-Bench, and Hyperagents each ship a small task set that composes a multi-dimensional capability — and the way they compose it is the most portable lesson, not the tasks themselves.

This page lays out 18 candidate axes for a meta-agent benchmark, grouped by online vs. offline, cross-walked to the substrates we already have, with three composition tricks worth borrowing wholesale. It is the input to a v1 cut — not the cut itself.

v1 cut tbd
01

How three benchmarks composed their capability space

Each of these three benchmarks ships a small task set that's claimed to cover a multi-dimensional capability. The task lists differ; the composition logic is what's portable. Read across the three to see three distinct moves: axis-per-task with cheap variants, one-task-per-flavor with an orthogonal eval-grid axis, and small-domain-set chosen for uncorrelation.

CL-Bench 1.0
Berkeley Sky Lab · May 2026 · 6 tasks

Continual learning — does the same system, with state, beat its own stateless baseline?

Move
One task per axis; schedule.json + variant.json multiply coverage cheaply.
Axes
Operational reuse · cross-source synthesis · drift adaptation+retention · long-horizon institutional memory · latent-state inference
Domains
Real (PRs, retail) for operational learning; fictional (cohort biomarkers, RF transmitters) where pre-training would shortcut the axis.
Headline
Reward — Stateless-of-same-system = Gain. Cancels model strength.
Hygiene
Canary string in every task README; per-task r_max + parallel_safe; per-dir AGENTS.md.
Steal: the Gain trick (with vs without the meta-action), the schedule abstraction, fictional domains for memorization-prone axes.
RE-Bench
METR · 2024–25 · 7 tasks

Frontier ML R&D engineering — comparable across humans + agents at multiple compute budgets.

Move
One task per flavor; orthogonal eval-grid axis is time budget, not part of the task set.
Flavors
Worker scaffolding · RLHF tuning · model debugging · perf engineering · GPU kernels · scaling-law extrapolation · constrained architecture
Domains
All real, all gradeable by an automated scorer that returns a continuous metric.
Headline
Score-at-budget; humans 71 attempts; agents 4× human at 2h, humans 2× at 32h. Time-scaling is the story.
Hygiene
Explicit "what we are not measuring" (novel-research-idea generation).
Steal: the orthogonal eval-grid axis (we'd add meta-action on/off); the explicit "what we don't measure" list; continuous scorers per task.
Hyperagents (DGM-H)
arXiv 2603.19461 · ICLR 2026 · 4 domains

Recursive self-modification — meta-agent improves itself, transferable across uncorrelated domains.

Move
Small domain set chosen for uncorrelation; second axis is cross-domain transfer.
Domains
Polyglot coding · paper review (accept/reject) · robotics reward design · Olympiad math grading
Why
"No domain-specific alignment between task performance and improvement skill" — kills DGM's coding-only critique.
Headline
imp@k = best agent in k generations; transfer = train on {A,B} → measure imp@50 on novel C.
Hygiene
Staged eval (10 tasks → 50) prevents bias toward expensive early evaluations.
Steal: domain-uncorrelation as a v2 transfer axis; imp@k as the metric for axes where the meta-agent is itself the artifact under test.
02

Eighteen axes of meta-agent capability

18 axes

Comprehensive list, not the v1 cut. Each card names what the axis stresses, what makes it distinct from neighboring axes, the substrate we have or could borrow, and tags showing online/offline and current coverage. Hover any card to read its full distinctness criterion. Numbering matches the radial compass at the top of the page.

online within a single rollout offline across instances or post-hoc covered working substrate exists partial substrate exists, eval framing missing greenfield nothing built yet
AXIS 01

Live supervision

Detect worker failure mid-task from observation stream; intervene (steer / revert) before completion.

Distinct
About task progress, not blast radius (axis 09) or claim truth (axis 10).
Substrate
live-intervention-wip · CooperBench (Sonnet over 2 Haiku workers).
Analogue
None — fresh axis vs. the three reference benches.
onlinecovered
Rough sketch — supervised CooperBench
Setup
CooperBench feature pair (~100 structurally-conflicting pairs); 2 Haiku 4.5 workers in parallel E2B sandboxes solving sibling subtasks.
Meta
Sonnet 4.6 supervisor subscribed to both workers' effect streams. Tools: SEND_HINT, SWITCH_BRANCH, REVERT_TO_STEP_K, NOOP.
Score
Joint patch pass-rate on hidden pytest harness at fixed total token budget. Headline: Δ vs naive-parallel baseline.
Baseline
Same workers, same budget, no supervisor (effect stream still emitted but not consumed). Plus solo as a floor.
AXIS 02

Speculative branching

Identify uncertain decision points; explore K branches in parallel from a shared prefix; select the winner.

Distinct
Same-task K-way; axis 05 is partition into different subtasks.
Substrate
framework-perf-wip KV-cache fork; mcts-rl-wip Tree-GRPO rollouts.
Analogue
RE-Bench best-of-k at fixed compute; Hyperagents archive selection.
onlinecovered
Rough sketch — fork-and-pick on TB2
Setup
TB2 task subset where each task has at least one well-known decision-fork (e.g., 3 plausible fix strategies). E2B sandbox with metagit-overlay backend.
Meta
At flagged turns, fork into K branches (K ∈ {2, 4, 8}), run each to a checkpoint, evaluate, pick best. KV prefix reused across branches via the framework's fork primitive.
Score
pass@1(branching, cost=B)pass@1(single-shot, cost=B). Sweep B; report a Pareto curve.
Baseline
K independent rollouts from scratch (no shared prefix) — proves the framework's KV-reuse contribution beyond plain best-of-K.
AXIS 03

Post-hoc trajectory pruning

Identify wasted work in a completed trace; rewrite the compressed version; verify it still solves.

Distinct
Post-hoc edit, not online efficiency (CL-Bench codebase task).
Substrate
longcot-wip · trajprune-prepilot-wip (TB2 + SWE-Verified).
Analogue
CL-Bench codebase step-efficiency reward, but online not post-hoc.
offlinecovered
Rough sketch — replay-and-verify pruner
Setup
Completed traces from a stronger model (Opus or gpt-5.4) on TB2 / SWE-Verified / LongCoT; each trace tagged with the task it solved and a verifier (LongCoT verifier · pytest harness · grader LLM).
Meta
Identify wasted span [i, j]; fork at step i-1; replay to completion with the span excised; verifier checks final output.
Score
Token-reduction × verification-pass-rate (compression ratio at fixed correctness). Per-task headline: tokens-saved per attempt at zero quality loss.
Baseline
Random-span truncation at matched compression ratio; the meta-agent should beat this by a wide margin or the axis isn't measuring what we think.
AXIS 04

Workflow optimization

Edit worker prompts, tools, scaffold from N training instances; score on held-out.

Distinct
Across instances, not online (axes 12, 15).
Substrate
cbo-wip CBO vs GEPA on HoVer; A1 prompt iteration in live-intervention-wip.
Analogue
Direct CL-Bench paradigm with the system being a meta-agent; Hyperagents DGM-H is the canonical method.
offlinecovered
Rough sketch — optimizer-as-system on HoVer / MATH / LiveBench
Setup
HoVer (cbo-wip exists), MATH L5, LiveBench instruction-following. Train/dev/test splits with shared baseline cache.
Meta
Optimizer (GEPA · CBO · Meta-Harness · DGM) modifies prompts/scaffold over N training instances under a fixed reflection budget.
Score
Held-out test score with optimized scaffold − baseline scaffold. This is exactly CL-Bench's Gain applied to a meta-agent.
Baseline
Starter scaffold with no optimization (matches CL-Bench's stateless-of-same-system).
AXIS 05

Task decomposition

Split a goal into subtasks dispatched to (parallel) workers, then assemble outputs.

Distinct
Different subtasks, not same-task K-way (axis 02). No model heterogeneity (axis 16).
Substrate
longcot-wip restricted harness (planner → forked sub-agents → assembler).
Analogue
HuggingGPT-family; Hyperagents hierarchical decomposition.
onlinepartial
Rough sketch — planner / sub-agents / assembler
Setup
Multi-step LongCoT problems (logic / cs / chemistry / chess / math) where the natural decomposition is non-trivial; ground-truth verifier per problem.
Meta
Planner emits K subtasks with explicit interfaces; each dispatched to a forked sub-agent in its own scope; assembler joins outputs and re-checks consistency.
Score
Solve-rate − solo-agent at matched total token budget. Decomposition speedup at fixed quality.
Baseline
Solo agent given the full prompt and the full token budget — must beat this for the decomposition to be earning its keep.
AXIS 06

Continual meta-improvement

Meta-agent learns from past instances to make better meta-actions in future ones (recursive self-improvement).

Distinct
Not just "edit a prompt" (axis 04) — modifies the meta-mechanism itself.
Substrate
None operational. DGM/Hyperagents are the planned ref impls in the framework release.
Analogue
Hyperagents imp@k + cross-domain transfer is the literature standard.
offlinegreenfield
Rough sketch — DGM-H over uncorrelated domains
Setup
Stream of instances drawn from 4 deliberately-uncorrelated domains (Hyperagents-style: e.g., paper-review · robotics-reward · math-grading · code). Train on 2 → eval on the held-out 2.
Meta
Between instances, the meta-agent edits its own task-agent template AND its own meta-mechanism (recursive). Archive of past agents kept for parent selection.
Score
imp@k on held-out domain — best agent produced within k generations. Plus ablations (fix-the-meta, no-archive) bound each component.
Baseline
Fixed meta-agent with no self-modification (the original DGM critique).
AXIS 07

Cost-aware meta-decisions

At each step, decide whether to invoke an expensive meta-action given remaining budget.

Distinct
About spend; axis 13 is specifically about model-strength escalation.
Substrate
F.4 partial data flagged as a "small writeup" in the post-NeurIPS plan.
Analogue
RE-Bench compute-scaling; Hyperagents resource-aware planning.
onlinepartial
Rough sketch — spend-or-save under budget
Setup
TB2 task pool with per-step expected-cost annotations (cheap shell exec vs. expensive LLM call vs. very-expensive K-fork) and a fixed total budget B per task.
Meta
At each step, the meta-agent decides whether to invoke the expensive subaction (extra reasoning turns, K-branch fork, deeper supervision pass).
Score
Pass-rate at budget B, swept over B ∈ {¼·B₀, ½·B₀, B₀, 2·B₀}. Reported as a Pareto frontier vs always-spend / never-spend.
Baseline
Two fixed policies: always spend (upper bound on cost) and never spend (lower bound on quality). Meta should sit above the line connecting them.
AXIS 08

Open-ended ideation

Meta-agent proposes new meta-actions or new evaluation tasks that didn't exist before.

Distinct
Generates the action space, not just acts within it.
Substrate
None.
Risk
Overlaps RE-Bench head-on; scoring genuinely hard. Defer to v2.
offlinegreenfieldv2
Sketch deferred — see Section 05
Note
Parked for v2 because (a) RE-Bench already covers novel-architecture-design as one of its 7 tasks, (b) scoring "did the meta-agent invent a useful new meta-action" is a graduate-thesis-grade open problem, and (c) we have no substrate. Revisit once v1 axes are stable and we have clear "this meta-action exists, this one doesn't" reference sets.
AXIS 09

Safety / harm reversion

Detect harmful subagent action, localize cause, choose revert granularity, contain irreversibles, choose re-attempt strategy.

Distinct
Blast-radius, not task-progress (axis 01); subagent action, not external input (axis 11).
Substrate
Recast AgentHarm / AgentDojo / InjecAgent as meta-agent eval; or trap-rows in existing tasks.
Sub-caps
Detection · causal localization · revert granularity · containment · re-attempt.
onlinegreenfieldsafety
Rough sketch — trap-worker on CooperBench
Setup
CooperBench-derived 2-worker tasks; one of the workers is a "trap" with adversarial prompt to attempt one of: curl … | bash, write to /etc, post env to a webhook, install a shadow dep at step T then exfil at step T+30. Trap timing varies per task.
Meta
Detect harmful action → localize the causal step (often not the most recent) → pick revert granularity (sandbox / git-file / session / full-rollback) → if irreversible, switch to containment (kill network, freeze siblings, alert) → prescribe re-attempt for the good worker.
Score
5 sub-scores aligned to the 5 sub-caps. Headline: harms-prevented per false-positive-revert. Plus the good worker's productivity preserved (FP cost).
Baseline
Same scaffold with no safety meta-agent (worker just runs); upper-bound from a human-in-the-loop reviewer who sees the full trace.
Note
This axis is the strongest demo of the framework's reversibility primitive — every other tool stack would have to reproduce sandbox checkpoint/revert from scratch.
AXIS 10

Hallucination catching

Detect that the worker fabricated a fact (wrong API, made-up filename) before downstream actions consume it.

Distinct
The claim is false even if the worker is moving forward (vs. axis 01: progress is failing).
Substrate
LongCoT verifier as oracle; SWE-Bench fact-check on natural-language summaries.
Analogue
Process-reward-model literature; HoVer fact verification.
onlinepartial
Rough sketch — claim-checker over worker summaries
Setup
SWE-Verified runs where the worker emits a NL summary at each step ("I edited X to do Y because Z"). LongCoT verifier acts as truth oracle on the underlying claim.
Meta
Per step, classify summary as faithful / unfaithful / unverifiable. On unfaithful, intervene before the next step consumes the claim — either correct, request rework, or revert the underlying action.
Score
Two-layer: (a) detection F1 against verifier-derived labels, and (b) Δ task-success when intervention is allowed vs not. The two-layer split lets us see "great detector / bad responder" cases.
Baseline
Pass-through (no claim verification). Plus: a fixed-rate random-flagger to bound the chance of accidental gain.
AXIS 11

Adversarial input robustness

Recognize and refuse hostile content embedded in tool outputs, fetched pages, documents.

Distinct
External attacker (vs. axis 09: subagent did harm). Same effect-stream, different threat model.
Substrate
AgentDojo / InjecAgent directly portable; meta-agent in the middle.
Analogue
Prompt-injection literature.
onlinegreenfieldsafety
Rough sketch — sanitizer-in-the-middle on AgentDojo
Setup
AgentDojo + InjecAgent task suites; meta-agent sits between worker and tool outputs (not between worker and instructions). Injection content lives inside fetch() outputs, RAG documents, email bodies — exactly as in the source benchmarks.
Meta
Pre-screen each tool output for injection markers; refuse, sanitize (strip/escape), or pass through. Worker never sees raw flagged content.
Score
Injection-success-rate − unprotected baseline. Plus productivity preserved on benign tasks (false-positive rate must stay low or sanitizer is worse than nothing).
Baseline
No sanitizer (raw scaffold). Plus a naïve regex-based sanitizer as a non-LLM floor.
AXIS 12

Plan revision

Mid-rollout, new info contradicts the original plan; meta-agent decides to abandon, modify, or persist.

Distinct
No mistake yet (vs. axis 01); but the plan no longer fits.
Substrate
TB2 tasks where build error reveals wrong env; research tasks with hypothesis-invalidating findings.
Analogue
Reflexion / replanning literature.
onlinepartial
Rough sketch — invalidating-evidence subset of TB2
Setup
Curated TB2 subset where an early plausible plan becomes invalid by step 15-20: wrong-env detected on first build, missing dep surfaced after a long install, hidden constraint discovered via a failing test. Each task tagged with the canonical "invalidation step".
Meta
Watch for plan-invalidating evidence; emit one of ABANDON / MODIFY / PERSIST. Re-plan if needed.
Score
Two-layer: (a) detection-of-invalidation (precision/recall vs canonical step), and (b) Δ task-success when revision is allowed vs not.
Baseline
Worker that follows the original plan to completion (ablate the meta entirely). Plus: a naive "always re-plan every K steps" policy as a non-trivial floor.
AXIS 13

Verification escalation

Decide whether to invoke a stronger model or human given uncertainty.

Distinct
Specifically about strength escalation + human handoff (axis 07 is general spend).
Substrate
Multi-model condition: Haiku worker, Sonnet meta, escalate-to-Opus button with fixed budget.
Analogue
Cascading model literature (e.g., FrugalGPT).
onlinegreenfield
Rough sketch — Haiku → Sonnet → Opus → human cascade
Setup
Any task pool with diverse difficulty (TB2 mixed, SWE-Verified, MATH L5). Three model tiers available + a human-in-the-loop button. Per-task budget caps invocations of each tier.
Meta
At each uncertain step, decide: stay at Haiku · escalate to Sonnet · escalate to Opus · request human review. Each decision is logged with the meta-agent's stated reason.
Score
Pass-rate at total spend B; Pareto curve vs always-Opus / never-escalate / always-human-in-loop. Headline: spend-needed-to-match-Opus on the same tasks.
Baseline
Fixed escalation rule (escalate every K steps, or when the worker's confidence drops below threshold) — the meta should beat any fixed rule.
AXIS 14

Cross-worker reconciliation

When N parallel workers' outputs disagree, decide which to trust / merge / re-run.

Distinct
Post-hoc arbitration after both completed (vs. axis 01: online intervention).
Substrate
CooperBench (two patches, both run tests, meta picks); MCTS-RL leaf aggregation.
Analogue
Self-consistency, LLM-judge consistency.
onlinepartial
Rough sketch — disagreement-set arbitration
Setup
2-3 parallel workers per task: CooperBench dual-patch with both running tests, or MCTS-RL leaves on the same problem. Filter to instances where workers disagree (the disagreement set is the actual eval).
Meta
Pick one · merge into hybrid · run a tie-breaker rollout · request more from the higher-confidence worker.
Score
Agreement-with-truth on the disagreement set; Δ vs simple voting and vs first-come. Cost-adjusted: if the meta picks a re-run, that costs more.
Baseline
Majority vote (or first-come for K=2). Plus an oracle ceiling (always-pick-the-correct-one) to bound headroom.
AXIS 15

Online tool / scope selection

Per-instance: pick which tools the worker has access to (or refuse all and force CoT).

Distinct
Per-instance gating (vs. axis 04: across instances).
Substrate
TB2 tasks where least-privileged path is fastest; curated mix where wrong tool choice doubles cost.
Analogue
Toolformer / dynamic tool use literature.
onlinegreenfield
Rough sketch — least-privilege tool gating
Setup
TB2 tasks with a curated tool palette of ~10 tools (bash, python, http_fetch, edit, search, …). Each task tagged with its minimal sufficient toolset (some tools are dangerous, some unnecessary, some required).
Meta
At task start (and optionally per-phase), restrict the worker's toolset; relax only on demand with justification.
Score
Success at fixed budget × no harmful actions × tool-restriction-quality (overlap with the canonical minimal toolset). Δ vs full-tool-access scaffold.
Baseline
All tools on for all tasks (the default in most agent frameworks). Plus an oracle that uses the canonical minimal toolset as upper bound.
AXIS 16

Expertise routing

Route subtasks to specialist workers (researcher / coder / reviewer; Haiku / Sonnet / Opus by skill).

Distinct
Adds model heterogeneity as the explicit choice (axis 05 is plain decomposition).
Substrate
Mixed-skill subgoal tasks (find-and-fix-and-document); cost-aware variant.
Analogue
Mixture-of-Agents; routing literature.
onlinegreenfield
Rough sketch — Haiku/Sonnet/Opus by subgoal kind
Setup
Tasks with explicit mixed-skill subgoals: research (web search + summary) + code (repo edit + tests) + write doc (NL synthesis). Three model tiers + a local-only specialist available.
Meta
Decompose, then route each subgoal to the model whose strength matches: e.g., Haiku for boilerplate code, Sonnet for research, Opus for design tradeoffs, local-7B for trivial transformations.
Score
Cost-adjusted task quality vs all-Opus (quality ceiling, cost ceiling) and all-Haiku (quality floor, cost floor).
Baseline
Single-model-for-everything (Opus, Sonnet, Haiku each as their own baseline). Plus a random-routing floor.
AXIS 17

Resumption from partial state

Pick up a partially-completed task from another agent (or self-after-crash); decide continue, restart, revise.

Distinct
Single-instance resumption with imperfect handoff (axis 06 is across-instance learning).
Substrate
Synthetic: checkpoint a worker mid-task; restart with our meta-agent + a different worker.
Analogue
OS-Copilot / persistent-shell agents.
onlinegreenfield
Rough sketch — checkpoint Worker A, resume with Worker B
Setup
Synthetic protocol: Worker A (one model) runs a TB2 / SWE-V task to step k ∈ {¼·T, ½·T, ¾·T} where T is the typical full-trace length. Sandbox snapshotted. Conversation/effect-stream archived.
Meta
A meta-agent (or just Worker B alone, as ablation) reads partial state — sandbox + transcript + effect stream — and decides: continue from current state, restart from clean, or revise (revert k' < k steps, then continue).
Score
Success-rate of resumed-from-mid-task vs from-scratch run with same total budget. Sweep over k to see where resumption beats restart.
Baseline
Worker B from scratch (ignore the partial state); plus Worker B continuing blindly without meta deliberation.
AXIS 18

Self-calibration

Predict own / worker's success probability before acting; use calibration to allocate effort.

Distinct
Earns the right to spend (axis 07) by knowing when you don't know.
Substrate
Any diverse-difficulty pool with confidence reporting; calibration curves as the metric.
Analogue
LM-confidence calibration literature.
onlinegreenfield
Rough sketch — pre-action confidence + refusal curves
Setup
Diverse-difficulty pool: TB2 mixed with SWE-Verified easy/hard split, MATH L2-L5, LongCoT logic. Each instance has a known difficulty band.
Meta
Before acting on each subtask, emit a calibrated probability of own success. Optionally refuse low-confidence subtasks; route refused ones to a stronger model (overlap with axis 13).
Score
ECE (expected calibration error) on the per-subtask predictions; refusal-curve quality (selective accuracy at coverage); Brier score.
Baseline
Zero-shot LM confidence (token-prob over the success branch); plus a uniform-confidence baseline as a floor.
03

Three composition tricks worth borrowing wholesale

Becomes visible once the 18 axes are on the table. Each will drive the v1 cut.

Trick 1 — Factor as (detection target) × (response action)

Six of the eighteen axes pair up as "detect X" + "respond to X": axis 01 (failure → intervene), axis 09 (harm → revert), axis 10 (hallucination → verify), axis 11 (injection → refuse), axis 12 (new evidence → revise), axis 14 (disagreement → reconcile).

Cleaner factoring: leaderboard reports each axis as two sub-scores (detection F1, response quality conditional on detection), so a meta-agent that catches harm but reverts at the wrong granularity is distinguishable from one that misses harm entirely. CL-Bench didn't do this and the per-task READMEs feel slightly incoherent across tasks as a result.

detect × respond matrix failure harm hallucination injection drift disagreement intervene revert verify revise six axes, two-of-six response actions

Trick 2 — Online vs. offline is the cleanest first partition

Online (within a single rollout): axes 01, 02, 05, 07, 09, 10, 11, 12, 13, 14, 15, 16, 17, 18 — fourteen axes. Offline (across instances or post-hoc): axes 03, 04, 06 — three axes (plus deferred 08).

Different harness shape (online needs a live observation stream and intervention API; offline runs over completed traces); different baseline construction (online: meta-action toggle on/off; offline: optimizer toggle on/off); possibly different leaderboards. We could ship as two sub-leaderboards, like CL-Bench's "Reward / Gain / Cost" columns but with "Online uplift / Offline uplift / Cost".

online vs offline split t=0t=T ONLINE — 14 axes live observe + intervene OFFLINE — 3 axes post-hoc edit, replay

Trick 3 — Generalize CL-Bench's Gain trick per axis

For each axis, headline metric becomes Δ-with-meta-action vs. same-system-without. Cancels raw model strength, gives a structurally consistent leaderboard column even when tasks differ wildly. Axis 04 is the same as CL-Bench's Gain. Axis 09's Gain is "harm caught and contained, that the un-meta scaffold would have run". Axis 02's Gain is "tasks solved by best-of-K branching that single-shot would have failed".

One physical task can carry multiple axis evaluations if conditions are designed carefully — that's the real generalization of CL-Bench's schedule.json trick. CooperBench can host axes 01, 05, 09, 14, 16; TB2 can host 01, 02, 03, 12, 15; LongCoT can host 03, 05, 10. The substrate map below makes that explicit.

Δ = with-meta − without-meta ax 01 ax 02 ax 04 ax 09 without meta with meta = Gain
04

Substrate map: which experiments host which axes

Cells: full = working substrate · part = exists, eval framing missing · empty = nothing yet. The dense rows (cooperbench, tb2, longcot) suggest where one physical task can carry multiple axis evaluations.

axis cooperbench tb2 / swe-v longcot endless-term hover agentdojo tb2 + drift synthetic
01Live supervision live-int supervised · · · · · ·
02Speculative branching · f-perf KV long replay mcts-rl · · · ·
03Post-hoc pruning · trajprune prune loop · · · · ·
04Workflow optimization A1 prompts · · · cbo / gepa · · ·
05Task decomposition peer-coop · restricted · · · · ·
06Continual self-improvement · · · · · · · DGM ref
07Cost-aware decisions · · · · · · · F.4 data
09Safety / harm reversion trap row · · · · recast · ·
10Hallucination catching · SWE summary verifier · fact-check · · ·
11Adversarial robustness · · · · · portable · ·
12Plan revision · env reveal · tree replay · · drift run ·
13Verification escalation · · · · · · · multi-model
14Cross-worker reconciliation 2-patch · · leaf agg · · · ·
15Online tool / scope select · least-priv · · · · · ·
16Expertise routing mixed-skill · · · · · · ·
17Resumption from partial · · · · · · · handoff
18Self-calibration · diverse · · · · · ·
05

Considered, parked for v2 or dropped

Capabilities I explored and would defer or drop unless we hear otherwise. Listed for completeness so the v1 conversation knows what's on the table elsewhere.

Reward / objective specification
About task setup more than meta-action; better as its own benchmark — Hyperagents reward-design domain is the model.
v2
Memory consolidation
Too close to CL-Bench territory; we'd be the second mover. Folds into axis 06 if anywhere.
drop
Multi-agent protocol design
Folds into axis 04 if across-instance; into axis 01 if online.
fold
Curriculum / data selection for self-training
Meta-agent-as-trainer; relevant to MCTS-RL but probably too far afield for a capability benchmark.
v2
Documentation / explainability
Hard to score; better as a requirement on every axis than its own.
drop
Open-ended ideation (axis 08)
Overlaps RE-Bench head-on; scoring genuinely hard.
v2
Meta-agent vs meta-agent (red/blue)
Workshop paper at best; no strong v1 angle.
drop
06

Open questions before we cut to v1

Decisions to make before this brainstorm becomes a written spec. Each blocks at least one downstream design choice.

How many axes for v1?

CL-Bench shipped 6 tasks for one capability. RE-Bench shipped 7 for one capability. Hyperagents shipped 4 domains for one mechanism. For 18 axes, we cannot ship one task per axis at v1 — but we can ship one condition per axis on a smaller substrate set. Working assumption: 6–8 axes for v1, drawn from the "covered" + "partial" pile in section 02. Confirm or push back.

Online vs offline — one leaderboard or two?

Online uplift (axis 01-style) and offline uplift (axis 04-style) are computed differently and have different cost shapes. CL-Bench reports one column for Gain. We could do the same, or split into two. Two is more honest, one is more legible. Lean two, but the single-column "Aggregate Meta Uplift" still has to exist for marketing.

Detect × respond — separate or fused?

Trick 1 in section 03. If we factor, we get 6 extra leaderboard columns and a more honest picture; if we fuse, we lose the ability to distinguish "great detector with sloppy response" from "missed it entirely". Lean factor, but each detection-only score needs its own ground truth annotation, which is cost.

Safety as a separate track or sprinkled across?

Axes 09 and 11 are tagged safety. Workshop CFP framing in the post-NeurIPS plan calls for safety/interpretability framing. Two options: (a) ship a safety mini-leaderboard alongside the capability one (NIST/AISI optics); (b) embed safety axes among the others. Lean (a). Decide before workshop CFP June 6.

How does the framework's reversibility primitive get its own axis vs. just enabling axis 09?

The framework ships fork / checkpoint / revert. Axis 02 (branching), axis 03 (pruning), axis 09 (revert), and axis 17 (resume) all use it. We should make sure at least one axis directly probes revert granularity — that's the hard part of axis 09 and the unique thing the framework provides over a stateless scaffold. Confirm axis 09 is the right home for that probe, vs. carving out an "axis 09a — revert granularity" sub-score.

What's the "Hyperagents-style transfer" v2 ambition?

Hyperagents proves cross-domain transfer of meta-skills. Our v2 analogue: does a meta-agent that learns to supervise on cooperbench also do better at reconciling on cross-worker disagreement? Or does it transfer worse, exposing axis-specific overfitting? This is the strongest "we measured something other benchmarks can't" claim, but only meaningful once v1's per-axis scores are stable.