Live supervision
Detect worker failure mid-task from observation stream; intervene (steer / revert) before completion.
Rough sketch — supervised CooperBench
- Setup
- CooperBench feature pair (~100 structurally-conflicting pairs); 2 Haiku 4.5 workers in parallel E2B sandboxes solving sibling subtasks.
- Meta
- Sonnet 4.6 supervisor subscribed to both workers' effect streams. Tools:
SEND_HINT,SWITCH_BRANCH,REVERT_TO_STEP_K,NOOP. - Score
- Joint patch pass-rate on hidden pytest harness at fixed total token budget. Headline: Δ vs naive-parallel baseline.
- Baseline
- Same workers, same budget, no supervisor (effect stream still emitted but not consumed). Plus solo as a floor.