Asapi

rl-gym benchmark · Data Forensics

Real work is messy.
Can an agent handle it?

Every decision rests on quiet assumptions we never notice. Can an agent navigate them — reading the data as it is, hearing what's really being asked, and answering the way a human would?

0
scenarios
0
obstacles
0
business questions
0
models scored

How Data Forensics works

Each scenario drops an agent into a simulated business day: a stakeholder with no technical or database knowledge asks 20 business questions in plain language — the way a real manager actually would — over a dirty, multi-table dataset. Only a simple schema, no clean columns; some answers live in the data, some don't. The agent has to do the rest.

Step 1

Ask

20 everyday business questions — no SQL, no column names, no guarantee the answer is even in the data.

Example

"Which hauler moved the most tonnage in Q3 2023?"

Step 2

Clean & join

The agent inspects the tables, repairs quality issues, and merges across them to recover the truth.

Example

Station codes appear as ST-01 and st_1 — normalise, then join manifests → stations.

Step 3

Grade vs gold

Each answer is scored by category against a gold run that knows every obstacle.

Example

Answer 142.5 t vs gold 142.5 texcellent. Off by 30% → naive.

Step 4

Aggregate

Category scores roll up into one aggregate score, making models directly comparable.

Example

9 excellent · 6 good · 5 naive → 63% aggregate.

The Leaderboard

Each provider's frontier flagship takes on twelve data-quality scenarios — . Avg Agg % is the mean of a model's aggregate score across the scenarios in which it produced a valid submission; Scenarios shows how many of the suite it cleared. Where a model is missing submissions, its average covers only the ones it completed — flagged with an asterisk.

Rank Provider Model Route Avg Agg % Avg Wall (mm:ss) Avg Cost ($) Scenarios

* Averaged over fewer than the full suite — only scenarios with a valid submission.
Qwen3-Coder-480B is one version behind Alibaba's current flagship (Qwen 3.6, not attempted — no first-party key); rows carry the best-available Bedrock entry.
Amazon's flagship Nova Premier produced no submission; rows carry the mid-tier Nova Pro.

Native-harness reference

The same models run through their own agent harness rather than a bare API call — averaged across the scenarios where they ran, shown as reference points outside the cohort ranking. Over twelve scenarios the harness-vs-native picture is both scenario- and model-dependent in sign: Fable 5 — Claude Code leads this table at 61.35 and posts the single highest run in several scenarios, yet the same native loop drops Opus 4.8 below its own bare-API score on Helios and Brierdale (see findings).

ProviderModelRouteAvg Agg %Avg Wall (mm:ss)Avg Cost ($)Scenarios

Full per-question scoreboard, grading methodology, and analysis are in the combined technical report.

Interesting findings

A few things that surprised us across twelve scenarios and thirteen models — patterns that cut against the intuition that a bigger, pricier, or "smarter" model simply wins.

New: Fable 5

Fable 5's lead: part extra reasoning time, part genuinely sharper

Fable 5 pauses to reason before every step — about 4× longer per step — and that can't be switched off. So in the simple agent it effectively ran with thinking on while Opus 4.8 ran with thinking off, which inflates the headline +11 gap. Compare each model at its best setup and the real lead is about +9: mostly from two scenarios (Kitchen Supply, Crosswind) where Fable fixes planted data errors that Opus never does, roughly +4 everywhere else — and on one scenario (Brindholm) Opus is 20 points better.

New: reality check

The ceiling rose, but the floor is still low

Fable 5 lifts the high end — ~58% on average, and above 60 on five scenarios (Aurora Transit's 80.75 is the best single score). But everyone else still averages in the mid-40s or below, and the hardest scenario (Brivane) tops out at 38.0. On messy data, even the best agents still get many answers wrong.

Agent can flip

A model's own agent doesn't always help

Running a model inside its own full agent (Claude Code, Codex) usually raises its score — but not always. On Helios, Opus 4.8 scores 58.0 with the simple agent and only 46.25 in Claude Code, while Fable 5 goes the other way (52.75 → 63.5, the best Helios run). On Brierdale the split is extreme: Fable climbs to 74.75 while Opus drops to 40. Whether the agent helps depends on the model.

No clean sweep

No single model wins every scenario

Fable 5 wins 8 of 12 — but not all. Opus 4.8 still wins three (Helios, Helix Marine, Elo), and DeepSeek V4-Pro wins Pellichar at about 1/40th of Fable's cost. Even the strongest model loses a third of the scenarios; which model wins still depends on the scenario.

Scenario-dependent

The Opus 4.7 → 4.8 upgrade helped and hurt

Moving from Opus 4.7 to 4.8 helped on some scenarios and hurt on others — from −23 (Brierdale) to +14.75 (Aurora Transit) under Claude Code, averaging out to roughly zero. There is no single "the upgrade is better" answer; it depends on the scenario.

Cost ≠ quality

The cheapest model rides near the top

DeepSeek V4-Pro averages 42.9 at roughly $0.05–0.07 a run — level with models costing 20–30× more, and it wins one scenario outright (Pellichar). Meanwhile the priciest single run (Gemini 3.5 Flash, ~$14 on Aurora Transit) returned no answers at all.

Dive deep into Scenarios

Rank Provider Model Route Agg % Wall (mm:ss) Cost ($)

Native-harness reference

The same models run through their own production agent on this scenario rather than a bare API call — shown alongside the cohort but not interleaved into the ranking. Compare the bare-API row for the same model to see the per-scenario harness lift (positive = native helps; negative = SWE-agent beats native, which happens on some scenarios).

Provider Model Route Agg % Wall (mm:ss) Cost ($)

Interesting findings