rl-gym: Data-Quality Track — Combined Technical Report (Twelve Scenarios, Frontier Flagship Cohort)
Publication draft, 2026-06-03.
1. Abstract
We introduce rl-gym, an agent-native benchmark that evaluates whether a language-model agent can complete an end-to-end synthetic data-science take-home: read a stakeholder brief, inspect a dirty multi-table relational dataset, recover from seeded data-quality failures, and submit a JSON file containing numeric answers to twenty business questions, each scored against a hidden canonical key by a deterministic rubric. Unlike multiple-choice knowledge benchmarks, code-execution suites, or atomic-instruction tests, rl-gym grades the answer artifact an analyst hands to a stakeholder, treating the intervening reasoning, tool use, and recovery as opaque.
This report aggregates twelve frozen data-quality scenarios — the original five (Brierdale, Brindholm, Brivane, Halberon, Pellichar) plus seven newly authored covers (Aurora Transit, Crosswind, Elo, Helios, Helix Marine, Jobpost Analytics Co., Kitchen Supply Co.) — each a distinct fictional cover over a different procedurally-reseeded public dataset, each carrying its own multi-table corpus, twenty-question pack, and seeded obstacle catalogue. We assemble a cross-scenario leaderboard of each major provider's current flagship model under one harness — SWE-agent [SWE-agent] — scoring each flagship on every scenario in which it produced a valid submission. Most providers shipped a new flagship we ran through the vendor's own first-party API (Anthropic Opus 4.8 and the prior-generation Opus 4.7, OpenAI GPT-5.5, DeepSeek V4-Pro, z.ai GLM-5.1, MiniMax M2.7, Moonshot Kimi K2.6, Google Gemini 3.5 Flash); Anthropic's newest model, Fable 5, was run on all twelve scenarios. The remaining flagships are carried at their Bedrock-hosted scores (Alibaba, Mistral, Meta, Amazon). For the three Anthropic models and for GPT-5.5 we additionally run their native production agents — Fable 5, Opus 4.8, and Opus 4.7 under Claude Code, GPT-5.5 under Codex — to isolate how much of a score is the harness rather than the model.
The headline of this report is that the sign of the Opus 4.7 → 4.8 upgrade is
scenario-dependent — and that Fable 5, added afterward, leads the suite on both
harnesses. Under SWE-agent, the Opus upgrade is a small cross-scenario win:
Opus 4.8 averages 46.67 against Opus 4.7's 44.56 over the twelve scenarios
(Δ = +2.11) and leads the rest of the cohort — but Fable 5, run on all
twelve, leads the suite outright at 57.83, topping eight of the twelve boards.
(The raw +11.2 bare-API gap over Opus 4.8 is not regime-matched — SWE-agent
passes no thinking budget, so Opus runs thinking-off while Fable's adaptive
thinking is always on; regime-matched the lead is ≈ +9 and concentrated in two
scenarios, §8.4.) Under native Claude Code at matched max effort, the two Opus
generations are essentially tied — Opus 4.8 51.27 vs Opus 4.7 51.98
(Δ = −0.71) — while Fable 5 leads the native board at 61.35. The per-scenario Opus native
deltas (4.8 − 4.7, max) span −23.0 (brierdale) to +14.75 (aurora_transit),
with sign distribution +6 positive / −5 negative / 1 flat across the twelve
scenarios. The previous
five-scenario report identified a cross-scenario native regression (−5.25) and
treated it as the suite's flagship finding; the wider eleven-scenario sample
does not support that generalisation. Brierdale (−23.0) remains the largest
single anomaly and a controlled effort-tier rules it in as orchestration-driven
(§8.1), but the six new scenarios shift the net cross-scenario native delta to
−0.82 — and on four of those six the upgrade is positive under the same native
agent (aurora_transit +14.75, helios +4.00, helix_marine +2.00, crosswind
+0.75). The honest reading is: under SWE-agent the upgrade is mildly positive
and consistent; under native Claude Code the upgrade averages near zero with
high per-scenario variance, dominated by the brierdale outlier. We therefore
report a per-scenario distribution rather than a single trend.
Three further cross-scenario findings emerge. First, the cross-scenario leaderboard is shaped by provider, not routing. Fable 5 leads at 57.83 over twelve scenarios; behind it a first-party band (Opus 4.8 46.67, Opus 4.7 44.56, DeepSeek V4-Pro 42.90, GPT-5.5 42.46, GLM-5.1 40.70) clusters inside six points, ahead of MiniMax M2.7 (29.40) and Bedrock-hosted Qwen3-Coder-480B (22.27) / Mistral Large 3 (19.96); Gemini 3.5 Flash averages 43.92 over its three valid submissions but does not clear half the suite. The cluster at the top has widened over the broader sample: the spread inside the top five drops from 6.0 points (5-scenario) to 4.5 points (11-scenario), and the order of the middle three (Opus 4.7, V4-Pro, GPT-5.5) is reshuffled. Second, no single model wins every scenario — but Fable 5 comes close — Fable 5 tops eight bare-API boards (brierdale 60.5, brindholm 51.0, brivane 38.0, halberon 44.25, aurora_transit 80.75, crosswind 76.5, jobpost_analytics_co 63.25, kitchen_supply_co 74.75), Opus 4.8 tops three (helios 58.0, helix_marine 55.0, elo 62.5), and DeepSeek V4-Pro tops pellichar (52.50) — so even the strongest model loses a third of the scenarios to a rival. Third, the harness-vs-native delta is itself scenario-dependent in sign. On helios, SWE-agent Opus 4.8 (58.00) out-scores its own native Claude Code run (46.25); on aurora_transit the same Opus 4.8 is 21.25 points below its native run (49.00 vs 70.25). The native-lift story is no longer "natives generally beat SWE-agent"; it is "natives usually beat SWE-agent except where the SWE-agent operating point happens to suit the scenario, of which we now have at least two clear cases (helios, crosswind)."
2. Introduction
Saturation of static benchmarks — MMLU, GSM8K, HumanEval — has pushed evaluation toward larger, harder, or more contamination-resistant variants (MMLU-Pro, GPQA, LiveCodeBench, SWE-bench). Yet a category remains under-served: the multi-question, rubric-graded, agentic deliverable that mimics what a junior analyst hands to a hiring manager. Existing benchmarks address adjacent shapes:
- Knowledge MCQ benchmarks (MMLU, MMLU-Pro, BIG-bench, HELM) score single-answer selections without any intermediate data manipulation.
- Code benchmarks (HumanEval, LiveCodeBench, SWE-bench) score executable artifacts against unit tests, treating the dataset as a side concern.
- Instruction-following benchmarks (IFEval) verify atomic format compliance, not multi-step business reasoning.
- Single-shot canonical-answer benchmarks (MATH, GPQA) test reasoning over short premises, not extended data exploration.
None tests whether an agent can recover from messy data it has never seen, deliver twenty numeric answers, and have those answers checked against hidden ground truth. rl-gym fills this gap. Its design commitments are:
- Open-book over a private synthetic corpus. The agent is given dirty CSVs and a stakeholder brief; the canonical answers, the obstacle catalogue, and the rubric are hidden. Pretraining contamination is structurally blocked by the freshly synthesised fictional cover.
- Twenty questions, no closed-form gimmes. The question pack is designed so
that one
groupby().agg()call does not produce the right answer for any non-trivial slot. Every question targets at least one catalogued data-quality failure mode. - Rubric-as-grader, not human-as-grader. A deterministic per-question rubric maps the submitted numeric value to one of seven verdict tiers; the agent's reasoning trace and any explanatory text are never read by the grader.
- Persona-calibrated thresholds. Tier thresholds are anchored using six synthesised personas whose submissions are constructed by inverting the per-question grading metric. This decouples threshold tuning from obstacle authorship.
- Single-artifact deliverable. The agent's submission is a JSON file with
bare submission keys
q01..q20. No code is executed, no tests are run, no auxiliary files are scored. - Frozen builds. Each scenario's corpus, question pack, rubric, and construction seed are fixed for this report.
- Agent-and-condition agnostic. The harness is silent about prompts, tool sets, or step budgets; comparable runs document their conditions, but the benchmark grades only the final JSON.
This report asks four questions across the eleven-scenario suite:
- When the cohort is restricted to current flagship models served at their vendors' first-party APIs, does the suite produce a wide model-quality gradient, or does the spread collapse?
- Which scenarios and which questions discriminate inside the top cohort, and which collapse to universal success or universal failure?
- How much of a model's score is attributable to the harness rather than the model? Native-agent baselines on every scenario isolate the harness effect.
- Is the sign of a model version upgrade (Opus 4.7 → 4.8) consistent across scenarios, or is it scenario-dependent? The previous five-scenario report identified a cross-scenario native regression and treated it as a single finding; the wider eleven-scenario sample reframes this as per-scenario heterogeneity dominated by one large negative outlier (§8.1).
3. Related Work
Knowledge-MCQ benchmarks. MMLU [MMLU] and its successor MMLU-Pro [MMLU-Pro] test broad knowledge through four-way to ten-way multiple choice over 57 to 14 subject areas. BIG-bench [BIG-bench] aggregates 200+ tasks ranging from translation to logical inference, exposing them through a programmatic API. HELM [HELM] takes a different aggregation strategy, sweeping a fixed model set across a curated task taxonomy and reporting multiple metric categories per task. All four share the property that the artifact graded is a short text selection, the dataset is fully published, and the construction pipeline yields questions designed to admit a unique correct option. rl-gym differs in three ways: the artifact is a structured JSON of twenty numeric values, the dataset is private and synthetic, and the questions are written so that no single-line query is correct.
Code and agent benchmarks. HumanEval [HumanEval] introduced functional-correctness grading via unit-test execution, formalised pass@k as the metric, and proved to be saturable through fine-tuning on similar surface forms. LiveCodeBench [LiveCodeBench] extends this with a continuously-updating problem stream to mitigate contamination, multiple task types, and difficulty bands. SWE-bench [SWE-bench] moves to repository-scale evaluation: issue text plus codebase to patch, graded by re-running the project's own test suite. rl-gym shares with these benchmarks the commitment to execution-grounded grading (the canonical answers are computed from the clean source by code, not adjudicated by humans), but the executed artifact is the canonical-answer pipeline, not the candidate's code. The candidate's code is not seen, executed, or scored.
Instruction-following and constrained-output benchmarks. IFEval [IFEval] evaluates whether an LLM follows atomic, machine-verifiable instructions. It reports a strict and a loose match. rl-gym extends this format-compliance idea: each of the twenty questions has an "expected format" (a dict keyed by an entity, a scalar, an integer count, a set of names), and submissions failing format validation are dropped to the lowest tier. But the analytic content — the numeric value — is the primary grading axis, not the format compliance.
Single-shot canonical-answer benchmarks. MATH [MATH] grades against boxed numeric or expression answers extracted via regex normalisation; GPQA [GPQA] grades multiple-choice questions written by domain experts to be objective even to motivated non-experts. Both benchmarks share with rl-gym the property that the canonical answer is unambiguous and the grader is deterministic. They differ in dataset scale (a single short premise vs. a multi-table relational corpus), in the number of answers per "instance" (one vs. twenty), and in the construction pipeline (expert-write-and-validate vs. dataset-synthesise-and-obstacle-inject).
Agent harness as a benchmark variable. A growing thread in benchmark
methodology treats the agent scaffold itself as part of the system under test
rather than as a fixed wrapper. SWE-bench's leaderboard explicitly conditions on
the harness (agentless, SWE-agent [SWE-agent], OpenHands, first-party
scaffolds). This report holds the harness constant for the leaderboard — every
ranked entrant runs through SWE-agent — so that the cross-scenario leaderboard
measures provider/model signal at the SWE-agent operating point, then varies the
harness deliberately in §8.1 to characterise the per-scenario distribution of
the within-provider generation upgrade. The eleven-scenario replication is, to
our knowledge, the strongest demonstration to date that a measured upgrade
direction under a single harness can be a scenario property rather than a
model property.
The gap rl-gym fills. No published benchmark we are aware of simultaneously requires: (i) reasoning over a multi-table dirty dataset; (ii) producing twenty numeric answers, each with its own metric; (iii) scoring via a hidden, deterministic rubric anchored by persona calibration; (iv) per-instance contamination resistance via synthetic-cover-over-real-source construction; and (v) replication of the same evaluation shape across eleven disjoint industry covers. The closest functional analog is the data-science take-home interview itself, but those are rarely reproducible or graded against published rubrics.
4. The rl-gym Suite
4.1 Concept
An rl-gym evaluation presents the candidate (a language-model agent) with: a
stakeholder brief introducing a fictional organisation; a private synthetic
relational dataset; and a list of twenty business questions, each phrased in a
different stakeholder voice. The candidate is told the data is dirty and to
expect cleanup work, but is never told which fields, rows, or relationships
have been tampered with. The candidate submits a JSON file containing one value
per question (q01..q20); a deterministic grader scores each value against a
hidden canonical answer derived by running a project-internal pipeline on the
clean source data. The aggregate score is the arithmetic mean of per-question
tier percentages.

submission.json. The lower row (clean source, per-question canonical pipeline, canonical answers) is hidden; the grader compares the agent's JSON against the canonical and emits per-question tier verdicts. The same pipeline shape is instantiated twelve times, once per scenario, over twelve different source datasets and fictional covers.4.2 The twelve scenarios
Each scenario is a distinct fictional cover over a different procedurally-reseeded public dataset. All twelve share the twenty-question pack size, the seven-tier rubric, the persona-calibration protocol, and the single-JSON deliverable; they differ in corpus shape, obstacle count, metric mix, and the character of their difficulty (reconciliation-depth versus shape-reading).
Brierdale — a procedurally-reseeded Kaggle FMCG sales 1M-row dataset
recast as Brierdale Waste & Recovery Group, a pan-European waste-and-recovery
operator built from a 2020–2022 acquisition wave consolidating three predecessor
operators. Five tables (stations, waste_streams, haulers, calendar,
daily_manifests) carry 39 seeded obstacles. Signature: a five-table-join
sanity gate and an anniversary-date attractor on a literal license_signed_date
field.
Brindholm — a procedurally-reseeded public time-series corpus recast as Brindholm Coastal Terminals, a regional container-port operator formed in mid-2021 from seven predecessor operators with an in-flight platform migration. Nine tables carry 40 seeded obstacles. Signature: the widest shared failure floor in the suite and the largest native-vs-SWE-agent harness lift on the original five.
Brivane — a procedurally-reseeded Kaggle Recruit Restaurant Visitor Forecasting dataset recast as Brivane Clinical Operations Network, a contract research organisation that consolidated a sponsor-direct programme and an external broker network onto one platform. Seven tables carry 25 seeded obstacles, joined only through a corrupted cross-network bridge. Signature: the purest reconciliation test and the lowest absolute SWE-agent scores in the suite.
Halberon — a procedurally-reseeded Kaggle Polymarket export recast as the Halberon Meteoritical Council, a meteorite-authentication bureau formed in mid-2022 by a four-way merger and a 28-month archive-modernisation mandate. Six tables carry 30 seeded obstacles. Signature: deep status-flag and case-variant reconciliation plus an encoding-drift hazard.
Pellichar — a procedurally-reseeded Kaggle ASHRAE Great Energy Predictor III dataset recast as Pellichar Cooperage Trust, a whisky-cooperage consortium running hourly maturation-cellar telemetry across multiple jurisdictions. Seven tables carry a seeded obstacle catalogue clustering into roughly a dozen recurring patterns. Signature: several of its hardest questions fail on the output container (count-vs-share dict, list-vs-ratio dict, key cardinality) rather than on deep reconciliation, making it a specification-reading benchmark as much as a data-cleaning one.
Aurora Transit — a procedurally-reseeded Kaggle ride-bookings dataset recast as the Aurora Transit ride-hail and micro-mobility platform in the fictional coastal metro of Meridian. Sixteen candidate-facing tables (bookings, payments, fare breakdowns, ratings, cancellations, riders, drivers, vehicles, shifts, surge windows, promo codes, and reference catalogues) carry 80 seeded obstacles — the largest table count and obstacle count in the suite. Signature: a flagship-district rush-hour question pair (q11, q19) that defeats every scored SWE-agent run, and the largest native lift in the suite (Claude Code Opus 4.8 70.25 — the single highest cell anywhere).
Crosswind — a procedurally-reseeded Kaggle M5 Forecasting (Walmart) dataset recast as the Crosswind Pharmacy Network, a ten-pharmacy regional dispensing chain across three regions, over about five and a half years of daily activity. Six candidate-facing tables (pharmacy master, drug formulary, daily dispensing fact, reimbursement-rate schedule, calendar/event, weekly claim-volume rollup) carry seeded data-quality obstacles. Signature: the narrowest scored range in the suite (0.00 to 42.75) and an SWE/native parity where Opus 4.8 max, Opus 4.7 max, and GPT-5.5/Codex all land within a single point.
Helios — a procedurally-reseeded JPX Tokyo Stock Exchange Prediction
dataset recast as the Helios Power Exchange wholesale electricity market run
by the Borealis Grid Authority market-data desk. Five candidate-facing
tables (asset master, two disjoint daily clearing-price boards, quarterly
operator filings, weekly participant-flow bulletin) carry seeded obstacles.
Signature: a five-question hard core (q01, q02, q03, q13, q18 — the per-fuel /
per-board / per-tier / monthly dispatch-energy roll-ups) that lands at
worse_than_naive for every scored run, SWE-agent and native alike, driven by
a single deeply-seeded per-asset multiplicative meter-scale quantisation. The
only scenario in the suite where SWE-agent Opus 4.8 (58.00) out-scores its own
native Claude Code run (46.25).
Helix Marine — a procedurally-reseeded Kaggle Zillow Prize home-value dataset recast as the Helix Marine online brokerage for pre-owned recreational marine vessels, whose HelixValue automated-appraisal product stands in for the source dataset's home-value estimator. Eight tables (vessel master, build specs, amenities, marina dimension, annual appraisals, listing events, HelixValue estimates, reference-code lookup) carry seeded obstacles. Signature: a tight top under SWE-agent (Opus 4.8 55.00 → GPT-5.5 50.75 → Opus 4.7 48.75 → Gemini 3.5 Flash 47.00 — four models within 8 points) with a small native lift (+3.25 for Opus 4.8).
Jobpost Analytics Co. — a procedurally-reseeded Kaggle LinkedIn Job Postings dataset recast as a B2B labor-market intelligence vendor that resells job-posting data to recruitment agencies and corporate talent-acquisition teams. Twelve candidate-facing tables (employer master, postings fact, three posting-attribute junctions, salaries fact, two company-attribute tables, headcount-snapshot, three lookup tables) carry 60 seeded obstacles. Signature: the only scenario in the suite where a non-frontier provider leads the SWE-agent board (MiniMax M2.7 55.00 #1) inside a six-model leading band packed into 6.5 points, with native GPT-5.5/Codex (65.25) the single best cell.
Kitchen Supply Co. — a procedurally-reseeded Kaggle Instacart Online
Grocery dataset recast as a B2B kitchen-supply distributor that ships to
restaurants and institutional kitchens. Thirteen candidate-facing warehouse
tables (accounts, orders, order lines, two-level product catalogue, suppliers,
fulfillment centers, promotions, deliveries, inventory snapshots, price history,
reference-code lookup) carry seeded obstacles. Signature: the flattest leaderboard
in the suite (SWE-agent 3.75 to 46.50) with a ten-question naive-floor and a
five-question worse_than_naive spine (q01, q07, q14, q19, q20) shared across
every scored run; native baselines bunch tightly too (47.25 / 46.50 / 45.75 —
barely two points across a generation).
Elo — a procedurally-reseeded public road-freight dataset recast as Caldermoor Freight Exchange, a national less-than-truckload (LTL) carrier network whose records straddle a platform-migration cutover. Five tables (a shipper-account dimension plus fact feeds either side of the migration) carry 25 seeded obstacles, including irrecoverable sentinel-overwrite losses. Signature: a tight four-way top (Opus 4.8, Opus 4.7, V4-Pro, Fable 5 within 3.5 points), a per-hub roll-up (q11) that defeats every scored cell, and the closest Fable-versus-Opus match in the suite (§8.4).
4.3 Common question-design rules
Every pack is governed by four rules: (1) no closed-form-gimme — the obvious schema-driven path is wrong for every non-trivial slot; (2) coverage matrix — every seeded obstacle is consumed by at least one question, and every question hits one or more obstacles; (3) banned-token lint — candidate-facing prose is linted against four lexicons (format/rubric terms, obstacle-design terms, cleanup phrases, structure words) and must return zero; (4) three-plausible-readings audit — each question is reviewed to enumerate at least three plausible interpretations and confirm the canonical pipeline targets the one a careful analyst would pick. Each pack also enforces a segmented/set-valued discipline: at least ten of twenty slots are on segmented or set-valued metrics, preventing a single lucky scalar from dominating the aggregate.
4.4 Submission specification
The candidate produces a single JSON file with bare submission keys q01..q20 at
solution/submission.json. Each value is a dictionary keyed by entity strings, a
scalar float, an integer, a date string, or a list of identifier strings. A
separate validator (validate_submission.py) checks structure and types only; it
never grades. Submissions that fail validation are graded as if every question
were at the worst tier. A canary string is included in each candidate brief to
support later training-data-contamination checks; obstacles are never enumerated
to the model, and each brief contains a single generic warning that the data
should be expected to be dirty.
5. Grading Methodology
5.1 Per-question metric kinds
Each of the twenty questions per scenario is graded by a single metric kind, chosen during construction and frozen before the agent runs. The closed metric set is enforced by a runtime sanity check on the rubric: any submission claiming a kind outside the allowed set is rejected.
mae_dict: per-key MAPE over an entity-keyed dict, with each entry capped at 1.0. Used for slot-level rollups.rel_err: symmetric MAPE on a single scalar.abs_err: absolute error on an integer or on the day-difference between two dates.name_match: F1 over a set of identifier strings, optionally averaged across a fixed set of sub-keys.pp_err: percentage-point error on a proportion.top3: weighted error over a top-ranked list (RBO over ordering plus value MAPE).combined: a documented composite over the above.
The eleven scenarios draw from this closed set in different proportions (§4.2); not every kind is used in every scenario.
5.2 Seven-tier classifier
Per-question scores are normalised against three anchors (naive, good, excellent) declared in the rubric for each question. The classifier produces one of seven tiers, each mapped to a percentage that contributes to the aggregate by simple arithmetic mean:
| Tier | Percentage | Roughly |
|---|---|---|
beyond_excellent |
100 | Better than the excellent anchor by a clear margin |
excellent |
100 | At or just below the excellent anchor |
good_to_excellent |
70 | Between good and excellent |
good |
50 | At or just below the good anchor |
naive_to_good |
30 | Between naive and good |
naive |
15 | At or just below the naive anchor |
worse_than_naive |
0 | Worse than a naive-baseline solver |
The aggregate score is reported as a percentage and is interpreted by tier mix more than by point value.
5.3 Persona calibration
Tier anchors are not set by hand. They are derived from a six-persona stress test in which synthesised submissions are graded by the same production rubric the agent submissions go through, and the per-question grades determine the anchors. The personas span four functional roles: a literal-clean reference (submits the canonical answers verbatim, scores 100 %); a naive baseline; a no-hints-thorough baseline; an excellent baseline (the canonical pipeline run against the dirty data); and two stress-test personas designed to fail along specific catalogued patterns. The personas are constructed by inverting the per-question grading metric so each submission lands at the targeted anchor value, decoupling threshold tuning from obstacle authorship.
Each scenario's measured band centres lie within their target bands and the rubric satisfies the reachability gates: strict numeric ordering of naive vs. good vs. excellent on all twenty questions; a naive-to-excellent spread of at least 60 percentage points; a no-hints-to-excellent spread of at least 35 percentage points; and meaningful naive-to-excellent separation on at least fourteen of twenty questions. The original five scenarios' measured centres are summarised in Appendix C; the six new scenarios pass the same gates and are recorded internally.
5.4 Canonical-answer dead-code audit
A subtle failure mode in any rubric-graded benchmark is canonical-answer dead code: the canonical pipeline contains helpers or filters that appear to apply a cleanup but in practice are no-ops on the dirty data, so a candidate who performs every standard normalisation would match the canonical bit-for-bit without recognising any seeded obstacle. rl-gym defends against this with a canonical-answer audit pass: every helper reachable from the canonical-compute entry point must materially change the dirty data, and a "dumb-dirty" baseline that applies no cleanup must match the canonical on at most two of twenty questions. A separate trap-test step then runs a comprehensive-canonicalisation persona (or a real frontier-model submission) and verifies that it scores below 40 % with at least half its questions at-or-below the naive anchor and zero bit-for-bit excellent matches. All eleven scenarios pass this audit.
5.5 Closed-form-floor exclusion rule
A small number of questions per scenario turn out to be solvable by a closed-form computation that survives the dirty-data cleanup unchanged — typically because the answer is a small integer or a set whose canonical value is invariant to the seeded obstacles. These are flagged as closed-form floor because they do not discriminate inside the "completes the schema correctly" group. The per-scenario results sections identify each floor and the suite recommends tightening those questions before re-use.
5.6 Scope of grading
The submission schema includes an optional explanations field; the grader does
not consume it. The grader executes no candidate code, lints no candidate
notebook, and reads no auxiliary files. The only artifact graded is the submitted
JSON. Qualitative analysis in this report draws on the explanations and on the
agent's trajectory, but no part of the aggregate score depends on them.
6. Models and Run Setup
6.1 Provider selection, flagship sourcing, and routing
The cohort is one entry per major provider: that provider's current flagship model, evaluated under the SWE-agent harness — with one deliberate exception. Anthropic appears twice, as both the new flagship (Opus 4.8) and the prior-generation flagship (Opus 4.7), so that the generation-over-generation upgrade can be measured directly under the same harness (§8.1). Where a provider shipped a new flagship for which a first-party API key was available, we ran it through the vendor's own endpoint (Anthropic, OpenAI, DeepSeek, z.ai, MiniMax, Moonshot, Google). Where a provider's current flagship had no newer first-party release and was already evaluated on the same frozen builds through Bedrock, we carry that score forward rather than re-run it (Mistral Large 3, Meta Llama 4 Maverick, Amazon Nova Pro, and — with a caveat — Alibaba's Qwen line). Every entry, first-party or Bedrock, runs through the identical SWE-agent scaffold on the identical seed-0 build of each scenario; only the model endpoint differs.
Two consequences follow and are treated as first-class caveats:
- Mixed routing. The upper entries are first-party; the trailing entries are Bedrock-hosted. Because routing correlates with both score and model recency in this cohort, cross-band comparisons are not pure model-capability comparisons (§8.2).
- One incomplete flagship, one tier substitution. Alibaba's current flagship (Qwen 3.6) was not attempted — no first-party key — so the Qwen row carries the best-available Bedrock entry (Qwen3-Coder-480B, one version behind) with an explicit marker. Amazon's flagship-tier Nova Premier produced no submission, so the Amazon row carries the mid-tier Nova Pro.
Separately, for the Opus generation pair and for GPT-5.5 we run the native
production agent — Opus 4.8 and Opus 4.7 under Claude Code (max effort, with
an Opus 4.8 xhigh control on the original five scenarios), GPT-5.5 under Codex —
on every scenario, as a harness-sensitivity baseline (§8.3) and as the basis for
the §8.1 per-scenario upgrade-distribution analysis. These are reported alongside
but not interleaved into the SWE-agent ranking.
6.2 Harness
Every ranked entrant runs through one harness. The SWE-agent [SWE-agent] adapter
exposes str_replace_editor (file edit), shell (bash, with whitelisted
commands), and submit (end-of-run marker). Tool calls are JSON-encoded function
calls, routed via LiteLLM so that the same scaffold serves every provider. The
system prompt and the candidate-facing question pack are byte-identical across all
runs of a given scenario.
Two SWE-agent mechanics affect the results and are worth flagging. SWE-agent was
authored around SWE-bench, where the expected artifact is a unified diff; rl-gym
wants solution/submission.json instead, so the system prompt retargets the
expected output and a model that calls submit before writing the file ends the
run with nothing on disk and is graded no submission. The tool surface is
function-call-shaped; a model that cannot emit structured tool calls cycles on
parse errors and never submits. Three minor integration adjustments (supplying
list prices for new model versions so the cost meter could proceed; caching and
restoring one provider's required reasoning-trace echo; padding empty-content
assistant turns for Moonshot's strict-non-empty constraint) touch neither the
model, the prompt, nor the grading.
The native-agent baselines (§8.1, §8.3) run under the providers' own production agents — Claude Code and Codex — which expose richer orchestration tooling (sub-agent workflows, task lists, monitors) than SWE-agent's three-tool surface. That difference is not incidental: it is the proximate cause of the brierdale anomaly that anchors the §8.1 distribution's negative tail.
6.3 Run conditions
- One trajectory per provider entry per scenario. No re-runs averaged into the aggregate.
- Identical seed-0 build per scenario — the same byte-identical injected corpus across all runs of that scenario.
- Identical candidate-facing package (stakeholder brief, schema document without obstacle traces, twenty questions, dirty data, validator).
- Per-run cost cap set to the adapter's maximum tier; models run until they call
submit, hit the cost cap, or exit on repeated tool-error retries. No per-provider prompt or step override. - Routing (mixed, by design). New-release flagships run via the vendor's
first-party API; unchanged flagships are carried at their Bedrock-hosted scores
on the same builds. The native-agent baselines use Claude Code (Opus 4.8 at
max, plusxhighcontrol on the original five; Opus 4.7 atmax) and Codex (GPT-5.5 athigh).
6.4 Cohort
The provider cohort, with the endpoint each was run through:
| Provider | Flagship | Model string | Route | Source |
|---|---|---|---|---|
| Anthropic | Opus 4.8 | anthropic/claude-opus-4-8 |
first-party | this report |
| OpenAI | GPT-5.5 | openai/gpt-5.5 |
first-party | this report |
| DeepSeek | V4-Pro | deepseek/deepseek-v4-pro |
first-party | this report |
| Anthropic | Opus 4.7 | anthropic/claude-opus-4-7 |
first-party | this report |
| z.ai | GLM-5.1 | zai/glm-5.1 |
first-party | this report |
| Moonshot | Kimi K2.6 | moonshot/kimi-k2.6 |
first-party | this report |
| MiniMax | M2.7 | minimax/MiniMax-M2.7 |
first-party | this report |
| Alibaba | Qwen3-Coder-480B * | bedrock/qwen.qwen3-coder-480b-a35b-v1:0 |
Bedrock | carried |
| Mistral | Large 3 | bedrock/mistral.mistral-large-3-675b-instruct |
Bedrock | carried |
| Meta | Llama 4 Maverick | bedrock/us.meta.llama4-maverick-17b-instruct-v1:0 |
Bedrock | carried |
| Amazon | Nova Pro ** | bedrock/us.amazon.nova-pro-v1:0 |
Bedrock | carried |
| Gemini 3.5 Flash | gemini/gemini-3.5-flash |
first-party | this report |
* Alibaba's current flagship Qwen 3.6 was not attempted (no first-party key); this row carries the best-available Bedrock entry, Qwen3-Coder-480B, one version behind. ** Amazon's flagship-tier Nova Premier produced no submission; the Amazon row carries the mid-tier Nova Pro.
The native-agent baselines accompany the ranking (reported in §8.1 and §8.3, not
interleaved into it): Opus 4.8 under Claude Code (max on all eleven scenarios,
xhigh on the original five), Opus 4.7 under Claude Code (max on all eleven),
and GPT-5.5 under Codex (high on all eleven).
7. Results
7.1 Cross-scenario leaderboard
The cross-scenario aggregate is the mean of a model's aggregate over the
scenarios in which it produced a valid submission. The Scenarios column gives
the count out of twelve; Avg Wall is the mean wall-clock of the valid runs.
| Rank | Provider | Model | Route | Avg Agg % | Scenarios | Avg Wall |
|---|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 57.83 | 12/12 | 15:50 |
| 2 | Anthropic | Opus 4.8 | FP | 46.67 | 12/12 | 11:44 |
| 3 | Anthropic | Opus 4.7 | FP | 44.56 | 12/12 | 06:56 |
| 4 | Gemini 3.5 Flash (thin) | FP | 43.92 | 3/12 | 26:11 | |
| 5 | DeepSeek | V4-Pro | FP | 42.90 | 12/12 | 13:52 |
| 6 | OpenAI | GPT-5.5 | FP | 42.46 | 12/12 | 09:01 |
| 7 | z.ai | GLM-5.1 | FP | 40.70 | 11/12 | 15:42 |
| 8 | MiniMax | M2.7 | FP | 29.40 | 12/12 | 08:15 |
| 9 | Moonshot | Kimi K2.6 (thin) | FP | 26.75 | 1/12 | 46:09 |
| 10 | Alibaba | Qwen3-Coder-480B * | BR | 22.27 | 12/12 | 19:33 |
| 11 | Mistral | Large 3 | BR | 19.96 | 12/12 | 05:28 |
| 12 | Meta | Llama 4 Maverick | BR | 1.41 | 8/12 | 01:07 |
| 13 | Amazon | Nova Pro ** | BR | 1.22 | 8/12 | 07:50 |
* Qwen3-Coder-480B is one version behind Alibaba's current flagship (Qwen 3.6, not attempted). ** Amazon's flagship-tier Nova Premier produced no submission; Nova Pro is the best-available Amazon entry. Fable 5 leads the table outright on full 12/12 coverage (57.83) and tops eight of the twelve boards; note the raw +11.2 gap over Opus 4.8 is not regime-matched — under SWE-agent Fable's always-on adaptive thinking runs while Opus runs thinking-off (§8.4). (thin) marks coverage below half the suite: Gemini 3.5 Flash scored on helios, helix_marine, and kitchen_supply_co (3/12); Kimi K2.6 scored only on halberon (1/12). Their cross-scenario averages should not be read as comparable to the 12/12 entries (§7.4).

The cross-scenario board separates into three regions:
- A partial-coverage leader, then a tight full-coverage band (40.7–46.7). Fable 5 posts the highest average in the suite by a wide margin (57.83, full 12/12 coverage) and tops eight of the twelve boards — though the raw +11.2 gap over Opus 4.8 mixes model quality with a reasoning-regime asymmetry (Fable's always-on adaptive thinking vs Opus thinking-off under this harness; §8.4). Behind it, Opus 4.8 leads the rest of the cohort at 46.67, 2.11 points clear of Opus 4.7 (44.56). DeepSeek V4-Pro (42.90), GPT-5.5 (42.46), and GLM-5.1 (40.70) follow inside a tight cluster: those five span only six points, comparable to per-scenario single-trial noise, so positions 3–7 are better read as a partial order than a strict ranking. The robust separations are Fable 5's per-scenario dominance where it ran and Opus 4.8's lead among the full-coverage cohort.
- Mid band (~20–29). MiniMax M2.7 (29.40, first-party) now sits noticeably ahead of the Bedrock-hosted Qwen3-Coder-480B (22.27) and Mistral Large 3 (19.96). The first-party MiniMax entry is ~7 points above the leading Bedrock entry — driven by M2.7's outsized 55.00 on jobpost_analytics_co (§7.2).
- Floor (≤1.5). Meta Llama 4 Maverick (1.41, 8/11) and Amazon Nova Pro (1.22, 8/11) never assemble the multi-table joins on the scenarios they reach. Their partial coverage reflects no-submissions (Llama 4: helios, jobpost, kitchen_supply; Nova Pro: pellichar/brierdale/halberon dropouts not re-attempted in the original cohort) rather than scenario-specific incapacity.
The DeepSeek result holds across the wider sample: V4-Pro reaches the top band (41.36, fourth overall) while running at roughly two orders of magnitude below the Western frontier's per-run cost (§7.5). It remains the cost-quality frontier point of the entire suite.
Scope note on Fable 5. Fable 5 was run on all twelve scenarios. The cross-scenario leaderboard (§7.1), the model × scenario heatmap (§7.3), the efficiency frontier (§7.5), the §8.4 synthesis, and every per-scenario board in Appendix A all include it. The only views that do not are the Opus 4.7 → 4.8 harness analysis (§8.1, §8.3, Figure 5) and the native-vs-SWE-agent appendix (Appendix B): those are specifically about the Opus generation pair and its three native conditions, so references to "eleven scenarios" there are intentional and exclude Fable 5.
7.2 Per-scenario summary
Each scenario produces its own twenty-question board. The table below gives, per
scenario, the top SWE-agent model and score, the median over scored SWE-agent
entries, and the best native-agent score (across Opus 4.8/4.7 under Claude Code
max and GPT-5.5 under Codex).
| Scenario | Top SWE model | Top score | Native best |
|---|---|---|---|
| Brierdale | Fable 5 | 60.50 | 74.75 (Fable 5 CC max) |
| Brindholm | Fable 5 | 51.00 | 76.50 (Opus 4.7 CC max) |
| Brivane | Fable 5 | 38.00 | 38.00 (Opus 4.7 CC max) |
| Halberon | Fable 5 | 44.25 | 57.25 (Fable 5 CC max) |
| Pellichar | V4-Pro | 52.50 | 48.50 (GPT-5.5 Codex) |
| Aurora Transit | Fable 5 | 80.75 | 73.25 (Fable 5 CC max) |
| Crosswind | Fable 5 | 76.50 | 81.50 (Fable 5 CC max) |
| Helios | Opus 4.8 | 58.00 | 63.50 (Fable 5 CC max) |
| Helix Marine | Opus 4.8 | 55.00 | 60.75 (Fable 5 CC max) |
| Jobpost Analytics Co. | Fable 5 | 63.25 | 65.25 (GPT-5.5 Codex) |
| Kitchen Supply Co. | Fable 5 | 74.75 | 67.00 (Fable 5 CC max) |
| Elo | Opus 4.8 | 62.50 | 66.50 (Fable 5 CC max) |
Four readings follow. First, Fable 5 dominates but does not sweep: it tops
eight bare-API boards (brierdale, brindholm, brivane, halberon, aurora_transit,
crosswind, jobpost_analytics_co, kitchen_supply_co), while Opus 4.8 tops three
(helios, helix_marine, elo) and DeepSeek V4-Pro tops pellichar — so even the
strongest model loses a third of the scenarios to a rival. Second, absolute
difficulty varies widely: brivane remains the hardest scenario (top score 38.0,
no model clears half the points), while aurora_transit (80.75), crosswind (76.5),
and kitchen_supply_co (74.75) — all Fable 5 wins — are now the most tractable.
Third, the highest single cell anywhere in the suite is native Fable 5 on
crosswind (Claude Code max, 81.5), with native Fable 5 on brierdale (74.75) and
aurora_transit (73.25) close behind; the highest cell under Opus 4.8 remains
aurora_transit (70.25). Fourth, the native-vs-SWE picture is model- and
scenario-dependent: helios is the scenario where SWE-agent out-scores the
native run for the Opus pair (Opus 4.8 SWE 58.00 vs CC max 46.25, −11.75),
yet on that same scenario native Fable 5 (63.5) is the highest run; and on the
boards Fable 5 already aces (aurora, kitchen, jobpost) its own native loop costs
it a few points. "Native generally wins" is no longer a clean rule (§8.3, §8.4).
7.3 Model × scenario heatmap
The aggregate of every scored cohort entry across the twelve scenarios is shown as a heatmap in Figure 3. The full per-scenario boards are reproduced in Appendix A.

Two structural patterns hold across the grid. The first-party top band is rank-correlated but not identical scenario to scenario: Opus 4.8 is top-band on most scenarios, but its margin swings from a clear win on brierdale and helios to mid-pack on crosswind (22.75, sixth) and jobpost (48.50, fifth). The Bedrock floor is a join-failure floor, not a reasoning floor: Meta Llama 4 Maverick and Amazon Nova Pro score near zero wherever they appear because they never assemble the multi-table join, and their absence from some columns is a no-submission outcome rather than a low score.
7.4 No-submission and floor failure modes
Across the eleven scenarios, three providers contribute most of the no-submission / floor mass:
- Google Gemini 3.5 Flash — 3/11. Scored on helios (43.50), helix_marine (47.00), and kitchen_supply_co (41.25); no valid submission on the other eight. The failure modes are heterogeneous: model-behavioural runaway (deliberation-heavy trajectories writing per-question scratch work but never assembling the answer file; aurora_transit, crosswind, jobpost), context over-runs (brierdale, pellichar), and on halberon a 35:06 / $6.86 budget exhaustion with no file emitted — the most expensive single entry anywhere in the suite. Where Gemini 3.5 Flash does submit, its scores are competitive with the first-party top band (43.92 cross-scenario over 3/11); the binding constraint is reaching a submission, not analytic capability once one is produced.
- Moonshot Kimi K2.6 — 1/11. Scored only on halberon (26.75), where it landed mid-band but at by far the slowest first-party wall in that scenario (46:09). On the other ten scenarios it either produced no submission or did not converge within step/context budget on retry. Its cross-scenario average (26.75) is a single-scenario figure and is not comparable to the 11/11 entries.
- Amazon Nova Pro — 8/11. Submitted on most scenarios at floor scores (≤2.25 on aurora_transit, jobpost; 0.00 on pellichar, helix_marine; 0.75 brindholm, brivane; 2.25 jobpost) driven by join failure; no submission on brierdale/halberon/kitchen_supply. Meta Llama 4 Maverick — 8/11. Submitted on most scenarios with every submitted run a near-zero join failure (≤3.75); no-submission on brindholm, jobpost, kitchen_supply.
None of the no-submission outcomes is a function-calling or patch-convention failure of the kind §6.2 describes; all are produce-no-artifact terminations. Their absence from a column should not be read as a hard capability floor — where these endpoints reach submission, they reach a score.
7.5 Cross-scenario efficiency
Per-run cost and wall-clock vary by roughly two orders of magnitude across the cohort. Cost is the LiteLLM meter's billed figure for the first-party runs and the Bedrock meter's figure for the carried runs; the two are list-price comparable but not identical accounting, and the USD figure is not snapshot-locked across providers.

Two patterns stand out. First, cost is uncorrelated with score: DeepSeek V4-Pro reaches the top band at ≈$0.06 per run while Opus 4.8 leads at ≈$1.6 — roughly 25× the spend for a ~4-point gain — and the three Chinese-lab flagships cluster an order of magnitude below the Western frontier at comparable scores. Second, spend does not buy a submission: Gemini 3.5 Flash's most expensive runs (aurora_transit $14+, halberon $6.86) produced no answer file. The honest cost reading is unchanged from the original five: the benchmark's cost-quality frontier is held by a cheap model (DeepSeek V4-Pro), and the most expensive runs in the entire study — the native Claude Code baselines at $22–75 per run — sit where the §8.1 brierdale outlier lives.
7.6 Chain-of-thought audit
A trajectory-level search of the scored SWE-agent runs for pivot markers (wrong
approach, abandoned, reconsider, pivot) and for an uncertainty register
returns a near-null result across all eleven scenarios: the scored SWE-agent
runs follow a produce-and-submit pattern, drafting per-question values and
writing them to solution/submission.json without a documented mid-run pivot.
The trajectories that break the pattern are the native-agent runs — and there
the over-deliberation is not always benign: on brierdale (the §8.1 outlier),
brindholm, brivane, and jobpost, the Opus 4.8 max Claude Code runs ran large
self-verification Workflows whose reconcile stages re-litigated answers, with
the brierdale instance the only one that unambiguously collapsed multiple
top-tier readings (§8.1). The grader reads none of this; every aggregate is
computed solely from submission.json. Trajectory notes are qualitative
context only.
8. Analysis
8.1 The Opus 4.7 → 4.8 upgrade sign is scenario-dependent (headline)
This is the flagship finding of the suite as it now stands. We measured the Anthropic Opus 4.7 → 4.8 generation upgrade under two harnesses across all eleven scenarios, holding each frozen build and seed fixed. The per-scenario distribution is the result, not the average alone.
| Harness | Opus 4.7 (avg over 11) | Opus 4.8 (avg over 11) | Δ (4.8 − 4.7) |
|---|---|---|---|
| SWE-agent | 43.07 | 45.23 | +2.16 |
Native Claude Code (max) |
50.82 | 50.00 | −0.82 |
The cross-scenario averages tell only the headline: under SWE-agent the upgrade
is mildly positive on average; under native Claude Code at matched max effort
the upgrade is essentially tied. The per-scenario detail is what the
five-scenario report missed — the upgrade direction varies sharply by scenario
under both harnesses:
| Scenario | SWE 4.7 | SWE 4.8 | Δ SWE | CC 4.7 max |
CC 4.8 max |
Δ native |
|---|---|---|---|---|---|---|
| Brierdale | 50.25 | 59.00 | +8.75 | 63.00 | 40.00 | −23.00 |
| Brindholm | 47.50 | 46.00 | −1.50 | 76.50 | 71.00 | −5.50 |
| Brivane | 26.00 | 30.75 | +4.75 | 38.00 | 29.50 | −8.50 |
| Halberon | 33.75 | 43.50 | +9.75 | 41.75 | 47.75 | +6.00 |
| Pellichar | 44.00 | 45.25 | +1.25 | 37.75 | 42.50 | +4.75 |
| Aurora Transit | 43.25 | 49.00 | +5.75 | 55.50 | 70.25 | +14.75 |
| Crosswind | 36.50 | 22.75 | −13.75 | 40.00 | 40.75 | +0.75 |
| Helios | 43.75 | 58.00 | +14.25 | 42.25 | 46.25 | +4.00 |
| Helix Marine | 48.75 | 55.00 | +6.25 | 56.25 | 58.25 | +2.00 |
| Jobpost Analytics Co. | 54.25 | 48.50 | −5.75 | 61.50 | 58.00 | −3.50 |
| Kitchen Supply Co. | 45.75 | 39.75 | −6.00 | 46.50 | 45.75 | −0.75 |
| Average | 43.07 | 45.23 | +2.16 | 50.82 | 50.00 | −0.82 |
Under SWE-agent, the upgrade is positive on 7 of 11 scenarios; the four
negatives include crosswind (−13.75) and kitchen_supply_co (−6.00) where Opus
4.8 underperforms its predecessor decisively. Under native Claude Code at
matched max effort, the upgrade splits +5 / −5 / 1 flat across the
eleven scenarios. The five-scenario subset reported in the previous combined TR
was 3 negative / 2 positive, with brierdale's −23.0 dominating the average;
the wider sample is much more balanced.

max Δ = −0.71 (essentially tied). Right: per-scenario native deltas (4.8 − 4.7, max), sorted by sign. The distribution spans −23.0 (brierdale) to +14.75 (aurora_transit); 6 scenarios positive, 5 negative, 1 flat. Brierdale is the dominant negative outlier. Single trial per cell — magnitudes are noisy; the per-scenario distribution is the load-bearing finding.Brierdale (−23.0) is the dominant negative outlier and remains
mechanistically explained. Under native Claude Code, max-effort 4.8 front-loads
a correct, format-valid baseline and then spends the majority of a 2–3.4× longer
budget running a large per-question self-verification Workflow whose reconcile
stage re-litigates already-correct answers into "more defensible" wrong
readings, collapsing top-tier questions. Brierdale is the most extreme instance
because the package's examples/example_submission.json — explicitly labelled a
placeholder decoy — was treated by max-effort 4.8 as a "ground-truth calibration
oracle"; its 21-agent verification Workflow noticed the literal-wording
tensions on several questions but resolved each toward "defensible because it
matches the example," endorsing the violations rather than catching them. The
xhigh control (one tier below max) under the same Claude Code harness
reached the opposite verdict on the example file — treating it as a decoy and
relying on question wording — and scored 57.0 versus max's 40.0, a 17-point
gap from one tier of effort change with the model identical. The collapse is
therefore a property of how the heaviest effort tier resolves under native
orchestration and the presence of a specific anchor (the decoy file), not a
stable 4.8 capability deficit. No other scenario reproduces the magnitude.
On the six new scenarios the same max-effort 4.8 averages +2.71 per
scenario against 4.7 natively; only jobpost (−3.50) and kitchen_supply
(−0.75) carry small negatives, both within plausible single-trial range.
The §8.1 finding as it now stands is therefore: the Opus 4.7 → 4.8 upgrade is not a clean cross-scenario regression under native Claude Code. It is a scenario-dependent distribution whose mean over eleven scenarios is statistically indistinguishable from zero (−0.82 against a per-scenario s.d. of roughly 10). The previous report's −5.25 cross-scenario regression headline is withdrawn — it was an artefact of an oversampled negative subset. What survives is brierdale as a specific, mechanistically-characterised outlier where the reconcile pathology compounds with an exploitable anchor file in the package. The general claim "max-effort 4.8 under native Claude Code spends compute on a self-verification Workflow that can re-litigate correct answers into wrong ones" is still supported by trajectory reads on brierdale + brindholm + brivane + jobpost, but the outcome of that compute is not uniformly negative across the suite (aurora_transit is the same Workflow producing the suite's single highest native score at 70.25).
Why the previous five-scenario report concluded the opposite. The original five included three negatives (brierdale, brindholm, brivane), one large positive (halberon), and one small positive (pellichar). The negatives were clustered on scenarios with rich multi-table-bridge reconciliation where the self-verification Workflow has many targets to re-litigate; the new six include scenarios (aurora_transit, helios) where the same Workflow is productive (the extra tooling lifts native 4.8 above 4.7 by 4+ points). The corrected reading is that the reconcile pathology is real but not universal: it depends on whether the scenario's question pack contains many "defensible alternative reading" attractors and on whether the candidate package contains an anchor artifact (like brierdale's decoy file) for the reconcile to fix on.
The single-trial-per-cell caveat remains in force. With s.d. ~10 across the eleven per-scenario deltas, a multi-seed sweep is now the highest-priority follow-up (§11).
8.2 Routing and recency are confounded with the score
The cross-scenario leaderboard mixes two routing layers by construction (§6.1): the upper entries run first-party, the trailing entries are Bedrock-hosted. Because the first-party entries are also the newest releases, three variables move together down the table — model identity, model recency, and routing — and cannot be separated from a single trial per cell. Over eleven scenarios the mid-band routing seam widens: MiniMax M2.7 (29.40, first-party) now sits ~7 points above the leading Bedrock entry (Qwen3-Coder-480B, 22.27), versus 0.55 points on the original five — driven primarily by M2.7's 55.00 outlier on jobpost_analytics_co. The floor entries (Meta, Nova Pro) are Bedrock and older and weaker, so their position is overdetermined. The honest reading is unchanged: this is a leaderboard of each provider's best-available flagship as served, not a controlled capability ranking.
8.3 Harness sensitivity: native lift is scenario-dependent, not uniform
For three models we hold the model and the build fixed and vary only the agent. Across the eleven scenarios the native-versus-SWE-agent picture is:
| Model / effort | Native agent | Native avg (11) | SWE-agent avg (11) | Native − SWE |
|---|---|---|---|---|
| GPT-5.5 | Codex | 51.25 | 41.11 | +10.14 |
Opus 4.7 (max) |
Claude Code | 50.82 | 43.07 | +7.75 |
Opus 4.8 (max) |
Claude Code | 50.00 | 45.23 | +4.77 |
Opus 4.8 (xhigh, 5-only) |
Claude Code | 48.55 | 44.90 (5-only) | +3.65 |
Two observations.
First, native agents lift most models, but the lift is now substantially narrower than the five-scenario report suggested. Opus 4.7 gains +7.75 (vs +11.10 on five) and GPT-5.5 gains +10.14 (vs +9.75 on five) moving from SWE-agent to their native agents. The Opus 4.7 native-lift attenuation is driven by helios (SWE 43.75 vs native 42.25 — the native lift is negative there) and by crosswind (SWE 36.50 vs native 40.00 — only +3.5).
Second, Opus 4.8's native lift recovers under the wider sample. Its native
Claude Code max average (50.00) now sits +4.77 above its SWE-agent
average (45.23) — a meaningful lift, where the five-scenario report measured
only +1.25. The recovery is driven by aurora_transit (+21.25 native, the
suite's largest single lift), helix_marine (+3.25), and the disappearance of
brierdale's −19 native penalty from the cross-scenario mean. The "Opus 4.8's
native lift nearly vanishes" claim from the previous report does not
generalise.
Third, harness sensitivity has both signs. The new scenarios produce two
clear cases where SWE-agent is not a uniform handicap:
- Helios: SWE-agent Opus 4.8 (58.00) > native Claude Code Opus 4.8 (46.25) —
an 11.75-point harness advantage for SWE-agent on the same model. The
per-asset multiplicative meter-scale quantisation that drives the 5-question
hard core (§4.2 helios signature) is not recovered by either harness; both
collapse on the same questions, but SWE-agent's leaner loop concentrates
budget on the questions that are recoverable, while native max-effort
spends extra tooling on the unrecoverable ones.
- Crosswind: SWE-agent GLM-5.1 (42.75) exceeds the best of the three Opus/Codex
native cells (Opus 4.8 CC max 40.75) — the only scenario where a SWE-agent
cell beats all three native conditions of the Opus-pair probe. (Fable 5, added
later, tops the scenario outright on both harnesses — SWE 76.5, native 81.5; §8.4.)
The implication is direct: the absolute numbers in the cross-scenario leaderboard are model-under-SWE-agent scores, and the native-vs-SWE-agent delta is scenario-dependent in sign. On scenarios with rich reconciliation under richer tooling (aurora_transit, jobpost, brindholm), native lifts are large; on scenarios where the hard core is unrecoverable under either harness (helios, crosswind), the SWE-agent operating point can match or beat the native operating point. The full per-scenario native-versus-SWE-agent grid is in Appendix B.
Caveats: one trial per agent/effort/scenario cell, and the native runs assume the same frozen builds as the harness cohort — the question packs and metric kinds match — but a build drift between runs cannot be fully excluded (§9).
8.4 Fable 5: leads on both harnesses — but the bare-API gap is not regime-matched
Fable 5 ran on all twelve scenarios under both the bare API (SWE-agent) and native Claude Code. It posts the highest bare-API average (57.83) and the highest native average (61.35, against Opus 4.7's 51.98, GPT-5.5/Codex's 51.48, and Opus 4.8's 51.27), and tops eight of the twelve bare-API boards. Attributing that lead, however, requires care, because the headline same-harness comparison is confounded by the reasoning regime.
The SWE-agent scaffold is identical, but the reasoning regimes are not. Under
SWE-agent the --effort knob maps only to a per-run cost cap — the harness never
passes a thinking budget. Opus 4.8 therefore runs thinking-off at the provider's
default 4,096-token output cap, while Fable 5's adaptive thinking cannot be disabled
(an explicit disable is rejected by the API) and runs under a raised 32k output cap.
The trace economics confirm it: on Aurora Transit, Fable spends ~37 s per call against
Opus's ~9 s, while Opus emits more visible thought text — Fable's depth is hidden
thinking, not verbosity. So the raw +11.2 bare-API gap mixes model quality with
test-time reasoning that one model gets and the other does not.
Regime-matched, the gap narrows and concentrates. Comparing each model's best condition per scenario (so each runs with its own thinking regime), Fable 5 leads by +9.1 mean / +6.4 median — and just +4.0 over the ten scenarios outside Crosswind and Kitchen Supply Co., with one inversion (Brindholm, Opus +20). The per-scenario mechanism splits cleanly in the traces:
- Regime-driven gaps (Aurora Transit, Jobpost Analytics Co.). Native Opus 4.8 — where its thinking and agentic loop are on — recovers most of what its bare-API run lost: Aurora 49.0 → 70.25, recovering exactly the unit-drift and derived-component questions (q01, q11, q13, q18 go 15/15/15/0 → 100/100/100/100); Jobpost 48.5 → 58.0. Against native Opus, Fable's Aurora lead shrinks from +31.75 to +10.5.
- Genuine model gaps (Crosswind, Kitchen Supply Co.). Opus 4.8 fails the core reconciliation in both regimes (Crosswind: 22.75 SWE / 40.75 native vs Fable's 76.5 / 81.5; Kitchen: 39.75 / 45.75 vs 74.75 / 67.0). The Kitchen traces are explicit: Opus's own exploration prints the four-value tier anomaly and never resolves it, and it never touches the supplier-entity merge or the sign-flipped promotion amounts — all of which Fable detects and corrects, documenting them in a cleaning log. These two scenarios alone carry roughly half the cross-scenario lead.
- Near-ties elsewhere. On the remaining originals the same-scaffold gap is ~+1 to +7, within single-trial noise for most cells.
The native loop adds little on net, and its sign is scenario-dependent. Fable 5's
native average is only +3.5 over its bare-API average, and the per-scenario lift
splits by scenario character. On reconciliation-heavy covers it gains (Brierdale
+14.25, Helix Marine +14.25, Halberon +13.0, Helios +10.75, Elo +7.5, Crosswind +5.0)
— the extra budget goes into re-deriving from the data. But on analytics-heavy
covers where its bare-API run was already very strong, the loop adds nothing or costs
a little (Aurora Transit −7.5, Kitchen Supply Co. −7.75, Jobpost Analytics Co. −7.0).
So "native lift" is not a fixed property — it depends on whether the scenario leaves
headroom the loop can recover. Opus 4.8, by contrast, frequently regresses under its
own native agent: its max-effort Codegen-and-Verify sub-agent panels police
arithmetic under a planning dossier that pre-commits the wrong output shapes, so the
flawed premise goes unchecked (Brierdale 59→40, Helios 58→46.25).
Two scenarios invert or neutralise the pattern — which is the tell that it is real. On Brindholm the roles flip: Opus 4.8 runs a genuine re-derivation workflow while Fable 5 only reproduces its own first answer, and both land at 51.0 with no native lift for Fable. On Elo, neither model uses verification sub-agents (both verify by re-running on the data), so they finish in a near-tie (Fable native 66.5, Opus 65.25) and the gap shrinks to a single units-interpretation call. The axis that separates the two models is not depth of reasoning but whether the agent loop is used to interrogate the data or to ratify a plan.
9. Limitations
- Single trial per cell. Every cohort entry, every native baseline, and every per-scenario cell is a single trajectory. No within-condition variance or decoding-seed sweep is reported, so the first-party top-band ordering (positions 2–6 inside 4.5 points), any small rank swap, and the magnitudes of the §8.1 upgrade deltas are below the resolution of the benchmark. The per-scenario distribution of the §8.1 upgrade is the load-bearing finding; the per-scenario point estimates are not.
- The §8.1 finding is now scenario-distributional, not cross-scenario-mean. The Opus 4.7 → 4.8 native cross-scenario delta is −0.82 over eleven scenarios (vs −5.25 reported on five). A reader must not quote either figure as "the" upgrade effect; what is reported is the distribution of per-scenario deltas. The previous report's headline regression claim is withdrawn in light of the wider sample.
- Mixed routing and recency confound (central). The cross-scenario leaderboard is not a controlled capability ranking. The top entries are newest-release flagships served first-party; the trailing entries are older flagships served through Bedrock. Model identity, recency, and routing move together down the table, and a single trial cannot separate them (§8.2).
- Scores are harness-conditioned. Every ranked entry runs through SWE-agent at its default operating point. §8.3 shows native-agent averages differ by +4.77 to +10.14 points on average across eleven scenarios, with per-scenario deltas spanning −11.75 (helios) to +21.25 (aurora_transit). The absolute numbers are model-under-SWE-agent, not model capability in the abstract.
- Uneven scenario coverage. Kimi K2.6 produced a valid submission on only one of eleven scenarios (halberon), so its cross-scenario average is a single-scenario figure not comparable to the 11/11 entries; Gemini 3.5 Flash produced a valid submission on three of eleven (helios, helix_marine, kitchen_supply_co); Meta (8/11) and Nova Pro (8/11) are partial. The cross-scenario average is computed only over each model's valid submissions, so models with thin coverage carry more single-trial risk.
- One incomplete flagship, one tier substitution. Alibaba's current flagship (Qwen 3.6) was not attempted — no first-party key — so the Qwen row carries the one-version-behind Qwen3-Coder-480B. Amazon's flagship-tier Nova Premier produced no submission, so the Amazon row carries Nova Pro. Both rows understate their provider's marketed flagship.
- Cross-report sourcing assumes stable builds. The first-party rows on the
six new scenarios were run for this report on the
fix-swe-opus4-harnessbranch (DQ-6 cohort, 2026-06-02); the original five rows are carried from the 2026-05-29 standardized rerun on the same branch; the Bedrock rows are carried from earlier sweeps. All are assumed to share the same frozen builds — the question packs and metric kinds match — but a build drift across the runs cannot be fully excluded. - Eleven corpora, not a population. The suite spans eleven synthetic corpora and eleven question packs across eleven industries. The wider sample reframes the §8.1 finding but eleven scenarios remain a small sample and the generalisation beyond this suite is not resolved here.
- Bedrock and native pricing not standardised. Token counts are reported from the LiteLLM meter; the USD figure depends on provider list prices at the time of the run and was not snapshot-locked. Cross-provider and SWE-vs-native USD comparisons are indicative, not exact.
- Persona calibration externally illegible. The six-persona stress test (§5.3) is a defensible internal threshold-tuning protocol, but no external reviewer can independently audit it without re-running the builds.
- No abstention or confidence instrumentation. Agents are not asked to flag low-confidence answers or to abstain; calibration plots cannot be drawn.
- Closed metric set, English-only, tabular-only. Seven metric kinds are allowed by construction; all stakeholder prose, schema names, and entity strings are in English; schemas are tabular with no multi-modal content. A high-token native run and a low-token SWE-agent run that emit the same JSON receive the same score; token-efficiency is reported separately (§7.5, §8.3) but is not part of the aggregate.
10. Broader Impacts and Ethics
rl-gym is built entirely from synthetic data with fictional cover; no real personal information is present, no real organisations are named, and no scraped human-authored content is graded. The source datasets (Kaggle FMCG sales, a public time-series corpus, Kaggle Recruit Restaurant Visitor Forecasting, Kaggle Polymarket, Kaggle ASHRAE Great Energy Predictor III, Kaggle JPX, Kaggle M5, Kaggle ride-bookings, Kaggle Zillow Prize, Kaggle LinkedIn Job Postings, Kaggle Instacart Online Grocery) are public and licensed for research use; rl-gym uses each as a procedural seed rather than as content. The clinical-trial framing of Brivane in particular is fictional and carries no real health data.
Three responsible-use considerations apply.
Over-reading provider rankings and version upgrades. The cross-scenario leaderboard ranks one flagship per provider on eleven benchmark instances, under one harness, in a single trial per cell — and it mixes first-party and Bedrock routing (§8.2). It is not a general "which provider is best" claim. Two reading rules apply: behind Opus 4.8's top-band lead the first six models are a cluster, not a ranked podium; and the first-party/Bedrock split confounds routing with model recency. The Opus 4.7 → 4.8 result (§8.1) must never be quoted as "4.8 is better" or "4.8 is worse" without the harness and the per-scenario distribution attached. The previous report's cross-scenario regression headline (−5.25) does not survive the wider sample and is withdrawn.
Documentation of the brierdale outlier. Brierdale's −23.0 native delta remains the largest single negative anomaly in the suite, and the reconcile pathology mechanism (§8.1) is corroborated by trajectory reads on three additional scenarios where smaller negative deltas appear. But the magnitude is brierdale-specific (driven in part by the package's decoy example file functioning as an anchor for the reconcile) and does not generalise. The finding should be read as a specific, mechanistically-characterised anomaly, not a verdict on the weights.
Data and scoring privacy. Each scenario ships with: a canary string in the candidate-facing instructions to support future contamination audits; the rubric and canonical answers kept private; obstacles never enumerated to the model; and the construction seed frozen so that the reported results remain reproducible.
11. Conclusion and Next Steps
Across eleven frozen data-quality scenarios, the rl-gym suite places each major provider's current flagship on one cross-scenario leaderboard under the SWE-agent harness, and reframes the previous five-scenario report's headline finding. The sign of the Opus 4.7 → 4.8 upgrade under native Claude Code is now characterised as a scenario-dependent distribution rather than a cross-scenario regression. Over eleven scenarios the native delta averages −0.82 (Opus 4.8 50.00 vs Opus 4.7 50.82) — essentially tied — with a per-scenario distribution spanning −23.0 (brierdale) to +14.75 (aurora_transit), 5 positive / 5 negative / 1 flat. Brierdale is the dominant negative outlier, mechanistically explained by a reconcile pathology compounded with a decoy anchor file in the candidate package; the magnitude does not generalise. Under SWE-agent the upgrade is mildly positive on average (+2.11); Opus 4.8 leads the rest of the cohort, but Fable 5 — added later, on all twelve scenarios — tops the cross-scenario board at 57.83.
Behind the headline the cross-scenario pattern holds. Below Fable 5, a first-party band (Opus 4.8 46.67, Opus 4.7 44.56, DeepSeek V4-Pro 42.90, GPT-5.5 42.46, GLM-5.1 40.70) clusters inside six points well above a mid band (MiniMax M2.7 29.40, Qwen3-Coder-480B 22.27, Mistral Large 3 19.96) and a join-failure floor (Meta 1.41, Nova Pro 1.22, Kimi 1/11, Gemini 3/11). No single model wins every scenario: Fable 5 tops eight boards, Opus 4.8 tops three (helios, helix_marine, elo), and DeepSeek V4-Pro tops pellichar — so even the strongest model loses a third of the scenarios to a rival. DeepSeek V4-Pro remains the cost-quality frontier point of the entire suite — top-band on cohort cost roughly two orders of magnitude below the Western frontier. The most important structural caveats are that the leaderboard mixes first-party and Bedrock routing (§8.2), and that absolute scores — and now the per-scenario distribution of a generation upgrade — are conditioned on the harness.
The harness-sensitivity picture has also gained nuance. On most scenarios native agents beat SWE-agent (Opus 4.7 +7.75 avg, GPT-5.5 +10.14 avg, Opus 4.8 +4.77 avg). But on helios SWE-agent Opus 4.8 (58.00) out-scores its own native Claude Code run (46.25), and on crosswind SWE-agent GLM-5.1 (42.75) tops the cross-condition board over the best native run. The native lift is no longer a uniform claim across the suite.
Next steps:
- Run multi-seed trials. Convert single-trial positions — especially the §8.1 per-scenario deltas, the top-band cluster, and the per-scenario top finishes — into statistically-supported findings. With s.d. ~10 across the eleven per-scenario native deltas, a multi-seed sweep is the highest-priority follow-up.
- Probe the brierdale anomaly directly. Ablate the decoy example file and,
separately, the self-verification Workflow on brierdale to confirm which is
the proximate driver of the −23 collapse, and whether a "distrust the
example / skip the reconcile" instruction recovers
maxtoxhighlevels. - Close the routing confound. Re-run the Bedrock-hosted flagships first-party (or the whole cohort through one route) so the leaderboard becomes a clean within-route comparison.
- Extend coverage on the thin entries. Re-run Kimi K2.6 on the remaining ten scenarios with a step-budget or context-budget cap that prevents the stamina loop; retry Gemini 3.5 Flash with a reasoning-budget cap to convert its 3/11 into a fuller picture.
- Probe Opus 4.8
xhighon the six new scenarios. Thexhigheffort control is only on the original five. Extending it to the new six would test whether the "more 4.8 effort is not better" finding holds across the wider suite or is specific to brierdale's anchor structure.
The suite is best read as a set of controlled benchmark instances: eleven fixed corpora, eleven sets of hidden answers, deterministic rubrics, and a cross-scenario leaderboard whose absolute level — and whose per-scenario distribution of a model upgrade — is conditioned on the harness.
Appendix A. The Twelve Per-Scenario Leaderboards
Tier counts: E+ beyond_excellent, E excellent, g2e good_to_excellent, G
good, n2g naive_to_good, N naive, W worse_than_naive. g+ is the
good-or-better count. All entries are SWE-agent unless noted. Cost is the
LiteLLM/Bedrock meter's billed figure in USD; wall is mm:ss.
A.1 Brierdale (FMCG seed; 5 tables; 39 obstacles)
| Rank | Provider | Model | Route | Agg % | g+ | Wall | Cost ($) |
|---|---|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 60.50 | 14:48 | ||
| 2 | Anthropic | Opus 4.8 | FP | 59.00 | 15 | 08:03 | 1.96 |
| 3 | OpenAI | GPT-5.5 | FP | 55.50 | 13 | 13:04 | 2.13 |
| 4 | z.ai | GLM-5.1 | FP | 50.50 | 12 | 12:33 | 0.29 |
| 5 | Anthropic | Opus 4.7 | FP | 50.25 | 12 | 04:58 | 1.10 |
| 6 | DeepSeek | V4-Pro | FP | 48.00 | 11 | 12:13 | 0.05 |
| 7 | MiniMax | M2.7 | FP | 44.50 | 10 | 07:43 | 0.06 |
| 8 | Alibaba | Qwen3-Coder-480B | BR | 38.75 | 10 | 13:26 | 0.52 |
| 9 | Mistral | Large 3 | BR | 27.75 | 4 | 05:04 | 0.71 |
| 10 | Meta | Llama 4 Maverick | BR | 0.00 | 0 | 00:54 | 0.02 |
| — | Moonshot | Kimi K2.6 | FP | no sub | 01:35 | — | |
| — | Amazon | Nova Pro | BR | no sub | 04:49 | 0.97 | |
| — | Gemini 3.5 Flash | FP | no sub | 01:29 | — |
Fable 5 now leads at 60.5 and the top native run (74.75 under Claude Code). Best non-Fable: 59.00 (Opus 4.8); median 48.00. Next-best native 63.00 (Opus 4.7 CC max). The
single largest native-vs-SWE-agent regression on Opus 4.7 → 4.8 in the suite
(−23.0); see §8.1.
A.2 Brindholm (time-series seed; 9 tables; 40 obstacles)
| Rank | Provider | Model | Route | Agg % | g+ | Wall | Cost ($) |
|---|---|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 51.00 | 14:07 | ||
| 2 | OpenAI | GPT-5.5 | FP | 47.50 | 11 | 07:39 | 1.20 |
| 3 | Anthropic | Opus 4.7 | FP | 47.50 | 10 | 07:49 | 1.85 |
| 4 | Anthropic | Opus 4.8 | FP | 46.00 | 10 | 15:05 | 4.15 |
| 5 | DeepSeek | V4-Pro | FP | 43.50 | 10 | 12:37 | 0.05 |
| 6 | z.ai | GLM-5.1 | FP | 43.00 | 10 | 15:27 | 0.30 |
| 7 | MiniMax | M2.7 | FP | 15.50 | 3 | 10:01 | 0.03 |
| 8 | Alibaba | Qwen3-Coder-480B | BR | 13.00 | 3 | 32:42 | 1.06 |
| 9 | Mistral | Large 3 | BR | 9.75 | 2 | 05:50 | 0.65 |
| 10 | Amazon | Nova Pro | BR | 0.75 | 0 | 09:48 | 0.58 |
| — | Meta | Llama 4 Maverick | BR | no sub | 00:57 | 0.02 | |
| — | Gemini 3.5 Flash | FP | no sub | 01:27 | — | ||
| — | Moonshot | Kimi K2.6 | FP | not run | — | — |
Fable 5 now leads at 51.0 (native 51.0 under Claude Code). Best non-Fable: 47.50 (GPT-5.5 / Opus 4.7 tie); median 43.00. Native best 76.50 (Opus 4.7 CC
max) — the highest single native cell anywhere in the suite. Sharply bimodal
tier mix: top cluster heavy with beyond_excellent over a wide
worse_than_naive floor (the suite's widest shared failure floor), with almost
no middle band.
A.3 Brivane (Recruit Restaurant seed; 7 tables; 25 obstacles)
| Rank | Provider | Model | Route | Agg % | g+ | Wall | Cost ($) |
|---|---|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 38.00 | 12:47 | ||
| 2 | Anthropic | Opus 4.8 | FP | 30.75 | 6 | 04:35 | 0.83 |
| 3 | Anthropic | Opus 4.7 | FP | 26.00 | 4 | 04:14 | 0.90 |
| 4 | DeepSeek | V4-Pro | FP | 25.50 | 5 | 15:48 | 0.07 |
| 5 | OpenAI | GPT-5.5 | FP | 23.00 | 4 | 11:26 | 1.01 |
| 6 | z.ai | GLM-5.1 | FP | 20.75 | 4 | 10:38 | 0.24 |
| 7 | Alibaba | Qwen3-Coder-480B | BR | 16.25 | 3 | 12:41 | 0.40 |
| 8 | MiniMax | M2.7 | FP | 6.00 | 0 | 10:37 | 0.05 |
| 9 | Mistral | Large 3 | BR | 1.50 | 0 | 02:23 | 0.16 |
| 10 | Amazon | Nova Pro | BR | 0.75 | 0 | 01:25 | 0.16 |
| 11 | Meta | Llama 4 Maverick | BR | 0.00 | 0 | 01:20 | 0.06 |
| — | Moonshot | Kimi K2.6 | FP | no sub | 01:37 | — | |
| — | Gemini 3.5 Flash | FP | no sub | 01:29 | — |
Fable 5 now leads at 38.0 (native 37.0 under Claude Code). Best non-Fable: 30.75 (Opus 4.8); median 18.50. Native best 38.00 (Opus 4.7 CC max). The
hardest scenario in the suite — a pure cross-network-bridge reconciliation test
on which no model clears a third of the points; near-bimodal
excellent/worse tier mix.
A.4 Halberon (Polymarket seed; 6 tables; 30 obstacles)
| Rank | Provider | Model | Route | Agg % | g+ | Wall | Cost ($) |
|---|---|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 44.25 | 18:51 | ||
| 2 | Anthropic | Opus 4.8 | FP | 43.50 | 9 | 07:28 | 1.67 |
| 3 | OpenAI | GPT-5.5 | FP | 39.50 | 9 | 06:58 | 1.45 |
| 4 | DeepSeek | V4-Pro | FP | 38.25 | 7 | 11:43 | 0.05 |
| 5 | Anthropic | Opus 4.7 | FP | 33.75 | 9 | 05:32 | 1.21 |
| 6 | z.ai | GLM-5.1 | FP | 31.75 | 7 | 25:16 | 0.32 |
| 7 | Moonshot | Kimi K2.6 | FP | 26.75 | 7 | 46:09 | 0.65 |
| 8 | MiniMax | M2.7 | FP | 16.50 | 3 | 08:22 | 0.04 |
| 9 | Mistral | Large 3 | BR | 15.75 | 2 | 09:44 | 1.17 |
| 10 | Alibaba | Qwen3-Coder-480B | BR | 14.25 | 2 | 82:34 | 0.52 |
| 11 | Meta | Llama 4 Maverick | BR | 1.50 | 0 | 02:27 | 0.01 |
| — | Amazon | Nova Pro | BR | no sub | 00:59 | 0.06 | |
| — | Gemini 3.5 Flash | FP | no sub | 35:06 | 6.86 |
Fable 5 now leads at 44.25 and the top native run (57.25 under Claude Code). Best non-Fable: 43.50 (Opus 4.8); median 29.25. Next-best native 57.00 (GPT-5.5 Codex). The only scenario in which Kimi K2.6 produced a valid submission.
A.5 Pellichar (ASHRAE seed; 7 tables; ~dozen obstacle patterns)
| Rank | Provider | Model | Route | Agg % | g+ | Wall | Cost ($) |
|---|---|---|---|---|---|---|---|
| 1 | DeepSeek | V4-Pro | FP | 52.50 | 11 | 23:46 | 0.06 |
| 2 | z.ai | GLM-5.1 | FP | 48.50 | 11 | 19:43 | 0.23 |
| 3 | Anthropic | Fable 5 | FP | 46.75 | 25:07 | ||
| 4 | Anthropic | Opus 4.8 | FP | 45.25 | 10 | 11:43 | 1.64 |
| 5 | OpenAI | GPT-5.5 | FP | 44.25 | 9 | 23:15 | 1.49 |
| 6 | Anthropic | Opus 4.7 | FP | 44.00 | 9 | 13:22 | 1.24 |
| 7 | MiniMax | M2.7 | FP | 37.75 | 7 | 08:57 | 0.04 |
| 8 | Alibaba | Qwen3-Coder-480B | BR | 35.25 | 6 | 54:44 | 0.61 |
| 9 | Mistral | Large 3 | BR | 30.00 | 6 | 11:48 | 1.12 |
| 10 | Meta | Llama 4 Maverick | BR | 1.50 | 0 | 00:46 | 0.01 |
| 11 | Amazon | Nova Pro | BR | 0.00 | 0 | 01:51 | 0.28 |
| — | Moonshot | Kimi K2.6 | FP | no sub | 13:18 | 0.05 | |
| — | Gemini 3.5 Flash | FP | not run | — | — |
Top 52.50 (DeepSeek V4-Pro — the single highest SWE-agent score on the original five); median 40.88. Native best 48.50 (GPT-5.5 Codex). The one of the original five where the best native run sits below the top SWE-agent run.
A.6 Aurora Transit (Kaggle ride-bookings; 16 tables; 80 obstacles)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 80.75 | 23:15 |
| 2 | z.ai | GLM-5.1 | FP | 52.25 | 29:10 |
| 3 | DeepSeek | V4-Pro | FP | 49.00 | 16:55 |
| 4 | Anthropic | Opus 4.8 | FP | 49.00 | 08:05 |
| 5 | OpenAI | GPT-5.5 | FP | 46.00 | 07:03 |
| 6 | Anthropic | Opus 4.7 | FP | 43.25 | 07:12 |
| 7 | MiniMax | M2.7 | FP | 30.75 | 10:01 |
| 8 | Alibaba | Qwen3-Coder-480B | BR | 20.50 | 04:02 |
| 9 | Mistral | Large 3 | BR | 16.00 | 05:48 |
| 10 | Amazon | Nova Pro | BR | 2.25 | 30:46 |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
| — | Meta | Llama 4 Maverick | BR | no sub | — |
| — | Gemini 3.5 Flash | FP | no sub | — |
Fable 5 now leads at 80.75 and the top native run (73.25 under Claude Code). Best non-Fable: 52.25 (GLM-5.1); next-best native 70.25 (Opus 4.8 CC max) — the highest cell
anywhere in the suite under Opus 4.8 and the largest single native-vs-SWE-agent
lift (+21.25). Sixteen-table corpus with 80 seeded obstacles — the largest
table count and obstacle count in the suite. Question pair q05/q11/q19
(flagship-district rush-hour metrics) defeats every scored SWE-agent run.
A.7 Crosswind (Kaggle M5 Forecasting (Walmart); 6 tables)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 76.50 | 14:59 |
| 2 | z.ai | GLM-5.1 | FP | 42.75 | 09:48 |
| 3 | Anthropic | Opus 4.7 | FP | 36.50 | 07:39 |
| 4 | OpenAI | GPT-5.5 | FP | 31.00 | 08:10 |
| 5 | DeepSeek | V4-Pro | FP | 30.25 | 10:19 |
| 6 | MiniMax | M2.7 | FP | 30.25 | 09:57 |
| 7 | Anthropic | Opus 4.8 | FP | 22.75 | 05:35 |
| 8 | Mistral | Large 3 | BR | 15.50 | 03:37 |
| 9 | Alibaba | Qwen3-Coder-480B | BR | 15.50 | 08:13 |
| 10 | Meta | Llama 4 Maverick | BR | 0.75 | 00:55 |
| 11 | Amazon | Nova Pro | BR | 0.00 | 11:31 |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
| — | Gemini 3.5 Flash | FP | no sub | — |
Fable 5 now leads at 76.5 and the top native run (81.5 under Claude Code). Best non-Fable: 42.75 (GLM-5.1). Next-best native 40.75 (Opus 4.8 CC max); the only scenario
where the top SWE-agent score exceeds the top native score across all three
native conditions. The cohort spans only 42.75 points and a three-question
hard core (q12, q13, q19) lands at worse_than_naive for every scored run,
native and SWE-agent alike.
A.8 Helios (JPX Tokyo Stock Exchange seed; 5 tables)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Opus 4.8 | FP | 58.00 | 04:54 |
| 2 | Anthropic | Fable 5 | FP | 52.75 | 14:55 |
| 3 | Anthropic | Opus 4.7 | FP | 43.75 | 08:56 |
| 4 | Gemini 3.5 Flash | FP | 43.50 | 24:04 | |
| 5 | DeepSeek | V4-Pro | FP | 36.00 | 15:33 |
| 6 | OpenAI | GPT-5.5 | FP | 24.50 | 06:27 |
| 7 | z.ai | GLM-5.1 | FP | 20.50 | 10:12 |
| 8 | Mistral | Large 3 | BR | 17.50 | 06:19 |
| 9 | Alibaba | Qwen3-Coder-480B | BR | 17.50 | 10:29 |
| 10 | MiniMax | M2.7 | FP | 13.75 | 03:46 |
| 11 | Meta | Llama 4 Maverick | BR | 3.75 | 00:54 |
| 12 | Amazon | Nova Pro | BR | 3.75 | 04:01 |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
Top 58.00 (Opus 4.8). Native best 46.25 (Opus 4.8 CC max) — the only
scenario where SWE-agent on the same model exceeds the native run (−11.75
native lift). Five-question hard core (q01, q02, q03, q13, q18 — per-fuel /
per-board / per-tier / monthly dispatch-energy roll-ups) at worse_than_naive
for every one of 14 scored runs (SWE-agent + native), driven by a single
per-asset multiplicative meter-scale quantisation that no configuration
recovers.
A.9 Helix Marine (Kaggle Zillow Prize; 8 tables)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Opus 4.8 | FP | 55.00 | 05:30 |
| 2 | OpenAI | GPT-5.5 | FP | 50.75 | 07:48 |
| 3 | Anthropic | Opus 4.7 | FP | 48.75 | 05:05 |
| 4 | Gemini 3.5 Flash | FP | 47.00 | 18:47 | |
| 5 | Anthropic | Fable 5 | FP | 46.50 | 14:52 |
| 6 | z.ai | GLM-5.1 | FP | 42.75 | 19:29 |
| 7 | MiniMax | M2.7 | FP | 40.00 | 03:14 |
| 8 | DeepSeek | V4-Pro | FP | 39.50 | 06:57 |
| 9 | Mistral | Large 3 | BR | 39.00 | 03:07 |
| 10 | Alibaba | Qwen3-Coder-480B | BR | 33.50 | 03:10 |
| 11 | Meta | Llama 4 Maverick | BR | 0.00 | 00:52 |
| 12 | Amazon | Nova Pro | BR | 0.00 | 01:12 |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
Top 55.00 (Opus 4.8); native best 58.25 (Opus 4.8 CC max). Tight top — four
models within 8 points (Opus 4.8 55, GPT-5.5 50.75, Opus 4.7 48.75, Gemini 3.5
Flash 47). Five-question hard core (q02, q03, q04, q05, q13)
naive-or-worse_than_naive for every scored SWE-agent run.
A.10 Jobpost Analytics Co. (Kaggle LinkedIn Job Postings; 12 tables; 60 obstacles)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 63.25 | 10:51 |
| 2 | MiniMax | M2.7 | FP | 55.00 | 03:53 |
| 3 | Anthropic | Opus 4.7 | FP | 54.25 | 04:59 |
| 4 | DeepSeek | V4-Pro | FP | 52.50 | 21:42 |
| 5 | OpenAI | GPT-5.5 | FP | 51.50 | 05:40 |
| 6 | Anthropic | Opus 4.8 | FP | 48.50 | 04:12 |
| 7 | z.ai | GLM-5.1 | FP | 48.50 | 10:00 |
| 8 | Alibaba | Qwen3-Coder-480B | BR | 17.25 | 02:03 |
| 9 | Mistral | Large 3 | BR | 17.00 | 02:29 |
| 10 | Amazon | Nova Pro | BR | 2.25 | 02:04 |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
| — | Meta | Llama 4 Maverick | BR | no sub | — |
| — | Gemini 3.5 Flash | FP | no sub | — |
Fable 5 now leads at 63.25 and the top native run (56.25 under Claude Code). Best non-Fable: 55.00 (MiniMax M2.7 — the first scenario in the suite where a non-frontier provider leads the SWE-agent board); native best 65.25 (GPT-5.5 Codex — the single highest cell on this scenario). Six-model leading band packed into 6.5 points (55 → 48.50), then a cliff to a two-model tail (Qwen 17.25, Mistral 17.00) and a single-model floor (Nova Pro 2.25).
A.11 Kitchen Supply Co. (Kaggle Instacart Online Grocery; 13 tables)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Fable 5 | FP | 74.75 | 13:21 |
| 2 | z.ai | GLM-5.1 | FP | 46.50 | 10:24 |
| 3 | Anthropic | Opus 4.7 | FP | 45.75 | 05:59 |
| 4 | Gemini 3.5 Flash | FP | 41.25 | 35:42 | |
| 5 | DeepSeek | V4-Pro | FP | 40.00 | 07:40 |
| 6 | Anthropic | Opus 4.8 | FP | 39.75 | 04:04 |
| 7 | OpenAI | GPT-5.5 | FP | 38.75 | 03:14 |
| 8 | MiniMax | M2.7 | FP | 30.25 | 13:46 |
| 9 | Alibaba | Qwen3-Coder-480B | BR | 25.75 | 05:38 |
| 10 | Mistral | Large 3 | BR | 15.25 | 03:41 |
| 11 | Meta | Llama 4 Maverick | BR | 3.75 | 00:47 |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
| — | Amazon | Nova Pro | BR | no sub | — |
Fable 5 now leads at 74.75 and the top native run (67.0 under Claude Code). Best non-Fable: 46.50 (GLM-5.1); next-best native 47.25 (GPT-5.5 Codex). The flattest leaderboard
in the suite — six endpoints within 8 points (46.50 → 38.75), no model runs
away. Ten questions are naive-or-worse_than_naive for every scored endpoint
and all three native agents; a five-question spine (q01, q07, q14, q19, q20)
is worse_than_naive for every run in the report.
A.12 Elo (road-freight LTL seed; 5 tables; 25 obstacles)
| Rank | Provider | Model | Route | Agg % | Wall |
|---|---|---|---|---|---|
| 1 | Anthropic | Opus 4.8 | FP | 62.50 | 61:40 |
| 2 | Anthropic | Opus 4.7 | FP | 61.00 | 07:32 |
| 3 | DeepSeek | V4-Pro | FP | 59.75 | 11:16 |
| 4 | Anthropic | Fable 5 | FP | 59.00 | 12:03 |
| 5 | OpenAI | GPT-5.5 | FP | 57.25 | 07:33 |
| 6 | Mistral | Large 3 | BR | 34.50 | 05:40 |
| 7 | MiniMax | M2.7 | FP | 32.50 | 08:43 |
| 8 | Alibaba | Qwen3-Coder-480B | BR | 19.75 | 04:57 |
| — | z.ai | GLM-5.1 | FP | no sub | — |
| — | Moonshot | Kimi K2.6 | FP | no sub | — |
| — | Meta | Llama 4 Maverick | BR | no sub | — |
| — | Amazon | Nova Pro | BR | no sub | — |
| — | Gemini 3.5 Flash | FP | no sub | — |
Top 62.50 (Opus 4.8; an infra DNF in the first build re-ran cleanly); native best
66.50 (Fable 5 — Claude Code, ahead of Opus 4.8's 65.25). A tight four-way top —
Opus 4.8, Opus 4.7, V4-Pro, and Fable 5 within 3.5 points. q11 (a per-hub roll-up)
is worse_than_naive for every scored cell and all four natives. The closest
Fable-versus-Opus match in the suite: neither model uses verification sub-agents,
so the small gap is a single units-interpretation call, not a delegation pattern
(§8.4).
Appendix B. Native-vs-SWE-agent per Scenario
For the three probed conditions the table gives the per-scenario aggregate
under each harness, holding the model and the build fixed. Native Claude Code
runs Opus 4.8 at max on all eleven scenarios (and xhigh on the original
five only); Opus 4.7 at max on all eleven; GPT-5.5 runs under Codex on all
eleven.
| Scenario | Op4.8 SWE | Op4.8 CC max |
Op4.7 SWE | Op4.7 CC max |
GPT SWE | GPT Codex |
|---|---|---|---|---|---|---|
| Brierdale | 59.00 | 40.00 | 50.25 | 63.00 | 55.50 | 58.00 |
| Brindholm | 46.00 | 71.00 | 47.50 | 76.50 | 47.50 | 70.50 |
| Brivane | 30.75 | 29.50 | 26.00 | 38.00 | 23.00 | 24.50 |
| Halberon | 43.50 | 47.75 | 33.75 | 41.75 | 39.50 | 57.00 |
| Pellichar | 45.25 | 42.50 | 44.00 | 37.75 | 44.25 | 48.50 |
| Aurora Transit | 49.00 | 70.25 | 43.25 | 55.50 | 46.00 | 59.00 |
| Crosswind | 22.75 | 40.75 | 36.50 | 40.00 | 31.00 | 40.00 |
| Helios | 58.00 | 46.25 | 43.75 | 42.25 | 24.50 | 44.50 |
| Helix Marine | 55.00 | 58.25 | 48.75 | 56.25 | 50.75 | 49.25 |
| Jobpost Analytics Co. | 48.50 | 58.00 | 54.25 | 61.50 | 51.50 | 65.25 |
| Kitchen Supply Co. | 39.75 | 45.75 | 45.75 | 46.50 | 38.75 | 47.25 |
| Average (11) | 45.23 | 50.00 | 43.07 | 50.82 | 41.11 | 51.25 |
Reading the grid:
- Opus 4.8 native lift (SWE → CC max): +4.77 cross-scenario (much wider
than the +1.25 reported on five; the recovery is driven by aurora_transit
+21.25 and brierdale's −19 being only one of eleven cells).
- Opus 4.7 native lift: +7.75 cross-scenario (narrower than +11.10 on five;
attenuated by helios where the native lift is negative, −1.50).
- GPT-5.5 native lift: +10.14 cross-scenario (essentially unchanged from
+9.75 on five).
- Two scenarios invert the native-lift sign for at least one model: helios
(Opus 4.8 SWE 58.00 > native 46.25) and helix_marine (GPT-5.5 SWE 50.75 >
native 49.25, a smaller +1.5 SWE-advantage).
Native cost per run ranges $22–75 (Claude Code) and ≈$1.1–1.8 (Codex), versus $0.02–4.15 under SWE-agent; the most expensive runs in the study are the native Claude Code baselines. One trial per cell; build-consistency caveat in §9.
Appendix C. Persona-Calibration Summary
The rubric thresholds for each scenario are anchored via the six-persona stress test of §5.3 (personas constructed by inverting the per-question grading metric, so each lands at its targeted anchor). Measured band centres for the two scenarios with published values from the original five:
| Persona | Target band | Brierdale | Halberon |
|---|---|---|---|
| Literal-clean reference (canonical verbatim) | 100 % | 100.0 % | 100.0 % |
| Naive baseline | 10–20 % | 14.2 % | 15.0 % |
| No-hints-thorough baseline | 18–40 % | 31.5 % | 25.0 % |
| Excellent baseline (canonical pipeline on dirty data) | 75–95 % | 86.5 % | 91.8 % |
| Stress-test pattern A | 18–50 % | 27.8 % | 31.5 % |
| Stress-test pattern B | 18–50 % | 25.0 % | 32.0 % |
(Brindholm, Brivane, and Pellichar report their band centres as within-or-just-below their target bands without publishing every measured percentage; the literal-clean reference is 100.0 % in all five. Pellichar's published centres are naive 14.2 %, no-hints 33.0 %, excellent 94.2 %, stress A 26.2 %, stress B 27.0 %. The six new scenarios pass the same reachability gates and are recorded internally.)
Every scenario passes the reachability gates: strict naive < good < excellent ordering on all twenty questions; naive-to-excellent spread ≥ 60 pp (brierdale-class spreads run 76–80 pp); no-hints-to-excellent spread ≥ 35 pp; and meaningful naive-to-excellent separation on at least fourteen of twenty questions. All eleven pass the dead-code / trap-test floor: a dumb-dirty no-cleanup baseline matches the canonical on at most two of twenty questions, and a comprehensive-canonicalisation submission scores below 40 % with at least half its questions at-or-below naive and zero bit-for-bit excellent matches.
References
- [BIG-bench] Srivastava, A., Rastogi, A., Rao, A., et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615, 2022.
- [GPQA] Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., Bowman, S. R. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022, 2023.
- [HELM] Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., et al. Holistic Evaluation of Language Models. TMLR, August 2023. arXiv:2211.09110.
- [HumanEval] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. O., Kaplan, J., et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021.
- [IFEval] Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y., Zhou, D., Hou, L. Instruction-Following Evaluation for Large Language Models. arXiv:2311.07911, 2023.
- [LiveCodeBench] Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S. I., Solar-Lezama, A., Sen, K., Stoica, I. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. arXiv:2403.07974, 2024.
- [MATH] Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874, 2021.
- [MMLU] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J. Measuring Massive Multitask Language Understanding. ICLR 2021. arXiv:2009.03300.
- [MMLU-Pro] Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv:2406.01574, 2024.
- [regression] rl-gym internal sweep. Opus 4.7 → 4.8 native-vs-SWE-agent per-scenario distribution (eleven scenarios; brierdale outlier). Frozen-build cross-scenario comparison, 2026.
- [SWE-agent] Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS 2024. arXiv:2405.15793.
- [SWE-bench] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770.
End of rl-gym combined technical report — Data-Quality Track, eleven scenarios, SWE-agent harness cohort with native baselines.