rl-gym: Data-Quality Track — Combined Technical Report (Twelve Scenarios, Frontier Flagship Cohort)

Publication draft, 2026-06-03.


1. Abstract

We introduce rl-gym, an agent-native benchmark that evaluates whether a language-model agent can complete an end-to-end synthetic data-science take-home: read a stakeholder brief, inspect a dirty multi-table relational dataset, recover from seeded data-quality failures, and submit a JSON file containing numeric answers to twenty business questions, each scored against a hidden canonical key by a deterministic rubric. Unlike multiple-choice knowledge benchmarks, code-execution suites, or atomic-instruction tests, rl-gym grades the answer artifact an analyst hands to a stakeholder, treating the intervening reasoning, tool use, and recovery as opaque.

This report aggregates twelve frozen data-quality scenarios — the original five (Brierdale, Brindholm, Brivane, Halberon, Pellichar) plus seven newly authored covers (Aurora Transit, Crosswind, Elo, Helios, Helix Marine, Jobpost Analytics Co., Kitchen Supply Co.) — each a distinct fictional cover over a different procedurally-reseeded public dataset, each carrying its own multi-table corpus, twenty-question pack, and seeded obstacle catalogue. We assemble a cross-scenario leaderboard of each major provider's current flagship model under one harness — SWE-agent [SWE-agent] — scoring each flagship on every scenario in which it produced a valid submission. Most providers shipped a new flagship we ran through the vendor's own first-party API (Anthropic Opus 4.8 and the prior-generation Opus 4.7, OpenAI GPT-5.5, DeepSeek V4-Pro, z.ai GLM-5.1, MiniMax M2.7, Moonshot Kimi K2.6, Google Gemini 3.5 Flash); Anthropic's newest model, Fable 5, was run on all twelve scenarios. The remaining flagships are carried at their Bedrock-hosted scores (Alibaba, Mistral, Meta, Amazon). For the three Anthropic models and for GPT-5.5 we additionally run their native production agents — Fable 5, Opus 4.8, and Opus 4.7 under Claude Code, GPT-5.5 under Codex — to isolate how much of a score is the harness rather than the model.

The headline of this report is that the sign of the Opus 4.7 → 4.8 upgrade is scenario-dependent — and that Fable 5, added afterward, leads the suite on both harnesses. Under SWE-agent, the Opus upgrade is a small cross-scenario win: Opus 4.8 averages 46.67 against Opus 4.7's 44.56 over the twelve scenarios (Δ = +2.11) and leads the rest of the cohort — but Fable 5, run on all twelve, leads the suite outright at 57.83, topping eight of the twelve boards. (The raw +11.2 bare-API gap over Opus 4.8 is not regime-matched — SWE-agent passes no thinking budget, so Opus runs thinking-off while Fable's adaptive thinking is always on; regime-matched the lead is ≈ +9 and concentrated in two scenarios, §8.4.) Under native Claude Code at matched max effort, the two Opus generations are essentially tied — Opus 4.8 51.27 vs Opus 4.7 51.98 (Δ = −0.71) — while Fable 5 leads the native board at 61.35. The per-scenario Opus native deltas (4.8 − 4.7, max) span −23.0 (brierdale) to +14.75 (aurora_transit), with sign distribution +6 positive / −5 negative / 1 flat across the twelve scenarios. The previous five-scenario report identified a cross-scenario native regression (−5.25) and treated it as the suite's flagship finding; the wider eleven-scenario sample does not support that generalisation. Brierdale (−23.0) remains the largest single anomaly and a controlled effort-tier rules it in as orchestration-driven (§8.1), but the six new scenarios shift the net cross-scenario native delta to −0.82 — and on four of those six the upgrade is positive under the same native agent (aurora_transit +14.75, helios +4.00, helix_marine +2.00, crosswind +0.75). The honest reading is: under SWE-agent the upgrade is mildly positive and consistent; under native Claude Code the upgrade averages near zero with high per-scenario variance, dominated by the brierdale outlier. We therefore report a per-scenario distribution rather than a single trend.

Three further cross-scenario findings emerge. First, the cross-scenario leaderboard is shaped by provider, not routing. Fable 5 leads at 57.83 over twelve scenarios; behind it a first-party band (Opus 4.8 46.67, Opus 4.7 44.56, DeepSeek V4-Pro 42.90, GPT-5.5 42.46, GLM-5.1 40.70) clusters inside six points, ahead of MiniMax M2.7 (29.40) and Bedrock-hosted Qwen3-Coder-480B (22.27) / Mistral Large 3 (19.96); Gemini 3.5 Flash averages 43.92 over its three valid submissions but does not clear half the suite. The cluster at the top has widened over the broader sample: the spread inside the top five drops from 6.0 points (5-scenario) to 4.5 points (11-scenario), and the order of the middle three (Opus 4.7, V4-Pro, GPT-5.5) is reshuffled. Second, no single model wins every scenario — but Fable 5 comes close — Fable 5 tops eight bare-API boards (brierdale 60.5, brindholm 51.0, brivane 38.0, halberon 44.25, aurora_transit 80.75, crosswind 76.5, jobpost_analytics_co 63.25, kitchen_supply_co 74.75), Opus 4.8 tops three (helios 58.0, helix_marine 55.0, elo 62.5), and DeepSeek V4-Pro tops pellichar (52.50) — so even the strongest model loses a third of the scenarios to a rival. Third, the harness-vs-native delta is itself scenario-dependent in sign. On helios, SWE-agent Opus 4.8 (58.00) out-scores its own native Claude Code run (46.25); on aurora_transit the same Opus 4.8 is 21.25 points below its native run (49.00 vs 70.25). The native-lift story is no longer "natives generally beat SWE-agent"; it is "natives usually beat SWE-agent except where the SWE-agent operating point happens to suit the scenario, of which we now have at least two clear cases (helios, crosswind)."

2. Introduction

Saturation of static benchmarks — MMLU, GSM8K, HumanEval — has pushed evaluation toward larger, harder, or more contamination-resistant variants (MMLU-Pro, GPQA, LiveCodeBench, SWE-bench). Yet a category remains under-served: the multi-question, rubric-graded, agentic deliverable that mimics what a junior analyst hands to a hiring manager. Existing benchmarks address adjacent shapes:

None tests whether an agent can recover from messy data it has never seen, deliver twenty numeric answers, and have those answers checked against hidden ground truth. rl-gym fills this gap. Its design commitments are:

  1. Open-book over a private synthetic corpus. The agent is given dirty CSVs and a stakeholder brief; the canonical answers, the obstacle catalogue, and the rubric are hidden. Pretraining contamination is structurally blocked by the freshly synthesised fictional cover.
  2. Twenty questions, no closed-form gimmes. The question pack is designed so that one groupby().agg() call does not produce the right answer for any non-trivial slot. Every question targets at least one catalogued data-quality failure mode.
  3. Rubric-as-grader, not human-as-grader. A deterministic per-question rubric maps the submitted numeric value to one of seven verdict tiers; the agent's reasoning trace and any explanatory text are never read by the grader.
  4. Persona-calibrated thresholds. Tier thresholds are anchored using six synthesised personas whose submissions are constructed by inverting the per-question grading metric. This decouples threshold tuning from obstacle authorship.
  5. Single-artifact deliverable. The agent's submission is a JSON file with bare submission keys q01..q20. No code is executed, no tests are run, no auxiliary files are scored.
  6. Frozen builds. Each scenario's corpus, question pack, rubric, and construction seed are fixed for this report.
  7. Agent-and-condition agnostic. The harness is silent about prompts, tool sets, or step budgets; comparable runs document their conditions, but the benchmark grades only the final JSON.

This report asks four questions across the eleven-scenario suite:

  1. When the cohort is restricted to current flagship models served at their vendors' first-party APIs, does the suite produce a wide model-quality gradient, or does the spread collapse?
  2. Which scenarios and which questions discriminate inside the top cohort, and which collapse to universal success or universal failure?
  3. How much of a model's score is attributable to the harness rather than the model? Native-agent baselines on every scenario isolate the harness effect.
  4. Is the sign of a model version upgrade (Opus 4.7 → 4.8) consistent across scenarios, or is it scenario-dependent? The previous five-scenario report identified a cross-scenario native regression and treated it as a single finding; the wider eleven-scenario sample reframes this as per-scenario heterogeneity dominated by one large negative outlier (§8.1).

Knowledge-MCQ benchmarks. MMLU [MMLU] and its successor MMLU-Pro [MMLU-Pro] test broad knowledge through four-way to ten-way multiple choice over 57 to 14 subject areas. BIG-bench [BIG-bench] aggregates 200+ tasks ranging from translation to logical inference, exposing them through a programmatic API. HELM [HELM] takes a different aggregation strategy, sweeping a fixed model set across a curated task taxonomy and reporting multiple metric categories per task. All four share the property that the artifact graded is a short text selection, the dataset is fully published, and the construction pipeline yields questions designed to admit a unique correct option. rl-gym differs in three ways: the artifact is a structured JSON of twenty numeric values, the dataset is private and synthetic, and the questions are written so that no single-line query is correct.

Code and agent benchmarks. HumanEval [HumanEval] introduced functional-correctness grading via unit-test execution, formalised pass@k as the metric, and proved to be saturable through fine-tuning on similar surface forms. LiveCodeBench [LiveCodeBench] extends this with a continuously-updating problem stream to mitigate contamination, multiple task types, and difficulty bands. SWE-bench [SWE-bench] moves to repository-scale evaluation: issue text plus codebase to patch, graded by re-running the project's own test suite. rl-gym shares with these benchmarks the commitment to execution-grounded grading (the canonical answers are computed from the clean source by code, not adjudicated by humans), but the executed artifact is the canonical-answer pipeline, not the candidate's code. The candidate's code is not seen, executed, or scored.

Instruction-following and constrained-output benchmarks. IFEval [IFEval] evaluates whether an LLM follows atomic, machine-verifiable instructions. It reports a strict and a loose match. rl-gym extends this format-compliance idea: each of the twenty questions has an "expected format" (a dict keyed by an entity, a scalar, an integer count, a set of names), and submissions failing format validation are dropped to the lowest tier. But the analytic content — the numeric value — is the primary grading axis, not the format compliance.

Single-shot canonical-answer benchmarks. MATH [MATH] grades against boxed numeric or expression answers extracted via regex normalisation; GPQA [GPQA] grades multiple-choice questions written by domain experts to be objective even to motivated non-experts. Both benchmarks share with rl-gym the property that the canonical answer is unambiguous and the grader is deterministic. They differ in dataset scale (a single short premise vs. a multi-table relational corpus), in the number of answers per "instance" (one vs. twenty), and in the construction pipeline (expert-write-and-validate vs. dataset-synthesise-and-obstacle-inject).

Agent harness as a benchmark variable. A growing thread in benchmark methodology treats the agent scaffold itself as part of the system under test rather than as a fixed wrapper. SWE-bench's leaderboard explicitly conditions on the harness (agentless, SWE-agent [SWE-agent], OpenHands, first-party scaffolds). This report holds the harness constant for the leaderboard — every ranked entrant runs through SWE-agent — so that the cross-scenario leaderboard measures provider/model signal at the SWE-agent operating point, then varies the harness deliberately in §8.1 to characterise the per-scenario distribution of the within-provider generation upgrade. The eleven-scenario replication is, to our knowledge, the strongest demonstration to date that a measured upgrade direction under a single harness can be a scenario property rather than a model property.

The gap rl-gym fills. No published benchmark we are aware of simultaneously requires: (i) reasoning over a multi-table dirty dataset; (ii) producing twenty numeric answers, each with its own metric; (iii) scoring via a hidden, deterministic rubric anchored by persona calibration; (iv) per-instance contamination resistance via synthetic-cover-over-real-source construction; and (v) replication of the same evaluation shape across eleven disjoint industry covers. The closest functional analog is the data-science take-home interview itself, but those are rarely reproducible or graded against published rubrics.

4. The rl-gym Suite

4.1 Concept

An rl-gym evaluation presents the candidate (a language-model agent) with: a stakeholder brief introducing a fictional organisation; a private synthetic relational dataset; and a list of twenty business questions, each phrased in a different stakeholder voice. The candidate is told the data is dirty and to expect cleanup work, but is never told which fields, rows, or relationships have been tampered with. The candidate submits a JSON file containing one value per question (q01..q20); a deterministic grader scores each value against a hidden canonical answer derived by running a project-internal pipeline on the clean source data. The aggregate score is the arithmetic mean of per-question tier percentages.

rl-gym pipeline overview.
Figure 1. rl-gym pipeline overview. The agent sees only the upper row (stakeholder brief, dirty CSVs, twenty questions) and emits submission.json. The lower row (clean source, per-question canonical pipeline, canonical answers) is hidden; the grader compares the agent's JSON against the canonical and emits per-question tier verdicts. The same pipeline shape is instantiated twelve times, once per scenario, over twelve different source datasets and fictional covers.

4.2 The twelve scenarios

Each scenario is a distinct fictional cover over a different procedurally-reseeded public dataset. All twelve share the twenty-question pack size, the seven-tier rubric, the persona-calibration protocol, and the single-JSON deliverable; they differ in corpus shape, obstacle count, metric mix, and the character of their difficulty (reconciliation-depth versus shape-reading).

Brierdale — a procedurally-reseeded Kaggle FMCG sales 1M-row dataset recast as Brierdale Waste & Recovery Group, a pan-European waste-and-recovery operator built from a 2020–2022 acquisition wave consolidating three predecessor operators. Five tables (stations, waste_streams, haulers, calendar, daily_manifests) carry 39 seeded obstacles. Signature: a five-table-join sanity gate and an anniversary-date attractor on a literal license_signed_date field.

Brindholm — a procedurally-reseeded public time-series corpus recast as Brindholm Coastal Terminals, a regional container-port operator formed in mid-2021 from seven predecessor operators with an in-flight platform migration. Nine tables carry 40 seeded obstacles. Signature: the widest shared failure floor in the suite and the largest native-vs-SWE-agent harness lift on the original five.

Brivane — a procedurally-reseeded Kaggle Recruit Restaurant Visitor Forecasting dataset recast as Brivane Clinical Operations Network, a contract research organisation that consolidated a sponsor-direct programme and an external broker network onto one platform. Seven tables carry 25 seeded obstacles, joined only through a corrupted cross-network bridge. Signature: the purest reconciliation test and the lowest absolute SWE-agent scores in the suite.

Halberon — a procedurally-reseeded Kaggle Polymarket export recast as the Halberon Meteoritical Council, a meteorite-authentication bureau formed in mid-2022 by a four-way merger and a 28-month archive-modernisation mandate. Six tables carry 30 seeded obstacles. Signature: deep status-flag and case-variant reconciliation plus an encoding-drift hazard.

Pellichar — a procedurally-reseeded Kaggle ASHRAE Great Energy Predictor III dataset recast as Pellichar Cooperage Trust, a whisky-cooperage consortium running hourly maturation-cellar telemetry across multiple jurisdictions. Seven tables carry a seeded obstacle catalogue clustering into roughly a dozen recurring patterns. Signature: several of its hardest questions fail on the output container (count-vs-share dict, list-vs-ratio dict, key cardinality) rather than on deep reconciliation, making it a specification-reading benchmark as much as a data-cleaning one.

Aurora Transit — a procedurally-reseeded Kaggle ride-bookings dataset recast as the Aurora Transit ride-hail and micro-mobility platform in the fictional coastal metro of Meridian. Sixteen candidate-facing tables (bookings, payments, fare breakdowns, ratings, cancellations, riders, drivers, vehicles, shifts, surge windows, promo codes, and reference catalogues) carry 80 seeded obstacles — the largest table count and obstacle count in the suite. Signature: a flagship-district rush-hour question pair (q11, q19) that defeats every scored SWE-agent run, and the largest native lift in the suite (Claude Code Opus 4.8 70.25 — the single highest cell anywhere).

Crosswind — a procedurally-reseeded Kaggle M5 Forecasting (Walmart) dataset recast as the Crosswind Pharmacy Network, a ten-pharmacy regional dispensing chain across three regions, over about five and a half years of daily activity. Six candidate-facing tables (pharmacy master, drug formulary, daily dispensing fact, reimbursement-rate schedule, calendar/event, weekly claim-volume rollup) carry seeded data-quality obstacles. Signature: the narrowest scored range in the suite (0.00 to 42.75) and an SWE/native parity where Opus 4.8 max, Opus 4.7 max, and GPT-5.5/Codex all land within a single point.

Helios — a procedurally-reseeded JPX Tokyo Stock Exchange Prediction dataset recast as the Helios Power Exchange wholesale electricity market run by the Borealis Grid Authority market-data desk. Five candidate-facing tables (asset master, two disjoint daily clearing-price boards, quarterly operator filings, weekly participant-flow bulletin) carry seeded obstacles. Signature: a five-question hard core (q01, q02, q03, q13, q18 — the per-fuel / per-board / per-tier / monthly dispatch-energy roll-ups) that lands at worse_than_naive for every scored run, SWE-agent and native alike, driven by a single deeply-seeded per-asset multiplicative meter-scale quantisation. The only scenario in the suite where SWE-agent Opus 4.8 (58.00) out-scores its own native Claude Code run (46.25).

Helix Marine — a procedurally-reseeded Kaggle Zillow Prize home-value dataset recast as the Helix Marine online brokerage for pre-owned recreational marine vessels, whose HelixValue automated-appraisal product stands in for the source dataset's home-value estimator. Eight tables (vessel master, build specs, amenities, marina dimension, annual appraisals, listing events, HelixValue estimates, reference-code lookup) carry seeded obstacles. Signature: a tight top under SWE-agent (Opus 4.8 55.00 → GPT-5.5 50.75 → Opus 4.7 48.75 → Gemini 3.5 Flash 47.00 — four models within 8 points) with a small native lift (+3.25 for Opus 4.8).

Jobpost Analytics Co. — a procedurally-reseeded Kaggle LinkedIn Job Postings dataset recast as a B2B labor-market intelligence vendor that resells job-posting data to recruitment agencies and corporate talent-acquisition teams. Twelve candidate-facing tables (employer master, postings fact, three posting-attribute junctions, salaries fact, two company-attribute tables, headcount-snapshot, three lookup tables) carry 60 seeded obstacles. Signature: the only scenario in the suite where a non-frontier provider leads the SWE-agent board (MiniMax M2.7 55.00 #1) inside a six-model leading band packed into 6.5 points, with native GPT-5.5/Codex (65.25) the single best cell.

Kitchen Supply Co. — a procedurally-reseeded Kaggle Instacart Online Grocery dataset recast as a B2B kitchen-supply distributor that ships to restaurants and institutional kitchens. Thirteen candidate-facing warehouse tables (accounts, orders, order lines, two-level product catalogue, suppliers, fulfillment centers, promotions, deliveries, inventory snapshots, price history, reference-code lookup) carry seeded obstacles. Signature: the flattest leaderboard in the suite (SWE-agent 3.75 to 46.50) with a ten-question naive-floor and a five-question worse_than_naive spine (q01, q07, q14, q19, q20) shared across every scored run; native baselines bunch tightly too (47.25 / 46.50 / 45.75 — barely two points across a generation).

Elo — a procedurally-reseeded public road-freight dataset recast as Caldermoor Freight Exchange, a national less-than-truckload (LTL) carrier network whose records straddle a platform-migration cutover. Five tables (a shipper-account dimension plus fact feeds either side of the migration) carry 25 seeded obstacles, including irrecoverable sentinel-overwrite losses. Signature: a tight four-way top (Opus 4.8, Opus 4.7, V4-Pro, Fable 5 within 3.5 points), a per-hub roll-up (q11) that defeats every scored cell, and the closest Fable-versus-Opus match in the suite (§8.4).

4.3 Common question-design rules

Every pack is governed by four rules: (1) no closed-form-gimme — the obvious schema-driven path is wrong for every non-trivial slot; (2) coverage matrix — every seeded obstacle is consumed by at least one question, and every question hits one or more obstacles; (3) banned-token lint — candidate-facing prose is linted against four lexicons (format/rubric terms, obstacle-design terms, cleanup phrases, structure words) and must return zero; (4) three-plausible-readings audit — each question is reviewed to enumerate at least three plausible interpretations and confirm the canonical pipeline targets the one a careful analyst would pick. Each pack also enforces a segmented/set-valued discipline: at least ten of twenty slots are on segmented or set-valued metrics, preventing a single lucky scalar from dominating the aggregate.

4.4 Submission specification

The candidate produces a single JSON file with bare submission keys q01..q20 at solution/submission.json. Each value is a dictionary keyed by entity strings, a scalar float, an integer, a date string, or a list of identifier strings. A separate validator (validate_submission.py) checks structure and types only; it never grades. Submissions that fail validation are graded as if every question were at the worst tier. A canary string is included in each candidate brief to support later training-data-contamination checks; obstacles are never enumerated to the model, and each brief contains a single generic warning that the data should be expected to be dirty.

5. Grading Methodology

5.1 Per-question metric kinds

Each of the twenty questions per scenario is graded by a single metric kind, chosen during construction and frozen before the agent runs. The closed metric set is enforced by a runtime sanity check on the rubric: any submission claiming a kind outside the allowed set is rejected.

The eleven scenarios draw from this closed set in different proportions (§4.2); not every kind is used in every scenario.

5.2 Seven-tier classifier

Per-question scores are normalised against three anchors (naive, good, excellent) declared in the rubric for each question. The classifier produces one of seven tiers, each mapped to a percentage that contributes to the aggregate by simple arithmetic mean:

Tier Percentage Roughly
beyond_excellent 100 Better than the excellent anchor by a clear margin
excellent 100 At or just below the excellent anchor
good_to_excellent 70 Between good and excellent
good 50 At or just below the good anchor
naive_to_good 30 Between naive and good
naive 15 At or just below the naive anchor
worse_than_naive 0 Worse than a naive-baseline solver

The aggregate score is reported as a percentage and is interpreted by tier mix more than by point value.

5.3 Persona calibration

Tier anchors are not set by hand. They are derived from a six-persona stress test in which synthesised submissions are graded by the same production rubric the agent submissions go through, and the per-question grades determine the anchors. The personas span four functional roles: a literal-clean reference (submits the canonical answers verbatim, scores 100 %); a naive baseline; a no-hints-thorough baseline; an excellent baseline (the canonical pipeline run against the dirty data); and two stress-test personas designed to fail along specific catalogued patterns. The personas are constructed by inverting the per-question grading metric so each submission lands at the targeted anchor value, decoupling threshold tuning from obstacle authorship.

Each scenario's measured band centres lie within their target bands and the rubric satisfies the reachability gates: strict numeric ordering of naive vs. good vs. excellent on all twenty questions; a naive-to-excellent spread of at least 60 percentage points; a no-hints-to-excellent spread of at least 35 percentage points; and meaningful naive-to-excellent separation on at least fourteen of twenty questions. The original five scenarios' measured centres are summarised in Appendix C; the six new scenarios pass the same gates and are recorded internally.

5.4 Canonical-answer dead-code audit

A subtle failure mode in any rubric-graded benchmark is canonical-answer dead code: the canonical pipeline contains helpers or filters that appear to apply a cleanup but in practice are no-ops on the dirty data, so a candidate who performs every standard normalisation would match the canonical bit-for-bit without recognising any seeded obstacle. rl-gym defends against this with a canonical-answer audit pass: every helper reachable from the canonical-compute entry point must materially change the dirty data, and a "dumb-dirty" baseline that applies no cleanup must match the canonical on at most two of twenty questions. A separate trap-test step then runs a comprehensive-canonicalisation persona (or a real frontier-model submission) and verifies that it scores below 40 % with at least half its questions at-or-below the naive anchor and zero bit-for-bit excellent matches. All eleven scenarios pass this audit.

5.5 Closed-form-floor exclusion rule

A small number of questions per scenario turn out to be solvable by a closed-form computation that survives the dirty-data cleanup unchanged — typically because the answer is a small integer or a set whose canonical value is invariant to the seeded obstacles. These are flagged as closed-form floor because they do not discriminate inside the "completes the schema correctly" group. The per-scenario results sections identify each floor and the suite recommends tightening those questions before re-use.

5.6 Scope of grading

The submission schema includes an optional explanations field; the grader does not consume it. The grader executes no candidate code, lints no candidate notebook, and reads no auxiliary files. The only artifact graded is the submitted JSON. Qualitative analysis in this report draws on the explanations and on the agent's trajectory, but no part of the aggregate score depends on them.

6. Models and Run Setup

6.1 Provider selection, flagship sourcing, and routing

The cohort is one entry per major provider: that provider's current flagship model, evaluated under the SWE-agent harness — with one deliberate exception. Anthropic appears twice, as both the new flagship (Opus 4.8) and the prior-generation flagship (Opus 4.7), so that the generation-over-generation upgrade can be measured directly under the same harness (§8.1). Where a provider shipped a new flagship for which a first-party API key was available, we ran it through the vendor's own endpoint (Anthropic, OpenAI, DeepSeek, z.ai, MiniMax, Moonshot, Google). Where a provider's current flagship had no newer first-party release and was already evaluated on the same frozen builds through Bedrock, we carry that score forward rather than re-run it (Mistral Large 3, Meta Llama 4 Maverick, Amazon Nova Pro, and — with a caveat — Alibaba's Qwen line). Every entry, first-party or Bedrock, runs through the identical SWE-agent scaffold on the identical seed-0 build of each scenario; only the model endpoint differs.

Two consequences follow and are treated as first-class caveats:

Separately, for the Opus generation pair and for GPT-5.5 we run the native production agent — Opus 4.8 and Opus 4.7 under Claude Code (max effort, with an Opus 4.8 xhigh control on the original five scenarios), GPT-5.5 under Codex — on every scenario, as a harness-sensitivity baseline (§8.3) and as the basis for the §8.1 per-scenario upgrade-distribution analysis. These are reported alongside but not interleaved into the SWE-agent ranking.

6.2 Harness

Every ranked entrant runs through one harness. The SWE-agent [SWE-agent] adapter exposes str_replace_editor (file edit), shell (bash, with whitelisted commands), and submit (end-of-run marker). Tool calls are JSON-encoded function calls, routed via LiteLLM so that the same scaffold serves every provider. The system prompt and the candidate-facing question pack are byte-identical across all runs of a given scenario.

Two SWE-agent mechanics affect the results and are worth flagging. SWE-agent was authored around SWE-bench, where the expected artifact is a unified diff; rl-gym wants solution/submission.json instead, so the system prompt retargets the expected output and a model that calls submit before writing the file ends the run with nothing on disk and is graded no submission. The tool surface is function-call-shaped; a model that cannot emit structured tool calls cycles on parse errors and never submits. Three minor integration adjustments (supplying list prices for new model versions so the cost meter could proceed; caching and restoring one provider's required reasoning-trace echo; padding empty-content assistant turns for Moonshot's strict-non-empty constraint) touch neither the model, the prompt, nor the grading.

The native-agent baselines (§8.1, §8.3) run under the providers' own production agents — Claude Code and Codex — which expose richer orchestration tooling (sub-agent workflows, task lists, monitors) than SWE-agent's three-tool surface. That difference is not incidental: it is the proximate cause of the brierdale anomaly that anchors the §8.1 distribution's negative tail.

6.3 Run conditions

6.4 Cohort

The provider cohort, with the endpoint each was run through:

Provider Flagship Model string Route Source
Anthropic Opus 4.8 anthropic/claude-opus-4-8 first-party this report
OpenAI GPT-5.5 openai/gpt-5.5 first-party this report
DeepSeek V4-Pro deepseek/deepseek-v4-pro first-party this report
Anthropic Opus 4.7 anthropic/claude-opus-4-7 first-party this report
z.ai GLM-5.1 zai/glm-5.1 first-party this report
Moonshot Kimi K2.6 moonshot/kimi-k2.6 first-party this report
MiniMax M2.7 minimax/MiniMax-M2.7 first-party this report
Alibaba Qwen3-Coder-480B * bedrock/qwen.qwen3-coder-480b-a35b-v1:0 Bedrock carried
Mistral Large 3 bedrock/mistral.mistral-large-3-675b-instruct Bedrock carried
Meta Llama 4 Maverick bedrock/us.meta.llama4-maverick-17b-instruct-v1:0 Bedrock carried
Amazon Nova Pro ** bedrock/us.amazon.nova-pro-v1:0 Bedrock carried
Google Gemini 3.5 Flash gemini/gemini-3.5-flash first-party this report

* Alibaba's current flagship Qwen 3.6 was not attempted (no first-party key); this row carries the best-available Bedrock entry, Qwen3-Coder-480B, one version behind. ** Amazon's flagship-tier Nova Premier produced no submission; the Amazon row carries the mid-tier Nova Pro.

The native-agent baselines accompany the ranking (reported in §8.1 and §8.3, not interleaved into it): Opus 4.8 under Claude Code (max on all eleven scenarios, xhigh on the original five), Opus 4.7 under Claude Code (max on all eleven), and GPT-5.5 under Codex (high on all eleven).

7. Results

7.1 Cross-scenario leaderboard

The cross-scenario aggregate is the mean of a model's aggregate over the scenarios in which it produced a valid submission. The Scenarios column gives the count out of twelve; Avg Wall is the mean wall-clock of the valid runs.

Rank Provider Model Route Avg Agg % Scenarios Avg Wall
1 Anthropic Fable 5 FP 57.83 12/12 15:50
2 Anthropic Opus 4.8 FP 46.67 12/12 11:44
3 Anthropic Opus 4.7 FP 44.56 12/12 06:56
4 Google Gemini 3.5 Flash (thin) FP 43.92 3/12 26:11
5 DeepSeek V4-Pro FP 42.90 12/12 13:52
6 OpenAI GPT-5.5 FP 42.46 12/12 09:01
7 z.ai GLM-5.1 FP 40.70 11/12 15:42
8 MiniMax M2.7 FP 29.40 12/12 08:15
9 Moonshot Kimi K2.6 (thin) FP 26.75 1/12 46:09
10 Alibaba Qwen3-Coder-480B * BR 22.27 12/12 19:33
11 Mistral Large 3 BR 19.96 12/12 05:28
12 Meta Llama 4 Maverick BR 1.41 8/12 01:07
13 Amazon Nova Pro ** BR 1.22 8/12 07:50

* Qwen3-Coder-480B is one version behind Alibaba's current flagship (Qwen 3.6, not attempted). ** Amazon's flagship-tier Nova Premier produced no submission; Nova Pro is the best-available Amazon entry. Fable 5 leads the table outright on full 12/12 coverage (57.83) and tops eight of the twelve boards; note the raw +11.2 gap over Opus 4.8 is not regime-matched — under SWE-agent Fable's always-on adaptive thinking runs while Opus runs thinking-off (§8.4). (thin) marks coverage below half the suite: Gemini 3.5 Flash scored on helios, helix_marine, and kitchen_supply_co (3/12); Kimi K2.6 scored only on halberon (1/12). Their cross-scenario averages should not be read as comparable to the 12/12 entries (§7.4).

Cross-scenario SWE-agent leaderboard, average aggregate over valid submissions.
Figure 2. Cross-scenario SWE-agent leaderboard. Bars are each model's mean aggregate over the scenarios in which it produced a valid submission, coloured by route (first-party vs Bedrock). Over twelve scenarios Fable 5 (57.83, full 12/12 coverage) posts the highest average by a wide margin, while the next first-party band — Opus 4.8 (46.67), Opus 4.7 (44.56), DeepSeek V4-Pro (42.90), GPT-5.5 (42.46), GLM-5.1 (40.70) — clusters inside six points. MiniMax M2.7 (29.40) pulls away from the Bedrock band (Qwen 22.27, Mistral 19.96), and Meta and Nova Pro floor the table. Fable 5, Gemini 3.5 Flash, and Kimi K2.6 are drawn with explicit coverage annotation.

The cross-scenario board separates into three regions:

The DeepSeek result holds across the wider sample: V4-Pro reaches the top band (41.36, fourth overall) while running at roughly two orders of magnitude below the Western frontier's per-run cost (§7.5). It remains the cost-quality frontier point of the entire suite.

Scope note on Fable 5. Fable 5 was run on all twelve scenarios. The cross-scenario leaderboard (§7.1), the model × scenario heatmap (§7.3), the efficiency frontier (§7.5), the §8.4 synthesis, and every per-scenario board in Appendix A all include it. The only views that do not are the Opus 4.7 → 4.8 harness analysis (§8.1, §8.3, Figure 5) and the native-vs-SWE-agent appendix (Appendix B): those are specifically about the Opus generation pair and its three native conditions, so references to "eleven scenarios" there are intentional and exclude Fable 5.

7.2 Per-scenario summary

Each scenario produces its own twenty-question board. The table below gives, per scenario, the top SWE-agent model and score, the median over scored SWE-agent entries, and the best native-agent score (across Opus 4.8/4.7 under Claude Code max and GPT-5.5 under Codex).

Scenario Top SWE model Top score Native best
Brierdale Fable 5 60.50 74.75 (Fable 5 CC max)
Brindholm Fable 5 51.00 76.50 (Opus 4.7 CC max)
Brivane Fable 5 38.00 38.00 (Opus 4.7 CC max)
Halberon Fable 5 44.25 57.25 (Fable 5 CC max)
Pellichar V4-Pro 52.50 48.50 (GPT-5.5 Codex)
Aurora Transit Fable 5 80.75 73.25 (Fable 5 CC max)
Crosswind Fable 5 76.50 81.50 (Fable 5 CC max)
Helios Opus 4.8 58.00 63.50 (Fable 5 CC max)
Helix Marine Opus 4.8 55.00 60.75 (Fable 5 CC max)
Jobpost Analytics Co. Fable 5 63.25 65.25 (GPT-5.5 Codex)
Kitchen Supply Co. Fable 5 74.75 67.00 (Fable 5 CC max)
Elo Opus 4.8 62.50 66.50 (Fable 5 CC max)

Four readings follow. First, Fable 5 dominates but does not sweep: it tops eight bare-API boards (brierdale, brindholm, brivane, halberon, aurora_transit, crosswind, jobpost_analytics_co, kitchen_supply_co), while Opus 4.8 tops three (helios, helix_marine, elo) and DeepSeek V4-Pro tops pellichar — so even the strongest model loses a third of the scenarios to a rival. Second, absolute difficulty varies widely: brivane remains the hardest scenario (top score 38.0, no model clears half the points), while aurora_transit (80.75), crosswind (76.5), and kitchen_supply_co (74.75) — all Fable 5 wins — are now the most tractable. Third, the highest single cell anywhere in the suite is native Fable 5 on crosswind (Claude Code max, 81.5), with native Fable 5 on brierdale (74.75) and aurora_transit (73.25) close behind; the highest cell under Opus 4.8 remains aurora_transit (70.25). Fourth, the native-vs-SWE picture is model- and scenario-dependent: helios is the scenario where SWE-agent out-scores the native run for the Opus pair (Opus 4.8 SWE 58.00 vs CC max 46.25, −11.75), yet on that same scenario native Fable 5 (63.5) is the highest run; and on the boards Fable 5 already aces (aurora, kitchen, jobpost) its own native loop costs it a few points. "Native generally wins" is no longer a clean rule (§8.3, §8.4).

7.3 Model × scenario heatmap

The aggregate of every scored cohort entry across the twelve scenarios is shown as a heatmap in Figure 3. The full per-scenario boards are reproduced in Appendix A.

Model-by-scenario aggregate heatmap.
Figure 3. Model × scenario aggregate heatmap. Rows are provider flagships (ordered by cross-scenario average — Fable 5 is now the top row), columns are the twelve scenarios; cells are the SWE-agent aggregate, colour-scaled. Fable 5 forms a warm top row across most columns, with the first-party band warm below it; the Bedrock floor (Meta, Nova Pro) is uniformly cold; brivane stays the coldest column (the hardest scenario), while Fable 5's strongest wins (aurora_transit, crosswind, kitchen_supply_co) warm the right side. Empty cells mark no-submission / not-run conditions.

Two structural patterns hold across the grid. The first-party top band is rank-correlated but not identical scenario to scenario: Opus 4.8 is top-band on most scenarios, but its margin swings from a clear win on brierdale and helios to mid-pack on crosswind (22.75, sixth) and jobpost (48.50, fifth). The Bedrock floor is a join-failure floor, not a reasoning floor: Meta Llama 4 Maverick and Amazon Nova Pro score near zero wherever they appear because they never assemble the multi-table join, and their absence from some columns is a no-submission outcome rather than a low score.

7.4 No-submission and floor failure modes

Across the eleven scenarios, three providers contribute most of the no-submission / floor mass:

None of the no-submission outcomes is a function-calling or patch-convention failure of the kind §6.2 describes; all are produce-no-artifact terminations. Their absence from a column should not be read as a hard capability floor — where these endpoints reach submission, they reach a score.

7.5 Cross-scenario efficiency

Per-run cost and wall-clock vary by roughly two orders of magnitude across the cohort. Cost is the LiteLLM meter's billed figure for the first-party runs and the Bedrock meter's figure for the carried runs; the two are list-price comparable but not identical accounting, and the USD figure is not snapshot-locked across providers.

Cost versus aggregate score across the cohort.
Figure 4. Cost (log scale, x) versus cross-scenario aggregate (y) over twelve scenarios. Fable 5 sits at the top of the score axis (57.83) at a moderate ~$2–4/run — well clear of the field on quality. The cost-quality frontier point remains DeepSeek V4-Pro — fourth on the board at sub-$0.10 per run, an order of magnitude cheaper than the Western frontier (Opus 4.8 and GPT-5.5 in the ~$1.5/run range) for a ~4-point gap. The three Chinese-lab flagships (DeepSeek, GLM-5.1, MiniMax) all sit at the left edge of the cost axis. Gemini 3.5 Flash, where it produced no submission, was the most expensive single run in the suite (halberon, $6.86). The native Claude Code runs (§8, $22–75) are off the right edge of this chart entirely.

Two patterns stand out. First, cost is uncorrelated with score: DeepSeek V4-Pro reaches the top band at ≈$0.06 per run while Opus 4.8 leads at ≈$1.6 — roughly 25× the spend for a ~4-point gain — and the three Chinese-lab flagships cluster an order of magnitude below the Western frontier at comparable scores. Second, spend does not buy a submission: Gemini 3.5 Flash's most expensive runs (aurora_transit $14+, halberon $6.86) produced no answer file. The honest cost reading is unchanged from the original five: the benchmark's cost-quality frontier is held by a cheap model (DeepSeek V4-Pro), and the most expensive runs in the entire study — the native Claude Code baselines at $22–75 per run — sit where the §8.1 brierdale outlier lives.

7.6 Chain-of-thought audit

A trajectory-level search of the scored SWE-agent runs for pivot markers (wrong approach, abandoned, reconsider, pivot) and for an uncertainty register returns a near-null result across all eleven scenarios: the scored SWE-agent runs follow a produce-and-submit pattern, drafting per-question values and writing them to solution/submission.json without a documented mid-run pivot. The trajectories that break the pattern are the native-agent runs — and there the over-deliberation is not always benign: on brierdale (the §8.1 outlier), brindholm, brivane, and jobpost, the Opus 4.8 max Claude Code runs ran large self-verification Workflows whose reconcile stages re-litigated answers, with the brierdale instance the only one that unambiguously collapsed multiple top-tier readings (§8.1). The grader reads none of this; every aggregate is computed solely from submission.json. Trajectory notes are qualitative context only.

8. Analysis

8.1 The Opus 4.7 → 4.8 upgrade sign is scenario-dependent (headline)

This is the flagship finding of the suite as it now stands. We measured the Anthropic Opus 4.7 → 4.8 generation upgrade under two harnesses across all eleven scenarios, holding each frozen build and seed fixed. The per-scenario distribution is the result, not the average alone.

Harness Opus 4.7 (avg over 11) Opus 4.8 (avg over 11) Δ (4.8 − 4.7)
SWE-agent 43.07 45.23 +2.16
Native Claude Code (max) 50.82 50.00 −0.82

The cross-scenario averages tell only the headline: under SWE-agent the upgrade is mildly positive on average; under native Claude Code at matched max effort the upgrade is essentially tied. The per-scenario detail is what the five-scenario report missed — the upgrade direction varies sharply by scenario under both harnesses:

Scenario SWE 4.7 SWE 4.8 Δ SWE CC 4.7 max CC 4.8 max Δ native
Brierdale 50.25 59.00 +8.75 63.00 40.00 −23.00
Brindholm 47.50 46.00 −1.50 76.50 71.00 −5.50
Brivane 26.00 30.75 +4.75 38.00 29.50 −8.50
Halberon 33.75 43.50 +9.75 41.75 47.75 +6.00
Pellichar 44.00 45.25 +1.25 37.75 42.50 +4.75
Aurora Transit 43.25 49.00 +5.75 55.50 70.25 +14.75
Crosswind 36.50 22.75 −13.75 40.00 40.75 +0.75
Helios 43.75 58.00 +14.25 42.25 46.25 +4.00
Helix Marine 48.75 55.00 +6.25 56.25 58.25 +2.00
Jobpost Analytics Co. 54.25 48.50 −5.75 61.50 58.00 −3.50
Kitchen Supply Co. 45.75 39.75 −6.00 46.50 45.75 −0.75
Average 43.07 45.23 +2.16 50.82 50.00 −0.82

Under SWE-agent, the upgrade is positive on 7 of 11 scenarios; the four negatives include crosswind (−13.75) and kitchen_supply_co (−6.00) where Opus 4.8 underperforms its predecessor decisively. Under native Claude Code at matched max effort, the upgrade splits +5 / −5 / 1 flat across the eleven scenarios. The five-scenario subset reported in the previous combined TR was 3 negative / 2 positive, with brierdale's −23.0 dominating the average; the wider sample is much more balanced.

The Opus 4.7 to 4.8 upgrade under two harnesses across eleven scenarios.
Figure 5. The Opus 4.7 → 4.8 upgrade is scenario-dependent (Fable 5 is not part of this Opus-pair view). Left: cross-scenario averages over twelve scenarios — SWE-agent Δ = +2.11, native Claude Code max Δ = −0.71 (essentially tied). Right: per-scenario native deltas (4.8 − 4.7, max), sorted by sign. The distribution spans −23.0 (brierdale) to +14.75 (aurora_transit); 6 scenarios positive, 5 negative, 1 flat. Brierdale is the dominant negative outlier. Single trial per cell — magnitudes are noisy; the per-scenario distribution is the load-bearing finding.

Brierdale (−23.0) is the dominant negative outlier and remains mechanistically explained. Under native Claude Code, max-effort 4.8 front-loads a correct, format-valid baseline and then spends the majority of a 2–3.4× longer budget running a large per-question self-verification Workflow whose reconcile stage re-litigates already-correct answers into "more defensible" wrong readings, collapsing top-tier questions. Brierdale is the most extreme instance because the package's examples/example_submission.json — explicitly labelled a placeholder decoy — was treated by max-effort 4.8 as a "ground-truth calibration oracle"; its 21-agent verification Workflow noticed the literal-wording tensions on several questions but resolved each toward "defensible because it matches the example," endorsing the violations rather than catching them. The xhigh control (one tier below max) under the same Claude Code harness reached the opposite verdict on the example file — treating it as a decoy and relying on question wording — and scored 57.0 versus max's 40.0, a 17-point gap from one tier of effort change with the model identical. The collapse is therefore a property of how the heaviest effort tier resolves under native orchestration and the presence of a specific anchor (the decoy file), not a stable 4.8 capability deficit. No other scenario reproduces the magnitude. On the six new scenarios the same max-effort 4.8 averages +2.71 per scenario against 4.7 natively; only jobpost (−3.50) and kitchen_supply (−0.75) carry small negatives, both within plausible single-trial range.

The §8.1 finding as it now stands is therefore: the Opus 4.7 → 4.8 upgrade is not a clean cross-scenario regression under native Claude Code. It is a scenario-dependent distribution whose mean over eleven scenarios is statistically indistinguishable from zero (−0.82 against a per-scenario s.d. of roughly 10). The previous report's −5.25 cross-scenario regression headline is withdrawn — it was an artefact of an oversampled negative subset. What survives is brierdale as a specific, mechanistically-characterised outlier where the reconcile pathology compounds with an exploitable anchor file in the package. The general claim "max-effort 4.8 under native Claude Code spends compute on a self-verification Workflow that can re-litigate correct answers into wrong ones" is still supported by trajectory reads on brierdale + brindholm + brivane + jobpost, but the outcome of that compute is not uniformly negative across the suite (aurora_transit is the same Workflow producing the suite's single highest native score at 70.25).

Why the previous five-scenario report concluded the opposite. The original five included three negatives (brierdale, brindholm, brivane), one large positive (halberon), and one small positive (pellichar). The negatives were clustered on scenarios with rich multi-table-bridge reconciliation where the self-verification Workflow has many targets to re-litigate; the new six include scenarios (aurora_transit, helios) where the same Workflow is productive (the extra tooling lifts native 4.8 above 4.7 by 4+ points). The corrected reading is that the reconcile pathology is real but not universal: it depends on whether the scenario's question pack contains many "defensible alternative reading" attractors and on whether the candidate package contains an anchor artifact (like brierdale's decoy file) for the reconcile to fix on.

The single-trial-per-cell caveat remains in force. With s.d. ~10 across the eleven per-scenario deltas, a multi-seed sweep is now the highest-priority follow-up (§11).

8.2 Routing and recency are confounded with the score

The cross-scenario leaderboard mixes two routing layers by construction (§6.1): the upper entries run first-party, the trailing entries are Bedrock-hosted. Because the first-party entries are also the newest releases, three variables move together down the table — model identity, model recency, and routing — and cannot be separated from a single trial per cell. Over eleven scenarios the mid-band routing seam widens: MiniMax M2.7 (29.40, first-party) now sits ~7 points above the leading Bedrock entry (Qwen3-Coder-480B, 22.27), versus 0.55 points on the original five — driven primarily by M2.7's 55.00 outlier on jobpost_analytics_co. The floor entries (Meta, Nova Pro) are Bedrock and older and weaker, so their position is overdetermined. The honest reading is unchanged: this is a leaderboard of each provider's best-available flagship as served, not a controlled capability ranking.

8.3 Harness sensitivity: native lift is scenario-dependent, not uniform

For three models we hold the model and the build fixed and vary only the agent. Across the eleven scenarios the native-versus-SWE-agent picture is:

Model / effort Native agent Native avg (11) SWE-agent avg (11) Native − SWE
GPT-5.5 Codex 51.25 41.11 +10.14
Opus 4.7 (max) Claude Code 50.82 43.07 +7.75
Opus 4.8 (max) Claude Code 50.00 45.23 +4.77
Opus 4.8 (xhigh, 5-only) Claude Code 48.55 44.90 (5-only) +3.65

Two observations.

First, native agents lift most models, but the lift is now substantially narrower than the five-scenario report suggested. Opus 4.7 gains +7.75 (vs +11.10 on five) and GPT-5.5 gains +10.14 (vs +9.75 on five) moving from SWE-agent to their native agents. The Opus 4.7 native-lift attenuation is driven by helios (SWE 43.75 vs native 42.25 — the native lift is negative there) and by crosswind (SWE 36.50 vs native 40.00 — only +3.5).

Second, Opus 4.8's native lift recovers under the wider sample. Its native Claude Code max average (50.00) now sits +4.77 above its SWE-agent average (45.23) — a meaningful lift, where the five-scenario report measured only +1.25. The recovery is driven by aurora_transit (+21.25 native, the suite's largest single lift), helix_marine (+3.25), and the disappearance of brierdale's −19 native penalty from the cross-scenario mean. The "Opus 4.8's native lift nearly vanishes" claim from the previous report does not generalise.

Third, harness sensitivity has both signs. The new scenarios produce two clear cases where SWE-agent is not a uniform handicap: - Helios: SWE-agent Opus 4.8 (58.00) > native Claude Code Opus 4.8 (46.25) — an 11.75-point harness advantage for SWE-agent on the same model. The per-asset multiplicative meter-scale quantisation that drives the 5-question hard core (§4.2 helios signature) is not recovered by either harness; both collapse on the same questions, but SWE-agent's leaner loop concentrates budget on the questions that are recoverable, while native max-effort spends extra tooling on the unrecoverable ones. - Crosswind: SWE-agent GLM-5.1 (42.75) exceeds the best of the three Opus/Codex native cells (Opus 4.8 CC max 40.75) — the only scenario where a SWE-agent cell beats all three native conditions of the Opus-pair probe. (Fable 5, added later, tops the scenario outright on both harnesses — SWE 76.5, native 81.5; §8.4.)

The implication is direct: the absolute numbers in the cross-scenario leaderboard are model-under-SWE-agent scores, and the native-vs-SWE-agent delta is scenario-dependent in sign. On scenarios with rich reconciliation under richer tooling (aurora_transit, jobpost, brindholm), native lifts are large; on scenarios where the hard core is unrecoverable under either harness (helios, crosswind), the SWE-agent operating point can match or beat the native operating point. The full per-scenario native-versus-SWE-agent grid is in Appendix B.

Caveats: one trial per agent/effort/scenario cell, and the native runs assume the same frozen builds as the harness cohort — the question packs and metric kinds match — but a build drift between runs cannot be fully excluded (§9).

8.4 Fable 5: leads on both harnesses — but the bare-API gap is not regime-matched

Fable 5 ran on all twelve scenarios under both the bare API (SWE-agent) and native Claude Code. It posts the highest bare-API average (57.83) and the highest native average (61.35, against Opus 4.7's 51.98, GPT-5.5/Codex's 51.48, and Opus 4.8's 51.27), and tops eight of the twelve bare-API boards. Attributing that lead, however, requires care, because the headline same-harness comparison is confounded by the reasoning regime.

The SWE-agent scaffold is identical, but the reasoning regimes are not. Under SWE-agent the --effort knob maps only to a per-run cost cap — the harness never passes a thinking budget. Opus 4.8 therefore runs thinking-off at the provider's default 4,096-token output cap, while Fable 5's adaptive thinking cannot be disabled (an explicit disable is rejected by the API) and runs under a raised 32k output cap. The trace economics confirm it: on Aurora Transit, Fable spends ~37 s per call against Opus's ~9 s, while Opus emits more visible thought text — Fable's depth is hidden thinking, not verbosity. So the raw +11.2 bare-API gap mixes model quality with test-time reasoning that one model gets and the other does not.

Regime-matched, the gap narrows and concentrates. Comparing each model's best condition per scenario (so each runs with its own thinking regime), Fable 5 leads by +9.1 mean / +6.4 median — and just +4.0 over the ten scenarios outside Crosswind and Kitchen Supply Co., with one inversion (Brindholm, Opus +20). The per-scenario mechanism splits cleanly in the traces:

The native loop adds little on net, and its sign is scenario-dependent. Fable 5's native average is only +3.5 over its bare-API average, and the per-scenario lift splits by scenario character. On reconciliation-heavy covers it gains (Brierdale +14.25, Helix Marine +14.25, Halberon +13.0, Helios +10.75, Elo +7.5, Crosswind +5.0) — the extra budget goes into re-deriving from the data. But on analytics-heavy covers where its bare-API run was already very strong, the loop adds nothing or costs a little (Aurora Transit −7.5, Kitchen Supply Co. −7.75, Jobpost Analytics Co. −7.0). So "native lift" is not a fixed property — it depends on whether the scenario leaves headroom the loop can recover. Opus 4.8, by contrast, frequently regresses under its own native agent: its max-effort Codegen-and-Verify sub-agent panels police arithmetic under a planning dossier that pre-commits the wrong output shapes, so the flawed premise goes unchecked (Brierdale 59→40, Helios 58→46.25).

Two scenarios invert or neutralise the pattern — which is the tell that it is real. On Brindholm the roles flip: Opus 4.8 runs a genuine re-derivation workflow while Fable 5 only reproduces its own first answer, and both land at 51.0 with no native lift for Fable. On Elo, neither model uses verification sub-agents (both verify by re-running on the data), so they finish in a near-tie (Fable native 66.5, Opus 65.25) and the gap shrinks to a single units-interpretation call. The axis that separates the two models is not depth of reasoning but whether the agent loop is used to interrogate the data or to ratify a plan.

9. Limitations

  1. Single trial per cell. Every cohort entry, every native baseline, and every per-scenario cell is a single trajectory. No within-condition variance or decoding-seed sweep is reported, so the first-party top-band ordering (positions 2–6 inside 4.5 points), any small rank swap, and the magnitudes of the §8.1 upgrade deltas are below the resolution of the benchmark. The per-scenario distribution of the §8.1 upgrade is the load-bearing finding; the per-scenario point estimates are not.
  2. The §8.1 finding is now scenario-distributional, not cross-scenario-mean. The Opus 4.7 → 4.8 native cross-scenario delta is −0.82 over eleven scenarios (vs −5.25 reported on five). A reader must not quote either figure as "the" upgrade effect; what is reported is the distribution of per-scenario deltas. The previous report's headline regression claim is withdrawn in light of the wider sample.
  3. Mixed routing and recency confound (central). The cross-scenario leaderboard is not a controlled capability ranking. The top entries are newest-release flagships served first-party; the trailing entries are older flagships served through Bedrock. Model identity, recency, and routing move together down the table, and a single trial cannot separate them (§8.2).
  4. Scores are harness-conditioned. Every ranked entry runs through SWE-agent at its default operating point. §8.3 shows native-agent averages differ by +4.77 to +10.14 points on average across eleven scenarios, with per-scenario deltas spanning −11.75 (helios) to +21.25 (aurora_transit). The absolute numbers are model-under-SWE-agent, not model capability in the abstract.
  5. Uneven scenario coverage. Kimi K2.6 produced a valid submission on only one of eleven scenarios (halberon), so its cross-scenario average is a single-scenario figure not comparable to the 11/11 entries; Gemini 3.5 Flash produced a valid submission on three of eleven (helios, helix_marine, kitchen_supply_co); Meta (8/11) and Nova Pro (8/11) are partial. The cross-scenario average is computed only over each model's valid submissions, so models with thin coverage carry more single-trial risk.
  6. One incomplete flagship, one tier substitution. Alibaba's current flagship (Qwen 3.6) was not attempted — no first-party key — so the Qwen row carries the one-version-behind Qwen3-Coder-480B. Amazon's flagship-tier Nova Premier produced no submission, so the Amazon row carries Nova Pro. Both rows understate their provider's marketed flagship.
  7. Cross-report sourcing assumes stable builds. The first-party rows on the six new scenarios were run for this report on the fix-swe-opus4-harness branch (DQ-6 cohort, 2026-06-02); the original five rows are carried from the 2026-05-29 standardized rerun on the same branch; the Bedrock rows are carried from earlier sweeps. All are assumed to share the same frozen builds — the question packs and metric kinds match — but a build drift across the runs cannot be fully excluded.
  8. Eleven corpora, not a population. The suite spans eleven synthetic corpora and eleven question packs across eleven industries. The wider sample reframes the §8.1 finding but eleven scenarios remain a small sample and the generalisation beyond this suite is not resolved here.
  9. Bedrock and native pricing not standardised. Token counts are reported from the LiteLLM meter; the USD figure depends on provider list prices at the time of the run and was not snapshot-locked. Cross-provider and SWE-vs-native USD comparisons are indicative, not exact.
  10. Persona calibration externally illegible. The six-persona stress test (§5.3) is a defensible internal threshold-tuning protocol, but no external reviewer can independently audit it without re-running the builds.
  11. No abstention or confidence instrumentation. Agents are not asked to flag low-confidence answers or to abstain; calibration plots cannot be drawn.
  12. Closed metric set, English-only, tabular-only. Seven metric kinds are allowed by construction; all stakeholder prose, schema names, and entity strings are in English; schemas are tabular with no multi-modal content. A high-token native run and a low-token SWE-agent run that emit the same JSON receive the same score; token-efficiency is reported separately (§7.5, §8.3) but is not part of the aggregate.

10. Broader Impacts and Ethics

rl-gym is built entirely from synthetic data with fictional cover; no real personal information is present, no real organisations are named, and no scraped human-authored content is graded. The source datasets (Kaggle FMCG sales, a public time-series corpus, Kaggle Recruit Restaurant Visitor Forecasting, Kaggle Polymarket, Kaggle ASHRAE Great Energy Predictor III, Kaggle JPX, Kaggle M5, Kaggle ride-bookings, Kaggle Zillow Prize, Kaggle LinkedIn Job Postings, Kaggle Instacart Online Grocery) are public and licensed for research use; rl-gym uses each as a procedural seed rather than as content. The clinical-trial framing of Brivane in particular is fictional and carries no real health data.

Three responsible-use considerations apply.

Over-reading provider rankings and version upgrades. The cross-scenario leaderboard ranks one flagship per provider on eleven benchmark instances, under one harness, in a single trial per cell — and it mixes first-party and Bedrock routing (§8.2). It is not a general "which provider is best" claim. Two reading rules apply: behind Opus 4.8's top-band lead the first six models are a cluster, not a ranked podium; and the first-party/Bedrock split confounds routing with model recency. The Opus 4.7 → 4.8 result (§8.1) must never be quoted as "4.8 is better" or "4.8 is worse" without the harness and the per-scenario distribution attached. The previous report's cross-scenario regression headline (−5.25) does not survive the wider sample and is withdrawn.

Documentation of the brierdale outlier. Brierdale's −23.0 native delta remains the largest single negative anomaly in the suite, and the reconcile pathology mechanism (§8.1) is corroborated by trajectory reads on three additional scenarios where smaller negative deltas appear. But the magnitude is brierdale-specific (driven in part by the package's decoy example file functioning as an anchor for the reconcile) and does not generalise. The finding should be read as a specific, mechanistically-characterised anomaly, not a verdict on the weights.

Data and scoring privacy. Each scenario ships with: a canary string in the candidate-facing instructions to support future contamination audits; the rubric and canonical answers kept private; obstacles never enumerated to the model; and the construction seed frozen so that the reported results remain reproducible.

11. Conclusion and Next Steps

Across eleven frozen data-quality scenarios, the rl-gym suite places each major provider's current flagship on one cross-scenario leaderboard under the SWE-agent harness, and reframes the previous five-scenario report's headline finding. The sign of the Opus 4.7 → 4.8 upgrade under native Claude Code is now characterised as a scenario-dependent distribution rather than a cross-scenario regression. Over eleven scenarios the native delta averages −0.82 (Opus 4.8 50.00 vs Opus 4.7 50.82) — essentially tied — with a per-scenario distribution spanning −23.0 (brierdale) to +14.75 (aurora_transit), 5 positive / 5 negative / 1 flat. Brierdale is the dominant negative outlier, mechanistically explained by a reconcile pathology compounded with a decoy anchor file in the candidate package; the magnitude does not generalise. Under SWE-agent the upgrade is mildly positive on average (+2.11); Opus 4.8 leads the rest of the cohort, but Fable 5 — added later, on all twelve scenarios — tops the cross-scenario board at 57.83.

Behind the headline the cross-scenario pattern holds. Below Fable 5, a first-party band (Opus 4.8 46.67, Opus 4.7 44.56, DeepSeek V4-Pro 42.90, GPT-5.5 42.46, GLM-5.1 40.70) clusters inside six points well above a mid band (MiniMax M2.7 29.40, Qwen3-Coder-480B 22.27, Mistral Large 3 19.96) and a join-failure floor (Meta 1.41, Nova Pro 1.22, Kimi 1/11, Gemini 3/11). No single model wins every scenario: Fable 5 tops eight boards, Opus 4.8 tops three (helios, helix_marine, elo), and DeepSeek V4-Pro tops pellichar — so even the strongest model loses a third of the scenarios to a rival. DeepSeek V4-Pro remains the cost-quality frontier point of the entire suite — top-band on cohort cost roughly two orders of magnitude below the Western frontier. The most important structural caveats are that the leaderboard mixes first-party and Bedrock routing (§8.2), and that absolute scores — and now the per-scenario distribution of a generation upgrade — are conditioned on the harness.

The harness-sensitivity picture has also gained nuance. On most scenarios native agents beat SWE-agent (Opus 4.7 +7.75 avg, GPT-5.5 +10.14 avg, Opus 4.8 +4.77 avg). But on helios SWE-agent Opus 4.8 (58.00) out-scores its own native Claude Code run (46.25), and on crosswind SWE-agent GLM-5.1 (42.75) tops the cross-condition board over the best native run. The native lift is no longer a uniform claim across the suite.

Next steps:

  1. Run multi-seed trials. Convert single-trial positions — especially the §8.1 per-scenario deltas, the top-band cluster, and the per-scenario top finishes — into statistically-supported findings. With s.d. ~10 across the eleven per-scenario native deltas, a multi-seed sweep is the highest-priority follow-up.
  2. Probe the brierdale anomaly directly. Ablate the decoy example file and, separately, the self-verification Workflow on brierdale to confirm which is the proximate driver of the −23 collapse, and whether a "distrust the example / skip the reconcile" instruction recovers max to xhigh levels.
  3. Close the routing confound. Re-run the Bedrock-hosted flagships first-party (or the whole cohort through one route) so the leaderboard becomes a clean within-route comparison.
  4. Extend coverage on the thin entries. Re-run Kimi K2.6 on the remaining ten scenarios with a step-budget or context-budget cap that prevents the stamina loop; retry Gemini 3.5 Flash with a reasoning-budget cap to convert its 3/11 into a fuller picture.
  5. Probe Opus 4.8 xhigh on the six new scenarios. The xhigh effort control is only on the original five. Extending it to the new six would test whether the "more 4.8 effort is not better" finding holds across the wider suite or is specific to brierdale's anchor structure.

The suite is best read as a set of controlled benchmark instances: eleven fixed corpora, eleven sets of hidden answers, deterministic rubrics, and a cross-scenario leaderboard whose absolute level — and whose per-scenario distribution of a model upgrade — is conditioned on the harness.

Appendix A. The Twelve Per-Scenario Leaderboards

Tier counts: E+ beyond_excellent, E excellent, g2e good_to_excellent, G good, n2g naive_to_good, N naive, W worse_than_naive. g+ is the good-or-better count. All entries are SWE-agent unless noted. Cost is the LiteLLM/Bedrock meter's billed figure in USD; wall is mm:ss.

A.1 Brierdale (FMCG seed; 5 tables; 39 obstacles)

Rank Provider Model Route Agg % g+ Wall Cost ($)
1 Anthropic Fable 5 FP 60.50 14:48
2 Anthropic Opus 4.8 FP 59.00 15 08:03 1.96
3 OpenAI GPT-5.5 FP 55.50 13 13:04 2.13
4 z.ai GLM-5.1 FP 50.50 12 12:33 0.29
5 Anthropic Opus 4.7 FP 50.25 12 04:58 1.10
6 DeepSeek V4-Pro FP 48.00 11 12:13 0.05
7 MiniMax M2.7 FP 44.50 10 07:43 0.06
8 Alibaba Qwen3-Coder-480B BR 38.75 10 13:26 0.52
9 Mistral Large 3 BR 27.75 4 05:04 0.71
10 Meta Llama 4 Maverick BR 0.00 0 00:54 0.02
Moonshot Kimi K2.6 FP no sub 01:35
Amazon Nova Pro BR no sub 04:49 0.97
Google Gemini 3.5 Flash FP no sub 01:29

Fable 5 now leads at 60.5 and the top native run (74.75 under Claude Code). Best non-Fable: 59.00 (Opus 4.8); median 48.00. Next-best native 63.00 (Opus 4.7 CC max). The single largest native-vs-SWE-agent regression on Opus 4.7 → 4.8 in the suite (−23.0); see §8.1.

A.2 Brindholm (time-series seed; 9 tables; 40 obstacles)

Rank Provider Model Route Agg % g+ Wall Cost ($)
1 Anthropic Fable 5 FP 51.00 14:07
2 OpenAI GPT-5.5 FP 47.50 11 07:39 1.20
3 Anthropic Opus 4.7 FP 47.50 10 07:49 1.85
4 Anthropic Opus 4.8 FP 46.00 10 15:05 4.15
5 DeepSeek V4-Pro FP 43.50 10 12:37 0.05
6 z.ai GLM-5.1 FP 43.00 10 15:27 0.30
7 MiniMax M2.7 FP 15.50 3 10:01 0.03
8 Alibaba Qwen3-Coder-480B BR 13.00 3 32:42 1.06
9 Mistral Large 3 BR 9.75 2 05:50 0.65
10 Amazon Nova Pro BR 0.75 0 09:48 0.58
Meta Llama 4 Maverick BR no sub 00:57 0.02
Google Gemini 3.5 Flash FP no sub 01:27
Moonshot Kimi K2.6 FP not run

Fable 5 now leads at 51.0 (native 51.0 under Claude Code). Best non-Fable: 47.50 (GPT-5.5 / Opus 4.7 tie); median 43.00. Native best 76.50 (Opus 4.7 CC max) — the highest single native cell anywhere in the suite. Sharply bimodal tier mix: top cluster heavy with beyond_excellent over a wide worse_than_naive floor (the suite's widest shared failure floor), with almost no middle band.

A.3 Brivane (Recruit Restaurant seed; 7 tables; 25 obstacles)

Rank Provider Model Route Agg % g+ Wall Cost ($)
1 Anthropic Fable 5 FP 38.00 12:47
2 Anthropic Opus 4.8 FP 30.75 6 04:35 0.83
3 Anthropic Opus 4.7 FP 26.00 4 04:14 0.90
4 DeepSeek V4-Pro FP 25.50 5 15:48 0.07
5 OpenAI GPT-5.5 FP 23.00 4 11:26 1.01
6 z.ai GLM-5.1 FP 20.75 4 10:38 0.24
7 Alibaba Qwen3-Coder-480B BR 16.25 3 12:41 0.40
8 MiniMax M2.7 FP 6.00 0 10:37 0.05
9 Mistral Large 3 BR 1.50 0 02:23 0.16
10 Amazon Nova Pro BR 0.75 0 01:25 0.16
11 Meta Llama 4 Maverick BR 0.00 0 01:20 0.06
Moonshot Kimi K2.6 FP no sub 01:37
Google Gemini 3.5 Flash FP no sub 01:29

Fable 5 now leads at 38.0 (native 37.0 under Claude Code). Best non-Fable: 30.75 (Opus 4.8); median 18.50. Native best 38.00 (Opus 4.7 CC max). The hardest scenario in the suite — a pure cross-network-bridge reconciliation test on which no model clears a third of the points; near-bimodal excellent/worse tier mix.

A.4 Halberon (Polymarket seed; 6 tables; 30 obstacles)

Rank Provider Model Route Agg % g+ Wall Cost ($)
1 Anthropic Fable 5 FP 44.25 18:51
2 Anthropic Opus 4.8 FP 43.50 9 07:28 1.67
3 OpenAI GPT-5.5 FP 39.50 9 06:58 1.45
4 DeepSeek V4-Pro FP 38.25 7 11:43 0.05
5 Anthropic Opus 4.7 FP 33.75 9 05:32 1.21
6 z.ai GLM-5.1 FP 31.75 7 25:16 0.32
7 Moonshot Kimi K2.6 FP 26.75 7 46:09 0.65
8 MiniMax M2.7 FP 16.50 3 08:22 0.04
9 Mistral Large 3 BR 15.75 2 09:44 1.17
10 Alibaba Qwen3-Coder-480B BR 14.25 2 82:34 0.52
11 Meta Llama 4 Maverick BR 1.50 0 02:27 0.01
Amazon Nova Pro BR no sub 00:59 0.06
Google Gemini 3.5 Flash FP no sub 35:06 6.86

Fable 5 now leads at 44.25 and the top native run (57.25 under Claude Code). Best non-Fable: 43.50 (Opus 4.8); median 29.25. Next-best native 57.00 (GPT-5.5 Codex). The only scenario in which Kimi K2.6 produced a valid submission.

A.5 Pellichar (ASHRAE seed; 7 tables; ~dozen obstacle patterns)

Rank Provider Model Route Agg % g+ Wall Cost ($)
1 DeepSeek V4-Pro FP 52.50 11 23:46 0.06
2 z.ai GLM-5.1 FP 48.50 11 19:43 0.23
3 Anthropic Fable 5 FP 46.75 25:07
4 Anthropic Opus 4.8 FP 45.25 10 11:43 1.64
5 OpenAI GPT-5.5 FP 44.25 9 23:15 1.49
6 Anthropic Opus 4.7 FP 44.00 9 13:22 1.24
7 MiniMax M2.7 FP 37.75 7 08:57 0.04
8 Alibaba Qwen3-Coder-480B BR 35.25 6 54:44 0.61
9 Mistral Large 3 BR 30.00 6 11:48 1.12
10 Meta Llama 4 Maverick BR 1.50 0 00:46 0.01
11 Amazon Nova Pro BR 0.00 0 01:51 0.28
Moonshot Kimi K2.6 FP no sub 13:18 0.05
Google Gemini 3.5 Flash FP not run

Top 52.50 (DeepSeek V4-Pro — the single highest SWE-agent score on the original five); median 40.88. Native best 48.50 (GPT-5.5 Codex). The one of the original five where the best native run sits below the top SWE-agent run.

A.6 Aurora Transit (Kaggle ride-bookings; 16 tables; 80 obstacles)

Rank Provider Model Route Agg % Wall
1 Anthropic Fable 5 FP 80.75 23:15
2 z.ai GLM-5.1 FP 52.25 29:10
3 DeepSeek V4-Pro FP 49.00 16:55
4 Anthropic Opus 4.8 FP 49.00 08:05
5 OpenAI GPT-5.5 FP 46.00 07:03
6 Anthropic Opus 4.7 FP 43.25 07:12
7 MiniMax M2.7 FP 30.75 10:01
8 Alibaba Qwen3-Coder-480B BR 20.50 04:02
9 Mistral Large 3 BR 16.00 05:48
10 Amazon Nova Pro BR 2.25 30:46
Moonshot Kimi K2.6 FP no sub
Meta Llama 4 Maverick BR no sub
Google Gemini 3.5 Flash FP no sub

Fable 5 now leads at 80.75 and the top native run (73.25 under Claude Code). Best non-Fable: 52.25 (GLM-5.1); next-best native 70.25 (Opus 4.8 CC max) — the highest cell anywhere in the suite under Opus 4.8 and the largest single native-vs-SWE-agent lift (+21.25). Sixteen-table corpus with 80 seeded obstacles — the largest table count and obstacle count in the suite. Question pair q05/q11/q19 (flagship-district rush-hour metrics) defeats every scored SWE-agent run.

A.7 Crosswind (Kaggle M5 Forecasting (Walmart); 6 tables)

Rank Provider Model Route Agg % Wall
1 Anthropic Fable 5 FP 76.50 14:59
2 z.ai GLM-5.1 FP 42.75 09:48
3 Anthropic Opus 4.7 FP 36.50 07:39
4 OpenAI GPT-5.5 FP 31.00 08:10
5 DeepSeek V4-Pro FP 30.25 10:19
6 MiniMax M2.7 FP 30.25 09:57
7 Anthropic Opus 4.8 FP 22.75 05:35
8 Mistral Large 3 BR 15.50 03:37
9 Alibaba Qwen3-Coder-480B BR 15.50 08:13
10 Meta Llama 4 Maverick BR 0.75 00:55
11 Amazon Nova Pro BR 0.00 11:31
Moonshot Kimi K2.6 FP no sub
Google Gemini 3.5 Flash FP no sub

Fable 5 now leads at 76.5 and the top native run (81.5 under Claude Code). Best non-Fable: 42.75 (GLM-5.1). Next-best native 40.75 (Opus 4.8 CC max); the only scenario where the top SWE-agent score exceeds the top native score across all three native conditions. The cohort spans only 42.75 points and a three-question hard core (q12, q13, q19) lands at worse_than_naive for every scored run, native and SWE-agent alike.

A.8 Helios (JPX Tokyo Stock Exchange seed; 5 tables)

Rank Provider Model Route Agg % Wall
1 Anthropic Opus 4.8 FP 58.00 04:54
2 Anthropic Fable 5 FP 52.75 14:55
3 Anthropic Opus 4.7 FP 43.75 08:56
4 Google Gemini 3.5 Flash FP 43.50 24:04
5 DeepSeek V4-Pro FP 36.00 15:33
6 OpenAI GPT-5.5 FP 24.50 06:27
7 z.ai GLM-5.1 FP 20.50 10:12
8 Mistral Large 3 BR 17.50 06:19
9 Alibaba Qwen3-Coder-480B BR 17.50 10:29
10 MiniMax M2.7 FP 13.75 03:46
11 Meta Llama 4 Maverick BR 3.75 00:54
12 Amazon Nova Pro BR 3.75 04:01
Moonshot Kimi K2.6 FP no sub

Top 58.00 (Opus 4.8). Native best 46.25 (Opus 4.8 CC max) — the only scenario where SWE-agent on the same model exceeds the native run (−11.75 native lift). Five-question hard core (q01, q02, q03, q13, q18 — per-fuel / per-board / per-tier / monthly dispatch-energy roll-ups) at worse_than_naive for every one of 14 scored runs (SWE-agent + native), driven by a single per-asset multiplicative meter-scale quantisation that no configuration recovers.

A.9 Helix Marine (Kaggle Zillow Prize; 8 tables)

Rank Provider Model Route Agg % Wall
1 Anthropic Opus 4.8 FP 55.00 05:30
2 OpenAI GPT-5.5 FP 50.75 07:48
3 Anthropic Opus 4.7 FP 48.75 05:05
4 Google Gemini 3.5 Flash FP 47.00 18:47
5 Anthropic Fable 5 FP 46.50 14:52
6 z.ai GLM-5.1 FP 42.75 19:29
7 MiniMax M2.7 FP 40.00 03:14
8 DeepSeek V4-Pro FP 39.50 06:57
9 Mistral Large 3 BR 39.00 03:07
10 Alibaba Qwen3-Coder-480B BR 33.50 03:10
11 Meta Llama 4 Maverick BR 0.00 00:52
12 Amazon Nova Pro BR 0.00 01:12
Moonshot Kimi K2.6 FP no sub

Top 55.00 (Opus 4.8); native best 58.25 (Opus 4.8 CC max). Tight top — four models within 8 points (Opus 4.8 55, GPT-5.5 50.75, Opus 4.7 48.75, Gemini 3.5 Flash 47). Five-question hard core (q02, q03, q04, q05, q13) naive-or-worse_than_naive for every scored SWE-agent run.

A.10 Jobpost Analytics Co. (Kaggle LinkedIn Job Postings; 12 tables; 60 obstacles)

Rank Provider Model Route Agg % Wall
1 Anthropic Fable 5 FP 63.25 10:51
2 MiniMax M2.7 FP 55.00 03:53
3 Anthropic Opus 4.7 FP 54.25 04:59
4 DeepSeek V4-Pro FP 52.50 21:42
5 OpenAI GPT-5.5 FP 51.50 05:40
6 Anthropic Opus 4.8 FP 48.50 04:12
7 z.ai GLM-5.1 FP 48.50 10:00
8 Alibaba Qwen3-Coder-480B BR 17.25 02:03
9 Mistral Large 3 BR 17.00 02:29
10 Amazon Nova Pro BR 2.25 02:04
Moonshot Kimi K2.6 FP no sub
Meta Llama 4 Maverick BR no sub
Google Gemini 3.5 Flash FP no sub

Fable 5 now leads at 63.25 and the top native run (56.25 under Claude Code). Best non-Fable: 55.00 (MiniMax M2.7 — the first scenario in the suite where a non-frontier provider leads the SWE-agent board); native best 65.25 (GPT-5.5 Codex — the single highest cell on this scenario). Six-model leading band packed into 6.5 points (55 → 48.50), then a cliff to a two-model tail (Qwen 17.25, Mistral 17.00) and a single-model floor (Nova Pro 2.25).

A.11 Kitchen Supply Co. (Kaggle Instacart Online Grocery; 13 tables)

Rank Provider Model Route Agg % Wall
1 Anthropic Fable 5 FP 74.75 13:21
2 z.ai GLM-5.1 FP 46.50 10:24
3 Anthropic Opus 4.7 FP 45.75 05:59
4 Google Gemini 3.5 Flash FP 41.25 35:42
5 DeepSeek V4-Pro FP 40.00 07:40
6 Anthropic Opus 4.8 FP 39.75 04:04
7 OpenAI GPT-5.5 FP 38.75 03:14
8 MiniMax M2.7 FP 30.25 13:46
9 Alibaba Qwen3-Coder-480B BR 25.75 05:38
10 Mistral Large 3 BR 15.25 03:41
11 Meta Llama 4 Maverick BR 3.75 00:47
Moonshot Kimi K2.6 FP no sub
Amazon Nova Pro BR no sub

Fable 5 now leads at 74.75 and the top native run (67.0 under Claude Code). Best non-Fable: 46.50 (GLM-5.1); next-best native 47.25 (GPT-5.5 Codex). The flattest leaderboard in the suite — six endpoints within 8 points (46.50 → 38.75), no model runs away. Ten questions are naive-or-worse_than_naive for every scored endpoint and all three native agents; a five-question spine (q01, q07, q14, q19, q20) is worse_than_naive for every run in the report.

A.12 Elo (road-freight LTL seed; 5 tables; 25 obstacles)

Rank Provider Model Route Agg % Wall
1 Anthropic Opus 4.8 FP 62.50 61:40
2 Anthropic Opus 4.7 FP 61.00 07:32
3 DeepSeek V4-Pro FP 59.75 11:16
4 Anthropic Fable 5 FP 59.00 12:03
5 OpenAI GPT-5.5 FP 57.25 07:33
6 Mistral Large 3 BR 34.50 05:40
7 MiniMax M2.7 FP 32.50 08:43
8 Alibaba Qwen3-Coder-480B BR 19.75 04:57
z.ai GLM-5.1 FP no sub
Moonshot Kimi K2.6 FP no sub
Meta Llama 4 Maverick BR no sub
Amazon Nova Pro BR no sub
Google Gemini 3.5 Flash FP no sub

Top 62.50 (Opus 4.8; an infra DNF in the first build re-ran cleanly); native best 66.50 (Fable 5 — Claude Code, ahead of Opus 4.8's 65.25). A tight four-way top — Opus 4.8, Opus 4.7, V4-Pro, and Fable 5 within 3.5 points. q11 (a per-hub roll-up) is worse_than_naive for every scored cell and all four natives. The closest Fable-versus-Opus match in the suite: neither model uses verification sub-agents, so the small gap is a single units-interpretation call, not a delegation pattern (§8.4).

Appendix B. Native-vs-SWE-agent per Scenario

For the three probed conditions the table gives the per-scenario aggregate under each harness, holding the model and the build fixed. Native Claude Code runs Opus 4.8 at max on all eleven scenarios (and xhigh on the original five only); Opus 4.7 at max on all eleven; GPT-5.5 runs under Codex on all eleven.

Scenario Op4.8 SWE Op4.8 CC max Op4.7 SWE Op4.7 CC max GPT SWE GPT Codex
Brierdale 59.00 40.00 50.25 63.00 55.50 58.00
Brindholm 46.00 71.00 47.50 76.50 47.50 70.50
Brivane 30.75 29.50 26.00 38.00 23.00 24.50
Halberon 43.50 47.75 33.75 41.75 39.50 57.00
Pellichar 45.25 42.50 44.00 37.75 44.25 48.50
Aurora Transit 49.00 70.25 43.25 55.50 46.00 59.00
Crosswind 22.75 40.75 36.50 40.00 31.00 40.00
Helios 58.00 46.25 43.75 42.25 24.50 44.50
Helix Marine 55.00 58.25 48.75 56.25 50.75 49.25
Jobpost Analytics Co. 48.50 58.00 54.25 61.50 51.50 65.25
Kitchen Supply Co. 39.75 45.75 45.75 46.50 38.75 47.25
Average (11) 45.23 50.00 43.07 50.82 41.11 51.25

Reading the grid: - Opus 4.8 native lift (SWE → CC max): +4.77 cross-scenario (much wider than the +1.25 reported on five; the recovery is driven by aurora_transit +21.25 and brierdale's −19 being only one of eleven cells). - Opus 4.7 native lift: +7.75 cross-scenario (narrower than +11.10 on five; attenuated by helios where the native lift is negative, −1.50). - GPT-5.5 native lift: +10.14 cross-scenario (essentially unchanged from +9.75 on five). - Two scenarios invert the native-lift sign for at least one model: helios (Opus 4.8 SWE 58.00 > native 46.25) and helix_marine (GPT-5.5 SWE 50.75 > native 49.25, a smaller +1.5 SWE-advantage).

Native cost per run ranges $22–75 (Claude Code) and ≈$1.1–1.8 (Codex), versus $0.02–4.15 under SWE-agent; the most expensive runs in the study are the native Claude Code baselines. One trial per cell; build-consistency caveat in §9.

Appendix C. Persona-Calibration Summary

The rubric thresholds for each scenario are anchored via the six-persona stress test of §5.3 (personas constructed by inverting the per-question grading metric, so each lands at its targeted anchor). Measured band centres for the two scenarios with published values from the original five:

Persona Target band Brierdale Halberon
Literal-clean reference (canonical verbatim) 100 % 100.0 % 100.0 %
Naive baseline 10–20 % 14.2 % 15.0 %
No-hints-thorough baseline 18–40 % 31.5 % 25.0 %
Excellent baseline (canonical pipeline on dirty data) 75–95 % 86.5 % 91.8 %
Stress-test pattern A 18–50 % 27.8 % 31.5 %
Stress-test pattern B 18–50 % 25.0 % 32.0 %

(Brindholm, Brivane, and Pellichar report their band centres as within-or-just-below their target bands without publishing every measured percentage; the literal-clean reference is 100.0 % in all five. Pellichar's published centres are naive 14.2 %, no-hints 33.0 %, excellent 94.2 %, stress A 26.2 %, stress B 27.0 %. The six new scenarios pass the same reachability gates and are recorded internally.)

Every scenario passes the reachability gates: strict naive < good < excellent ordering on all twenty questions; naive-to-excellent spread ≥ 60 pp (brierdale-class spreads run 76–80 pp); no-hints-to-excellent spread ≥ 35 pp; and meaningful naive-to-excellent separation on at least fourteen of twenty questions. All eleven pass the dead-code / trap-test floor: a dumb-dirty no-cleanup baseline matches the canonical on at most two of twenty questions, and a comprehensive-canonicalisation submission scores below 40 % with at least half its questions at-or-below naive and zero bit-for-bit excellent matches.

References


End of rl-gym combined technical report — Data-Quality Track, eleven scenarios, SWE-agent harness cohort with native baselines.