Asapi

Robust Modeling · messy evals

Measuring model quality when the data shifts.

Eleven business ML scenarios test whether agents can build models that hold up on messy, inconsistent, real-world data — before fragile assumptions show up as business-PnL loss.

0
scenarios
0
task types

How Robust Modeling works

Each scenario hands an agent a messy multi-table training set and a held-out eval shape the agent never gets to inspect ahead of time. The agent must train a model in one session, persist whatever inference code it needs, then a second session loads the artefacts and produces a per-row submission on the eval data. Scoring compares against a gold run of the same scenario whose data is left clean.

Step 1

Train

Agent reads training data — possibly distorted by reversible obstacles — and persists a model + a predict.py.

Example

A categorical's codes can differ between training and inference — a model that doesn't reconcile them meets unfamiliar values exactly when it matters.

Step 2

Predict

A fresh sandbox loads the agent's persisted artefacts and runs predict.py against the eval data.

Example

A signal that looks strong during training can be unavailable or meaningless at inference — a model that leans on it degrades.

Step 3

Score vs gold

Submission scored on a business PnL — the dollar value of the agent's predictions under the scenario's cost model — restricted to the rows the evaluator deems scorable.

Example

Each scenario scores only the rows where the business outcome is well-defined, under a cost model that reflects what errors actually cost.

Step 4

Compare

Gold-vs-default delta. The wider the gap, the bigger the bite from obstacles the agent failed to catch and undo.

Example

Gold PnL -$2.36K → default -$37.15K: a ~16× larger loss, from a data issue the agent didn't catch.

Scenario Benchmarks

Default figures are from Claude Code · Fable 5 · max effort on the default (messy, with reversible data obstacles) condition of each scenario, scored on business PnL. Gold is the clean-data ceiling — the best result achievable with no obstacles present — shown as a model-agnostic reference.

Scenario Task Gold PnL Default PnL Δ PnL Δ %
FleetSignalbinary+$56.83M-$92.32M-$149.15M-262%
Depot Throughputtime-series-$2.00K-$2.00K$00%
Patient Readmission Riskbinary+$16.30M-$214.94M-$231.24M-1419%
Workforce Training Completionbinary-$104.28-$238.00-$133.72-128%
Workforce Competency Scoreregression-$766.21-$1.02K-$255.79-33%
Collections Routingmulticlass-$35.92M-$35.92M$00%
Commodity Trade P&Lregression+$2.98B+$2.32B-$660.68M-22%
Event Ticket Demand Tiermulticlass+$9.60K+$6.54K-$3.06K-32%
High Demand Daybinary+$27.80M+$27.30M-$0.50M-2%
PnL Band Tiermulticlass+$47.36M+$20.55M-$26.81M-57%
Commodity Throughputregression+$3.18B+$2.54B-$645.95M-20%

Δ PnL is Fable's default (messy) minus the clean ceiling, and Δ % is that relative to the ceiling's magnitude — negative means the obstacles cost accuracy/PnL. Where Fable's default matched or beat the clean reference (Depot Throughput, Collections Routing), the ceiling is shown equal to the default (Δ 0), since no separate clean-data run was made for those. Where the ceiling PnL is small or flips sign (Patient Readmission), the percentage runs large; read it as directional. PnL units differ per scenario's cost model, so compare within a row.

Dive deep into Scenarios

Condition Agent Variant Wall (mm:ss) Metric PnL

Interesting findings