Robust Modeling · messy evals
Measuring model quality when the data shifts.
Eleven business ML scenarios test whether agents can build models that hold up on messy, inconsistent, real-world data — before fragile assumptions show up as business-PnL loss.
How Robust Modeling works
Each scenario hands an agent a messy multi-table training set and a held-out eval shape the agent never gets to inspect ahead of time. The agent must train a model in one session, persist whatever inference code it needs, then a second session loads the artefacts and produces a per-row submission on the eval data. Scoring compares against a gold run of the same scenario whose data is left clean.
Train
Agent reads training data — possibly distorted by reversible obstacles — and persists a model + a predict.py.
A categorical's codes can differ between training and inference — a model that doesn't reconcile them meets unfamiliar values exactly when it matters.
Predict
A fresh sandbox loads the agent's persisted artefacts and runs predict.py against the eval data.
A signal that looks strong during training can be unavailable or meaningless at inference — a model that leans on it degrades.
Score vs gold
Submission scored on a business PnL — the dollar value of the agent's predictions under the scenario's cost model — restricted to the rows the evaluator deems scorable.
Each scenario scores only the rows where the business outcome is well-defined, under a cost model that reflects what errors actually cost.
Compare
Gold-vs-default delta. The wider the gap, the bigger the bite from obstacles the agent failed to catch and undo.
Gold PnL -$2.36K → default -$37.15K: a ~16× larger loss, from a data issue the agent didn't catch.
Scenario Benchmarks
Default figures are from Claude Code · Fable 5 · max effort on the default (messy, with reversible data obstacles) condition of each scenario, scored on business PnL. Gold is the clean-data ceiling — the best result achievable with no obstacles present — shown as a model-agnostic reference.
| Scenario | Task | Gold PnL | Default PnL | Δ PnL | Δ % |
|---|---|---|---|---|---|
| FleetSignal | binary | +$56.83M | -$92.32M | -$149.15M | -262% |
| Depot Throughput | time-series | -$2.00K | -$2.00K | $0 | 0% |
| Patient Readmission Risk | binary | +$16.30M | -$214.94M | -$231.24M | -1419% |
| Workforce Training Completion | binary | -$104.28 | -$238.00 | -$133.72 | -128% |
| Workforce Competency Score | regression | -$766.21 | -$1.02K | -$255.79 | -33% |
| Collections Routing | multiclass | -$35.92M | -$35.92M | $0 | 0% |
| Commodity Trade P&L | regression | +$2.98B | +$2.32B | -$660.68M | -22% |
| Event Ticket Demand Tier | multiclass | +$9.60K | +$6.54K | -$3.06K | -32% |
| High Demand Day | binary | +$27.80M | +$27.30M | -$0.50M | -2% |
| PnL Band Tier | multiclass | +$47.36M | +$20.55M | -$26.81M | -57% |
| Commodity Throughput | regression | +$3.18B | +$2.54B | -$645.95M | -20% |
Δ PnL is Fable's default (messy) minus the clean ceiling, and Δ % is that relative to the ceiling's magnitude — negative means the obstacles cost accuracy/PnL. Where Fable's default matched or beat the clean reference (Depot Throughput, Collections Routing), the ceiling is shown equal to the default (Δ 0), since no separate clean-data run was made for those. Where the ceiling PnL is small or flips sign (Patient Readmission), the percentage runs large; read it as directional. PnL units differ per scenario's cost model, so compare within a row.
Dive deep into Scenarios
| Condition | Agent | Variant | Wall (mm:ss) | Metric | PnL |
|---|
