Robust Modeling · messy evals

Measuring model quality when the data shifts.

Eleven business ML scenarios test whether agents can build models that hold up on messy, inconsistent, real-world data — before fragile assumptions show up as business-PnL loss.

scenarios

task types

How Robust Modeling works

Each scenario hands an agent a messy multi-table training set and a held-out eval shape the agent never gets to inspect ahead of time. The agent must train a model in one session, persist whatever inference code it needs, then a second session loads the artefacts and produces a per-row submission on the eval data. Scoring compares against a gold run of the same scenario whose data is left clean.

Step 1

Train

Agent reads training data — possibly distorted by reversible obstacles — and persists a model + a predict.py.

Example

A categorical's codes can differ between training and inference — a model that doesn't reconcile them meets unfamiliar values exactly when it matters.

Step 2

Predict

A fresh sandbox loads the agent's persisted artefacts and runs predict.py against the eval data.

Example

A signal that looks strong during training can be unavailable or meaningless at inference — a model that leans on it degrades.

Step 3

Score vs gold

Submission scored on a business PnL — the dollar value of the agent's predictions under the scenario's cost model — restricted to the rows the evaluator deems scorable.

Example

Each scenario scores only the rows where the business outcome is well-defined, under a cost model that reflects what errors actually cost.

Step 4

Compare

Gold-vs-default delta. The wider the gap, the bigger the bite from obstacles the agent failed to catch and undo.

Example

Gold PnL -$2.36K → default -$37.15K: a ~16× larger loss, from a data issue the agent didn't catch.

Scenario Benchmarks

Default figures are from Claude Code · Fable 5 · max effort on the default (messy, with reversible data obstacles) condition of each scenario, scored on business PnL. Gold is the clean-data ceiling — the best result achievable with no obstacles present — shown as a model-agnostic reference.

Scenario	Task	Gold PnL	Default PnL	Δ PnL	Δ %
FleetSignal	binary	+$56.83M	-$92.32M	-$149.15M	-262%
Depot Throughput	time-series	-$2.00K	-$2.00K	$0	0%
Patient Readmission Risk	binary	+$16.30M	-$214.94M	-$231.24M	-1419%
Workforce Training Completion	binary	-$104.28	-$238.00	-$133.72	-128%
Workforce Competency Score	regression	-$766.21	-$1.02K	-$255.79	-33%
Collections Routing	multiclass	-$35.92M	-$35.92M	$0	0%
Commodity Trade P&L	regression	+$2.98B	+$2.32B	-$660.68M	-22%
Event Ticket Demand Tier	multiclass	+$9.60K	+$6.54K	-$3.06K	-32%
High Demand Day	binary	+$27.80M	+$27.30M	-$0.50M	-2%
PnL Band Tier	multiclass	+$47.36M	+$20.55M	-$26.81M	-57%
Commodity Throughput	regression	+$3.18B	+$2.54B	-$645.95M	-20%

Δ PnL is Fable's default (messy) minus the clean ceiling, and Δ % is that relative to the ceiling's magnitude — negative means the obstacles cost accuracy/PnL. Where Fable's default matched or beat the clean reference (Depot Throughput, Collections Routing), the ceiling is shown equal to the default (Δ 0), since no separate clean-data run was made for those. Where the ceiling PnL is small or flips sign (Patient Readmission), the percentage runs large; read it as directional. PnL units differ per scenario's cost model, so compare within a row.

Dive deep into Scenarios

Scenario

Condition	Agent	Variant	Wall (mm:ss)	Metric	PnL

Measuring model quality when the data shifts.

How Robust Modeling works

Train

Predict

Score vs gold

Compare

Scenario Benchmarks

Dive deep into Scenarios

Interesting findings