Generic Agent vs ML Agent
Six REST-hosted, turn-based games. Each is an academic machine-learning problem wrapped in casual narrative so the surface reveals no ML structure. Two Claude agents play every game: a Generic agent and an ML agent prompted to use modelling, backtesting, and cross-validation. The gap between them is the benchmark signal.
Game factory
How a game is made
A pipeline of subagents converts a hard ML problem into a playable game.
Discovery is incremental, the optimal strategy needs ML reasoning, and a greedy reference player plus a Bayes-optimal player bracket the achievable range.
Benchmark summary
Score table
Raw final scores as recorded.
Metric direction is taken from each game's own scoring rule. The gap bar shows the ML agent's relative advantage, capped at 100%.
| Game | Metric | Generic | ML | Winner | ML advantage |
|---|---|---|---|---|---|
| Across all six games, the ML agent wins by identifying hidden structure before acting on it. | |||||
Detailed benchmark
The games and the play-throughs
Each card shows the cover story, hidden ML problem, agent approaches, and animated replay.
Benchmark signal
What the gap means
The decisive factor is methodology, not just persistence.
Across these benchmarks the ML methodology, including system identification, cross-validation, dynamic programming, graphical models, causal-effect inversion, and backtesting, is the decisive factor across all six games.
Reject noise
In Switchbench, the generic agent overfit a spurious pattern from a small sample. The ML agent used leave-one-out cross-validation, rejected the phantom feature, and calibrated the decision rule.
Identify hidden dynamics
In AnvilChase and SignalHunter, simulators, latent-state estimates, dynamic programming, and cross-validated feature search exposed structure the generic approach missed.
Choose the right model class
Precision matrices, LOSO-CV edge selection, residual covariance, and backtested Hawkes-style configurations won where direct heuristics flattened the signal.
