AI game-solving benchmark · report

Generic Agent vs ML Agent

Six REST-hosted, turn-based games. Each is an academic machine-learning problem wrapped in casual narrative so the surface reveals no ML structure. Two Claude agents play every game: a Generic agent and an ML agent prompted to use modelling, backtesting, and cross-validation. The gap between them is the benchmark signal.

games

recorded sessions

0/6

ML wins

Game factory

How a game is made

A pipeline of subagents converts a hard ML problem into a playable game.

Discovery is incremental, the optimal strategy needs ML reasoning, and a greedy reference player plus a Bayes-optimal player bracket the achievable range.

Base ML problem→ Reframe as game→ Parameterize→ Build & simulate→ Difficulty gate→ Tune params→ REST host :18000→ PLAYER.md

Benchmark summary

Score table

Raw final scores as recorded.

Metric direction is taken from each game's own scoring rule. The gap bar shows the ML agent's relative advantage, capped at 100%.

Game	Metric	Generic	ML	Winner	ML advantage
Across all six games, the ML agent wins by identifying hidden structure before acting on it.

Detailed benchmark

The games and the play-throughs

Each card shows the cover story, hidden ML problem, agent approaches, and animated replay.

Exact logs and reconstructed turn-by-turn series are preserved from the source report.

Benchmark signal

What the gap means

The decisive factor is methodology, not just persistence.

Across these benchmarks the ML methodology, including system identification, cross-validation, dynamic programming, graphical models, causal-effect inversion, and backtesting, is the decisive factor across all six games.

Reject noise

In Switchbench, the generic agent overfit a spurious pattern from a small sample. The ML agent used leave-one-out cross-validation, rejected the phantom feature, and calibrated the decision rule.

Identify hidden dynamics

In AnvilChase and SignalHunter, simulators, latent-state estimates, dynamic programming, and cross-validated feature search exposed structure the generic approach missed.

Choose the right model class

Precision matrices, LOSO-CV edge selection, residual covariance, and backtested Hawkes-style configurations won where direct heuristics flattened the signal.