Summary

Battleship is the most challenging game on PlayTheAI for AI models. Out of 398 completed matches across 31 model variants, only a single AI victory has been recorded — by Claude Sonnet 4.5 in text mode. The overwhelming majority of matches (90.5%) end with the AI being disqualified after accumulating 3 illegal moves (shooting at previously targeted cells).

The Core Challenge: Coordinate Memory

Battleship requires tracking which cells have already been targeted on a 7×7 grid — a task that tests spatial memory and state tracking across multiple turns. In our observations, models tend to struggle with this as the game progresses:

Most AI errors occur between 3-7 shots into the game
The peak failure point is at 4 shots (33 matches ended there)
Very few models survive past 10 shots without committing a foul

This suggests that models can handle the first few moves but begin to lose track of previously visited coordinates as the game state becomes more complex.

The Foul-Out Pattern

Of 159 post-bugfix AI errors, 157 ended with exactly 3 illegal moves — the maximum before disqualification. The typical pattern:

AI plays 2-7 valid shots
AI attempts to shoot an already-targeted cell
After retry, AI tries another previously-targeted cell
Third illegal attempt → disqualification

This pattern is consistent across all providers and models, suggesting a fundamental difficulty with maintaining shot history rather than a model-specific issue.

The One AI Victory

Match 245414c7 stands alone: Claude Sonnet 4.5 (text mode) defeated its human opponent in 15 AI shots, sinking all 3 ships. The AI showed an efficient search pattern, finding and systematically targeting adjacent cells after a hit. Notably, even this winning game included 2 illegal moves — the AI was one foul away from disqualification.

Text vs Vision: A Clear Difference

Mode	Matches	AI Wins	Illegal/Game	Cost
Text	205	1	0.9	$3.52
Vision	193	0	1.1	$6.58

Text mode shows slightly fewer illegal moves per game and the only AI win came in text mode. Vision mode costs nearly twice as much while performing marginally worse — models seem to find the visual grid representation more challenging to parse accurately.

Provider Analysis

Mistral stands out with the lowest illegal move rate (0.4/game), suggesting better coordinate tracking capabilities in Battleship. xAI (Grok) follows at 0.8/game. At the other end, Zhipu (GLM-4.7) shows the highest rate at 1.2 illegal moves per game.

Cost-wise, Anthropic models dominate spending ($7.01 total) — Claude Opus 4.5 costs $0.20/game in vision mode, making it 500x more expensive than Gemini Flash Lite ($0.0004/game) with no better outcomes.

ELO Observations

Battleship ELOs cluster tightly between 965-1012, far below the starting value of 1000 for most variants. The highest-rated is Grok 4.1 Fast (vision) at 1012, but with 0 wins in 11 matches. ELO rankings in Battleship are essentially a measure of "who loses the slowest" — models with fewer illegal moves and more shots before fouling out lose less ELO per match.

Comparison With Other Games

Game	AI Win Rate	Avg Illegal/Game	Matches
WordDuel	8.9%	0.0	338
Connect4	7.6%	0.8	1,033
TicTacToe	5.6%	1.0	2,152
Mastermind	1.3%	0.0	229
Battleship	0.3%	1.0	398

Battleship's 0.3% AI win rate is dramatically lower than any other game. While TicTacToe has a similar illegal move rate, the coordinate tracking requirement in Battleship makes those illegal moves far more consequential.

Key Takeaways

Coordinate memory is the bottleneck: Models consistently fail at tracking which cells they've already targeted
Foul-out is the dominant failure mode: 90.5% of all matches end in AI disqualification
No model has solved it: 31 variants tested, only 1 single win across 398 matches
Text > Vision for this task: Text input yields slightly better coordinate tracking
Cost doesn't correlate with performance: Claude Opus 4.5 at $0.20/game performs no better than Gemini Flash Lite at $0.0004/game

⚠️ Open Beta — These are preliminary observations based on 398 matches (172 post-bugfix). Battleship data before January 12, 2026 is affected by a grid reconstruction bug and was excluded from analysis where noted. Interested in scientific collaboration? Contact us!

Battleship Deep Dive: AI Naval Strategy Analysis

Battleship Overview

Match Outcomes

AI Win Rate Comparison Across Games

Illegal Moves per Game (by Provider)

Cost per Game ($) — Top 5 Most Expensive