Battleship Deep Dive: AI Naval Strategy Analysis
398 matches, 31 model variants, 1 AI victory β why Battleship remains the hardest challenge for LLMs
Battleship Overview
Summary
Battleship is the most challenging game on PlayTheAI for AI models. Out of 398 completed matches across 31 model variants, only a single AI victory has been recorded β by Claude Sonnet 4.5 in text mode. The overwhelming majority of matches (90.5%) end with the AI being disqualified after accumulating 3 illegal moves (shooting at previously targeted cells).
The Core Challenge: Coordinate Memory
Battleship requires tracking which cells have already been targeted on a 7Γ7 grid β a task that tests spatial memory and state tracking across multiple turns. In our observations, models tend to struggle with this as the game progresses:
- Most AI errors occur between 3-7 shots into the game
- The peak failure point is at 4 shots (33 matches ended there)
- Very few models survive past 10 shots without committing a foul
This suggests that models can handle the first few moves but begin to lose track of previously visited coordinates as the game state becomes more complex.
The Foul-Out Pattern
Of 159 post-bugfix AI errors, 157 ended with exactly 3 illegal moves β the maximum before disqualification. The typical pattern:
- AI plays 2-7 valid shots
- AI attempts to shoot an already-targeted cell
- After retry, AI tries another previously-targeted cell
- Third illegal attempt β disqualification
This pattern is consistent across all providers and models, suggesting a fundamental difficulty with maintaining shot history rather than a model-specific issue.
The One AI Victory
Match 245414c7 stands alone: Claude Sonnet 4.5 (text mode) defeated its human opponent in 15 AI shots, sinking all 3 ships. The AI showed an efficient search pattern, finding and systematically targeting adjacent cells after a hit. Notably, even this winning game included 2 illegal moves β the AI was one foul away from disqualification.
Text vs Vision: A Clear Difference
| Mode | Matches | AI Wins | Illegal/Game | Cost |
|---|---|---|---|---|
| Text | 205 | 1 | 0.9 | $3.52 |
| Vision | 193 | 0 | 1.1 | $6.58 |
Text mode shows slightly fewer illegal moves per game and the only AI win came in text mode. Vision mode costs nearly twice as much while performing marginally worse β models seem to find the visual grid representation more challenging to parse accurately.
Provider Analysis
Mistral stands out with the lowest illegal move rate (0.4/game), suggesting better coordinate tracking capabilities in Battleship. xAI (Grok) follows at 0.8/game. At the other end, Zhipu (GLM-4.7) shows the highest rate at 1.2 illegal moves per game.
Cost-wise, Anthropic models dominate spending ($7.01 total) β Claude Opus 4.5 costs $0.20/game in vision mode, making it 500x more expensive than Gemini Flash Lite ($0.0004/game) with no better outcomes.
ELO Observations
Battleship ELOs cluster tightly between 965-1012, far below the starting value of 1000 for most variants. The highest-rated is Grok 4.1 Fast (vision) at 1012, but with 0 wins in 11 matches. ELO rankings in Battleship are essentially a measure of "who loses the slowest" β models with fewer illegal moves and more shots before fouling out lose less ELO per match.
Comparison With Other Games
| Game | AI Win Rate | Avg Illegal/Game | Matches |
|---|---|---|---|
| WordDuel | 8.9% | 0.0 | 338 |
| Connect4 | 7.6% | 0.8 | 1,033 |
| TicTacToe | 5.6% | 1.0 | 2,152 |
| Mastermind | 1.3% | 0.0 | 229 |
| Battleship | 0.3% | 1.0 | 398 |
Battleship's 0.3% AI win rate is dramatically lower than any other game. While TicTacToe has a similar illegal move rate, the coordinate tracking requirement in Battleship makes those illegal moves far more consequential.
Key Takeaways
- Coordinate memory is the bottleneck: Models consistently fail at tracking which cells they've already targeted
- Foul-out is the dominant failure mode: 90.5% of all matches end in AI disqualification
- No model has solved it: 31 variants tested, only 1 single win across 398 matches
- Text > Vision for this task: Text input yields slightly better coordinate tracking
- Cost doesn't correlate with performance: Claude Opus 4.5 at $0.20/game performs no better than Gemini Flash Lite at $0.0004/game
β οΈ Open Beta β These are preliminary observations based on 398 matches (172 post-bugfix). Battleship data before January 12, 2026 is affected by a grid reconstruction bug and was excluded from analysis where noted. Interested in scientific collaboration? Contact us!