Back to Blog
Deep Dive

Battleship Deep Dive: AI Naval Strategy Analysis

398 matches, 31 model variants, 1 AI victory β€” why Battleship remains the hardest challenge for LLMs

Battleship Overview

398
Total Matches
1
AI Wins
360
AI Error (Foul Out)
31
Model Variants
384
Illegal Moves Total
$10.10
Total Cost

Match Outcomes

AI Error (Foul Out)360
Human Won38
AI Won1

AI Win Rate Comparison Across Games

WordDuel8.9
Connect47.6
TicTacToe5.6
Mastermind1.3
Battleship0.3

Illegal Moves per Game (by Provider)

Zhipu (GLM)1.2
Anthropic1.1
OpenAI1.1
Google1
Meta0.9
xAI0.8
Mistral0.4

Cost per Game ($) β€” Top 5 Most Expensive

Opus 4.5 Vision0.203
Opus 4.5 Text0.144
Sonnet 4.5 Vision0.09
Sonnet 4.5 Text0.047
GPT-5.2 Vision0.045

Summary

Battleship is the most challenging game on PlayTheAI for AI models. Out of 398 completed matches across 31 model variants, only a single AI victory has been recorded β€” by Claude Sonnet 4.5 in text mode. The overwhelming majority of matches (90.5%) end with the AI being disqualified after accumulating 3 illegal moves (shooting at previously targeted cells).

The Core Challenge: Coordinate Memory

Battleship requires tracking which cells have already been targeted on a 7Γ—7 grid β€” a task that tests spatial memory and state tracking across multiple turns. In our observations, models tend to struggle with this as the game progresses:

  • Most AI errors occur between 3-7 shots into the game
  • The peak failure point is at 4 shots (33 matches ended there)
  • Very few models survive past 10 shots without committing a foul

This suggests that models can handle the first few moves but begin to lose track of previously visited coordinates as the game state becomes more complex.

The Foul-Out Pattern

Of 159 post-bugfix AI errors, 157 ended with exactly 3 illegal moves β€” the maximum before disqualification. The typical pattern:

  1. AI plays 2-7 valid shots
  2. AI attempts to shoot an already-targeted cell
  3. After retry, AI tries another previously-targeted cell
  4. Third illegal attempt β†’ disqualification

This pattern is consistent across all providers and models, suggesting a fundamental difficulty with maintaining shot history rather than a model-specific issue.

The One AI Victory

Match 245414c7 stands alone: Claude Sonnet 4.5 (text mode) defeated its human opponent in 15 AI shots, sinking all 3 ships. The AI showed an efficient search pattern, finding and systematically targeting adjacent cells after a hit. Notably, even this winning game included 2 illegal moves β€” the AI was one foul away from disqualification.

Text vs Vision: A Clear Difference

Mode Matches AI Wins Illegal/Game Cost
Text 205 1 0.9 $3.52
Vision 193 0 1.1 $6.58

Text mode shows slightly fewer illegal moves per game and the only AI win came in text mode. Vision mode costs nearly twice as much while performing marginally worse β€” models seem to find the visual grid representation more challenging to parse accurately.

Provider Analysis

Mistral stands out with the lowest illegal move rate (0.4/game), suggesting better coordinate tracking capabilities in Battleship. xAI (Grok) follows at 0.8/game. At the other end, Zhipu (GLM-4.7) shows the highest rate at 1.2 illegal moves per game.

Cost-wise, Anthropic models dominate spending ($7.01 total) β€” Claude Opus 4.5 costs $0.20/game in vision mode, making it 500x more expensive than Gemini Flash Lite ($0.0004/game) with no better outcomes.

ELO Observations

Battleship ELOs cluster tightly between 965-1012, far below the starting value of 1000 for most variants. The highest-rated is Grok 4.1 Fast (vision) at 1012, but with 0 wins in 11 matches. ELO rankings in Battleship are essentially a measure of "who loses the slowest" β€” models with fewer illegal moves and more shots before fouling out lose less ELO per match.

Comparison With Other Games

Game AI Win Rate Avg Illegal/Game Matches
WordDuel 8.9% 0.0 338
Connect4 7.6% 0.8 1,033
TicTacToe 5.6% 1.0 2,152
Mastermind 1.3% 0.0 229
Battleship 0.3% 1.0 398

Battleship's 0.3% AI win rate is dramatically lower than any other game. While TicTacToe has a similar illegal move rate, the coordinate tracking requirement in Battleship makes those illegal moves far more consequential.

Key Takeaways

  1. Coordinate memory is the bottleneck: Models consistently fail at tracking which cells they've already targeted
  2. Foul-out is the dominant failure mode: 90.5% of all matches end in AI disqualification
  3. No model has solved it: 31 variants tested, only 1 single win across 398 matches
  4. Text > Vision for this task: Text input yields slightly better coordinate tracking
  5. Cost doesn't correlate with performance: Claude Opus 4.5 at $0.20/game performs no better than Gemini Flash Lite at $0.0004/game

⚠️ Open Beta β€” These are preliminary observations based on 398 matches (172 post-bugfix). Battleship data before January 12, 2026 is affected by a grid reconstruction bug and was excluded from analysis where noted. Interested in scientific collaboration? Contact us!

πŸ†• Neue Version verfΓΌgbar!