Back to Blog
Deep Dive

Which Game is Hardest for AI? A 4,150-Match Analysis

From 0.3% win rate in Battleship to 8.9% in WordDuel β€” five games reveal very different AI capabilities

Data Overview

4150
Completed Matches
31
Model Variants
5
Games Analyzed
4.7%
Avg. AI Win Rate

AI Win Rate per Game (%)

WordDuel8.9
Connect47.5
TicTacToe5.5
Mastermind1.3
Battleship0.3

ELO Spread (Range Best to Worst)

TicTacToe492
WordDuel248
Connect4172
Mastermind86
Battleship47

AI Error Rate per Game (%)

Battleship90.2
TicTacToe22
Connect416.3
Mastermind0
WordDuel0

Illegal Moves per Match (avg)

TicTacToe1.02
Battleship0.96
Connect40.75
WordDuel0
Mastermind0

The Difficulty Ranking

Across 4,150 completed matches with 31 AI model variants, the five games on PlayTheAI form a clear difficulty hierarchy for AI. The ranking, from hardest to easiest:

  1. Battleship β€” 0.3% AI win rate (1 win in 399 matches)
  2. Mastermind β€” 1.3% AI win rate (3 wins in 229 matches)
  3. TicTacToe β€” 5.5% AI win rate (119 wins in 2,151 matches)
  4. Connect4 β€” 7.5% AI win rate (77 wins in 1,033 matches)
  5. WordDuel β€” 8.9% AI win rate (30 wins in 338 matches)

Why Battleship is Nearly Impossible for AI

Battleship stands out as the most challenging game by a wide margin. With only 1 AI victory across 399 completed matches, the numbers are striking:

  • 90.2% of matches end in AI error β€” far more than any other game
  • The ELO spread is just 47 points (965–1,012) β€” meaning even the best model barely performs above the baseline
  • Coordinate tracking is the core problem: models need to remember which cells they've already fired at, which ones were hits, and maintain a spatial model of the opponent's fleet

The only model that even reaches above 1,000 ELO is Grok 4.1 Fast (vision) at 1,012 β€” and even that model has a 0% win rate. Claude Sonnet 4.5 (text) is the only model with any Battleship win at all (1 win in 17 matches, 1,002 ELO).

The skill Battleship tests β€” persistent spatial state tracking β€” appears to be fundamentally difficult for current language models in our tests.

Mastermind: The Feedback Processing Challenge

Mastermind presents a different kind of difficulty. The AI needs to guess a hidden color code and use feedback (black and white pegs) to narrow down possibilities:

  • 1.3% AI win rate but a notable 27.9% draw rate β€” models can partially solve codes but struggle to crack them within the move limit
  • Zero illegal moves (the format is simple), yet performance remains poor
  • Only 3 models have ever won a Mastermind game: Claude Haiku 4.5, Gemini 3 Flash Preview, and Claude Opus 4.5 β€” each with just 1 win
  • The ELO spread is only 86 points β€” models cluster near 1,000 with limited differentiation

Mastermind tests iterative deduction from structured feedback. The high draw rate suggests models understand the game mechanics but lack the ability to efficiently narrow the search space.

TicTacToe: Simple Rules, Surprising Failures

Despite being the simplest game, TicTacToe produces the most illegal moves per match (1.02) and reveals the largest ELO spread (492 points):

  • The top models (Claude Opus 4.5 at 1,404 ELO, Gemini 3 Flash at 1,403) perform near-perfectly β€” achieving frequent draws against humans, which is the theoretically optimal outcome
  • The worst performers (Grok 4 Fast text at 912 ELO) struggle with basic rules, averaging 2.6 illegal moves per match
  • 22% of matches end in AI error β€” models attempt to play in occupied cells, a seemingly trivial task

TicTacToe is actually the best game for differentiating between models. The 492-point ELO spread shows that basic spatial awareness and board state understanding vary enormously across models.

Connect4: Where Strategy Matters

Connect4 offers the clearest strategic test β€” planning ahead and recognizing patterns:

  • 7.5% AI win rate with 0% draws (the game always has a winner)
  • Grok 4 Fast (text) stands out with a 17.5% win rate and 1,129 ELO β€” the highest per-game ELO outside of TicTacToe/WordDuel
  • The game has a moderate illegal move rate (0.75/match), mostly from attempting to play in full columns

Interesting pattern: Several models that perform poorly at TicTacToe actually do well at Connect4. Grok 4 Fast has the worst TicTacToe ELO (912) but the best Connect4 ELO (1,129) β€” suggesting the skills these games test are quite different.

WordDuel: The AI's Best Chance

WordDuel (a Wordle-style word guessing game) is where AI performs best:

  • 8.9% AI win rate and a massive 46.4% draw rate β€” many games end with neither side guessing the word
  • Claude Opus 4.5 (text) reaches a remarkable 50% win rate with 1,200 ELO
  • Gemini 3 Flash Preview (vision) achieves 1,244 ELO with 42.9% win rate
  • Zero illegal moves β€” the word format is easy to parse

Language is where large language models naturally excel. Premium models show strong vocabulary knowledge and deduction from color-coded letter feedback.

The Difficulty Matrix: What Each Game Tests

Game Core Challenge AI Win Rate Why It's Hard
Battleship Spatial memory, coordinate tracking 0.3% No persistent state between turns
Mastermind Iterative deduction from feedback 1.3% Efficient search space narrowing
TicTacToe Board state awareness, basic strategy 5.5% Occupied cell detection (!)
Connect4 Pattern recognition, planning ahead 7.5% Column validity, vertical threats
WordDuel Vocabulary, letter-position deduction 8.9% Natural fit for language models

Key Takeaway

The difficulty spectrum reveals a clear pattern: games requiring persistent spatial state (Battleship) or structured iterative reasoning (Mastermind) are much harder for AI than games relying on pattern matching (Connect4) or language processing (WordDuel).

The most surprising finding may be TicTacToe's position in the middle. Despite its simplicity, it exposes a fundamental gap: many models still struggle to correctly read a 3x3 board β€” making it an unexpectedly effective benchmark for basic reasoning.

Based on 4,150 matches across 31 model variants.


⚠️ Open Beta β€” preliminary observations. Sample sizes vary by game (2,151 for TicTacToe, 229 for Mastermind). Interested in scientific collaboration? Contact us!

πŸ†• Neue Version verfΓΌgbar!