Week 2: January 5-11, 2026
582 matches across 31 models • Gemini 3 Flash leads TicTacToe (+94 ELO) • Connect4 column-3 fixation observed
4 min read
Week at a Glance
Record-Breaking Sunday: 499 Matches
Sunday January 11th saw 499 matches played - 86% of the entire week's activity (582 matches). This surge from 3-22 daily matches to 499 indicates a significant weekend testing session. The activity spike allowed for extensive model comparison across all games.
Gemini 3 Flash Preview Leads TicTacToe
Gemini 3 Flash Preview Vision gained +94 ELO in TicTacToe this week (11 matches, now ELO 1158), making it the week's top performer. The same model also added +86 ELO in WordDuel. Claude Sonnet 4.5 Vision followed with +80 ELO in TicTacToe. Google's preview model shows strong tactical reasoning.
Connect4: Column 3 Fixation Pattern
Over 50 Connect4 matches on Sunday showed models repeatedly choosing column 3, even when blocked. Claude Opus 4.5 showed this pattern in 10 separate matches, Claude Sonnet 4.5 in 10 matches, GPT-5.2 in 10 matches. This center-column preference persists across model families, suggesting a shared training bias for 'safe' opening moves.
Mastermind: Repeated Guesses Despite Feedback
Multiple models repeated identical guesses despite receiving feedback. GPT-4o repeated 'YRGB' 13 times in one match. Claude Haiku 4.5 repeated 'RGBY' up to 10 times. Gemini 2.5 Flash Lite continued with 'RRRR' for 10 attempts. This pattern suggests challenges in incorporating feedback into subsequent guesses.
Claude 3.5 Haiku: ELO Decline in TicTacToe
After leading TicTacToe last week, Claude 3.5 Haiku Text lost -21 ELO this week (7 matches). The Vision variant also dropped -20 ELO (10 matches). Meanwhile, larger Claude models gained: Opus 4.5 +67 ELO, Sonnet 4.5 +73-80 ELO. The smaller Haiku model appears to struggle against the current player skill level.
Claude Opus 4.5 Maintains WordDuel Lead
Claude Opus 4.5 Text remains the WordDuel champion with ELO 1166 and 56% win rate (9 matches). Gemini 3 Flash Preview Vision also showed strong WordDuel performance, gaining +86 ELO to reach 1128. Both models demonstrate solid word reasoning capabilities.
TicTacToe Most Popular Game
TicTacToe accounted for 38% of all matches this week (224 of 582). Connect4 followed with 33% (191 matches). WordDuel (15%) and Mastermind (14%) saw fewer matches. The preference for grid-based games may reflect their faster pace for rapid model testing.