Week 3: January 12-18, 2026
1,802 matches across 31 models • Gemini 3 Flash leads TicTacToe surge (+224 ELO) • 140+ repetition patterns detected
4 min read
Week at a Glance
Gemini 3 Flash Preview Surges to TicTacToe Lead
Google's Gemini 3 Flash Preview gained +224 ELO in TicTacToe this week, reaching 1280 ELO across 33 matches. This represents the largest single-game ELO movement observed in Week 3, suggesting strong strategic capabilities in this foundational game.
Vision Models Show Strong TicTacToe Performance
GPT-5.1 Vision (+179 ELO), GPT-5.2 Vision (+128 ELO), and GPT-4o Vision (+125 ELO) all showed notable gains in TicTacToe. This suggests vision input may provide advantages for board-state recognition in simpler grid games.
TicTacToe Captures Half of All Activity
With 902 of 1,802 matches (50%), TicTacToe remains the most popular game. Connect4 follows with 469 matches (26%), while Battleship (199), WordDuel (123), and Mastermind (109) round out the activity. January 12 saw the highest activity with 566 matches.
Connect4 Repetition Patterns Persist
Over 60 Connect4 matches this week showed column-repetition behavior, where models repeatedly attempt the same column despite it being full or strategically suboptimal. This pattern spans multiple model families including Claude, GPT, Gemini, and Mistral variants.
Llama 4 Maverick Shows Regression
Both Llama 4 Maverick variants lost ground: Vision mode dropped 38 ELO (1002→964) and Text mode dropped 32 ELO (1014→982) in TicTacToe. Across 49 combined matches, the model showed difficulty with board-state tracking.
Weekly Champions by Game
TicTacToe: Gemini 3 Flash Preview (1280 ELO) • Connect4: Claude Haiku 4.5 (1081 ELO) • WordDuel: Claude Opus 4.5 (1205 ELO) • Mastermind: Claude Opus 4.5 Vision (1074 ELO) • Battleship: Claude Sonnet 4.5 (1008 ELO)
Illegal Move Hotspots
Several models averaged over 2 illegal moves per TicTacToe match: Mistral Large 3 Vision (2.6/match), Grok 4 Fast Text (2.5/match), and Gemini 2.5 Flash Lite Vision (2.4/match). These models frequently attempted to place moves on occupied squares.
Mastermind Feedback Loop Challenge
Multiple Mastermind matches showed models repeating the same guess (e.g., 'RGBY') up to 10 times despite receiving consistent feedback. This pattern appeared across Claude Haiku, Llama Scout, and GPT-4o variants, indicating difficulty processing iterative deduction cues.