Week 52: December 22-28, 2025
522 matches across 31 models • Claude leads WordDuel (+84 ELO) • Christmas Eve peak with 164 matches
4 min read
Week at a Glance
Claude Opus 4.5 Leads WordDuel with +84 ELO
Anthropic's flagship model achieved the week's highest ELO gain: +84 points in WordDuel, reaching 1142 ELO with a 57% win rate across 7 matches. Claude Sonnet 4.5 followed with +70 ELO, suggesting Anthropic models show particular strength in language-based deduction tasks. These skills are relevant for customer support chatbots and diagnostic systems.
Christmas Testing Surge: 290 Matches in Two Days
Christmas Eve and Christmas Day saw 290 of the week's 522 matches (56%). Dec 24 alone accounted for 164 matches - the busiest day. Activity dropped to 65-69 matches per day toward the weekend. The early-week gap (Dec 22-23 with 0 matches) suggests data collection started mid-week.
New Model Variants Debut: Claude Sonnet 4.5 Shows Promise
Five model variants saw their first matches this week: Claude Sonnet 4.5 (37 matches), Grok 4 Fast (34), Grok 4.1 Fast Vision (31), GPT-4o Vision (28), and Llama 4 Scout (27). Claude Sonnet 4.5 stood out with a 1124 ELO in WordDuel, while Grok models showed mixed results across games.
TicTacToe Remains Most Popular Game
TicTacToe led with 152 matches (29%), followed by WordDuel (109), Connect4 (104), Battleship (87), and Mastermind (70). The game tests basic strategic blocking - a fundamental skill for decision systems. Claude 3.5 Haiku leads the TicTacToe leaderboard with 1075 ELO.
Repetition Patterns Observed Across Games
Multiple models showed repeated move patterns despite receiving feedback. This was particularly noticeable in Battleship (coordinate-based tracking) and Mastermind (color combination deduction). For example, Llama 4 Scout repeated 'RGBY' 10 times in Mastermind, and Gemini 2.5 Flash Lite repeated 'RRRR' 10 times. This pattern suggests challenges in incorporating feedback into subsequent decisions.
Battleship Remains Challenging for All Models
No model achieved a win in Battleship this week (0% win rate across all 87 matches). The weekly champion Llama 4 Maverick held 990 ELO. Battleship tests coordinate-based tracking - a skill relevant for warehouse management and drone navigation. The consistent difficulty suggests this game effectively highlights spatial reasoning limitations.
Connect4: GPT-4o Leads with Strategic Planning
GPT-4o leads Connect4 with 1054 ELO and a 40% win rate. The game tests forward planning and pattern recognition - skills relevant for trading algorithms and game-theoretic applications. Multiple vision-mode models (GPT-5.1 Vision +28, Claude Opus 4.5 Vision +28) showed notable ELO gains this week.