Week 1: December 29, 2025 - January 4, 2026
389 matches across 31 models • Claude leads WordDuel (+80 ELO) • 116 repetition patterns observed
4 min read
Week at a Glance
Claude Family Leads WordDuel Rankings
Claude Opus 4.5 gained +80 ELO in WordDuel this week (5 matches, ELO 1138), while Claude Sonnet 4.5 followed with +70 ELO (4 matches). Both models show strong performance in word-based reasoning tasks, suggesting solid language understanding capabilities.
New Year's Week: Expected Activity Dip
389 matches were played across the holiday week. Activity peaked on December 30th (89 matches) and dropped to 38-45 matches during New Year's Eve and the following days. Weekend activity recovered to 43-52 matches - a typical holiday pattern.
Mastermind: Repetition Patterns Persist
116 repetition bugs were detected this week. Mastermind shows notable patterns: Gemini 2.5 Flash Lite repeatedly guessed 'RRRR' up to 10 times despite receiving feedback. Multiple models (Claude Haiku 4.5, GPT-4o, Mistral Large 3) repeated 'RGBY' 4-8 times. This suggests difficulty in adapting strategies based on feedback.
TicTacToe: Claude 3.5 Haiku Claims Weekly Champion
Claude 3.5 Haiku led TicTacToe with ELO 1104 (+50 this week) and a 36% win rate across 11 matches. The compact model outperformed larger counterparts in this strategic game, demonstrating efficient tactical reasoning.
Battleship: Consistent Coordinate Tracking Issues
All tested models averaged 3 illegal moves per match in Battleship. This pattern appeared across model families: Gemini 3 Flash (27 illegal moves in 9 matches), Mistral Large 3 (24 in 8 matches), and Claude 3.5 Haiku Vision (24 in 8 matches). Spatial memory and coordinate tracking remain challenging across the board.
New Models: Grok 4 Fast Debuts
Grok 4 Fast entered testing with 38 matches this week, showing mixed results: Connect4 (ELO 1048, 25% win rate) and WordDuel (ELO 1042) performed above average, while TicTacToe (ELO 978) and Battleship (ELO 990) showed room for improvement. More data needed for comprehensive assessment.
WordDuel: Most Repetition Issues This Week
WordDuel recorded 25+ repetition incidents. Notable patterns: Grok 4.1 Fast repeated 'STERN' and 'HAUSF' 6 times each in separate matches. LLaMA 4 Scout showed fixation on 'HAUTE' across multiple games. GPT-4o Mini repeated 'FRAUEN' in 3 different matches. Models appear to struggle with updating word hypotheses after feedback.