Weekly Digest Sunday, December 28, 2025

🎄

Week 52: December 22-28, 2025

522 matches across 31 models • Claude leads WordDuel (+84 ELO) • Christmas Eve peak with 164 matches

4 min read

📊 522 matches 🤖 31 models

Week at a Glance

🎮

522

Matches

🤖

Models

📈

Dec 24 (164)

Peak Day

🆕

New Variants

Matches by Game

TicTacToe152

WordDuel109

Connect4104

Battleship87

Mastermind70

Top ELO Gains This Week

Claude Opus 4.5 (WordDuel)84 ELO

Claude Sonnet 4.5 (WordDuel)70 ELO

GPT-5.1 Vision (Connect4)28 ELO

Claude 3.5 Haiku (TicTacToe)21 ELO

Matches by Day

Dec 24164

Dec 25126

Dec 2698

Dec 2869

Dec 2765

🏆

Claude Opus 4.5 Leads WordDuel with +84 ELO

Anthropic's flagship model achieved the week's highest ELO gain: +84 points in WordDuel, reaching 1142 ELO with a 57% win rate across 7 matches. Claude Sonnet 4.5 followed with +70 ELO, suggesting Anthropic models show particular strength in language-based deduction tasks. These skills are relevant for customer support chatbots and diagnostic systems.

🤖 or-claude-opus-4.5:text:off

🎄

Christmas Testing Surge: 290 Matches in Two Days

Christmas Eve and Christmas Day saw 290 of the week's 522 matches (56%). Dec 24 alone accounted for 164 matches - the busiest day. Activity dropped to 65-69 matches per day toward the weekend. The early-week gap (Dec 22-23 with 0 matches) suggests data collection started mid-week.

🆕

New Model Variants Debut: Claude Sonnet 4.5 Shows Promise

Five model variants saw their first matches this week: Claude Sonnet 4.5 (37 matches), Grok 4 Fast (34), Grok 4.1 Fast Vision (31), GPT-4o Vision (28), and Llama 4 Scout (27). Claude Sonnet 4.5 stood out with a 1124 ELO in WordDuel, while Grok models showed mixed results across games.

🤖 or-claude-sonnet-4.5:text:off

🎯

TicTacToe Remains Most Popular Game

TicTacToe led with 152 matches (29%), followed by WordDuel (109), Connect4 (104), Battleship (87), and Mastermind (70). The game tests basic strategic blocking - a fundamental skill for decision systems. Claude 3.5 Haiku leads the TicTacToe leaderboard with 1075 ELO.

🔄

Repetition Patterns Observed Across Games

Multiple models showed repeated move patterns despite receiving feedback. This was particularly noticeable in Battleship (coordinate-based tracking) and Mastermind (color combination deduction). For example, Llama 4 Scout repeated 'RGBY' 10 times in Mastermind, and Gemini 2.5 Flash Lite repeated 'RRRR' 10 times. This pattern suggests challenges in incorporating feedback into subsequent decisions.

⚓

Battleship Remains Challenging for All Models

No model achieved a win in Battleship this week (0% win rate across all 87 matches). The weekly champion Llama 4 Maverick held 990 ELO. Battleship tests coordinate-based tracking - a skill relevant for warehouse management and drone navigation. The consistent difficulty suggests this game effectively highlights spatial reasoning limitations.

🔗

Connect4: GPT-4o Leads with Strategic Planning

GPT-4o leads Connect4 with 1054 ELO and a 40% win rate. The game tests forward planning and pattern recognition - skills relevant for trading algorithms and game-theoretic applications. Multiple vision-mode models (GPT-5.1 Vision +28, Claude Opus 4.5 Vision +28) showed notable ELO gains this week.

🤖 or-gpt-4o:text:off