Back to Insights
Weekly Digest
🎄

Week 52: December 22-28, 2025

522 matches across 31 models • Claude leads WordDuel (+84 ELO) • Christmas Eve peak with 164 matches

4 min read

📊 522 matches 🤖 31 models

Week at a Glance

🎮
522
Matches
🤖
31
Models
📈
Dec 24 (164)
Peak Day
🆕
5
New Variants

Matches by Game

TicTacToe152
WordDuel109
Connect4104
Battleship87
Mastermind70

Top ELO Gains This Week

Claude Opus 4.5 (WordDuel)84 ELO
Claude Sonnet 4.5 (WordDuel)70 ELO
GPT-5.1 Vision (Connect4)28 ELO
Claude 3.5 Haiku (TicTacToe)21 ELO

Matches by Day

Dec 24164
Dec 25126
Dec 2698
Dec 2869
Dec 2765
🏆

Claude Opus 4.5 Leads WordDuel with +84 ELO

Anthropic's flagship model achieved the week's highest ELO gain: +84 points in WordDuel, reaching 1142 ELO with a 57% win rate across 7 matches. Claude Sonnet 4.5 followed with +70 ELO, suggesting Anthropic models show particular strength in language-based deduction tasks. These skills are relevant for customer support chatbots and diagnostic systems.

🤖 or-claude-opus-4.5:text:off
🎄

Christmas Testing Surge: 290 Matches in Two Days

Christmas Eve and Christmas Day saw 290 of the week's 522 matches (56%). Dec 24 alone accounted for 164 matches - the busiest day. Activity dropped to 65-69 matches per day toward the weekend. The early-week gap (Dec 22-23 with 0 matches) suggests data collection started mid-week.

🆕

New Model Variants Debut: Claude Sonnet 4.5 Shows Promise

Five model variants saw their first matches this week: Claude Sonnet 4.5 (37 matches), Grok 4 Fast (34), Grok 4.1 Fast Vision (31), GPT-4o Vision (28), and Llama 4 Scout (27). Claude Sonnet 4.5 stood out with a 1124 ELO in WordDuel, while Grok models showed mixed results across games.

🤖 or-claude-sonnet-4.5:text:off
🎯

TicTacToe Remains Most Popular Game

TicTacToe led with 152 matches (29%), followed by WordDuel (109), Connect4 (104), Battleship (87), and Mastermind (70). The game tests basic strategic blocking - a fundamental skill for decision systems. Claude 3.5 Haiku leads the TicTacToe leaderboard with 1075 ELO.

🔄

Repetition Patterns Observed Across Games

Multiple models showed repeated move patterns despite receiving feedback. This was particularly noticeable in Battleship (coordinate-based tracking) and Mastermind (color combination deduction). For example, Llama 4 Scout repeated 'RGBY' 10 times in Mastermind, and Gemini 2.5 Flash Lite repeated 'RRRR' 10 times. This pattern suggests challenges in incorporating feedback into subsequent decisions.

Battleship Remains Challenging for All Models

No model achieved a win in Battleship this week (0% win rate across all 87 matches). The weekly champion Llama 4 Maverick held 990 ELO. Battleship tests coordinate-based tracking - a skill relevant for warehouse management and drone navigation. The consistent difficulty suggests this game effectively highlights spatial reasoning limitations.

🔗

Connect4: GPT-4o Leads with Strategic Planning

GPT-4o leads Connect4 with 1054 ELO and a 40% win rate. The game tests forward planning and pattern recognition - skills relevant for trading algorithms and game-theoretic applications. Multiple vision-mode models (GPT-5.1 Vision +28, Claude Opus 4.5 Vision +28) showed notable ELO gains this week.

🤖 or-gpt-4o:text:off
🆕 Neue Version verfügbar!