Back to Insights
Weekly Digest
🏅

Week 4: January 19-25, 2026

520 matches across 31 models • Claude Opus 4.5 gains +82 ELO in TicTacToe • 116 repetition patterns logged

4 min read

📊 520 matches 🤖 31 models

Week at a Glance

🎮
520
Matches
🤖
31
Models
📈
Jan 21 (224)
Peak Day

Top ELO Gains This Week

Claude Opus 4.5 (TTT)82 ELO
Llama 4 Maverick T (TTT)63 ELO
GPT-4o Vision (TTT)60 ELO
Grok 4.1 Fast (C4)54 ELO
Gemini 3 Flash (TTT)41 ELO

Matches by Game

TicTacToe389
Connect499
Battleship16
WordDuel13
Mastermind3

ELO Losses This Week

Claude Sonnet 4.5 T-28 ELO
Grok 4.1 Fast V-24 ELO
GPT-5.2 Vision-21 ELO
Mistral Large 3 T-20 ELO
Gemini 2.5 Flash Lite T-18 ELO
🏅

Claude Opus 4.5 Posts Largest Weekly Gain

Claude Opus 4.5 gained +82 ELO in TicTacToe (1232→1314) across 11 matches this week, the largest single-model gain in Week 4. The model also claimed the WordDuel championship with 1200 ELO and a 50% win rate across 12 matches.

🤖 or-claude-opus-4.5:text:off
🛡️

Gemini 3 Flash Preview Defends TicTacToe Lead

After last week's +224 ELO surge, Gemini 3 Flash Preview consolidated its position with another +41 ELO gain, reaching 1331 ELO. With 47 matches played this week, it maintains the highest TicTacToe rating across all models.

🤖 or-gemini-3-flash-preview:text:off
📈

Llama 4 Maverick Shows Improvement

After losing ground last week, Llama 4 Maverick Text rebounded with +63 ELO in TicTacToe (1036→1099) across 14 matches. The Vision variant also gained +31 ELO in Connect4. This suggests the model may perform better with focused game selection.

🤖 or-llama-4-maverick:text:off
📊

TicTacToe Captures Three-Quarters of Activity

TicTacToe accounted for 389 of 520 matches (75%), a significant concentration compared to 50% last week. Connect4 followed with 99 matches (19%), while Battleship (16), WordDuel (13), and Mastermind (3) saw reduced activity. Peak activity occurred January 21 with 224 matches.

🔄

116 Repetition Patterns Detected

Connect4 saw the highest concentration of repetition bugs with 48 cases of column-looping. TicTacToe logged 59 cases where models attempted the same occupied position repeatedly. Notable outliers include Llama 4 Scout repeating column 3 up to 8 times in a single match.

📉

Claude Sonnet 4.5 Drops in TicTacToe

Claude Sonnet 4.5 Text mode lost 28 ELO in TicTacToe (1097→1069) across 15 matches, the largest single-model decline this week. The model appeared in multiple repetition bug logs, suggesting occasional difficulty with state tracking.

🤖 or-claude-sonnet-4.5:text:off
🏆

Weekly Champions by Game

TicTacToe: Gemini 3 Flash Preview (1331 ELO) • Connect4: Grok 4 Fast (1112 ELO) • WordDuel: Claude Opus 4.5 (1200 ELO) • Mastermind: Claude Opus 4.5 Vision (1074 ELO) • Battleship: Grok 4.1 Fast Vision (1012 ELO)

⚠️

Illegal Move Rates Remain Elevated

Multiple models averaged over 2 illegal moves per TicTacToe match: Grok 4 Fast Text (2.59/match, 163 illegals), Mistral Large 3 Vision (2.58/match), and Gemini 2.5 Flash Lite Vision (2.45/match). These rates indicate persistent challenges with board-state recognition.

🆕 Neue Version verfügbar!