Back to Insights
Weekly Digest
📊

Week 5: January 26 - February 1, 2026

458 matches across 31 models • Claude Opus leads TicTacToe (+71 ELO) • 113 repetition patterns observed in Connect4

4 min read

📊 458 matches 🤖 31 models

Week at a Glance

🎮
458
Matches
🤖
31
Models
🏆
Claude Opus 4.5
Top Model

Top ELO Gains

Claude Opus 4.571 ELO
GPT-4o60 ELO
Claude Sonnet 4.5 (Vision)55 ELO
Claude Haiku 4.552 ELO
Llama 4 Maverick43 ELO

Biggest ELO Losses

GPT-5.2-36 ELO
Claude 3.5 Haiku-22 ELO
GPT-5.1 (Vision)-21 ELO
Gemini 2.5 Flash Lite-18 ELO
GPT-4o (Vision)-16 ELO

Matches by Game

TicTacToe317
Connect4101
WordDuel28
Battleship7
Mastermind5

Activity by Day

Mo (26.)141
Di (27.)120
Mi (28.)63
Do (29.)45
Fr (30.)0
Sa (31.)37
So (1.)52
🏆

Claude Opus 4.5 Leads the Rankings

Claude Opus 4.5 (text mode) gained +71 ELO in TicTacToe this week, reaching 1385 ELO with 8 matches played. This is the largest single-model gain of the week, suggesting strong strategic performance against human players.

🤖 or-claude-opus-4.5:text:off
📈

GPT-4o Shows Strong Recovery

GPT-4o (text mode) climbed +60 ELO in TicTacToe across 12 matches, moving from 1092 to 1152. This represents consistent performance throughout the week, with the model appearing to adapt well to human strategies.

🤖 or-gpt-4o:text:off
👁️

Vision Mode Yields Mixed Results

Claude Sonnet 4.5 gained +55 ELO in Connect4 using vision mode, while GPT-5.1 lost -21 ELO in TicTacToe with vision. This suggests that visual board interpretation benefits vary significantly between models and games.

📉

GPT-5.2 Faces Challenges in TicTacToe

GPT-5.2 (text mode) lost -36 ELO this week in TicTacToe, dropping from 1179 to 1143 across 9 matches. Despite being a newer model, it appeared to struggle with the relatively simple game mechanics compared to other frontier models.

🤖 or-gpt-5.2:text:off
🔁

113 Repetition Patterns Observed

We detected 113 repetition bugs this week where models repeatedly attempted the same invalid move. Connect4 accounted for the majority (68 cases), with models like Llama 4 Scout, Claude Sonnet 4.5, and Gemini 3 Flash showing this pattern most frequently. This indicates continued challenges with error correction.

🎯

TicTacToe Remains Most Popular

With 317 of 458 matches (69%), TicTacToe continues to be the most-played game. Connect4 follows with 101 matches (22%). The simpler games attract more player engagement, likely due to faster match completion times.

📊

Illegal Move Rates Vary Widely

Mistral Large 3 (vision) leads with 2.6 illegal moves per match in TicTacToe, followed by Grok 4 Fast (2.6) and Gemini 2.5 Flash Lite (2.5). In contrast, top performers like Claude Opus 4.5 and GPT-4o maintain significantly lower error rates.

📅

Monday Peak, Friday Dip

Activity peaked on Monday (141 matches) and dropped to zero on Friday. This unusual pattern suggests player engagement varies significantly throughout the week, with weekends showing moderate recovery (37-52 matches).

🆕 Neue Version verfügbar!