Week 5: January 26 - February 1, 2026
458 matches across 31 models • Claude Opus leads TicTacToe (+71 ELO) • 113 repetition patterns observed in Connect4
4 min read
Week at a Glance
Claude Opus 4.5 Leads the Rankings
Claude Opus 4.5 (text mode) gained +71 ELO in TicTacToe this week, reaching 1385 ELO with 8 matches played. This is the largest single-model gain of the week, suggesting strong strategic performance against human players.
GPT-4o Shows Strong Recovery
GPT-4o (text mode) climbed +60 ELO in TicTacToe across 12 matches, moving from 1092 to 1152. This represents consistent performance throughout the week, with the model appearing to adapt well to human strategies.
Vision Mode Yields Mixed Results
Claude Sonnet 4.5 gained +55 ELO in Connect4 using vision mode, while GPT-5.1 lost -21 ELO in TicTacToe with vision. This suggests that visual board interpretation benefits vary significantly between models and games.
GPT-5.2 Faces Challenges in TicTacToe
GPT-5.2 (text mode) lost -36 ELO this week in TicTacToe, dropping from 1179 to 1143 across 9 matches. Despite being a newer model, it appeared to struggle with the relatively simple game mechanics compared to other frontier models.
113 Repetition Patterns Observed
We detected 113 repetition bugs this week where models repeatedly attempted the same invalid move. Connect4 accounted for the majority (68 cases), with models like Llama 4 Scout, Claude Sonnet 4.5, and Gemini 3 Flash showing this pattern most frequently. This indicates continued challenges with error correction.
TicTacToe Remains Most Popular
With 317 of 458 matches (69%), TicTacToe continues to be the most-played game. Connect4 follows with 101 matches (22%). The simpler games attract more player engagement, likely due to faster match completion times.
Illegal Move Rates Vary Widely
Mistral Large 3 (vision) leads with 2.6 illegal moves per match in TicTacToe, followed by Grok 4 Fast (2.6) and Gemini 2.5 Flash Lite (2.5). In contrast, top performers like Claude Opus 4.5 and GPT-4o maintain significantly lower error rates.
Monday Peak, Friday Dip
Activity peaked on Monday (141 matches) and dropped to zero on Friday. This unusual pattern suggests player engagement varies significantly throughout the week, with weekends showing moderate recovery (37-52 matches).