Back to Insights
Weekly Digest
📈

Week 3: January 12-18, 2026

1,802 matches across 31 models • Gemini 3 Flash leads TicTacToe surge (+224 ELO) • 140+ repetition patterns detected

4 min read

📊 1802 matches 🤖 31 models

Week at a Glance

🎮
1802
Matches
🤖
31
Models
📈
Jan 12 (566)
Peak Day

Top ELO Gains This Week

Gemini 3 Flash (TTT)224 ELO
GPT-5.1 Vision (TTT)179 ELO
GPT-5.2 Vision (TTT)128 ELO
GPT-4o Vision (TTT)125 ELO
GPT-5.2 Text (TTT)119 ELO

Matches by Game

TicTacToe902
Connect4469
Battleship199
WordDuel123
Mastermind109

ELO Losses This Week

Llama 4 Maverick V-38 ELO
Llama 4 Maverick T-32 ELO
Grok 4.1 Fast V-30 ELO
Claude 3.5 Haiku T-30 ELO
Grok 4 Fast T-29 ELO
🚀

Gemini 3 Flash Preview Surges to TicTacToe Lead

Google's Gemini 3 Flash Preview gained +224 ELO in TicTacToe this week, reaching 1280 ELO across 33 matches. This represents the largest single-game ELO movement observed in Week 3, suggesting strong strategic capabilities in this foundational game.

🤖 or-gemini-3-flash-preview:text:off
👁️

Vision Models Show Strong TicTacToe Performance

GPT-5.1 Vision (+179 ELO), GPT-5.2 Vision (+128 ELO), and GPT-4o Vision (+125 ELO) all showed notable gains in TicTacToe. This suggests vision input may provide advantages for board-state recognition in simpler grid games.

📊

TicTacToe Captures Half of All Activity

With 902 of 1,802 matches (50%), TicTacToe remains the most popular game. Connect4 follows with 469 matches (26%), while Battleship (199), WordDuel (123), and Mastermind (109) round out the activity. January 12 saw the highest activity with 566 matches.

🔄

Connect4 Repetition Patterns Persist

Over 60 Connect4 matches this week showed column-repetition behavior, where models repeatedly attempt the same column despite it being full or strategically suboptimal. This pattern spans multiple model families including Claude, GPT, Gemini, and Mistral variants.

📉

Llama 4 Maverick Shows Regression

Both Llama 4 Maverick variants lost ground: Vision mode dropped 38 ELO (1002→964) and Text mode dropped 32 ELO (1014→982) in TicTacToe. Across 49 combined matches, the model showed difficulty with board-state tracking.

🤖 or-llama-4-maverick:text:off
🏆

Weekly Champions by Game

TicTacToe: Gemini 3 Flash Preview (1280 ELO) • Connect4: Claude Haiku 4.5 (1081 ELO) • WordDuel: Claude Opus 4.5 (1205 ELO) • Mastermind: Claude Opus 4.5 Vision (1074 ELO) • Battleship: Claude Sonnet 4.5 (1008 ELO)

⚠️

Illegal Move Hotspots

Several models averaged over 2 illegal moves per TicTacToe match: Mistral Large 3 Vision (2.6/match), Grok 4 Fast Text (2.5/match), and Gemini 2.5 Flash Lite Vision (2.4/match). These models frequently attempted to place moves on occupied squares.

🧠

Mastermind Feedback Loop Challenge

Multiple Mastermind matches showed models repeating the same guess (e.g., 'RGBY') up to 10 times despite receiving consistent feedback. This pattern appeared across Claude Haiku, Llama Scout, and GPT-4o variants, indicating difficulty processing iterative deduction cues.

🆕 Neue Version verfügbar!