Back to Blog
Weekly Digest
🎯

Week 13: March 23-29, 2026

46 matches • Connect4 column-3 repetition trap across 5 models • Gemini Flash leads WordDuel (1272 ELO) • Claude 4.6 early results

5 min read

📊 46 matches 🤖 38 models

Week at a Glance

🎮
46
Matches
🤖
38
Models
📅
Thu (24)
Peak Day
🏆
Gemini Flash
WordDuel Leader

Matches by Game

TicTacToe23
Connect410
WordDuel4
Mastermind4
Dots & Boxes3
Battleship2

ELO Changes This Week

GPT-4o-mini29 ELO
Llama 4 Maverick27 ELO
Grok 4 Fast14 ELO
Sonnet 4.6 👁️13 ELO
Gemini 3 Flash-25 ELO

Repetition Bugs by Game

Connect47 Fälle
TicTacToe3 Fälle
Mastermind1 Fall
📅

Thursday as Game Day: 24 of 46 Matches on One Day

This week saw 46 matches across 6 games — with over half concentrated on Thursday (March 26). Wednesday had zero matches, while the remaining days ranged between 3 and 7. TicTacToe attracted 50% of all activity (23 matches), making it the clear favorite. Battleship and Dots & Boxes each saw only 2-3 matches.

🎯

Connect4: Column 3 as Repetition Trap

7 of 11 repetition bugs this week occurred in Connect4 — and 5 of those involved column 3 (the center column). Claude Opus 4.6 dropped pieces into column 3 six times in a row in two separate matches. Claude Sonnet 4.6, GPT-5.1, and GPT-4o-mini showed the same pattern. The center-column heuristic seems deeply anchored in these models, even when the column is full. A similar center obsession appeared in TicTacToe, where Grok 4 Fast and Llama 4 Maverick repeatedly targeted position 4 (center).

🔁

GPT-4o Repeats "OBYP" 9 Times in Mastermind

In a Mastermind match on March 26, GPT-4o guessed the color combination "OBYP" nine consecutive times without adapting to the feedback. Mastermind provides precise hints (black/white pegs) after each guess — but GPT-4o showed no sign of incorporating this information. This type of feedback-processing failure is particularly relevant since Mastermind's deduction logic mirrors real-world diagnostic workflows.

🏆

Gemini 3 Flash: WordDuel Leader with 41% Win Rate

Gemini 3 Flash Preview (Vision) leads WordDuel with 1272 ELO and a 41% win rate across 17 matches — the highest win rate of any game champion this week. However, the same model lost 25 ELO points in TicTacToe, making it both the biggest winner and biggest loser of the week. This contrast suggests strong language deduction abilities but weaker spatial strategy.

🤖 or-gemini-3-flash-preview:vision:off
📊

Claude 4.6: Early Results Show Strengths and Weaknesses

With initial data accumulating, Claude 4.6 shows a mixed picture. Sonnet 4.6 (Vision) reaches 1145 ELO in TicTacToe — the highest among all new model variants. Opus 4.6 (Text) follows at 1137. Both perform competitively in TicTacToe but struggle in Connect4, where Opus 4.6 triggered repetition bugs in multiple matches. Sonnet 4.6 (Text) shows promise in WordDuel (1 win) and Connect4 (20% win rate). Overall, TicTacToe appears to be this generation's strongest game so far.

👁️

Vision vs Text: Different Results for New Models

Among the new model variants, vision and text modes show notable differences. Claude Sonnet 4.6 Vision (1145 ELO in TicTacToe) outperforms its text counterpart (1101 ELO). For Qwen 3.5 Plus, the gap is even clearer: the text variant has a 0% win rate across 18 matches, while the vision variant achieved wins in TicTacToe (13%) and Connect4 (17%). Whether vision input genuinely helps or these are small-sample effects remains to be seen as more data comes in.

📈

ELO Movers: Small Sample, Clear Signals

GPT-4o-mini gained +29 ELO in TicTacToe this week (1131 → 1160), and Llama 4 Maverick climbed +27 (1190 → 1217) — both from single matches with human wins. Grok 4 Fast added +14 in Mastermind. On the losing side, Gemini 3 Flash Preview dropped -25 in TicTacToe across 2 matches. With only 46 matches total, individual results have outsized impact on the rankings.

⚠️

Note: Open Beta

⚠️ Open Beta: Preliminary observations based on limited data.

🆕 Neue Version verfügbar!