Week 13: March 23-29, 2026
46 matches • Connect4 column-3 repetition trap across 5 models • Gemini Flash leads WordDuel (1272 ELO) • Claude 4.6 early results
5 min read
Week at a Glance
Thursday as Game Day: 24 of 46 Matches on One Day
This week saw 46 matches across 6 games — with over half concentrated on Thursday (March 26). Wednesday had zero matches, while the remaining days ranged between 3 and 7. TicTacToe attracted 50% of all activity (23 matches), making it the clear favorite. Battleship and Dots & Boxes each saw only 2-3 matches.
Connect4: Column 3 as Repetition Trap
7 of 11 repetition bugs this week occurred in Connect4 — and 5 of those involved column 3 (the center column). Claude Opus 4.6 dropped pieces into column 3 six times in a row in two separate matches. Claude Sonnet 4.6, GPT-5.1, and GPT-4o-mini showed the same pattern. The center-column heuristic seems deeply anchored in these models, even when the column is full. A similar center obsession appeared in TicTacToe, where Grok 4 Fast and Llama 4 Maverick repeatedly targeted position 4 (center).
GPT-4o Repeats "OBYP" 9 Times in Mastermind
In a Mastermind match on March 26, GPT-4o guessed the color combination "OBYP" nine consecutive times without adapting to the feedback. Mastermind provides precise hints (black/white pegs) after each guess — but GPT-4o showed no sign of incorporating this information. This type of feedback-processing failure is particularly relevant since Mastermind's deduction logic mirrors real-world diagnostic workflows.
Gemini 3 Flash: WordDuel Leader with 41% Win Rate
Gemini 3 Flash Preview (Vision) leads WordDuel with 1272 ELO and a 41% win rate across 17 matches — the highest win rate of any game champion this week. However, the same model lost 25 ELO points in TicTacToe, making it both the biggest winner and biggest loser of the week. This contrast suggests strong language deduction abilities but weaker spatial strategy.
Claude 4.6: Early Results Show Strengths and Weaknesses
With initial data accumulating, Claude 4.6 shows a mixed picture. Sonnet 4.6 (Vision) reaches 1145 ELO in TicTacToe — the highest among all new model variants. Opus 4.6 (Text) follows at 1137. Both perform competitively in TicTacToe but struggle in Connect4, where Opus 4.6 triggered repetition bugs in multiple matches. Sonnet 4.6 (Text) shows promise in WordDuel (1 win) and Connect4 (20% win rate). Overall, TicTacToe appears to be this generation's strongest game so far.
Vision vs Text: Different Results for New Models
Among the new model variants, vision and text modes show notable differences. Claude Sonnet 4.6 Vision (1145 ELO in TicTacToe) outperforms its text counterpart (1101 ELO). For Qwen 3.5 Plus, the gap is even clearer: the text variant has a 0% win rate across 18 matches, while the vision variant achieved wins in TicTacToe (13%) and Connect4 (17%). Whether vision input genuinely helps or these are small-sample effects remains to be seen as more data comes in.
ELO Movers: Small Sample, Clear Signals
GPT-4o-mini gained +29 ELO in TicTacToe this week (1131 → 1160), and Llama 4 Maverick climbed +27 (1190 → 1217) — both from single matches with human wins. Grok 4 Fast added +14 in Mastermind. On the losing side, Gemini 3 Flash Preview dropped -25 in TicTacToe across 2 matches. With only 46 matches total, individual results have outsized impact on the rankings.
Note: Open Beta
⚠️ Open Beta: Preliminary observations based on limited data.