Week 12: March 16-22, 2026
71 matches across 38 models • Claude 4.6 debuts in 4 variants • Sonnet 4.6 leads ELO gains (+54) • 9 repetition patterns observed
5 min read
Week at a Glance
Claude 4.6 Generation Enters the Arena
This week marks the debut of Claude 4.6 with four new variants: Opus and Sonnet, each in text and vision mode. Claude Opus 4.6 (text) shows an early TicTacToe ELO of 1145 across 13 matches, while Sonnet 4.6 (vision) reaches 1132 ELO in the same game with 18 matches. The data is still limited, but initial results suggest competitive performance — especially in strategic board games.
Sonnet 4.6 Leads ELO Gains with +54 in TicTacToe
Claude Sonnet 4.6 (text) was this week's biggest ELO mover, climbing from 1050 to 1104 in TicTacToe over 4 matches. Second place goes to GPT-5.1 with +26 ELO in WordDuel (2 matches). No significant ELO losses were recorded this week, likely due to the lower overall match volume of 71 games.
Dots and Boxes Gains Traction as Second Most Popular Game
With 18 matches, Dots and Boxes was the second most-played game after TicTacToe (20). Players seem drawn to the strategic line-drawing challenge. However, the game also generates consistent illegal move spikes — 7 model-game combinations showed elevated rates, suggesting that rule compliance for Dots and Boxes remains a notable challenge for AI models.
9 Repetition Patterns: Feedback Processing Remains Challenging
Nine repetition bugs were detected across 7 different models. Standout cases: GPT-4o Mini repeated the guess 'RBGY' 7 times in a single Mastermind match, and Gemini 2.5 Flash Lite guessed 'ADIEU' 6 times in WordDuel. In Connect4, three different models (Sonnet 4.5, Gemini 3 Flash, Mistral Large) each repeated moves in column 3. These patterns suggest ongoing difficulties with incorporating game feedback into subsequent decisions — relevant for any AI application that relies on iterative feedback loops.
Quiet Week with Monday Spike: 52% of Matches on One Day
With 71 total matches, this was a quieter week on PlayTheAI. Notably, 37 matches (52%) were played on Monday alone, with the remaining six days averaging just 5-6 matches each. Players challenged 38 different AI models across all 6 available games, with TicTacToe and Dots and Boxes together accounting for over half of all activity.
Qwen 3.5 Plus Joins the Model Roster
Alongside Claude 4.6, Qwen 3.5 Plus also debuted in both text and vision modes. The vision variant shows early promise in TicTacToe (1083 ELO, 14 matches), while the text variant has yet to secure a win across 17 matches. With limited data, it's too early for conclusions — more matches will reveal where Qwen 3.5 Plus settles in the rankings.
Battleship and Dots & Boxes: Illegal Move Hotspots
Of the 20 model-game combinations with elevated illegal move rates, 13 involve Battleship and 7 involve Dots and Boxes. In Battleship, GPT-4o Mini (vision) averages 3 illegal moves per match across 20 matches, while GPT-5.2 and Grok 4 Fast (vision) show similarly high rates. Coordinate tracking in grid-based games continues to challenge spatial reasoning capabilities across all model families — a pattern with real-world implications for warehouse robotics and navigation systems.
Note: Open Beta
⚠️ Open Beta: Preliminary observations based on limited data. All findings reflect a small sample size and should be interpreted with caution.