Week 9: February 23 - March 1, 2026
138 matches across 38 models • Claude 4.6 family debuts (+55 ELO) • Column 3 pattern in 17 Connect4 matches
5 min read
Week at a Glance
Claude 4.6 Family Makes Strong Debut
The new Claude 4.6 generation entered PlayTheAI this week with both Opus 4.6 and Sonnet 4.6 in text and vision modes. Claude Opus 4.6 Vision stood out as the strongest newcomer: +55 ELO in TicTacToe (reaching 1081 from 4 matches) and +28 in Connect4 (reaching 1024 from 2 matches), with an overall 33% win rate against human challengers. Opus 4.6 Text also showed promise at 1109 ELO in TicTacToe after 9 matches. Sonnet 4.6 Vision reached 1050 ELO in TicTacToe with a solid +28 ELO gain. Early results suggest the 4.6 generation enters the mid-field with room to grow as more matches accumulate.
TicTacToe Accounts for 54% of All Matches
With 74 of 138 matches this week, TicTacToe remains the go-to game for human challengers. It's the quickest game to play and the most accessible entry point for testing AI models. Connect4 follows at 28% (39 matches). Activity peaked on Sunday (46 matches) and Monday (38 matches), with a quieter mid-week period. The pattern suggests weekend players drive the bulk of testing activity.
Connect4 Column 3 Pattern: 17 of 24 Repetition Cases
24 repetition patterns were observed this week, with 17 occurring in Connect4 – and all involving repeated plays to column 3. This pattern appears across multiple model families: Claude Opus 4.6 (3 cases, up to 6 repeats), GPT-5.2 (2 cases), Qwen 3.5 Plus (3 cases), Claude Sonnet 4.6, Claude Opus 4.5, Claude Sonnet 4.5, and Gemini 2.5 Flash Lite. The remaining 7 cases occurred in TicTacToe (position 4 repeats) and 1 in Battleship. The cross-model Connect4 column 3 pattern is notable – it may indicate something about the game state or prompt that encourages this behavior rather than a model-specific issue.
Gemini 3 Flash Preview Leads WordDuel at 1270 ELO
Gemini 3 Flash Preview (Vision) holds the WordDuel crown this week with an impressive 1270 ELO and a 47% win rate across 15 matches – the highest ELO in any game aside from the TicTacToe leaders. This week it gained another +26 ELO. WordDuel tests language understanding and deductive reasoning, suggesting strong linguistic capabilities in Google's latest preview model. With only 5 WordDuel matches played this week overall, the sample is small but the trend is consistent.
Qwen 3.5 Plus Enters with Mixed Results
Alibaba's Qwen 3.5 Plus debuted this week with both text and vision variants, accumulating 26 matches across 4 games. The vision variant performed better overall (18% win rate, 1062 ELO in TicTacToe) compared to text (0% win rate so far). However, Qwen 3.5 Plus appeared in 4 of the 24 repetition bug cases in Connect4 and showed elevated illegal move rates in TicTacToe (3 per match in text mode). Still early days – the model needs more matches to establish a reliable baseline.
Battleship and Dots & Boxes: Consistently Challenging
Battleship continues to push all models to their limits, with the majority showing around 3 illegal moves per match – regardless of model family or size. GPT-4o Mini (Vision) leads with 60 illegal moves across 20 matches. The highest Battleship ELO is just 994 (Claude Sonnet 4.5 Text), well below the 1000 starting point. Dots & Boxes, the newest game, shows a similar picture: every model that played it this week averaged 3 illegal moves per match. These spatial-reasoning games seem to expose fundamental challenges in coordinate tracking and rule adherence that even frontier models share.
Weekly Champions: Claude Opus 4.5 Holds TicTacToe Throne
Claude Opus 4.5 (Text) remains the undisputed TicTacToe leader at 1437 ELO, despite a slight -11 ELO dip this week. In Connect4, Grok 4 Fast (Text) leads with 1154 ELO. Mastermind's top spot goes to Claude Opus 4.5 (Vision) at 1074 ELO. Battleship stays close with Claude Sonnet 4.5 (Text) at 994. The week's biggest ELO movement came from a newcomer – Claude Opus 4.6 Vision gaining +55 in TicTacToe – but it still has a long way to go to challenge the established 4.5 generation at the top.
Open Beta: Preliminary observations based on limited data.
PlayTheAI is an open beta hobby project. All observations are based on a limited number of matches and should not be considered definitive benchmarks. ELO ratings stabilize with more games – many models this week have fewer than 20 matches per game. Results may shift significantly as more human challengers test the models.