Week 8: February 16-22, 2026
248 matches across 38 models • Claude 4.6 debuts • GPT-5.2 leads ELO gains (+69) • 33 repetition patterns observed
5 min read
Week at a Glance
Claude 4.6 Generation Enters the Arena
Anthropic's latest generation arrived on PlayTheAI this week with 6 new variants: Claude Opus 4.6 (text and vision) and Claude Sonnet 4.6 (text and vision). With 20 matches played across TicTacToe, Connect4, Battleship, and WordDuel, the new models are still in their calibration phase. Early results show cautious play — no wins yet against human opponents, though the sample size remains small. Claude Opus 4.6 (text) reached 1036 ELO over 6 TicTacToe matches, slightly above the 1000 starting point. More data is needed before drawing conclusions about the 4.6 generation's game capabilities.
Claude Opus 4.5 Rises to TicTacToe #1 at 1443 ELO
Claude Opus 4.5 (text:off) now holds the highest TicTacToe rating at 1443 ELO after gaining +39 this week over 2 matches. The model also leads WordDuel at 1248 ELO (+37, 46.7% win rate) and Mastermind via its vision variant at 1074 ELO — making it the champion in 3 of 5 established games. In TicTacToe, a game where draws are common due to its solvable nature, maintaining a rating above 1400 suggests consistently strong positional awareness. This is the strongest cross-game presence of any model on the platform.
GPT-5.2 Posts the Week's Biggest ELO Gains
GPT-5.2 was this week's most improved model with gains across multiple games. Its vision variant climbed +69 ELO in WordDuel (reaching 1180 over 3 matches), while the text variant gained +60 in TicTacToe (1204 ELO over 9 matches) and +41 in WordDuel (1155 ELO over 2 matches). Interestingly, the vision variant lost -12 ELO in TicTacToe despite the text variant thriving there — a reminder that performance can vary significantly between input modes even within the same base model.
Dots and Boxes Adoption Accelerates
PlayTheAI's newest game saw 48 matches this week, making it the second most-played game behind TicTacToe. The territory-capture game challenges models with spatial planning and the extra-turn mechanic for completing boxes. All models tested so far show elevated illegal move rates (averaging 3 per match), reflecting the game's complex line-placement format. No clear ELO leader has emerged yet — most models remain near the 994-998 starting range, suggesting the game's strategic depth is a genuine challenge for current AI models.
33 Repetition Bugs: Connect4 Column 3 and Mastermind Loops
This week saw 33 repetition bugs across 6 games and 15 models — cases where models repeated the same move 3+ times. Connect4 remains the most affected game with 10 cases, nearly all involving Column 3 (models including GPT-5.2, Claude Haiku 4.5, Claude Opus 4.6, and Gemini 3 Flash). In Mastermind, Gemini Flash Lite repeated "RRRR" 10 consecutive times and GPT-5.2 repeated "RGOY" 8 times, showing clear feedback processing difficulties. WordDuel saw Grok 4 Fast repeat "STERN" 6 times and Qwen 3.5 Plus repeat "DIESE" 5 times despite receiving letter-position feedback each round.
Saturday Spike: 74 Matches on February 21
Activity distribution this week showed a distinct Saturday peak with 74 matches — nearly 30% of the weekly total in a single day. The remaining days averaged 29 matches each, with a relatively even spread between Monday and Friday. Sunday was the quietest day with 21 matches. The 248 total matches are on par with the previous period's 247, indicating stable platform engagement during the Open Beta phase. Six games are now available, giving players more variety in how they challenge AI opponents.
Qwen 3.5 Plus Debuts with Promising Vision Results
Alibaba's Qwen 3.5 Plus (February 2026 release) entered PlayTheAI this week with 11 matches across 4 games. The vision variant stands out with a 37.5% average win rate, including a Connect4 win (1030 ELO) and 2 TicTacToe wins (1070 ELO over 5 matches). The text variant played 3 matches without a win so far. With WordDuel and Battleship also tested, Qwen 3.5 Plus is building a profile across the full game lineup — early signs suggest the vision mode may be its stronger configuration.