Back to Blog
Weekly Digest
🆕

Week 7: February 9-15, 2026

247 matches across 32 models • Dots and Boxes debuts as 6th game • Gemini 3 Flash extends TicTacToe lead to 1407 ELO • 49 repetition bugs observed

4 min read

📊 247 matches 🤖 32 models

Week at a Glance

🎮
247
Matches
🤖
32
Active Models
🎲
6
Games
🆕
Dots & Boxes
New Game

Top ELO Gains This Week

Gemini 3 Flash (vision) TTT37 ELO
Mistral Large 3 (text) C431 ELO
Claude Haiku 4.5 (text) C430 ELO
Claude Haiku 4.5 (text) WD30 ELO
Claude Opus 4.5 (text) WD11 ELO

Matches by Game

TicTacToe135
Connect438
Battleship23
Dots & Boxes20
WordDuel19
Mastermind12

Matches by Day

Mon 914
Tue 1030
Wed 1116
Thu 129
Fri 1389
Sat 1430
Sun 1559
🆕

Dots and Boxes Debuts as the 6th Game on PlayTheAI

PlayTheAI's game library expanded this week with Dots and Boxes, a strategic territory game where players draw lines between dots to claim boxes. With 20 matches played in its debut week, the game adds a new dimension to AI benchmarking: spatial planning combined with turn-order tactics, where completing a box grants an extra move. Early data shows this territory-capture mechanic creates interesting decision points for AI models.

🏆

Gemini 3 Flash Preview Extends TicTacToe Lead to 1407 ELO

Gemini 3 Flash Preview (text:off) now holds the highest TicTacToe rating at 1407 ELO over 62 matches, widening the gap from last week's near-tie with Claude Opus 4.5. Its vision variant gained +37 ELO this week (the biggest weekly gain), climbing to 1287 ELO over 4 matches. Gemini seems particularly well-suited to TicTacToe's coordinate-based format, maintaining strong move parsing across both text and vision modes.

🎮 tictactoe 🤖 or-gemini-3-flash-preview:text:off
🧠

Claude Sonnet 4.5 Thinking Mode Makes Its Debut

A new model variant appeared this week: Claude Sonnet 4.5 (text:on), the first Thinking-enabled variant for Sonnet. In its initial 3 matches, it won 2 WordDuel games (reaching 1060 ELO) and played 1 Connect4 match. With a 67% win rate across its small sample, the Thinking-mode variant shows a promising start — though more matches are needed to assess whether the extra reasoning budget consistently translates to stronger play.

🤖 or-claude-sonnet-4.5:text:on
📊

Friday the 13th Peak: 89 Matches in a Single Day

Activity varied significantly across the week, with Friday February 13 standing out at 89 matches — more than triple the weekly average of 35 matches per day. Thursday was the quietest day with just 9 matches. The 247 total weekly matches represent a quieter week compared to the platform's peak periods, likely reflecting natural usage fluctuations during the Open Beta phase.

🔄

Claude Haiku 4.5 Gains Ground in Connect4 and WordDuel

Claude Haiku 4.5 (text:off) had a solid week with +30 ELO gains in both Connect4 (reaching 1079 ELO) and WordDuel (reaching 1064 ELO). As one of the smaller Claude models, Haiku's consistent gains suggest efficient reasoning capabilities that scale well to structured game formats. Mistral Large 3 also showed improvement with +31 ELO in Connect4 (999 ELO) and +14 in WordDuel (1050 ELO).

🤖 or-claude-haiku-4.5:text:off
⚠️

49 Repetition Bugs Across 13 Models and 6 Games

This week saw 49 repetition bugs — cases where AI models repeated the same move 3 or more times despite the move being invalid or suboptimal. TicTacToe accounted for the majority with models repeatedly trying occupied positions (especially position 4). Connect4 showed persistent column repetitions across several models. Notably, Mastermind saw Llama 4 Maverick repeat "RGBO" 9 and 10 times in two separate matches, and Gemini 3 Flash Preview repeated "BBYY" 8 times — suggesting these models struggle to incorporate feedback from previous guesses.

Illegal Move Rates: TicTacToe Coordinate Parsing Persists as Challenge

The highest illegal move rates continue to cluster in TicTacToe. Mistral Large 3 (vision) averages 2.6 illegal moves per match over 70 games, followed by Grok 4 Fast (text) at 2.5 per match over 77 games. These rates have remained stable compared to the previous week, indicating a consistent formatting challenge rather than a worsening trend. Battleship also shows elevated rates, with Claude Sonnet 4.5 (vision) at 1.8 illegal moves per match across 17 games.

🆕 Neue Version verfügbar!