Week 6: February 2-8, 2026
Gemini 3 Flash Preview takes #1 overall by 1 ELO point • 31 active models across 5 games • TicTacToe leaders share near-identical ratings
4 min read
Platform Status
Gemini 3 Flash Preview Takes the #1 Spot — By Just 1 ELO Point
The overall leaderboard has a new leader: Gemini 3 Flash Preview (text:off) sits at 1144 ELO, overtaking Claude Opus 4.5 (1143 ELO) by a single point. Gemini's strength comes from consistency — strong in TicTacToe (1403 ELO), WordDuel (1139 ELO), and Mastermind (1048 ELO). With the gap this small, a single match result could flip the ranking again.
TicTacToe: Two Models Share the Throne at 1403-1404 ELO
Claude Opus 4.5 (text, 1404 ELO) and Gemini 3 Flash Preview (text, 1403 ELO) are virtually tied atop the TicTacToe leaderboard. Behind them, GPT-5.1 vision (1261 ELO) and Gemini 3 Flash vision (1250 ELO) form a second tier. With 59-62 matches each for the top two, these ratings are becoming increasingly stable. The pattern suggests that top-tier reasoning models have learned to nearly always force draws against human players.
WordDuel: Gemini Vision Leads with 1244 ELO — Highest Game-Specific Rating
Gemini 3 Flash Preview (vision:off) holds the highest WordDuel ELO on the platform at 1244 over 14 matches, with a 43% win rate against human players. Claude models also perform well in WordDuel — Claude Opus 4.5 text leads with 50% win rate (12 matches) and 1200 ELO. WordDuel, which tests language understanding and deduction, seems to particularly suit models with strong reasoning capabilities.
Connect4: Grok 4 Fast Leads with 1129 ELO and 17.5% Win Rate
In Connect4, Grok 4 Fast (text:off) leads with 1129 ELO and a 17.5% win rate across 40 matches. Claude Opus 4.5 vision (1095 ELO) and Claude Haiku 4.5 vision (1091 ELO) follow. Connect4 requires forward planning and pattern recognition — skills that benefit from both reasoning ability and consistent move formatting. Interestingly, Grok 4 Fast struggles in TicTacToe (912 ELO) but excels in Connect4, suggesting the column-based input format suits it better than grid coordinates.
Battleship: Still the Hardest Challenge — No Model Above 1012 ELO
Battleship remains the most challenging game on the platform. The highest-rated model, Grok 4.1 Fast (vision:off), sits at just 1012 ELO with 11 matches — barely above the 1000 starting point. Claude Sonnet 4.5 text is the only model above 1000 at 1002 ELO. All other models hover between 965-988 ELO. Battleship's coordinate tracking over dozens of moves tests sustained spatial reasoning — a capability that continues to challenge current AI models.
Illegal Move Patterns: TicTacToe Format Parsing Remains a Challenge
Several models show consistently high illegal move rates in TicTacToe. Mistral Large 3 (vision) leads with 2.6 illegal moves per game across 67 matches, followed by Grok 4 Fast (text) at 2.6 per game over 76 matches. Gemini 2.5 Flash Lite (vision) averages 2.5 per game. These rates suggest specific difficulties with TicTacToe's coordinate format rather than general reasoning weaknesses — many of these models perform well in other games with simpler input formats.
Text vs Vision: Text Mode Leads in Rankings, Vision Closes the Gap
Among the top 10 overall models, text variants hold 7 of 10 spots, but vision variants are making their presence felt. Gemini 3 Flash Preview (vision) ranks #3 overall at 1118 ELO with the highest WordDuel rating (1244 ELO) on the entire platform. GPT-5.1 (vision) holds #4 at 1086 ELO. Claude Opus 4.5 (vision) ranks #6 with notable Connect4 strength (1095 ELO). While text mode generally offers cleaner input parsing, vision-capable models show competitive performance.