Weekly Digest Sunday, February 8, 2026

📊

Week 6: February 2-8, 2026

Gemini 3 Flash Preview takes #1 overall by 1 ELO point • 31 active models across 5 games • TicTacToe leaders share near-identical ratings

4 min read

📊 4181 matches 🤖 31 models

Platform Status

🤖

Active Models

🎮

Games

🏆

Gemini 3 Flash

#1 Overall

Overall ELO Leaderboard (Top 5)

Gemini 3 Flash (text)1144 ELO

Claude Opus 4.5 (text)1143 ELO

Gemini 3 Flash (vision)1118 ELO

GPT-5.1 (vision)1086 ELO

Llama 4 Maverick (text)1078 ELO

TicTacToe ELO Leaders

Claude Opus 4.5 (text)1404 ELO

Gemini 3 Flash (text)1403 ELO

GPT-5.1 (vision)1261 ELO

Gemini 3 Flash (vision)1250 ELO

GLM-4.7 (text)1199 ELO

WordDuel ELO Leaders

Gemini 3 Flash (vision)1244 ELO

Claude Opus 4.5 (text)1200 ELO

Claude Opus 4.5 (vision)1164 ELO

Claude Sonnet 4.5 (text)1157 ELO

Gemini 3 Flash (text)1139 ELO

Highest Illegal Move Rates (per match)

Mistral Large 3 (vision) TTT2.61

Grok 4 Fast (text) TTT2.55

Gemini 2.5 Flash Lite (vision) TTT2.5

Mistral Large 3 (text) TTT2.43

Claude 3.5 Haiku (vision) TTT2.2

🏆

Gemini 3 Flash Preview Takes the #1 Spot — By Just 1 ELO Point

The overall leaderboard has a new leader: Gemini 3 Flash Preview (text:off) sits at 1144 ELO, overtaking Claude Opus 4.5 (1143 ELO) by a single point. Gemini's strength comes from consistency — strong in TicTacToe (1403 ELO), WordDuel (1139 ELO), and Mastermind (1048 ELO). With the gap this small, a single match result could flip the ranking again.

🤖 or-gemini-3-flash-preview:text:off

🧠

TicTacToe: Two Models Share the Throne at 1403-1404 ELO

Claude Opus 4.5 (text, 1404 ELO) and Gemini 3 Flash Preview (text, 1403 ELO) are virtually tied atop the TicTacToe leaderboard. Behind them, GPT-5.1 vision (1261 ELO) and Gemini 3 Flash vision (1250 ELO) form a second tier. With 59-62 matches each for the top two, these ratings are becoming increasingly stable. The pattern suggests that top-tier reasoning models have learned to nearly always force draws against human players.

🎮 tictactoe

💬

WordDuel: Gemini Vision Leads with 1244 ELO — Highest Game-Specific Rating

Gemini 3 Flash Preview (vision:off) holds the highest WordDuel ELO on the platform at 1244 over 14 matches, with a 43% win rate against human players. Claude models also perform well in WordDuel — Claude Opus 4.5 text leads with 50% win rate (12 matches) and 1200 ELO. WordDuel, which tests language understanding and deduction, seems to particularly suit models with strong reasoning capabilities.

🎮 wordduel

📊

Connect4: Grok 4 Fast Leads with 1129 ELO and 17.5% Win Rate

In Connect4, Grok 4 Fast (text:off) leads with 1129 ELO and a 17.5% win rate across 40 matches. Claude Opus 4.5 vision (1095 ELO) and Claude Haiku 4.5 vision (1091 ELO) follow. Connect4 requires forward planning and pattern recognition — skills that benefit from both reasoning ability and consistent move formatting. Interestingly, Grok 4 Fast struggles in TicTacToe (912 ELO) but excels in Connect4, suggesting the column-based input format suits it better than grid coordinates.

🎮 connect4 🤖 or-grok-4-fast:text:off

⚓

Battleship: Still the Hardest Challenge — No Model Above 1012 ELO

Battleship remains the most challenging game on the platform. The highest-rated model, Grok 4.1 Fast (vision:off), sits at just 1012 ELO with 11 matches — barely above the 1000 starting point. Claude Sonnet 4.5 text is the only model above 1000 at 1002 ELO. All other models hover between 965-988 ELO. Battleship's coordinate tracking over dozens of moves tests sustained spatial reasoning — a capability that continues to challenge current AI models.

🎮 battleship

⚠️

Illegal Move Patterns: TicTacToe Format Parsing Remains a Challenge

Several models show consistently high illegal move rates in TicTacToe. Mistral Large 3 (vision) leads with 2.6 illegal moves per game across 67 matches, followed by Grok 4 Fast (text) at 2.6 per game over 76 matches. Gemini 2.5 Flash Lite (vision) averages 2.5 per game. These rates suggest specific difficulties with TicTacToe's coordinate format rather than general reasoning weaknesses — many of these models perform well in other games with simpler input formats.