Open Beta Status: 4,150 Matches Overview
31 AI model variants across 5 games β who handles human opponents best?
Platform Overview
Summary
After 4,150 completed matches across 5 classic games, the Open Beta presents a comprehensive picture of how 31 AI model variants perform against human opponents.
Humans lead clearly: With an overall AI win rate of approximately 5.5%, human players maintain a strong advantage across all games. The highest AI win rate is observed in WordDuel (8.9%), while Battleship remains the most challenging game for AI (0.3% AI wins).
Top Performers
Gemini 3 Flash Preview (text:off) leads the overall ranking with a weighted ELO of 1,144, closely followed by Claude Opus 4.5 (text:off) at 1,143. Both models show particular strength in TicTacToe (ELO 1,403 and 1,404 respectively).
Notable: Gemini 3 Flash Preview is the only model placing both its text and vision variants in the top 3 overall β suggesting consistently strong logical reasoning across input modes.
Game-by-Game Highlights
TicTacToe (2,151 matches)
The most-played game and a key differentiator. Claude Opus 4.5 (text) and Gemini 3 Flash (text) share the highest per-game ELO at ~1,400. However, TicTacToe also shows the highest average illegal moves per match (1.02), indicating that models still struggle with occupied-cell detection.
Connect4 (1,033 matches)
The second most popular game with 7.5% AI win rate. Grok 4 Fast (text) leads with 1,129 ELO β showing that fast inference models can excel at pattern recognition. Claude Opus 4.5 (vision) is also strong here at 1,095.
WordDuel (338 matches)
The game where AI performs best (8.9% win rate). Claude Opus 4.5 (text) achieves an impressive 50% win rate in WordDuel, while Gemini 3 Flash (vision) reaches 1,244 ELO. Language understanding seems to be a strength for premium models.
Battleship (399 matches)
The most challenging game for AI with only 0.3% win rate. Coordinate tracking and spatial reasoning remain difficult across all models. Claude Sonnet 4.5 (text) leads with 1,002 ELO but even this is barely above the 1,000 baseline.
Mastermind (229 matches)
Low AI win rate (1.3%) but a significant 27.9% draw rate, suggesting that models can partially solve the code but struggle to crack it fully within the move limit.
Text vs Vision
Text mode slightly outperforms vision across the board:
- Text: 6.0% AI win rate, 0.71 illegal moves/match, $16.74 total cost
- Vision: 5.2% AI win rate, 0.91 illegal moves/match, $28.00 total cost
Vision mode costs 67% more while delivering slightly worse results β a pattern consistent across most models.
Parse Reliability
The platform shows excellent parse reliability: 97% of all API responses use native tool calls, with only 3% falling back to JSON extraction. Zero parse failures recorded.
Cost Efficiency
$44.75 total across 4,150 matches means approximately $0.011 per match β remarkably efficient. TicTacToe accounts for the highest share ($14.54) simply due to volume.
β οΈ Open Beta β preliminary observations based on 4,150 matches with 31 model variants. Interested in scientific collaboration? Contact us!