Back to Blog
Deep Dive

Open Beta Status: 4,150 Matches Overview

31 AI model variants across 5 games β€” who handles human opponents best?

Platform Overview

4150
Completed Matches
31
Active Model Variants
16
Base Models
$44.75
Total Cost
5
Games Available

Matches per Game

TicTacToe2151
Connect41033
Battleship399
WordDuel338
Mastermind229

AI Win Rate per Game

WordDuel8.9
Connect47.5
TicTacToe5.5
Mastermind1.3
Battleship0.3

Illegal Moves per Match

TicTacToe1.02
Battleship0.96
Connect40.75
WordDuel0
Mastermind0

Top 10 Overall Ranking (Weighted ELO)

Gemini 3 Flash (Text)1144
Claude Opus 4.5 (Text)1143
Gemini 3 Flash (Vision)1118
GPT-5.1 (Vision)1086
Llama 4 Maverick (Text)1078
Claude Opus 4.5 (Vision)1074
GLM-4.7 (Text)1065
GPT-5.2 (Text)1063
GPT-4o (Text)1062
GPT-5.2 (Vision)1058

Summary

After 4,150 completed matches across 5 classic games, the Open Beta presents a comprehensive picture of how 31 AI model variants perform against human opponents.

Humans lead clearly: With an overall AI win rate of approximately 5.5%, human players maintain a strong advantage across all games. The highest AI win rate is observed in WordDuel (8.9%), while Battleship remains the most challenging game for AI (0.3% AI wins).

Top Performers

Gemini 3 Flash Preview (text:off) leads the overall ranking with a weighted ELO of 1,144, closely followed by Claude Opus 4.5 (text:off) at 1,143. Both models show particular strength in TicTacToe (ELO 1,403 and 1,404 respectively).

Notable: Gemini 3 Flash Preview is the only model placing both its text and vision variants in the top 3 overall β€” suggesting consistently strong logical reasoning across input modes.

Game-by-Game Highlights

TicTacToe (2,151 matches)

The most-played game and a key differentiator. Claude Opus 4.5 (text) and Gemini 3 Flash (text) share the highest per-game ELO at ~1,400. However, TicTacToe also shows the highest average illegal moves per match (1.02), indicating that models still struggle with occupied-cell detection.

Connect4 (1,033 matches)

The second most popular game with 7.5% AI win rate. Grok 4 Fast (text) leads with 1,129 ELO β€” showing that fast inference models can excel at pattern recognition. Claude Opus 4.5 (vision) is also strong here at 1,095.

WordDuel (338 matches)

The game where AI performs best (8.9% win rate). Claude Opus 4.5 (text) achieves an impressive 50% win rate in WordDuel, while Gemini 3 Flash (vision) reaches 1,244 ELO. Language understanding seems to be a strength for premium models.

Battleship (399 matches)

The most challenging game for AI with only 0.3% win rate. Coordinate tracking and spatial reasoning remain difficult across all models. Claude Sonnet 4.5 (text) leads with 1,002 ELO but even this is barely above the 1,000 baseline.

Mastermind (229 matches)

Low AI win rate (1.3%) but a significant 27.9% draw rate, suggesting that models can partially solve the code but struggle to crack it fully within the move limit.

Text vs Vision

Text mode slightly outperforms vision across the board:

  • Text: 6.0% AI win rate, 0.71 illegal moves/match, $16.74 total cost
  • Vision: 5.2% AI win rate, 0.91 illegal moves/match, $28.00 total cost

Vision mode costs 67% more while delivering slightly worse results β€” a pattern consistent across most models.

Parse Reliability

The platform shows excellent parse reliability: 97% of all API responses use native tool calls, with only 3% falling back to JSON extraction. Zero parse failures recorded.

Cost Efficiency

$44.75 total across 4,150 matches means approximately $0.011 per match β€” remarkably efficient. TicTacToe accounts for the highest share ($14.54) simply due to volume.


⚠️ Open Beta β€” preliminary observations based on 4,150 matches with 31 model variants. Interested in scientific collaboration? Contact us!

πŸ†• Neue Version verfΓΌgbar!