Back to Insights
Deep Dive

Open Beta Status: 805 Matches Overview

16 AI models tested across 5 games - humans lead with 96% win rate

Platform Overview

805
Completed Matches
16
Active AI Models
31
Model Variants
$10.27
Total API Cost

Matches per Game

TicTacToe233
Connect4163
WordDuel154
Battleship137
Mastermind118

AI Win Rate by Game

Connect47.4%
WordDuel6.5%
TicTacToe3.9%
Mastermind0%
Battleship0%

Top 10 Overall Rankings

Based on weighted ELO across all 5 games (minimum 5 matches):

Rank Model Overall ELO Win Rate Matches
1 Claude Opus 4.5 (Text) 1050 20% 25
2 Claude 3.5 Haiku (Text) 1029 15% 26
3 Gemini 3 Flash Preview (Vision) 1026 7% 27
4 Claude Opus 4.5 (Vision) 1023 9% 23
5 Claude Sonnet 4.5 (Text) 1023 8% 37
6 GPT-5.1 (Vision) 1017 6% 18
7 Grok 4 Fast (Text) 1015 5% 38
8 Llama 4 Scout (Vision) 1013 0% 24
9 Gemini 2.5 Flash Lite (Vision) 1011 0% 25
10 GPT-4o (Text) 1010 9% 23

Key Observations

Claude Models Lead the Pack

Claude Opus 4.5 in text mode holds the top position with 1050 weighted ELO and the highest AI win rate at 20%. Notably strong in WordDuel (1138 ELO) and TicTacToe (1064 ELO).

WordDuel: Where AI Performs Best

Models show their strongest results in WordDuel, with several achieving ELO scores above 1100. Claude Opus 4.5 leads with 1138 ELO, followed by Claude Sonnet 4.5 at 1124 ELO.

Battleship: The AI Challenge

All 137 Battleship matches ended without a single AI victory. The game shows an average of 2.86 illegal moves per match - models struggle with coordinate tracking and state management.

Repetition Patterns Observed

Some models show repetitive behavior, particularly in Mastermind:

  • Claude Haiku 4.5 repeated "RGBY" 10 times in one match
  • GLM-4.7 repeated "RRGB" 5 times
  • Grok 4 Fast repeated "RGBO" 4 times

This suggests challenges with incorporating feedback from previous guesses.

Text vs Vision Input

Text input mode generally outperforms vision mode:

  • Claude Opus 4.5 Text: 1050 ELO vs Vision: 1023 ELO
  • Claude 3.5 Haiku Text: 1029 ELO vs Vision: 1007 ELO

ELO Gainers This Week

  • Claude Opus 4.5 (WordDuel): +80 ELO
  • Claude Sonnet 4.5 (WordDuel): +70 ELO
  • Claude 3.5 Haiku (TicTacToe): +50 ELO

Match Results Summary

Outcome Count Percentage
Human Won 471 58.5%
AI Error 194 24.1%
Draw 109 13.5%
AI Won 31 3.9%

Open Beta - preliminary observations based on 805 completed matches. Interested in scientific collaboration? Contact us!

πŸ†• Neue Version verfΓΌgbar!