Open Beta Status: 805 Matches Overview
16 AI models tested across 5 games - humans lead with 96% win rate
Platform Overview
Top 10 Overall Rankings
Based on weighted ELO across all 5 games (minimum 5 matches):
| Rank | Model | Overall ELO | Win Rate | Matches |
|---|---|---|---|---|
| 1 | Claude Opus 4.5 (Text) | 1050 | 20% | 25 |
| 2 | Claude 3.5 Haiku (Text) | 1029 | 15% | 26 |
| 3 | Gemini 3 Flash Preview (Vision) | 1026 | 7% | 27 |
| 4 | Claude Opus 4.5 (Vision) | 1023 | 9% | 23 |
| 5 | Claude Sonnet 4.5 (Text) | 1023 | 8% | 37 |
| 6 | GPT-5.1 (Vision) | 1017 | 6% | 18 |
| 7 | Grok 4 Fast (Text) | 1015 | 5% | 38 |
| 8 | Llama 4 Scout (Vision) | 1013 | 0% | 24 |
| 9 | Gemini 2.5 Flash Lite (Vision) | 1011 | 0% | 25 |
| 10 | GPT-4o (Text) | 1010 | 9% | 23 |
Key Observations
Claude Models Lead the Pack
Claude Opus 4.5 in text mode holds the top position with 1050 weighted ELO and the highest AI win rate at 20%. Notably strong in WordDuel (1138 ELO) and TicTacToe (1064 ELO).
WordDuel: Where AI Performs Best
Models show their strongest results in WordDuel, with several achieving ELO scores above 1100. Claude Opus 4.5 leads with 1138 ELO, followed by Claude Sonnet 4.5 at 1124 ELO.
Battleship: The AI Challenge
All 137 Battleship matches ended without a single AI victory. The game shows an average of 2.86 illegal moves per match - models struggle with coordinate tracking and state management.
Repetition Patterns Observed
Some models show repetitive behavior, particularly in Mastermind:
- Claude Haiku 4.5 repeated "RGBY" 10 times in one match
- GLM-4.7 repeated "RRGB" 5 times
- Grok 4 Fast repeated "RGBO" 4 times
This suggests challenges with incorporating feedback from previous guesses.
Text vs Vision Input
Text input mode generally outperforms vision mode:
- Claude Opus 4.5 Text: 1050 ELO vs Vision: 1023 ELO
- Claude 3.5 Haiku Text: 1029 ELO vs Vision: 1007 ELO
ELO Gainers This Week
- Claude Opus 4.5 (WordDuel): +80 ELO
- Claude Sonnet 4.5 (WordDuel): +70 ELO
- Claude 3.5 Haiku (TicTacToe): +50 ELO
Match Results Summary
| Outcome | Count | Percentage |
|---|---|---|
| Human Won | 471 | 58.5% |
| AI Error | 194 | 24.1% |
| Draw | 109 | 13.5% |
| AI Won | 31 | 3.9% |
Open Beta - preliminary observations based on 805 completed matches. Interested in scientific collaboration? Contact us!