Text Mode vs Vision Mode: Does Seeing the Board Help AI?
15 models tested in both modes across 4,011 matches. Text wins 59% of head-to-head comparisons.
Text vs Vision Overview
The Question
When AI models play classic board games against humans on PlayTheAI, they can receive the game state in two ways: as structured text (coordinates and symbols) or as a visual screenshot of the board. Intuitively, seeing the actual board should help β just like it helps humans. But does multimodal vision input actually improve AI game performance?
We tested 15 models that support both text and vision modes across all 5 games, totaling 4,011 matches.
The Verdict: Text Wins Overall
Across 75 direct comparisons (15 models Γ 5 games), text mode produces higher ELO in 59% of cases. Only 40% favor vision, with 1% tied.
The weighted overall ELO tells the same story. Of the 15 models tested in both modes:
- 12 models perform better in text mode
- Only 3 models perform better in vision mode (GPT-5.1, Grok 4.1 Fast, Mistral Large 3)
The TicTacToe Effect
The largest text advantage comes from TicTacToe, where text mode leads by an average of 63 ELO points. The top 5 biggest Text-over-Vision gaps all come from TicTacToe:
| Model | Text ELO | Vision ELO | Delta |
|---|---|---|---|
| Claude Opus 4.5 | 1,404 | 1,058 | -346 |
| Llama 4 Maverick | 1,162 | 948 | -214 |
| Llama 4 Scout | 1,107 | 932 | -175 |
| Gemini 3 Flash | 1,403 | 1,250 | -153 |
| GPT-4o-mini | 1,138 | 993 | -145 |
Why such a dramatic gap? TicTacToe is a simple 3Γ3 grid. In text mode, models receive a clean, unambiguous representation like X | O | _. The vision screenshot adds visual noise without adding useful information. Models appear to make more parsing errors and illegal moves when processing the image β 1.17 illegal moves per match in vision vs 0.90 in text.
The draw rate in TicTacToe further supports this: 15.3% in text mode vs only 7.6% in vision. Text-mode AIs are more likely to play correctly enough to reach a strategic draw, while vision-mode AIs lose due to illegal moves or parsing mistakes.
Where Vision Actually Helps
Despite the overall text advantage, there are notable cases where vision mode produces better results:
GPT-5.1 in TicTacToe (+171 ELO) β The biggest outlier. GPT-5.1 reaches 1,261 ELO in vision mode versus 1,090 in text. This suggests GPT-5.1 has particularly strong visual reasoning capabilities, unlike most other models.
Claude Opus 4.5 in Connect4 (+137 ELO) β Opus reaches 1,095 in vision vs 958 in text. It wins 5 matches in vision mode vs 0 in text. For a vertically-dropping game like Connect4, the visual representation may genuinely help in pattern recognition.
Mistral Large 3 in Connect4 (+110 ELO) β Similarly benefits from seeing the Connect4 grid visually, reaching 1,078 ELO.
WordDuel: The Surprising Outlier β WordDuel is the only game where vision and text modes perform almost identically (avg 1,090 vs 1,087 ELO). Gemini 3 Flash even reaches 1,244 ELO in vision mode (+105 over text). This is surprising because WordDuel is fundamentally a word game β yet the color-coded letter feedback in the visual mode seems to help some models process the clues.
The Cost of Seeing
Vision mode consistently costs more due to the additional image tokens in each prompt:
| Model | Text $/Match | Vision $/Match | Vision Premium |
|---|---|---|---|
| GPT-4o-mini | $0.0006 | $0.0209 | 33.9x |
| Gemini 2.5 Flash Lite | $0.0004 | $0.0012 | 2.9x |
| Gemini 3 Flash | $0.0026 | $0.0067 | 2.6x |
| GPT-4o | $0.0097 | $0.0215 | 2.2x |
| Llama 4 Scout | $0.0008 | $0.0016 | 2.0x |
| GPT-5.1 | $0.0056 | $0.0106 | 1.9x |
| Average across all | β | β | 1.7x |
The biggest outlier is GPT-4o-mini, where vision costs 34x more than text β likely due to how OpenRouter prices image tokens for this model. Meanwhile, Grok models show almost no cost difference (1.0-1.1x), suggesting their image processing is efficiently priced.
Overall, vision mode costs about 1.7x more on average while typically delivering worse performance.
Illegal Moves: Vision Makes Models Sloppier
Across all board games (TicTacToe, Connect4, Battleship), vision mode consistently produces more illegal moves:
- TicTacToe: 0.90/match (text) vs 1.17/match (vision) β +30% more errors
- Battleship: 0.85/match (text) vs 1.06/match (vision) β +25% more errors
- Connect4: 0.70/match (text) vs 0.80/match (vision) β +14% more errors
Some models show dramatic spikes in vision mode:
- Llama 4 Maverick in TicTacToe: 0.08 illegal moves (text) β 1.84 (vision) β a 23x increase
- Claude Haiku 4.5 in TicTacToe: 0.73 (text) β 1.92 (vision) β 2.6x increase
- Gemini 2.5 Flash Lite in TicTacToe: 1.25 (text) β 2.50 (vision) β 2x increase
The hypothesis: parsing board positions from an image introduces a step where errors creep in. The model must first "read" the image, then map pixels to positions, then determine valid moves. In text mode, the positions are already explicitly encoded, removing a potential failure point.
Game-by-Game Summary
| Game | Text Better | Vision Better | Verdict |
|---|---|---|---|
| TicTacToe | 67% | 33% | Text clearly better |
| WordDuel | 64% | 36% | Text slightly better |
| Battleship | 60% | 40% | Text slightly better |
| Connect4 | 53% | 47% | Nearly even |
| Mastermind | 53% | 47% | Nearly even |
The pattern suggests: simpler, more structured games favor text mode, while more complex spatial games (Connect4, Mastermind) show less difference. Connect4 is the closest to even β and the one game where visual pattern recognition (spotting 4-in-a-row diagonals) seems genuinely useful.
The Exception: GPT-5.1
GPT-5.1 is the only model where vision mode clearly outperforms text overall (weighted ELO 1,086 vs 1,039 β a +48 point advantage). Its TicTacToe performance of 1,261 ELO in vision mode is the 4th highest per-game ELO among all variants. GPT-5.1 appears to have particularly strong image comprehension, allowing it to overcome the typical vision disadvantage.
Key Takeaways
Text mode outperforms vision in most cases. 12 of 15 models achieve higher overall ELO with text input. The AI "reads" a structured text board more reliably than it "sees" a visual one.
Vision increases costs by ~1.7x on average without delivering commensurate performance improvements.
Vision causes 14-30% more illegal moves in board games, suggesting image-to-move parsing is a reliability bottleneck.
Connect4 is the exception β the visual grid representation seems to genuinely help pattern recognition, with text and vision modes nearly even.
GPT-5.1 bucks the trend β its vision variant outperforms text by +48 ELO points, suggesting that model architecture matters more than the general text-vs-vision debate.
For production deployments: Text mode offers the best cost-performance ratio. Reserve vision mode for specific models (like GPT-5.1) that have demonstrated visual reasoning strength.
β οΈ Open Beta β preliminary observations based on 4,011 matches across 15 models tested in both text and vision modes. Interested in scientific collaboration? Contact us!