Back to Blog
Deep Dive

Text Mode vs Vision Mode: Does Seeing the Board Help AI?

15 models tested in both modes across 4,011 matches. Text wins 59% of head-to-head comparisons.

Text vs Vision Overview

15
Models Tested (Both Modes)
2027
Total Matches (Text)
1984
Total Matches (Vision)
59%
Text Wins ELO Comparison
1.7x
Avg Vision Cost Multiplier

Average ELO by Game: Text vs Vision

TicTacToe (Text)1111
TicTacToe (Vision)1048
WordDuel (Text)1087
WordDuel (Vision)1090
Connect4 (Text)1021
Connect4 (Vision)1018

Where Vision Wins Big (ELO Ξ”)

GPT-5.1 @ TicTacToe171
Opus 4.5 @ Connect4137
Mistral L3 @ Connect4110
Gemini 3 @ WordDuel105
Haiku 4.5 @ WordDuel70

Where Text Wins Big (ELO Ξ”)

Opus 4.5 @ TicTacToe346
Maverick @ TicTacToe214
Scout @ TicTacToe175
Gemini 3 @ TicTacToe153
GPT-4o-mini @ TicTacToe145

Illegal Moves per Match: Text vs Vision

TicTacToe (Text)0.9
TicTacToe (Vision)1.17
Battleship (Text)0.85
Battleship (Vision)1.06
Connect4 (Text)0.7
Connect4 (Vision)0.8

Overall Weighted ELO: Text vs Vision

Gemini 3 Flash (Text)1144
Claude Opus 4.5 (Text)1143
Gemini 3 Flash (Vision)1118
GPT-5.1 (Vision)1086
Maverick (Text)1078
Opus 4.5 (Vision)1074
GPT-5.2 (Text)1063
GPT-5.2 (Vision)1058

The Question

When AI models play classic board games against humans on PlayTheAI, they can receive the game state in two ways: as structured text (coordinates and symbols) or as a visual screenshot of the board. Intuitively, seeing the actual board should help β€” just like it helps humans. But does multimodal vision input actually improve AI game performance?

We tested 15 models that support both text and vision modes across all 5 games, totaling 4,011 matches.

The Verdict: Text Wins Overall

Across 75 direct comparisons (15 models Γ— 5 games), text mode produces higher ELO in 59% of cases. Only 40% favor vision, with 1% tied.

The weighted overall ELO tells the same story. Of the 15 models tested in both modes:

  • 12 models perform better in text mode
  • Only 3 models perform better in vision mode (GPT-5.1, Grok 4.1 Fast, Mistral Large 3)

The TicTacToe Effect

The largest text advantage comes from TicTacToe, where text mode leads by an average of 63 ELO points. The top 5 biggest Text-over-Vision gaps all come from TicTacToe:

Model Text ELO Vision ELO Delta
Claude Opus 4.5 1,404 1,058 -346
Llama 4 Maverick 1,162 948 -214
Llama 4 Scout 1,107 932 -175
Gemini 3 Flash 1,403 1,250 -153
GPT-4o-mini 1,138 993 -145

Why such a dramatic gap? TicTacToe is a simple 3Γ—3 grid. In text mode, models receive a clean, unambiguous representation like X | O | _. The vision screenshot adds visual noise without adding useful information. Models appear to make more parsing errors and illegal moves when processing the image β€” 1.17 illegal moves per match in vision vs 0.90 in text.

The draw rate in TicTacToe further supports this: 15.3% in text mode vs only 7.6% in vision. Text-mode AIs are more likely to play correctly enough to reach a strategic draw, while vision-mode AIs lose due to illegal moves or parsing mistakes.

Where Vision Actually Helps

Despite the overall text advantage, there are notable cases where vision mode produces better results:

GPT-5.1 in TicTacToe (+171 ELO) β€” The biggest outlier. GPT-5.1 reaches 1,261 ELO in vision mode versus 1,090 in text. This suggests GPT-5.1 has particularly strong visual reasoning capabilities, unlike most other models.

Claude Opus 4.5 in Connect4 (+137 ELO) β€” Opus reaches 1,095 in vision vs 958 in text. It wins 5 matches in vision mode vs 0 in text. For a vertically-dropping game like Connect4, the visual representation may genuinely help in pattern recognition.

Mistral Large 3 in Connect4 (+110 ELO) β€” Similarly benefits from seeing the Connect4 grid visually, reaching 1,078 ELO.

WordDuel: The Surprising Outlier β€” WordDuel is the only game where vision and text modes perform almost identically (avg 1,090 vs 1,087 ELO). Gemini 3 Flash even reaches 1,244 ELO in vision mode (+105 over text). This is surprising because WordDuel is fundamentally a word game β€” yet the color-coded letter feedback in the visual mode seems to help some models process the clues.

The Cost of Seeing

Vision mode consistently costs more due to the additional image tokens in each prompt:

Model Text $/Match Vision $/Match Vision Premium
GPT-4o-mini $0.0006 $0.0209 33.9x
Gemini 2.5 Flash Lite $0.0004 $0.0012 2.9x
Gemini 3 Flash $0.0026 $0.0067 2.6x
GPT-4o $0.0097 $0.0215 2.2x
Llama 4 Scout $0.0008 $0.0016 2.0x
GPT-5.1 $0.0056 $0.0106 1.9x
Average across all β€” β€” 1.7x

The biggest outlier is GPT-4o-mini, where vision costs 34x more than text β€” likely due to how OpenRouter prices image tokens for this model. Meanwhile, Grok models show almost no cost difference (1.0-1.1x), suggesting their image processing is efficiently priced.

Overall, vision mode costs about 1.7x more on average while typically delivering worse performance.

Illegal Moves: Vision Makes Models Sloppier

Across all board games (TicTacToe, Connect4, Battleship), vision mode consistently produces more illegal moves:

  • TicTacToe: 0.90/match (text) vs 1.17/match (vision) β€” +30% more errors
  • Battleship: 0.85/match (text) vs 1.06/match (vision) β€” +25% more errors
  • Connect4: 0.70/match (text) vs 0.80/match (vision) β€” +14% more errors

Some models show dramatic spikes in vision mode:

  • Llama 4 Maverick in TicTacToe: 0.08 illegal moves (text) β†’ 1.84 (vision) β€” a 23x increase
  • Claude Haiku 4.5 in TicTacToe: 0.73 (text) β†’ 1.92 (vision) β€” 2.6x increase
  • Gemini 2.5 Flash Lite in TicTacToe: 1.25 (text) β†’ 2.50 (vision) β€” 2x increase

The hypothesis: parsing board positions from an image introduces a step where errors creep in. The model must first "read" the image, then map pixels to positions, then determine valid moves. In text mode, the positions are already explicitly encoded, removing a potential failure point.

Game-by-Game Summary

Game Text Better Vision Better Verdict
TicTacToe 67% 33% Text clearly better
WordDuel 64% 36% Text slightly better
Battleship 60% 40% Text slightly better
Connect4 53% 47% Nearly even
Mastermind 53% 47% Nearly even

The pattern suggests: simpler, more structured games favor text mode, while more complex spatial games (Connect4, Mastermind) show less difference. Connect4 is the closest to even β€” and the one game where visual pattern recognition (spotting 4-in-a-row diagonals) seems genuinely useful.

The Exception: GPT-5.1

GPT-5.1 is the only model where vision mode clearly outperforms text overall (weighted ELO 1,086 vs 1,039 β€” a +48 point advantage). Its TicTacToe performance of 1,261 ELO in vision mode is the 4th highest per-game ELO among all variants. GPT-5.1 appears to have particularly strong image comprehension, allowing it to overcome the typical vision disadvantage.

Key Takeaways

  1. Text mode outperforms vision in most cases. 12 of 15 models achieve higher overall ELO with text input. The AI "reads" a structured text board more reliably than it "sees" a visual one.

  2. Vision increases costs by ~1.7x on average without delivering commensurate performance improvements.

  3. Vision causes 14-30% more illegal moves in board games, suggesting image-to-move parsing is a reliability bottleneck.

  4. Connect4 is the exception β€” the visual grid representation seems to genuinely help pattern recognition, with text and vision modes nearly even.

  5. GPT-5.1 bucks the trend β€” its vision variant outperforms text by +48 ELO points, suggesting that model architecture matters more than the general text-vs-vision debate.

  6. For production deployments: Text mode offers the best cost-performance ratio. Reserve vision mode for specific models (like GPT-5.1) that have demonstrated visual reasoning strength.


⚠️ Open Beta β€” preliminary observations based on 4,011 matches across 15 models tested in both text and vision modes. Interested in scientific collaboration? Contact us!

πŸ†• Neue Version verfΓΌgbar!