The Question

When AI models play classic board games against humans on PlayTheAI, they can receive the game state in two ways: as structured text (coordinates and symbols) or as a visual screenshot of the board. Intuitively, seeing the actual board should help — just like it helps humans. But does multimodal vision input actually improve AI game performance?

We tested 15 models that support both text and vision modes across all 5 games, totaling 4,011 matches.

The Verdict: Text Wins Overall

Across 75 direct comparisons (15 models × 5 games), text mode produces higher ELO in 59% of cases. Only 40% favor vision, with 1% tied.

The weighted overall ELO tells the same story. Of the 15 models tested in both modes:

12 models perform better in text mode
Only 3 models perform better in vision mode (GPT-5.1, Grok 4.1 Fast, Mistral Large 3)

The TicTacToe Effect

The largest text advantage comes from TicTacToe, where text mode leads by an average of 63 ELO points. The top 5 biggest Text-over-Vision gaps all come from TicTacToe:

Model	Text ELO	Vision ELO	Delta
Claude Opus 4.5	1,404	1,058	-346
Llama 4 Maverick	1,162	948	-214
Llama 4 Scout	1,107	932	-175
Gemini 3 Flash	1,403	1,250	-153
GPT-4o-mini	1,138	993	-145

Why such a dramatic gap? TicTacToe is a simple 3×3 grid. In text mode, models receive a clean, unambiguous representation like X | O | _. The vision screenshot adds visual noise without adding useful information. Models appear to make more parsing errors and illegal moves when processing the image — 1.17 illegal moves per match in vision vs 0.90 in text.

The draw rate in TicTacToe further supports this: 15.3% in text mode vs only 7.6% in vision. Text-mode AIs are more likely to play correctly enough to reach a strategic draw, while vision-mode AIs lose due to illegal moves or parsing mistakes.

Where Vision Actually Helps

Despite the overall text advantage, there are notable cases where vision mode produces better results:

GPT-5.1 in TicTacToe (+171 ELO) — The biggest outlier. GPT-5.1 reaches 1,261 ELO in vision mode versus 1,090 in text. This suggests GPT-5.1 has particularly strong visual reasoning capabilities, unlike most other models.

Claude Opus 4.5 in Connect4 (+137 ELO) — Opus reaches 1,095 in vision vs 958 in text. It wins 5 matches in vision mode vs 0 in text. For a vertically-dropping game like Connect4, the visual representation may genuinely help in pattern recognition.

Mistral Large 3 in Connect4 (+110 ELO) — Similarly benefits from seeing the Connect4 grid visually, reaching 1,078 ELO.

WordDuel: The Surprising Outlier — WordDuel is the only game where vision and text modes perform almost identically (avg 1,090 vs 1,087 ELO). Gemini 3 Flash even reaches 1,244 ELO in vision mode (+105 over text). This is surprising because WordDuel is fundamentally a word game — yet the color-coded letter feedback in the visual mode seems to help some models process the clues.

The Cost of Seeing

Vision mode consistently costs more due to the additional image tokens in each prompt:

Model	Text $/Match	Vision $/Match	Vision Premium
GPT-4o-mini	$0.0006	$0.0209	33.9x
Gemini 2.5 Flash Lite	$0.0004	$0.0012	2.9x
Gemini 3 Flash	$0.0026	$0.0067	2.6x
GPT-4o	$0.0097	$0.0215	2.2x
Llama 4 Scout	$0.0008	$0.0016	2.0x
GPT-5.1	$0.0056	$0.0106	1.9x
Average across all	—	—	1.7x

The biggest outlier is GPT-4o-mini, where vision costs 34x more than text — likely due to how OpenRouter prices image tokens for this model. Meanwhile, Grok models show almost no cost difference (1.0-1.1x), suggesting their image processing is efficiently priced.

Overall, vision mode costs about 1.7x more on average while typically delivering worse performance.

Illegal Moves: Vision Makes Models Sloppier

Across all board games (TicTacToe, Connect4, Battleship), vision mode consistently produces more illegal moves:

TicTacToe: 0.90/match (text) vs 1.17/match (vision) — +30% more errors
Battleship: 0.85/match (text) vs 1.06/match (vision) — +25% more errors
Connect4: 0.70/match (text) vs 0.80/match (vision) — +14% more errors

Some models show dramatic spikes in vision mode:

Llama 4 Maverick in TicTacToe: 0.08 illegal moves (text) → 1.84 (vision) — a 23x increase
Claude Haiku 4.5 in TicTacToe: 0.73 (text) → 1.92 (vision) — 2.6x increase
Gemini 2.5 Flash Lite in TicTacToe: 1.25 (text) → 2.50 (vision) — 2x increase

The hypothesis: parsing board positions from an image introduces a step where errors creep in. The model must first "read" the image, then map pixels to positions, then determine valid moves. In text mode, the positions are already explicitly encoded, removing a potential failure point.

Game-by-Game Summary

Game	Text Better	Vision Better	Verdict
TicTacToe	67%	33%	Text clearly better
WordDuel	64%	36%	Text slightly better
Battleship	60%	40%	Text slightly better
Connect4	53%	47%	Nearly even
Mastermind	53%	47%	Nearly even

The pattern suggests: simpler, more structured games favor text mode, while more complex spatial games (Connect4, Mastermind) show less difference. Connect4 is the closest to even — and the one game where visual pattern recognition (spotting 4-in-a-row diagonals) seems genuinely useful.

The Exception: GPT-5.1

GPT-5.1 is the only model where vision mode clearly outperforms text overall (weighted ELO 1,086 vs 1,039 — a +48 point advantage). Its TicTacToe performance of 1,261 ELO in vision mode is the 4th highest per-game ELO among all variants. GPT-5.1 appears to have particularly strong image comprehension, allowing it to overcome the typical vision disadvantage.

Key Takeaways

Text mode outperforms vision in most cases. 12 of 15 models achieve higher overall ELO with text input. The AI "reads" a structured text board more reliably than it "sees" a visual one.
Vision increases costs by ~1.7x on average without delivering commensurate performance improvements.
Vision causes 14-30% more illegal moves in board games, suggesting image-to-move parsing is a reliability bottleneck.
Connect4 is the exception — the visual grid representation seems to genuinely help pattern recognition, with text and vision modes nearly even.
GPT-5.1 bucks the trend — its vision variant outperforms text by +48 ELO points, suggesting that model architecture matters more than the general text-vs-vision debate.
For production deployments: Text mode offers the best cost-performance ratio. Reserve vision mode for specific models (like GPT-5.1) that have demonstrated visual reasoning strength.

⚠️ Open Beta — preliminary observations based on 4,011 matches across 15 models tested in both text and vision modes. Interested in scientific collaboration? Contact us!

Text Mode vs Vision Mode: Does Seeing the Board Help AI?

Text vs Vision Overview

Average ELO by Game: Text vs Vision

Where Vision Wins Big (ELO Δ)

Where Text Wins Big (ELO Δ)

Illegal Moves per Match: Text vs Vision

Overall Weighted ELO: Text vs Vision