FAQ

Everything you need to know about PlayTheAI

Our Philosophy

Why does PlayTheAI exist?
Traditional AI benchmarks show near-human or superhuman scores. But these tests can be optimized for during training. We provide an alternative: dynamic games against real humans that can't be memorized or gamed.
What does this show about AI capabilities?
In our tests, models scoring 90%+ on reasoning benchmarks often show single-digit win rates against average humans in simple strategy games. This gap between benchmark performance and real-world generalization is what we explore.
Why use games instead of traditional benchmarks?
Games are dynamic and unpredictable. Each match is unique β€” impossible to train for specifically. Unlike MMLU or HumanEval, you can't memorize the test set. This tests generalization ability in a new way.
Should AI need 'extended thinking' for simple games?
If a model needs minutes of reasoning to play Tic-Tac-Toe, that's itself notable. Children master these games intuitively. True generalization shouldn't require special reasoning modes for simple tasks.

General

What is PlayTheAI?
PlayTheAI is a benchmark platform where you can compete against various AI models in classic games. Real-time Elo ratings are generated to show how well different AI models perform against humans.
Is PlayTheAI free?
Yes, PlayTheAI is completely free. No account, no hidden costs. Just play!
Do I need an account?
No, you can play anonymously right away. In the future (Phase 3), there will be optional accounts to save your personal match history.

Playing

How do I choose which AI to play against?
You can choose 'Random' for blind matching (you only find out which model it was after the game), or you can select a specific model from the list.
Why does the AI response sometimes take longer?
Larger models (e.g., 70B parameters) need more computation time. This is intentional - we measure intelligence, not speed. All models have unlimited thinking time.
What happens if I close the browser window?
Your match is automatically saved. When you return, you can continue playing (Match Recovery).
Can AIs cheat?
No. The AIs only see the game board, like a human player. They don't have access to hidden information or special algorithms - only their language understanding abilities.

Elo & Rating

What is Elo?
Elo is a rating system originally developed for chess. The higher the number, the stronger the player. For us, Elo shows how well an AI performs against humans.
How does the rating system work?
Every AI starts at 1000 Elo. Humans are the baseline at 1500. If you win, the AI loses points. When an AI reaches 1500+, it plays at human level or better!
Do I also have an Elo rating?
No, humans play anonymously. We treat every human player as an 'average player' (1500). This way we focus on AI performance, not individual players.
What does the uncertainty (Β±) mean?
The uncertainty shows how accurate the Elo is. New models with few matches have high uncertainty. After many games, the rating becomes more precise.
Why are reasoning models excluded from rankings?

Reasoning models (o1, R1, etc.) are excluded for two main reasons:

1. Fairness: It is currently impossible to set a fair limit. Limiting by time only tests server speed (GPUs), not intelligence. Limiting by tokens is not supported by most providers (only Anthropic).
2. User Experience: Simple games like Tic-Tac-Toe should be fast. Waiting 5 minutes for an AI to 'overthink' a move destroys the game flow.

We want a fair and fun experience.

Want your reasoning model included?
AI providers can provide a free API key to enable all variants (non-reasoner + reasoner). We pay for non-reasoner via OpenRouter, you pay for reasoner via your API key. Win-win!

Learn more about the API Key Deal β†’
How is the Overall Elo calculated?

When viewing "All Games", we calculate a weighted Overall Elo using logarithmic weighting. This prevents "farming" - playing many easy games to inflate rankings.

Formula:

Overall = Ξ£(log(matches+1) Γ— elo) / Ξ£(log(matches+1))

Why logarithmic? With linear weighting, 1000 games would count 100Γ— more than 10 games. With log weighting, 1000 games count only ~3Γ— more. This ensures:

  • Models can't gain unfair advantage by only playing easy games
  • Difficult games (fewer matches) still count meaningfully
  • True skill across all games is reflected fairly

Example: A model with 1200 Elo in TicTacToe (50 games) and 1100 Elo in Connect4 (10 games) gets Overall β‰ˆ 1168 Elo (not simply 1175 from linear average).

AI Models

Which AI models are available?
Currently 32 model variants from 7 providers are ELO-ranked: OpenAI (GPT-4o, GPT-5.1, GPT-5.2, o4), Anthropic (Claude 3.5 Haiku, 4.5 Haiku/Sonnet), Google (Gemini 3 Flash), Meta (Llama 4 Scout/Maverick), Mistral (Large 3), Alibaba (Qwen3), and xAI (Grok 4).
Which providers are supported?
We support models from various providers including OpenAI, Anthropic, Google, Meta, and more.
Why do the AIs differ in strength?
Each model was trained differently and has different strengths. Some are better at language, others at logic. Our games test different cognitive abilities.
What are the 9 skill dimensions?
We measure 9 cognitive dimensions: Language, Logic, Combinatorics, Spatial Reasoning, Strategy, Knowledge, Probability, Deception, and Deduction. The radar chart shows where each model is strong or weak.
What's the difference between 'Instant Response' and 'Reasoning' models?
Instant Response (Zero-Shot) models answer immediately without internal deliberation β€” like human intuition. They're fast, cheap, and what most users interact with daily. Reasoning models use 'extended thinking' with thousands of tokens before responding β€” like solving a math problem on paper. We primarily rank Instant Response models for fair comparison, since reasoning models trade speed and cost for accuracy. Both modes are available to play.

For AI Providers

How can I get my model listed on PlayTheAI?
If your model is available on OpenRouter, it's likely already listed! For direct API integration or data partnerships, contact us via the feedback form or visit our Research page.
What data do partners receive?
Data partners receive detailed monthly performance reports: Elo trends, win rates vs competitors, token efficiency, response times, and failure analysis. Enterprise partners get custom dashboards and API access.
What metrics do you track for AI models?
We track Elo rating, win/loss/draw rates, tokens per move, response time, foul rate (invalid moves), tool call success rate, and more. Data partners receive detailed monthly reports with all metrics.
Why is PlayTheAI different from other benchmarks?
Unlike static benchmarks (MMLU, HumanEval), our games are dynamic and unpredictable. Each match is unique - impossible to "train for". We test against real humans with diverse strategies, providing authentic performance data that reflects real-world capabilities.
Can I test unreleased models privately?
Yes! Enterprise partners can access private benchmark instances for pre-release testing. Contact us to discuss your requirements.

Technical

How does it work technically?
You play against a real AI, not pre-programmed moves. The AI receives the current game state as text, 'thinks', and returns its move. This happens in real-time via our API.
Is my data stored?
We only store anonymous match data (game history, result). No personal data, no tracking cookies. See our privacy policy for details.
Can I use the API?
The API is currently not public. If you're interested in an integration, contact us via the feedback form.
How do AIs receive game information?
For fair comparison, all AI models receive exactly the same prompt with identical game rules. πŸ“ Text models receive everything as text: system prompt with game rules and board state as ASCII representation (e.g., 'X | O | _'). πŸ‘οΈ Vision models receive: system prompt and game rules as text (identical to text models), plus the board state as a PNG image. The only difference between modes is how the board is presented - not the rules or instructions.
Can I verify the moves / audit a game?
Yes! Every game has a unique Game ID displayed at the end. We store detailed logs for every match to ensure scientific integrity. If you want to audit a specific game, send us the Game ID via the feedback form, and we will provide you with the full replay log including every API call and response.
What is 'Foul Rate' and why does it matter?
Unlike chat benchmarks, games have strict rules. When an AI attempts an invalid move (e.g., placing a piece on an occupied square), we record it as a 'foul'. After 3 fouls, the AI is disqualified and loses automatically. Foul rates reveal instruction-following ability β€” how well a model understands game mechanics and respects constraints. Some models achieve 0% foul rates, others fail frequently. This is a unique metric no other benchmark tracks, and a key differentiator for real-world reliability.

Still have questions?

We're happy to help!

Give Feedback
πŸ†• Neue Version verfΓΌgbar!