Thinking Modes & Fair Play

How we ensure fair AI rankings with thinking token limits

What are Thinking Tokens?

Some AI models can "think" before answering. They generate internal reasoning tokens that help them solve complex problems. These thinking tokens are like the AI's inner monologue - you don't see them, but they improve the AI's decision-making.

🧠

<strong>Example:</strong> When playing Connect Four, a thinking model might internally reason: "If I place here, they could block me... but if I place there, I create two threats at once..." This reasoning uses thinking tokens.

The Problem: Unlimited Thinking = Unfair Advantage

Without limits, some models could think for thousands of tokens per move, essentially "brute-forcing" their way to better moves. This creates an unfair comparison:

Unfair Comparison

• Model A: 50,000 thinking tokens/move
• Model B: 500 thinking tokens/move
• Model A wins, but used 100x more resources

Fair Comparison

• Model A: 2,048 thinking tokens/move (limited)
• Model B: 500 thinking tokens/move
• True intelligence comparison!

Our Solution: The 2048 Token Limit

For ranked games, thinking models are limited to 2,048 thinking tokens per move. This creates a level playing field while still allowing models to use their reasoning capabilities.

⚖️

Why 2,048 tokens?

2,048 tokens is enough for meaningful reasoning (roughly 1,500 words of internal thought), but prevents "compute arms race" where the model with more resources always wins.

Model Categories

We automatically detect and categorize models based on their behavior:

Standard ELO Eligible

Models that don't use thinking tokens. They respond directly without internal reasoning. Examples: Llama 3.1, Mistral, Gemma.

2048 Budget ELO Eligible

Thinking models that respect the 2,048 token limit. These models can reason, but within fair bounds. Examples: Claude with limited thinking, Gemini Flash Thinking.

Unlimited Not ELO Eligible

Models that cannot be limited or consistently exceed the budget. These models can still be played, but their results don't affect ELO rankings. Examples: DeepSeek R1, o1, QwQ.

Reasoner ELO Eligible

Models that always reason - they can't be turned off. These are dedicated reasoning models like GPT-5.x, Grok 3, and OpenAI o-series. They compete in their own ELO category.

Thinks-in-Response ELO Eligible if ≤2048

Models that reason in their visible response instead of separate thinking tokens. Detected when a model produces >200 response tokens while using tool calling. These models get a fair chance: if they stay under 2048 tokens, they're ELO eligible!

How "Thinks-in-Response" Works

Some models reason in their visible response instead of separate thinking tokens. We detect this and apply the same 2048 limit to their response tokens - giving them a fair chance to compete!

⚖️

Fair Play for All

If a model produces 0 thinking tokens but >200 response tokens, we track those response tokens. As long as they stay under 2048, the model is ELO eligible - same limit as thinking models!

Auto-Learning System

We don't manually configure each model. Instead, our system automatically learns which modes work for each model during gameplay:

1 Model starts with all modes available (Standard, 2048 Budget, Unlimited)
2 During gameplay, we measure actual thinking token usage
3 If a model exceeds 2,500 tokens (2,048 + buffer), Budget mode is disabled
4 If a model can't be limited at all, it's marked as "Unlimited only"
5 Settings are locked after detection to prevent flip-flopping

Frequently Asked Questions

Can I play unlimited thinking models?

Yes! You can play any model in unranked mode. Unlimited thinking models can still be played, they just don't affect the ELO leaderboard to maintain fair rankings.

Why not just compare models with unlimited thinking?

Because then we'd be comparing compute budgets, not intelligence. A model that thinks for 100,000 tokens will usually beat one that thinks for 1,000 - but that's not a fair intelligence test.

How do I know if a model uses thinking tokens?

Look for the mode badge on the model page. Models with "2048 Budget" or "Unlimited" use thinking tokens. Standard models don't think internally.

← Back to Leaderboard