► Benchmarks · May 2026

How the frontier ranks. How the open-source closes.

Reader-voted scores beside the public benchmark numbers, on the tasks readers actually use these models for. Pick a task, see who wins, vote for the model you trust.

Illustrative · scores below are placeholders while the live benchmark aggregator is wired up. Reader votes are real and persist locally; the numbered scores are not measurements.

Top closed-source frontier models · scored on coding (illustrative)

Claude 4.6
Anthropic
95
95illustrative
GPT-5
OpenAI
92
92illustrative
Gemini 3 Pro
Google
88
88illustrative
Grok 4
xAI
82
82illustrative
Reka Core 2
Reka
70
70illustrative

How we score

Planned methodology (not yet live). The scores shown above are illustrative placeholders; the pipeline below describes how scoring will work once the aggregator ships.

Public benchmarksMMLU, HumanEval, GPQA, MATH, GSM8K, MMMU, AIME 2025 will be aggregated, normalized, and weighted toward fresher releases, refreshed monthly.
Reader votesOne vote per reader per model. The vote count appears beside the benchmark score: these are independent signals, not blended.
What we don't doWe don't include vendor self-reported numbers. We don't include benchmarks the vendor created themselves. We don't blend benchmarks and votes into a single ranking.

Keyboard shortcuts

?
Show this menu
T
Toggle theme
Esc
Close