How the frontier ranks. How the open-source closes.
Reader-voted scores beside the public benchmark numbers, on the tasks readers actually use these models for. Pick a task, see who wins, vote for the model you trust.
Illustrative · scores below are placeholders while the live benchmark aggregator is wired up. Reader votes are real and persist locally; the numbered scores are not measurements.
Top closed-source frontier models · scored on coding (illustrative)
Claude 4.6
Anthropic
95illustrative
GPT-5
OpenAI
92illustrative
Gemini 3 Pro
Google
88illustrative
Grok 4
xAI
82illustrative
Reka Core 2
Reka
70illustrative
How we score
Planned methodology (not yet live). The scores shown above are illustrative placeholders; the pipeline below describes how scoring will work once the aggregator ships.
Public benchmarksMMLU, HumanEval, GPQA, MATH, GSM8K, MMMU, AIME 2025 will be aggregated, normalized, and weighted toward fresher releases, refreshed monthly.
Reader votesOne vote per reader per model. The vote count appears beside the benchmark score: these are independent signals, not blended.
What we don't doWe don't include vendor self-reported numbers. We don't include benchmarks the vendor created themselves. We don't blend benchmarks and votes into a single ranking.