► Benchmarks

How the field stacks up. Cited, not cherry-picked.

Real numbers from HuggingFace Open LLM v2 and LMSYS Chatbot Arena, rolled up into the categories you actually pick a model for. Weekly editorial blurb lands soon.

No data yetHow we score

Awaiting first ingest

The HuggingFace and LMSYS fetchers run on cron. Numbers land when the first run completes. See methodology for what we score and how.

How we score

Per-benchmark scores are converted to percentile rank within the field, weighted by recency (six-month half life), then combined with a geometric mean inside each category. Geometric mean penalizes one-trick-pony models. We need at least two benchmarks in a category before we publish a rollup.

Read the full methodology