EBEasy BenchmarksLLM model index
Workspace
Overview
Benchmarks
Benchmarks list
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
Models
All models
GPT-5.5 (xhigh)
GPT-5.5 (high)
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Gemini 3.1 Pro Preview
GPT-5.4 (xhigh)
Artificial Analysis data
Back

MMLU-Pro

Knowledge and reasoning benchmark score.

MMLU-Pro extends MMLU with more challenging, reasoning-focused questions, removes trivial or noisy items, and expands multiple-choice options from four to ten. It is meant to be more discriminative for advanced language models.

Test type: Multiple-choice reasoning and knowledge benchmark across broad academic domains.

Coverage

345 models have this metric.

89.8%

Current leader: Gemini 3 Pro Preview (high)

Project links

This app ranks the MMLU-Pro score exposed by the Artificial Analysis snapshot.

PaperGitHub

Top MMLU-Pro Models

Top models ranked by MMLU-Pro.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1Gemini 3 Pro Preview (high)Google89.8%128.7 tok/s$4.50/M
#2
Claude Opus 4.5 (Reasoning)
Anthropic
89.5%
57 tok/s
$10.00/M
#3Gemini 3 Pro Preview (low)Google89.5%n/a$4.50/M
#4Gemini 3 Flash Preview (Reasoning)Google89.0%193.2 tok/s$1.13/M
#5Claude Opus 4.5 (Non-reasoning)Anthropic88.9%50.3 tok/s$10.00/M
#6Gemini 3 Flash Preview (Non-reasoning)Google88.2%178.3 tok/s$1.13/M
#7Claude 4.1 Opus (Reasoning)Anthropic88.0%35.8 tok/s$30.00/M
#8Claude 4.5 Sonnet (Reasoning)Anthropic87.5%43.8 tok/s$6.00/M
#9MiniMax-M2.1MiniMax87.5%84.8 tok/s$0.525/M
#10GPT-5.2 (xhigh)OpenAI87.4%71.8 tok/s$4.81/M
#11Claude 4 Opus (Reasoning)Anthropic87.3%36.8 tok/s$30.00/M
#12GPT-5 (high)OpenAI87.1%84.2 tok/s$3.44/M
#13GPT-5.1 (high)OpenAI87.0%123.3 tok/s$3.44/M
#14GPT-5 (medium)OpenAI86.7%82.3 tok/s$3.44/M
#15Grok 4xAI86.6%50.3 tok/s$6.00/M
#16GPT-5 Codex (high)OpenAI86.5%166.8 tok/s$3.44/M
#17DeepSeek V3.2 SpecialeDeepSeek86.3%n/a-
#18DeepSeek V3.2 (Reasoning)DeepSeek86.2%n/a$0.315/M
#19Gemini 2.5 ProGoogle86.2%120.2 tok/s$3.44/M
#20Claude 4 Opus (Non-reasoning)Anthropic86.0%36.6 tok/s$30.00/M
#21Claude 4.5 Sonnet (Non-reasoning)Anthropic86.0%44.2 tok/s$6.00/M
#22GPT-5 (low)OpenAI86.0%65.8 tok/s$3.44/M
#23GPT-5.1 Codex (high)OpenAI86.0%162.7 tok/s$3.44/M
#24GPT-5.2 (medium)OpenAI85.9%n/a$4.81/M
#25Gemini 2.5 Pro Preview (Mar' 25)Google85.8%n/a-
#26GLM-4.7 (Reasoning)Z AI85.6%90.3 tok/s$1.00/M
#27Doubao Seed CodeByteDance Seed85.4%n/a-
#28Grok 4.1 Fast (Reasoning)xAI85.4%140.9 tok/s$0.275/M
#29o3OpenAI85.3%72.7 tok/s$3.50/M
#30DeepSeek V3.1 (Reasoning)DeepSeek85.1%n/a$0.865/M
#31DeepSeek V3.1 Terminus (Reasoning)DeepSeek85.1%n/a$1.91/M
#32DeepSeek V3.2 Exp (Reasoning)DeepSeek85.0%n/a$0.315/M
#33Grok 4 Fast (Reasoning)xAI85.0%76.2 tok/s$0.275/M
#34Cogito v2.1 (Reasoning)Deep Cogito84.9%51.1 tok/s$1.25/M
#35DeepSeek R1 0528 (May '25)DeepSeek84.9%n/a$2.36/M
#36Kimi K2 ThinkingKimi84.8%99 tok/s$1.08/M
#37DeepSeek R1 (Jan '25)DeepSeek84.4%n/a$2.36/M
#38MiMo-V2-Flash (Reasoning)Xiaomi84.3%118.8 tok/s$0.150/M
#39Qwen3 235B A22B 2507 (Reasoning)Alibaba84.3%56 tok/s$2.63/M
#40Claude 4 Sonnet (Reasoning)Anthropic84.2%50.3 tok/s$6.00/M
#41Gemini 2.5 Flash Preview (Sep '25) (Reasoning)Google84.2%n/a-
#42o1OpenAI84.1%103.3 tok/s$26.25/M
#43Qwen3 MaxAlibaba84.1%32.2 tok/s$2.40/M
#44K-EXAONE (Reasoning)LG AI Research83.8%n/a-
#45Qwen3 Max (Preview)Alibaba83.8%45.1 tok/s$2.40/M
#46Claude 3.7 Sonnet (Reasoning)Anthropic83.7%n/a$6.00/M
#47Claude 4 Sonnet (Non-reasoning)Anthropic83.7%47.6 tok/s$6.00/M
#48DeepSeek V3.2 (Non-reasoning)DeepSeek83.7%n/a$0.315/M
#49Gemini 2.5 Pro Preview (May' 25)Google83.7%n/a$3.44/M
#50GPT-5 mini (high)OpenAI83.7%85.7 tok/s$0.688/M
#51DeepSeek V3.1 Terminus (Non-reasoning)DeepSeek83.6%n/a$0.453/M
#52DeepSeek V3.2 Exp (Non-reasoning)DeepSeek83.6%n/a$0.315/M
#53Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)Google83.6%n/a-
#54Qwen3 VL 235B A22B (Reasoning)Alibaba83.6%46.2 tok/s$2.63/M
#55GLM-4.5 (Reasoning)Z AI83.5%46.4 tok/s$1.00/M
#56DeepSeek V3.1 (Non-reasoning)DeepSeek83.3%n/a$0.834/M
#57Gemini 2.5 Flash (Reasoning)Google83.2%199.6 tok/s$0.850/M
#58o4-mini (high)OpenAI83.2%124.5 tok/s$1.93/M
#59ERNIE 5.0 Thinking PreviewBaidu83.0%n/a-
#60Nova 2.0 Pro Preview (medium)Amazon83.0%112.7 tok/s$3.44/M
#61GLM-4.6 (Reasoning)Z AI82.9%26.3 tok/s$0.963/M
#62Hermes 4 - Llama-3.1 405B (Reasoning)Nous Research82.9%34.9 tok/s$1.50/M
#63GPT-5 mini (medium)OpenAI82.8%77.2 tok/s$0.688/M
#64Grok 3 mini Reasoning (high)xAI82.8%215.5 tok/s$0.350/M
#65Qwen3 235B A22B (Reasoning)Alibaba82.8%61.4 tok/s$2.63/M
#66Qwen3 235B A22B 2507 InstructAlibaba82.8%64.7 tok/s$1.23/M
#67Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)NVIDIA82.5%41 tok/s$0.900/M
#68Kimi K2Kimi82.4%33 tok/s$1.04/M
#69Qwen3 Max Thinking (Preview)Alibaba82.4%40.8 tok/s$2.40/M
#70Qwen3 Next 80B A3B (Reasoning)Alibaba82.4%172.2 tok/s$1.88/M
#71Qwen3 VL 235B A22B InstructAlibaba82.3%49 tok/s$1.23/M
#72INTELLECT-3Prime Intellect82.2%n/a-
#73Ling-1TInclusionAI82.2%n/a-
#74Nova 2.0 Pro Preview (low)Amazon82.2%122.6 tok/s$3.44/M
#75GPT-5 (ChatGPT)OpenAI82.0%149.8 tok/s$3.44/M
#76GPT-5.1 Codex mini (high)OpenAI82.0%207.2 tok/s$0.688/M
#77MiniMax-M2MiniMax82.0%83.5 tok/s$0.525/M
#78DeepSeek V3 0324DeepSeek81.9%n/a$1.25/M
#79Kimi K2 0905Kimi81.9%12.5 tok/s$1.08/M
#80Qwen3 Next 80B A3B InstructAlibaba81.9%155.3 tok/s$0.875/M