Easy Benchmarks
Workspace
Overview
Benchmarks
Benchmarks list
Compare
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
LLMs
Audio
Image
Video
Feedback
Log inSign up
Back

MMLU-Pro

Knowledge and reasoning benchmark score.

MMLU-Pro extends MMLU with more challenging, reasoning-focused questions, removes trivial or noisy items, and expands multiple-choice options from four to ten. It is meant to be more discriminative for advanced language models.

Test type: Multiple-choice reasoning and knowledge benchmark across broad academic domains.

Coverage

345 models have this metric.

89.8%

Current leader: Gemini 3 Pro Preview (high)

Project links

This app ranks the MMLU-Pro score exposed by the Artificial Analysis snapshot.

PaperGitHub

Top MMLU-Pro Models

Top models ranked by MMLU-Pro.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1Gemini 3 Pro Preview (high)Google89.8%n/a$4.50/M
#2
Claude Opus 4.5 (Reasoning)
Anthropic
89.5%
53.5 tok/s
$10.94/M
#3Gemini 3 Pro Preview (low)Google89.5%n/a$4.50/M
#4Gemini 3 Flash Preview (Reasoning)Google89.0%172.8 tok/s$1.13/M
#5Claude Opus 4.5 (Non-reasoning)Anthropic88.9%47.6 tok/s$10.94/M
#6Gemini 3 Flash Preview (Non-reasoning)Google88.2%181.3 tok/s$1.13/M
#7Claude 4.1 Opus (Reasoning)Anthropic88.0%33.7 tok/s$32.81/M
#8Claude 4.5 Sonnet (Reasoning)Anthropic87.5%50.1 tok/s$6.56/M
#9MiniMax-M2.1MiniMax87.5%184.6 tok/s$0.525/M
#10GPT-5.2 (xhigh)OpenAI87.4%71 tok/s$4.81/M
#11Claude 4 Opus (Reasoning)Anthropic87.3%36.4 tok/s$32.81/M
#12GPT-5 (high)OpenAI87.1%111.1 tok/s$3.44/M
#13GPT-5.1 (high)OpenAI87.0%121.2 tok/s$3.44/M
#14GPT-5 (medium)OpenAI86.7%85.6 tok/s$3.44/M
#15Grok 4xAI86.6%n/a$11.00/M
#16GPT-5 Codex (high)OpenAI86.5%171.1 tok/s$3.44/M
#17DeepSeek V3.2 SpecialeDeepSeek86.3%n/a-
#18DeepSeek V3.2 (Reasoning)DeepSeek86.2%n/a$0.337/M
#19Gemini 2.5 ProGoogle86.2%132 tok/s$3.44/M
#20Claude 4 Opus (Non-reasoning)Anthropic86.0%33.9 tok/s$32.81/M
#21Claude 4.5 Sonnet (Non-reasoning)Anthropic86.0%42.3 tok/s$6.56/M
#22GPT-5 (low)OpenAI86.0%79.3 tok/s$3.44/M
#23GPT-5.1 Codex (high)OpenAI86.0%182.1 tok/s$3.44/M
#24GPT-5.2 (medium)OpenAI85.9%n/a$4.81/M
#25Gemini 2.5 Pro Preview (Mar' 25)Google85.8%n/a-
#26GLM-4.7 (Reasoning)Z AI85.6%79.2 tok/s$1.00/M
#27Doubao Seed CodeByteDance Seed85.4%n/a-
#28Grok 4.1 Fast (Reasoning)xAI85.4%n/a-
#29o3OpenAI85.3%122.3 tok/s$3.50/M
#30DeepSeek V3.1 (Reasoning)DeepSeek85.1%n/a$0.865/M
#31DeepSeek V3.1 Terminus (Reasoning)DeepSeek85.1%n/a$1.91/M
#32DeepSeek V3.2 Exp (Reasoning)DeepSeek85.0%n/a$0.310/M
#33Grok 4 Fast (Reasoning)xAI85.0%n/a$0.275/M
#34Cogito v2.1 (Reasoning)Deep Cogito84.9%62.8 tok/s$1.25/M
#35DeepSeek R1 0528 (May '25)DeepSeek84.9%n/a$2.06/M
#36Kimi K2 ThinkingKimi84.8%131.1 tok/s$1.08/M
#37DeepSeek R1 (Jan '25)DeepSeek84.4%n/a$2.43/M
#38MiMo-V2-Flash (Reasoning)Xiaomi84.3%129.5 tok/s$0.150/M
#39Qwen3 235B A22B 2507 (Reasoning)Alibaba84.3%59.4 tok/s$0.838/M
#40Claude 4 Sonnet (Reasoning)Anthropic84.2%45.5 tok/s$6.56/M
#41Gemini 2.5 Flash Preview (Sep '25) (Reasoning)Google84.2%n/a-
#42o1OpenAI84.1%123.2 tok/s$26.25/M
#43Qwen3 MaxAlibaba84.1%48.2 tok/s$3.05/M
#44K-EXAONE (Reasoning)LG AI Research83.8%n/a-
#45Qwen3 Max (Preview)Alibaba83.8%47.1 tok/s$2.40/M
#46Claude 3.7 Sonnet (Reasoning)Anthropic83.7%n/a-
#47Claude 4 Sonnet (Non-reasoning)Anthropic83.7%45.2 tok/s$6.56/M
#48DeepSeek V3.2 (Non-reasoning)DeepSeek83.7%n/a$0.775/M
#49Gemini 2.5 Pro Preview (May' 25)Google83.7%n/a$3.44/M
#50GPT-5 mini (high)OpenAI83.7%87.4 tok/s$0.688/M
#51DeepSeek V3.1 Terminus (Non-reasoning)DeepSeek83.6%n/a$0.453/M
#52DeepSeek V3.2 Exp (Non-reasoning)DeepSeek83.6%n/a$0.310/M
#53Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)Google83.6%n/a-
#54Qwen3 VL 235B A22B (Reasoning)Alibaba83.6%32.5 tok/s$2.17/M
#55GLM-4.5 (Reasoning)Z AI83.5%50.1 tok/s$1.00/M
#56DeepSeek V3.1 (Non-reasoning)DeepSeek83.3%n/a$0.834/M
#57Gemini 2.5 Flash (Reasoning)Google83.2%221.3 tok/s$0.850/M
#58o4-mini (high)OpenAI83.2%151 tok/s$1.93/M
#59ERNIE 5.0 Thinking PreviewBaidu83.0%n/a-
#60Nova 2.0 Pro Preview (medium)Amazon83.0%127.7 tok/s$3.44/M
#61GLM-4.6 (Reasoning)Z AI82.9%43.9 tok/s$0.963/M
#62Hermes 4 - Llama-3.1 405B (Reasoning)Nous Research82.9%39.5 tok/s$1.50/M
#63GPT-5 mini (medium)OpenAI82.8%86.7 tok/s$0.688/M
#64Grok 3 mini Reasoning (high)xAI82.8%58.8 tok/s$0.350/M
#65Qwen3 235B A22B (Reasoning)Alibaba82.8%59 tok/s$2.63/M
#66Qwen3 235B A22B 2507 InstructAlibaba82.8%42.5 tok/s$0.356/M
#67Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)NVIDIA82.5%52.7 tok/s$0.900/M
#68Kimi K2Kimi82.4%24.3 tok/s$1.04/M
#69Qwen3 Max Thinking (Preview)Alibaba82.4%50.7 tok/s$2.40/M
#70Qwen3 Next 80B A3B (Reasoning)Alibaba82.4%135.7 tok/s$1.88/M
#71Qwen3 VL 235B A22B InstructAlibaba82.3%48.1 tok/s$0.700/M
#72INTELLECT-3Prime Intellect82.2%n/a-
#73Ling-1TInclusionAI82.2%n/a-
#74Nova 2.0 Pro Preview (low)Amazon82.2%147.9 tok/s$3.44/M
#75GPT-5 (ChatGPT)OpenAI82.0%167.3 tok/s$3.44/M
#76GPT-5.1 Codex mini (high)OpenAI82.0%213.6 tok/s$0.688/M
#77MiniMax-M2MiniMax82.0%102.9 tok/s$0.525/M
#78DeepSeek V3 0324DeepSeek81.9%n/a$1.21/M
#79Kimi K2 0905Kimi81.9%24.2 tok/s$1.08/M
#80Qwen3 Next 80B A3B InstructAlibaba81.9%131.1 tok/s$0.875/M