Knowledge and reasoning benchmark score.
MMLU-Pro extends MMLU with more challenging, reasoning-focused questions, removes trivial or noisy items, and expands multiple-choice options from four to ten. It is meant to be more discriminative for advanced language models.
Test type: Multiple-choice reasoning and knowledge benchmark across broad academic domains.
Top models ranked by MMLU-Pro.
| Rank | Model | Creator | Value | Speed | Blended Price |
|---|---|---|---|---|---|
| #1 | Gemini 3 Pro Preview (high) | 89.8% | n/a | $4.50/M | |
| #2 |
| Anthropic |
| 89.5% |
| 53.5 tok/s |
| $10.94/M |
| #3 | Gemini 3 Pro Preview (low) | 89.5% | n/a | $4.50/M |
| #4 | Gemini 3 Flash Preview (Reasoning) | 89.0% | 172.8 tok/s | $1.13/M |
| #5 | Claude Opus 4.5 (Non-reasoning) | Anthropic | 88.9% | 47.6 tok/s | $10.94/M |
| #6 | Gemini 3 Flash Preview (Non-reasoning) | 88.2% | 181.3 tok/s | $1.13/M |
| #7 | Claude 4.1 Opus (Reasoning) | Anthropic | 88.0% | 33.7 tok/s | $32.81/M |
| #8 | Claude 4.5 Sonnet (Reasoning) | Anthropic | 87.5% | 50.1 tok/s | $6.56/M |
| #9 | MiniMax-M2.1 | MiniMax | 87.5% | 184.6 tok/s | $0.525/M |
| #10 | GPT-5.2 (xhigh) | OpenAI | 87.4% | 71 tok/s | $4.81/M |
| #11 | Claude 4 Opus (Reasoning) | Anthropic | 87.3% | 36.4 tok/s | $32.81/M |
| #12 | GPT-5 (high) | OpenAI | 87.1% | 111.1 tok/s | $3.44/M |
| #13 | GPT-5.1 (high) | OpenAI | 87.0% | 121.2 tok/s | $3.44/M |
| #14 | GPT-5 (medium) | OpenAI | 86.7% | 85.6 tok/s | $3.44/M |
| #15 | Grok 4 | xAI | 86.6% | n/a | $11.00/M |
| #16 | GPT-5 Codex (high) | OpenAI | 86.5% | 171.1 tok/s | $3.44/M |
| #17 | DeepSeek V3.2 Speciale | DeepSeek | 86.3% | n/a | - |
| #18 | DeepSeek V3.2 (Reasoning) | DeepSeek | 86.2% | n/a | $0.337/M |
| #19 | Gemini 2.5 Pro | 86.2% | 132 tok/s | $3.44/M |
| #20 | Claude 4 Opus (Non-reasoning) | Anthropic | 86.0% | 33.9 tok/s | $32.81/M |
| #21 | Claude 4.5 Sonnet (Non-reasoning) | Anthropic | 86.0% | 42.3 tok/s | $6.56/M |
| #22 | GPT-5 (low) | OpenAI | 86.0% | 79.3 tok/s | $3.44/M |
| #23 | GPT-5.1 Codex (high) | OpenAI | 86.0% | 182.1 tok/s | $3.44/M |
| #24 | GPT-5.2 (medium) | OpenAI | 85.9% | n/a | $4.81/M |
| #25 | Gemini 2.5 Pro Preview (Mar' 25) | 85.8% | n/a | - |
| #26 | GLM-4.7 (Reasoning) | Z AI | 85.6% | 79.2 tok/s | $1.00/M |
| #27 | Doubao Seed Code | ByteDance Seed | 85.4% | n/a | - |
| #28 | Grok 4.1 Fast (Reasoning) | xAI | 85.4% | n/a | - |
| #29 | o3 | OpenAI | 85.3% | 122.3 tok/s | $3.50/M |
| #30 | DeepSeek V3.1 (Reasoning) | DeepSeek | 85.1% | n/a | $0.865/M |
| #31 | DeepSeek V3.1 Terminus (Reasoning) | DeepSeek | 85.1% | n/a | $1.91/M |
| #32 | DeepSeek V3.2 Exp (Reasoning) | DeepSeek | 85.0% | n/a | $0.310/M |
| #33 | Grok 4 Fast (Reasoning) | xAI | 85.0% | n/a | $0.275/M |
| #34 | Cogito v2.1 (Reasoning) | Deep Cogito | 84.9% | 62.8 tok/s | $1.25/M |
| #35 | DeepSeek R1 0528 (May '25) | DeepSeek | 84.9% | n/a | $2.06/M |
| #36 | Kimi K2 Thinking | Kimi | 84.8% | 131.1 tok/s | $1.08/M |
| #37 | DeepSeek R1 (Jan '25) | DeepSeek | 84.4% | n/a | $2.43/M |
| #38 | MiMo-V2-Flash (Reasoning) | Xiaomi | 84.3% | 129.5 tok/s | $0.150/M |
| #39 | Qwen3 235B A22B 2507 (Reasoning) | Alibaba | 84.3% | 59.4 tok/s | $0.838/M |
| #40 | Claude 4 Sonnet (Reasoning) | Anthropic | 84.2% | 45.5 tok/s | $6.56/M |
| #41 | Gemini 2.5 Flash Preview (Sep '25) (Reasoning) | 84.2% | n/a | - |
| #42 | o1 | OpenAI | 84.1% | 123.2 tok/s | $26.25/M |
| #43 | Qwen3 Max | Alibaba | 84.1% | 48.2 tok/s | $3.05/M |
| #44 | K-EXAONE (Reasoning) | LG AI Research | 83.8% | n/a | - |
| #45 | Qwen3 Max (Preview) | Alibaba | 83.8% | 47.1 tok/s | $2.40/M |
| #46 | Claude 3.7 Sonnet (Reasoning) | Anthropic | 83.7% | n/a | - |
| #47 | Claude 4 Sonnet (Non-reasoning) | Anthropic | 83.7% | 45.2 tok/s | $6.56/M |
| #48 | DeepSeek V3.2 (Non-reasoning) | DeepSeek | 83.7% | n/a | $0.775/M |
| #49 | Gemini 2.5 Pro Preview (May' 25) | 83.7% | n/a | $3.44/M |
| #50 | GPT-5 mini (high) | OpenAI | 83.7% | 87.4 tok/s | $0.688/M |
| #51 | DeepSeek V3.1 Terminus (Non-reasoning) | DeepSeek | 83.6% | n/a | $0.453/M |
| #52 | DeepSeek V3.2 Exp (Non-reasoning) | DeepSeek | 83.6% | n/a | $0.310/M |
| #53 | Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) | 83.6% | n/a | - |
| #54 | Qwen3 VL 235B A22B (Reasoning) | Alibaba | 83.6% | 32.5 tok/s | $2.17/M |
| #55 | GLM-4.5 (Reasoning) | Z AI | 83.5% | 50.1 tok/s | $1.00/M |
| #56 | DeepSeek V3.1 (Non-reasoning) | DeepSeek | 83.3% | n/a | $0.834/M |
| #57 | Gemini 2.5 Flash (Reasoning) | 83.2% | 221.3 tok/s | $0.850/M |
| #58 | o4-mini (high) | OpenAI | 83.2% | 151 tok/s | $1.93/M |
| #59 | ERNIE 5.0 Thinking Preview | Baidu | 83.0% | n/a | - |
| #60 | Nova 2.0 Pro Preview (medium) | Amazon | 83.0% | 127.7 tok/s | $3.44/M |
| #61 | GLM-4.6 (Reasoning) | Z AI | 82.9% | 43.9 tok/s | $0.963/M |
| #62 | Hermes 4 - Llama-3.1 405B (Reasoning) | Nous Research | 82.9% | 39.5 tok/s | $1.50/M |
| #63 | GPT-5 mini (medium) | OpenAI | 82.8% | 86.7 tok/s | $0.688/M |
| #64 | Grok 3 mini Reasoning (high) | xAI | 82.8% | 58.8 tok/s | $0.350/M |
| #65 | Qwen3 235B A22B (Reasoning) | Alibaba | 82.8% | 59 tok/s | $2.63/M |
| #66 | Qwen3 235B A22B 2507 Instruct | Alibaba | 82.8% | 42.5 tok/s | $0.356/M |
| #67 | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | NVIDIA | 82.5% | 52.7 tok/s | $0.900/M |
| #68 | Kimi K2 | Kimi | 82.4% | 24.3 tok/s | $1.04/M |
| #69 | Qwen3 Max Thinking (Preview) | Alibaba | 82.4% | 50.7 tok/s | $2.40/M |
| #70 | Qwen3 Next 80B A3B (Reasoning) | Alibaba | 82.4% | 135.7 tok/s | $1.88/M |
| #71 | Qwen3 VL 235B A22B Instruct | Alibaba | 82.3% | 48.1 tok/s | $0.700/M |
| #72 | INTELLECT-3 | Prime Intellect | 82.2% | n/a | - |
| #73 | Ling-1T | InclusionAI | 82.2% | n/a | - |
| #74 | Nova 2.0 Pro Preview (low) | Amazon | 82.2% | 147.9 tok/s | $3.44/M |
| #75 | GPT-5 (ChatGPT) | OpenAI | 82.0% | 167.3 tok/s | $3.44/M |
| #76 | GPT-5.1 Codex mini (high) | OpenAI | 82.0% | 213.6 tok/s | $0.688/M |
| #77 | MiniMax-M2 | MiniMax | 82.0% | 102.9 tok/s | $0.525/M |
| #78 | DeepSeek V3 0324 | DeepSeek | 81.9% | n/a | $1.21/M |
| #79 | Kimi K2 0905 | Kimi | 81.9% | 24.2 tok/s | $1.08/M |
| #80 | Qwen3 Next 80B A3B Instruct | Alibaba | 81.9% | 131.1 tok/s | $0.875/M |