Knowledge and reasoning benchmark score.
MMLU-Pro extends MMLU with more challenging, reasoning-focused questions, removes trivial or noisy items, and expands multiple-choice options from four to ten. It is meant to be more discriminative for advanced language models.
Test type: Multiple-choice reasoning and knowledge benchmark across broad academic domains.
Top models ranked by MMLU-Pro.
| Rank | Model | Creator | Value | Speed | Blended Price |
|---|---|---|---|---|---|
| #1 | Gemini 3 Pro Preview (high) | 89.8% | 128.7 tok/s | $4.50/M | |
| #2 |
| Anthropic |
| 89.5% |
| 57 tok/s |
| $10.00/M |
| #3 | Gemini 3 Pro Preview (low) | 89.5% | n/a | $4.50/M |
| #4 | Gemini 3 Flash Preview (Reasoning) | 89.0% | 193.2 tok/s | $1.13/M |
| #5 | Claude Opus 4.5 (Non-reasoning) | Anthropic | 88.9% | 50.3 tok/s | $10.00/M |
| #6 | Gemini 3 Flash Preview (Non-reasoning) | 88.2% | 178.3 tok/s | $1.13/M |
| #7 | Claude 4.1 Opus (Reasoning) | Anthropic | 88.0% | 35.8 tok/s | $30.00/M |
| #8 | Claude 4.5 Sonnet (Reasoning) | Anthropic | 87.5% | 43.8 tok/s | $6.00/M |
| #9 | MiniMax-M2.1 | MiniMax | 87.5% | 84.8 tok/s | $0.525/M |
| #10 | GPT-5.2 (xhigh) | OpenAI | 87.4% | 71.8 tok/s | $4.81/M |
| #11 | Claude 4 Opus (Reasoning) | Anthropic | 87.3% | 36.8 tok/s | $30.00/M |
| #12 | GPT-5 (high) | OpenAI | 87.1% | 84.2 tok/s | $3.44/M |
| #13 | GPT-5.1 (high) | OpenAI | 87.0% | 123.3 tok/s | $3.44/M |
| #14 | GPT-5 (medium) | OpenAI | 86.7% | 82.3 tok/s | $3.44/M |
| #15 | Grok 4 | xAI | 86.6% | 50.3 tok/s | $6.00/M |
| #16 | GPT-5 Codex (high) | OpenAI | 86.5% | 166.8 tok/s | $3.44/M |
| #17 | DeepSeek V3.2 Speciale | DeepSeek | 86.3% | n/a | - |
| #18 | DeepSeek V3.2 (Reasoning) | DeepSeek | 86.2% | n/a | $0.315/M |
| #19 | Gemini 2.5 Pro | 86.2% | 120.2 tok/s | $3.44/M |
| #20 | Claude 4 Opus (Non-reasoning) | Anthropic | 86.0% | 36.6 tok/s | $30.00/M |
| #21 | Claude 4.5 Sonnet (Non-reasoning) | Anthropic | 86.0% | 44.2 tok/s | $6.00/M |
| #22 | GPT-5 (low) | OpenAI | 86.0% | 65.8 tok/s | $3.44/M |
| #23 | GPT-5.1 Codex (high) | OpenAI | 86.0% | 162.7 tok/s | $3.44/M |
| #24 | GPT-5.2 (medium) | OpenAI | 85.9% | n/a | $4.81/M |
| #25 | Gemini 2.5 Pro Preview (Mar' 25) | 85.8% | n/a | - |
| #26 | GLM-4.7 (Reasoning) | Z AI | 85.6% | 90.3 tok/s | $1.00/M |
| #27 | Doubao Seed Code | ByteDance Seed | 85.4% | n/a | - |
| #28 | Grok 4.1 Fast (Reasoning) | xAI | 85.4% | 140.9 tok/s | $0.275/M |
| #29 | o3 | OpenAI | 85.3% | 72.7 tok/s | $3.50/M |
| #30 | DeepSeek V3.1 (Reasoning) | DeepSeek | 85.1% | n/a | $0.865/M |
| #31 | DeepSeek V3.1 Terminus (Reasoning) | DeepSeek | 85.1% | n/a | $1.91/M |
| #32 | DeepSeek V3.2 Exp (Reasoning) | DeepSeek | 85.0% | n/a | $0.315/M |
| #33 | Grok 4 Fast (Reasoning) | xAI | 85.0% | 76.2 tok/s | $0.275/M |
| #34 | Cogito v2.1 (Reasoning) | Deep Cogito | 84.9% | 51.1 tok/s | $1.25/M |
| #35 | DeepSeek R1 0528 (May '25) | DeepSeek | 84.9% | n/a | $2.36/M |
| #36 | Kimi K2 Thinking | Kimi | 84.8% | 99 tok/s | $1.08/M |
| #37 | DeepSeek R1 (Jan '25) | DeepSeek | 84.4% | n/a | $2.36/M |
| #38 | MiMo-V2-Flash (Reasoning) | Xiaomi | 84.3% | 118.8 tok/s | $0.150/M |
| #39 | Qwen3 235B A22B 2507 (Reasoning) | Alibaba | 84.3% | 56 tok/s | $2.63/M |
| #40 | Claude 4 Sonnet (Reasoning) | Anthropic | 84.2% | 50.3 tok/s | $6.00/M |
| #41 | Gemini 2.5 Flash Preview (Sep '25) (Reasoning) | 84.2% | n/a | - |
| #42 | o1 | OpenAI | 84.1% | 103.3 tok/s | $26.25/M |
| #43 | Qwen3 Max | Alibaba | 84.1% | 32.2 tok/s | $2.40/M |
| #44 | K-EXAONE (Reasoning) | LG AI Research | 83.8% | n/a | - |
| #45 | Qwen3 Max (Preview) | Alibaba | 83.8% | 45.1 tok/s | $2.40/M |
| #46 | Claude 3.7 Sonnet (Reasoning) | Anthropic | 83.7% | n/a | $6.00/M |
| #47 | Claude 4 Sonnet (Non-reasoning) | Anthropic | 83.7% | 47.6 tok/s | $6.00/M |
| #48 | DeepSeek V3.2 (Non-reasoning) | DeepSeek | 83.7% | n/a | $0.315/M |
| #49 | Gemini 2.5 Pro Preview (May' 25) | 83.7% | n/a | $3.44/M |
| #50 | GPT-5 mini (high) | OpenAI | 83.7% | 85.7 tok/s | $0.688/M |
| #51 | DeepSeek V3.1 Terminus (Non-reasoning) | DeepSeek | 83.6% | n/a | $0.453/M |
| #52 | DeepSeek V3.2 Exp (Non-reasoning) | DeepSeek | 83.6% | n/a | $0.315/M |
| #53 | Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning) | 83.6% | n/a | - |
| #54 | Qwen3 VL 235B A22B (Reasoning) | Alibaba | 83.6% | 46.2 tok/s | $2.63/M |
| #55 | GLM-4.5 (Reasoning) | Z AI | 83.5% | 46.4 tok/s | $1.00/M |
| #56 | DeepSeek V3.1 (Non-reasoning) | DeepSeek | 83.3% | n/a | $0.834/M |
| #57 | Gemini 2.5 Flash (Reasoning) | 83.2% | 199.6 tok/s | $0.850/M |
| #58 | o4-mini (high) | OpenAI | 83.2% | 124.5 tok/s | $1.93/M |
| #59 | ERNIE 5.0 Thinking Preview | Baidu | 83.0% | n/a | - |
| #60 | Nova 2.0 Pro Preview (medium) | Amazon | 83.0% | 112.7 tok/s | $3.44/M |
| #61 | GLM-4.6 (Reasoning) | Z AI | 82.9% | 26.3 tok/s | $0.963/M |
| #62 | Hermes 4 - Llama-3.1 405B (Reasoning) | Nous Research | 82.9% | 34.9 tok/s | $1.50/M |
| #63 | GPT-5 mini (medium) | OpenAI | 82.8% | 77.2 tok/s | $0.688/M |
| #64 | Grok 3 mini Reasoning (high) | xAI | 82.8% | 215.5 tok/s | $0.350/M |
| #65 | Qwen3 235B A22B (Reasoning) | Alibaba | 82.8% | 61.4 tok/s | $2.63/M |
| #66 | Qwen3 235B A22B 2507 Instruct | Alibaba | 82.8% | 64.7 tok/s | $1.23/M |
| #67 | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | NVIDIA | 82.5% | 41 tok/s | $0.900/M |
| #68 | Kimi K2 | Kimi | 82.4% | 33 tok/s | $1.04/M |
| #69 | Qwen3 Max Thinking (Preview) | Alibaba | 82.4% | 40.8 tok/s | $2.40/M |
| #70 | Qwen3 Next 80B A3B (Reasoning) | Alibaba | 82.4% | 172.2 tok/s | $1.88/M |
| #71 | Qwen3 VL 235B A22B Instruct | Alibaba | 82.3% | 49 tok/s | $1.23/M |
| #72 | INTELLECT-3 | Prime Intellect | 82.2% | n/a | - |
| #73 | Ling-1T | InclusionAI | 82.2% | n/a | - |
| #74 | Nova 2.0 Pro Preview (low) | Amazon | 82.2% | 122.6 tok/s | $3.44/M |
| #75 | GPT-5 (ChatGPT) | OpenAI | 82.0% | 149.8 tok/s | $3.44/M |
| #76 | GPT-5.1 Codex mini (high) | OpenAI | 82.0% | 207.2 tok/s | $0.688/M |
| #77 | MiniMax-M2 | MiniMax | 82.0% | 83.5 tok/s | $0.525/M |
| #78 | DeepSeek V3 0324 | DeepSeek | 81.9% | n/a | $1.25/M |
| #79 | Kimi K2 0905 | Kimi | 81.9% | 12.5 tok/s | $1.08/M |
| #80 | Qwen3 Next 80B A3B Instruct | Alibaba | 81.9% | 155.3 tok/s | $0.875/M |