Graduate-level science and reasoning benchmark score.
GPQA, often reported as GPQA Diamond in model leaderboards, is a graduate-level Google-proof Q&A benchmark. It focuses on expert science reasoning where strong retrieval alone is not enough.
Test type: Expert multiple-choice science Q&A, usually evaluated with exact option extraction.
Top models ranked by GPQA.
| Rank | Model | Creator | Value | Speed | Blended Price |
|---|---|---|---|---|---|
| #1 | Gemini 3.1 Pro Preview | 94.1% | 131.2 tok/s | $4.50/M | |
| #2 |
| OpenAI |
| 93.5% |
| 66.1 tok/s |
| $11.25/M |
| #3 | GPT-5.5 (high) | OpenAI | 93.2% | 59.3 tok/s | $11.25/M |
| #4 | GPT-5.5 (medium) | OpenAI | 92.6% | 57.5 tok/s | $11.25/M |
| #5 | GPT-5.4 (xhigh) | OpenAI | 92.0% | 93.5 tok/s | $5.63/M |
| #6 | GPT-5.3 Codex (xhigh) | OpenAI | 91.5% | 87.1 tok/s | $4.81/M |
| #7 | Claude Opus 4.7 (Adaptive Reasoning, Max Effort) | Anthropic | 91.4% | 51.8 tok/s | $10.00/M |
| #8 | Grok 4.20 0309 v2 (Reasoning) | xAI | 91.1% | 89.3 tok/s | $3.00/M |
| #9 | Kimi K2.6 | Kimi | 91.1% | 29.1 tok/s | $1.71/M |
| #10 | GPT-5.5 (low) | OpenAI | 91.0% | 56.8 tok/s | $11.25/M |
| #11 | Gemini 3 Pro Preview (high) | 90.8% | 128.7 tok/s | $4.50/M |
| #12 | DeepSeek V4 Pro (Reasoning, High Effort) | DeepSeek | 90.5% | 32.9 tok/s | $2.18/M |
| #13 | GPT-5.2 (xhigh) | OpenAI | 90.3% | 71.8 tok/s | $4.81/M |
| #14 | GPT-5.2 Codex (xhigh) | OpenAI | 89.9% | 87.7 tok/s | $4.81/M |
| #15 | Gemini 3 Flash Preview (Reasoning) | 89.8% | 193.2 tok/s | $1.13/M |
| #16 | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) | Anthropic | 89.6% | 49.9 tok/s | $10.00/M |
| #17 | DeepSeek V4 Flash (Reasoning, Max Effort) | DeepSeek | 89.4% | 77.4 tok/s | $0.175/M |
| #18 | Qwen3.5 397B A17B (Reasoning) | Alibaba | 89.3% | 50.4 tok/s | $1.35/M |
| #19 | DeepSeek V4 Pro (Reasoning, Max Effort) | DeepSeek | 88.8% | 34.3 tok/s | $2.18/M |
| #20 | Qwen3.6 Max Preview | Alibaba | 88.8% | 33.2 tok/s | $2.93/M |
| #21 | Gemini 3 Pro Preview (low) | 88.7% | n/a | $4.50/M |
| #22 | Claude Opus 4.7 (Non-reasoning, High Effort) | Anthropic | 88.5% | 43 tok/s | $10.00/M |
| #23 | Grok 4.20 0309 (Reasoning) | xAI | 88.5% | 87.8 tok/s | $3.00/M |
| #24 | Muse Spark | Meta | 88.4% | n/a | - |
| #25 | Qwen3.6 Plus | Alibaba | 88.2% | 53.1 tok/s | $1.13/M |
| #26 | Kimi K2.5 (Reasoning) | Kimi | 87.9% | 31.6 tok/s | $1.20/M |
| #27 | Grok 4 | xAI | 87.7% | 50.3 tok/s | $6.00/M |
| #28 | Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) | Anthropic | 87.5% | 68 tok/s | $6.00/M |
| #29 | GPT-5.4 mini (xhigh) | OpenAI | 87.5% | 158.9 tok/s | $1.69/M |
| #30 | MiniMax-M2.7 | MiniMax | 87.4% | 43.9 tok/s | $0.525/M |
| #31 | GPT-5.1 (high) | OpenAI | 87.3% | 123.3 tok/s | $3.44/M |
| #32 | DeepSeek V3.2 Speciale | DeepSeek | 87.1% | n/a | - |
| #33 | GPT-5.4 (low) | OpenAI | 87.1% | 59.1 tok/s | $5.63/M |
| #34 | MiMo-V2-Pro | Xiaomi | 87.0% | n/a | - |
| #35 | GLM-5.1 (Reasoning) | Z AI | 86.8% | 45.7 tok/s | $2.15/M |
| #36 | DeepSeek V4 Flash (Reasoning, High Effort) | DeepSeek | 86.7% | n/a | $0.175/M |
| #37 | Hy3-preview (Reasoning) | Tencent | 86.7% | 86.4 tok/s | - |
| #38 | Claude Opus 4.5 (Reasoning) | Anthropic | 86.6% | 57 tok/s | $10.00/M |
| #39 | MiMo-V2.5-Pro | Xiaomi | 86.6% | 59.9 tok/s | $1.50/M |
| #40 | GPT-5.2 (medium) | OpenAI | 86.4% | n/a | $4.81/M |
| #41 | Qwen3 Max Thinking | Alibaba | 86.1% | 34.3 tok/s | $2.40/M |
| #42 | Qwen3.5 397B A17B (Non-reasoning) | Alibaba | 86.1% | 52.5 tok/s | $1.35/M |
| #43 | GPT-5.1 Codex (high) | OpenAI | 86.0% | 162.7 tok/s | $3.44/M |
| #44 | GLM-4.7 (Reasoning) | Z AI | 85.9% | 90.3 tok/s | $1.00/M |
| #45 | Qwen3.5 27B (Reasoning) | Alibaba | 85.8% | 87 tok/s | $0.825/M |
| #46 | Gemma 4 31B (Reasoning) | 85.7% | 34.8 tok/s | - |
| #47 | Qwen3.5 122B A10B (Reasoning) | Alibaba | 85.7% | 139.9 tok/s | $1.10/M |
| #48 | KAT Coder Pro V2 | KwaiKAT | 85.5% | 110.7 tok/s | $0.525/M |
| #49 | MiMo-V2-Omni-0327 | Xiaomi | 85.5% | n/a | - |
| #50 | GPT-5 (high) | OpenAI | 85.4% | 84.2 tok/s | $3.44/M |
| #51 | Grok 4.1 Fast (Reasoning) | xAI | 85.3% | 140.9 tok/s | $0.275/M |
| #52 | MiMo-V2.5 | Xiaomi | 84.9% | n/a | - |
| #53 | Nanbeige4.1-3B | Nanbeige | 84.9% | n/a | - |
| #54 | MiniMax-M2.5 | MiniMax | 84.8% | 79.7 tok/s | $0.525/M |
| #55 | GLM-5-Turbo | Z AI | 84.7% | n/a | - |
| #56 | Grok 4 Fast (Reasoning) | xAI | 84.7% | 76.2 tok/s | $0.275/M |
| #57 | MiMo-V2-Flash (Reasoning) | Xiaomi | 84.6% | 118.8 tok/s | $0.150/M |
| #58 | o3-pro | OpenAI | 84.5% | 16.9 tok/s | $35.00/M |
| #59 | Qwen3.5 35B A3B (Reasoning) | Alibaba | 84.5% | 137.7 tok/s | $0.688/M |
| #60 | Gemini 2.5 Pro | 84.4% | 120.2 tok/s | $3.44/M |
| #61 | GPT-5 (medium) | OpenAI | 84.2% | 82.3 tok/s | $3.44/M |
| #62 | Qwen3.5 27B (Non-reasoning) | Alibaba | 84.2% | 90.6 tok/s | $0.825/M |
| #63 | Qwen3.6 27B (Reasoning) | Alibaba | 84.2% | 64.1 tok/s | $1.35/M |
| #64 | Qwen3.6 35B A3B (Reasoning) | Alibaba | 84.1% | 191.8 tok/s | $0.557/M |
| #65 | Claude Opus 4.6 (Non-reasoning, High Effort) | Anthropic | 84.0% | 42 tok/s | $10.00/M |
| #66 | DeepSeek V3.2 (Reasoning) | DeepSeek | 84.0% | n/a | $0.315/M |
| #67 | GLM-5.1 (Non-reasoning) | Z AI | 83.9% | 41.5 tok/s | $2.15/M |
| #68 | Kimi K2 Thinking | Kimi | 83.8% | 99 tok/s | $1.08/M |
| #69 | GPT-5 Codex (high) | OpenAI | 83.7% | 166.8 tok/s | $3.44/M |
| #70 | Gemini 2.5 Pro Preview (Mar' 25) | 83.6% | n/a | - |
| #71 | MiMo-V2-Flash (Feb 2026) | Xiaomi | 83.5% | 120.6 tok/s | $0.150/M |
| #72 | Claude 4.5 Sonnet (Reasoning) | Anthropic | 83.4% | 43.8 tok/s | $6.00/M |
| #73 | Step 3.5 Flash | StepFun | 83.1% | 123.6 tok/s | $0.150/M |
| #74 | MiniMax-M2.1 | MiniMax | 83.0% | 84.8 tok/s | $0.525/M |
| #75 | Qwen3.6 27B (Non-reasoning) | Alibaba | 82.9% | 60.5 tok/s | $1.35/M |
| #76 | GPT-5 mini (high) | OpenAI | 82.8% | 85.7 tok/s | $0.688/M |
| #77 | MiMo-V2-Omni | Xiaomi | 82.8% | n/a | - |
| #78 | o3 | OpenAI | 82.7% | 72.7 tok/s | $3.50/M |
| #79 | Qwen3.5 122B A10B (Non-reasoning) | Alibaba | 82.7% | 131.5 tok/s | $1.10/M |
| #80 | Qwen3.5 Omni Plus | Alibaba | 82.6% | 56 tok/s | $1.50/M |