Easy Benchmarks
Workspace
Overview
Benchmarks
Benchmarks list
Compare
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
LLMs
Audio
Image
Video
Feedback
Log inSign up
Back

GPQA

Graduate-level science and reasoning benchmark score.

GPQA, often reported as GPQA Diamond in model leaderboards, is a graduate-level Google-proof Q&A benchmark. It focuses on expert science reasoning where strong retrieval alone is not enough.

Test type: Expert multiple-choice science Q&A, usually evaluated with exact option extraction.

Coverage

504 models have this metric.

94.1%

Current leader: Gemini 3.1 Pro Preview

Project links

This app ranks the GPQA score exposed by the Artificial Analysis snapshot.

GitHubPaper

Top GPQA Models

Top models ranked by GPQA.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1Gemini 3.1 Pro PreviewGoogle94.1%124.7 tok/s$4.50/M
#2
GPT-5.5 (xhigh)
OpenAI
93.5%
69 tok/s
$11.25/M
#3GPT-5.5 (high)OpenAI93.2%61.6 tok/s$11.25/M
#4MiniMax-M3MiniMax92.9%45.6 tok/s$0.525/M
#5Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)Anthropic92.6%n/a$20.00/M
#6GPT-5.5 (medium)OpenAI92.6%58.7 tok/s$11.25/M
#7Qwen3.7 MaxAlibaba92.3%186.5 tok/s$3.75/M
#8Gemini 3.5 Flash (high)Google92.2%203.3 tok/s$3.38/M
#9Gemini 3.5 Flash (medium)Google92.1%210.1 tok/s$3.38/M
#10Claude Opus 4.8 (Adaptive Reasoning, Max Effort)Anthropic92.0%67.8 tok/s$10.00/M
#11GPT-5.4 (xhigh)OpenAI92.0%75.5 tok/s$5.63/M
#12GPT-5.3 Codex (xhigh)OpenAI91.5%84.5 tok/s$4.81/M
#13Claude Opus 4.7 (Adaptive Reasoning, Max Effort)Anthropic91.4%53.8 tok/s$10.00/M
#14Grok 4.20 0309 v2 (Reasoning)xAI91.1%168.7 tok/s$3.00/M
#15Kimi K2.6Kimi91.1%41.6 tok/s$1.71/M
#16GPT-5.5 (low)OpenAI91.0%66.4 tok/s$11.25/M
#17Gemini 3 Pro Preview (high)Google90.8%n/a$4.50/M
#18DeepSeek V4 Pro (Reasoning, High Effort)DeepSeek90.5%65.7 tok/s$0.544/M
#19GPT-5.2 (xhigh)OpenAI90.3%71 tok/s$4.81/M
#20Grok 4.3 (high)xAI90.1%159.7 tok/s$1.56/M
#21Qwen3.7 PlusAlibaba90.0%53.6 tok/s$0.590/M
#22GPT-5.2 Codex (xhigh)OpenAI89.9%105.3 tok/s$4.81/M
#23Gemini 3 Flash Preview (Reasoning)Google89.8%172.8 tok/s$1.13/M
#24Claude Opus 4.6 (Adaptive Reasoning, Max Effort)Anthropic89.6%47.3 tok/s$10.94/M
#25DeepSeek V4 Flash (Reasoning, Max Effort)DeepSeek89.4%98.3 tok/s$0.175/M
#26Qwen3.5 397B A17B (Reasoning)Alibaba89.3%51.8 tok/s$1.35/M
#27Grok 4.3 (medium)xAI89.0%136.9 tok/s$1.56/M
#28DeepSeek V4 Pro (Reasoning, Max Effort)DeepSeek88.8%61.6 tok/s$0.544/M
#29Qwen3.6 Max PreviewAlibaba88.8%40.9 tok/s$2.93/M
#30Gemini 3 Pro Preview (low)Google88.7%n/a$4.50/M
#31Claude Opus 4.7 (Non-reasoning, High Effort)Anthropic88.5%46 tok/s$10.00/M
#32Grok 4.20 0309 (Reasoning)xAI88.5%166.5 tok/s$3.00/M
#33Muse SparkMeta88.4%n/a-
#34Qwen3.6 PlusAlibaba88.2%52.8 tok/s$1.13/M
#35Kimi K2.5 (Reasoning)Kimi87.9%31.7 tok/s$1.19/M
#36Grok 4xAI87.7%n/a$11.00/M
#37Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)Anthropic87.5%63.2 tok/s$6.00/M
#38GPT-5.4 mini (xhigh)OpenAI87.5%178.8 tok/s$1.69/M
#39MiniMax-M2.7MiniMax87.4%75 tok/s$0.525/M
#40GPT-5.1 (high)OpenAI87.3%121.2 tok/s$3.44/M
#41DeepSeek V3.2 SpecialeDeepSeek87.1%n/a-
#42GPT-5.4 (low)OpenAI87.1%63.6 tok/s$5.63/M
#43MiMo-V2-ProXiaomi87.0%42.5 tok/s$1.50/M
#44GLM-5.1 (Reasoning)Z AI86.8%46.8 tok/s$2.15/M
#45DeepSeek V4 Flash (Reasoning, High Effort)DeepSeek86.7%n/a$0.175/M
#46TEHy3-preview (Reasoning)Tencent86.7%96 tok/s$0.200/M
#47Claude Opus 4.5 (Reasoning)Anthropic86.6%53.5 tok/s$10.94/M
#48MiMo-V2.5-ProXiaomi86.6%43.3 tok/s$0.544/M
#49GPT-5.2 (medium)OpenAI86.4%n/a$4.81/M
#50Qwen3 Max ThinkingAlibaba86.1%n/a$2.40/M
#51Qwen3.5 397B A17B (Non-reasoning)Alibaba86.1%53.1 tok/s$1.35/M
#52GPT-5.1 Codex (high)OpenAI86.0%182.1 tok/s$3.44/M
#53GLM-4.7 (Reasoning)Z AI85.9%79.2 tok/s$1.00/M
#54Qwen3.5 27B (Reasoning)Alibaba85.8%82.8 tok/s$0.825/M
#55Gemma 4 31B (Reasoning)Google85.7%34.8 tok/s-
#56Qwen3.5 122B A10B (Reasoning)Alibaba85.7%143.6 tok/s$1.10/M
#57Ring-2.6-1TInclusionAI85.7%122.1 tok/s$0.850/M
#58KAT Coder Pro V2KwaiKAT85.5%118.1 tok/s$0.525/M
#59MiMo-V2-Omni-0327Xiaomi85.5%85.6 tok/s$0.800/M
#60GPT-5 (high)OpenAI85.4%111.1 tok/s$3.44/M
#61Grok 4.1 Fast (Reasoning)xAI85.3%n/a-
#62MiMo-V2.5Xiaomi84.9%77.4 tok/s$0.175/M
#63Nanbeige4.1-3BNanbeige84.9%n/a-
#64MiniMax-M2.5MiniMax84.8%202.9 tok/s$0.525/M
#65GLM-5-TurboZ AI84.7%n/a-
#66Grok 4 Fast (Reasoning)xAI84.7%n/a$0.275/M
#67GPT-5.5 Instant (May 2026)OpenAI84.6%n/a$11.25/M
#68MiMo-V2-Flash (Reasoning)Xiaomi84.6%129.5 tok/s$0.150/M
#69o3-proOpenAI84.5%22.7 tok/s$35.00/M
#70Qwen3.5 35B A3B (Reasoning)Alibaba84.5%130.4 tok/s$0.688/M
#71Gemini 2.5 ProGoogle84.4%132 tok/s$3.44/M
#72Grok 4.3 (low)xAI84.3%148.4 tok/s$1.56/M
#73GPT-5 (medium)OpenAI84.2%85.6 tok/s$3.44/M
#74Qwen3.5 27B (Non-reasoning)Alibaba84.2%90.6 tok/s$0.875/M
#75Qwen3.6 27B (Reasoning)Alibaba84.2%54.7 tok/s$1.35/M
#76Qwen3.6 35B A3B (Reasoning)Alibaba84.1%159.9 tok/s$0.557/M
#77Claude Opus 4.6 (Non-reasoning, High Effort)Anthropic84.0%40.9 tok/s$10.94/M
#78DeepSeek V3.2 (Reasoning)DeepSeek84.0%n/a$0.337/M
#79GLM-5.1 (Non-reasoning)Z AI83.9%45.6 tok/s$2.15/M
#80Kimi K2 ThinkingKimi83.8%131.1 tok/s$1.08/M