EBEasy BenchmarksLLM model index
Workspace
Overview
Benchmarks
Benchmarks list
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
Models
All models
GPT-5.5 (xhigh)
GPT-5.5 (high)
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Gemini 3.1 Pro Preview
GPT-5.4 (xhigh)
Artificial Analysis data
Back

GPQA

Graduate-level science and reasoning benchmark score.

GPQA, often reported as GPQA Diamond in model leaderboards, is a graduate-level Google-proof Q&A benchmark. It focuses on expert science reasoning where strong retrieval alone is not enough.

Test type: Expert multiple-choice science Q&A, usually evaluated with exact option extraction.

Coverage

478 models have this metric.

94.1%

Current leader: Gemini 3.1 Pro Preview

Project links

This app ranks the GPQA score exposed by the Artificial Analysis snapshot.

GitHubPaper

Top GPQA Models

Top models ranked by GPQA.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1Gemini 3.1 Pro PreviewGoogle94.1%131.2 tok/s$4.50/M
#2
GPT-5.5 (xhigh)
OpenAI
93.5%
66.1 tok/s
$11.25/M
#3GPT-5.5 (high)OpenAI93.2%59.3 tok/s$11.25/M
#4GPT-5.5 (medium)OpenAI92.6%57.5 tok/s$11.25/M
#5GPT-5.4 (xhigh)OpenAI92.0%93.5 tok/s$5.63/M
#6GPT-5.3 Codex (xhigh)OpenAI91.5%87.1 tok/s$4.81/M
#7Claude Opus 4.7 (Adaptive Reasoning, Max Effort)Anthropic91.4%51.8 tok/s$10.00/M
#8Grok 4.20 0309 v2 (Reasoning)xAI91.1%89.3 tok/s$3.00/M
#9Kimi K2.6Kimi91.1%29.1 tok/s$1.71/M
#10GPT-5.5 (low)OpenAI91.0%56.8 tok/s$11.25/M
#11Gemini 3 Pro Preview (high)Google90.8%128.7 tok/s$4.50/M
#12DeepSeek V4 Pro (Reasoning, High Effort)DeepSeek90.5%32.9 tok/s$2.18/M
#13GPT-5.2 (xhigh)OpenAI90.3%71.8 tok/s$4.81/M
#14GPT-5.2 Codex (xhigh)OpenAI89.9%87.7 tok/s$4.81/M
#15Gemini 3 Flash Preview (Reasoning)Google89.8%193.2 tok/s$1.13/M
#16Claude Opus 4.6 (Adaptive Reasoning, Max Effort)Anthropic89.6%49.9 tok/s$10.00/M
#17DeepSeek V4 Flash (Reasoning, Max Effort)DeepSeek89.4%77.4 tok/s$0.175/M
#18Qwen3.5 397B A17B (Reasoning)Alibaba89.3%50.4 tok/s$1.35/M
#19DeepSeek V4 Pro (Reasoning, Max Effort)DeepSeek88.8%34.3 tok/s$2.18/M
#20Qwen3.6 Max PreviewAlibaba88.8%33.2 tok/s$2.93/M
#21Gemini 3 Pro Preview (low)Google88.7%n/a$4.50/M
#22Claude Opus 4.7 (Non-reasoning, High Effort)Anthropic88.5%43 tok/s$10.00/M
#23Grok 4.20 0309 (Reasoning)xAI88.5%87.8 tok/s$3.00/M
#24Muse SparkMeta88.4%n/a-
#25Qwen3.6 PlusAlibaba88.2%53.1 tok/s$1.13/M
#26Kimi K2.5 (Reasoning)Kimi87.9%31.6 tok/s$1.20/M
#27Grok 4xAI87.7%50.3 tok/s$6.00/M
#28Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)Anthropic87.5%68 tok/s$6.00/M
#29GPT-5.4 mini (xhigh)OpenAI87.5%158.9 tok/s$1.69/M
#30MiniMax-M2.7MiniMax87.4%43.9 tok/s$0.525/M
#31GPT-5.1 (high)OpenAI87.3%123.3 tok/s$3.44/M
#32DeepSeek V3.2 SpecialeDeepSeek87.1%n/a-
#33GPT-5.4 (low)OpenAI87.1%59.1 tok/s$5.63/M
#34MiMo-V2-ProXiaomi87.0%n/a-
#35GLM-5.1 (Reasoning)Z AI86.8%45.7 tok/s$2.15/M
#36DeepSeek V4 Flash (Reasoning, High Effort)DeepSeek86.7%n/a$0.175/M
#37TEHy3-preview (Reasoning)Tencent86.7%86.4 tok/s-
#38Claude Opus 4.5 (Reasoning)Anthropic86.6%57 tok/s$10.00/M
#39MiMo-V2.5-ProXiaomi86.6%59.9 tok/s$1.50/M
#40GPT-5.2 (medium)OpenAI86.4%n/a$4.81/M
#41Qwen3 Max ThinkingAlibaba86.1%34.3 tok/s$2.40/M
#42Qwen3.5 397B A17B (Non-reasoning)Alibaba86.1%52.5 tok/s$1.35/M
#43GPT-5.1 Codex (high)OpenAI86.0%162.7 tok/s$3.44/M
#44GLM-4.7 (Reasoning)Z AI85.9%90.3 tok/s$1.00/M
#45Qwen3.5 27B (Reasoning)Alibaba85.8%87 tok/s$0.825/M
#46Gemma 4 31B (Reasoning)Google85.7%34.8 tok/s-
#47Qwen3.5 122B A10B (Reasoning)Alibaba85.7%139.9 tok/s$1.10/M
#48KAT Coder Pro V2KwaiKAT85.5%110.7 tok/s$0.525/M
#49MiMo-V2-Omni-0327Xiaomi85.5%n/a-
#50GPT-5 (high)OpenAI85.4%84.2 tok/s$3.44/M
#51Grok 4.1 Fast (Reasoning)xAI85.3%140.9 tok/s$0.275/M
#52MiMo-V2.5Xiaomi84.9%n/a-
#53Nanbeige4.1-3BNanbeige84.9%n/a-
#54MiniMax-M2.5MiniMax84.8%79.7 tok/s$0.525/M
#55GLM-5-TurboZ AI84.7%n/a-
#56Grok 4 Fast (Reasoning)xAI84.7%76.2 tok/s$0.275/M
#57MiMo-V2-Flash (Reasoning)Xiaomi84.6%118.8 tok/s$0.150/M
#58o3-proOpenAI84.5%16.9 tok/s$35.00/M
#59Qwen3.5 35B A3B (Reasoning)Alibaba84.5%137.7 tok/s$0.688/M
#60Gemini 2.5 ProGoogle84.4%120.2 tok/s$3.44/M
#61GPT-5 (medium)OpenAI84.2%82.3 tok/s$3.44/M
#62Qwen3.5 27B (Non-reasoning)Alibaba84.2%90.6 tok/s$0.825/M
#63Qwen3.6 27B (Reasoning)Alibaba84.2%64.1 tok/s$1.35/M
#64Qwen3.6 35B A3B (Reasoning)Alibaba84.1%191.8 tok/s$0.557/M
#65Claude Opus 4.6 (Non-reasoning, High Effort)Anthropic84.0%42 tok/s$10.00/M
#66DeepSeek V3.2 (Reasoning)DeepSeek84.0%n/a$0.315/M
#67GLM-5.1 (Non-reasoning)Z AI83.9%41.5 tok/s$2.15/M
#68Kimi K2 ThinkingKimi83.8%99 tok/s$1.08/M
#69GPT-5 Codex (high)OpenAI83.7%166.8 tok/s$3.44/M
#70Gemini 2.5 Pro Preview (Mar' 25)Google83.6%n/a-
#71MiMo-V2-Flash (Feb 2026)Xiaomi83.5%120.6 tok/s$0.150/M
#72Claude 4.5 Sonnet (Reasoning)Anthropic83.4%43.8 tok/s$6.00/M
#73Step 3.5 FlashStepFun83.1%123.6 tok/s$0.150/M
#74MiniMax-M2.1MiniMax83.0%84.8 tok/s$0.525/M
#75Qwen3.6 27B (Non-reasoning)Alibaba82.9%60.5 tok/s$1.35/M
#76GPT-5 mini (high)OpenAI82.8%85.7 tok/s$0.688/M
#77MiMo-V2-OmniXiaomi82.8%n/a-
#78o3OpenAI82.7%72.7 tok/s$3.50/M
#79Qwen3.5 122B A10B (Non-reasoning)Alibaba82.7%131.5 tok/s$1.10/M
#80Qwen3.5 Omni PlusAlibaba82.6%56 tok/s$1.50/M