EBEasy BenchmarksLLM model index
Workspace
Overview
Benchmarks
Benchmarks list
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
Models
All models
GPT-5.5 (xhigh)
GPT-5.5 (high)
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Gemini 3.1 Pro Preview
GPT-5.4 (xhigh)
Artificial Analysis data
Back

Humanity's Last Exam

Difficult broad knowledge and reasoning benchmark score.

Humanity's Last Exam is a broad expert benchmark from CAIS and Scale AI. The public project describes 2,500 difficult questions across many subjects, with closed-ended answers for automatic grading and held-out questions to monitor overfitting.

Test type: Closed-ended expert reasoning and knowledge benchmark with automatic grading.

Coverage

474 models have this metric.

44.7%

Current leader: Gemini 3.1 Pro Preview

Project links

This app ranks the HLE score exposed by the Artificial Analysis snapshot.

Official websiteGitHub

Top HLE Models

Top models ranked by HLE.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1Gemini 3.1 Pro PreviewGoogle44.7%131.2 tok/s$4.50/M
#2
GPT-5.5 (xhigh)
OpenAI
44.3%
66.1 tok/s
$11.25/M
#3GPT-5.5 (high)OpenAI43.0%59.3 tok/s$11.25/M
#4GPT-5.4 (xhigh)OpenAI41.6%93.5 tok/s$5.63/M
#5GPT-5.5 (medium)OpenAI40.6%57.5 tok/s$11.25/M
#6GPT-5.3 Codex (xhigh)OpenAI39.9%87.1 tok/s$4.81/M
#7Muse SparkMeta39.9%n/a-
#8Claude Opus 4.7 (Adaptive Reasoning, Max Effort)Anthropic39.6%51.8 tok/s$10.00/M
#9Gemini 3 Pro Preview (high)Google37.2%128.7 tok/s$4.50/M
#10Claude Opus 4.6 (Adaptive Reasoning, Max Effort)Anthropic36.7%49.9 tok/s$10.00/M
#11DeepSeek V4 Pro (Reasoning, Max Effort)DeepSeek35.9%34.3 tok/s$2.18/M
#12Kimi K2.6Kimi35.9%29.1 tok/s$1.71/M
#13GPT-5.2 (xhigh)OpenAI35.4%71.8 tok/s$4.81/M
#14Gemini 3 Flash Preview (Reasoning)Google34.7%193.2 tok/s$1.13/M
#15MiMo-V2.5-ProXiaomi33.8%59.9 tok/s$1.50/M
#16DeepSeek V4 Pro (Reasoning, High Effort)DeepSeek33.5%32.9 tok/s$2.18/M
#17GPT-5.2 Codex (xhigh)OpenAI33.5%87.7 tok/s$4.81/M
#18KAT-Coder-Pro V1KwaiKAT33.4%117.1 tok/s$0.525/M
#19Grok 4.20 0309 v2 (Reasoning)xAI32.2%89.3 tok/s$3.00/M
#20DeepSeek V4 Flash (Reasoning, Max Effort)DeepSeek32.1%77.4 tok/s$0.175/M
#21Claude Opus 4.7 (Non-reasoning, High Effort)Anthropic31.2%43 tok/s$10.00/M
#22GPT-5.5 (low)OpenAI31.0%56.8 tok/s$11.25/M
#23Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)Anthropic30.0%68 tok/s$6.00/M
#24Grok 4.20 0309 (Reasoning)xAI30.0%87.8 tok/s$3.00/M
#25Kimi K2.5 (Reasoning)Kimi29.4%31.6 tok/s$1.20/M
#26GPT-5.4 (low)OpenAI28.9%59.1 tok/s$5.63/M
#27Qwen3.6 Max PreviewAlibaba28.9%33.2 tok/s$2.93/M
#28Claude Opus 4.5 (Reasoning)Anthropic28.4%57 tok/s$10.00/M
#29MiMo-V2-ProXiaomi28.3%n/a-
#30MiniMax-M2.7MiniMax28.1%43.9 tok/s$0.525/M
#31GLM-5.1 (Reasoning)Z AI28.0%45.7 tok/s$2.15/M
#32DeepSeek V4 Flash (Reasoning, High Effort)DeepSeek27.8%n/a$0.175/M
#33Gemini 3 Pro Preview (low)Google27.6%n/a$4.50/M
#34Qwen3.5 397B A17B (Reasoning)Alibaba27.3%50.4 tok/s$1.35/M
#35GLM-5 (Reasoning)Z AI27.2%64.5 tok/s$1.55/M
#36GPT-5.4 mini (xhigh)OpenAI26.6%158.9 tok/s$1.69/M
#37GPT-5 (high)OpenAI26.5%84.2 tok/s$3.44/M
#38GPT-5.1 (high)OpenAI26.5%123.3 tok/s$3.44/M
#39GPT-5.4 nano (xhigh)OpenAI26.5%160.3 tok/s$0.463/M
#40Qwen3 Max ThinkingAlibaba26.2%34.3 tok/s$2.40/M
#41DeepSeek V3.2 SpecialeDeepSeek26.1%n/a-
#42Qwen3.6 PlusAlibaba25.7%53.1 tok/s$1.13/M
#43GLM-5.1 (Non-reasoning)Z AI25.6%41.5 tok/s$2.15/M
#44GPT-5 Codex (high)OpenAI25.6%166.8 tok/s$3.44/M
#45TEHy3-preview (Reasoning)Tencent25.5%86.4 tok/s-
#46GLM-5-TurboZ AI25.4%n/a-
#47MiMo-V2.5Xiaomi25.2%n/a-
#48GLM-4.7 (Reasoning)Z AI25.1%90.3 tok/s$1.00/M
#49GPT-5.2 (medium)OpenAI24.9%n/a$4.81/M
#50Grok 4.20 0309 v2 (Non-reasoning)xAI24.2%86.6 tok/s$3.00/M
#51Grok 4xAI23.9%50.3 tok/s$6.00/M
#52GPT-5 (medium)OpenAI23.5%82.3 tok/s$3.44/M
#53GPT-5.1 Codex (high)OpenAI23.4%162.7 tok/s$3.44/M
#54Qwen3.5 122B A10B (Reasoning)Alibaba23.4%139.9 tok/s$1.10/M
#55Gemma 4 31B (Reasoning)Google22.7%34.8 tok/s-
#56Step 3.5 Flash 2603StepFun22.6%132.3 tok/s-
#57Grok 4.20 0309 (Non-reasoning)xAI22.5%77.1 tok/s$3.00/M
#58Kimi K2 ThinkingKimi22.3%99 tok/s$1.08/M
#59DeepSeek V3.2 (Reasoning)DeepSeek22.2%n/a$0.315/M
#60MiniMax-M2.1MiniMax22.2%84.8 tok/s$0.525/M
#61Qwen3.5 27B (Reasoning)Alibaba22.2%87 tok/s$0.825/M
#62Qwen3.6 27B (Reasoning)Alibaba21.6%64.1 tok/s$1.35/M
#63Gemini 2.5 ProGoogle21.1%120.2 tok/s$3.44/M
#64MiMo-V2-Flash (Reasoning)Xiaomi21.1%118.8 tok/s$0.150/M
#65MiMo-V2-Omni-0327Xiaomi20.4%n/a-
#66Qwen3.6 35B A3B (Reasoning)Alibaba20.2%191.8 tok/s$0.557/M
#67MiMo-V2-Flash (Feb 2026)Xiaomi20.0%120.6 tok/s$0.150/M
#68o3OpenAI20.0%72.7 tok/s$3.50/M
#69MiMo-V2-OmniXiaomi19.9%n/a-
#70GPT-5 mini (high)OpenAI19.7%85.7 tok/s$0.688/M
#71Qwen3.5 35B A3B (Reasoning)Alibaba19.7%137.7 tok/s$0.688/M
#72NVIDIA Nemotron 3 Super 120B A12B (Reasoning)NVIDIA19.2%162.5 tok/s$0.412/M
#73MiniMax-M2.5MiniMax19.1%79.7 tok/s$0.525/M
#74Step 3.5 FlashStepFun19.1%123.6 tok/s$0.150/M
#75Qwen3.5 397B A17B (Non-reasoning)Alibaba18.8%52.5 tok/s$1.35/M
#76Claude Opus 4.6 (Non-reasoning, High Effort)Anthropic18.6%42 tok/s$10.00/M
#77gpt-oss-120B (high)OpenAI18.5%212.3 tok/s$0.263/M
#78GPT-5 (low)OpenAI18.4%65.8 tok/s$3.44/M
#79Gemma 4 26B A4B (Reasoning)Google18.3%n/a$0.198/M
#80Kimi K2.6 (Non-reasoning)Kimi18.2%n/a-