Easy Benchmarks
Workspace
Overview
Benchmarks
Benchmarks list
Compare
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
LLMs
Audio
Image
Video
Feedback
Log inSign up
Back

Humanity's Last Exam

Difficult broad knowledge and reasoning benchmark score.

Humanity's Last Exam is a broad expert benchmark from CAIS and Scale AI. The public project describes 2,500 difficult questions across many subjects, with closed-ended answers for automatic grading and held-out questions to monitor overfitting.

Test type: Closed-ended expert reasoning and knowledge benchmark with automatic grading.

Coverage

500 models have this metric.

53.3%

Current leader: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

Project links

This app ranks the HLE score exposed by the Artificial Analysis snapshot.

Official websiteGitHub

Top HLE Models

Top models ranked by HLE.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)Anthropic53.3%n/a$20.00/M
#2
Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
Anthropic
45.7%
67.8 tok/s
$10.00/M
#3Gemini 3.1 Pro PreviewGoogle44.7%124.7 tok/s$4.50/M
#4GPT-5.5 (xhigh)OpenAI44.3%69 tok/s$11.25/M
#5GPT-5.5 (high)OpenAI43.0%61.6 tok/s$11.25/M
#6GPT-5.4 (xhigh)OpenAI41.6%75.5 tok/s$5.63/M
#7Gemini 3.5 Flash (high)Google41.0%203.3 tok/s$3.38/M
#8GPT-5.5 (medium)OpenAI40.6%58.7 tok/s$11.25/M
#9Gemini 3.5 Flash (medium)Google39.9%210.1 tok/s$3.38/M
#10GPT-5.3 Codex (xhigh)OpenAI39.9%84.5 tok/s$4.81/M
#11Muse SparkMeta39.9%n/a-
#12Claude Opus 4.7 (Adaptive Reasoning, Max Effort)Anthropic39.6%53.8 tok/s$10.00/M
#13Qwen3.7 MaxAlibaba38.1%186.5 tok/s$3.75/M
#14Gemini 3 Pro Preview (high)Google37.2%n/a$4.50/M
#15MiniMax-M3MiniMax37.1%45.6 tok/s$0.525/M
#16Claude Opus 4.6 (Adaptive Reasoning, Max Effort)Anthropic36.7%47.3 tok/s$10.94/M
#17DeepSeek V4 Pro (Reasoning, Max Effort)DeepSeek35.9%61.6 tok/s$0.544/M
#18Kimi K2.6Kimi35.9%41.6 tok/s$1.71/M
#19GPT-5.2 (xhigh)OpenAI35.4%71 tok/s$4.81/M
#20Grok 4.3 (high)xAI35.0%159.7 tok/s$1.56/M
#21Gemini 3 Flash Preview (Reasoning)Google34.7%172.8 tok/s$1.13/M
#22MiMo-V2.5-ProXiaomi33.8%43.3 tok/s$0.544/M
#23DeepSeek V4 Pro (Reasoning, High Effort)DeepSeek33.5%65.7 tok/s$0.544/M
#24GPT-5.2 Codex (xhigh)OpenAI33.5%105.3 tok/s$4.81/M
#25KAT-Coder-Pro V1KwaiKAT33.4%114.7 tok/s$0.525/M
#26Qwen3.7 PlusAlibaba33.4%53.6 tok/s$0.590/M
#27Grok 4.20 0309 v2 (Reasoning)xAI32.2%168.7 tok/s$3.00/M
#28DeepSeek V4 Flash (Reasoning, Max Effort)DeepSeek32.1%98.3 tok/s$0.175/M
#29Claude Opus 4.7 (Non-reasoning, High Effort)Anthropic31.2%46 tok/s$10.00/M
#30GPT-5.5 (low)OpenAI31.0%66.4 tok/s$11.25/M
#31Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)Anthropic30.0%63.2 tok/s$6.00/M
#32Grok 4.20 0309 (Reasoning)xAI30.0%166.5 tok/s$3.00/M
#33Kimi K2.5 (Reasoning)Kimi29.4%31.7 tok/s$1.19/M
#34GPT-5.4 (low)OpenAI28.9%63.6 tok/s$5.63/M
#35Qwen3.6 Max PreviewAlibaba28.9%40.9 tok/s$2.93/M
#36Claude Opus 4.5 (Reasoning)Anthropic28.4%53.5 tok/s$10.94/M
#37MiMo-V2-ProXiaomi28.3%42.5 tok/s$1.50/M
#38Grok 4.3 (medium)xAI28.1%136.9 tok/s$1.56/M
#39MiniMax-M2.7MiniMax28.1%75 tok/s$0.525/M
#40GLM-5.1 (Reasoning)Z AI28.0%46.8 tok/s$2.15/M
#41DeepSeek V4 Flash (Reasoning, High Effort)DeepSeek27.8%n/a$0.175/M
#42Gemini 3 Pro Preview (low)Google27.6%n/a$4.50/M
#43Qwen3.5 397B A17B (Reasoning)Alibaba27.3%51.8 tok/s$1.35/M
#44GLM-5 (Reasoning)Z AI27.2%79.5 tok/s$1.55/M
#45GPT-5.4 mini (xhigh)OpenAI26.6%178.8 tok/s$1.69/M
#46GPT-5 (high)OpenAI26.5%111.1 tok/s$3.44/M
#47GPT-5.1 (high)OpenAI26.5%121.2 tok/s$3.44/M
#48GPT-5.4 nano (xhigh)OpenAI26.5%147.6 tok/s$0.463/M
#49Qwen3 Max ThinkingAlibaba26.2%n/a$2.40/M
#50DeepSeek V3.2 SpecialeDeepSeek26.1%n/a-
#51Qwen3.6 PlusAlibaba25.7%52.8 tok/s$1.13/M
#52GLM-5.1 (Non-reasoning)Z AI25.6%45.6 tok/s$2.15/M
#53GPT-5 Codex (high)OpenAI25.6%171.1 tok/s$3.44/M
#54TEHy3-preview (Reasoning)Tencent25.5%96 tok/s$0.200/M
#55GLM-5-TurboZ AI25.4%n/a-
#56MiMo-V2.5Xiaomi25.2%77.4 tok/s$0.175/M
#57GLM-4.7 (Reasoning)Z AI25.1%79.2 tok/s$1.00/M
#58GPT-5.2 (medium)OpenAI24.9%n/a$4.81/M
#59Grok 4.20 0309 v2 (Non-reasoning)xAI24.2%160.7 tok/s$3.00/M
#60Grok 4xAI23.9%n/a$11.00/M
#61GPT-5 (medium)OpenAI23.5%85.6 tok/s$3.44/M
#62GPT-5.1 Codex (high)OpenAI23.4%182.1 tok/s$3.44/M
#63Qwen3.5 122B A10B (Reasoning)Alibaba23.4%143.6 tok/s$1.10/M
#64Gemini 3.5 Flash (minimal)Google23.1%202.7 tok/s$3.38/M
#65Gemma 4 31B (Reasoning)Google22.7%34.8 tok/s-
#66Step 3.5 Flash 2603StepFun22.6%231 tok/s$0.150/M
#67Grok 4.20 0309 (Non-reasoning)xAI22.5%158.9 tok/s$3.00/M
#68Kimi K2 ThinkingKimi22.3%131.1 tok/s$1.08/M
#69DeepSeek V3.2 (Reasoning)DeepSeek22.2%n/a$0.337/M
#70MiniMax-M2.1MiniMax22.2%184.6 tok/s$0.525/M
#71Qwen3.5 27B (Reasoning)Alibaba22.2%82.8 tok/s$0.825/M
#72Qwen3.6 27B (Reasoning)Alibaba21.6%54.7 tok/s$1.35/M
#73Gemini 2.5 ProGoogle21.1%132 tok/s$3.44/M
#74MiMo-V2-Flash (Reasoning)Xiaomi21.1%129.5 tok/s$0.150/M
#75MiMo-V2-Omni-0327Xiaomi20.4%85.6 tok/s$0.800/M
#76GPT-5.5 Instant (May 2026)OpenAI20.3%n/a$11.25/M
#77Qwen3.6 35B A3B (Reasoning)Alibaba20.2%159.9 tok/s$0.557/M
#78MiMo-V2-Flash (Feb 2026)Xiaomi20.0%124.9 tok/s$0.150/M
#79o3OpenAI20.0%122.3 tok/s$3.50/M
#80MiMo-V2-OmniXiaomi19.9%81.5 tok/s-