EBEasy BenchmarksLLM model index
Workspace
Overview
Benchmarks
Benchmarks list
Overall Index
Coding
Math
MMLU-Pro
Speed
Value
Models
All models
GPT-5.5 (xhigh)
GPT-5.5 (high)
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
Gemini 3.1 Pro Preview
GPT-5.4 (xhigh)
Artificial Analysis data
Back

MATH-500

Mathematical problem-solving benchmark score.

MATH-500 is a 500-problem held-out subset from the MATH benchmark split used in OpenAI's Let's Verify Step by Step work. It is widely used to test final-answer mathematical problem solving.

Test type: Competition math word problems, usually scored by normalized final-answer matching.

Coverage

201 models have this metric.

99.4%

Current leader: GPT-5 (high)

Project links

This app ranks the MATH-500 score exposed by the Artificial Analysis snapshot.

DatasetOpenAI split

Top MATH-500 Models

Top models ranked by MATH-500.

Leaderboard

RankModelCreatorValueSpeedBlended Price
#1GPT-5 (high)OpenAI99.4%84.2 tok/s$3.44/M
#2
Grok 3 mini Reasoning (high)
xAI
99.2%
215.5 tok/s
$0.350/M
#3o3OpenAI99.2%72.7 tok/s$3.50/M
#4Claude 4 Sonnet (Reasoning)Anthropic99.1%50.3 tok/s$6.00/M
#5GPT-5 (medium)OpenAI99.1%82.3 tok/s$3.44/M
#6Grok 4xAI99.0%50.3 tok/s$6.00/M
#7o4-mini (high)OpenAI98.9%124.5 tok/s$1.93/M
#8GPT-5 (low)OpenAI98.7%65.8 tok/s$3.44/M
#9Gemini 2.5 Pro Preview (May' 25)Google98.6%n/a$3.44/M
#10o3-mini (high)OpenAI98.5%140 tok/s$1.93/M
#11Qwen3 235B A22B 2507 (Reasoning)Alibaba98.4%56 tok/s$2.63/M
#12DeepSeek R1 0528 (May '25)DeepSeek98.3%n/a$2.36/M
#13Llama Nemotron Super 49B v1.5 (Reasoning)NVIDIA98.3%50.8 tok/s$0.175/M
#14Claude 4 Opus (Reasoning)Anthropic98.2%36.8 tok/s$30.00/M
#15Gemini 2.5 Flash (Reasoning)Google98.1%199.6 tok/s$0.850/M
#16Gemini 2.5 Flash Preview (Reasoning)Google98.1%n/a-
#17Gemini 2.5 Pro Preview (Mar' 25)Google98.0%n/a-
#18MiniMax M1 80kMiniMax98.0%n/a$0.963/M
#19Qwen3 235B A22B 2507 InstructAlibaba98.0%64.7 tok/s$1.23/M
#20GLM-4.5 (Reasoning)Z AI97.9%46.4 tok/s$1.00/M
#21EXAONE 4.0 32B (Reasoning)LG AI Research97.7%n/a-
#22Qwen3 30B A3B 2507 (Reasoning)Alibaba97.6%143.2 tok/s$0.750/M
#23Qwen3 30B A3B 2507 InstructAlibaba97.5%97.9 tok/s$0.350/M
#24o3-miniOpenAI97.3%140.1 tok/s$1.93/M
#25MiniMax M1 40kMiniMax97.2%n/a-
#26Kimi K2Kimi97.1%33 tok/s$1.04/M
#27o1OpenAI97.0%103.3 tok/s$26.25/M
#28Gemini 2.5 Flash-Lite (Reasoning)Google96.9%243.6 tok/s$0.175/M
#29Gemini 2.5 ProGoogle96.7%120.2 tok/s$3.44/M
#30Solar Pro 2 (Reasoning)Upstage96.7%n/a-
#31DeepSeek R1 (Jan '25)DeepSeek96.6%n/a$2.36/M
#32GLM-4.5-AirZ AI96.5%72.9 tok/s$0.372/M
#33Magistral Small 1Mistral96.3%n/a-
#34Qwen3 14B (Reasoning)Alibaba96.1%63 tok/s$1.31/M
#35Qwen3 32B (Reasoning)Alibaba96.1%91.4 tok/s$2.63/M
#36Llama 3.3 Nemotron Super 49B v1 (Reasoning)NVIDIA95.9%n/a-
#37Qwen3 30B A3B (Reasoning)Alibaba95.9%77.8 tok/s$0.750/M
#38QwQ 32BAlibaba95.7%30.4 tok/s$0.745/M
#39Sonar Reasoning ProPerplexity95.7%n/a-
#40R1 1776Perplexity95.4%n/a-
#41Llama 3.1 Nemotron Ultra 253B v1 (Reasoning)NVIDIA95.2%41 tok/s$0.900/M
#42DeepSeek R1 Distill Qwen 14BDeepSeek94.9%n/a-
#43Claude 3.7 Sonnet (Reasoning)Anthropic94.7%n/a$6.00/M
#44Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)NVIDIA94.7%n/a-
#45Gemini 2.0 Flash Thinking Experimental (Jan '25)Google94.4%n/a-
#46o1-miniOpenAI94.4%n/a-
#47DeepSeek V3 0324DeepSeek94.2%n/a$1.25/M
#48Qwen3 Coder 480B A35B InstructAlibaba94.2%66.1 tok/s$3.00/M
#49Claude 4 Opus (Non-reasoning)Anthropic94.1%36.6 tok/s$30.00/M
#50DeepSeek R1 Distill Qwen 32BDeepSeek94.1%n/a-
#51EXAONE 4.0 32B (Non-reasoning)LG AI Research93.9%n/a-
#52DeepSeek R1 Distill Llama 70BDeepSeek93.5%44 tok/s$0.875/M
#53Claude 4 Sonnet (Non-reasoning)Anthropic93.4%47.6 tok/s$6.00/M
#54Qwen3 4B (Reasoning)Alibaba93.3%101.8 tok/s$0.398/M
#55DeepSeek R1 0528 Qwen3 8BDeepSeek93.2%n/a-
#56Gemini 2.5 Flash (Non-reasoning)Google93.2%189.1 tok/s$0.850/M
#57ERNIE 4.5 300B A47BBaidu93.1%22.7 tok/s$0.485/M
#58Gemini 2.0 Flash (Feb '25)Google93.0%n/a$0.263/M
#59Qwen3 235B A22B (Reasoning)Alibaba93.0%61.4 tok/s$2.63/M
#60Gemini 2.5 Flash Preview (Non-reasoning)Google92.6%n/a-
#61Gemini 2.5 Flash-Lite (Non-reasoning)Google92.6%239.9 tok/s$0.175/M
#62GPT-4.1 miniOpenAI92.5%78.4 tok/s$0.700/M
#63o1-previewOpenAI92.4%n/a$28.88/M
#64Gemini 2.0 Pro Experimental (Feb '25)Google92.3%n/a-
#65Sonar ReasoningPerplexity92.1%n/a-
#66Magistral Medium 1Mistral91.7%n/a-
#67GPT-4.1OpenAI91.3%86.4 tok/s$3.50/M
#68Gemini 2.0 Flash (experimental)Google91.1%n/a-
#69QwQ 32B-PreviewAlibaba91.0%n/a-
#70Mistral Medium 3Mistral90.7%56.8 tok/s$0.800/M
#71Qwen3 8B (Reasoning)Alibaba90.4%87.9 tok/s$0.660/M
#72Qwen3 235B A22B (Non-reasoning)Alibaba90.2%61.1 tok/s$1.23/M
#73Solar Pro 2 (Preview) (Reasoning)Upstage90.0%n/a-
#74Qwen3 1.7B (Reasoning)Alibaba89.4%136.7 tok/s$0.398/M
#75GPT-4o (March 2025, chatgpt-4o-latest)OpenAI89.3%n/a-
#76Qwen3 Coder 30B A3B InstructAlibaba89.3%110.3 tok/s$0.900/M
#77Reka Flash 3Reka AI89.3%90.6 tok/s$0.350/M
#78Llama 4 MaverickMeta88.9%115.2 tok/s$0.475/M
#79Solar Pro 2 (Non-reasoning)Upstage88.9%n/a-
#80DeepSeek V3 (Dec '24)DeepSeek88.7%n/a$0.625/M