Mathematical problem-solving benchmark score.
MATH-500 is a 500-problem held-out subset from the MATH benchmark split used in OpenAI's Let's Verify Step by Step work. It is widely used to test final-answer mathematical problem solving.
Test type: Competition math word problems, usually scored by normalized final-answer matching.
201 models have this metric.
Current leader: GPT-5 (high)
Project links
This app ranks the MATH-500 score exposed by the Artificial Analysis snapshot.
Top models ranked by MATH-500.
| Rank | Model | Creator | Value | Speed | Blended Price |
|---|---|---|---|---|---|
| #1 | GPT-5 (high) | OpenAI | 99.4% | 84.2 tok/s | $3.44/M |
| #2 |
| xAI |
| 99.2% |
| 215.5 tok/s |
| $0.350/M |
| #3 | o3 | OpenAI | 99.2% | 72.7 tok/s | $3.50/M |
| #4 | Claude 4 Sonnet (Reasoning) | Anthropic | 99.1% | 50.3 tok/s | $6.00/M |
| #5 | GPT-5 (medium) | OpenAI | 99.1% | 82.3 tok/s | $3.44/M |
| #6 | Grok 4 | xAI | 99.0% | 50.3 tok/s | $6.00/M |
| #7 | o4-mini (high) | OpenAI | 98.9% | 124.5 tok/s | $1.93/M |
| #8 | GPT-5 (low) | OpenAI | 98.7% | 65.8 tok/s | $3.44/M |
| #9 | Gemini 2.5 Pro Preview (May' 25) | 98.6% | n/a | $3.44/M |
| #10 | o3-mini (high) | OpenAI | 98.5% | 140 tok/s | $1.93/M |
| #11 | Qwen3 235B A22B 2507 (Reasoning) | Alibaba | 98.4% | 56 tok/s | $2.63/M |
| #12 | DeepSeek R1 0528 (May '25) | DeepSeek | 98.3% | n/a | $2.36/M |
| #13 | Llama Nemotron Super 49B v1.5 (Reasoning) | NVIDIA | 98.3% | 50.8 tok/s | $0.175/M |
| #14 | Claude 4 Opus (Reasoning) | Anthropic | 98.2% | 36.8 tok/s | $30.00/M |
| #15 | Gemini 2.5 Flash (Reasoning) | 98.1% | 199.6 tok/s | $0.850/M |
| #16 | Gemini 2.5 Flash Preview (Reasoning) | 98.1% | n/a | - |
| #17 | Gemini 2.5 Pro Preview (Mar' 25) | 98.0% | n/a | - |
| #18 | MiniMax M1 80k | MiniMax | 98.0% | n/a | $0.963/M |
| #19 | Qwen3 235B A22B 2507 Instruct | Alibaba | 98.0% | 64.7 tok/s | $1.23/M |
| #20 | GLM-4.5 (Reasoning) | Z AI | 97.9% | 46.4 tok/s | $1.00/M |
| #21 | EXAONE 4.0 32B (Reasoning) | LG AI Research | 97.7% | n/a | - |
| #22 | Qwen3 30B A3B 2507 (Reasoning) | Alibaba | 97.6% | 143.2 tok/s | $0.750/M |
| #23 | Qwen3 30B A3B 2507 Instruct | Alibaba | 97.5% | 97.9 tok/s | $0.350/M |
| #24 | o3-mini | OpenAI | 97.3% | 140.1 tok/s | $1.93/M |
| #25 | MiniMax M1 40k | MiniMax | 97.2% | n/a | - |
| #26 | Kimi K2 | Kimi | 97.1% | 33 tok/s | $1.04/M |
| #27 | o1 | OpenAI | 97.0% | 103.3 tok/s | $26.25/M |
| #28 | Gemini 2.5 Flash-Lite (Reasoning) | 96.9% | 243.6 tok/s | $0.175/M |
| #29 | Gemini 2.5 Pro | 96.7% | 120.2 tok/s | $3.44/M |
| #30 | Solar Pro 2 (Reasoning) | Upstage | 96.7% | n/a | - |
| #31 | DeepSeek R1 (Jan '25) | DeepSeek | 96.6% | n/a | $2.36/M |
| #32 | GLM-4.5-Air | Z AI | 96.5% | 72.9 tok/s | $0.372/M |
| #33 | Magistral Small 1 | Mistral | 96.3% | n/a | - |
| #34 | Qwen3 14B (Reasoning) | Alibaba | 96.1% | 63 tok/s | $1.31/M |
| #35 | Qwen3 32B (Reasoning) | Alibaba | 96.1% | 91.4 tok/s | $2.63/M |
| #36 | Llama 3.3 Nemotron Super 49B v1 (Reasoning) | NVIDIA | 95.9% | n/a | - |
| #37 | Qwen3 30B A3B (Reasoning) | Alibaba | 95.9% | 77.8 tok/s | $0.750/M |
| #38 | QwQ 32B | Alibaba | 95.7% | 30.4 tok/s | $0.745/M |
| #39 | Sonar Reasoning Pro | Perplexity | 95.7% | n/a | - |
| #40 | R1 1776 | Perplexity | 95.4% | n/a | - |
| #41 | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | NVIDIA | 95.2% | 41 tok/s | $0.900/M |
| #42 | DeepSeek R1 Distill Qwen 14B | DeepSeek | 94.9% | n/a | - |
| #43 | Claude 3.7 Sonnet (Reasoning) | Anthropic | 94.7% | n/a | $6.00/M |
| #44 | Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) | NVIDIA | 94.7% | n/a | - |
| #45 | Gemini 2.0 Flash Thinking Experimental (Jan '25) | 94.4% | n/a | - |
| #46 | o1-mini | OpenAI | 94.4% | n/a | - |
| #47 | DeepSeek V3 0324 | DeepSeek | 94.2% | n/a | $1.25/M |
| #48 | Qwen3 Coder 480B A35B Instruct | Alibaba | 94.2% | 66.1 tok/s | $3.00/M |
| #49 | Claude 4 Opus (Non-reasoning) | Anthropic | 94.1% | 36.6 tok/s | $30.00/M |
| #50 | DeepSeek R1 Distill Qwen 32B | DeepSeek | 94.1% | n/a | - |
| #51 | EXAONE 4.0 32B (Non-reasoning) | LG AI Research | 93.9% | n/a | - |
| #52 | DeepSeek R1 Distill Llama 70B | DeepSeek | 93.5% | 44 tok/s | $0.875/M |
| #53 | Claude 4 Sonnet (Non-reasoning) | Anthropic | 93.4% | 47.6 tok/s | $6.00/M |
| #54 | Qwen3 4B (Reasoning) | Alibaba | 93.3% | 101.8 tok/s | $0.398/M |
| #55 | DeepSeek R1 0528 Qwen3 8B | DeepSeek | 93.2% | n/a | - |
| #56 | Gemini 2.5 Flash (Non-reasoning) | 93.2% | 189.1 tok/s | $0.850/M |
| #57 | ERNIE 4.5 300B A47B | Baidu | 93.1% | 22.7 tok/s | $0.485/M |
| #58 | Gemini 2.0 Flash (Feb '25) | 93.0% | n/a | $0.263/M |
| #59 | Qwen3 235B A22B (Reasoning) | Alibaba | 93.0% | 61.4 tok/s | $2.63/M |
| #60 | Gemini 2.5 Flash Preview (Non-reasoning) | 92.6% | n/a | - |
| #61 | Gemini 2.5 Flash-Lite (Non-reasoning) | 92.6% | 239.9 tok/s | $0.175/M |
| #62 | GPT-4.1 mini | OpenAI | 92.5% | 78.4 tok/s | $0.700/M |
| #63 | o1-preview | OpenAI | 92.4% | n/a | $28.88/M |
| #64 | Gemini 2.0 Pro Experimental (Feb '25) | 92.3% | n/a | - |
| #65 | Sonar Reasoning | Perplexity | 92.1% | n/a | - |
| #66 | Magistral Medium 1 | Mistral | 91.7% | n/a | - |
| #67 | GPT-4.1 | OpenAI | 91.3% | 86.4 tok/s | $3.50/M |
| #68 | Gemini 2.0 Flash (experimental) | 91.1% | n/a | - |
| #69 | QwQ 32B-Preview | Alibaba | 91.0% | n/a | - |
| #70 | Mistral Medium 3 | Mistral | 90.7% | 56.8 tok/s | $0.800/M |
| #71 | Qwen3 8B (Reasoning) | Alibaba | 90.4% | 87.9 tok/s | $0.660/M |
| #72 | Qwen3 235B A22B (Non-reasoning) | Alibaba | 90.2% | 61.1 tok/s | $1.23/M |
| #73 | Solar Pro 2 (Preview) (Reasoning) | Upstage | 90.0% | n/a | - |
| #74 | Qwen3 1.7B (Reasoning) | Alibaba | 89.4% | 136.7 tok/s | $0.398/M |
| #75 | GPT-4o (March 2025, chatgpt-4o-latest) | OpenAI | 89.3% | n/a | - |
| #76 | Qwen3 Coder 30B A3B Instruct | Alibaba | 89.3% | 110.3 tok/s | $0.900/M |
| #77 | Reka Flash 3 | Reka AI | 89.3% | 90.6 tok/s | $0.350/M |
| #78 | Llama 4 Maverick | Meta | 88.9% | 115.2 tok/s | $0.475/M |
| #79 | Solar Pro 2 (Non-reasoning) | Upstage | 88.9% | n/a | - |
| #80 | DeepSeek V3 (Dec '24) | DeepSeek | 88.7% | n/a | $0.625/M |