Mathematical problem-solving benchmark score.
MATH-500 is a 500-problem held-out subset from the MATH benchmark split used in OpenAI's Let's Verify Step by Step work. It is widely used to test final-answer mathematical problem solving.
Test type: Competition math word problems, usually scored by normalized final-answer matching.
201 models have this metric.
Current leader: GPT-5 (high)
Project links
This app ranks the MATH-500 score exposed by the Artificial Analysis snapshot.
Top models ranked by MATH-500.
| Rank | Model | Creator | Value | Speed | Blended Price |
|---|---|---|---|---|---|
| #1 | GPT-5 (high) | OpenAI | 99.4% | 111.1 tok/s | $3.44/M |
| #2 |
| xAI |
| 99.2% |
| 58.8 tok/s |
| $0.350/M |
| #3 | o3 | OpenAI | 99.2% | 122.3 tok/s | $3.50/M |
| #4 | GPT-5 (medium) | OpenAI | 99.1% | 85.6 tok/s | $3.44/M |
| #5 | Claude 4 Sonnet (Reasoning) | Anthropic | 99.1% | 45.5 tok/s | $6.56/M |
| #6 | Grok 4 | xAI | 99.0% | n/a | $11.00/M |
| #7 | o4-mini (high) | OpenAI | 98.9% | 151 tok/s | $1.93/M |
| #8 | GPT-5 (low) | OpenAI | 98.7% | 79.3 tok/s | $3.44/M |
| #9 | Gemini 2.5 Pro Preview (May' 25) | 98.6% | n/a | $3.44/M |
| #10 | o3-mini (high) | OpenAI | 98.5% | 218.5 tok/s | $1.93/M |
| #11 | Qwen3 235B A22B 2507 (Reasoning) | Alibaba | 98.4% | 59.4 tok/s | $0.838/M |
| #12 | Llama Nemotron Super 49B v1.5 (Reasoning) | NVIDIA | 98.3% | 44.2 tok/s | $0.175/M |
| #13 | DeepSeek R1 0528 (May '25) | DeepSeek | 98.3% | n/a | $2.06/M |
| #14 | Claude 4 Opus (Reasoning) | Anthropic | 98.2% | 36.4 tok/s | $32.81/M |
| #15 | Gemini 2.5 Flash (Reasoning) | 98.1% | 221.3 tok/s | $0.850/M |
| #16 | Gemini 2.5 Flash Preview (Reasoning) | 98.1% | n/a | - |
| #17 | Gemini 2.5 Pro Preview (Mar' 25) | 98.0% | n/a | - |
| #18 | MiniMax M1 80k | MiniMax | 98.0% | n/a | $0.963/M |
| #19 | Qwen3 235B A22B 2507 Instruct | Alibaba | 98.0% | 42.5 tok/s | $0.356/M |
| #20 | GLM-4.5 (Reasoning) | Z AI | 97.9% | 50.1 tok/s | $1.00/M |
| #21 | EXAONE 4.0 32B (Reasoning) | LG AI Research | 97.7% | n/a | - |
| #22 | Qwen3 30B A3B 2507 (Reasoning) | Alibaba | 97.6% | 139.3 tok/s | $0.673/M |
| #23 | Qwen3 30B A3B 2507 Instruct | Alibaba | 97.5% | 105.2 tok/s | $0.213/M |
| #24 | o3-mini | OpenAI | 97.3% | 203.3 tok/s | $1.93/M |
| #25 | MiniMax M1 40k | MiniMax | 97.2% | n/a | - |
| #26 | Kimi K2 | Kimi | 97.1% | 24.3 tok/s | $1.04/M |
| #27 | o1 | OpenAI | 97.0% | 123.2 tok/s | $26.25/M |
| #28 | Gemini 2.5 Flash-Lite (Reasoning) | 96.9% | 265.2 tok/s | $0.175/M |
| #29 | Gemini 2.5 Pro | 96.7% | 132 tok/s | $3.44/M |
| #30 | Solar Pro 2 (Reasoning) | Upstage | 96.7% | n/a | - |
| #31 | DeepSeek R1 (Jan '25) | DeepSeek | 96.6% | n/a | $2.43/M |
| #32 | GLM-4.5-Air | Z AI | 96.5% | 74.5 tok/s | $0.372/M |
| #33 | Magistral Small 1 | Mistral | 96.3% | n/a | - |
| #34 | Qwen3 14B (Reasoning) | Alibaba | 96.1% | 63.5 tok/s | $0.731/M |
| #35 | Qwen3 32B (Reasoning) | Alibaba | 96.1% | 98.4 tok/s | $0.276/M |
| #36 | Qwen3 30B A3B (Reasoning) | Alibaba | 95.9% | 68.5 tok/s | $0.180/M |
| #37 | Llama 3.3 Nemotron Super 49B v1 (Reasoning) | NVIDIA | 95.9% | n/a | - |
| #38 | QwQ 32B | Alibaba | 95.7% | 31 tok/s | $0.745/M |
| #39 | Sonar Reasoning Pro | Perplexity | 95.7% | n/a | - |
| #40 | R1 1776 | Perplexity | 95.4% | n/a | - |
| #41 | Llama 3.1 Nemotron Ultra 253B v1 (Reasoning) | NVIDIA | 95.2% | 52.7 tok/s | $0.900/M |
| #42 | DeepSeek R1 Distill Qwen 14B | DeepSeek | 94.9% | n/a | - |
| #43 | Claude 3.7 Sonnet (Reasoning) | Anthropic | 94.7% | n/a | - |
| #44 | Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning) | NVIDIA | 94.7% | n/a | - |
| #45 | Gemini 2.0 Flash Thinking Experimental (Jan '25) | 94.4% | n/a | - |
| #46 | o1-mini | OpenAI | 94.4% | n/a | - |
| #47 | DeepSeek V3 0324 | DeepSeek | 94.2% | n/a | $1.21/M |
| #48 | Qwen3 Coder 480B A35B Instruct | Alibaba | 94.2% | 61 tok/s | $0.675/M |
| #49 | Claude 4 Opus (Non-reasoning) | Anthropic | 94.1% | 33.9 tok/s | $32.81/M |
| #50 | DeepSeek R1 Distill Qwen 32B | DeepSeek | 94.1% | n/a | - |
| #51 | EXAONE 4.0 32B (Non-reasoning) | LG AI Research | 93.9% | n/a | - |
| #52 | DeepSeek R1 Distill Llama 70B | DeepSeek | 93.5% | 44.7 tok/s | $0.787/M |
| #53 | Claude 4 Sonnet (Non-reasoning) | Anthropic | 93.4% | 45.2 tok/s | $6.56/M |
| #54 | Qwen3 4B (Reasoning) | Alibaba | 93.3% | n/a | $0.398/M |
| #55 | DeepSeek R1 0528 Qwen3 8B | DeepSeek | 93.2% | n/a | - |
| #56 | Gemini 2.5 Flash (Non-reasoning) | 93.2% | 185.1 tok/s | $0.850/M |
| #57 | ERNIE 4.5 300B A47B | Baidu | 93.1% | 23.7 tok/s | $0.485/M |
| #58 | Gemini 2.0 Flash (Feb '25) | 93.0% | n/a | $0.262/M |
| #59 | Qwen3 235B A22B (Reasoning) | Alibaba | 93.0% | 59 tok/s | $2.63/M |
| #60 | Gemini 2.5 Flash Preview (Non-reasoning) | 92.6% | n/a | - |
| #61 | Gemini 2.5 Flash-Lite (Non-reasoning) | 92.6% | 229.5 tok/s | $0.175/M |
| #62 | GPT-4.1 mini | OpenAI | 92.5% | 79.3 tok/s | $0.700/M |
| #63 | o1-preview | OpenAI | 92.4% | n/a | $28.88/M |
| #64 | Gemini 2.0 Pro Experimental (Feb '25) | 92.3% | n/a | - |
| #65 | Sonar Reasoning | Perplexity | 92.1% | n/a | - |
| #66 | Magistral Medium 1 | Mistral | 91.7% | n/a | - |
| #67 | GPT-4.1 | OpenAI | 91.3% | 128.3 tok/s | $3.50/M |
| #68 | Gemini 2.0 Flash (experimental) | 91.1% | n/a | - |
| #69 | QwQ 32B-Preview | Alibaba | 91.0% | n/a | - |
| #70 | Mistral Medium 3 | Mistral | 90.7% | 42.2 tok/s | $0.800/M |
| #71 | Qwen3 8B (Reasoning) | Alibaba | 90.4% | 62.5 tok/s | $0.370/M |
| #72 | Qwen3 235B A22B (Non-reasoning) | Alibaba | 90.2% | 65.4 tok/s | $0.787/M |
| #73 | Solar Pro 2 (Preview) (Reasoning) | Upstage | 90.0% | n/a | - |
| #74 | Qwen3 1.7B (Reasoning) | Alibaba | 89.4% | n/a | $0.398/M |
| #75 | Qwen3 Coder 30B A3B Instruct | Alibaba | 89.3% | 83 tok/s | $0.352/M |
| #76 | GPT-4o (March 2025, chatgpt-4o-latest) | OpenAI | 89.3% | n/a | - |
| #77 | Reka Flash 3 | Reka AI | 89.3% | 93.2 tok/s | $0.350/M |
| #78 | Llama 4 Maverick | Meta | 88.9% | 92.9 tok/s | $0.475/M |
| #79 | Solar Pro 2 (Non-reasoning) | Upstage | 88.9% | n/a | - |
| #80 | DeepSeek V3 (Dec '24) | DeepSeek | 88.7% | n/a | $0.523/M |