AI Model Benchmarks 2026

16 top AI models on standard benchmarks. MMLU, HumanEval, MATH-500, GPQA Diamond. All accessible via one AIPower API.

🧠

MMLU leader

🇺🇸 GPT-5

94.2%

general knowledge

💻

Coding leader

🇺🇸 Claude Sonnet 4

93.7%

HumanEval

🧮

Math champion

🇨🇳 DeepSeek R1

97.3%

MATH-500

🔬

Reasoning king

🇨🇳 DeepSeek R1

71.5%

GPQA Diamond

Model	MMLU	HumanEval	MATH-500	GPQA	Context	$/M In	$/M Out
🇺🇸 GPT-5 openai/gpt-5	94.2%	91%	94.5%	58.2%	272K	$2.88	$17.25
🇺🇸 Claude Opus 4.6 anthropic/claude-opus	92.8%	93.4%	91.2%	68.4%	200K	$5.75	$28.75
🇺🇸 Claude Sonnet 4 anthropic/claude-sonnet	90.1%	93.7%	87.3%	62.1%	200K	$3.45	$17.25
🇺🇸 Gemini 2.5 Pro google/gemini-2.5-pro	91.8%	88.9%	89.5%	60.3%	1M	$1.88	$15.00
🇨🇳 DeepSeek R1 deepseek/deepseek-reasoner	90.8%	89.5%	97.3%	71.5%	64K	$0.34	$0.50
🇨🇳 DeepSeek V3 deepseek/deepseek-chat	88.5%	92.7%	85.3%	59.1%	64K	$0.32	$0.48
🇨🇳 GLM-5.1 zhipu/glm-5.1	87.3%	92.1%	82.8%	54.8%	128K	$1.20	$3.84
🇨🇳 Qwen Plus qwen/qwen-plus	89.2%	86.1%	79.8%	56.3%	128K	$0.13	$1.80
🇨🇳 Kimi K2.5 moonshot/kimi-k2.5	85.7%	89.5%	80.1%	52.4%	256K	$0.23	$1.15
🇺🇸 Gemini 2.5 Flash google/gemini-2.5-flash	83.5%	82.3%	78.2%	48.6%	1M	$0.35	$2.88
🇺🇸 GPT-4o Mini openai/gpt-4o-mini	82.1%	87.2%	75.4%	46.8%	128K	$0.23	$0.90
🇨🇳 Qwen Turbo qwen/qwen-turbo	81.3%	80.2%	67.4%	42.1%	128K	$0.08	$0.31
🇨🇳 MiniMax Text 01 minimax/minimax-text-01	80.5%	78.3%	72.6%	45.2%	1M	$0.35	$1.38
🇨🇳 Doubao Pro doubao/doubao-pro-256k	79.8%	76.1%	70.5%	43.8%	256K	$0.06	$0.11
🇨🇳 Moonshot v1 8K moonshot/moonshot-v1-8k	73.2%	72.5%	61.3%	38.4%	8K	$0.14	$0.14
🇨🇳 GLM-4 Flash zhipu/glm-4-flash	68.5%	65.8%	55.2%	32.1%	128K	$0.01	$0.01

Sources: OpenAI, Anthropic, Google, DeepSeek, Alibaba, Zhipu public papers & model cards. Updated April 2026.

Price-to-performance leaders

DeepSeek V3

92.7% HumanEval at $0.34/M. Beats GPT-4o Mini on everything, 1.5x cheaper.

DeepSeek R1

97.3% MATH-500, 71.5% GPQA at $0.32/M. SOTA reasoning at 91% lower cost than GPT-5.

GLM-4 Flash

$0.01/M — practically free. 68% MMLU still beats small open-source models. Perfect for classification.

One API. Benchmark them yourself. 10 trial calls to start.