Open-Source Models vs Sonnet 4.6 on Coding Tasks

How five leading models — GLM 5.2, MiniMax M3, Kimi K2.7-code and Qwen3.7-Plus alongside Anthropic's Sonnet 4.6 — stack up on a coding-agent benchmark. Scores are shown both with a dedicated coding skill enabled and at baseline, together with task efficiency and cost.

Data

Metric	GLM 5.2	MiniMax M3	Sonnet 4.6	Kimi K2.7-code	Qwen3.7-Plus
Overall score	91.9	91.4	90.8	88.7	82.2
Overall score (baseline, no skill)	71.7	70.5	66.4	69.2	62.7
Overall lift from the skill	+20.2	+20.9	+24.4	+19.5	+19.5
Instruction-following	87.4	87.2	86.1	82.5	77.2
Instruction-following (baseline)	56.2	55.4	49.1	52.8	45.7
Task-completion	97.8	97.0	97.1	96.9	88.9
Turns to complete	18.5	22.7	17.7	27.5	16.5
Output tokens per task	8,813	8,952	6,841	21,787	12,296
List price (input / output, per MTok)	$1.40 / $4.40	$0.30 / $1.20	$3 / $15	$0.95 / $4.00	$0.40 / $1.60
Cost per task	$0.289	$0.207	$0.296	$0.661	$0.068
Points per dollar	318	442	307	134	1,204

A short review of the data

At the top, the field is remarkably tight. GLM 5.2 (91.9), MiniMax M3 (91.4) and Sonnet 4.6 (90.8) are separated by barely a point, with Kimi K2.7-code close behind at 88.7. On raw quality, the leading open-source models have effectively closed the gap with Sonnet 4.6 on this coding benchmark — Qwen3.7-Plus (82.2) is the only one that trails meaningfully.

The coding skill matters more than the model choice. Every model gains roughly 20–24 points once it is enabled, and Sonnet 4.6 sees the largest lift (+24.4). The baseline scores tell the real story: without the skill, instruction-following collapses into the 45–56 range for everyone, so most of the headline performance is coming from the scaffolding rather than the model alone.

Efficiency and cost are where the models diverge sharply. Sonnet 4.6 is the most concise (6,841 output tokens, 17.7 turns) but the priciest per token at $3 / $15. Kimi K2.7-code is the outlier in the wrong direction — it burns 21,787 tokens over 27.5 turns, pushing cost per task to $0.66 and points-per-dollar down to just 134. Qwen3.7-Plus flips the equation: despite the lowest score, its $0.068 per task yields a staggering 1,204 points per dollar. MiniMax M3 is arguably the sweet spot — near-top quality at $0.207 a task and 442 points per dollar. The takeaway: if you only care about peak quality, GLM 5.2 and MiniMax M3 now match Sonnet 4.6; if you care about value, the open-source options win comfortably.