Open-Source Models vs Sonnet 4.6 on Coding Tasks
How five leading models — GLM 5.2, MiniMax M3, Kimi K2.7-code and Qwen3.7-Plus alongside Anthropic's Sonnet 4.6 — stack up on a coding-agent benchmark. Scores are shown both with a dedicated coding skill enabled and at baseline, together with task efficiency and cost.
Data
| Metric | GLM 5.2 | MiniMax M3 | Sonnet 4.6 | Kimi K2.7-code | Qwen3.7-Plus |
|---|---|---|---|---|---|
| Overall score | 91.9 | 91.4 | 90.8 | 88.7 | 82.2 |
| Overall score (baseline, no skill) | 71.7 | 70.5 | 66.4 | 69.2 | 62.7 |
| Overall lift from the skill | +20.2 | +20.9 | +24.4 | +19.5 | +19.5 |
| Instruction-following | 87.4 | 87.2 | 86.1 | 82.5 | 77.2 |
| Instruction-following (baseline) | 56.2 | 55.4 | 49.1 | 52.8 | 45.7 |
| Task-completion | 97.8 | 97.0 | 97.1 | 96.9 | 88.9 |
| Turns to complete | 18.5 | 22.7 | 17.7 | 27.5 | 16.5 |
| Output tokens per task | 8,813 | 8,952 | 6,841 | 21,787 | 12,296 |
| List price (input / output, per MTok) | $1.40 / $4.40 | $0.30 / $1.20 | $3 / $15 | $0.95 / $4.00 | $0.40 / $1.60 |
| Cost per task | $0.289 | $0.207 | $0.296 | $0.661 | $0.068 |
| Points per dollar | 318 | 442 | 307 | 134 | 1,204 |
A short review of the data
At the top, the field is remarkably tight. GLM 5.2 (91.9), MiniMax M3 (91.4) and Sonnet 4.6 (90.8) are separated by barely a point, with Kimi K2.7-code close behind at 88.7. On raw quality, the leading open-source models have effectively closed the gap with Sonnet 4.6 on this coding benchmark — Qwen3.7-Plus (82.2) is the only one that trails meaningfully.
The coding skill matters more than the model choice. Every model gains roughly 20–24 points once it is enabled, and Sonnet 4.6 sees the largest lift (+24.4). The baseline scores tell the real story: without the skill, instruction-following collapses into the 45–56 range for everyone, so most of the headline performance is coming from the scaffolding rather than the model alone.
Efficiency and cost are where the models diverge sharply. Sonnet 4.6 is the most concise (6,841 output tokens, 17.7 turns) but the priciest per token at $3 / $15. Kimi K2.7-code is the outlier in the wrong direction — it burns 21,787 tokens over 27.5 turns, pushing cost per task to $0.66 and points-per-dollar down to just 134. Qwen3.7-Plus flips the equation: despite the lowest score, its $0.068 per task yields a staggering 1,204 points per dollar. MiniMax M3 is arguably the sweet spot — near-top quality at $0.207 a task and 442 points per dollar. The takeaway: if you only care about peak quality, GLM 5.2 and MiniMax M3 now match Sonnet 4.6; if you care about value, the open-source options win comfortably.