03 · Findings

Results & analysis

Across 22 scanners on identical ground truth, a three-tier ordering emerges and holds under every metric. The within-tier rankings, and several second-order effects, are documented below. The full per-scanner table is on the dashboard.

3.1

A three-tier hierarchy

The best Security-Specialized scanner reaches F3 86.5 (strict), at recall 0.89. The best General-Purpose LLM (GPT-5.5, F3 56.7) clears the best rule-based tool (SonarQube, F3 14.4) by a wide margin. The ordering — Security-Specialized > General-Purpose LLM > Rule-Based SAST — holds under F1, F2, and F3.

Security-Specialized systems take the top of the table: Kolega DevSec Max V0.0.1 leads at F3 86.5 (strict), 0.89 recall. The best General-Purpose LLM trails by 29.8 F3 points, and the best rule-based tool by 72.1.

The standard/strict mode matters more than the F2/F3 choice for the leaderboard's middle. Strict scoring counts a scanner's unfinished repositories as misses, so models that completed all 66 repos rise relative to those that timed out, whose unscored repos count as misses.

Top, F3 strict

Kolega DevSec Max V0.0.1 — 86.5

Best LLM, F3 strict

GPT-5.5 — 56.7

Best rule-based

SonarQube — 14.4

3.2

Where the advantage originates

The per-CWE breakdown shows LLM-based scanners dominate on classes that require semantic understanding of data flow; rule-based tools remain competitive only on weaknesses that reduce to syntactic patterns.

SQL injection · recall

LLM-based 96%
Rule-based 37%

Insecure deserialization · recall

LLM-based 100%
Rule-based 57%

Syntactic patterns

Rule-based tools stay competitive on weaknesses like hardcoded secrets, but their overall recall remains low even there.

3.3

Cost-efficiency

Cost is the total USD spend for the scored run. There is no clean correlation between price and detection: the cheapest capable model lands within a few points of the most expensive, and the priciest LLM does not lead its tier by much.

Scanner	Category	F3 (strict)	Run cost
Kolega.Dev v0.0.1	Security-Specialized	73.0	—
GPT-5.5	GP-LLM	60.2	$66.45
DeepSeek V4 Flash	GP-LLM	56.5	$0.96
Kimi K2.6	GP-LLM	53.9	$6.24
Claude Opus 4.8	GP-LLM	53.6	$35.65
DeepSeek V4 Pro	GP-LLM	52.9	$9.59
Gemini 3.5 Flash	GP-LLM	47.6	$27.99
Kimi K2.5	GP-LLM	46.0	$2.17
Minimax M2.7	GP-LLM	38.2	$1.11
Grok 4.20 Reasoning	GP-LLM	27.7	$16.82
Semgrep	Rule SAST	19.4	free

DeepSeek V4 Flash reaches F3 56.5 for $0.96 — within four points of the top LLM (GPT-5.5, 60.2) at barely 1.5% of its $66 cost. Minimax M2.7 runs the full benchmark for $1.11. Spend is not a proxy for detection.

3.4

Reliability & the precision–recall trade-off

Reliability

Bigger models do not always win

Models that fail to complete all 66 repositories drop under strict scoring, since unscored repositories count as misses — coverage, not just per-repo accuracy, matters. Several frontier models failed to return results for 15–27% of repositories. Selection should favour reliability-adjusted performance, not peak capability.

The trade-off

Conservatism carries a cost

Grok 4.20 Reasoning posts the highest precision in the benchmark (0.932) but recalls only 26% of vulnerabilities. SonarQube shows the opposite failure mode — reasonable precision (0.611), negligible recall (0.063). The recall-weighted F3 metric rewards coverage over conservatism.

Threats to validity. The target repositories are public, so some may appear in the training data of the evaluated LLMs; memorized vulnerability locations could inflate recall. Two mitigations apply: the 120 false-positive traps test discrimination rather than recall, and the v2 roadmap adds repositories published after model training cutoffs. The full set of limitations — Python-only scope, label subjectivity, LLM non-determinism, and default configurations — is documented in the paper.

← Previous

Dataset

Roadmap & contributing