realvuln v1.0
Dashboard Methodology Dataset Findings Roadmap GitHub ↗
03 · Findings

Results & analysis

Across 24 scanners on identical ground truth, a three-tier ordering emerges and holds under every metric. The within-tier rankings, and several second-order effects, are documented below. The full per-scanner table is on the dashboard.

3.1

A three-tier hierarchy

The best Security-Specialized scanner reaches F3 92.4 (strict), with the highest recall in the benchmark (0.95). The best General-Purpose LLM (GPT-5.5, F3 60.2) achieves more than 3× the F3 of the best rule-based tool (Semgrep, 19.4). The ordering — Security-Specialized > General-Purpose LLM > Rule-Based SAST — holds under F1, F2, and F3.

The two Security-Specialized systems take the top two places: Kolega Enterprise reaches F3 92.4 (strict) at 0.95 recall, and Kolega v0.0.1 follows at F3 73.0. Both clear the field by a wide margin — the best General-Purpose LLM trails by more than 13 F3 points, and the best rule-based tool by over 50.

The standard/strict mode matters more than the F2/F3 choice for the leaderboard's middle. Strict scoring counts a scanner's unfinished repositories as misses, so models that completed all 26 repos rise relative to those that timed out — e.g. Claude Opus 4.6 scores F3 59.7 on the 19 repos it finished, but 47.2 under strict scoring.

Top, F3 strict
Kolega Enterprise — 92.4
Best LLM, F3 strict
GPT-5.5 — 60.2
Best rule-based
Semgrep — 19.4
3.2

Where the advantage originates

The per-CWE breakdown shows LLM-based scanners dominate on classes that require semantic understanding of data flow; rule-based tools remain competitive only on weaknesses that reduce to syntactic patterns.

SQL injection · recall
LLM-based  100%
Rule-based 63%
Insecure deserialization · recall
LLM-based 100%
Rule-based 88%
Syntactic patterns

Rule-based tools stay competitive on weaknesses like hardcoded secrets, but their overall recall remains low even there.

3.3

Cost-efficiency

Cost is the total USD spend for the scored run. There is no clean correlation between price and detection: the cheapest capable model lands within a few points of the most expensive, and the priciest LLM does not lead its tier by much.

ScannerCategoryF3 (strict)Run cost
Kolega.Dev v0.0.1Security-Specialized73.0
GPT-5.5GP-LLM60.2$66.45
DeepSeek V4 FlashGP-LLM56.5$0.96
Kimi K2.6GP-LLM53.9$6.24
Claude Opus 4.8GP-LLM53.6$35.65
DeepSeek V4 ProGP-LLM52.9$9.59
Gemini 3.5 FlashGP-LLM47.6$27.99
Kimi K2.5GP-LLM46.0$2.17
Minimax M2.7GP-LLM38.2$1.11
Grok 4.20 ReasoningGP-LLM27.7$16.82
SemgrepRule SAST19.4free

DeepSeek V4 Flash reaches F3 56.5 for $0.96 — within four points of the top LLM (GPT-5.5, 60.2) at barely 1.5% of its $66 cost. Minimax M2.7 runs the full benchmark for $1.11. Spend is not a proxy for detection.

3.4

Reliability & the precision–recall trade-off

Reliability

Bigger models do not always win

Claude Opus completed only 19 of 26 repositories, dropping its strict F3 to 47.2 — below Sonnet's 50.9, which reflects fuller coverage. Several frontier models failed to return results for 15–27% of repositories. Selection should favour reliability-adjusted performance, not peak capability.

The trade-off

Conservatism carries a cost

Grok 4.20 Reasoning posts the highest precision in the benchmark (0.932) but recalls only 26% of vulnerabilities. SonarQube shows the opposite failure mode — reasonable precision (0.611), negligible recall (0.063). The recall-weighted F3 metric rewards coverage over conservatism.

Threats to validity. The target repositories are public, so some may appear in the training data of the evaluated LLMs; memorized vulnerability locations could inflate recall. Two mitigations apply: the 120 false-positive traps test discrimination rather than recall, and the v2 roadmap adds repositories published after model training cutoffs. The full set of limitations — Python-only scope, label subjectivity, LLM non-determinism, and default configurations — is documented in the paper.