realvuln v1.0
Dashboard Leaderboard Methodology Dataset Findings Roadmap GitHub ↗
Living benchmark · v1.0 · MIT License · arXiv:2604.13764 · March 2026

RealVuln

An open benchmark measuring how Rule-Based SAST, General-Purpose LLM, and Security-Specialized scanners perform on real-world vulnerable code.

697
Vulnerabilities
120
False-positive traps
26
Repositories
20,062
Python LOC
25
Scanners tested

Leaderboard

Open full dashboard →

24 scanners on identical pinned commits across 26 repositories. Headline metric is F3 (strict): recall weighted nine times over precision, with unfinished repositories counted as misses. Switch between F2 and F3 to re-rank. Metric definitions →

Security-Specialized General-Purpose LLM Rule-Based SAST
F3 weights recall over precision · strict scoring counts unfinished repos as misses
# Scanner F3 Recall % Prec % Repos Cost $

Amber repo counts mark runs that did not complete all 26 repositories; under strict scoring their unscored repos count as misses. Cost is the total USD spend for the scored run; rule-based tools are free or variably priced (—).

Precision vs. recall, all 24 scanners. Security-Specialized systems (violet) reach the high-recall right; General-Purpose LLMs cluster center with strong precision; Rule-Based tools occupy the low-recall left.

Security-Specialized2 systems 92.4best F3 recall to 0.95
breadth-driven
General-Purpose LLM19 models 60.2best F3 range 21.0–60.2
high variance
Rule-Based SAST3 tools 19.4best F3 recall ≤ 0.19
syntactic only

A three-tier ordering — Security-Specialized > General-Purpose LLM > Rule-Based SAST — holds under both F2 and F3, though within-tier rankings shift with the metric and strict/standard mode. Read the analysis →