RealVuln
An open benchmark measuring how Rule-Based SAST, General-Purpose LLM, and Security-Specialized scanners perform on real-world vulnerable code.
Leaderboard
Open full dashboard →24 scanners on identical pinned commits across 26 repositories. Headline metric is F3 (strict): recall weighted nine times over precision, with unfinished repositories counted as misses. Switch between F2 and F3 to re-rank. Metric definitions →
| # | Scanner ▼ | F3 ▼ | Recall % ▼ | Prec % ▼ | Repos ▼ | Cost $ ▼ |
|---|
Amber repo counts mark runs that did not complete all 26 repositories; under strict scoring their unscored repos count as misses. Cost is the total USD spend for the scored run; rule-based tools are free or variably priced (—).
Precision vs. recall, all 24 scanners. Security-Specialized systems (violet) reach the high-recall right; General-Purpose LLMs cluster center with strong precision; Rule-Based tools occupy the low-recall left.
breadth-driven
high variance
syntactic only
A three-tier ordering — Security-Specialized > General-Purpose LLM > Rule-Based SAST — holds under both F2 and F3, though within-tier rankings shift with the metric and strict/standard mode. Read the analysis →
Documentation
Benchmark design & scoring
Target-type taxonomy, ground-truth labeling, the matching algorithm, and why the primary metric is recall-weighted F3.
→The corpus
26 hand-labeled Python repositories, the ground-truth schema, false-positive traps, and framework coverage.
→Results & analysis
The three-tier hierarchy, per-CWE detection, cost-efficiency, reliability, and the precision–recall trade-off.
→Living benchmark & contributing
Versioning, the v2 roadmap, the authorship research question, how to contribute, and how to cite.