24
Scanners
3 categories
26
Repositories
Python · Type 1
92.4
Best F3 (strict)
Kolega Enterprise
95.3
Highest recall %
Kolega Enterprise
93.2
Highest precision %
Grok 4.20
Leaderboard
ranked by active metric| # | Scanner ▼ | F3 ▼ | Recall % ▼ | Prec % ▼ | Repos ▼ | Cost $ ▼ |
|---|
Precision vs. recall
hover a pointPerformance vs. cost
F3 vs costRecall ranking
fraction of vulnerabilities foundPrecision ranking
fraction of flags that were realBy category
three-tier summaryDetection by vulnerability class
recall %, best by approach▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.
Dataset composition
697 vulnerabilities · 120 FP traps · 26 repositoriesFindings
Real vulnerabilities
FP traps (14.7%)
18
CWE families
20,062
Python LOC
Frameworks (26 repos)
Scanner categories
5
Frameworks
24
Scanners tested
All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →