realvuln v1.0
Dashboard Methodology Dataset Findings Roadmap GitHub ↗
Benchmark dashboard 24 scanners · 26 repositories · ranked by F3 (strict)
Metric
24
Scanners
3 categories
26
Repositories
Python · Type 1
92.4
Best F3 (strict)
Kolega Enterprise
95.3
Highest recall %
Kolega Enterprise
93.2
Highest precision %
Grok 4.20

Leaderboard

ranked by active metric
# Scanner F3 Recall % Prec % Repos Cost $

Precision vs. recall

hover a point

Performance vs. cost

F3 vs cost

Recall ranking

fraction of vulnerabilities found

Precision ranking

fraction of flags that were real

By category

three-tier summary

Detection by vulnerability class

recall %, best by approach

LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.

Dataset composition

697 vulnerabilities · 120 FP traps · 26 repositories

Findings

697 vulnerabilities
120
Real vulnerabilities FP traps (14.7%)
18
CWE families
20,062
Python LOC

Frameworks (26 repos)

Flask15
Django3
FastAPI3
aiohttp1
Tornado1
custom3

Scanner categories

GP-LLM19
Rule SAST3
Sec.-spec.2
5
Frameworks
24
Scanners tested

All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →