01 · Methodology
Benchmark design & scoring
Every design choice optimizes for one property: results that can be audited line by line and reproduced exactly. This page documents the taxonomy, labeling, matching, and metrics.
1.3
Target-type taxonomy
Five target types classify the code under test on a code-realism axis. Version 1.0 covers Type 1: high vulnerability density, auditable labels, and resistance to training-data contamination. Types 2–5 are planned. A second axis records code authorship (human-authored, LLM-assisted, LLM-generated); all v1.0 targets are human-authored.
1
Intentionally vulnerable apps
Educational & CTF projects — PyGoat, DVPWA, VAmPI. High vulnerability density, diverse CWE coverage.
v1.0 · live
2
Previously-vulnerable platforms
Production apps pinned to pre-patch commits with disclosed CVEs, at realistic vulnerability density.
v2 · planned
3
Previously-vulnerable libraries
Open-source libraries pinned to known-vulnerable versions, patched for disclosed CVEs.
v2 · planned
4
Benchmark roll-ups
Existing benchmarks (OWASP Benchmark, NIST Juliet) imported and re-scored under RealVuln metrics.
v2 · planned
5
Academic reproductions
Published scanner evaluations encoded as reproducible configurations.
future
1.4
Ground-truth labeling
Every finding was produced by manual review. Each labeled entry records the following fields:
- id · is_vulnerable
- A unique identifier and a boolean flag distinguishing real vulnerabilities from false-positive traps.
- primary_cwe
- The most precise weakness, plus an acceptable_cwes list of alternative CWE identifiers a scanner may reasonably report for the same flaw.
- location
- File path and a start_line/end_line range pinpointing the code.
- severity
- One of critical, high, medium, or low.
- evidence
- The annotation source (manual review, CVE id, or published walkthrough) and a free-text rationale for why the code is, or is not, vulnerable.
False-positive traps. 120 of the 817 entries (14.7%) are code patterns that appear suspicious but are demonstrably safe — for example, a login function passing user input to an ORM's filter_by(), which auto-parameterizes the query. Flagging a trap is penalized as a false positive, and the traps double as true negatives for false-positive-rate computation.
1.6
Scoring
Two base metrics underpin scoring. Precision is the fraction of flagged findings that were real; recall is the fraction of real vulnerabilities found. The Fβ family combines them as a weighted harmonic mean, where β controls the trade-off.
In security the costs are asymmetric: a single missed vulnerability can lead to a breach, while a false positive costs an analyst minutes. F1 (β=1) weights both equally, implicitly assuming those costs are equal. RealVuln's primary metric is F3 (β=3), which weights recall nine times over precision. F1 and F2 are reported throughout so results can be re-ranked under any preference.
Aggregation. The conservative strict_micro mode pools confusion-matrix counts across repositories and treats any repository a scanner failed to complete as all-false-negatives. All headline scores use this strict mode.
F3 = 10 ·
P · R9P + R
× 100
β = 3 · recall weighted 9× · scaled to [0, 100]
F2 = 5·PR / (4P+R) · F3 = 10·PR / (9P+R)
Per-CWE-family and per-severity breakdowns are computed alongside the aggregates.
$ python score.py --repo realvuln-pygoat --all-scanners