01 · Methodology

Benchmark design & scoring

Every design choice optimizes for one property: results that can be audited line by line and reproduced exactly. This page documents the taxonomy, labeling, matching, and metrics.

1.1

Motivation

Static analysis is core to secure development, but noise has eroded its practical value. A reported false-positive rate of 99.5% for command-injection findings in Python/Flask applications, and NIST's finding that only 8–30% of tool warnings were security-relevant, are representative of the field.^{[Ghost Security; NIST SATE]} A separate 36-scanner study found the best automated tool detected only 22.7% of known vulnerabilities.^{[Fluid Attacks]}

LLM-powered scanners claim deeper reasoning about data flow and exploitability, but no open benchmark existed to test that claim on real code. Existing benchmarks each fall short on at least one dimension: synthetic code (OWASP Benchmark, NIST Juliet), closed data (SastBench), vendor-controlled methodology (ZeroPath, Cycode, DryRun), or no false-positive measurement. RealVuln is built to address these simultaneously.

comparison with prior SAST benchmarks

                          Real   Open   Open    FP     Multi   LLM
                          Code   Data   Scoring Test   Scanner Scan.
  OWASP Benchmark          ✕      ✓      ✓       ✓      ✓       ✕
  NIST Juliet              ✕      ✓      ✕       ✕      ✕       ✕
  SastBench                ✓      ✕      ✕       ✓      ✕       ✕
  Fluid Attacks            ✓      ✕      ✕       ✓      ✓       ✕
  Cycode / DryRun          ✓      ~      ✕       ✕      ✓       ✕
  RealVuln (ours)          ✓      ✓      ✓       ✓      ✓       ✓

1.2

Pipeline

01 / collect

Real repositories

66 Python applications — intentionally vulnerable apps and vibe-coded projects — each pinned to a specific commit SHA so every scanner analyzes identical source trees.

02 / label

Hand-built ground truth

Every entry manually reviewed: a unique id, CWE classification, file and line range, severity, and free-text evidence.

03 / match

Three-field matching

A finding matches on file path, CWE (within an acceptable set), and line number (±10 lines). Each entry is consumed at most once.

04 / score

Recall-weighted F3

Unmatched findings become false positives; unmatched vulnerable entries, false negatives. Scored under F2 and F3, each in standard and strict mode.

1.3

Target-type taxonomy

Five target types classify the code under test on a code-realism axis. Version 1.0 covers Type 1: high vulnerability density, auditable labels, and resistance to training-data contamination. Types 2–5 are planned. A second axis records code authorship (human-authored, LLM-assisted, LLM-generated): the v1.0 corpus is human-authored, and v2.0 adds a vibe-coded (LLM-generated) subset, labeled per repository for authorship analysis.

Intentionally vulnerable apps

Educational & CTF projects — PyGoat, DVPWA, VAmPI. High vulnerability density, diverse CWE coverage.

v1.0 · live

Previously-vulnerable platforms

Production apps pinned to pre-patch commits with disclosed CVEs, at realistic vulnerability density.

v2 · planned

Previously-vulnerable libraries

Open-source libraries pinned to known-vulnerable versions, patched for disclosed CVEs.

v2 · planned

Benchmark roll-ups

Existing benchmarks (OWASP Benchmark, NIST Juliet) imported and re-scored under RealVuln metrics.

v2 · planned

Academic reproductions

Published scanner evaluations encoded as reproducible configurations.

future

1.4

Ground-truth labeling

Every finding was produced by manual review. Each labeled entry records the following fields:

id · is_vulnerable: A unique identifier and a boolean flag distinguishing real vulnerabilities from false-positive traps.
primary_cwe: The most precise weakness, plus an acceptable_cwes list of alternative CWE identifiers a scanner may reasonably report for the same flaw.
location: File path and a start_line/end_line range pinpointing the code.
severity: One of critical, high, medium, or low.
evidence: The annotation source (manual review, CVE id, or published walkthrough) and a free-text rationale for why the code is, or is not, vulnerable.

False-positive traps. 279 of the 2,182 entries (12.8%) are code patterns that appear suspicious but are demonstrably safe — for example, a login function passing user input to an ORM's filter_by(), which auto-parameterizes the query. Flagging a trap is penalized as a false positive, and the traps double as true negatives for false-positive-rate computation.

1.5

Matching algorithm

Each scanner finding is matched against ground truth on three fields: file path (exact after normalization), CWE (must appear in the entry's acceptable set), and line number (within ±10 lines of the range). When a finding matches multiple entries, real vulnerabilities are preferred over traps. Each entry is consumed at most once.

Some scanners report attack chains across several files. A finding may declare alternative locations; if any matches, it is not a false positive, and a finding whose alternatives match distinct entries is credited with multiple true positives — so chain-of-evidence reporting is not penalized.

True positive

Matches a vulnerable entry.

False positive

Matches a trap, or flags code with no entry.

False negative

A vulnerable entry the scanner missed.

True negative

A trap the scanner correctly ignored.

1.6

Scoring

Two base metrics underpin scoring. Precision is the fraction of flagged findings that were real; recall is the fraction of real vulnerabilities found. The F_β family combines them as a weighted harmonic mean, where β controls the trade-off.

In security the costs are asymmetric: a single missed vulnerability can lead to a breach, while a false positive costs an analyst minutes. F1 (β=1) weights both equally, implicitly assuming those costs are equal. RealVuln's primary metric is F3 (β=3), which weights recall nine times over precision. F1 and F2 are reported throughout so results can be re-ranked under any preference.

Aggregation. The conservative strict_micro mode pools confusion-matrix counts across repositories and treats any repository a scanner failed to complete as all-false-negatives. All headline scores use this strict mode.

F₃ = 10 · P · R9P + R × 100

β = 3 · recall weighted 9× · scaled to [0, 100]

F2 = 5·PR / (4P+R) · F3 = 10·PR / (9P+R)

Per-CWE-family and per-severity breakdowns are computed alongside the aggregates.

$ python score.py --repo realvuln-pygoat --all-scanners

← Back

Dashboard

The corpus