Living benchmark · v2.0 · Apache 2.0 · arXiv:2604.13764 · May 2026

RealVuln

An open benchmark measuring how Rule-Based SAST, General-Purpose LLM, and Security-Specialized scanners perform on real-world vulnerable code.

John Pellew · Faizan Raza/All ground truth, scanner outputs, and scoring code are released for independent reproduction and audit.

Read on arXiv Paper (PDF) Repository Dashboard

1,903

Vulnerabilities

279

False-positive traps

Repositories

133,782

Python LOC

Scanners tested

▚

Leaderboard

Open full dashboard →

22 scanners on identical pinned commits across 66 repositories. Headline metric is F3 (strict): recall weighted nine times over precision, with unfinished repositories counted as misses. Switch between F2 and F3 to re-rank. Metric definitions →

Security-Specialized General-Purpose LLM Rule-Based SAST

F3 weights recall 9× over precision · strict scoring counts unfinished repos as misses

#	Scanner ▼	F3 ▼	Recall % ▼	Prec % ▼	Repos ▼	Cost $ ▼

Amber repo counts mark runs that did not complete all 66 repositories; under strict scoring their unscored repos count as misses. Cost is the total USD spend for the scored run; rule-based tools are free or variably priced (—).

Precision vs. recall, all 22 scanners. Security-Specialized systems (violet) reach the high-recall right; General-Purpose LLMs cluster center with strong precision; Rule-Based tools occupy the low-recall left.

Security-Specialized6 systems 86.5best F3 recall to 0.89
breadth-driven

General-Purpose LLM14 models 56.7best F3 range 23.3–56.7
high variance

Rule-Based SAST2 tools 14.4best F3 recall ≤ 0.19
syntactic only

A three-tier ordering — Security-Specialized > General-Purpose LLM > Rule-Based SAST — holds under both F2 and F3, though within-tier rankings shift with the metric and strict/standard mode. Read the analysis →

Documentation

→

01 / methodology

Benchmark design & scoring

Target-type taxonomy, ground-truth labeling, the matching algorithm, and why the primary metric is recall-weighted F3.

→

02 / dataset

The corpus

66 hand-labeled Python repositories, the ground-truth schema, false-positive traps, and framework coverage.

→

03 / findings

Results & analysis

The three-tier hierarchy, per-CWE detection, cost-efficiency, reliability, and the precision–recall trade-off.

→

04 / roadmap

Living benchmark & contributing

Versioning, the v2 roadmap, the authorship research question, how to contribute, and how to cite.