realvuln v1.0
Dashboard Methodology Dataset Findings Roadmap GitHub ↗
04 · Roadmap

A living benchmark

A static benchmark goes stale the moment the tools it measures improve. Every release is identified by a deterministic manifest hash, so old results stay valid and versions compare side by side.

4.1

Versioning

A manifest hash is computed from three inputs: the content hash of the ground-truth directory, the pinned commit SHA of each repository, and the hash of the default prompt used for LLM scanners. Any change to labels, code, or prompt produces a new version; scores are always reported against a named manifest, and the dashboard surfaces historical comparisons.

4.2

Roadmap

v1.0shipped · Mar 2026

Python, Type 1, 24 scanners

  • 26-repo corpus · 26 scored
  • 817 labeled findings
  • F2 / F3 · standard & strict
  • per-CWE & per-severity breakdowns
  • public dashboard
v2planned

Multi-language & production CVEs

  • JavaScript / TypeScript
  • Go
  • Java
  • Type 2 — production CVE commits
  • Type 3 — vulnerable libraries
  • agentic sandbox (Docker)

Type 2 targets introduce realistic, low-density vulnerability profiles — a higher bar for precision than intentionally vulnerable apps.

researchopen question

Scanner performance vs. code authorship

Hypothesis: LLM-based scanners may detect vulnerabilities in LLM-generated code disproportionately better than in human-authored code. RealVuln tags every repository's authorship — the method runs each scanner against matched human-authored and LLM-generated pairs and compares the deltas. The result would be informative independent of which scanner ranks highest.

4.3

Contributing

RealVuln is maintained as a community resource. Transparency is treated not as a defense against bias but as an invitation to find it. Contribution paths are documented in the repository README.

▸ add

A scanner

Place results in Semgrep-compatible JSON and score. Unknown slugs fall back to the default parser; other formats need a parser class registered in parsers/.

▸ add

A repository

Author a ground-truth manifest following the schema, validate it, and pin a commit SHA. At least one false-positive trap is required per five real findings.

▸ challenge

A label

Every label ships with evidence and rationale. Open an issue to dispute one; corrections produce a new manifest version without invalidating prior results.

reproduce any score
# validate ground-truth schemas
python validate_gt.py

# score one repo against every scanner
python score.py --repo realvuln-pygoat --all-scanners

# regenerate the interactive dashboard
python dashboard.py --scanner-group all
4.4

Cite

If RealVuln is useful in your work, please cite the paper. All artifacts are released under the MIT license.

realvuln.bib
@misc{pellew2026realvuln,
  title  = {RealVuln: Benchmarking Rule-Based, General-Purpose LLM,
            and Security-Specialized Scanners on Real-World Code},
  author = {Pellew, John and Raza, Faizan},
  year   = {2026},
  eprint = {2604.13764},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR},
  url    = {https://arxiv.org/abs/2604.13764}
}