realvuln v1.0
Dashboard Methodology Dataset Findings Roadmap GitHub ↗
Scanner deep-dive

GPT-5.5 by OpenAI ↗

General-Purpose LLM · agentic-v1 · scored on 26/26 repositories. Strict scoring (unfinished repos counted as misses).

60.2
F3 (strict)
62.1
F2 (strict)
58.4%
Recall (strict)
83.2%
Precision
26/26
Repos scored
gpt-5.5
Model
$66
Total cost
153s
Avg latency
§

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)
pythonssti100 F2 · 100%
vfapi91 F2 · 89%
intentionally-vulnerable-python-application88 F2 · 86%
vampi87 F2 · 84%
insecure-web77 F2 · 78%
python-app76 F2 · 73%
vulnpy76 F2 · 75%
vulnerable-api76 F2 · 74%
dsvw75 F2 · 72%
dsvpwa73 F2 · 73%
dvblab70 F2 · 68%
extremely-vulnerable-flask-app64 F2 · 58%
owasp-web-playground63 F2 · 59%
lets-be-bad-guys63 F2 · 58%
vulnerable-flask-app61 F2 · 57%
damn-vulnerable-graphql-application60 F2 · 56%
dvpwa59 F2 · 58%
threatbyte58 F2 · 54%
pygoat56 F2 · 52%
vulnerable-python-apps56 F2 · 50%
damn-vulnerable-flask-application54 F2 · 49%
vulnerable-tornado-app50 F2 · 48%
djangoat50 F2 · 47%
python-insecure-app45 F2 · 42%
vulpy40 F2 · 35%
flask-xss37 F2 · 32%
RepositoryTPFPFNRecall %F2
pythonssti200100.0100.0
vfapi80188.990.9
intentionally-vulnerable-python-application60185.788.2
vampi130284.586.7
insecure-web72277.877.2
python-app151573.376.4
vulnpy59151975.276.0
vulnerable-api102473.875.6
dsvw191871.675.1
dsvpwa248873.473.4
dvblab154768.270.3
extremely-vulnerable-flask-app1901358.363.5
owasp-web-playground1621258.963.0
lets-be-bad-guys1421058.362.7
vulnerable-flask-app122957.161.4
damn-vulnerable-graphql-application2041656.560.2
dvpwa136957.659.3
threatbyte1441253.857.5
pygoat4083751.956.2
vulnerable-python-apps1101150.055.5
damn-vulnerable-flask-application71848.953.6
vulnerable-tornado-app73747.650.5
djangoat23122746.749.5
python-insecure-app31541.745.3
vulpy2043735.139.5
flask-xss1002032.237.3
§

Detection by severity

SeverityTPFPFNRecall %
Critical780890.7
High162010261.4
Medium151012854.1
Low2304533.8
§

Detection by vulnerability class

CWE familyTPFPFNRecall %
Open Redirect600100.0
HTTP Header Injection200100.0
XPath Injection400100.0
Denial of Service190195.0
Insecure Deserialization170289.5
Command / OS Injection150288.2
XML External Entities70187.5
Code Injection / RFI120285.7
Path Traversal220484.6
SQL Injection390883.0
Hardcoded Credentials4701477.0
Broken Access Control / IDOR170770.8
Cross-Site Scripting5103162.2
Server-Side Request Forgery1301154.2
Missing Authentication / Authorization2302448.9
Security Misconfiguration1501845.5
Other84012240.8
Sensitive Data Exposure2103636.8
§

LLM operational metrics

71,890
Avg input tokens
8,588
Avg output tokens
326,992
Avg total tokens
153s
Avg latency / repo
0.0%
JSON repair rate
78
Total runs
±15.9
F2 run-to-run σ
§

Cost

$66
Total cost
$0.87
Cost / run
$0.331
Cost / 100 LOC
20,062
Python LOC scanned
76
Successful runs

← Back to the leaderboard