realvuln v1.0
Dashboard Methodology Dataset Findings Roadmap GitHub ↗
Scanner deep-dive

Gemini 3.1 Pro by Google DeepMind ↗

General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).

49.7
F3 (strict)
51.8
F2 (strict)
47.8%
Recall (strict)
77.6%
Precision
24/26
Repos scored
gemini-3.1-pro-preview
Model
$27
Total cost
170s
Avg latency
§

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)
vulnpy87 F2 · 86%
intentionally-vulnerable-python-application74 F2 · 71%
vampi71 F2 · 71%
vulnerable-api69 F2 · 67%
python-app68 F2 · 65%
vfapi67 F2 · 67%
lets-be-bad-guys66 F2 · 61%
insecure-web64 F2 · 63%
dsvw62 F2 · 60%
dvblab62 F2 · 59%
vulnerable-tornado-app59 F2 · 55%
dsvpwa59 F2 · 55%
damn-vulnerable-flask-application58 F2 · 56%
vulnerable-flask-app56 F2 · 52%
pythonssti54 F2 · 50%
flask-xss49 F2 · 44%
extremely-vulnerable-flask-app48 F2 · 44%
dvpwa47 F2 · 44%
threatbyte43 F2 · 40%
python-insecure-app42 F2 · 38%
damn-vulnerable-graphql-application41 F2 · 38%
pygoat41 F2 · 37%
vulpy38 F2 · 33%
djangoat33 F2 · 29%
RepositoryTPFPFNRecall %F2
vulnpy6761186.387.4
intentionally-vulnerable-python-application51271.473.5
vampi114471.171.0
vulnerable-api92566.769.3
python-app133765.067.9
vfapi63366.767.2
lets-be-bad-guys151961.165.8
insecure-web62363.064.0
dsvw1681160.562.0
dvblab134959.161.7
vulnerable-tornado-app81654.859.3
dsvpwa1841455.258.8
damn-vulnerable-flask-application83755.657.7
vulnerable-flask-app1121052.456.5
pythonssti10150.053.7
flask-xss1331744.548.9
extremely-vulnerable-flask-app1441843.848.0
dvpwa1051243.946.7
threatbyte1071639.842.6
python-insecure-app31537.541.7
damn-vulnerable-graphql-application1472238.041.4
pygoat28114936.840.7
vulpy1953833.337.6
djangoat1593529.332.8
§

Detection by severity

SeverityTPFPFNRecall %
Critical7201087.8
High135010855.6
Medium132012850.8
Low1804429.0
§

Detection by vulnerability class

CWE familyTPFPFNRecall %
SQL Injection3900100.0
XML External Entities800100.0
Insecure Deserialization1600100.0
Open Redirect600100.0
HTTP Header Injection200100.0
XPath Injection400100.0
Path Traversal240196.0
Code Injection / RFI130192.9
Denial of Service170385.0
Command / OS Injection140382.4
Broken Access Control / IDOR140863.6
Cross-Site Scripting4803160.8
Hardcoded Credentials3002356.6
Server-Side Request Forgery1201054.5
Missing Authentication / Authorization1802541.9
Security Misconfiguration1202037.5
Other68012535.2
Sensitive Data Exposure1204023.1
§

LLM operational metrics

56,142
Avg input tokens
4,315
Avg output tokens
437,119
Avg total tokens
170s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±13.4
F2 run-to-run σ
§

Cost

$27
Total cost
$0.38
Cost / run
$0.136
Cost / 100 LOC
20,062
Python LOC scanned
72
Successful runs

← Back to the leaderboard