Scanner deep-dive

Gemini 3.1 Pro by Google DeepMind ↗

General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).

49.7

F3 (strict)

51.8

F2 (strict)

47.8%

Recall (strict)

77.6%

Precision

24/26

Repos scored

gemini-3.1-pro-preview

Model

$27

Total cost

170s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

vulnpy87 F2 · 86%

intentionally-vulnerable-python-application74 F2 · 71%

vampi71 F2 · 71%

vulnerable-api69 F2 · 67%

python-app68 F2 · 65%

vfapi67 F2 · 67%

lets-be-bad-guys66 F2 · 61%

insecure-web64 F2 · 63%

dsvw62 F2 · 60%

dvblab62 F2 · 59%

vulnerable-tornado-app59 F2 · 55%

dsvpwa59 F2 · 55%

damn-vulnerable-flask-application58 F2 · 56%

vulnerable-flask-app56 F2 · 52%

pythonssti54 F2 · 50%

flask-xss49 F2 · 44%

extremely-vulnerable-flask-app48 F2 · 44%

dvpwa47 F2 · 44%

threatbyte43 F2 · 40%

python-insecure-app42 F2 · 38%

damn-vulnerable-graphql-application41 F2 · 38%

pygoat41 F2 · 37%

vulpy38 F2 · 33%

djangoat33 F2 · 29%

Repository	TP	FP	FN	Recall %	F2
vulnpy	67	6	11	86.3	87.4
intentionally-vulnerable-python-application	5	1	2	71.4	73.5
vampi	11	4	4	71.1	71.0
vulnerable-api	9	2	5	66.7	69.3
python-app	13	3	7	65.0	67.9
vfapi	6	3	3	66.7	67.2
lets-be-bad-guys	15	1	9	61.1	65.8
insecure-web	6	2	3	63.0	64.0
dsvw	16	8	11	60.5	62.0
dvblab	13	4	9	59.1	61.7
vulnerable-tornado-app	8	1	6	54.8	59.3
dsvpwa	18	4	14	55.2	58.8
damn-vulnerable-flask-application	8	3	7	55.6	57.7
vulnerable-flask-app	11	2	10	52.4	56.5
pythonssti	1	0	1	50.0	53.7
flask-xss	13	3	17	44.5	48.9
extremely-vulnerable-flask-app	14	4	18	43.8	48.0
dvpwa	10	5	12	43.9	46.7
threatbyte	10	7	16	39.8	42.6
python-insecure-app	3	1	5	37.5	41.7
damn-vulnerable-graphql-application	14	7	22	38.0	41.4
pygoat	28	11	49	36.8	40.7
vulpy	19	5	38	33.3	37.6
djangoat	15	9	35	29.3	32.8

Detection by severity

Severity	TP	FN	Recall %
Critical	72	10	87.8
High	135	108	55.6
Medium	132	128	50.8
Low	18	44	29.0

Detection by vulnerability class

CWE family	TP	FN	Recall %
SQL Injection	39	0	100.0
XML External Entities	8	0	100.0
Insecure Deserialization	16	0	100.0
Open Redirect	6	0	100.0
HTTP Header Injection	2	0	100.0
XPath Injection	4	0	100.0
Path Traversal	24	1	96.0
Code Injection / RFI	13	1	92.9
Denial of Service	17	3	85.0
Command / OS Injection	14	3	82.4
Broken Access Control / IDOR	14	8	63.6
Cross-Site Scripting	48	31	60.8
Hardcoded Credentials	30	23	56.6
Server-Side Request Forgery	12	10	54.5
Missing Authentication / Authorization	18	25	41.9
Security Misconfiguration	12	20	37.5
Other	68	125	35.2
Sensitive Data Exposure	12	40	23.1

LLM operational metrics

56,142

Avg input tokens

4,315

Avg output tokens

437,119

Avg total tokens

170s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±13.4

F2 run-to-run σ

Cost

$27

Total cost

$0.38

Cost / run

$0.136

Cost / 100 LOC

20,062

Python LOC scanned

Successful runs

← Back to the leaderboard