Scanner deep-dive

Grok 3 by xAI ↗

General-Purpose LLM · agentic-v1 · scored on 21/26 repositories. Strict scoring (unfinished repos counted as misses).

21.0

F3 (strict)

22.9

F2 (strict)

19.3%

Recall (strict)

84.4%

Precision

21/26

Repos scored

xai/grok-3

Model

Total cost

34s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

vfapi71 F2 · 67%

insecure-web71 F2 · 67%

dsvpwa69 F2 · 66%

pythonssti56 F2 · 50%

vampi55 F2 · 50%

dsvw52 F2 · 47%

dvblab49 F2 · 44%

vulnerable-api48 F2 · 43%

python-insecure-app47 F2 · 42%

lets-be-bad-guys44 F2 · 40%

python-app36 F2 · 32%

damn-vulnerable-flask-application36 F2 · 31%

vulnerable-tornado-app33 F2 · 29%

threatbyte26 F2 · 22%

vulnerable-flask-app26 F2 · 22%

flask-xss17 F2 · 14%

dvpwa13 F2 · 11%

djangoat11 F2 · 9%

vulpy11 F2 · 9%

pygoat8 F2 · 7%

vulnpy6 F2 · 5%

Repository	TP	FP	FN	Recall %	F2
vfapi	6	0	3	66.7	71.3
insecure-web	6	0	3	66.7	70.9
dsvpwa	21	3	11	65.6	69.1
pythonssti	1	0	1	50.0	55.6
vampi	8	1	8	50.0	54.8
dsvw	13	0	14	46.9	52.5
dvblab	10	1	12	43.9	49.0
vulnerable-api	6	1	8	42.9	47.9
python-insecure-app	3	0	5	41.7	47.1
lets-be-bad-guys	10	3	14	39.6	43.8
python-app	6	2	14	31.7	35.7
damn-vulnerable-flask-application	5	1	10	31.1	35.5
vulnerable-tornado-app	4	0	10	28.6	33.3
threatbyte	6	0	20	21.8	25.8
vulnerable-flask-app	5	3	16	22.2	25.5
flask-xss	4	1	26	14.4	17.3
dvpwa	2	0	20	10.6	12.9
djangoat	5	2	45	9.3	11.3
vulpy	5	1	52	8.8	10.6
pygoat	5	4	72	6.9	8.4
vulnpy	4	2	74	5.1	6.2

Detection by severity

Severity	TP	FN	Recall %
Critical	36	37	49.3
High	56	156	26.4
Medium	38	194	16.4
Low	3	52	5.5

Detection by vulnerability class

CWE family	TP	FN	Recall %
SQL Injection	25	10	71.4
Command / OS Injection	9	4	69.2
Open Redirect	4	2	66.7
Insecure Deserialization	9	5	64.3
HTTP Header Injection	1	1	50.0
Code Injection / RFI	6	8	42.9
XML External Entities	3	5	37.5
Path Traversal	8	16	33.3
Hardcoded Credentials	13	35	27.1
XPath Injection	1	3	25.0
Cross-Site Scripting	14	57	19.7
Missing Authentication / Authorization	6	30	16.7
Broken Access Control / IDOR	3	15	16.7
Server-Side Request Forgery	3	17	15.0
Security Misconfiguration	4	23	14.8
Other	21	149	12.4
Denial of Service	1	17	5.6
Sensitive Data Exposure	2	42	4.5

LLM operational metrics

15,856

Avg input tokens

1,369

Avg output tokens

17,535

Avg total tokens

34s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±21.3

F2 run-to-run σ

Cost

Total cost

$0.08

Cost / run

$0.028

Cost / 100 LOC

17,556

Python LOC scanned

Successful runs

← Back to the leaderboard