Scanner deep-dive

Grok 4.20 Reasoning by xAI ↗

General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).

27.7

F3 (strict)

30.0

F2 (strict)

25.7%

Recall (strict)

93.2%

Precision

24/26

Repos scored

xai/grok-4.20-reasoning-latest

Model

$17

Total cost

34s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

dsvpwa69 F2 · 66%

insecure-web61 F2 · 56%

vfapi60 F2 · 56%

intentionally-vulnerable-python-application56 F2 · 52%

pythonssti56 F2 · 50%

python-insecure-app47 F2 · 42%

dsvw46 F2 · 41%

dvblab46 F2 · 41%

vulnerable-api46 F2 · 40%

vulnerable-tornado-app44 F2 · 38%

damn-vulnerable-flask-application41 F2 · 36%

python-app40 F2 · 35%

vampi38 F2 · 33%

extremely-vulnerable-flask-app38 F2 · 33%

vulnpy35 F2 · 34%

lets-be-bad-guys28 F2 · 24%

vulnerable-flask-app28 F2 · 24%

dvpwa27 F2 · 23%

threatbyte26 F2 · 22%

damn-vulnerable-graphql-application22 F2 · 19%

flask-xss20 F2 · 17%

djangoat19 F2 · 16%

pygoat10 F2 · 9%

vulpy10 F2 · 8%

Repository	TP	FP	FN	Recall %	F2
dsvpwa	21	3	11	65.6	69.1
insecure-web	5	0	4	55.6	61.0
vfapi	5	0	4	55.6	60.5
intentionally-vulnerable-python-application	4	1	3	52.4	56.0
pythonssti	1	0	1	50.0	55.6
python-insecure-app	3	0	5	41.7	46.6
dsvw	11	0	16	40.7	46.2
dvblab	9	0	13	40.9	46.1
vulnerable-api	6	0	8	40.5	45.9
vulnerable-tornado-app	5	0	9	38.1	43.5
damn-vulnerable-flask-application	5	0	10	35.6	40.6
python-app	7	0	13	35.0	40.0
vampi	5	0	10	33.3	38.2
extremely-vulnerable-flask-app	11	1	21	33.3	38.0
vulnpy	27	5	51	34.2	35.2
lets-be-bad-guys	6	0	18	23.6	27.9
vulnerable-flask-app	5	1	16	23.8	27.9
dvpwa	5	0	17	22.7	26.9
threatbyte	6	0	20	21.8	25.8
damn-vulnerable-graphql-application	7	1	29	18.5	22.0
flask-xss	5	0	25	16.7	20.0
djangoat	8	0	42	16.0	19.2
pygoat	7	0	70	8.7	10.5
vulpy	5	1	52	8.2	10.0

Detection by severity

Severity	TP	FN	Recall %
Critical	46	36	56.1
High	68	175	28.0
Medium	35	225	13.5
Low	2	60	3.2

Detection by vulnerability class

CWE family	TP	FN	Recall %
SQL Injection	33	6	84.6
Open Redirect	5	1	83.3
Command / OS Injection	11	6	64.7
Insecure Deserialization	10	6	62.5
HTTP Header Injection	1	1	50.0
Path Traversal	11	14	44.0
XML External Entities	3	5	37.5
Hardcoded Credentials	15	38	28.3
XPath Injection	1	3	25.0
Server-Side Request Forgery	5	17	22.7
Code Injection / RFI	3	11	21.4
Security Misconfiguration	6	26	18.8
Broken Access Control / IDOR	4	18	18.2
Cross-Site Scripting	11	68	13.9
Other	26	167	13.5
Missing Authentication / Authorization	5	38	11.6
Sensitive Data Exposure	1	51	1.9
Denial of Service	0	20	0.0

LLM operational metrics

110,646

Avg input tokens

2,042

Avg output tokens

112,688

Avg total tokens

34s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±16.0

F2 run-to-run σ

Cost

$17

Total cost

$0.23

Cost / run

$0.084

Cost / 100 LOC

20,062

Python LOC scanned

Successful runs

← Back to the leaderboard