Scanner deep-dive
Grok 4.20 Reasoning by xAI ↗
General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).
27.7
F3 (strict)
30.0
F2 (strict)
25.7%
Recall (strict)
93.2%
Precision
24/26
Repos scored
xai/grok-4.20-reasoning-latest
Model
$17
Total cost
34s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| dsvpwa | 21 | 3 | 11 | 65.6 | 69.1 |
| insecure-web | 5 | 0 | 4 | 55.6 | 61.0 |
| vfapi | 5 | 0 | 4 | 55.6 | 60.5 |
| intentionally-vulnerable-python-application | 4 | 1 | 3 | 52.4 | 56.0 |
| pythonssti | 1 | 0 | 1 | 50.0 | 55.6 |
| python-insecure-app | 3 | 0 | 5 | 41.7 | 46.6 |
| dsvw | 11 | 0 | 16 | 40.7 | 46.2 |
| dvblab | 9 | 0 | 13 | 40.9 | 46.1 |
| vulnerable-api | 6 | 0 | 8 | 40.5 | 45.9 |
| vulnerable-tornado-app | 5 | 0 | 9 | 38.1 | 43.5 |
| damn-vulnerable-flask-application | 5 | 0 | 10 | 35.6 | 40.6 |
| python-app | 7 | 0 | 13 | 35.0 | 40.0 |
| vampi | 5 | 0 | 10 | 33.3 | 38.2 |
| extremely-vulnerable-flask-app | 11 | 1 | 21 | 33.3 | 38.0 |
| vulnpy | 27 | 5 | 51 | 34.2 | 35.2 |
| lets-be-bad-guys | 6 | 0 | 18 | 23.6 | 27.9 |
| vulnerable-flask-app | 5 | 1 | 16 | 23.8 | 27.9 |
| dvpwa | 5 | 0 | 17 | 22.7 | 26.9 |
| threatbyte | 6 | 0 | 20 | 21.8 | 25.8 |
| damn-vulnerable-graphql-application | 7 | 1 | 29 | 18.5 | 22.0 |
| flask-xss | 5 | 0 | 25 | 16.7 | 20.0 |
| djangoat | 8 | 0 | 42 | 16.0 | 19.2 |
| pygoat | 7 | 0 | 70 | 8.7 | 10.5 |
| vulpy | 5 | 1 | 52 | 8.2 | 10.0 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 46 | 0 | 36 | 56.1 |
| High | 68 | 0 | 175 | 28.0 |
| Medium | 35 | 0 | 225 | 13.5 |
| Low | 2 | 0 | 60 | 3.2 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| SQL Injection | 33 | 0 | 6 | 84.6 |
| Open Redirect | 5 | 0 | 1 | 83.3 |
| Command / OS Injection | 11 | 0 | 6 | 64.7 |
| Insecure Deserialization | 10 | 0 | 6 | 62.5 |
| HTTP Header Injection | 1 | 0 | 1 | 50.0 |
| Path Traversal | 11 | 0 | 14 | 44.0 |
| XML External Entities | 3 | 0 | 5 | 37.5 |
| Hardcoded Credentials | 15 | 0 | 38 | 28.3 |
| XPath Injection | 1 | 0 | 3 | 25.0 |
| Server-Side Request Forgery | 5 | 0 | 17 | 22.7 |
| Code Injection / RFI | 3 | 0 | 11 | 21.4 |
| Security Misconfiguration | 6 | 0 | 26 | 18.8 |
| Broken Access Control / IDOR | 4 | 0 | 18 | 18.2 |
| Cross-Site Scripting | 11 | 0 | 68 | 13.9 |
| Other | 26 | 0 | 167 | 13.5 |
| Missing Authentication / Authorization | 5 | 0 | 38 | 11.6 |
| Sensitive Data Exposure | 1 | 0 | 51 | 1.9 |
| Denial of Service | 0 | 0 | 20 | 0.0 |
§
LLM operational metrics
110,646
Avg input tokens
2,042
Avg output tokens
112,688
Avg total tokens
34s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±16.0
F2 run-to-run σ
§
Cost
$17
Total cost
$0.23
Cost / run
$0.084
Cost / 100 LOC
20,062
Python LOC scanned
72
Successful runs