Scanner deep-dive
Grok 3 by xAI ↗
General-Purpose LLM · agentic-v1 · scored on 21/26 repositories. Strict scoring (unfinished repos counted as misses).
21.0
F3 (strict)
22.9
F2 (strict)
19.3%
Recall (strict)
84.4%
Precision
21/26
Repos scored
xai/grok-3
Model
$5
Total cost
34s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vfapi | 6 | 0 | 3 | 66.7 | 71.3 |
| insecure-web | 6 | 0 | 3 | 66.7 | 70.9 |
| dsvpwa | 21 | 3 | 11 | 65.6 | 69.1 |
| pythonssti | 1 | 0 | 1 | 50.0 | 55.6 |
| vampi | 8 | 1 | 8 | 50.0 | 54.8 |
| dsvw | 13 | 0 | 14 | 46.9 | 52.5 |
| dvblab | 10 | 1 | 12 | 43.9 | 49.0 |
| vulnerable-api | 6 | 1 | 8 | 42.9 | 47.9 |
| python-insecure-app | 3 | 0 | 5 | 41.7 | 47.1 |
| lets-be-bad-guys | 10 | 3 | 14 | 39.6 | 43.8 |
| python-app | 6 | 2 | 14 | 31.7 | 35.7 |
| damn-vulnerable-flask-application | 5 | 1 | 10 | 31.1 | 35.5 |
| vulnerable-tornado-app | 4 | 0 | 10 | 28.6 | 33.3 |
| threatbyte | 6 | 0 | 20 | 21.8 | 25.8 |
| vulnerable-flask-app | 5 | 3 | 16 | 22.2 | 25.5 |
| flask-xss | 4 | 1 | 26 | 14.4 | 17.3 |
| dvpwa | 2 | 0 | 20 | 10.6 | 12.9 |
| djangoat | 5 | 2 | 45 | 9.3 | 11.3 |
| vulpy | 5 | 1 | 52 | 8.8 | 10.6 |
| pygoat | 5 | 4 | 72 | 6.9 | 8.4 |
| vulnpy | 4 | 2 | 74 | 5.1 | 6.2 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 36 | 0 | 37 | 49.3 |
| High | 56 | 0 | 156 | 26.4 |
| Medium | 38 | 0 | 194 | 16.4 |
| Low | 3 | 0 | 52 | 5.5 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| SQL Injection | 25 | 0 | 10 | 71.4 |
| Command / OS Injection | 9 | 0 | 4 | 69.2 |
| Open Redirect | 4 | 0 | 2 | 66.7 |
| Insecure Deserialization | 9 | 0 | 5 | 64.3 |
| HTTP Header Injection | 1 | 0 | 1 | 50.0 |
| Code Injection / RFI | 6 | 0 | 8 | 42.9 |
| XML External Entities | 3 | 0 | 5 | 37.5 |
| Path Traversal | 8 | 0 | 16 | 33.3 |
| Hardcoded Credentials | 13 | 0 | 35 | 27.1 |
| XPath Injection | 1 | 0 | 3 | 25.0 |
| Cross-Site Scripting | 14 | 0 | 57 | 19.7 |
| Missing Authentication / Authorization | 6 | 0 | 30 | 16.7 |
| Broken Access Control / IDOR | 3 | 0 | 15 | 16.7 |
| Server-Side Request Forgery | 3 | 0 | 17 | 15.0 |
| Security Misconfiguration | 4 | 0 | 23 | 14.8 |
| Other | 21 | 0 | 149 | 12.4 |
| Denial of Service | 1 | 0 | 17 | 5.6 |
| Sensitive Data Exposure | 2 | 0 | 42 | 4.5 |
§
LLM operational metrics
15,856
Avg input tokens
1,369
Avg output tokens
17,535
Avg total tokens
34s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±21.3
F2 run-to-run σ
§
Cost
$5
Total cost
$0.08
Cost / run
$0.028
Cost / 100 LOC
17,556
Python LOC scanned
58
Successful runs