Scanner deep-dive
Gemini 3.1 Pro by Google DeepMind ↗
General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).
49.7
F3 (strict)
51.8
F2 (strict)
47.8%
Recall (strict)
77.6%
Precision
24/26
Repos scored
gemini-3.1-pro-preview
Model
$27
Total cost
170s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vulnpy | 67 | 6 | 11 | 86.3 | 87.4 |
| intentionally-vulnerable-python-application | 5 | 1 | 2 | 71.4 | 73.5 |
| vampi | 11 | 4 | 4 | 71.1 | 71.0 |
| vulnerable-api | 9 | 2 | 5 | 66.7 | 69.3 |
| python-app | 13 | 3 | 7 | 65.0 | 67.9 |
| vfapi | 6 | 3 | 3 | 66.7 | 67.2 |
| lets-be-bad-guys | 15 | 1 | 9 | 61.1 | 65.8 |
| insecure-web | 6 | 2 | 3 | 63.0 | 64.0 |
| dsvw | 16 | 8 | 11 | 60.5 | 62.0 |
| dvblab | 13 | 4 | 9 | 59.1 | 61.7 |
| vulnerable-tornado-app | 8 | 1 | 6 | 54.8 | 59.3 |
| dsvpwa | 18 | 4 | 14 | 55.2 | 58.8 |
| damn-vulnerable-flask-application | 8 | 3 | 7 | 55.6 | 57.7 |
| vulnerable-flask-app | 11 | 2 | 10 | 52.4 | 56.5 |
| pythonssti | 1 | 0 | 1 | 50.0 | 53.7 |
| flask-xss | 13 | 3 | 17 | 44.5 | 48.9 |
| extremely-vulnerable-flask-app | 14 | 4 | 18 | 43.8 | 48.0 |
| dvpwa | 10 | 5 | 12 | 43.9 | 46.7 |
| threatbyte | 10 | 7 | 16 | 39.8 | 42.6 |
| python-insecure-app | 3 | 1 | 5 | 37.5 | 41.7 |
| damn-vulnerable-graphql-application | 14 | 7 | 22 | 38.0 | 41.4 |
| pygoat | 28 | 11 | 49 | 36.8 | 40.7 |
| vulpy | 19 | 5 | 38 | 33.3 | 37.6 |
| djangoat | 15 | 9 | 35 | 29.3 | 32.8 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 72 | 0 | 10 | 87.8 |
| High | 135 | 0 | 108 | 55.6 |
| Medium | 132 | 0 | 128 | 50.8 |
| Low | 18 | 0 | 44 | 29.0 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| SQL Injection | 39 | 0 | 0 | 100.0 |
| XML External Entities | 8 | 0 | 0 | 100.0 |
| Insecure Deserialization | 16 | 0 | 0 | 100.0 |
| Open Redirect | 6 | 0 | 0 | 100.0 |
| HTTP Header Injection | 2 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| Path Traversal | 24 | 0 | 1 | 96.0 |
| Code Injection / RFI | 13 | 0 | 1 | 92.9 |
| Denial of Service | 17 | 0 | 3 | 85.0 |
| Command / OS Injection | 14 | 0 | 3 | 82.4 |
| Broken Access Control / IDOR | 14 | 0 | 8 | 63.6 |
| Cross-Site Scripting | 48 | 0 | 31 | 60.8 |
| Hardcoded Credentials | 30 | 0 | 23 | 56.6 |
| Server-Side Request Forgery | 12 | 0 | 10 | 54.5 |
| Missing Authentication / Authorization | 18 | 0 | 25 | 41.9 |
| Security Misconfiguration | 12 | 0 | 20 | 37.5 |
| Other | 68 | 0 | 125 | 35.2 |
| Sensitive Data Exposure | 12 | 0 | 40 | 23.1 |
§
LLM operational metrics
56,142
Avg input tokens
4,315
Avg output tokens
437,119
Avg total tokens
170s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±13.4
F2 run-to-run σ
§
Cost
$27
Total cost
$0.38
Cost / run
$0.136
Cost / 100 LOC
20,062
Python LOC scanned
72
Successful runs