Scanner deep-dive
Gemini 3.5 Flash by Google DeepMind ↗
General-Purpose LLM · agentic-v1 · scored on 26/26 repositories. Strict scoring (unfinished repos counted as misses).
47.6
F3 (strict)
50.0
F2 (strict)
45.4%
Recall (strict)
84.0%
Precision
26/26
Repos scored
gemini-3.5-flash
Model
$28
Total cost
121s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vulnpy | 77 | 22 | 1 | 98.7 | 93.7 |
| intentionally-vulnerable-python-application | 5 | 0 | 2 | 71.4 | 75.5 |
| dsvpwa | 21 | 3 | 11 | 65.6 | 69.1 |
| dsvw | 17 | 1 | 10 | 64.2 | 68.7 |
| vampi | 9 | 1 | 6 | 57.8 | 62.2 |
| dvblab | 12 | 2 | 10 | 56.1 | 60.4 |
| insecure-web | 5 | 1 | 4 | 55.6 | 59.8 |
| pythonssti | 1 | 0 | 1 | 50.0 | 55.6 |
| python-app | 10 | 0 | 10 | 48.3 | 53.7 |
| lets-be-bad-guys | 12 | 1 | 12 | 48.6 | 53.5 |
| owasp-web-playground | 14 | 2 | 14 | 48.2 | 53.2 |
| vulnerable-tornado-app | 6 | 0 | 8 | 45.2 | 50.6 |
| extremely-vulnerable-flask-app | 14 | 2 | 18 | 43.8 | 48.6 |
| vulnerable-flask-app | 8 | 1 | 12 | 40.5 | 45.5 |
| vfapi | 4 | 0 | 5 | 40.7 | 44.1 |
| damn-vulnerable-flask-application | 6 | 1 | 9 | 40.0 | 43.9 |
| threatbyte | 10 | 0 | 16 | 37.2 | 42.5 |
| vulnerable-python-apps | 8 | 1 | 14 | 34.8 | 39.7 |
| pygoat | 25 | 14 | 52 | 32.9 | 34.5 |
| vulnerable-api | 4 | 1 | 10 | 30.9 | 34.4 |
| damn-vulnerable-graphql-application | 10 | 0 | 26 | 29.2 | 33.9 |
| python-insecure-app | 2 | 0 | 6 | 29.2 | 33.5 |
| dvpwa | 6 | 1 | 16 | 28.8 | 33.0 |
| flask-xss | 8 | 1 | 22 | 26.7 | 31.1 |
| vulpy | 14 | 3 | 43 | 25.1 | 29.2 |
| djangoat | 8 | 2 | 42 | 16.0 | 18.6 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 73 | 0 | 13 | 84.9 |
| High | 141 | 2 | 123 | 53.4 |
| Medium | 110 | 1 | 169 | 39.4 |
| Low | 15 | 0 | 53 | 22.1 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| XML External Entities | 8 | 0 | 0 | 100.0 |
| Open Redirect | 6 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| Insecure Deserialization | 18 | 0 | 1 | 94.7 |
| Server-Side Request Forgery | 22 | 2 | 2 | 91.7 |
| SQL Injection | 43 | 0 | 4 | 91.5 |
| Denial of Service | 17 | 0 | 3 | 85.0 |
| Command / OS Injection | 14 | 0 | 3 | 82.4 |
| Path Traversal | 21 | 0 | 5 | 80.8 |
| Code Injection / RFI | 11 | 0 | 3 | 78.6 |
| Broken Access Control / IDOR | 17 | 0 | 7 | 70.8 |
| HTTP Header Injection | 1 | 0 | 1 | 50.0 |
| Cross-Site Scripting | 37 | 0 | 45 | 45.1 |
| Hardcoded Credentials | 27 | 1 | 34 | 44.3 |
| Missing Authentication / Authorization | 19 | 0 | 28 | 40.4 |
| Other | 61 | 0 | 145 | 29.6 |
| Security Misconfiguration | 8 | 0 | 25 | 24.2 |
| Sensitive Data Exposure | 5 | 0 | 52 | 8.8 |
§
LLM operational metrics
106,467
Avg input tokens
4,848
Avg output tokens
470,869
Avg total tokens
121s
Avg latency / repo
0.0%
JSON repair rate
74
Total runs
±16.7
F2 run-to-run σ
§
Cost
$28
Total cost
$0.38
Cost / run
$0.140
Cost / 100 LOC
20,062
Python LOC scanned
74
Successful runs