Scanner deep-dive
GPT-5.5 by OpenAI ↗
General-Purpose LLM · agentic-v1 · scored on 26/26 repositories. Strict scoring (unfinished repos counted as misses).
60.2
F3 (strict)
62.1
F2 (strict)
58.4%
Recall (strict)
83.2%
Precision
26/26
Repos scored
gpt-5.5
Model
$66
Total cost
153s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| pythonssti | 2 | 0 | 0 | 100.0 | 100.0 |
| vfapi | 8 | 0 | 1 | 88.9 | 90.9 |
| intentionally-vulnerable-python-application | 6 | 0 | 1 | 85.7 | 88.2 |
| vampi | 13 | 0 | 2 | 84.5 | 86.7 |
| insecure-web | 7 | 2 | 2 | 77.8 | 77.2 |
| python-app | 15 | 1 | 5 | 73.3 | 76.4 |
| vulnpy | 59 | 15 | 19 | 75.2 | 76.0 |
| vulnerable-api | 10 | 2 | 4 | 73.8 | 75.6 |
| dsvw | 19 | 1 | 8 | 71.6 | 75.1 |
| dsvpwa | 24 | 8 | 8 | 73.4 | 73.4 |
| dvblab | 15 | 4 | 7 | 68.2 | 70.3 |
| extremely-vulnerable-flask-app | 19 | 0 | 13 | 58.3 | 63.5 |
| owasp-web-playground | 16 | 2 | 12 | 58.9 | 63.0 |
| lets-be-bad-guys | 14 | 2 | 10 | 58.3 | 62.7 |
| vulnerable-flask-app | 12 | 2 | 9 | 57.1 | 61.4 |
| damn-vulnerable-graphql-application | 20 | 4 | 16 | 56.5 | 60.2 |
| dvpwa | 13 | 6 | 9 | 57.6 | 59.3 |
| threatbyte | 14 | 4 | 12 | 53.8 | 57.5 |
| pygoat | 40 | 8 | 37 | 51.9 | 56.2 |
| vulnerable-python-apps | 11 | 0 | 11 | 50.0 | 55.5 |
| damn-vulnerable-flask-application | 7 | 1 | 8 | 48.9 | 53.6 |
| vulnerable-tornado-app | 7 | 3 | 7 | 47.6 | 50.5 |
| djangoat | 23 | 12 | 27 | 46.7 | 49.5 |
| python-insecure-app | 3 | 1 | 5 | 41.7 | 45.3 |
| vulpy | 20 | 4 | 37 | 35.1 | 39.5 |
| flask-xss | 10 | 0 | 20 | 32.2 | 37.3 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 78 | 0 | 8 | 90.7 |
| High | 162 | 0 | 102 | 61.4 |
| Medium | 151 | 0 | 128 | 54.1 |
| Low | 23 | 0 | 45 | 33.8 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Open Redirect | 6 | 0 | 0 | 100.0 |
| HTTP Header Injection | 2 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| Denial of Service | 19 | 0 | 1 | 95.0 |
| Insecure Deserialization | 17 | 0 | 2 | 89.5 |
| Command / OS Injection | 15 | 0 | 2 | 88.2 |
| XML External Entities | 7 | 0 | 1 | 87.5 |
| Code Injection / RFI | 12 | 0 | 2 | 85.7 |
| Path Traversal | 22 | 0 | 4 | 84.6 |
| SQL Injection | 39 | 0 | 8 | 83.0 |
| Hardcoded Credentials | 47 | 0 | 14 | 77.0 |
| Broken Access Control / IDOR | 17 | 0 | 7 | 70.8 |
| Cross-Site Scripting | 51 | 0 | 31 | 62.2 |
| Server-Side Request Forgery | 13 | 0 | 11 | 54.2 |
| Missing Authentication / Authorization | 23 | 0 | 24 | 48.9 |
| Security Misconfiguration | 15 | 0 | 18 | 45.5 |
| Other | 84 | 0 | 122 | 40.8 |
| Sensitive Data Exposure | 21 | 0 | 36 | 36.8 |
§
LLM operational metrics
71,890
Avg input tokens
8,588
Avg output tokens
326,992
Avg total tokens
153s
Avg latency / repo
0.0%
JSON repair rate
78
Total runs
±15.9
F2 run-to-run σ
§
Cost
$66
Total cost
$0.87
Cost / run
$0.331
Cost / 100 LOC
20,062
Python LOC scanned
76
Successful runs