Scanner deep-dive
Claude Opus 4.8 by Anthropic ↗
General-Purpose LLM · agentic-v1 · scored on 26/26 repositories. Strict scoring (unfinished repos counted as misses).
53.6
F3 (strict)
55.7
F2 (strict)
51.6%
Recall (strict)
80.7%
Precision
26/26
Repos scored
claude-opus-4-8
Model
$36
Total cost
80s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vfapi | 8 | 2 | 1 | 85.2 | 83.4 |
| insecure-web | 7 | 3 | 2 | 77.8 | 76.7 |
| vulnerable-api | 10 | 0 | 4 | 71.4 | 75.3 |
| dsvw | 19 | 2 | 8 | 71.6 | 74.9 |
| intentionally-vulnerable-python-application | 5 | 2 | 2 | 71.4 | 72.1 |
| vulnerable-tornado-app | 10 | 2 | 4 | 69.0 | 71.8 |
| dvblab | 15 | 2 | 7 | 66.7 | 70.1 |
| dsvpwa | 21 | 3 | 11 | 66.7 | 70.0 |
| damn-vulnerable-flask-application | 10 | 2 | 5 | 64.4 | 67.8 |
| pythonssti | 1 | 0 | 1 | 66.7 | 67.4 |
| vampi | 9 | 3 | 6 | 62.2 | 64.8 |
| vulnerable-python-apps | 14 | 3 | 8 | 61.4 | 64.6 |
| vulnpy | 47 | 6 | 31 | 59.8 | 63.9 |
| lets-be-bad-guys | 13 | 3 | 11 | 55.5 | 59.3 |
| dvpwa | 12 | 2 | 10 | 54.5 | 58.6 |
| python-insecure-app | 4 | 1 | 4 | 54.2 | 58.5 |
| vulnerable-flask-app | 12 | 4 | 9 | 55.5 | 58.5 |
| owasp-web-playground | 16 | 7 | 12 | 56.0 | 58.0 |
| threatbyte | 13 | 3 | 13 | 50.0 | 54.1 |
| flask-xss | 14 | 2 | 16 | 45.6 | 50.1 |
| pygoat | 35 | 9 | 42 | 45.0 | 49.3 |
| extremely-vulnerable-flask-app | 14 | 1 | 18 | 42.7 | 48.0 |
| python-app | 9 | 8 | 11 | 46.7 | 47.6 |
| djangoat | 17 | 7 | 33 | 33.3 | 37.2 |
| damn-vulnerable-graphql-application | 12 | 6 | 24 | 32.4 | 36.0 |
| vulpy | 13 | 3 | 44 | 22.8 | 26.6 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 76 | 1 | 10 | 88.4 |
| High | 135 | 2 | 129 | 51.1 |
| Medium | 112 | 0 | 167 | 40.1 |
| Low | 21 | 0 | 47 | 30.9 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Code Injection / RFI | 14 | 0 | 0 | 100.0 |
| Open Redirect | 6 | 0 | 0 | 100.0 |
| HTTP Header Injection | 2 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| Insecure Deserialization | 17 | 0 | 2 | 89.5 |
| SQL Injection | 42 | 0 | 5 | 89.4 |
| XML External Entities | 7 | 1 | 1 | 87.5 |
| Command / OS Injection | 14 | 0 | 3 | 82.4 |
| Path Traversal | 19 | 1 | 7 | 73.1 |
| Broken Access Control / IDOR | 16 | 0 | 8 | 66.7 |
| Hardcoded Credentials | 36 | 0 | 25 | 59.0 |
| Server-Side Request Forgery | 13 | 0 | 11 | 54.2 |
| Cross-Site Scripting | 44 | 0 | 38 | 53.7 |
| Security Misconfiguration | 15 | 0 | 18 | 45.5 |
| Missing Authentication / Authorization | 16 | 0 | 31 | 34.0 |
| Other | 63 | 1 | 143 | 30.6 |
| Sensitive Data Exposure | 14 | 0 | 43 | 24.6 |
| Denial of Service | 2 | 0 | 18 | 10.0 |
§
LLM operational metrics
15
Avg input tokens
6,837
Avg output tokens
257,687
Avg total tokens
80s
Avg latency / repo
0.0%
JSON repair rate
77
Total runs
±13.7
F2 run-to-run σ
§
Cost
$36
Total cost
$0.46
Cost / run
$0.178
Cost / 100 LOC
20,062
Python LOC scanned
77
Successful runs