Scanner deep-dive
Claude Opus 4.7 by Anthropic ↗
General-Purpose LLM · agentic-v1 · scored on 25/26 repositories. Strict scoring (unfinished repos counted as misses).
47.5
F3 (strict)
49.4
F2 (strict)
45.8%
Recall (strict)
71.5%
Precision
25/26
Repos scored
claude-opus-4-7
Model
$32
Total cost
76s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| pythonssti | 2 | 0 | 0 | 100.0 | 100.0 |
| vfapi | 8 | 4 | 1 | 88.9 | 83.9 |
| intentionally-vulnerable-python-application | 6 | 1 | 1 | 81.0 | 81.6 |
| vulnerable-tornado-app | 10 | 2 | 4 | 73.8 | 75.7 |
| dsvw | 19 | 3 | 8 | 70.4 | 73.3 |
| insecure-web | 7 | 3 | 2 | 74.1 | 73.0 |
| python-app | 14 | 6 | 6 | 72.5 | 72.5 |
| vulnerable-api | 10 | 2 | 4 | 69.0 | 71.9 |
| vulnerable-python-apps | 15 | 5 | 7 | 69.7 | 70.2 |
| vampi | 10 | 2 | 5 | 66.7 | 69.1 |
| dvblab | 14 | 3 | 8 | 62.1 | 65.0 |
| dsvpwa | 20 | 8 | 12 | 62.5 | 63.9 |
| damn-vulnerable-flask-application | 9 | 5 | 6 | 62.2 | 62.5 |
| python-insecure-app | 4 | 1 | 4 | 54.2 | 58.5 |
| owasp-web-playground | 16 | 10 | 12 | 55.4 | 56.2 |
| threatbyte | 13 | 3 | 13 | 51.3 | 55.2 |
| pygoat | 40 | 23 | 36 | 52.6 | 54.3 |
| vulnerable-flask-app | 11 | 5 | 10 | 50.8 | 53.5 |
| lets-be-bad-guys | 12 | 4 | 12 | 48.6 | 52.1 |
| extremely-vulnerable-flask-app | 14 | 4 | 18 | 43.8 | 47.9 |
| djangoat | 20 | 13 | 30 | 39.3 | 42.3 |
| damn-vulnerable-graphql-application | 13 | 6 | 23 | 36.1 | 39.8 |
| dvpwa | 8 | 2 | 14 | 34.8 | 39.4 |
| vulpy | 17 | 9 | 40 | 29.8 | 33.5 |
| flask-xss | 7 | 3 | 23 | 23.3 | 25.4 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 69 | 0 | 7 | 90.8 |
| High | 134 | 1 | 103 | 56.5 |
| Medium | 109 | 0 | 139 | 44.0 |
| Low | 17 | 0 | 41 | 29.3 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Code Injection / RFI | 11 | 0 | 0 | 100.0 |
| HTTP Header Injection | 2 | 0 | 0 | 100.0 |
| XPath Injection | 1 | 0 | 0 | 100.0 |
| SQL Injection | 43 | 0 | 1 | 97.7 |
| Insecure Deserialization | 14 | 0 | 1 | 93.3 |
| Command / OS Injection | 13 | 0 | 1 | 92.9 |
| Path Traversal | 15 | 0 | 3 | 83.3 |
| Open Redirect | 5 | 0 | 1 | 83.3 |
| Server-Side Request Forgery | 9 | 0 | 2 | 81.8 |
| XML External Entities | 4 | 1 | 1 | 80.0 |
| Broken Access Control / IDOR | 18 | 0 | 6 | 75.0 |
| Hardcoded Credentials | 43 | 0 | 18 | 70.5 |
| Security Misconfiguration | 18 | 0 | 13 | 58.1 |
| Missing Authentication / Authorization | 21 | 0 | 26 | 44.7 |
| Cross-Site Scripting | 26 | 0 | 44 | 37.1 |
| Other | 69 | 0 | 129 | 34.8 |
| Sensitive Data Exposure | 16 | 0 | 41 | 28.1 |
| Denial of Service | 1 | 0 | 3 | 25.0 |
§
LLM operational metrics
14
Avg input tokens
5,440
Avg output tokens
287,918
Avg total tokens
76s
Avg latency / repo
0.0%
JSON repair rate
78
Total runs
±17.2
F2 run-to-run σ
§
Cost
$32
Total cost
$0.49
Cost / run
$0.184
Cost / 100 LOC
17,572
Python LOC scanned
66
Successful runs