Scanner deep-dive
Claude Opus 4.6 by Anthropic ↗
General-Purpose LLM · agentic-v1 · scored on 19/26 repositories. Strict scoring (unfinished repos counted as misses).
47.2
F3 (strict)
49.4
F2 (strict)
45.1%
Recall (strict)
79.9%
Precision
19/26
Repos scored
claude-opus-4-6
Model
$22
Total cost
763s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vfapi | 8 | 3 | 1 | 88.9 | 85.1 |
| python-app | 17 | 3 | 3 | 83.3 | 83.6 |
| python-insecure-app | 6 | 0 | 2 | 75.0 | 78.9 |
| lets-be-bad-guys | 18 | 0 | 6 | 75.0 | 78.7 |
| damn-vulnerable-flask-application | 12 | 4 | 3 | 77.8 | 77.0 |
| insecure-web | 7 | 4 | 2 | 77.8 | 74.5 |
| intentionally-vulnerable-python-application | 5 | 1 | 2 | 71.4 | 73.7 |
| vulnpy | 56 | 10 | 22 | 71.4 | 73.6 |
| vulnerable-api | 10 | 2 | 4 | 71.4 | 73.2 |
| vampi | 11 | 4 | 4 | 71.1 | 71.7 |
| vulnerable-flask-app | 14 | 5 | 7 | 65.1 | 66.7 |
| vulnerable-tornado-app | 9 | 4 | 5 | 64.3 | 65.7 |
| threatbyte | 16 | 6 | 10 | 60.3 | 62.5 |
| extremely-vulnerable-flask-app | 16 | 3 | 16 | 50.0 | 54.4 |
| vulpy | 26 | 2 | 31 | 45.6 | 50.7 |
| pygoat | 34 | 12 | 42 | 44.8 | 48.6 |
| djangoat | 21 | 4 | 29 | 42.0 | 46.7 |
| damn-vulnerable-graphql-application | 16 | 10 | 20 | 43.0 | 45.6 |
| flask-xss | 12 | 2 | 18 | 40.0 | 44.8 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 57 | 0 | 8 | 87.7 |
| High | 131 | 1 | 81 | 61.8 |
| Medium | 111 | 0 | 106 | 51.2 |
| Low | 20 | 0 | 28 | 41.7 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Code Injection / RFI | 13 | 0 | 0 | 100.0 |
| SQL Injection | 29 | 0 | 0 | 100.0 |
| Insecure Deserialization | 13 | 0 | 0 | 100.0 |
| Open Redirect | 3 | 0 | 0 | 100.0 |
| HTTP Header Injection | 1 | 0 | 0 | 100.0 |
| XPath Injection | 3 | 0 | 0 | 100.0 |
| Server-Side Request Forgery | 19 | 0 | 1 | 95.0 |
| Command / OS Injection | 13 | 0 | 1 | 92.9 |
| Hardcoded Credentials | 40 | 0 | 6 | 87.0 |
| Path Traversal | 18 | 1 | 4 | 81.8 |
| Broken Access Control / IDOR | 16 | 0 | 4 | 80.0 |
| XML External Entities | 5 | 0 | 2 | 71.4 |
| Cross-Site Scripting | 42 | 0 | 24 | 63.6 |
| Missing Authentication / Authorization | 18 | 0 | 20 | 47.4 |
| Security Misconfiguration | 9 | 0 | 14 | 39.1 |
| Sensitive Data Exposure | 17 | 0 | 28 | 37.8 |
| Other | 58 | 0 | 102 | 36.2 |
| Denial of Service | 2 | 0 | 17 | 10.5 |
§
LLM operational metrics
7
Avg input tokens
4,608
Avg output tokens
176,970
Avg total tokens
763s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±13.6
F2 run-to-run σ
§
Cost
$22
Total cost
$0.49
Cost / run
$0.123
Cost / 100 LOC
18,251
Python LOC scanned
46
Successful runs