Scanner deep-dive
Claude Sonnet 4.6 by Anthropic ↗
General-Purpose LLM · agentic-v1 · scored on 23/26 repositories. Strict scoring (unfinished repos counted as misses).
50.9
F3 (strict)
53.0
F2 (strict)
48.9%
Recall (strict)
79.7%
Precision
23/26
Repos scored
claude-sonnet-4-6
Model
$17
Total cost
367s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vampi | 12 | 3 | 3 | 80.0 | 80.0 |
| vfapi | 8 | 4 | 1 | 85.2 | 79.8 |
| dsvw | 20 | 1 | 7 | 74.1 | 77.7 |
| python-app | 14 | 3 | 6 | 71.7 | 73.9 |
| vulnpy | 53 | 5 | 25 | 67.5 | 71.3 |
| insecure-web | 6 | 2 | 3 | 70.4 | 70.9 |
| damn-vulnerable-flask-application | 10 | 3 | 5 | 68.9 | 70.3 |
| dvblab | 15 | 4 | 7 | 68.2 | 70.1 |
| vulnerable-api | 10 | 4 | 4 | 69.0 | 69.7 |
| lets-be-bad-guys | 15 | 1 | 9 | 62.5 | 67.0 |
| vulnerable-flask-app | 12 | 3 | 9 | 58.7 | 61.8 |
| threatbyte | 15 | 2 | 11 | 56.4 | 60.9 |
| intentionally-vulnerable-python-application | 4 | 1 | 3 | 57.1 | 60.6 |
| dsvpwa | 18 | 5 | 14 | 55.2 | 58.5 |
| vulnerable-tornado-app | 8 | 5 | 6 | 57.1 | 58.0 |
| pythonssti | 1 | 0 | 1 | 50.0 | 55.6 |
| extremely-vulnerable-flask-app | 16 | 0 | 16 | 50.0 | 55.4 |
| dvpwa | 11 | 4 | 11 | 50.0 | 53.4 |
| pygoat | 33 | 11 | 44 | 42.4 | 46.5 |
| damn-vulnerable-graphql-application | 14 | 6 | 22 | 38.9 | 42.6 |
| flask-xss | 11 | 2 | 19 | 36.7 | 41.4 |
| djangoat | 17 | 9 | 33 | 34.0 | 37.7 |
| vulpy | 18 | 9 | 39 | 32.2 | 35.9 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 75 | 0 | 6 | 92.6 |
| High | 141 | 0 | 100 | 58.5 |
| Medium | 111 | 1 | 146 | 43.2 |
| Low | 22 | 0 | 38 | 36.7 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| SQL Injection | 39 | 0 | 0 | 100.0 |
| Insecure Deserialization | 16 | 0 | 0 | 100.0 |
| Open Redirect | 6 | 0 | 0 | 100.0 |
| HTTP Header Injection | 2 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| Code Injection / RFI | 13 | 0 | 1 | 92.9 |
| Command / OS Injection | 15 | 0 | 2 | 88.2 |
| XML External Entities | 7 | 0 | 1 | 87.5 |
| Path Traversal | 20 | 0 | 5 | 80.0 |
| Hardcoded Credentials | 37 | 1 | 14 | 72.5 |
| Broken Access Control / IDOR | 15 | 0 | 7 | 68.2 |
| Server-Side Request Forgery | 14 | 0 | 7 | 66.7 |
| Cross-Site Scripting | 44 | 0 | 33 | 57.1 |
| Missing Authentication / Authorization | 21 | 0 | 22 | 48.8 |
| Security Misconfiguration | 12 | 0 | 19 | 38.7 |
| Other | 67 | 0 | 125 | 34.9 |
| Sensitive Data Exposure | 14 | 0 | 37 | 27.5 |
| Denial of Service | 3 | 0 | 17 | 15.0 |
§
LLM operational metrics
10
Avg input tokens
5,709
Avg output tokens
232,970
Avg total tokens
367s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±13.3
F2 run-to-run σ
§
Cost
$17
Total cost
$0.29
Cost / run
$0.083
Cost / 100 LOC
19,983
Python LOC scanned
58
Successful runs