Scanner deep-dive
Qwen 3.5 397B by Alibaba Qwen ↗
General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).
38.2
F3 (strict)
39.9
F2 (strict)
36.5%
Recall (strict)
63.6%
Precision
24/26
Repos scored
together_ai/Qwen/Qwen3.5-397B-A17B
Model
$3
Total cost
77s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vfapi | 7 | 9 | 2 | 74.1 | 64.6 |
| dvblab | 13 | 1 | 9 | 59.1 | 63.7 |
| insecure-web | 5 | 2 | 4 | 59.3 | 61.0 |
| dsvw | 16 | 6 | 11 | 58.0 | 60.1 |
| intentionally-vulnerable-python-application | 4 | 1 | 3 | 57.1 | 60.0 |
| vulnpy | 42 | 4 | 36 | 54.5 | 59.2 |
| pythonssti | 1 | 0 | 1 | 50.0 | 55.6 |
| vulnerable-api | 7 | 1 | 7 | 50.0 | 54.4 |
| vampi | 8 | 10 | 7 | 53.3 | 51.6 |
| python-insecure-app | 4 | 1 | 4 | 45.8 | 49.6 |
| vulnerable-tornado-app | 6 | 3 | 8 | 45.2 | 48.4 |
| damn-vulnerable-flask-application | 7 | 3 | 8 | 44.4 | 47.4 |
| lets-be-bad-guys | 11 | 8 | 13 | 44.4 | 46.5 |
| pygoat | 34 | 27 | 44 | 43.5 | 45.4 |
| vulnerable-flask-app | 8 | 6 | 13 | 39.7 | 42.2 |
| dsvpwa | 11 | 4 | 21 | 35.4 | 39.6 |
| python-app | 7 | 7 | 13 | 36.7 | 38.8 |
| damn-vulnerable-graphql-application | 13 | 22 | 23 | 37.0 | 36.6 |
| threatbyte | 8 | 5 | 18 | 32.0 | 35.3 |
| dvpwa | 6 | 4 | 16 | 28.8 | 32.2 |
| flask-xss | 8 | 2 | 22 | 26.7 | 30.8 |
| djangoat | 12 | 8 | 38 | 24.7 | 28.0 |
| extremely-vulnerable-flask-app | 7 | 2 | 25 | 22.9 | 26.6 |
| vulpy | 10 | 10 | 47 | 17.0 | 19.5 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 63 | 1 | 19 | 76.8 |
| High | 117 | 6 | 126 | 48.1 |
| Medium | 78 | 0 | 182 | 30.0 |
| Low | 5 | 0 | 57 | 8.1 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| XML External Entities | 8 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| SQL Injection | 38 | 3 | 1 | 97.4 |
| Insecure Deserialization | 15 | 1 | 1 | 93.8 |
| Command / OS Injection | 15 | 1 | 2 | 88.2 |
| Code Injection / RFI | 12 | 0 | 2 | 85.7 |
| Path Traversal | 18 | 2 | 7 | 72.0 |
| Open Redirect | 4 | 0 | 2 | 66.7 |
| Broken Access Control / IDOR | 13 | 0 | 9 | 59.1 |
| Hardcoded Credentials | 29 | 0 | 24 | 54.7 |
| Server-Side Request Forgery | 12 | 0 | 10 | 54.5 |
| HTTP Header Injection | 1 | 0 | 1 | 50.0 |
| Cross-Site Scripting | 24 | 0 | 55 | 30.4 |
| Missing Authentication / Authorization | 11 | 0 | 32 | 25.6 |
| Security Misconfiguration | 7 | 0 | 25 | 21.9 |
| Other | 40 | 0 | 153 | 20.7 |
| Sensitive Data Exposure | 9 | 0 | 43 | 17.3 |
| Denial of Service | 3 | 0 | 17 | 15.0 |
§
LLM operational metrics
43,965
Avg input tokens
4,943
Avg output tokens
121,929
Avg total tokens
77s
Avg latency / repo
16.7%
JSON repair rate
72
Total runs
±12.9
F2 run-to-run σ
§
Cost
$3
Total cost
$0.05
Cost / run
$0.016
Cost / 100 LOC
20,062
Python LOC scanned
69
Successful runs