Scanner deep-dive
GLM-5.1 by Z.ai ↗
General-Purpose LLM · agentic-v1 · scored on 25/26 repositories. Strict scoring (unfinished repos counted as misses).
57.1
F3 (strict)
58.6
F2 (strict)
55.7%
Recall (strict)
74.1%
Precision
25/26
Repos scored
glm-5.1
Model
$10
Total cost
438s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| pythonssti | 2 | 0 | 0 | 100.0 | 100.0 |
| vfapi | 9 | 4 | 0 | 96.3 | 90.0 |
| damn-vulnerable-flask-application | 14 | 4 | 2 | 90.0 | 87.7 |
| python-insecure-app | 6 | 0 | 2 | 81.2 | 83.2 |
| dsvpwa | 25 | 6 | 7 | 77.1 | 77.7 |
| vulnerable-api | 11 | 4 | 3 | 78.6 | 76.9 |
| insecure-web | 7 | 3 | 2 | 77.8 | 76.7 |
| vulnerable-tornado-app | 10 | 2 | 4 | 73.8 | 75.6 |
| dsvw | 20 | 5 | 8 | 72.2 | 73.6 |
| python-app | 14 | 6 | 6 | 70.0 | 70.0 |
| dvblab | 15 | 7 | 7 | 69.7 | 69.6 |
| intentionally-vulnerable-python-application | 5 | 4 | 2 | 71.4 | 67.8 |
| vulnpy | 48 | 11 | 30 | 61.5 | 64.7 |
| vulnerable-flask-app | 13 | 4 | 8 | 61.9 | 64.1 |
| lets-be-bad-guys | 14 | 2 | 10 | 59.7 | 64.0 |
| threatbyte | 15 | 4 | 11 | 59.0 | 62.4 |
| dvpwa | 13 | 6 | 9 | 57.6 | 59.6 |
| owasp-web-playground | 16 | 8 | 12 | 57.1 | 59.0 |
| vulnerable-python-apps | 12 | 5 | 10 | 54.5 | 57.0 |
| pygoat | 40 | 14 | 36 | 52.6 | 55.8 |
| extremely-vulnerable-flask-app | 14 | 4 | 18 | 45.3 | 49.3 |
| damn-vulnerable-graphql-application | 17 | 15 | 19 | 47.2 | 48.1 |
| flask-xss | 12 | 1 | 18 | 40.0 | 45.0 |
| djangoat | 18 | 10 | 32 | 37.0 | 40.5 |
| vulpy | 19 | 7 | 38 | 33.9 | 38.0 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 76 | 1 | 9 | 89.4 |
| High | 155 | 2 | 103 | 60.1 |
| Medium | 133 | 0 | 140 | 48.7 |
| Low | 24 | 0 | 42 | 36.4 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Code Injection / RFI | 14 | 0 | 0 | 100.0 |
| XML External Entities | 8 | 1 | 0 | 100.0 |
| HTTP Header Injection | 2 | 0 | 0 | 100.0 |
| XPath Injection | 4 | 0 | 0 | 100.0 |
| SQL Injection | 44 | 0 | 2 | 95.7 |
| Insecure Deserialization | 18 | 0 | 1 | 94.7 |
| Path Traversal | 23 | 0 | 3 | 88.5 |
| Command / OS Injection | 15 | 0 | 2 | 88.2 |
| Open Redirect | 5 | 0 | 1 | 83.3 |
| Cross-Site Scripting | 56 | 0 | 26 | 68.3 |
| Broken Access Control / IDOR | 15 | 0 | 7 | 68.2 |
| Hardcoded Credentials | 38 | 0 | 22 | 63.3 |
| Server-Side Request Forgery | 13 | 1 | 11 | 54.2 |
| Missing Authentication / Authorization | 22 | 0 | 23 | 48.9 |
| Security Misconfiguration | 16 | 0 | 17 | 48.5 |
| Other | 75 | 1 | 125 | 37.5 |
| Sensitive Data Exposure | 19 | 0 | 36 | 34.5 |
| Denial of Service | 1 | 0 | 18 | 5.3 |
§
LLM operational metrics
32,263
Avg input tokens
9,004
Avg output tokens
128,705
Avg total tokens
438s
Avg latency / repo
1.3%
JSON repair rate
78
Total runs
±15.6
F2 run-to-run σ
§
Cost
$10
Total cost
$0.16
Cost / run
$0.053
Cost / 100 LOC
19,610
Python LOC scanned
63
Successful runs