Scanner deep-dive
Kimi K2.5 by Moonshot AI ↗
General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).
46.0
F3 (strict)
47.8
F2 (strict)
44.3%
Recall (strict)
69.3%
Precision
24/26
Repos scored
kimi-k2.5
Model
$2
Total cost
140s
Avg latency
§
Per-repository breakdown
Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.
| Repository | TP | FP | FN | Recall % | F2 |
|---|---|---|---|---|---|
| vfapi | 9 | 10 | 0 | 96.3 | 79.2 |
| intentionally-vulnerable-python-application | 5 | 1 | 2 | 71.4 | 73.2 |
| vulnpy | 53 | 5 | 25 | 68.0 | 71.6 |
| vampi | 10 | 5 | 5 | 66.7 | 66.6 |
| dsvw | 17 | 3 | 10 | 61.7 | 65.5 |
| insecure-web | 6 | 4 | 3 | 66.7 | 65.1 |
| python-app | 12 | 9 | 8 | 61.7 | 61.1 |
| dvblab | 13 | 5 | 9 | 57.6 | 60.1 |
| vulnerable-flask-app | 12 | 5 | 9 | 57.1 | 59.4 |
| lets-be-bad-guys | 14 | 6 | 10 | 57.0 | 59.3 |
| dsvpwa | 17 | 4 | 15 | 52.1 | 56.1 |
| pythonssti | 1 | 0 | 1 | 50.0 | 55.6 |
| vulnerable-api | 7 | 3 | 7 | 52.4 | 55.3 |
| vulnerable-tornado-app | 7 | 3 | 7 | 52.4 | 54.9 |
| python-insecure-app | 4 | 3 | 4 | 50.0 | 51.8 |
| dvpwa | 11 | 5 | 11 | 48.5 | 51.3 |
| damn-vulnerable-flask-application | 7 | 2 | 8 | 46.7 | 50.7 |
| threatbyte | 11 | 9 | 15 | 42.3 | 44.5 |
| extremely-vulnerable-flask-app | 12 | 3 | 20 | 37.5 | 41.8 |
| flask-xss | 11 | 5 | 19 | 37.8 | 41.6 |
| pygoat | 28 | 17 | 49 | 36.4 | 39.4 |
| damn-vulnerable-graphql-application | 13 | 11 | 23 | 36.1 | 38.7 |
| djangoat | 16 | 12 | 34 | 32.0 | 35.1 |
| vulpy | 13 | 7 | 44 | 22.2 | 25.5 |
§
Detection by severity
| Severity | TP | FP | FN | Recall % |
|---|---|---|---|---|
| Critical | 69 | 0 | 13 | 84.1 |
| High | 125 | 3 | 118 | 51.4 |
| Medium | 94 | 2 | 166 | 36.2 |
| Low | 17 | 0 | 45 | 27.4 |
§
Detection by vulnerability class
| CWE family | TP | FP | FN | Recall % |
|---|---|---|---|---|
| XPath Injection | 4 | 0 | 0 | 100.0 |
| SQL Injection | 38 | 0 | 1 | 97.4 |
| Insecure Deserialization | 15 | 0 | 1 | 93.8 |
| Command / OS Injection | 15 | 1 | 2 | 88.2 |
| XML External Entities | 7 | 0 | 1 | 87.5 |
| Code Injection / RFI | 12 | 0 | 2 | 85.7 |
| Path Traversal | 20 | 2 | 5 | 80.0 |
| Server-Side Request Forgery | 16 | 0 | 6 | 72.7 |
| Open Redirect | 4 | 0 | 2 | 66.7 |
| Cross-Site Scripting | 45 | 2 | 34 | 57.0 |
| Broken Access Control / IDOR | 11 | 0 | 11 | 50.0 |
| HTTP Header Injection | 1 | 0 | 1 | 50.0 |
| Hardcoded Credentials | 26 | 0 | 27 | 49.1 |
| Missing Authentication / Authorization | 17 | 0 | 26 | 39.5 |
| Other | 53 | 0 | 140 | 27.5 |
| Security Misconfiguration | 8 | 0 | 24 | 25.0 |
| Sensitive Data Exposure | 11 | 0 | 41 | 21.2 |
| Denial of Service | 2 | 0 | 18 | 10.0 |
§
LLM operational metrics
20,086
Avg input tokens
5,029
Avg output tokens
131,193
Avg total tokens
140s
Avg latency / repo
0.0%
JSON repair rate
72
Total runs
±13.0
F2 run-to-run σ
§
Cost
$2
Total cost
$0.03
Cost / run
$0.011
Cost / 100 LOC
20,062
Python LOC scanned
72
Successful runs