Scanner deep-dive

Qwen 3.5 397B by Alibaba Qwen ↗

General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).

38.2

F3 (strict)

39.9

F2 (strict)

36.5%

Recall (strict)

63.6%

Precision

24/26

Repos scored

together_ai/Qwen/Qwen3.5-397B-A17B

Model

Total cost

77s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

vfapi65 F2 · 74%

dvblab64 F2 · 59%

insecure-web61 F2 · 59%

dsvw60 F2 · 58%

intentionally-vulnerable-python-application60 F2 · 57%

vulnpy59 F2 · 54%

pythonssti56 F2 · 50%

vulnerable-api54 F2 · 50%

vampi52 F2 · 53%

python-insecure-app50 F2 · 46%

vulnerable-tornado-app48 F2 · 45%

damn-vulnerable-flask-application47 F2 · 44%

lets-be-bad-guys46 F2 · 44%

pygoat45 F2 · 44%

vulnerable-flask-app42 F2 · 40%

dsvpwa40 F2 · 35%

python-app39 F2 · 37%

damn-vulnerable-graphql-application37 F2 · 37%

threatbyte35 F2 · 32%

dvpwa32 F2 · 29%

flask-xss31 F2 · 27%

djangoat28 F2 · 25%

extremely-vulnerable-flask-app27 F2 · 23%

vulpy20 F2 · 17%

Repository	TP	FP	FN	Recall %	F2
vfapi	7	9	2	74.1	64.6
dvblab	13	1	9	59.1	63.7
insecure-web	5	2	4	59.3	61.0
dsvw	16	6	11	58.0	60.1
intentionally-vulnerable-python-application	4	1	3	57.1	60.0
vulnpy	42	4	36	54.5	59.2
pythonssti	1	0	1	50.0	55.6
vulnerable-api	7	1	7	50.0	54.4
vampi	8	10	7	53.3	51.6
python-insecure-app	4	1	4	45.8	49.6
vulnerable-tornado-app	6	3	8	45.2	48.4
damn-vulnerable-flask-application	7	3	8	44.4	47.4
lets-be-bad-guys	11	8	13	44.4	46.5
pygoat	34	27	44	43.5	45.4
vulnerable-flask-app	8	6	13	39.7	42.2
dsvpwa	11	4	21	35.4	39.6
python-app	7	7	13	36.7	38.8
damn-vulnerable-graphql-application	13	22	23	37.0	36.6
threatbyte	8	5	18	32.0	35.3
dvpwa	6	4	16	28.8	32.2
flask-xss	8	2	22	26.7	30.8
djangoat	12	8	38	24.7	28.0
extremely-vulnerable-flask-app	7	2	25	22.9	26.6
vulpy	10	10	47	17.0	19.5

Detection by severity

Severity	TP	FP	FN	Recall %
Critical	63	1	19	76.8
High	117	6	126	48.1
Medium	78	0	182	30.0
Low	5	0	57	8.1

Detection by vulnerability class

CWE family	TP	FP	FN	Recall %
XML External Entities	8	0	0	100.0
XPath Injection	4	0	0	100.0
SQL Injection	38	3	1	97.4
Insecure Deserialization	15	1	1	93.8
Command / OS Injection	15	1	2	88.2
Code Injection / RFI	12	0	2	85.7
Path Traversal	18	2	7	72.0
Open Redirect	4	0	2	66.7
Broken Access Control / IDOR	13	0	9	59.1
Hardcoded Credentials	29	0	24	54.7
Server-Side Request Forgery	12	0	10	54.5
HTTP Header Injection	1	0	1	50.0
Cross-Site Scripting	24	0	55	30.4
Missing Authentication / Authorization	11	0	32	25.6
Security Misconfiguration	7	0	25	21.9
Other	40	0	153	20.7
Sensitive Data Exposure	9	0	43	17.3
Denial of Service	3	0	17	15.0

LLM operational metrics

43,965

Avg input tokens

4,943

Avg output tokens

121,929

Avg total tokens

77s

Avg latency / repo

16.7%

JSON repair rate

Total runs

±12.9

F2 run-to-run σ

Cost

Total cost

$0.05

Cost / run

$0.016

Cost / 100 LOC

20,062

Python LOC scanned

Successful runs

← Back to the leaderboard