Scanner deep-dive

Claude Haiku 4.5 by Anthropic ↗

General-Purpose LLM · direct-v1 · scored on 23/26 repositories. Strict scoring (unfinished repos counted as misses).

25.4

F3 (strict)

26.8

F2 (strict)

24.1%

Recall (strict)

48.7%

Precision

23/26

Repos scored

claude-haiku-4-5-20251001

Model

Total cost

19s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

intentionally-vulnerable-python-application66 F2 · 62%

insecure-web63 F2 · 63%

vulnerable-api63 F2 · 60%

python-insecure-app59 F2 · 54%

pythonssti52 F2 · 50%

vulnpy49 F2 · 44%

damn-vulnerable-flask-application48 F2 · 47%

flask-xss45 F2 · 41%

vampi43 F2 · 42%

dvblab38 F2 · 36%

lets-be-bad-guys32 F2 · 31%

vulnerable-flask-app32 F2 · 30%

vulnerable-tornado-app31 F2 · 29%

dsvpwa29 F2 · 26%

vulpy27 F2 · 24%

dsvw27 F2 · 25%

extremely-vulnerable-flask-app26 F2 · 23%

threatbyte23 F2 · 22%

dvpwa22 F2 · 20%

vfapi16 F2 · 19%

damn-vulnerable-graphql-application9 F2 · 8%

djangoat9 F2 · 8%

pygoat7 F2 · 6%

Repository	TP	FP	FN	Recall %	F2
intentionally-vulnerable-python-application	4	1	3	61.9	65.7
insecure-web	6	3	3	63.0	63.4
vulnerable-api	8	2	6	59.5	62.8
python-insecure-app	4	0	4	54.2	59.0
pythonssti	1	1	1	50.0	51.9
vulnpy	35	7	43	44.4	49.0
damn-vulnerable-flask-application	7	5	8	46.7	48.2
flask-xss	12	4	18	41.1	45.1
vampi	6	7	9	42.2	43.2
dvblab	8	10	14	36.4	37.6
lets-be-bad-guys	7	10	17	30.6	32.5
vulnerable-flask-app	6	10	15	30.2	31.6
vulnerable-tornado-app	4	4	10	28.6	31.1
dsvpwa	8	8	24	26.0	29.1
vulpy	14	10	43	24.0	27.1
dsvw	7	10	20	24.7	26.8
extremely-vulnerable-flask-app	7	5	25	22.9	26.1
threatbyte	6	13	20	21.8	23.1
dvpwa	4	8	18	19.7	21.7
vfapi	2	15	7	18.5	16.1
damn-vulnerable-graphql-application	3	15	33	8.3	9.2
djangoat	4	14	46	8.0	9.1
pygoat	5	15	72	6.1	7.1

Detection by severity

Severity	TP	FP	FN	Recall %
Critical	28	1	50	35.9
High	63	2	172	26.8
Medium	56	0	197	22.1
Low	6	0	55	9.8

Detection by vulnerability class

CWE family	TP	FP	FN	Recall %
HTTP Header Injection	2	0	0	100.0
XPath Injection	3	0	1	75.0
SQL Injection	20	2	18	52.6
Path Traversal	12	0	11	52.2
XML External Entities	3	0	4	42.9
Cross-Site Scripting	31	0	46	40.3
Insecure Deserialization	5	0	9	35.7
Hardcoded Credentials	18	0	33	35.3
Command / OS Injection	5	0	11	31.2
Code Injection / RFI	4	0	10	28.6
Broken Access Control / IDOR	5	0	17	22.7
Other	32	1	155	17.1
Open Redirect	1	0	5	16.7
Security Misconfiguration	5	0	27	15.6
Server-Side Request Forgery	2	0	20	9.1
Missing Authentication / Authorization	3	0	39	7.1
Sensitive Data Exposure	2	0	48	4.0
Denial of Service	0	0	20	0.0

LLM operational metrics

54,965

Avg input tokens

3,312

Avg output tokens

58,278

Avg total tokens

19s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±17.8

F2 run-to-run σ

Cost

Total cost

$0.07

Cost / run

$0.025

Cost / 100 LOC

19,723

Python LOC scanned

Successful runs

← Back to the leaderboard