Scanner deep-dive

Claude Haiku 4.5 by Anthropic ↗

General-Purpose LLM · agentic-v1 · scored on 24/26 repositories. Strict scoring (unfinished repos counted as misses).

36.4

F3 (strict)

38.6

F2 (strict)

34.4%

Recall (strict)

75.2%

Precision

24/26

Repos scored

claude-haiku-4-5-20251001

Model

Total cost

56s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

insecure-web71 F2 · 67%

intentionally-vulnerable-python-application64 F2 · 62%

vfapi58 F2 · 56%

lets-be-bad-guys57 F2 · 53%

damn-vulnerable-flask-application55 F2 · 53%

dvblab55 F2 · 53%

dsvw55 F2 · 49%

vulnerable-tornado-app54 F2 · 50%

pythonssti52 F2 · 50%

python-app52 F2 · 48%

vampi51 F2 · 49%

python-insecure-app51 F2 · 46%

vulnpy50 F2 · 47%

vulnerable-api50 F2 · 45%

vulnerable-flask-app50 F2 · 46%

dsvpwa43 F2 · 38%

dvpwa36 F2 · 32%

threatbyte33 F2 · 29%

pygoat33 F2 · 29%

extremely-vulnerable-flask-app32 F2 · 28%

flask-xss31 F2 · 27%

damn-vulnerable-graphql-application28 F2 · 26%

djangoat28 F2 · 25%

vulpy21 F2 · 18%

Repository	TP	FP	FN	Recall %	F2
insecure-web	6	0	3	66.7	70.9
intentionally-vulnerable-python-application	4	1	3	61.9	64.2
vfapi	5	2	4	55.6	57.7
lets-be-bad-guys	13	2	11	52.8	57.2
damn-vulnerable-flask-application	8	4	7	53.3	55.3
dvblab	12	6	10	53.0	55.3
dsvw	13	0	14	49.4	54.8
vulnerable-tornado-app	7	2	7	50.0	53.6
pythonssti	1	1	1	50.0	51.9
python-app	10	4	10	48.3	51.7
vampi	7	4	8	48.9	51.3
python-insecure-app	4	1	4	45.8	50.6
vulnpy	37	7	41	47.0	50.1
vulnerable-api	6	1	8	45.2	49.9
vulnerable-flask-app	10	3	11	46.0	49.8
dsvpwa	12	1	20	37.5	42.6
dvpwa	7	1	15	31.8	36.3
threatbyte	8	5	18	29.5	33.0
pygoat	22	10	55	29.0	32.8
extremely-vulnerable-flask-app	9	3	23	28.1	32.2
flask-xss	8	2	22	26.7	30.7
damn-vulnerable-graphql-application	9	11	27	25.9	28.2
djangoat	12	6	38	24.7	28.2
vulpy	10	2	47	17.5	20.8

Detection by severity

Severity	TP	FP	FN	Recall %
Critical	62	0	20	75.6
High	104	1	139	42.8
Medium	68	3	192	26.2
Low	15	0	47	24.2

Detection by vulnerability class

CWE family	TP	FP	FN	Recall %
XML External Entities	8	0	0	100.0
XPath Injection	4	0	0	100.0
SQL Injection	38	0	1	97.4
Insecure Deserialization	15	0	1	93.8
Open Redirect	5	0	1	83.3
Path Traversal	20	1	5	80.0
Code Injection / RFI	11	0	3	78.6
Command / OS Injection	13	0	4	76.5
Server-Side Request Forgery	12	0	10	54.5
Broken Access Control / IDOR	11	0	11	50.0
HTTP Header Injection	1	0	1	50.0
Hardcoded Credentials	20	1	33	37.7
Cross-Site Scripting	26	2	53	32.9
Security Misconfiguration	8	0	24	25.0
Other	42	0	151	21.8
Denial of Service	4	0	16	20.0
Missing Authentication / Authorization	7	0	36	16.3
Sensitive Data Exposure	4	0	48	7.7

LLM operational metrics

Avg input tokens

4,888

Avg output tokens

243,089

Avg total tokens

56s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±12.9

F2 run-to-run σ

Cost

Total cost

$0.07

Cost / run

$0.026

Cost / 100 LOC

20,062

Python LOC scanned

Successful runs

← Back to the leaderboard