Scanner deep-dive

Claude Opus 4.6 by Anthropic ↗

General-Purpose LLM · agentic-v1 · scored on 19/26 repositories. Strict scoring (unfinished repos counted as misses).

47.2

F3 (strict)

49.4

F2 (strict)

45.1%

Recall (strict)

79.9%

Precision

19/26

Repos scored

claude-opus-4-6

Model

$22

Total cost

763s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

vfapi85 F2 · 89%

python-app84 F2 · 83%

python-insecure-app79 F2 · 75%

lets-be-bad-guys79 F2 · 75%

damn-vulnerable-flask-application77 F2 · 78%

insecure-web74 F2 · 78%

intentionally-vulnerable-python-application74 F2 · 71%

vulnpy74 F2 · 71%

vulnerable-api73 F2 · 71%

vampi72 F2 · 71%

vulnerable-flask-app67 F2 · 65%

vulnerable-tornado-app66 F2 · 64%

threatbyte62 F2 · 60%

extremely-vulnerable-flask-app54 F2 · 50%

vulpy51 F2 · 46%

pygoat49 F2 · 45%

djangoat47 F2 · 42%

damn-vulnerable-graphql-application46 F2 · 43%

flask-xss45 F2 · 40%

Repository	TP	FP	FN	Recall %	F2
vfapi	8	3	1	88.9	85.1
python-app	17	3	3	83.3	83.6
python-insecure-app	6	0	2	75.0	78.9
lets-be-bad-guys	18	0	6	75.0	78.7
damn-vulnerable-flask-application	12	4	3	77.8	77.0
insecure-web	7	4	2	77.8	74.5
intentionally-vulnerable-python-application	5	1	2	71.4	73.7
vulnpy	56	10	22	71.4	73.6
vulnerable-api	10	2	4	71.4	73.2
vampi	11	4	4	71.1	71.7
vulnerable-flask-app	14	5	7	65.1	66.7
vulnerable-tornado-app	9	4	5	64.3	65.7
threatbyte	16	6	10	60.3	62.5
extremely-vulnerable-flask-app	16	3	16	50.0	54.4
vulpy	26	2	31	45.6	50.7
pygoat	34	12	42	44.8	48.6
djangoat	21	4	29	42.0	46.7
damn-vulnerable-graphql-application	16	10	20	43.0	45.6
flask-xss	12	2	18	40.0	44.8

Detection by severity

Severity	TP	FP	FN	Recall %
Critical	57	0	8	87.7
High	131	1	81	61.8
Medium	111	0	106	51.2
Low	20	0	28	41.7

Detection by vulnerability class

CWE family	TP	FP	FN	Recall %
Code Injection / RFI	13	0	0	100.0
SQL Injection	29	0	0	100.0
Insecure Deserialization	13	0	0	100.0
Open Redirect	3	0	0	100.0
HTTP Header Injection	1	0	0	100.0
XPath Injection	3	0	0	100.0
Server-Side Request Forgery	19	0	1	95.0
Command / OS Injection	13	0	1	92.9
Hardcoded Credentials	40	0	6	87.0
Path Traversal	18	1	4	81.8
Broken Access Control / IDOR	16	0	4	80.0
XML External Entities	5	0	2	71.4
Cross-Site Scripting	42	0	24	63.6
Missing Authentication / Authorization	18	0	20	47.4
Security Misconfiguration	9	0	14	39.1
Sensitive Data Exposure	17	0	28	37.8
Other	58	0	102	36.2
Denial of Service	2	0	17	10.5

LLM operational metrics

Avg input tokens

4,608

Avg output tokens

176,970

Avg total tokens

763s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±13.6

F2 run-to-run σ

Cost

$22

Total cost

$0.49

Cost / run

$0.123

Cost / 100 LOC

18,251

Python LOC scanned

Successful runs

← Back to the leaderboard