Scanner deep-dive

Claude Opus 4.8 by Anthropic ↗

General-Purpose LLM · agentic-v1 · scored on 26/26 repositories. Strict scoring (unfinished repos counted as misses).

53.6

F3 (strict)

55.7

F2 (strict)

51.6%

Recall (strict)

80.7%

Precision

26/26

Repos scored

claude-opus-4-8

Model

$36

Total cost

80s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

vfapi83 F2 · 85%

insecure-web77 F2 · 78%

vulnerable-api75 F2 · 71%

dsvw75 F2 · 72%

intentionally-vulnerable-python-application72 F2 · 71%

vulnerable-tornado-app72 F2 · 69%

dvblab70 F2 · 67%

dsvpwa70 F2 · 67%

damn-vulnerable-flask-application68 F2 · 64%

pythonssti67 F2 · 67%

vampi65 F2 · 62%

vulnerable-python-apps65 F2 · 61%

vulnpy64 F2 · 60%

lets-be-bad-guys59 F2 · 56%

dvpwa59 F2 · 55%

python-insecure-app58 F2 · 54%

vulnerable-flask-app58 F2 · 56%

owasp-web-playground58 F2 · 56%

threatbyte54 F2 · 50%

flask-xss50 F2 · 46%

pygoat49 F2 · 45%

extremely-vulnerable-flask-app48 F2 · 43%

python-app48 F2 · 47%

djangoat37 F2 · 33%

damn-vulnerable-graphql-application36 F2 · 32%

vulpy27 F2 · 23%

Repository	TP	FP	FN	Recall %	F2
vfapi	8	2	1	85.2	83.4
insecure-web	7	3	2	77.8	76.7
vulnerable-api	10	0	4	71.4	75.3
dsvw	19	2	8	71.6	74.9
intentionally-vulnerable-python-application	5	2	2	71.4	72.1
vulnerable-tornado-app	10	2	4	69.0	71.8
dvblab	15	2	7	66.7	70.1
dsvpwa	21	3	11	66.7	70.0
damn-vulnerable-flask-application	10	2	5	64.4	67.8
pythonssti	1	0	1	66.7	67.4
vampi	9	3	6	62.2	64.8
vulnerable-python-apps	14	3	8	61.4	64.6
vulnpy	47	6	31	59.8	63.9
lets-be-bad-guys	13	3	11	55.5	59.3
dvpwa	12	2	10	54.5	58.6
python-insecure-app	4	1	4	54.2	58.5
vulnerable-flask-app	12	4	9	55.5	58.5
owasp-web-playground	16	7	12	56.0	58.0
threatbyte	13	3	13	50.0	54.1
flask-xss	14	2	16	45.6	50.1
pygoat	35	9	42	45.0	49.3
extremely-vulnerable-flask-app	14	1	18	42.7	48.0
python-app	9	8	11	46.7	47.6
djangoat	17	7	33	33.3	37.2
damn-vulnerable-graphql-application	12	6	24	32.4	36.0
vulpy	13	3	44	22.8	26.6

Detection by severity

Severity	TP	FP	FN	Recall %
Critical	76	1	10	88.4
High	135	2	129	51.1
Medium	112	0	167	40.1
Low	21	0	47	30.9

Detection by vulnerability class

CWE family	TP	FP	FN	Recall %
Code Injection / RFI	14	0	0	100.0
Open Redirect	6	0	0	100.0
HTTP Header Injection	2	0	0	100.0
XPath Injection	4	0	0	100.0
Insecure Deserialization	17	0	2	89.5
SQL Injection	42	0	5	89.4
XML External Entities	7	1	1	87.5
Command / OS Injection	14	0	3	82.4
Path Traversal	19	1	7	73.1
Broken Access Control / IDOR	16	0	8	66.7
Hardcoded Credentials	36	0	25	59.0
Server-Side Request Forgery	13	0	11	54.2
Cross-Site Scripting	44	0	38	53.7
Security Misconfiguration	15	0	18	45.5
Missing Authentication / Authorization	16	0	31	34.0
Other	63	1	143	30.6
Sensitive Data Exposure	14	0	43	24.6
Denial of Service	2	0	18	10.0

LLM operational metrics

Avg input tokens

6,837

Avg output tokens

257,687

Avg total tokens

80s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±13.7

F2 run-to-run σ

Cost

$36

Total cost

$0.46

Cost / run

$0.178

Cost / 100 LOC

20,062

Python LOC scanned

Successful runs

← Back to the leaderboard