Scanner deep-dive

Claude Sonnet 4.6 by Anthropic ↗

General-Purpose LLM · agentic-v1 · scored on 23/26 repositories. Strict scoring (unfinished repos counted as misses).

50.9

F3 (strict)

53.0

F2 (strict)

48.9%

Recall (strict)

79.7%

Precision

23/26

Repos scored

claude-sonnet-4-6

Model

$17

Total cost

367s

Avg latency

Per-repository breakdown

Each bar shows true positives, false positives, and misses on one repository; bar length is proportional to that repo's labeled vulnerabilities. Ranked by F2.

True positiveFalse positiveMissed (FN)

vampi80 F2 · 80%

vfapi80 F2 · 85%

dsvw78 F2 · 74%

python-app74 F2 · 72%

vulnpy71 F2 · 68%

insecure-web71 F2 · 70%

damn-vulnerable-flask-application70 F2 · 69%

dvblab70 F2 · 68%

vulnerable-api70 F2 · 69%

lets-be-bad-guys67 F2 · 62%

vulnerable-flask-app62 F2 · 59%

threatbyte61 F2 · 56%

intentionally-vulnerable-python-application61 F2 · 57%

dsvpwa58 F2 · 55%

vulnerable-tornado-app58 F2 · 57%

pythonssti56 F2 · 50%

extremely-vulnerable-flask-app55 F2 · 50%

dvpwa53 F2 · 50%

pygoat46 F2 · 42%

damn-vulnerable-graphql-application43 F2 · 39%

flask-xss41 F2 · 37%

djangoat38 F2 · 34%

vulpy36 F2 · 32%

Repository	TP	FP	FN	Recall %	F2
vampi	12	3	3	80.0	80.0
vfapi	8	4	1	85.2	79.8
dsvw	20	1	7	74.1	77.7
python-app	14	3	6	71.7	73.9
vulnpy	53	5	25	67.5	71.3
insecure-web	6	2	3	70.4	70.9
damn-vulnerable-flask-application	10	3	5	68.9	70.3
dvblab	15	4	7	68.2	70.1
vulnerable-api	10	4	4	69.0	69.7
lets-be-bad-guys	15	1	9	62.5	67.0
vulnerable-flask-app	12	3	9	58.7	61.8
threatbyte	15	2	11	56.4	60.9
intentionally-vulnerable-python-application	4	1	3	57.1	60.6
dsvpwa	18	5	14	55.2	58.5
vulnerable-tornado-app	8	5	6	57.1	58.0
pythonssti	1	0	1	50.0	55.6
extremely-vulnerable-flask-app	16	0	16	50.0	55.4
dvpwa	11	4	11	50.0	53.4
pygoat	33	11	44	42.4	46.5
damn-vulnerable-graphql-application	14	6	22	38.9	42.6
flask-xss	11	2	19	36.7	41.4
djangoat	17	9	33	34.0	37.7
vulpy	18	9	39	32.2	35.9

Detection by severity

Severity	TP	FP	FN	Recall %
Critical	75	0	6	92.6
High	141	0	100	58.5
Medium	111	1	146	43.2
Low	22	0	38	36.7

Detection by vulnerability class

CWE family	TP	FP	FN	Recall %
SQL Injection	39	0	0	100.0
Insecure Deserialization	16	0	0	100.0
Open Redirect	6	0	0	100.0
HTTP Header Injection	2	0	0	100.0
XPath Injection	4	0	0	100.0
Code Injection / RFI	13	0	1	92.9
Command / OS Injection	15	0	2	88.2
XML External Entities	7	0	1	87.5
Path Traversal	20	0	5	80.0
Hardcoded Credentials	37	1	14	72.5
Broken Access Control / IDOR	15	0	7	68.2
Server-Side Request Forgery	14	0	7	66.7
Cross-Site Scripting	44	0	33	57.1
Missing Authentication / Authorization	21	0	22	48.8
Security Misconfiguration	12	0	19	38.7
Other	67	0	125	34.9
Sensitive Data Exposure	14	0	37	27.5
Denial of Service	3	0	17	15.0

LLM operational metrics

Avg input tokens

5,709

Avg output tokens

232,970

Avg total tokens

367s

Avg latency / repo

0.0%

JSON repair rate

Total runs

±13.3

F2 run-to-run σ

Cost

$17

Total cost

$0.29

Cost / run

$0.083

Cost / 100 LOC

19,983

Python LOC scanned

Successful runs

← Back to the leaderboard