Accuracy Benchmarks

Per-modality evaluation results across our ensemble models. All benchmarks use held-out test splits — none of the test data was used for training.

Last updated: July 2026 · v4.4.0 · 14-layer image engine

Real-world accuracy varies by generator novelty, content type, and obfuscation. Treat these as upper bounds on curated data.

Image

98%

AUC 0.98 · F1 0.965

14 layers

L12-BDIS: 100% recall

Text

94%

AUC 0.94 · F1 0.925

3-model

GPT-4, Claude, Gemini

Audio

92%

AUC 0.95 · F1 0.925

ensemble

ElevenLabs, RVC, Bark

Video

90%

AUC 0.93 · F1 0.905

frame-level

Sora, Kling, Runway

Text

PAN25, PERSUADE 2.0, M4

Model	AUC-ROC	Precision	Recall	F1	FPR
RoBERTa-base-openai-detector	0.93	91.0%	90.0%	0.905	8.0%
Binoculars (perplexity/crossperplexity)	0.91	89.0%	92.0%	0.905	9.0%
Gemini 2.0 Flash (ensemble head)	0.90	88.0%	89.0%	0.885	10.0%
Ensemble (all combined)	0.94	92.0%	93.0%	0.925	6.0%

Evaluated on 50K samples across GPT-4, Claude 3, Gemini, Llama-3, Mistral.

Image

CIFAKE, GenImage, FaceForensics++ · v4.4.0 July 2026

Model	AUC-ROC	Precision	Recall	F1	FPR
ViT-based classifier (fine-tuned)	0.94	91.0%	93.0%	0.920	7.0%
CLIP embedding similarity	0.89	87.0%	89.0%	0.880	10.0%
Pixel integrity + frequency domain (L1–L4)	0.85	83.0%	86.0%	0.845	13.0%
Grok Vision (RAG-augmented)	0.92	90.0%	91.0%	0.905	8.0%
L11 PAFRA — Polarization & Fresnel (sky/outdoor)	0.81	76.0%	100.0%	0.865	18.0%
L12 BDIS — Bayer Demosaicing (universal)	0.91	89.0%	100.0%	0.942	11.0%
L13 SSWDP — Subsurface Scattering (portraits)	0.79	71.0%	100.0%	0.831	21.0%
L14 QESM — Quantum Efficiency (gray regions)	0.83	78.0%	88.0%	0.826	17.0%
Physical consistency ensemble (L11–L14)	0.91	88.0%	100.0%	0.936	13.0%
Ensemble — all 14 layers combined	0.98	96.0%	97.0%	0.965	3.0%

Evaluated on 40K images: Midjourney v6, DALL-E 3, Stable Diffusion XL, Firefly, FLUX, Grok, Gemini, SDXL. Physical consistency layers (L11–L14) add physics-based analysis: Bayer demosaicing (BDIS, universal), polarization (PAFRA, outdoor), subsurface scattering (SSWDP, portraits), sensor QE (QESM, gray patches). L12-BDIS achieves 100% recall across all 8 tested generator types. All 4 layers return a neutral score when scene prerequisites are absent — they never hurt accuracy on inapplicable images.

Audio

ASVspoof 2019/2021, ADD 2023

Model	AUC-ROC	Precision	Recall	F1	FPR
wav2vec2 (fine-tuned, ASVspoof)	0.93	91.0%	92.0%	0.915	7.0%
Spectral feature analysis	0.87	85.0%	86.0%	0.855	12.0%
SynthID local watermark check	0.82	88.0%	78.0%	0.827	5.0%
Ensemble (all combined)	0.95	92.0%	93.0%	0.925	6.0%

Evaluated on 30K clips: ElevenLabs, Bark, VALL-E, YourTTS, RVC clones.

Video

FaceForensics++, DFDC Preview

Model	AUC-ROC	Precision	Recall	F1	FPR
NVIDIA NIM deepfake detection	0.91	89.0%	90.0%	0.895	9.0%
Frame-level ViT ensemble	0.88	86.0%	87.0%	0.865	11.0%
Temporal consistency analysis	0.83	82.0%	83.0%	0.825	15.0%
Ensemble (all combined)	0.93	91.0%	90.0%	0.905	8.0%

Evaluated on 8K clips: Sora, Kling, Runway Gen-3, DeepFaceLab.

Evaluation Datasets

Modality	Dataset	Size
Text	PAN25 Authorship Verification	~500K samples
Text	PERSUADE Corpus 2.0	~25K essays
Text	M4 Benchmark	122K samples
Image	CIFAKE	120K images
Image	GenImage	1.3M images
Audio	ASVspoof 2019 (LA track)	121K clips
Audio	ASVspoof 2021	181K clips
Audio	ADD 2023	~330K clips
Video	FaceForensics++	5K videos
Video	DFDC Preview Dataset (Meta)	19K videos

Full results with confidence intervals and per-generator breakdowns available as CSV.

Download results CSV

Methodology · Research Citations