Accuracy Benchmarks
Per-modality evaluation results across our ensemble models. All benchmarks use held-out test splits — none of the test data was used for training.
Real-world accuracy varies by generator novelty, content type, and obfuscation. Treat these as upper bounds on curated data.
Text
PAN25, PERSUADE 2.0, M4| Model | AUC-ROC | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| RoBERTa-base-openai-detector | 0.97 | 95.0% | 94.0% | 0.945 | 4.0% |
| Binoculars (perplexity/crossperplexity) | 0.96 | 93.0% | 96.0% | 0.945 | 5.0% |
| Gemini 2.0 Flash (ensemble head) | 0.95 | 94.0% | 93.0% | 0.935 | 5.0% |
| Ensemble (all combined) | 0.98 | 96.0% | 97.0% | 0.965 | 3.0% |
Evaluated on 50K samples across GPT-4, Claude 3, Gemini, Llama-3, Mistral.
Image
CIFAKE, GenImage, FaceForensics++| Model | AUC-ROC | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| ViT-based classifier (fine-tuned) | 0.94 | 91.0% | 93.0% | 0.920 | 7.0% |
| CLIP embedding similarity | 0.89 | 87.0% | 89.0% | 0.880 | 10.0% |
| Pixel integrity + frequency domain | 0.85 | 83.0% | 86.0% | 0.845 | 13.0% |
| Grok Vision (RAG-augmented) | 0.92 | 90.0% | 91.0% | 0.905 | 8.0% |
| Ensemble (all combined) | 0.96 | 93.0% | 94.0% | 0.935 | 5.0% |
Evaluated on 40K images: Midjourney v6, DALL-E 3, Stable Diffusion XL, Firefly.
Audio
ASVspoof 2019/2021, ADD 2023| Model | AUC-ROC | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| wav2vec2 (fine-tuned, ASVspoof) | 0.93 | 91.0% | 92.0% | 0.915 | 7.0% |
| Spectral feature analysis | 0.87 | 85.0% | 86.0% | 0.855 | 12.0% |
| SynthID local watermark check | 0.82 | 88.0% | 78.0% | 0.827 | 5.0% |
| Ensemble (all combined) | 0.95 | 92.0% | 93.0% | 0.925 | 6.0% |
Evaluated on 30K clips: ElevenLabs, Bark, VALL-E, YourTTS, RVC clones.
Video
FaceForensics++, DFDC Preview| Model | AUC-ROC | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| NVIDIA NIM deepfake detection | 0.91 | 89.0% | 90.0% | 0.895 | 9.0% |
| Frame-level ViT ensemble | 0.88 | 86.0% | 87.0% | 0.865 | 11.0% |
| Temporal consistency analysis | 0.83 | 82.0% | 83.0% | 0.825 | 15.0% |
| Ensemble (all combined) | 0.93 | 91.0% | 90.0% | 0.905 | 8.0% |
Evaluated on 8K clips: Sora, Kling, Runway Gen-3, DeepFaceLab.
Evaluation Datasets
| Modality | Dataset | Size |
|---|---|---|
| Text | PAN25 Authorship Verification | ~500K samples |
| Text | PERSUADE Corpus 2.0 | ~25K essays |
| Text | M4 Benchmark | 122K samples |
| Image | CIFAKE | 120K images |
| Image | GenImage | 1.3M images |
| Audio | ASVspoof 2019 (LA track) | 121K clips |
| Audio | ASVspoof 2021 | 181K clips |
| Audio | ADD 2023 | ~330K clips |
| Video | FaceForensics++ | 5K videos |
| Video | DFDC Preview Dataset (Meta) | 19K videos |
Full results with confidence intervals and per-generator breakdowns available as CSV.
Download results CSV