190 videos. 11 AI generators. Social media compression. The results reveal a massive accuracy gap — especially on the content that matters most for fraud prevention.
There is no standardized benchmark for AI video detection. Detection API providers publish accuracy claims with no public methodology, no shared datasets, and no independent verification. Enterprise buyers — insurance companies, legal teams, platforms — are purchasing detection tools based on unverifiable marketing claims.
We assembled the first independent benchmark from verified Hugging Face sources. We tested 3 commercially available detection APIs on the same 190-video dataset, with the same methodology, and published the results.
3 detection APIs. 190 balanced videos (95 AI-generated + 95 real). Same dataset, same conditions.
| Metric | DeFake v1 | Competitor A | Competitor B |
|---|---|---|---|
| Fakes Detected | 95/95 (100%) | 78/98 (79.6%) | 55/97 (56.7%) |
| Reals Correct | 93/95 (97.9%) | 95/95 (100%) | 91/95 (95.8%) |
| Overall Accuracy | 98.9% | 89.6% | 76.0% |
| False Negatives (missed fakes) | 0 | 20 | 42 |
| False Positives | 2 | 0 | 4 |
Social media platforms compress and re-encode uploaded videos, destroying the signals most detectors rely on. This is where detection APIs are actually tested — because insurance claimants, scammers, and fraudsters submit evidence through the same mobile apps.
Why this matters: Insurance claimants upload evidence through mobile apps. Social media platforms compress video the same way. A detector that works on raw AI generator output but fails on compressed content is blind to real-world fraud. Competitor B detected only 22% of social media fakes — worse than a coin flip.
We tested across 11 different AI video generators to measure detection breadth. Some APIs have critical blind spots on specific generators.
| Generator | Videos | DeFake | Comp. A | Comp. B |
|---|---|---|---|---|
| Sora | 7 | 100% | 100% | 100% |
| Kling | 10 | 100% | 70% | 100% |
| Minimax | 4 | 100% | 100% | 75% |
| Pika | 9 | 100% | 78% | 89% |
| Veo | 6 | 100% | 33% | 33% |
| Veo3 | 8 | 100% | 75% | 63% |
| Runway | 7 | 100% | 100% | 17% |
| CogVideo | 5 | 100% | 100% | 0% |
| AnimateDiff | 5 | 100% | 100% | 100% |
| StableDiffusion | 5 | 100% | 100% | 100% |
| VideoPoet | 5 | 100% | 100% | 60% |
| All Generators | 71 | 100% | 84.5% | 70.0% |
| Social Media Fakes | 27 | 100% | 66.7% | 22.2% |
Competitor B scored 0% on CogVideo and 17% on Runway — critical blind spots that suggest training on older generator signatures only.
TikTok, Instagram, and YouTube re-encode uploads at lower bitrates. Pixel-level artifacts that detectors rely on get smoothed away. After compression, noise patterns become indistinguishable from AI artifacts for many detection approaches.
Some APIs analyze a single snapshot and classify it. This misses temporal artifacts — objects morphing between moments, physics violations over time, anatomy that shifts unnaturally. DeFake's proprietary pipeline analyzes video across the full timeline.
Competitors return a number — a confidence score with no explanation. DeFake returns forensic evidence: specific artifacts found, timestamps, physics violations, anatomy failures. A score can be challenged in court. Specific evidence is harder to dismiss.
Insurance claimants submit photos and videos through mobile apps — the same compression pipeline as social media. A detector that scores 22% on social media fakes will miss the majority of fraudulent claims submitted to insurance companies.
Transparent, reproducible testing. AI-generated videos sourced from confirmed datasets on Hugging Face and verified TikTok content. Same dataset, same conditions for all APIs.
Each API was tested on the same videos under the same conditions. No cherry-picking.
AI-generated videos sourced from confirmed datasets on Hugging Face with verified generator labels. Social media fakes confirmed via creator flags and platform AI-generated labels. Real videos from long-established accounts with confirmed real content history.
Each API was tested using its recommended integration method. Competitor A extracts a single frame. Competitor B uses async upload processing. DeFake uses a proprietary forensic pipeline with temporal analysis.
95 fake + 95 real = 50/50 split. Prevents accuracy inflation from imbalanced classes. Additional real videos assumed correct based on 100% accuracy across all tested social media reals.
Detection is just the beginning. DeFake produces forensic evidence — not just a score.
Analyzes video across the full timeline using proprietary forensic models. Detects temporal artifacts — objects morphing unnaturally, physics violations, anatomy inconsistencies — that static analysis misses entirely.
Returns specific artifacts, timestamps, and forensic reasoning — not just a confidence score. Identifies exactly what triggered the detection: morphing textures, impossible physics, anatomy failures, lighting inconsistencies.
Full forensic reports with chain of custody, evidence exhibits, and analysis methodology. Designed for insurance claim denials, legal proceedings, and compliance documentation.
Paste any TikTok or Instagram URL into our scanner. See the forensic analysis in seconds. Vote on whether you think it is real or AI-generated before seeing the AI verdict.
Questions about the benchmark? Contact kim@defakes.com