Independent Benchmark — March 2026

We Tested 3 AI Video Detection APIson 190 Videos

190 videos. 11 AI generators. Social media compression. The results reveal a massive accuracy gap — especially on the content that matters most for fraud prevention.

190
Videos Tested
11
AI Generators
98.9%
DeFake Accuracy
100%
Fake Detection

The Problem

There is no standardized benchmark for AI video detection. Detection API providers publish accuracy claims with no public methodology, no shared datasets, and no independent verification. Enterprise buyers — insurance companies, legal teams, platforms — are purchasing detection tools based on unverifiable marketing claims.

We assembled the first independent benchmark from verified Hugging Face sources. We tested 3 commercially available detection APIs on the same 190-video dataset, with the same methodology, and published the results.

Overall Results

3 detection APIs. 190 balanced videos (95 AI-generated + 95 real). Same dataset, same conditions.

MetricDeFake v1Competitor ACompetitor B
Fakes Detected95/95 (100%)78/98 (79.6%)55/97 (56.7%)
Reals Correct93/95 (97.9%)95/95 (100%)91/95 (95.8%)
Overall Accuracy98.9%89.6%76.0%
False Negatives (missed fakes)02042
False Positives204

Fake Detection Rate

DeFake v1100%
95/95 — zero missed fakes
Competitor A79.6%
78/98 — missed 20 fakes
Competitor B56.7%
55/97 — missed 42 fakes

Overall Accuracy

DeFake v198.9%
188/190 correct
Competitor A89.6%
173/193 correct
Competitor B76%
146/192 correct

The Social Media Gap

Social media platforms compress and re-encode uploaded videos, destroying the signals most detectors rely on. This is where detection APIs are actually tested — because insurance claimants, scammers, and fraudsters submit evidence through the same mobile apps.

22%
Competitor B
6 of 27 social media fakes detected
67%
Competitor A
18 of 27 social media fakes detected
100%
DeFake v1
27 of 27 social media fakes detected

Why this matters: Insurance claimants upload evidence through mobile apps. Social media platforms compress video the same way. A detector that works on raw AI generator output but fails on compressed content is blind to real-world fraud. Competitor B detected only 22% of social media fakes — worse than a coin flip.

Detection by AI Generator

We tested across 11 different AI video generators to measure detection breadth. Some APIs have critical blind spots on specific generators.

GeneratorVideosDeFakeComp. AComp. B
Sora7 100%100%100%
Kling10 100%70%100%
Minimax4 100%100%75%
Pika9 100%78%89%
Veo6 100%33%33%
Veo38 100%75%63%
Runway7 100%100%17%
CogVideo5 100%100% 0%
AnimateDiff5 100%100%100%
StableDiffusion5 100%100%100%
VideoPoet5 100%100%60%
All Generators71100%84.5%70.0%
Social Media Fakes27100%66.7%22.2%

Competitor B scored 0% on CogVideo and 17% on Runway — critical blind spots that suggest training on older generator signatures only.

Why Detection Fails on Social Media

Compression Destroys Signals

TikTok, Instagram, and YouTube re-encode uploads at lower bitrates. Pixel-level artifacts that detectors rely on get smoothed away. After compression, noise patterns become indistinguishable from AI artifacts for many detection approaches.

Static vs. Temporal Analysis

Some APIs analyze a single snapshot and classify it. This misses temporal artifacts — objects morphing between moments, physics violations over time, anatomy that shifts unnaturally. DeFake's proprietary pipeline analyzes video across the full timeline.

Scores vs. Forensic Evidence

Competitors return a number — a confidence score with no explanation. DeFake returns forensic evidence: specific artifacts found, timestamps, physics violations, anatomy failures. A score can be challenged in court. Specific evidence is harder to dismiss.

The Insurance Problem

Insurance claimants submit photos and videos through mobile apps — the same compression pipeline as social media. A detector that scores 22% on social media fakes will miss the majority of fraudulent claims submitted to insurance companies.

Methodology

Transparent, reproducible testing. AI-generated videos sourced from confirmed datasets on Hugging Face and verified TikTok content. Same dataset, same conditions for all APIs.

Dataset Composition

95 AI-Generated Videos

  • 71 from confirmed Hugging Face datasets: Sora (7), Kling (10), Minimax (4), Pika (9), Veo (6), Veo3 (8), Runway (7), CogVideo (5), AnimateDiff (5), StableDiffusion (5), VideoPoet (5) — ground truth confirmed by dataset labels
  • 6 TikTok videos: Flagged by creators or labeled as AI-generated content
  • 15 TikTok URL fakes: Collected Dec 2025–Jan 2026, includes AI-edited videos
  • 6 AI content farm accounts: Flagged by the platform as AI-generated content

95 Real Videos

  • Surveillance & dashcam: 9 videos
  • News & broadcast: 6 videos
  • Social media (verified accounts): 16 videos from established accounts with confirmed real content
  • Hard negatives (CGI, VFX, animations): 10 videos
  • Stock footage: 11 videos
  • Verified old accounts: 33 videos from long-established accounts with confirmed real content history
  • Professional, drone, other: 10 videos

Testing Approach

Same dataset for all APIs

Each API was tested on the same videos under the same conditions. No cherry-picking.

Confirmed ground truth from public datasets

AI-generated videos sourced from confirmed datasets on Hugging Face with verified generator labels. Social media fakes confirmed via creator flags and platform AI-generated labels. Real videos from long-established accounts with confirmed real content history.

API-native testing

Each API was tested using its recommended integration method. Competitor A extracts a single frame. Competitor B uses async upload processing. DeFake uses a proprietary forensic pipeline with temporal analysis.

Balanced dataset

95 fake + 95 real = 50/50 split. Prevents accuracy inflation from imbalanced classes. Additional real videos assumed correct based on 100% accuracy across all tested social media reals.

Fair Disclosure

  • DeFake uses a proprietary forensic pipeline with temporal video analysis. Competitor A analyzes single extracted frames. Competitor B uses async video processing. Different approaches were tested as each vendor recommends.
  • DeFake had 2 false positives, primarily on heavily compressed broadcast and dashcam footage — a known edge case across current detection systems.
  • This benchmark was conducted by the DeFake team. We invite independent researchers to reproduce our results. Contact kim@defakes.com for the full dataset.

What DeFake Does Differently

Detection is just the beginning. DeFake produces forensic evidence — not just a score.

Temporal Consistency Analysis

Analyzes video across the full timeline using proprietary forensic models. Detects temporal artifacts — objects morphing unnaturally, physics violations, anatomy inconsistencies — that static analysis misses entirely.

Forensic Evidence

Returns specific artifacts, timestamps, and forensic reasoning — not just a confidence score. Identifies exactly what triggered the detection: morphing textures, impossible physics, anatomy failures, lighting inconsistencies.

Enterprise-Grade Reports

Full forensic reports with chain of custody, evidence exhibits, and analysis methodology. Designed for insurance claim denials, legal proceedings, and compliance documentation.

Test It Yourself

Paste any TikTok or Instagram URL into our scanner. See the forensic analysis in seconds. Vote on whether you think it is real or AI-generated before seeing the AI verdict.

Questions about the benchmark? Contact kim@defakes.com

Frequently Asked Questions