Total: 1
AI-based talking-head videoconferencing systems reduce bandwidth by transmitting a latent representation of a speaker’s pose and expression, which is used to synthesize frames on the receiver's end. However, these systems are vulnerable to “puppeteering” attacks, where an adversary controls the identity of another person in real-time. Traditional deepfake detectors fail here, as all video content is synthetic. We propose a novel biometric defense that detects identity leakage in the transmitted latent representation. Our metric-learning approach disentangles identity cues from pose and expression, enabling detection of unauthorized swaps. Experiments across multiple talking-head models show that our method consistently outperforms prior defenses, operates in real time on consumer GPUs, and generalizes well to out-of-distribution data. By targeting the latent features shared during normal operation, our method offers a practical and robust safeguard against puppeteering.