Video Action Differencing with Natural Language

Stanford and UC Berkeley

VidDiff: a new task for describing differences between two videos in natural language

Viddiff pull figure showing the task

The VidDiff task and benchmark

Given two videos of the same action, the Video Action Differencing task involves generating two sets of natural language statements: one set that is more true for video A than video B, and another set that is more true for video B than video A. For example, the video pairs may feature an expert and a novice performing a barbell squat, with important differences being "knees caving in more in video A" and "the squat is deeper in video B". Since this is a new task, we curate VidDiffBench which has 656 video pairs and 5,580 carefully annotated differences. We choose videos from diverse domains where learning skills require significant time investment and expert feedback, including fitness exercises, sports, music, and surgery.

Methods: the VidDiff Framework

VidDiff methods figure

A user-supplied description of the action (e.g. "weighted squat") is used by the LLM proposer to generate possible differences in the action. The frame localizer assigns each of these differences a set of retrieval strings and the approximate number of frames needed to determine the action. Given two videos A and B, this information is used to localize the frames needed to evaluate if the difference exists between videos. Using these localized frames, the action differencer converts the difference description to a multiple choice question given to a VQA model to determine if this variation exists.

BibTeX

(coming soon)