Mar 22 2024

Surprising videos, such as funny clips, creative performances, or visual
illusions, attract significant attention. Enjoyment of these videos is not
simply a response to visual stimuli; rather, it hinges on the human capacity to
understand (and appreciate) commonsense violations depicted in these videos. We
introduce FunQA, a challenging video question-answering (QA) dataset
specifically designed to evaluate and enhance the depth of video reasoning
based on counter-intuitive and fun videos. Unlike most video QA benchmarks
which focus on less surprising contexts, e.g., cooking or instructional videos,
FunQA covers three previously unexplored types of surprising videos: 1)
HumorQA, 2) CreativeQA, and 3) MagicQA. For each subset, we establish rigorous
QA tasks designed to assess the model's capability in counter-intuitive
timestamp localization, detailed video description, and reasoning around
counter-intuitiveness. We also pose higher-level tasks, such as attributing a
fitting and vivid title to the video and scoring the video creativity. In
total, the FunQA benchmark consists of 312K free-text QA pairs derived from
4.3K video clips, spanning a total of 24 video hours. Moreover, we propose
FunMentor, an agent designed for Vision-Language Models (VLMs) that uses
multi-turn dialogues to enhance models' understanding of counter-intuitiveness.
Extensive experiments with existing VLMs demonstrate the effectiveness of
FunMentor and reveal significant performance gaps for the FunQA videos across
spatial-temporal reasoning, visual-centered reasoning, and free-text
generation.