Total: 1
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce **VLog**, a novel video understanding framework that defines video narrations as a vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, **VLog** features three key innovations:1. **A Generative Retrieval Model** Marrying the language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search.2. **A Hierarchical Vocabulary** Derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand).3. **A Vocabulary Update Strategy** Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce **VidCab-Eval**, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on **EgoSchema**, **COIN**, and **HiREST** further demonstrate the effectiveness of **VLog**, highlighting its ability to generate concise, contextually accurate, and efficient narrations. This offers a novel perspective on video understanding.