Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task because videos often contain multiple events occurring at different time scales. For example, a video of a musher attaching dogs to a dog sled before them all involves a long event (the dogs pulling the sled) and a short event (the dogs being attached to the sled). One way to stimulate research on video understanding is through the work of dense video captioning, which consists of the temporary localization and description of all events in a one-minute video. It is different from single photo captioning and standard video captioningwhich consists of describing short videos with a single sentence.
Dense video captioning systems have wide applications, such as making videos accessible to people with visual or hearing impairments, automatically generating chapters for videos, or improving the search for video moments in large databases. Current dense video captioning approaches, however, have some limitations — for example, they often contain highly specialized task-specific components, making it difficult to integrate them into powerful foundational models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and therefore not a scalable solution.
In this post, we introduce the “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning”, to appear in CVPR 2023. The Vid2Seq architecture maximizes a language model with special time tokens, allowing it to seamlessly predict event boundaries and text descriptions in the same output sequence. To pre-train this unified model, we leverage unlabeled narrated videos by reframing sentence boundaries of transcribed speech as pseudo-event boundaries, and using transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of captioned videos outperforms the state of the art in a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to several-shot dense video captioning settings, the video paragraph captioning task, and the standard video captioning task. Finally, we have released the code for Vid2Seq here.
|Vid2Seq is a visual language model that predicts dense event captions along with their temporal basis in a video by generating a sequence of tokens.|
A visual language model for dense video captioning
Multimodal transformer architectures have improved the state of the art in a wide range of video works, such as recognition of action. However, such an architecture is not straightforward to adapt to the complex task of joint localization and captioning of events in minute videos.
For a general overview of how we achieve this, we augment a visual language model with special time tokens (such as text tokens) representing discretized video timestamps, namely Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can be input and generate sequences of text and time tokens. First, it enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, it enables Vid2Seq to collectively predict dense event captions and temporally place them in the video while generating single sequence of tokens.
The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are passed to a text decoder, which automatically predicts the output sequence of dense event captions along with their temporal localization in the video. Architecture was started by a strong visual backbone and a powerful language model.
Great initial training with uncut narrated videos
Due to the dense nature of the task, manual collection of annotations for dense video captioning is particularly expensive. So we pre-train the Vid2Seq model using unlabeled narrating videos, which is easily obtained at scale. Specifically, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.
We use transcribed speech sentences and their corresponding timestamps as controls, designated as a single sequence of tokens. We pre-trained Vid2Seq using a generative objective that instructs the decoder to predict the transcribed speech sequence given only visual inputs, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed sequence of speech and visual input. Specifically, noise is added to the speech sequence by randomly masking spans of tokens.
|Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (above) and a denoising objective (bottom).|
Results on the downstream dense video captioning benchmark
The resulting pre-trained Vid2Seq model can be fine-tuned to downstream tasks with a simple maximum likelihood objective using forced by the teacher (ie, predicting the next token given the previous ground-truth tokens). After fine-tuning, Vid2Seq dramatically improves the state of the art in three common downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in few-shot settings and on the video paragraph captioning task.
|Comparison of state-of-the-art methods for dense video captioning (leave) and for video clip captioning (right), in CIDEr scale (higher is better).|
We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can effectively train unlabeled narrated videos at scale, and achieves state-of-the-art results in various downstream dense video captioning benchmarks. Learn more from the paper and take the code here.
This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.