
Videos have become an increasingly important part of our daily lives, spanning areas such as entertainment, education and communication. Understanding the content of videos is a challenging task, however, because videos often contain multiple events that occur at different time scales. For example, a video showing a car hooking dogs to a dog sled before they all run includes a long event (the dogs pull the sled) and a short event (the dogs stick to the sled). One way to advance research in video understanding is the problem of dense video captioning, which consists of the temporal localization and description of all events in a minute of video. This differs from single-image captions and standard video captions, which consist of short video descriptions alone sentence
Dense video captioning systems have a wide range of applications, such as making videos accessible to people with visual or hearing impairments, automatically generating captions for videos, or improving the search for video moments in large databases. Current dense video captioning approaches, however, have several limitations. for example, they often contain highly specialized task-specific components that make it difficult to integrate them into robust foundational models. Moreover, they are often trained exclusively on hand-annotated datasets, which are very difficult to obtain and are therefore not a scalable solution.
In this post, we introduce Vid2Seq. A large-scale preparation of a visual language model for dense video subtitling,” to appear at CVPR 2023. The Vid2Seq architecture extends the language model with custom timestamps, allowing seamless prediction of event boundaries and textual descriptions in the same output sequence. To pre-train this unified model, we use unlabeled narrated videos by reformulating the transcribed speech sentence boundaries as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model, pre-trained on millions of narrated videos, improves on the state of the art on a number of dense video captioning benchmarks, including YouCook2, ViTT, and ActivityNet Captions. Vid2Seq also generalizes well to the multi-frame dense setting of video captioning, the video paragraph captioning task, and the standard video captioning task. Finally, we also released the Vid2Seq code here.
![]() |
Vid2Seq is a visual language model that predicts dense captions of events and their temporal reasoning in video, creating a single sequence of symbols. |
A visual language model for dense video subtitling
Multimodal transformer architecture has advanced the state of the art in a wide range of video applications, such as action recognition. However, it is not straightforward to adapt such an architecture to the complex task of co-localizing and captioning events in minute-long videos.
For an overview of how we achieve this, we augment the visual language model with custom timestamps (e.g., text tokens) that represent discretized video timestamps similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both serve as input and generate sequences of text and timestamps. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, given as a single sequence of symbols. Second, this allows Vid2Seq to co-predict dense subscripts of events and temporally ground them in the video, creating alone a sequence of signs.
The Vid2Seq architecture includes a visual encoder and a text encoder that encode video frames and transcribed speech input, respectively. The resulting encodings are then sent to a text decoder that auto-regressively predicts the output sequence of subtitle events along with their temporal localization in the video. The architecture originated with a strong visual backbone and a strong language model.
![]() |
Vid2Seq model overview: we formulate event-dense captioning as a sequencing problem using special temporal cues to allow the model to seamlessly understand and generate cue sequences that contain both text semantic information and temporal localization information, grounding each text sentence in a video. . |
Extensive pre-workout on unabridged videos
Due to the dense nature of the task, manual collection of dense video caption annotations is particularly expensive. Therefore, we pre-train the Vid2Seq model using unlabeled narrated videos that are easily accessible at scale. Specifically, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos spanning a wide range of domains.
As a control we use transcribed speech sentences and their corresponding timestamps, given as a single sequence of tokens. We prepare Vid2Seq with a generative goal that trains the decoder to predict a transcribed speech sequence with only visual inputs, and a deductive goal that encourages multimodal learning by requiring the model to predict masked cues given a noisy transcribed speech sequence and visual input. Specifically, noise is added to the speech sequence by randomly obscuring symbol spaces.
![]() |
Vid2Seq is pre-trained on unlabeled narrated videos for generative purposes (top) and zeroing target (Down) |
The video caption benchmark results below
The resulting pre-trained Vid2Seq model can be fine-tuned to downstream tasks via a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given the previous ground truth tokens). After fine-tuning, Vid2Seq significantly improves the state of the art for three standard dense video captions (ActivityNet Captions, YouCook2, and ViTT) and two video captioning benchmarks (MSR-VTT, MSVD). In our paper, we provide additional ablation studies, qualitative results, as well as results in a multi-shot setting and a video paragraph captioning task.
![]() |
Comparison with state-of-the-art methods for dense video subtitling (left) and for video titles (correct), on the CIDEr measure (higher, better). |
Conclusion
We present Vid2Seq, a new visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be efficiently trained on unlabeled narrated videos at scale and achieves state-of-the-art results on the underlying dense video captioning benchmarks. Learn more from the paper and grab the code here.
Gratitude
This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsak Seo, Antoine Miech, Jordi Pont-Toucet, Ivan Laptev, Josef Sivitch and Cordelia Schmid.