Pipeline
Lengthy video datasets are difficult to construct due to the numerous handbook effort required to pick out, watch, perceive and annotate lengthy movies with free-form pure language. Answering difficult questions on longer movies is commonly a multimodal process that will contain listening to the audio monitor along with watching the video. It might even be a non-linear process, as a result of generally it might be essential to rewind and rewatch key elements to reply a query. Proposing appropriate high-level questions that aren’t trivially solved by observing just a few frames can be tough for folks to do persistently and with enough selection.
In an effort to remedy this drawback we suggest a semi-automatic pipeline that first generates candidate a number of selection questions utilizing numerous robust vision-language fashions (VLMs) and huge language fashions (LLMs) with fastidiously designed prompts, after which lets human annotators filter and proper the proposed questions to cut back errors and bias. In an effort to cut back human effort, we leverage computerized instruments to (1) discover appropriate movies, (2) extract helpful alerts, after which (3) routinely generate video-level captions, questions and solutions.
Our pipeline begins with the choice of video content material. We filter movies to extend visible and demographic range. We additionally take away movies with principally static content material in addition to gaming movies and animated content material. Within the subsequent stage, we extract two sorts of captions from the ensuing movies: computerized speech recognition (ASR) captions and body captions. For the latter, we immediate a VLM to explain video frames sampled at one body per second. The following step summarizes these captions by segmenting the video into pictures, grouping them by matters and prompting an LLM to summarize ASR and frame-level captions into shot-level captions.
Given these captions, the pipeline generates multiple-choice questions in two levels. Within the first stage, we immediate an LLM to generate a set of difficult questions and solutions, offering it with the video captions as context. Within the second stage, we immediate the LLM with a generated question-answer pair and the video captions and ask it to generate 4 decoy solutions. Decoys must be incorrect however believable solutions to the query. The ultimate stage of the pipeline is human verification, the place we ask human raters to filter or appropriate incorrect questions, solutions and decoys.
