Recent advancements in Large Language Models (LLMs) have driven the development of Video Large Multimodal models (VLMs). While Supervised Fine-Tuning (SFT) for multimodal alignment between video and text has shown promise, challenges persist. The primary obstacle is the scarcity of high-quality video-text instruction-tune data, often resulting in poor grounding of video. Addressing this is crucial for successfully applying VLMs to various real-world video understanding tasks.
We first fine-tune an LLM, e.g., Vicuna, using supervised learning on synthetically generated video-text instruction-tune data. This involves the integration of a vision encoder with two linear layers and additional learnable parameters using LoRA, into the training process. In particular, we improve the SFT process by expanding the instruction-tune data and introducing simple curriculum learning. We refer to this fine-tuned model as the Video Large Multimodal model with SFT or VLM-SFT for short.
A key aspect of the RLAIF involves leveraging a pre-trained AI model to generate human-like preferences between different responses generated from the same input. To obtain human-like preference, we employ the VLM-SFT as a judge to assess preferences. Once preferences are judged, we train a reward model (RM) based on preferences. The RM give higher score reward to the better response and lower score reward to the less appropriate one in a pair of responses, thus guiding the policy model using reinforcement learning.
We finally fine-tune a supervised policy model, initialized from the VLM-SFT, aiming to optimize the scalar reward output of the trained RM by reinforcement learning (PPO). We call this trained model as the Video Large Multimodal model with RLAIF or VLM-RLAIF for short
For VLM-SFT to select preference grounded on the video, we argue that a detailed understanding of video content is necessary for more accurate and contextually relevant decisions by the VLM-SFT. We propose integrating detailed video descriptions, termed as context, into the preference selection workflow to enhance VLMM's contextual clarity. This context allows the VLM-SFT to better understand the video content and identify the most suitable response. Integrating context with instruction inputs using a specific prompt, as shown in dotted boxes in the right Figure-(2), facilitates the collection of context-aware preferences.
The right figure illustrates the three stages of the proposed context-aware reward modeling:
We quantitatively evaluate VLMMs on the video-based generative performance benchmark that measures five criteria of generated text, showing effectiveness of our proposed VLM-RLAIF.
We qualitatively compares the performance of VLM-SFT and VLM-RLAIF, highlighting their multimodal understanding capabilities below Figure. VLM-RLAIF consistently yields more accurate answers than VLM-SFT, as highlighted in blue for accurate responses and red for less accurate ones.
@inproceedings{ahnCYKC24,
author = {Ahn, Daechul and Choi, Yura and Yu, Youngjae and Kang, Dongyeop and Choi, Jonghyun},
title = {Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback},
booktitle = {ACL},
year = {2024},
}