🔥[NEW!]Free Lunch from Behavior - We show that training VLMs on receiver behavior data (likes, comments, replays) improves their content understanding abilities.
🔥[NEW!]BLIFT Dataset - We release a new Behaviour-LLaVA IFT dataset comprising 730k images and videos with their receiver behavior.
🔥[NEW!] Strong Performance - Our approach outperforms many supervised baselines by up to 150% across 46 different tasks on 26 benchmark datasets.
🔥[NEW!]Zero Human Annotation - Since receiver behavior is collected by default on the internet, the performance improvements come essentially for free.

Abstract

Communication is defined as "Who says what to whom with what effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We also release BLIFT, our Behaviour-LLaVA IFT dataset comprising of 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this.

Five Factors of Communication

The diagram depicts the five factors of communication in the context of an example YouTube video, showing where the free lunch lies. The receiver effect contains important signals that can help in understanding content.

The diagram shows how receiver behavior (comments, likes, etc.) contains valuable signals about content, including temporal, cognitive, character, context, and user opinion information. This data, which is collected by default on internet platforms, is often ignored during VLM training but can significantly enhance content understanding.

Performance Across Content Understanding Tasks

Behavior-LLaVA achieves significant improvements across a diverse range of content understanding tasks:

Task 0-Shot Improvement over Llama-Vid
LVU 21.49%
Video Ad Understanding 43.18%
Video Emotion 51.85%
Image and Video Memorability 186.4%
Video QA 0.6%
Image Emotion 29.14%
Image Dense Captioning 4.95%
HVU 5.88%
Audio Summarization 30%
Sentiment Analysis 4.73%

Free Lunch from Receiver Behavior

Examples showing how receiver behavior provides rich content understanding signals:

Example showing how YouTube comments provide temporal, cognitive, character, and context information about a video.

Example showing how like patterns correspond to emotional content in videos.

Performance comparison on downstream tasks:

Comparison of models trained with and without behavior data on emotion recognition tasks.

Zero-shot performance improvements across 26 benchmark datasets.

BLIFT: Behaviour-LLaVA IFT Dataset

We are releasing our BLIFT dataset in huggingface format [HuggingFace Dataset]. The dataset comprises 730k images and videos with their receiver behavior collected from multiple platforms.

Behavior Type Description Source Samples
Likes Content that received varying levels of like engagement Social media platforms [Number]
Comments User comments containing insights about content YouTube, social media [Number]
Replay Graphs Information about how users engage with video content over time Video platforms [Number]
Audience Retention Data on viewer retention patterns Streaming platforms [Number]
Sharing Patterns How content spreads across platforms Cross-platform analysis [Number]
User Reactions Emotional responses to content Reaction data [Number]

The BLIFT dataset is unique because it leverages data that is collected by default on internet platforms, requiring no additional human annotation effort. This "free lunch" data contains rich signals that help models better understand content context, emotional impact, and semantic meaning.

Methodology

Model Architecture

Architecture diagram of Behavior-LLaVA showing how the model is trained on receiver behavior signals.

We build on the LLaVA architecture, training the model to predict receiver behavior from content. This enables the model to learn rich representations that capture subtle aspects of content that elicit specific behaviors.

Training Approach

Our training approach involves two key phases:

  1. Behavior Prediction Training: The model is trained to predict receiver behaviors (likes, comments, etc.) from content.
  2. Knowledge Transfer: The learned representations are leveraged for downstream content understanding tasks.

Evaluation Protocol

We evaluate across 46 different tasks on 26 benchmark datasets, covering both zero-shot and fine-tuning scenarios. Tasks span multiple modalities including image, video, text, and audio understanding.

Key Findings

Charts showing how training on behavior data improves performance across diverse tasks.
  • Behavior data provides a "free lunch" for improving content understanding
  • Performance improvements are consistent across modalities
  • Even small amounts of behavior data can yield significant improvements
  • Different types of behavior (likes vs. comments vs. replay patterns) contribute differently to various tasks

BibTeX


        @article{singh2024teaching,
            title={Teaching Human Behavior Improves Content Understanding Abilities Of LLMs},
            author={Singh, Somesh and SI, Harini and Singla, Yaman K and Baths, Veeky and Shah, Rajiv Ratn and Chen, Changyou and Krishnamurthy, Balaji},
            journal={arXiv preprint arXiv:2405.00942},
            year={2024}
          }
      

Terms Of Service

Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Usage is restricted to research and non-commercial purposes. Users must comply with applicable privacy and data protection laws when uploading any content.

Acknowledgement

We thank Adobe for their generous sponsorship.
We thank the LLaVA team for the foundation model upon which our work is built.
We also thank the teams behind all the datasets and benchmarks used in our evaluation.