Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

Somesh Singh*, Harini S I*, Yaman K Singla*, Changyou Chen, Rajiv Ratn Shah, Veeky Baths, Balaji Krishnamurthy
* Equal Contribution
Adobe Adobe Media and Data Science Research (MDSR), IIITD, SUNY at Buffalo, BITS Pilani

ICLR 2025

Get in touch with us at behavior-in-the-wild@googlegroups.com

Abstract

Communication is defined as "Who says what to whom with what effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We also release BLIFT, our Behaviour-LLaVA IFT dataset comprising of 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this.

Five Factors of Communication

The diagram depicts the five factors of communication in the context of an example YouTube video, showing where the free lunch lies. The receiver effect contains important signals that can help in understanding content.

The diagram shows how receiver behavior (comments, likes, etc.) contains valuable signals about content, including temporal, cognitive, character, context, and user opinion information. This data, which is collected by default on internet platforms, is often ignored during VLM training but can significantly enhance content understanding.

Performance Across Content Understanding Tasks

Behavior-LLaVA achieves significant improvements across a diverse range of content understanding tasks:

Task 0-Shot Improvement over Llama-Vid
LVU21.49%
Video Ad Understanding43.18%
Video Emotion51.85%
Image and Video Memorability186.4%
Video QA0.6%
Image Emotion29.14%
Image Dense Captioning4.95%
HVU5.88%
Audio Summarization30%
Sentiment Analysis4.73%

Free Lunch from Receiver Behavior

Below are qualitative examples demonstrating Behavior-LLaVA's understanding of aesthetics, characters, world knowledge, emotion, and spatial relationships. The red text in the descriptions highlights these key aspects captured by the model.

BLIFT: Behaviour-LLaVA IFT Dataset

We are releasing our dataset, BLIFT on HuggingFace. The dataset comprises 730k images and videos with their receiver behavior collected from multiple platforms.

BibTeX

@article{singh2024teaching,
  title={Teaching Human Behavior Improves Content Understanding Abilities Of LLMs},
  author={Singh, Somesh and SI, Harini and Singla, Yaman K and Baths, Veeky and Shah, Rajiv Ratn and Chen, Changyou and Krishnamurthy, Balaji},
  journal={arXiv preprint arXiv:2405.00942},
  year={2024}
}

Terms Of Service

Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Usage is restricted to research and non-commercial purposes. Users must comply with applicable privacy and data protection laws when uploading any content.

Acknowledgement

We thank Adobe for their generous sponsorship.
We thank the LLaVA team for the foundation model upon which our work is built.
We also thank the teams behind all the datasets and benchmarks used in our evaluation.