Get in touch with us at behavior-in-the-wild@googlegroups.com
Communication is defined as "Who says what to whom with what effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We also release BLIFT, our Behaviour-LLaVA IFT dataset comprising of 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this.
The diagram shows how receiver behavior (comments, likes, etc.) contains valuable signals about content, including temporal, cognitive, character, context, and user opinion information. This data, which is collected by default on internet platforms, is often ignored during VLM training but can significantly enhance content understanding.
Behavior-LLaVA achieves significant improvements across a diverse range of content understanding tasks:
Task | 0-Shot Improvement over Llama-Vid |
---|---|
LVU | 21.49% |
Video Ad Understanding | 43.18% |
Video Emotion | 51.85% |
Image and Video Memorability | 186.4% |
Video QA | 0.6% |
Image Emotion | 29.14% |
Image Dense Captioning | 4.95% |
HVU | 5.88% |
Audio Summarization | 30% |
Sentiment Analysis | 4.73% |
Examples showing how receiver behavior provides rich content understanding signals:
Performance comparison on downstream tasks:
We are releasing our BLIFT dataset in huggingface format [HuggingFace Dataset]. The dataset comprises 730k images and videos with their receiver behavior collected from multiple platforms.
Behavior Type | Description | Source | Samples |
---|---|---|---|
Likes | Content that received varying levels of like engagement | Social media platforms | [Number] |
Comments | User comments containing insights about content | YouTube, social media | [Number] |
Replay Graphs | Information about how users engage with video content over time | Video platforms | [Number] |
Audience Retention | Data on viewer retention patterns | Streaming platforms | [Number] |
Sharing Patterns | How content spreads across platforms | Cross-platform analysis | [Number] |
User Reactions | Emotional responses to content | Reaction data | [Number] |
The BLIFT dataset is unique because it leverages data that is collected by default on internet platforms, requiring no additional human annotation effort. This "free lunch" data contains rich signals that help models better understand content context, emotional impact, and semantic meaning.
We build on the LLaVA architecture, training the model to predict receiver behavior from content. This enables the model to learn rich representations that capture subtle aspects of content that elicit specific behaviors.
Our training approach involves two key phases:
We evaluate across 46 different tasks on 26 benchmark datasets, covering both zero-shot and fine-tuning scenarios. Tasks span multiple modalities including image, video, text, and audio understanding.
@article{singh2024teaching,
title={Teaching Human Behavior Improves Content Understanding Abilities Of LLMs},
author={Singh, Somesh and SI, Harini and Singla, Yaman K and Baths, Veeky and Shah, Rajiv Ratn and Chen, Changyou and Krishnamurthy, Balaji},
journal={arXiv preprint arXiv:2405.00942},
year={2024}
}
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Usage is restricted to research and non-commercial purposes. Users must comply with applicable privacy and data protection laws when uploading any content.
We thank Adobe for their generous sponsorship.
We thank the LLaVA team for the foundation model upon which our work is built.
We also thank the teams behind all the datasets and benchmarks used in our evaluation.