Get in touch with us at behavior-in-the-wild@googlegroups.com
Communication is defined as "Who says what to whom with what effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text and audio over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by upto 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We also release BLIFT, our Behaviour-LLaVA IFT dataset comprising of 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this.
The diagram shows how receiver behavior (comments, likes, etc.) contains valuable signals about content, including temporal, cognitive, character, context, and user opinion information. This data, which is collected by default on internet platforms, is often ignored during VLM training but can significantly enhance content understanding.
Behavior-LLaVA achieves significant improvements across a diverse range of content understanding tasks:
Task | 0-Shot Improvement over Llama-Vid |
---|---|
LVU | 21.49% |
Video Ad Understanding | 43.18% |
Video Emotion | 51.85% |
Image and Video Memorability | 186.4% |
Video QA | 0.6% |
Image Emotion | 29.14% |
Image Dense Captioning | 4.95% |
HVU | 5.88% |
Audio Summarization | 30% |
Sentiment Analysis | 4.73% |
Below are qualitative examples demonstrating Behavior-LLaVA's understanding of aesthetics, characters, world knowledge, emotion, and spatial relationships. The red text in the descriptions highlights these key aspects captured by the model.
We are releasing our dataset, BLIFT on HuggingFace. The dataset comprises 730k images and videos with their receiver behavior collected from multiple platforms.
@article{singh2024teaching,
title={Teaching Human Behavior Improves Content Understanding Abilities Of LLMs},
author={Singh, Somesh and SI, Harini and Singla, Yaman K and Baths, Veeky and Shah, Rajiv Ratn and Chen, Changyou and Krishnamurthy, Balaji},
journal={arXiv preprint arXiv:2405.00942},
year={2024}
}
Users are required to agree to the following terms before using the service:
The service is a research preview. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. Please do not upload any private information. The service collects user dialogue data, including both text and images, and reserves the right to distribute it under a Creative Commons Attribution (CC-BY) or a similar license.
Usage is restricted to research and non-commercial purposes. Users must comply with applicable privacy and data protection laws when uploading any content.
We thank Adobe for their generous sponsorship.
We thank the LLaVA team for the foundation model upon which our work is built.
We also thank the teams behind all the datasets and benchmarks used in our evaluation.