EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

Abstract

With the integration of Multimodal large language models (MLLMs) into robotic systems and various AI applications, embedding emotional intelligence (EI) capabilities into these models is essential for enabling robots to effectively address human emotional needs and interact seamlessly in real-world scenarios. While benchmarks for evaluating EI in MLLMs have been developed, they primarily focus on static, text-based, or text-image tasks, overlooking the multimodal complexities of real-world interactions, which often involve video, audio, and dynamic contexts. Based on established psychological theories of EI, we build EmoBench-M, a novel benchmark designed to evaluate the EI capability of MLLMs across 13 valuation scenarios from three key dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis. Evaluations of open-source and closedsource MLLMs on EmoBench-M reveal a significant performance gap between MLLMs and humans across many scenarios. The findings underscore the need for further advancements in EI capabilities of MLLMs. All benchmark resources, including code and datasets, will be publicly released.

Results

We report model performance on the EmoBench-M benchmark across three levels of emotional intelligence. The first table evaluates foundational emotion recognition, the second focuses on conversational understanding, and the third addresses socially complex emotion analysis.

Overall, closed-source models outperform open-source counterparts across all dimensions, with Gemini-2.0-Flash achieving the best average performance. While some open-source models like Qwen2-Audio-7B-Instruct and InternVL2.5-38B show competitive results in specific tasks, a significant performance gap remains—particularly in socially complex scenarios—highlighting the challenges of achieving human-level emotional intelligence.

BibTeX

@article{hu2025emobench,
  title={EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models},
  author={Hu, He and Zhou, Yucheng and You, Lianzhong and Xu, Hongbo and Wang, Qianning and Lian, Zheng and Yu, Fei Richard and Ma, Fei and Cui, Laizhong},
  journal={arXiv preprint arXiv:2502.04424},
  year={2025}
  }

EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models

What is EmoBench-M?

EmoBench-M is a multimodal benchmark combining video, audio, and text, targeting basic, conversational, and complex social emotion understanding. Covering 13 real-world scenarios, it fills the gap in dynamic emotion recognition beyond static or single-modality datasets.

Abstract

Results

We report model performance on the EmoBench-M benchmark across three levels of emotional intelligence. The first table evaluates foundational emotion recognition, the second focuses on conversational understanding, and the third addresses socially complex emotion analysis.

BibTeX