This project, conducted in collaboration with Shiro Kumano and Hiromi Narimatsu at Nippon Telegraph and Telephone (NTT), and funded by NTT, investigates whether large language models (LLMs) can serve as proxies in emotion research by estimating human emotional states from multi-modal responses.
In an initial study presented at ACII 2025, we examined whether GPT-4o could infer emotions from text. Participants viewed artworks, reported their emotional states using the valence–arousal–dominance (VAD) framework, and provided written descriptions of their perceptions. GPT-4o was then prompted to estimate emotional intensity from both the writer’s (self) and a reader’s (third-party) perspective, and its predictions were compared with human self-reports and reader judgments.
The results show strong alignment with average human readers for valence, moderate alignment for arousal, and weaker alignment for dominance. Notably, agreement on arousal between writer and reader perspectives depended on whether the texts shared a similar emotional context, while variations in prompt design—including intermediate reasoning—did not improve performance. Overall, the findings suggest that, with appropriate task framing, LLMs like GPT-4o can effectively support pilot studies in cognitive and affective research. We are now extending this work to other media.