Total: 1
Emotion Recognition in Conversation (ERC) is essential for dialogue systems in human-computer interaction. Most existing studies primarily focus on modeling contextual information from historical interactions but often overlook the effective integration of speaker and content information. To address these challenges, we propose the "Three Ws" concept -Who, When, and What, representing speaker, context, and content information- to comprehensively capture emotional cues from historical interactions. Building on this concept, we further introduce a novel model for ERC. Additionally, we incorporate a speaker similarity loss to enhance speaker information. Experimental results show that our model outperforms baselines, with each component making significant contributions—especially context information. Additionally, the speaker similarity loss further improves ERC performance. Notably, the "Three Ws"concept demonstrates robustness across both single-modal and multimodal scenarios.