IJCAI.2024 - Humans and AI | Cool Papers - Immersive Paper Discovery

#1 ScreenAI: A Vision-Language Model for UI and Infographics Understanding [PDF] [Copy] [Kimi²] [REL]

Authors: Gilles Baechler ; Srinivas Sunkara ; Maria Wang ; Fedir Zubach ; Hassan Mansoor ; Vincent Etter ; Victor Carbune ; Jason Lin ; Jindong Chen ; Abhanshu Sharma

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-art results on UI- and infographics-based tasks (Multipage DocVQA, WebSRC, and MoTIF), and new best-in-class performance on others (ChartQA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

#2 Multi-level Disentangling Network for Cross-Subject Emotion Recognition Based on Multimodal Physiological Signals [PDF¹] [Copy] [Kimi] [REL]

Authors: Ziyu Jia ; Fengming Zhao ; Yuzhe Guo ; Hairong Chen ; Tianzi Jiang

Emotion recognition based on multimodal physiological signals is attracting more and more attention. However, how to deal with the consistency and heterogeneity of multimodal physiological signals, as well as individual differences across subjects, pose two significant challenges. In this paper, we propose a Multi-level Disentangling Network named MDNet for cross-subject emotion recognition based on multimodal physiological signals. Specifically, MDNet consists of a modality-level disentangling module and a subject-level disentangling module. The modality-level disentangling module projects multimodal physiological signals into modality-invariant subspace and modality-specific subspace, capturing modality-invariant features and modality-specific features. The subject-level disentangling module separates subject-shared features and subject-private features among different subjects from multimodal data, which facilitates cross-subject emotion recognition. Experiments on two multimodal emotion datasets demonstrate that MDNet outperforms other state-of-the-art baselines.

#3 VSGT: Variational Spatial and Gaussian Temporal Graph Models for EEG-based Emotion Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Chenyu Liu ; Xinliang Zhou ; Jiaping Xiao ; Zhengri Zhu ; Liming Zhai ; Ziyu Jia ; Yang Liu

Electroencephalogram (EEG), which directly reflects the emotional activity of the brain, has been increasingly utilized for emotion recognition. Most works exploit the spatial and temporal dependencies in EEG to learn emotional feature representations, but they still have two limitations to reach their full potential. First, prior knowledge is rarely used to capture the spatial dependency of brain regions. Second, the cross temporal dependency between consecutive time slices for different brain regions is ignored. To address these limitations, in this paper, we propose Variational Spatial and Gaussian Temporal (VSGT) graph models to investigate the spatial and temporal dependencies for EEG-based emotion recognition. The VSGT has two key components: Variational Spatial Encoder (VSE) and Gaussian Temporal Encoder (GTE). The VSE leverages the upper bound theorem to identify the dynamic spatial dependency based on prior knowledge by the variational Bayesian method. Besides, the GTE exploits the conditional Gaussian graph transform that computes comprehensive temporal dependency between consecutive time slices. Finally, the VSGT utilizes a recurrent structure to calculate the spatial and temporal dependencies for all time slices. Extensive experiments show the superiority of VSGT over state-of-the-art methods on multiple EEG datasets.

#4 Concept-Level Causal Explanation Method for Brain Function Network Classification [PDF] [Copy] [Kimi] [REL]

Authors: Jinduo Liu ; Feipeng Wang ; Junzhong Ji

Using deep models to classify brain functional networks (BFNs) for the auxiliary diagnosis and treatment of brain diseases has become increasingly popular. However, the unexplainability of deep models has seriously hindered their applications in computer-aided diagnosis. In addition, current explanation methods mostly focus on natural images, which cannot be directly used to explain the deep model for BFN classification. In this paper, we propose a concept-level causal explanation method for BFN classification called CLCEM. First, CLCEM employs the causal learning method to extract concepts that are meaningful to humans from BFNs. Second, it aggregates the same concepts to obtain the contribution of each concept to the model output. Finally, CLCEM adds the contribution of each concept to make a diagnosis. The experimental results show that our CLCEM can not only accurately identify brain regions related to specific brain diseases but also make decisions based on the concepts of these brain regions, which enables humans to understand the decision-making process without performance degradation.

#5 LitE-SNN: Designing Lightweight and Efficient Spiking Neural Network through Spatial-Temporal Compressive Network Search and Joint Optimization [PDF] [Copy] [Kimi] [REL]

Authors: Qianhui Liu ; Jiaqi Yan ; Malu Zhang ; Gang Pan ; Haizhou Li

Spiking Neural Networks (SNNs) mimic the information-processing mechanisms of the human brain and are highly energy-efficient, making them well-suited for low-power edge devices. However, the pursuit of accuracy in current studies leads to large, long-timestep SNNs, conflicting with the resource constraints of these devices. In order to design lightweight and efficient SNNs, we propose a new approach named LitE-SNN that incorporates both spatial and temporal compression into the automated network design process. Spatially, we present a novel Compressive Convolution block (CompConv) to expand the search space to support pruning and mixed-precision quantization. Temporally, we are the first to propose a compressive timestep search to identify the optimal number of timesteps under specific computation cost constraints. Finally, we formulate a joint optimization to simultaneously learn the architecture parameters and spatial-temporal compression strategies to achieve high performance while minimizing memory and computation costs. Experimental results on CIFAR-10, CIFAR-100, and Google Speech Command datasets demonstrate our proposed LitE-SNNs can achieve competitive or even higher accuracy with remarkably smaller model sizes and fewer computation costs.

#6 Designing Behavior-Aware AI to Improve the Human-AI Team Performance in AI-Assisted Decision Making [PDF] [Copy] [Kimi] [REL]

Authors: Syed Hasan Amin Mahmood ; Zhuoran Lu ; Ming Yin

With the rapid development of decision aids that are driven by AI models, the practice of AI-assisted decision making has become increasingly prevalent. To improve the human-AI team performance in decision making, earlier studies mostly focus on enhancing humans' capability in better utilizing a given AI-driven decision aid. In this paper, we tackle this challenge through a complementary approach—we aim to train "behavior-aware AI" by adjusting the AI model underlying the decision aid to account for humans' behavior in adopting AI advice. In particular, as humans are observed to accept AI advice more when their confidence in their own judgement is low, we propose to train AI models with a human-confidence-based instance weighting strategy, instead of solving the standard empirical risk minimization problem. Under an assumed, threshold-based model characterizing when humans will adopt the AI advice, we first derive the optimal instance weighting strategy for training AI models. We then validate the efficacy and robustness of our proposed method in improving the human-AI joint decision making performance through systematic experimentation on synthetic datasets. Finally, via randomized experiments with real human subjects along with their actual behavior in adopting the AI advice, we demonstrate that our method can significantly improve the decision making performance of the human-AI team in practice.

#7 DBPNet: Dual-Branch Parallel Network with Temporal-Frequency Fusion for Auditory Attention Detection [PDF] [Copy] [Kimi] [REL]

Authors: Qinke Ni ; Hongyu Zhang ; Cunhang Fan ; Shengbing Pei ; Chang Zhou ; Zhao Lv

Auditory attention decoding (AAD) aims to recognize the attended speaker based on electroencephalography (EEG) signals in multi-talker environments. Most AAD methods only focus on the temporal or frequency domain, but neglect the relationships between these two domains, which results in the inability to simultaneously consider both time-varying and spectral-spatial information. To address this issue, this paper proposes a dual-branch parallel network with temporal-frequency fusion for AAD, named DBPNet, which consists of the temporal attentive branch and the frequency residual branch. Specifically, the temporal attentive branch aims to capture the time-varying features in the EEG time-series signal. The frequency residual branch aims to extract spectral-spatial features of multi-band EEG signals by the residual convolution. Finally, these dual branches are fused to consider both EEG signals time-varying and spectral-spatial features and get classification results. Experimental results show that compared with the best baseline, DBPNet achieves a relative improvement of 20.4% with a 0.1-second decision window for the MM-AAD dataset, but the number of trainable parameters is reduced by about 91 times.

#8 ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction [PDF] [Copy] [Kimi] [REL]

Authors: Bo Qian ; Zhenhuan Wei ; Jiashuo Li ; Xing Wei

Efficiently estimating the full-body pose with minimal wearable devices presents a worthwhile research direction. Despite significant advancements in this field, most current research neglects to explore full-body avatar estimation under low-quality signal conditions, which is prevalent in practical usage. To bridge this gap, we summarize three scenarios that may be encountered in real-world applications: standard scenario, instantaneous data-loss scenario, and prolonged data-loss scenario, and propose a new evaluation benchmark. The solution we propose to address data-loss scenarios is integrating the full-body avatar pose estimation problem with motion prediction. Specifically, we present ReliaAvatar, a real-time, reliable avatar animator equipped with predictive modeling capabilities employing a dual-path architecture. ReliaAvatar operates effectively, with an impressive performance rate of 109 frames per second (fps). Extensive comparative evaluations on widely recognized benchmark datasets demonstrate ReliaAvatar's superior performance in both standard and low data-quality conditions. The code is available at https://github.com/MIV-XJTU/ReliaAvatar.

#9 TIM: An Efficient Temporal Interaction Module for Spiking Transformer [PDF] [Copy] [Kimi¹] [REL]

Authors: Sicheng Shen ; Dongcheng Zhao ; Guobin Shen ; Yi Zeng

Spiking Neural Networks (SNNs), as the third generation of neural networks, have gained prominence for their biological plausibility and computational efficiency, especially in processing diverse datasets. The integration of attention mechanisms, inspired by advancements in neural network architectures, has led to the development of Spiking Transformers. These have shown promise in enhancing SNNs' capabilities, particularly in the realms of both static and neuromorphic datasets. Despite their progress, a discernible gap exists in these systems, specifically in the Spiking Self Attention (SSA) mechanism's effectiveness in leveraging the temporal processing potential of SNNs. To address this, we introduce the Temporal Interaction Module (TIM), a novel, convolution-based enhancement designed to augment the temporal data processing abilities within SNN architectures. TIM's integration into existing SNN frameworks is seamless and efficient, requiring minimal additional parameters while significantly boosting their temporal information handling capabilities. Through rigorous experimentation, TIM has demonstrated its effectiveness in exploiting temporal information, leading to state-of-the-art performance across various neuromorphic datasets. The code is available at https://github.com/BrainCog-X/Brain-Cog/tree/main/examples/TIM.

#10 One-step Spiking Transformer with a Linear Complexity [PDF] [Copy] [Kimi] [REL]

Authors: Xiaotian Song ; Andy Song ; Rong Xiao ; Yanan Sun

Spiking transformers have recently emerged as a robust alternative in deep learning. One focus of this field is the reduction of energy consumption, given that spiking transformers require lengthy simulation timesteps and complex floating-point attention mechanisms. In this paper, we propose a one-step approach that requires only one timestep and is of linear complexity. The proposed One-step Spiking Transformer (OST) incorporates a Time Domain Compression and Compensation (TDCC) component, which can significantly mitigate the spatio-temporal overhead of spiking transformers. Another novel component in OST is the Spiking Linear Transformation (SLT), designed to greatly reduce the number of floating-point multiply-and-accumulate operations. Experiments on both static and neuromorphic images show that OST can perform as well as or better than SOTA methods with just one timestep, even for more difficult tasks. For instance, comparing with Spikeformer, OST gains 1.59% in accuracy on ImageNet, yet 40.27% more efficient, and gains 0.7% on DVS128 Gesture. The supplementary materials and source code are available at https://github.com/songxt3/OST.

#11 MetaJND: A Meta-Learning Approach for Just Noticeable Difference Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Miaohui Wang ; Yukuan Zhu ; Rong Zhang ; Wuyuan Xie

The modeling of just noticeable difference (JND) in supervised learning for visual signals has made significant progress. However, existing JND models often suffer from limited generalization due to the need for large-scale training data and their constraints to certain image types. Moreover, these models primarily focus on a single RGB modality, ignoring the potential complementary impacts of multiple modalities. To address these challenges, we propose a new meta-learning approach for the JND modeling, called MetaJND. We introduce two key visual-sensitive modalities like saliency and depth, and leverage a self-attention mechanism for effective interdependence of multi-modal features. Additionally, we incorporate meta-learning for the modality alignment, facilitating dynamic weight generation. Furthermore, we perform hierarchical fusion through multi-layer channel and spatial feature rectification. Experimental results on four benchmark datasets demonstrate the effectiveness of our MetaJND. Moreover, we have also evaluated its performance in compression and watermarking applications, observing higher bit-rate savings and better watermark hiding capabilities.

#12 Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition [PDF¹] [Copy] [Kimi] [REL]

Authors: Yang Wang ; Haiyang Mei ; Qirui Bao ; Ziqi Wei ; Mike Zheng Shou ; Haizhou Li ; Bo Dong ; Xin Yang

We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks. This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network. The core strength of this approach is its ability to utilize the ample, coarser temporal cues found in conventional frames for effective emotion recognition. Consequently, our method adeptly interprets both temporal and spatial information from the conventional frame domain, eliminating the need for specialized sensing devices, e.g., event-based camera. The effectiveness of our approach is thoroughly demonstrated using both existing and our compiled single-eye emotion recognition datasets, achieving unparalleled performance in accuracy and efficiency over existing state-of-the-art methods.

#13 Multi-scale Context-Aware Networks Based on Fragment Association for Human Activity Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Zhiqiong Wang ; Hanyu Liu ; Boyang Zhao ; Qi Shen ; Mingzhe Li ; Ningfeng Que ; Mingke Yan ; Junchang Xin

Sensor-based Human Activity Recognition (HAR) constitutes a key component of many artificial intelligence applications. Although deep feature extraction technology is constantly updated and iterated with excellent results, it is still a difficult task to find a balance between performance and computational efficiency. Through an in-depth exploration of the inherent characteristics of HAR data, we propose a lightweight feature perception model, which encompasses an internal feature extractor and a contextual feature perceiver. The model mainly consists of two stages. The first stage is a hierarchical multi-scale feature extraction module, which is composed of deep separable convolution and multi-head attention mechanism. This module serves to extract conventional features for Human Activity Recognition. After the feature goes through a fragment recombination operation, it is passed into the Context-Aware module of the second stage, which is based on Retentive Transformer and optimized by Dropkey method to efficiently extract the relationship between the feature fragments, so as to mine more valuable feature information. Importantly, this does not add too much complexity to the model, thereby preventing excessive resource consumption. We conducted extensive experimental validation on multiple publicly available HAR datasets.

#14 CoCoG: Controllable Visual Stimuli Generation Based on Human Concept Representations [PDF] [Copy] [Kimi] [REL]

Authors: Chen Wei ; Jiachen Zou ; Dietmar Heinke ; Quanying Liu

A central question for cognitive science is to understand how humans process visual scenes, i.e, to uncover human low-dimensional concept representation space from high-dimensional visual stimuli. Generating visual stimuli with controlling concepts is the key. However, there are currently no generative models in AI to solve this problem. Here, we present the Concept based Controllable Generation (CoCoG) framework. CoCoG consists of two components, a simple yet efficient AI agent for extracting interpretable concept and predicting human decision-making in visual similarity judgment tasks, and a conditional generation model for generating visual stimuli given the concepts. We quantify the performance of CoCoG from two aspects, the human behavior prediction accuracy and the controllable generation ability. The experiments with CoCoG indicate that 1) the reliable concept embeddings in CoCoG allows to predict human behavior with 64.07% accuracy in the THINGS-similarity dataset; 2) CoCoG can generate diverse stimuli through the control of concepts; 3) CoCoG can manipulate human similarity judgment behavior by intervening key concepts. CoCoG offers visual objects with controlling concepts to advance our understanding of causality in human cognition. The code of CoCoG framework is available at https://github.com/ncclab-sustech/CoCoG.

#15 Dialogue Cross-Enhanced Central Engagement Attention Model for Real-Time Engagement Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Jun Yu ; Keda Lu ; Ji Zhao ; Zhihong Wei ; Iek-Heng Chu ; Peng Chang

Real-time engagement estimation has been an important research topic in human-computer interaction in recent years. The emergence of the NOvice eXpert Interaction (NOXI) dataset, enriched with frame-wise engagement annotations, has catalyzed a surge in research efforts in this domain. Existing feature sequence partitioning methods for ultra-long videos have encountered challenges including insufficient information utilization and repetitive inference. Moreover, those studies focus mainly on the target participants’ features without taking into account those of the interlocutor. To address these issues, we propose the center-based sliding window method to obtain feature subsequences. The core of these subsequences is modeled using our innovative Central Engagement Attention Model (CEAM). Additionally, we introduce the dialogue cross-enhanced module that effectively incorporates the interlocutor’s features via cross-attention. Our proposed method outperforms the current best model, achieving a substantial gain of 1.5% in coordination correlation coefficient (CCC) and establishing a new state-of-the-art result. Our source codes and model checkpoints are available at https://github.com/wujiekd/Dialogue-Cross-Enhanced-CEAM.

#16 CausVSR: Causality Inspired Visual Sentiment Recognition [PDF¹] [Copy] [Kimi] [REL]

Authors: Xinyue Zhang ; Zhaoxia Wang ; Hailing Wang ; Jing Xiang ; Chunwei Wu ; Guitao Cao

Visual Sentiment Recognition (VSR) is an evolving field that aims to detect emotional tendencies within visual content. Despite its growing significance, detecting emotions depicted in visual content, such as images, faces challenges, notably the emergence of misleading or spurious correlations of the contextual information. In response to these challenges, we propose a causality inspired VSR approach, called CausVSR. CausVSR is rooted in the fundamental principles of Emotional Causality theory, mimicking the human process from receiving emotional stimuli to deriving emotional states. CausVSR takes a deliberate stride toward conquering the VSR challenges. It harnesses the power of a structural causal model, intricately designed to encapsulate the dynamic causal interplay between visual content and their corresponding pseudo sentiment regions. This strategic approach allows for a deep exploration of contextual information, elevating the accuracy of emotional inference. Additionally, CausVSR utilizes a global category elicitation module, strategically employed to execute front-door adjustment techniques, effectively detecting and handling spurious correlations. Experiments, conducted on four widely-used datasets, demonstrate CausVSR's superiority in enhancing emotion perception within VSR, surpassing existing methods.