IJCAI.2021 - Humans and AI

| Total: 8

#1 UIBert: Learning Generic Multimodal Representations for UI Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Chongyang Bai ; Xiaoxue Zang ; Ying Xu ; Srinivas Sunkara ; Abhinav Rastogi ; Jindong Chen ; Blaise Agüera y Arcas

To improve the accessibility of smart devices and to simplify their usage, building models which understand user interfaces (UIs) and assist users to complete their tasks is critical. However, unique challenges are proposed by UI-specific characteristics, such as how to effectively leverage multimodal UI features that involve image, text, and structural metadata and how to achieve good performance when high-quality labeled data is unavailable. To address such challenges we introduce UIBert, a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data to learn generic feature representations for a UI and its components. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.

#2 Pruning of Deep Spiking Neural Networks through Gradient Rewiring [PDF] [Copy] [Kimi] [REL]

Authors: Yanqi Chen ; Zhaofei Yu ; Wei Fang ; Tiejun Huang ; Yonghong Tian

Spiking Neural Networks (SNNs) have been attached great importance due to their biological plausibility and high energy-efficiency on neuromorphic chips. As these chips are usually resource-constrained, the compression of SNNs is thus crucial along the road of practical use of SNNs. Most existing methods directly apply pruning approaches in artificial neural networks (ANNs) to SNNs, which ignore the difference between ANNs and SNNs, thus limiting the performance of the pruned SNNs. Besides, these methods are only suitable for shallow SNNs. In this paper, inspired by synaptogenesis and synapse elimination in the neural system, we propose gradient rewiring (Grad R), a joint learning algorithm of connectivity and weight for SNNs, that enables us to seamlessly optimize network structure without retraining. Our key innovation is to redefine the gradient to a new synaptic parameter, allowing better exploration of network structures by taking full advantage of the competition between pruning and regrowth of connections. The experimental results show that the proposed method achieves minimal loss of SNNs' performance on MNIST and CIFAR-10 datasets so far. Moreover, it reaches a ~3.5% accuracy loss under unprecedented 0.73% connectivity, which reveals remarkable structure refining capability in SNNs. Our work suggests that there exists extremely high redundancy in deep SNNs. Our codes are available at https://github.com/Yanqi-Chen/Gradient-Rewiring.

#3 Human-AI Collaboration with Bandit Feedback [PDF] [Copy] [Kimi] [REL]

Authors: Ruijiang Gao ; Maytal Saar-Tsechansky ; Maria De-Arteaga ; Ligong Han ; Min Kyung Lee ; Matthew Lease

Human-machine complementarity is important when neither the algorithm nor the human yield dominant performance across all instances in a given domain. Most research on algorithmic decision-making solely centers on the algorithm's performance, while recent work that explores human-machine collaboration has framed the decision-making problems as classification tasks. In this paper, we first propose and then develop a solution for a novel human-machine collaboration problem in a bandit feedback setting. Our solution aims to exploit the human-machine complementarity to maximize decision rewards. We then extend our approach to settings with multiple human decision makers. We demonstrate the effectiveness of our proposed methods using both synthetic and real human responses, and find that our methods outperform both the algorithm and the human when they each make decisions on their own. We also show how personalized routing in the presence of multiple human decision-makers can further improve the human-machine team performance.

#4 Accounting for Confirmation Bias in Crowdsourced Label Aggregation [PDF] [Copy] [Kimi] [REL]

Authors: Meric Altug Gemalmaz ; Ming Yin

Collecting large-scale human-annotated datasets via crowdsourcing to train and improve automated models is a prominent human-in-the-loop approach to integrate human and machine intelligence. However, together with their unique intelligence, humans also come with their biases and subjective beliefs, which may influence the quality of the annotated data and negatively impact the effectiveness of the human-in-the-loop systems. One of the most common types of cognitive biases that humans are subject to is the confirmation bias, which is people's tendency to favor information that confirms their existing beliefs and values. In this paper, we present an algorithmic approach to infer the correct answers of tasks by aggregating the annotations from multiple crowd workers, while taking workers' various levels of confirmation bias into consideration. Evaluations on real-world crowd annotations show that the proposed bias-aware label aggregation algorithm outperforms baseline methods in accurately inferring the ground-truth labels of different tasks when crowd workers indeed exhibit some degree of confirmation bias. Through simulations on synthetic data, we further identify the conditions when the proposed algorithm has the largest advantages over baseline methods.

#5 An Entanglement-driven Fusion Neural Network for Video Sentiment Analysis [PDF] [Copy] [Kimi] [REL]

Authors: Dimitris Gkoumas ; Qiuchi Li ; Yijun Yu ; Dawei Song

Video data is multimodal in its nature, where an utterance can involve linguistic, visual and acoustic information. Therefore, a key challenge for video sentiment analysis is how to combine different modalities for sentiment recognition effectively. The latest neural network approaches achieve state-of-the-art performance, but they neglect to a large degree of how humans understand and reason about sentiment states. By contrast, recent advances in quantum probabilistic neural models have achieved comparable performance to the state-of-the-art, yet with better transparency and increased level of interpretability. However, the existing quantum-inspired models treat quantum states as either a classical mixture or as a separable tensor product across modalities, without triggering their interactions in a way that they are correlated or non-separable (i.e., entangled). This means that the current models have not fully exploited the expressive power of quantum probabilities. To fill this gap, we propose a transparent quantum probabilistic neural model. The model induces different modalities to interact in such a way that they may not be separable, encoding crossmodal information in the form of non-classical correlations. Comprehensive evaluation on two benchmarking datasets for video sentiment analysis shows that the model achieves significant performance improvement. We also show that the degree of non-separability between modalities optimizes the post-hoc interpretability.

#6 Event-based Action Recognition Using Motion Information and Spiking Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Qianhui Liu ; Dong Xing ; Huajin Tang ; De Ma ; Gang Pan

Event-based cameras have attracted increasing attention due to their advantages of biologically inspired paradigm and low power consumption. Since event-based cameras record the visual input as asynchronous discrete events, they are inherently suitable to cooperate with the spiking neural network (SNN). Existing works of SNNs for processing events mainly focus on the task of object recognition. However, events from the event-based camera are triggered by dynamic changes, which makes it an ideal choice to capture actions in the visual scene. Inspired by the dorsal stream in visual cortex, we propose a hierarchical SNN architecture for event-based action recognition using motion information. Motion features are extracted and utilized from events to local and finally to global perception for action recognition. To the best of the authors’ knowledge, it is the first attempt of SNN to apply motion information to event-based action recognition. We evaluate our proposed SNN on three event-based action recognition datasets, including our newly published DailyAction-DVS dataset comprising 12 actions collected under diverse recording conditions. Extensive experimental results show the effectiveness of motion information and our proposed SNN architecture for event-based action recognition.

#7 Item Response Ranking for Cognitive Diagnosis [PDF] [Copy] [Kimi] [REL]

Authors: Shiwei Tong ; Qi Liu ; Runlong Yu ; Wei Huang ; Zhenya Huang ; Zachary A. Pardos ; Weijie Jiang

Cognitive diagnosis, a fundamental task in education area, aims at providing an approach to reveal the proficiency level of students on knowledge concepts. Actually, monotonicity is one of the basic conditions in cognitive diagnosis theory, which assumes that student's proficiency is monotonic with the probability of giving the right response to a test item. However, few of previous methods consider the monotonicity during optimization. To this end, we propose Item Response Ranking framework (IRR), aiming at introducing pairwise learning into cognitive diagnosis to well model the monotonicity between item responses. Specifically, we first use an item specific sampling method to sample item responses and construct response pairs based on their partial order, where we propose the two-branch sampling methods to handle the unobserved responses. After that, we use a pairwise objective function to exploit the monotonicity in the pair formulation. In fact, IRR is a general framework which can be applied to most of contemporary cognitive diagnosis models. Extensive experiments demonstrate the effectiveness and interpretability of our method.

#8 Type Anywhere You Want: An Introduction to Invisible Mobile Keyboard [PDF] [Copy] [Kimi] [REL]

Authors: Sahng-Min Yoo ; Ue-Hwan Kim ; Yewon Hwang ; Jong-Hwan Kim

Contemporary soft keyboards possess limitations: the lack of physical feedback results in an increase of typos, and the interface of soft keyboards degrades the utility of the screen. To overcome these limitations, we propose an Invisible Mobile Keyboard (IMK), which lets users freely type on the desired area without any constraints. To facilitate a data-driven IMK decoding task, we have collected the most extensive text-entry dataset (approximately 2M pairs of typing positions and the corresponding characters). Additionally, we propose our baseline decoder along with a semantic typo correction mechanism based on self-attention, which decodes such unconstrained inputs with high accuracy (96.0%). Moreover, the user study reveals that the users could type faster and feel convenience and satisfaction to IMK with our decoder. Lastly, we make the source code and the dataset public to contribute to the research community.