Computers and Society | Cool Papers - Immersive Paper Discovery

#1 Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge [PDF⁹] [Copy] [Kimi²⁴]

Authors: Charles Koutcheme ; Nicola Dainese ; Sami Sarsa ; Arto Hellas ; Juho Leinonen ; Paul Denny

Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings.

#2 Designing Skill-Compatible AI: Methodologies and Frameworks in Chess [PDF] [Copy] [Kimi⁸]

Authors: Karim Hamade ; Reid McIlroy-Young ; Siddhartha Sen ; Jon Kleinberg ; Ashton Anderson

Powerful artificial intelligence systems are often used in settings where they must interact with agents that are computationally much weaker, for example when they work alongside humans or operate in complex environments where some tasks are handled by algorithms, heuristics, or other entities of varying computational power. For AI agents to successfully interact in these settings, however, achieving superhuman performance alone is not sufficient; they also need to account for suboptimal actions or idiosyncratic style from their less-skilled counterparts. We propose a formal evaluation framework for assessing the compatibility of near-optimal AI with interaction partners who may have much lower levels of skill; we use popular collaborative chess variants as model systems to study and develop AI agents that can successfully interact with lower-skill entities. Traditional chess engines designed to output near-optimal moves prove to be inadequate partners when paired with engines of various lower skill levels in this domain, as they are not designed to consider the presence of other agents. We contribute three methodologies to explicitly create skill-compatible AI agents in complex decision-making settings, and two chess game frameworks designed to foster collaboration between powerful AI agents and less-skilled partners. On these frameworks, our agents outperform state-of-the-art chess AI (based on AlphaZero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility.

#3 Overcoming Anchoring Bias: The Potential of AI and XAI-based Decision Support [PDF¹] [Copy] [Kimi²]

Authors: Felix Haag ; Carlo Stingl ; Katrin Zerfass ; Konstantin Hopf ; Thorsten Staake

Information systems (IS) are frequently designed to leverage the negative effect of anchoring bias to influence individuals' decision-making (e.g., by manipulating purchase decisions). Recent advances in Artificial Intelligence (AI) and the explanations of its decisions through explainable AI (XAI) have opened new opportunities for mitigating biased decisions. So far, the potential of these technological advances to overcome anchoring bias remains widely unclear. To this end, we conducted two online experiments with a total of N=390 participants in the context of purchase decisions to examine the impact of AI and XAI-based decision support on anchoring bias. Our results show that AI alone and its combination with XAI help to mitigate the negative effect of anchoring bias. Ultimately, our findings have implications for the design of AI and XAI-based decision support and IS to overcome cognitive biases.

#4 Enhancing Deep Knowledge Tracing via Diffusion Models for Personalized Adaptive Learning [PDF] [Copy] [Kimi³]

Authors: Ming Kuo ; Shouvon Sarker ; Lijun Qian ; Yujian Fu ; Xiangfang Li ; Xishuang Dong

In contrast to pedagogies like evidence-based teaching, personalized adaptive learning (PAL) distinguishes itself by closely monitoring the progress of individual students and tailoring the learning path to their unique knowledge and requirements. A crucial technique for effective PAL implementation is knowledge tracing, which models students' evolving knowledge to predict their future performance. Based on these predictions, personalized recommendations for resources and learning paths can be made to meet individual needs. Recent advancements in deep learning have successfully enhanced knowledge tracking through Deep Knowledge Tracing (DKT). This paper introduces generative AI models to further enhance DKT. Generative AI models, rooted in deep learning, are trained to generate synthetic data, addressing data scarcity challenges in various applications across fields such as natural language processing (NLP) and computer vision (CV). This study aims to tackle data shortage issues in student learning records to enhance DKT performance for PAL. Specifically, it employs TabDDPM, a diffusion model, to generate synthetic educational records to augment training data for enhancing DKT. The proposed method's effectiveness is validated through extensive experiments on ASSISTments datasets. The experimental results demonstrate that the AI-generated data by TabDDPM significantly improves DKT performance, particularly in scenarios with small data for training and large data for testing.

#5 Integrating LSTM and BERT for Long-Sequence Data Analysis in Intelligent Tutoring Systems [PDF¹] [Copy] [Kimi²]

Authors: Zhaoxing Li ; Jujie Yang ; Jindi Wang ; Lei Shi ; Sebastian Stein

The field of Knowledge Tracing aims to understand how students learn and master knowledge over time by analyzing their historical behaviour data. To achieve this goal, many researchers have proposed Knowledge Tracing models that use data from Intelligent Tutoring Systems to predict students' subsequent actions. However, with the development of Intelligent Tutoring Systems, large-scale datasets containing long-sequence data began to emerge. Recent deep learning based Knowledge Tracing models face obstacles such as low efficiency, low accuracy, and low interpretability when dealing with large-scale datasets containing long-sequence data. To address these issues and promote the sustainable development of Intelligent Tutoring Systems, we propose a LSTM BERT-based Knowledge Tracing model for long sequence data processing, namely LBKT, which uses a BERT-based architecture with a Rasch model-based embeddings block to deal with different difficulty levels information and an LSTM block to process the sequential characteristic in students' actions. LBKT achieves the best performance on most benchmark datasets on the metrics of ACC and AUC. Additionally, an ablation study is conducted to analyse the impact of each component of LBKT's overall performance. Moreover, we used t-SNE as the visualisation tool to demonstrate the model's embedding strategy. The results indicate that LBKT is faster, more interpretable, and has a lower memory cost than the traditional deep learning based Knowledge Tracing methods.

#6 Responding to Generative AI Technologies with Research-through-Design: The Ryelands AI Lab as an Exploratory Study [PDF] [Copy] [Kimi¹]

Authors: Jesse Josua Benjamin ; Joseph Lindley ; Elizabeth Edwards ; Elisa Rubegni ; Tim Korjakow ; David Grist ; Rhiannon Sharkey

Generative AI technologies demand new practical and critical competencies, which call on design to respond to and foster these. We present an exploratory study guided by Research-through-Design, in which we partnered with a primary school to develop a constructionist curriculum centered on students interacting with a generative AI technology. We provide a detailed account of the design of and outputs from the curriculum and learning materials, finding centrally that the reflexive and prolonged `hands-on' approach led to a co-development of students' practical and critical competencies. From the study, we contribute guidance for designing constructionist approaches to generative AI technology education; further arguing to do so with `critical responsivity.' We then discuss how HCI researchers may leverage constructionist strategies in designing interactions with generative AI technologies; and suggest that Research-through-Design can play an important role as a `rapid response methodology' capable of reacting to fast-evolving, disruptive technologies such as generative AI.

#7 Physics-based deep learning reveals rising heating demand heightens air pollution in Norwegian cities [PDF] [Copy] [Kimi¹]

Authors: Cong Cao ; Ramit Debnath ; R. Michael Alvarez

Policymakers frequently analyze air quality and climate change in isolation, disregarding their interactions. This study explores the influence of specific climate factors on air quality by contrasting a regression model with K-Means Clustering, Hierarchical Clustering, and Random Forest techniques. We employ Physics-based Deep Learning (PBDL) and Long Short-Term Memory (LSTM) to examine the air pollution predictions. Our analysis utilizes ten years (2009-2018) of daily traffic, weather, and air pollution data from three major cities in Norway. Findings from feature selection reveal a correlation between rising heating degree days and heightened air pollution levels, suggesting increased heating activities in Norway are a contributing factor to worsening air quality. PBDL demonstrates superior accuracy in air pollution predictions compared to LSTM. This paper contributes to the growing literature on PBDL methods for more accurate air pollution predictions using environmental variables, aiding policymakers in formulating effective data-driven climate policies.

#8 Enhancing LLM-Based Feedback: Insights from Intelligent Tutoring Systems and the Learning Sciences [PDF] [Copy] [Kimi]

Authors: John Stamper ; Ruiwei Xiao ; Xinynig Hou

The field of Artificial Intelligence in Education (AIED) focuses on the intersection of technology, education, and psychology, placing a strong emphasis on supporting learners' needs with compassion and understanding. The growing prominence of Large Language Models (LLMs) has led to the development of scalable solutions within educational settings, including generating different types of feedback in Intelligent Tutoring Systems. However, the approach to utilizing these models often involves directly formulating prompts to solicit specific information, lacking a solid theoretical foundation for prompt construction and empirical assessments of their impact on learning. This work advocates careful and caring AIED research by going through previous research on feedback generation in ITS, with emphasis on the theoretical frameworks they utilized and the efficacy of the corresponding design in empirical evaluations, and then suggesting opportunities to apply these evidence-based principles to the design, experiment, and evaluation phases of LLM-based feedback generation. The main contributions of this paper include: an avocation of applying more cautious, theoretically grounded methods in feedback generation in the era of generative AI; and practical suggestions on theory and evidence-based feedback design for LLM-powered ITS.

#9 Verified authors shape X/Twitter discursive communities [PDF] [Copy] [Kimi]

Authors: Stefano Guarino ; Ayoub Mounim ; Guido Caldarelli ; Fabio Saracco

Community detection algorithms try to extract a mesoscale structure from the available network data, generally avoiding any explicit assumption regarding the quantity and quality of information conveyed by specific sets of edges. In this paper, we show that the core of ideological/discursive communities on X/Twitter can be effectively identified by uncovering the most informative interactions in an authors-audience bipartite network through a maximum-entropy null model. The analysis is performed considering three X/Twitter datasets related to the main political events of 2022 in Italy, using as benchmarks four state-of-the-art algorithms - three descriptive, one inferential -, and manually annotating nearly 300 verified users based on their political affiliation. In terms of information content, the communities obtained with the entropy-based algorithm are comparable to those obtained with some of the benchmarks. However, such a methodology on the authors-audience bipartite network: uses just a small sample of the available data to identify the central users of each community; returns a neater partition of the user set in just a few, easy to interpret, communities; clusters well-known political figures in a way that better matches the political alliances when compared with the benchmarks. Our results provide an important insight into online debates, highlighting that online interaction networks are mostly shaped by the activity of a small set of users who enjoy public visibility even outside social media.

#10 The Dark Side of Dataset Scaling: Evaluating Racial Classification in Multimodal Models [PDF] [Copy] [Kimi]

Authors: Abeba Birhane ; Sepehr Dehdashtian ; Vinay Uday Prabhu ; Vishnu Boddeti

Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan decreased, but misclassifying the same images as human offensive classes such as criminal increased. Furthermore, of the 14 Vision Transformer-based VLMs we evaluated, the probability of predicting an image of a Black man and a Latino man as criminal increases by 65% and 69%, respectively, when the dataset is scaled from 400M to 2B samples for the larger ViT-L models. Conversely, for the smaller base ViT-B models, the probability of predicting an image of a Black man and a Latino man as criminal decreases by 20% and 47%, respectively, when the dataset is scaled from 400M to 2B samples. We ground the model audit results in a qualitative and historical analysis, reflect on our findings and their implications for dataset curation practice, and close with a summary of mitigation mechanisms and ways forward. Content warning: This article contains racially dehumanising and offensive descriptions.

#11 Guiding the Way: A Comprehensive Examination of AI Guidelines in Global Media [PDF] [Copy] [Kimi]

Authors: M. F. de-Lima-Santos ; W. N. Yeung ; T. Dodds

With the increasing adoption of artificial intelligence (AI) technologies in the news industry, media organizations have begun publishing guidelines that aim to promote the responsible, ethical, and unbiased implementation of AI-based technologies. These guidelines are expected to serve journalists and media workers by establishing best practices and a framework that helps them navigate ever-evolving AI tools. Drawing on institutional theory and digital inequality concepts, this study analyzes 37 AI guidelines for media purposes in 17 countries. Our analysis reveals key thematic areas, such as transparency, accountability, fairness, privacy, and the preservation of journalistic values. Results highlight shared principles and best practices that emerge from these guidelines, including the importance of human oversight, explainability of AI systems, disclosure of automated content, and protection of user data. However, the geographical distribution of these guidelines, highlighting the dominance of Western nations, particularly North America and Europe, can further ongoing concerns about power asymmetries in AI adoption and consequently isomorphism outside these regions. Our results may serve as a resource for news organizations, policymakers, and stakeholders looking to navigate the complex AI development toward creating a more inclusive and equitable digital future for the media industry worldwide.

#12 Ordinal Behavior Classification of Student Online Course Interactions [PDF] [Copy] [Kimi]

Author: Thomas Trask

The study in interaction patterns between students in on-campus and MOOC-style online courses has been broadly studied for the last 11 years. Yet there remains a gap in the literature comparing the habits of students completing the same course offered in both on-campus and MOOC-style online formats. This study will look at browser-based usage patterns for students in the Georgia Tech CS1301 edx course for both the online course offered to on-campus students and the MOOCstyle course offered to anyone to determine what, if any, patterns exist between the two cohorts.

#13 Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank [PDF] [Copy] [Kimi]

Authors: Alexander Scarlatos ; Wanyong Feng ; Digory Smith ; Simon Woodhead ; Andrew Lan

Multiple-choice questions (MCQs) are commonly used across all levels of math education since they can be deployed and graded at a large scale. A critical component of MCQs is the distractors, i.e., incorrect answers crafted to reflect student errors or misconceptions. Automatically generating them in math MCQs, e.g., with large language models, has been challenging. In this work, we propose a novel method to enhance the quality of generated distractors through overgenerate-and-rank, training a ranking model to predict how likely distractors are to be selected by real students. Experimental results on a real-world dataset and human evaluation with math teachers show that our ranking model increases alignment with human-authored distractors, although human-authored ones are still preferred over generated ones.