AAAI.2025 - AI for Social Impact Track | Cool Papers

#1 Spatial Clustering of Citizen Science Data Improves Downstream Species Distribution Models [PDF] [Copy] [Kimi¹] [REL]

Authors: Nahian Ahmed, Mark Roth, Tyler A. Hallman, W. Douglas Robinson, Rebecca A. Hutchinson

Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.

Subject: AAAI.2025 - AI for Social Impact Track

#2 Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model [PDF³] [Copy] [Kimi¹] [REL]

Authors: Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, Meeyoung Cha

The increasing frequency and intensity of natural disasters call for rapid and accurate damage assessment. In response, disaster benchmark datasets from high-resolution satellite imagery have been constructed to develop methods for detecting damaged areas. However, these methods face significant challenges when applied to previously unseen regions due to the limited geographical and disaster-type diversity in the existing datasets. We introduce DAVI (Disaster Assessment with VIsion foundation model), a novel approach that addresses domain disparities and detects structural damage at the building level without requiring ground-truth labels for target regions. DAVI combines task-specific knowledge from a model trained on source regions with task-agnostic knowledge from an image segmentation model to generate pseudo labels indicating potential damage in target regions. It then utilizes a two-stage refinement process, which operate at both pixel and image levels, to accurately identify changes in disaster-affected areas. Our evaluation, including a case study on the 2023 Türkiye earthquake, demonstrates that our model achieves exceptional performance across diverse terrains (e.g., North America, Asia, and the Middle East) and disaster types (e.g., wildfires, hurricanes, and tsunamis). This confirms its robustness in disaster assessment without dependence on ground-truth labels and highlights its practical applicability.

Subject: AAAI.2025 - AI for Social Impact Track

#3 Dynamics-Based Feature Augmentation of Graph Neural Networks for Variant Emergence Prediction [PDF] [Copy] [Kimi²] [REL]

Authors: Majd Al Aawar, Srikar Mutnuri, Mansooreh Montazerin, Ajitesh Srivastava

During the COVID-19 pandemic, a major driver of new surges has been the emergence of new variants. When a new variant emerges in one or more countries, other nations monitor its spread in preparation for its potential arrival. The impact of the new variant and the timings of epidemic peaks in a country highly depend on when the variant arrives. The current methods for predicting the spread of new variants rely on statistical modeling, however, these methods work only when the new variant has already arrived in the region of interest and has a significant prevalence. Can we predict when a variant existing elsewhere will arrive in a given region? To address this question, we propose a variant-dynamics-informed Graph Neural Network (GNN) approach. First, we derive the dynamics of variant prevalence across pairs of regions (countries) that apply to a large class of epidemic models. The dynamics motivate the introduction of certain features in the GNN. We demonstrate that our proposed dynamics-informed GNN outperforms all the baselines, including the currently pervasive framework of Physics-Informed Neural Networks (PINNs). To advance research in this area, we introduce a benchmarking tool to assess a user-defined model's prediction performance across 87 countries and 36 variants.

Subject: AAAI.2025 - AI for Social Impact Track

#4 Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments [PDF] [Copy] [Kimi] [REL]

Authors: Tuka Alhanai, Adam Kasumovic, Mohammad M. Ghassemi, Aven Zitzelberger, Jessica M. Lundin, Guillaume Chabot-Couture

Large Language Models (LLMs) have shown remarkable performance across various tasks, yet significant disparities remain for non-English languages, and especially native African languages. This paper addresses these disparities by creating approximately 1 million human-translated words of new benchmark data in 8 low-resource African languages, covering a population of over 160 million speakers of: Amharic, Bambara, Igbo, Sepedi (Northern Sotho), Shona, Sesotho (Southern Sotho), Setswana, and Tsonga. Our benchmarks are translations of Winogrande and three sections of MMLU: college medicine, clinical knowledge, and virology. Using the translated benchmarks, we report previously unknown performance gaps between state-of-the-art (SOTA) LLMs in English and African languages. Finally, using results from over 400 fine-tuned models, we explore several methods to reduce the LLM performance gap, including high-quality dataset fine-tuning (using an LLM-as-an-Annotator), cross-lingual transfer, and cultural appropriateness adjustments. Key findings include average mono-lingual improvements of 5.6% with fine-tuning (with 5.4% average mono-lingual improvements when using high-quality data over low-quality data), 2.9% average gains from cross-lingual transfer, and a 3.0% out-of-the-box performance boost on culturally appropriate questions. The publicly available benchmarks, translations, and code from this study support further research and development aimed at creating more inclusive and effective language technologies.

Subject: AAAI.2025 - AI for Social Impact Track

#5 Dust-Mamba: An Efficient Dust Storm Detection Network with Multiple Data Sources [PDF¹] [Copy] [Kimi] [REL]

Authors: Cong Bai, Zhonghao Lin, Jinglin Zhang, Shengyong Chen

Accurate detection of dust storms is challenging due to complex meteorological interactions. With the development of deep learning, deep neural networks have been increasingly applied to dust storm detection, offering better learning and generalization capabilities compared to traditional physical modeling. However, existing methods face some limitations, leading to performance bottlenecks in dust storm detection. From the task perspective, existing research focuses on occurrence detection while neglecting intensity detection. From the data perspective, existing research fails to explore the utilization of multi-source data. From the model perspective, most models are built on convolutional neural networks, which have an inherent limitation in capturing long-range dependencies. To address the challenges mentioned, this study proposes Dust-Mamba. To the best of our knowledge, this study is the first attempt to accomplish both the occurrence and intensity detection of dust storms with advanced deep learning technology. In Dust-Mamba, multi-source data is introduced to provide a comprehensive perspective, Mamba and attention are applied to boost feature selection while maintaining long-range modeling capability. Additionally, this study proposes Structure Sharing Transfer Learning Strategies for intensity detection, which further enhances the performance of Dust-Mamba with minimal time cost. As shown by experiments, Dust-Mamba achieves Dice scores of 0.963 for occurrence detection and 0.560 for intensity detection, surpassing several baseline models. In conclusion, this study offers valuable baselines for dust storm detection, with significant reference value and promising application potential.

Subject: AAAI.2025 - AI for Social Impact Track

#6 Rethinking and Improving Student Learning and Forgetting Processes for Attention based Knowledge Tracing Models [PDF²] [Copy] [Kimi¹] [REL]

Authors: Youheng Bai, Xueyi Li, Zitao Liu, Yaying Huang, Mi Tian, Weiqi Luo

Knowledge tracing (KT) models students' knowledge states and predicts their future performance based on their historical interaction data. However, attention based KT models struggle to accurately capture diverse forgetting behaviors in ever-growing interaction sequences. First, existing models use uniform time decay matrices, conflating forgetting representations with problem relevance. Second, the fixed-length window prediction paradigm fails to model continuous forgetting processes in expanding sequences. To address these challenges, this paper introduces LefoKT, a unified architecture that enhances attention based KT models by incorporating proposed relative forgetting attention. LefoKT improves forgetting modeling through relative forgetting attention to decouple forgetting patterns from problem relevance. It also enhances attention based KT models' length extrapolation capability for capturing continuous forgetting processes in ever-growing interaction sequences. Extensive experimental results on three datasets validate the effectiveness of LefoKT.

Subject: AAAI.2025 - AI for Social Impact Track

#7 To Measure or Not: A Cost-Sensitive, Selective Measuring Environment for Agricultural Management Decisions with Reinforcement Learning [PDF] [Copy] [Kimi] [REL]

Authors: Hilmy Baja, Michiel Kallenberg, Ioannis N. Athanasiadis

Farmers rely on in-field observations to make well-informed crop management decisions to maximize profit and minimize adverse environmental impact. However, obtaining real-world crop state measurements is labor-intensive, time-consuming and expensive. In most cases, it is not feasible to gather crop state measurements before every decision moment. Moreover, in previous research pertaining to farm management optimization, these observations are often assumed to be readily available without any cost, which is unrealistic. Hence, enabling optimization without the need to have *temporally complete* crop state observations is important. An approach to that problem is to include measuring as part of decision making. As a solution, we apply reinforcement learning (RL) to recommend opportune moments to simultaneously measure crop features and apply nitrogen fertilizer. With realistic considerations, we design an RL environment with explicit crop feature measuring costs. While balancing costs, we find that an RL agent, trained with recurrent PPO, discovers adaptive measuring policies that follow critical crop development stages, with results aligned by what domain experts would consider a sensible approach. Our results highlight the importance of measuring when crop feature measurements are not readily available.

Subject: AAAI.2025 - AI for Social Impact Track

#8 ScamNet: Toward Explainable Large Language Model-Based Fraudulent Shopping Website Detection [PDF] [Copy] [Kimi] [REL]

Authors: Marzieh Bitaab, Alireza Karimi, Zhuoer Lyu, Ahmadreza Mosallanezhad, Adam Oest, Ruoyu Wang, Tiffany Bao, Yan Shoshitaishvili, Adam Doupé

Fraudulent shopping websites pose a significant threat to online consumers and legitimate businesses: in 2023, victims of such scams reported $392 million in losses to the Federal Trade Commission. This alarming trend not only impacts individuals but also erodes societal trust in e-commerce, necessitating urgent countermeasures. While previous studies have attempted to identify these fraudulent websites at scale, they face limitations such as potential bias in data collection, overreliance on easily manipulated features, and the lack of explainable results. This study explores the potential of Large Language Models (LLMs) in identifying fraudulent shopping websites, revealing that current LLMs underperform compared to existing machine learning models. To address this, we propose ScamNet, a fine-tuned LLM for explainable fraudulent shopping website detection. Our experimental results on real-world datasets demonstrate a breakthrough in detection performance from 22.35% detection rate to 95.59%, particularly in identifying subtle deceptive tactics such as using a legitimate-looking website template. ScamNet offers interpretable insights into its decision-making process, enhancing transparency and overcoming a key limitation of previous approaches.

Subject: AAAI.2025 - AI for Social Impact Track

#9 Evaluating Index-based Treatment Allocation in Underresourced Communities [PDF] [Copy] [Kimi] [REL]

Authors: Niclas Boehmer, Yash Nair, Sanket Shah, Lucas Janson, Aparna Taneja, Milind Tambe

In many applications of AI for Social Impact (e.g., when allocating spots in support programs for underserved communities), resources are scarce and an allocation policy is needed to decide who receives a resource. Before being deployed at scale, a rigorous evaluation of an AI-powered allocation policy is vital. In this paper, we introduce the methods necessary to evaluate index-based allocation policies, which allocate a limited number of resources to those who need them the most. Such policies create dependencies between agents, rendering standard statistical tests invalid and ineffective. Addressing the arising practical and technical challenges, we describe an efficient estimator and methods for drawing valid statistical conclusions. Our extensive experiments validate our methodology in practical settings while also showcasing its statistical power. We conclude by proposing and empirically verifying extensions of our methodology that enable us to reevaluate a past randomized control trial conducted with 10000 beneficiaries for a mHealth program for pregnant women. Our new methodology allows us to draw previously invisible conclusions when comparing two different ML allocation policies.

Subject: AAAI.2025 - AI for Social Impact Track

#10 FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring [PDF] [Copy] [Kimi] [REL]

Authors: Nikolaos Ioannis Bountos, Arthur Ouaknine, Ioannis Papoutsis, David Rolnick

Forests are vital to ecosystems, supporting biodiversity and essential services, but are rapidly changing due to land use and climate change. Understanding and mitigating negative effects requires parsing data on forests at global scale from a broad array of sensory modalities, and using them in diverse forest monitoring applications. Such diversity in data and applications can be effectively addressed through the development of a large, pre-trained foundation model that serves as a versatile base for various downstream tasks. However, remote sensing modalities, which are an excellent fit for several forest management tasks, are particularly challenging considering the variation in environmental conditions, object scales, image acquisition modes and spatio-temporal resolutions, etc. With that in mind, we present the first unified Forest Monitoring Benchmark (FoMo-Bench), carefully constructed to evaluate foundation models with such flexibility. FoMo-Bench consists of 15 diverse datasets encompassing satellite, aerial, and inventory data, covering a variety of geographical regions, and including multispectral, red-green-blue, synthetic aperture radar and LiDAR data with various temporal, spatial and spectral resolutions. FoMo-Bench includes multiple types of forest-monitoring tasks, spanning classification, segmentation, and object detection. To enhance task and geographic diversity in FoMo-Bench, we introduce TalloS, a global dataset combining satellite imagery with ground-based annotations for tree species classification across 1,000+ categories and hierarchical taxonomic levels. Finally, we propose FoMo-Net, a pre-training framework to develop foundation models with the capacity to process any combination of commonly used modalities and spectral bands in remote sensing. This work aims to inspire research collaborations between machine learning and forest biology researchers in exploring scalable multi-modal and multi-task models for forest monitoring and beyond. All code, data and appendices are published in the repository and on ArXiv.

Subject: AAAI.2025 - AI for Social Impact Track

#11 PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection [PDF] [Copy] [Kimi] [REL]

Authors: Tri Cao, Chengyu Huang, Yuexin Li, Wang Huilin, Amy He, Nay Oo, Bryan Hooi

Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the relevant top k items from offline knowledge bases, using available information from a webpage, including logos and HTML. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.

Subject: AAAI.2025 - AI for Social Impact Track

#12 Learning Production Functions for Supply Chains with Graph Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Serina Chang, Zhiyin Lin, Benjamin Yan, Swapnil Bembde, Qi Xiu, Chi Heem Wong, Yu Qin, Frank Kloster, Xi Luo, Raj Palleti, Jure Leskovec

The global economy relies on the flow of goods over supply chain networks, with nodes as firms and edges as transactions between firms. While we may observe these external transactions, they are governed by unseen production functions, which determine how firms internally transform the input products they receive into output products that they sell. In this setting, it can be extremely valuable to infer these production functions, to better understand and improve supply chains, and to forecast future transactions more accurately. However, existing graph neural networks (GNNs) cannot capture these hidden relationships between nodes’ inputs and outputs. Here, we introduce a new class of models for this setting, by combining temporal GNNs with a novel inventory module, which learns production functions via attention weights and a special loss function. We evaluate our models extensively on real supply chains data, along with data generated from our new open-source simulator, SupplySim. Our models successfully infer production functions, outperforming the strongest baseline by 6-50% (across datasets), and forecast future transactions, outperforming the strongest baseline by 11-62%.

Subject: AAAI.2025 - AI for Social Impact Track

#13 Enhancing Privacy in the Early Detection of Sexual Predators Through Federated Learning and Differential Privacy [PDF] [Copy] [Kimi] [REL]

Authors: Khaoula Chehbouni, Martine de Cock, Gilles Caporossi, Afaf Taik, Reihaneh Rabbany, Golnoosh Farnadi

The increased screen time and isolation caused by the COVID-19 pandemic have led to a significant surge in cases of online grooming, which is the use of strategies by predators to lure children into sexual exploitation. Previous efforts to detect grooming in industry and academia have involved accessing and monitoring private conversations through centrally-trained models or sending private conversations to a global server. In this work, we implement a privacy-preserving pipeline for the early detection of sexual predators. We leverage federated learning and differential privacy in order to create safer online spaces for children while respecting their privacy. We investigate various privacy-preserving implementations and discuss their benefits and shortcomings. Our extensive evaluation using real-world data proves that privacy and utility can coexist with only a slight reduction in utility.

Subject: AAAI.2025 - AI for Social Impact Track

#14 Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with an LLM-Enabled Simulation [PDF] [Copy] [Kimi] [REL]

Authors: Zirong Chen, Elizabeth Chason, Noah Mladenovski, Erin Wilson, Kristin Mullen, Stephen Martini, Meiyi Ma

Emergency response services are vital for enhancing public safety by safeguarding the environment, property, and human lives. As frontline members of these services, 9-1-1 dispatchers have a direct impact on response times and the overall effectiveness of emergency operations. However, traditional dispatcher training methods, which rely on role-playing by experienced personnel, are labor-intensive, time-consuming, and often neglect the specific needs of underserved communities. To address these challenges, we introduce Sim911, the first training simulation for 9-1-1 dispatchers powered by Large Language Models (LLMs). Sim911 enhances training through three key technical innovations: (1) knowledge construction, which utilizes archived 9-1-1 call data to generate simulations that closely mirror real-world scenarios; (2) context-aware controlled generation, which employs dynamic prompts and vector bases to ensure that LLM behavior aligns with training objectives; and (3) validation with looped correction, which filters out low-quality responses and refines the system performance. Experimental results show Sim911's superior performance in effectiveness and equity. Beyond its technical advancements, Sim911 delivers significant social impacts. Successfully deployed in the Metro X of Emergency Communications (MXDEC)(PS: To ensure a double-blind review, we refer to the city as 'City X,' a mid-sized U.S. city with a population of over 700,000. Its Metro Department of Emergency Communications (MXDEC) employs around 80 dispatchers and call-takers. For the rest of the paper, we refer to MXDEC as 'DEC.') Sim911 has been integrated into multiple training sessions, saving time for dispatchers. By supporting a diverse range of incident types and caller tags, Sim911 provides more realistic and inclusive training experiences. In a conducted user study, 90.00 percent of participants found Sim911 to be as effective or even superior to traditional human-led training, making it a valuable tool for emergency communications centers nationwide, particularly those facing staffing challenges.

Subject: AAAI.2025 - AI for Social Impact Track

#15 Uncertainty-aware Knowledge Tracing [PDF] [Copy] [Kimi] [REL]

Authors: Weihua Cheng, Hanwen Du, Chunxiao Li, Ersheng Ni, Liangdi Tan, Tianqi Xu, Yongxin Ni

Knowledge Tracing (KT) is crucial in education assessment, which focuses on depicting students' learning states and assessing students' mastery of subjects. With the rise of modern online learning platforms, particularly massive open online courses (MOOCs), an abundance of interaction data has greatly advanced the development of the KT technology. Previous research commonly adopts deterministic representation to capture students' knowledge states, which neglects the uncertainty during student interactions and thus fails to model the true knowledge state in learning process. In light of this, we propose an Uncertainty-Aware Knowledge Tracing model (UKT) which employs stochastic distribution embeddings to represent the uncertainty in student interactions, with a Wasserstein self-attention mechanism designed to capture the transition of state distribution in student learning behaviors. Additionally, we introduce the aleatory uncertainty-aware contrastive learning loss, which strengthens the model's robustness towards different types of uncertainties. Extensive experiments on six real-world datasets demonstrate that UKT not only significantly surpasses existing deep learning-based models in KT prediction, but also shows unique advantages in handling the uncertainty of student interactions.

Subject: AAAI.2025 - AI for Social Impact Track

#16 Leveraging Computer Vision and Visual LLMs for Cost-Effective and Consistent Street Food Safety Assessment in Kolkata India [PDF] [Copy] [Kimi] [REL]

Authors: Alexey Chernikov, Klaus Ackermann, Caitlin Brown, Denni Tommasi

Ensuring street food safety in developing countries is crucial due to the high prevalence of foodborne illnesses. Traditional methods of food safety assessments face challenges such as resource constraints, logistical issues, and subjective biases influenced by surveyors personal lived experiences, particularly when interacting with local communities. For instance, a local food safety inspector may inadvertently overrate the quality of infrastructure due to prior familiarity or past purchases, thereby compromising objective assessment. This subjectivity highlights the necessity for technologies that reduce human biases and enhance the accuracy of survey data across various domains. This paper proposes a novel approach based on a combination of Computer Vision and a lightweight Visual Large Language Model (VLLM) to automate the detection and analysis of critical food safety infrastructure in street food vendor environments at a field experiment in Kolkata, India. The system utilises a three-stage object extraction pipeline from the video to identify, extract and select unique representations of critical elements such as hand-washing stations, dishwashing areas, garbage bins, and water tanks. These four infrastructure items are crucial for maintaining safe food practices, irrespective of the specific methods employed by the vendors. A VLLM then analyses the extracted representations to assess compliance with food safety standards. Notably, over half of the pipeline can be processed using a user's smartphone, significantly reducing government server workload. By leveraging this decentralised approach, the proposed system decreases the analysis cost by many orders of magnitude compared to alternatives like ChatGPT or Claude 3.5. Additionally, processing data on local government servers provides better privacy and security than cloud platforms, addressing critical ethical considerations. This automated approach significantly improves efficiency, consistency, and scalability, providing a robust solution to enhance public health outcomes in developing regions.

Subject: AAAI.2025 - AI for Social Impact Track

#17 Optimizing Heat Alert Issuance with Reinforcement Learning [PDF] [Copy] [Kimi¹] [REL]

Authors: Ellen M. Considine, Rachel C. Nethery, Gregory A. Wellenius, Francesca Dominici, Mauricio Tec

A key strategy in societal adaptation to climate change is using alert systems to prompt preventative action and reduce the adverse health impacts of extreme heat events. This paper implements and evaluates reinforcement learning (RL) as a tool to optimize the effectiveness of such systems. Our contributions are threefold. First, we introduce a new publicly available RL environment enabling the evaluation of the effectiveness of heat alert policies to reduce heat-related hospitalizations. The rewards model is trained from a comprehensive dataset of historical weather, Medicare health records, and socioeconomic/geographic features. We use scalable Bayesian techniques tailored to the low-signal effects and spatial heterogeneity present in the data. The transition model uses real historical weather patterns enriched by a data augmentation mechanism based on climate region similarity. Second, we use this environment to evaluate standard RL algorithms in the context of heat alert issuance. Our analysis shows that policy constraints are needed to improve RL's initially poor performance. Third, a post-hoc contrastive analysis provides insight into scenarios where our modified heat alert-RL policies yield significant gains/losses over the current National Weather Service alert policy in the United States.

Subject: AAAI.2025 - AI for Social Impact Track

#18 PrecipDiff: Leveraging Image Diffusion Models to Enhance Satellite-Based Precipitation Observations [PDF] [Copy] [Kimi] [REL]

Authors: Ting-Yu Dai, Hayato Ushijima-Mwesigwa

A recent report from the World Meteorological Organization (WMO) highlights that water-related disasters have caused the highest human losses among natural disasters over the past 50 years, with over 91\% of deaths occurring in low-income countries. This disparity is largely due to the lack of adequate ground monitoring stations, such as weather surveillance radars (WSR), which are expensive to install. For example, while the US and Europe combined possess over 600 WSRs, Africa, despite having almost one and half times their landmass, has fewer than 40. To address this issue, satellite-based observations offer a global, near-real-time monitoring solution. However, they face several challenges like accuracy, bias, and low spatial resolution. This study leverages the power of diffusion models and residual learning to address these limitations in a unified framework. We introduce the first diffusion model for correcting the inconsistency between different precipitation products. Our method demonstrates the effectiveness in downscaling satellite precipitation estimates from 10 km to 1 km resolution. Extensive experiments conducted in the Seattle region demonstrate significant improvements in accuracy, bias reduction, and spatial detail. Importantly, our approach achieves these results using only precipitation data, showcasing the potential of a purely computer vision-based approach for enhancing satellite precipitation products and paving the way for further advancements in this domain.

Subject: AAAI.2025 - AI for Social Impact Track

#19 RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios? [PDF] [Copy] [Kimi] [REL]

Authors: Adrian de Wynter, Ishaan Watts, Tua Wongsangaroonsri, Minghui Zhang, Noura Farra, Nektar Ege Altıntoprak, Lena Baur, Samantha Claudet, Pavel Gajdušek, Qilong Gu, Anna Kaminska, Tomasz Kaminski, Ruby Kuo, Akiko Kyuba, Jongho Lee, Kartik Mathur, Petter Merok, Ivana Milovanović, Nani Paananen, Vesa-Matti Paananen, Anna Pavlenko, Bruno Pereira Vidal, Luciano Ivan Strika, Yueh Tsao, Davide Turcato, Oleksandr Vakhno, Judit Velcsov, Anna Vickers, Stéphanie F. Visser, Herdyan Widarmanto, Andrey Zaikin, Si-Qing Chen

Large language models (LLMs) and small language models (SLMs) are being adopted at remarkable speed, although their safety still remains a serious concern. With the advent of multilingual S/LLMs, the question now becomes a matter of scale: can we expand multilingual safety evaluations of these models with the same velocity at which they are deployed? To this end, we introduce RTP-LX, a human-transcreated and human-annotated corpus of toxic prompts and outputs in 28 languages. RTP-LX follows participatory design practices, and a portion of the corpus is especially designed to detect culturally-specific toxic language. We evaluate 10 S/LLMs on their ability to detect toxic content in a culturally-sensitive, multilingual scenario. We find that, although they typically score acceptably in terms of accuracy, they have low agreement with human judges when scoring holistically the toxicity of a prompt; and have difficulty discerning harm in context-dependent scenarios, particularly with subtle-yet-harmful content (e.g. microaggressions, bias). We release this dataset to contribute to further reduce harmful uses of these models and improve their safe deployment.

Subject: AAAI.2025 - AI for Social Impact Track

#20 Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Thao Do, Dinh Phu Tran, An Vo, Daeyoung Kim

Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03, when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.

Subject: AAAI.2025 - AI for Social Impact Track

#21 Distribution-Free Uncertainty Quantification in Mechanical Ventilation Treatment: A Conformal Deep Q-Learning Framework [PDF] [Copy] [Kimi] [REL]

Authors: Niloufar Eghbali, Tuka Alhanai, Mohammad M. Ghassemi

Mechanical Ventilation (MV) is a critical life-support intervention in intensive care units (ICUs). However, optimal ventilator settings are challenging to determine because of the complexity of balancing patient-specific physiological needs with the risks of adverse outcomes that impact morbidity, mortality, and healthcare costs. This study introduces ConformalDQN, a novel distribution-free conformal deep Q-learning approach for optimizing mechanical ventilation in intensive care units. By integrating conformal prediction with deep reinforcement learning, our method provides reliable uncertainty quantification, addressing the challenges of Q-value overestimation and out-of-distribution actions in offline settings. We trained and evaluated our model using ICU patient records from the MIMIC-IV database. ConformalDQN extends the Double DQN architecture with a conformal predictor and employs a composite loss function that balances Q-learning with well-calibrated probability estimation. This enables uncertainty-aware action selection, allowing the model to avoid potentially harmful actions in unfamiliar states and handle distribution shifts by being more conservative in out-of-distribution scenarios. Evaluation against baseline models, including physician policies, policy constraint methods, and behavior cloning, demonstrates that ConformalDQN consistently makes recommendations within clinically safe and relevant ranges, outperforming other methods by increasing the 90-day survival rate. Notably, our approach provides an interpretable measure of confidence in its decisions, which is crucial for clinical adoption and potential human-in-the-loop implementations.

Subject: AAAI.2025 - AI for Social Impact Track

#22 Multi-Scale Graph Learning for Anti-Sparse Downscaling [PDF] [Copy] [Kimi] [REL]

Authors: Yingda Fan, Runlong Yu, Janet R. Barclay, Alison P. Appling, Yiming Sun, Yiqun Xie, Xiaowei Jia

Water temperature can vary substantially even across short distances within the same sub-watershed. Accurate prediction of stream water temperature at fine spatial resolutions (i.e., fine scales, ≤ 1 km) enables precise interventions to maintain water quality and protect aquatic habitats. Although spatiotemporal models have made substantial progress in spatially coarse time series modeling, challenges persist in predicting at fine spatial scales due to the lack of data at that scale. To address the problem of insufficient fine-scale data, we propose a Multi-Scale Graph Learning (MSGL) method. This method employs a multi-task learning framework where coarse-scale graph learning, bolstered by larger datasets, simultaneously enhances fine-scale graph learning. Although existing multi-scale or multi-resolution methods integrate data from different spatial scales, they often overlook the spatial correspondences across graph structures at various scales. To address this, our MSGL introduces an additional learning task, cross-scale interpolation learning, which leverages the hydrological connectedness of stream locations across coarse- and fine-scale graphs to establish cross-scale connections, thereby enhancing overall model performance. Furthermore, we have broken free from the mindset that multi-scale learning is limited to synchronous training by proposing an Asynchronous Multi-Scale Graph Learning method (ASYNC-MSGL). Extensive experiments demonstrate the state-of-the-art performance of our method for anti-sparse downscaling of daily stream temperatures in the Delaware River Basin, USA, highlighting its potential utility for water resources monitoring and management.

Subject: AAAI.2025 - AI for Social Impact Track

#23 A Spatio-temporal Cluster-aware Supervised Learning Framework for Predicting County-level Drug Overdose Deaths [PDF] [Copy] [Kimi] [REL]

Authors: Zixuan Feng, Qing Ye, Weijun Xie, Qiushi Chen

The soaring drug overdose crisis in the United States has claimed more than half a million lives in the past decade and remains a major public health threat. The ability to predict drug overdose deaths at the county level can help local communities develop action plans in response to emerging changes. Applying off-the-shelf machine learning algorithms for prediction can be challenging due to the heterogeneous risk profiles of the counties and suppressed data in common publicly available data sources. To fill these gaps, we develop a cluster-aware supervised learning (CASL) framework to enhance the prediction of county-level drug overdose deaths. This CASL model simultaneously clusters counties into groups based on geographical and socioeconomic characteristics and minimizes the loss function that accounts for suppressed values and cluster-specific regularization. Our computational study uses real-world data from 2010 to 2021, focusing on the ten states most severely impacted by the drug overdose crisis. The results demonstrate that our proposed CASL framework significantly outperforms state-of-the-art methods by achieving a superior balance in prediction accuracy for both unsuppressed and suppressed observations. The proposed model also identifies different clusters of counties, capturing heterogeneous patterns of overdose mortality among counties of diverse characteristics.

Subject: AAAI.2025 - AI for Social Impact Track

#24 Constraint Optimisation Approaches for Designing Group-Living Captive Breeding Programmes [PDF¹] [Copy] [Kimi] [REL]

Authors: Matthew Forshaw, Rachel Gray, Bridgett vonHoldt, Alexander Ochoa, Joshua M. Miller, Kristin E. Brzeski, Adalgisa Caccone, Evelyn L. Jensen

Captive breeding programs play a critical role in combating the ongoing biodiversity crisis by preserving the most endangered species and supporting reintroduction efforts. Maintaining the genetic health of captive populations requires careful management to prevent inbreeding and maximize the effective population size. Decisions about which males and females should be bred together are guided by the principle of minimizing relatedness between pairs. Methods to select breeding pairs are well developed, however, some species' ecology requires them to live in groups, and evaluating optimal groupings of multiple males and females that would be suitable to breed together is a more complex problem. Current computational tools to support the design of group-living captive breeding programs suffer from challenges of scalability and flexibility. In this paper we demonstrate the applicability of constraint programming (CP) approaches to optimize breeding groups to minimize relatedness. We present the example of the Galapagos giant tortoises as the test case used to develop our approach. Exploration of the needs of this captive breeding program has informed the development of our flexible approach to capture the constraints on viable captive breeding program design. Our findings have directly informed the implementation of new group configurations at the captive breeding centre. We further demonstrate that our approach is broadly applicable in other contexts through a second case study, providing multi-objective optimisation of a breeding program of canids. Through these case studies and an ablation study using synthetic datasets, we show that our constraint optimisation approach provides an expressive and generalizable means to support captive breeding program design, including scaling to large captive populations, which are currently intractable using current computational methods.

Subject: AAAI.2025 - AI for Social Impact Track

#25 PersonalizedUS: Interpretable Breast Cancer Risk Assessment with Local Coverage Uncertainty Quantification [PDF] [Copy] [Kimi] [REL]

Authors: Alek Fröhlich, Thiago Ramos, Gustavo Motta Cabello Dos Santos, Isabela Panzeri Carlotti Buzatto, Rafael Izbicki, Daniel Guimarães Tiezzi

Correctly assessing the malignancy of breast lesions identified during ultrasound examinations is crucial for effective clinical decision-making. However, the current "gold standard" relies on manual BI-RADS scoring by clinicians, often leading to unnecessary biopsies and a significant mental health burden on patients and their families. In this paper, we introduce PersonalizedUS, an interpretable machine learning system that leverages recent advances in conformal prediction to provide precise and personalized risk estimates with local coverage guarantees and sensitivity, specificity, and predictive values above 0.9 across various threshold levels. In particular, we identify meaningful lesion subgroups where distribution-free, model-agnostic conditional coverage holds, with approximately 90% of our prediction sets containing only the ground truth in most lesion subgroups, thus explicitly characterizing for which patients the model is most suitably applied. Moreover, we make available a curated tabular dataset of 1936 biopsied breast lesions from a recent observational multicenter study and benchmark the performance of several state-of-the-art learning algorithms. We also report a successful case study of the deployed system in the same multicenter context. Concrete clinical benefits include up to a 65% reduction in requested biopsies among BI-RADS 4a and 4b lesions, with minimal to no missed cancer cases.

Subject: AAAI.2025 - AI for Social Impact Track