Artificial Intelligence | Cool Papers - Immersive Paper Discovery

#1 Multi-modal AI for comprehensive breast cancer prognostication [PDF⁶] [Copy] [Kimi¹¹] [REL]

Authors: Jan Witowski ; Ken Zeng ; Joseph Cappadona ; Jailan Elayoubi ; Elena Diana Chiru ; Nancy Chan ; Young-Joon Kang ; Frederick Howard ; Irina Ostrovnaya ; Carlos Fernandez-Granda ; Freya Schnabel ; Ugur Ozerdem ; Kangning Liu ; Zoe Steinsnyder ; Nitya Thakore ; Mohammad Sadic ; Frank Yeung ; Elisa Liu ; Theodore Hill ; Benjamin Swett ; Danielle Rigau ; Andrew Clayburn ; Valerie Speirs ; Marcus Vetter ; Lina Sojak ; Simone Muenst Soysal ; Daniel Baumhoer ; Khalil Choucair ; Yu Zong ; Lina Daoud ; Anas Saad ; Waleed Abdulsattar ; Rafic Beydoun ; Jia-Wern Pan ; Haslina Makmur ; Soo-Hwang Teo ; Linda Ma Pak ; Victor Angel ; Dovile Zilenaite-Petrulaitiene ; Arvydas Laurinavicius ; Natalie Klar ; Brian D. Piening ; Carlo Bifulco ; Sun-Young Jun ; Jae Pak Yi ; Su Hyun Lim ; Adam Brufsky ; Francisco J. Esteva ; Lajos Pusztai ; Yann LeCun ; Krzysztof J. Geras

Treatment selection in breast cancer is guided by molecular subtypes and clinical characteristics. Recurrence risk assessment plays a crucial role in personalizing treatment. Current methods, including genomic assays, have limited accuracy and clinical utility, leading to suboptimal decisions for many patients. We developed a test for breast cancer patient stratification based on digital pathology and clinical characteristics using novel AI methods. Specifically, we utilized a vision transformer-based pan-cancer foundation model trained with self-supervised learning to extract features from digitized H&E-stained slides. These features were integrated with clinical data to form a multi-modal AI test predicting cancer recurrence and death. The test was developed and evaluated using data from a total of 8,161 breast cancer patients across 15 cohorts originating from seven countries. Of these, 3,502 patients from five cohorts were used exclusively for evaluation, while the remaining patients were used for training. Our test accurately predicted our primary endpoint, disease-free interval, in the five external cohorts (C-index: 0.71 [0.68-0.75], HR: 3.63 [3.02-4.37, p<0.01]). In a direct comparison (N=858), the AI test was more accurate than Oncotype DX, the standard-of-care 21-gene assay, with a C-index of 0.67 [0.61-0.74] versus 0.61 [0.49-0.73], respectively. Additionally, the AI test added independent information to Oncotype DX in a multivariate analysis (HR: 3.11 [1.91-5.09, p<0.01)]). The test demonstrated robust accuracy across all major breast cancer subtypes, including TNBC (C-index: 0.71 [0.62-0.81], HR: 3.81 [2.35-6.17, p=0.02]), where no diagnostic tools are currently recommended by clinical guidelines. These results suggest that our AI test can improve accuracy, extend applicability to a wider range of patients, and enhance access to treatment selection tools.

Subjects: Artificial Intelligence ; Computer Vision and Pattern Recognition ; Image and Video Processing

Publish: 2024-10-28 17:54:29 UTC

#2 Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce [PDF] [Copy] [Kimi¹] [REL]

Authors: Zhantao Yang ; Han Zhang ; Fangyi Chen ; Anudeepsekhar Bolimera ; Marios Savvides

Knowledge Graph (KG) is playing an increasingly important role in various AI systems. For e-commerce, an efficient and low-cost automated knowledge graph construction method is the foundation of enabling various successful downstream applications. In this paper, we propose a novel method for constructing structured product knowledge graphs from raw product images. The method cooperatively leverages recent advances in the vision-language model (VLM) and large language model (LLM), fully automating the process and allowing timely graph updates. We also present a human-annotated e-commerce product dataset for benchmarking product property extraction in knowledge graph construction. Our method outperforms our baseline in all metrics and evaluated properties, demonstrating its effectiveness and bright usage potential.

Subject: Artificial Intelligence

Publish: 2024-10-28 17:34:05 UTC

#3 Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [PDF] [Copy] [Kimi¹] [REL]

Authors: Marharyta Domnich ; Julius Valja ; Rasmus Moorits Veski ; Giacomo Magnifico ; Kadi Tulver ; Eduard Barbu ; Raul Vicente

As machine learning models evolve, maintaining transparency demands more human-centric explainable AI techniques. Counterfactual explanations, with roots in human reasoning, identify the minimal input changes needed to obtain a given output and, hence, are crucial for supporting decision-making. Despite their importance, the evaluation of these explanations often lacks grounding in user studies and remains fragmented, with existing metrics not fully capturing human perspectives. To address this challenge, we developed a diverse set of 30 counterfactual scenarios and collected ratings across 8 evaluation metrics from 206 respondents. Subsequently, we fine-tuned different Large Language Models (LLMs) to predict average or individual human judgment across these metrics. Our methodology allowed LLMs to achieve an accuracy of up to 63% in zero-shot evaluations and 85% (over a 3-classes prediction) with fine-tuning across all metrics. The fine-tuned models predicting human ratings offer better comparability and scalability in evaluating different counterfactual explanation frameworks.

Subjects: Artificial Intelligence ; Computation and Language

Publish: 2024-10-28 15:33:37 UTC

#4 Learning to Handle Complex Constraints for Vehicle Routing Problems [PDF] [Copy] [Kimi] [REL]

Authors: Jieyi Bi ; Yining Ma ; Jianan Zhou ; Wen Song ; Zhiguang Cao ; Yaoxin Wu ; Jie Zhang

Vehicle Routing Problems (VRPs) can model many real-world scenarios and often involve complex constraints. While recent neural methods excel in constructing solutions based on feasibility masking, they struggle with handling complex constraints, especially when obtaining the masking itself is NP-hard. In this paper, we propose a novel Proactive Infeasibility Prevention (PIP) framework to advance the capabilities of neural methods towards more complex VRPs. Our PIP integrates the Lagrangian multiplier as a basis to enhance constraint awareness and introduces preventative infeasibility masking to proactively steer the solution construction process. Moreover, we present PIP-D, which employs an auxiliary decoder and two adaptive strategies to learn and predict these tailored masks, potentially enhancing performance while significantly reducing computational costs during training. To verify our PIP designs, we conduct extensive experiments on the highly challenging Traveling Salesman Problem with Time Window (TSPTW), and TSP with Draft Limit (TSPDL) variants under different constraint hardness levels. Notably, our PIP is generic to boost many neural methods, and exhibits both a significant reduction in infeasible rate and a substantial improvement in solution quality.

Subjects: Artificial Intelligence ; Machine Learning

Publish: 2024-10-28 14:26:54 UTC

#5 Neuro-symbolic Learning Yielding Logical Constraints [PDF] [Copy] [Kimi²] [REL]

Authors: Zenan Li ; Yunpeng Huang ; Zhaoyu Li ; Yuan Yao ; Jingwei Xu ; Taolue Chen ; Xiaoxing Ma ; Jian Lu

Neuro-symbolic systems combine the abilities of neural perception and logical reasoning. However, end-to-end learning of neuro-symbolic systems is still an unsolved challenge. This paper proposes a natural framework that fuses neural network training, symbol grounding, and logical constraint synthesis into a coherent and efficient end-to-end learning process. The capability of this framework comes from the improved interactions between the neural and the symbolic parts of the system in both the training and inference stages. Technically, to bridge the gap between the continuous neural network and the discrete logical constraint, we introduce a difference-of-convex programming technique to relax the logical constraints while maintaining their precision. We also employ cardinality constraints as the language for logical constraint learning and incorporate a trust region method to avoid the degeneracy of logical constraint in learning. Both theoretical analyses and empirical evaluations substantiate the effectiveness of the proposed framework.

Subjects: Artificial Intelligence ; Machine Learning

Publish: 2024-10-28 12:18:25 UTC

#6 Active Legibility in Multiagent Reinforcement Learning [PDF] [Copy] [Kimi¹] [REL]

Authors: Yanyu Liu ; Yinghui Pan ; Yifeng Zeng ; Biyang Ma ; Doshi Prashant

A multiagent sequential decision problem has been seen in many critical applications including urban transportation, autonomous driving cars, military operations, etc. Its widely known solution, namely multiagent reinforcement learning, has evolved tremendously in recent years. Among them, the solution paradigm of modeling other agents attracts our interest, which is different from traditional value decomposition or communication mechanisms. It enables agents to understand and anticipate others' behaviors and facilitates their collaboration. Inspired by recent research on the legibility that allows agents to reveal their intentions through their behavior, we propose a multiagent active legibility framework to improve their performance. The legibility-oriented framework allows agents to conduct legible actions so as to help others optimise their behaviors. In addition, we design a series of problem domains that emulate a common scenario and best characterize the legibility in multiagent reinforcement learning. The experimental results demonstrate that the new framework is more efficient and costs less training time compared to several multiagent reinforcement learning algorithms.

Subject: Artificial Intelligence

Publish: 2024-10-28 12:15:49 UTC

#7 FACTS: A Factored State-Space Framework For World Modelling [PDF] [Copy] [Kimi¹] [REL]

Authors: Li Nanbo ; Firas Laakom ; Yucheng Xu ; Wenyi Wang ; Jürgen Schmidhuber

World modelling is essential for understanding and predicting the dynamics of complex systems by learning both spatial and temporal dependencies. However, current frameworks, such as Transformers and selective state-space models like Mambas, exhibit limitations in efficiently encoding spatial and temporal structures, particularly in scenarios requiring long-term high-dimensional sequence modelling. To address these issues, we propose a novel recurrent framework, the \textbf{FACT}ored \textbf{S}tate-space (\textbf{FACTS}) model, for spatial-temporal world modelling. The FACTS framework constructs a graph-structured memory with a routing mechanism that learns permutable memory representations, ensuring invariance to input permutations while adapting through selective state-space propagation. Furthermore, FACTS supports parallel computation of high-dimensional sequences. We empirically evaluate FACTS across diverse tasks, including multivariate time series forecasting and object-centric world modelling, demonstrating that it consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.

Subjects: Artificial Intelligence ; Machine Learning

Publish: 2024-10-28 11:04:42 UTC

#8 Active Causal Structure Learning with Latent Variables: Towards Learning to Detour in Autonomous Robots [PDF] [Copy] [Kimi] [REL]

Authors: Pablo de los Riscos ; Fernando Corbacho

Artificial General Intelligence (AGI) Agents and Robots must be able to cope with everchanging environments and tasks. They must be able to actively construct new internal causal models of their interactions with the environment when new structural changes take place in the environment. Thus, we claim that active causal structure learning with latent variables (ACSLWL) is a necessary component to build AGI agents and robots. This paper describes how a complex planning and expectation-based detour behavior can be learned by ACSLWL when, unexpectedly, and for the first time, the simulated robot encounters a sort of transparent barrier in its pathway towards its target. ACSWL consists of acting in the environment, discovering new causal relations, constructing new causal models, exploiting the causal models to maximize its expected utility, detecting possible latent variables when unexpected observations occur, and constructing new structures-internal causal models and optimal estimation of the associated parameters, to be able to cope efficiently with the new encountered situations. That is, the agent must be able to construct new causal internal models that transform a previously unexpected and inefficient (sub-optimal) situation, into a predictable situation with an optimal operating plan.

Subjects: Artificial Intelligence ; Machine Learning

Publish: 2024-10-28 10:21:26 UTC

#9 Explainability in AI Based Applications: A Framework for Comparing Different Techniques [PDF] [Copy] [Kimi] [REL]

Authors: Arne Grobrugge ; Nidhi Mishra ; Johannes Jakubik ; Gerhard Satzger

The integration of artificial intelligence into business processes has significantly enhanced decision-making capabilities across various industries such as finance, healthcare, and retail. However, explaining the decisions made by these AI systems poses a significant challenge due to the opaque nature of recent deep learning models, which typically function as black boxes. To address this opacity, a multitude of explainability techniques have emerged. However, in practical business applications, the challenge lies in selecting an appropriate explainability method that balances comprehensibility with accuracy. This paper addresses the practical need of understanding differences in the output of explainability techniques by proposing a novel method for the assessment of the agreement of different explainability techniques. Based on our proposed methods, we provide a comprehensive comparative analysis of six leading explainability techniques to help guiding the selection of such techniques in practice. Our proposed general-purpose method is evaluated on top of one of the most popular deep learning architectures, the Vision Transformer model, which is frequently employed in business applications. Notably, we propose a novel metric to measure the agreement of explainability techniques that can be interpreted visually. By providing a practical framework for understanding the agreement of diverse explainability techniques, our research aims to facilitate the broader integration of interpretable AI systems in business applications.

Subject: Artificial Intelligence

Publish: 2024-10-28 09:45:34 UTC

#10 Implementation and Application of an Intelligibility Protocol for Interaction with an LLM [PDF] [Copy] [Kimi] [REL]

Authors: Ashwin Srinivasan ; Karan Bania ; Shreyas V ; Harshvardhan Mestha ; Sidong Liu

Our interest is in constructing interactive systems involving a human-expert interacting with a machine learning engine on data analysis tasks. This is of relevance when addressing complex problems arising in areas of science, the environment, medicine and so on, which are not immediately amenable to the usual methods of statistical or mathematical modelling. In such situations, it is possible that harnessing human expertise and creativity to modern machine-learning capabilities of identifying patterns by constructing new internal representations of the data may provide some insight to possible solutions. In this paper, we examine the implementation of an abstract protocol developed for interaction between agents, each capable of constructing predictions and explanations. The \PXP protocol, described in [12] is motivated by the notion of ''two-way intelligibility'' and is specified using a pair of communicating finite-state machines. While the formalisation allows the authors to prove several properties about the protocol, no implementation was presented. Here, we address this shortcoming for the case in which one of the agents acts as a ''generator'' using a large language model (LLM) and the other is an agent that acts as a ''tester'' using either a human-expert, or a proxy for a human-expert (for example, a database compiled using human-expertise). We believe these use-cases will be a widely applicable form of interaction for problems of the kind mentioned above. We present an algorithmic description of general-purpose implementation, and conduct preliminary experiments on its use in two different areas (radiology and drug-discovery). The experimental results provide early evidence in support of the protocol's capability of capturing one- and two-way intelligibility in human-LLM in the manner proposed in [12].

Subjects: Artificial Intelligence ; Human-Computer Interaction ; Machine Learning ; Multiagent Systems

Publish: 2024-10-27 21:20:18 UTC

#11 AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions [PDF²] [Copy] [Kimi⁴] [REL]

Authors: Ziming Li ; Qianbo Zang ; David Ma ; Jiawei Guo ; Tianyu Zheng ; Minghao liu ; Xinyao Niu ; Xiang Yue ; Yue Wang ; Jian Yang ; Jiaheng Liu ; Wanjun Zhong ; Wangchunshu Zhou ; Wenhao Huang ; Ge Zhang

Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.

Subjects: Artificial Intelligence ; Computation and Language

Publish: 2024-10-27 12:44:25 UTC

#12 Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs [PDF] [Copy] [Kimi] [REL]

Authors: Xingrui Zhuo ; Jiapu Wang ; Gongqing Wu ; Shirui Pan ; Xindong Wu

Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. To enhance the generalization of KGQE models, recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. The whole process is commonly referred to as Query Pattern Learning (QPL). However, current QPL methods typically suffer from the pattern-entity alignment bias problem, leading to the learned defective query patterns limiting KGQE models' performance. To address this problem, we propose an effective Query Instruction Parsing Plugin (QIPP) that leverages the context awareness of Pre-trained Language Models (PLMs) to capture latent query patterns from code-like query instructions. Unlike the external information introduced by previous QPL methods, we first propose code-like instructions to express FOL queries in an alternative format. This format utilizes textual variables and nested tuples to convey the logical semantics within FOL queries, serving as raw materials for a PLM-based instruction encoder to obtain complete query patterns. Building on this, we design a query-guided instruction decoder to adapt query patterns to KGQE models. To further enhance QIPP's effectiveness across various KGQE models, we propose a query pattern injection mechanism based on compressed optimization boundaries and an adaptive normalization component, allowing KGQE models to utilize query patterns more efficiently. Extensive experiments demonstrate that our plug-and-play method improves the performance of eight basic KGQE models and outperforms two state-of-the-art QPL methods.

Subject: Artificial Intelligence

Publish: 2024-10-27 03:18:52 UTC

#13 SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [PDF³] [Copy] [Kimi¹] [REL]

Authors: Antonis Antoniades ; Albert Örwall ; Kexun Zhang ; Yuxi Xie ; Anirudh Goyal ; William Wang

Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often rely on rigid processes and tend to repeat ineffective actions without the capacity to evaluate their performance or adapt their strategies over time. To address these challenges, we propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance on repository-level software tasks. SWE-Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self-feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased search depth and identifies key factors that facilitate effective self-evaluation in software agents. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.

Subject: Artificial Intelligence

Publish: 2024-10-26 22:45:56 UTC

#14 Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models [PDF¹] [Copy] [Kimi³] [REL]

Authors: Mohammad Beigi ; Sijia Wang ; Ying Shen ; Zihao Lin ; Adithya Kulkarni ; Jianfeng He ; Feng Chen ; Ming Jin ; Jin-Hee Cho ; Dawei Zhou ; Chang-Tien Lu ; Lifu Huang

In recent years, Large Language Models (LLMs) have become fundamental to a broad spectrum of artificial intelligence applications. As the use of LLMs expands, precisely estimating the uncertainty in their predictions has become crucial. Current methods often struggle to accurately identify, measure, and address the true uncertainty, with many focusing primarily on estimating model confidence. This discrepancy is largely due to an incomplete understanding of where, when, and how uncertainties are injected into models. This paper introduces a comprehensive framework specifically designed to identify and understand the types and sources of uncertainty, aligned with the unique characteristics of LLMs. Our framework enhances the understanding of the diverse landscape of uncertainties by systematically categorizing and defining each type, establishing a solid foundation for developing targeted methods that can precisely quantify these uncertainties. We also provide a detailed introduction to key related concepts and examine the limitations of current methods in mission-critical and safety-sensitive applications. The paper concludes with a perspective on future directions aimed at enhancing the reliability and practical adoption of these methods in real-world scenarios.

Subject: Artificial Intelligence

Publish: 2024-10-26 15:07:15 UTC

#15 LLMs Can Evolve Continually on Modality for X-Modal Reasoning [PDF¹] [Copy] [Kimi²] [REL]

Authors: Jiazuo Yu ; Haomiao Xiong ; Lu Zhang ; Haiwen Diao ; Yunzhi Zhuge ; Lanqing Hong ; Dong Wang ; Huchuan Lu ; You He ; Long Chen

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave

Subjects: Artificial Intelligence ; Computation and Language ; Computer Vision and Pattern Recognition ; Machine Learning

Publish: 2024-10-26 13:19:57 UTC

#16 MAD-Sherlock: Multi-Agent Debates for Out-of-Context Misinformation Detection [PDF¹] [Copy] [Kimi²] [REL]

Authors: Kumud Lakara ; Juil Sock ; Christian Rupprecht ; Philip Torr ; John Collomosse ; Christian Schroeder de Witt

One of the most challenging forms of misinformation involves the out-of-context (OOC) use of images paired with misleading text, creating false narratives. Existing AI-driven detection systems lack explainability and require expensive fine-tuning. We address these issues with MAD-Sherlock: a Multi-Agent Debate system for OOC Misinformation Detection. MAD-Sherlock introduces a novel multi-agent debate framework where multimodal agents collaborate to assess contextual consistency and request external information to enhance cross-context reasoning and decision-making. Our framework enables explainable detection with state-of-the-art accuracy even without domain-specific fine-tuning. Extensive ablation studies confirm that external retrieval significantly improves detection accuracy, and user studies demonstrate that MAD-Sherlock boosts performance for both experts and non-experts. These results position MAD-Sherlock as a powerful tool for autonomous and citizen intelligence applications.

Subject: Artificial Intelligence

Publish: 2024-10-26 10:34:22 UTC

#17 Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models [PDF²] [Copy] [Kimi³] [REL]

Authors: Danqing Wang ; Zhuorui Ye ; Fei Fang ; Lei Li

Enhancing the reasoning capabilities of large language models (LLMs) is crucial for enabling them to tackle complex, multi-step problems. Multi-agent frameworks have shown great potential in enhancing LLMs' reasoning capabilities. However, the lack of effective cooperation between LLM agents hinders their performance, especially for multi-step reasoning tasks. This paper proposes a novel cooperative multi-agent reasoning framework (CoPlanner) by separating reasoning steps and assigning distinct duties to different agents. CoPlanner consists of two LLM agents: a planning agent and a reasoning agent. The planning agent provides high-level strategic hints, while the reasoning agent follows these hints and infers answers. By training the planning agent's policy through the interactive reasoning process via Proximal Policy Optimization (PPO), the LLaMA-3-8B-based CoPlanner outperforms the previous best method by 9.94\% on LogiQA and 3.09\% on BBH. Our results demonstrate that the guidance from the planning agent and the effective cooperation between the agents contribute to the superior performance of CoPlanner in tackling multi-step reasoning problems.

Subjects: Artificial Intelligence ; Computation and Language

Publish: 2024-10-25 23:32:48 UTC

#18 Language Agents Meet Causality -- Bridging LLMs and Causal World Models [PDF] [Copy] [Kimi] [REL]

Authors: John Gkountouras ; Matthias Lindemann ; Phillip Lippe ; Efstratios Gavves ; Ivan Titov

Large Language Models (LLMs) have recently shown great promise in planning and reasoning applications. These tasks demand robust systems, which arguably require a causal understanding of the environment. While LLMs can acquire and reflect common sense causal knowledge from their pretraining data, this information is often incomplete, incorrect, or inapplicable to a specific environment. In contrast, causal representation learning (CRL) focuses on identifying the underlying causal structure within a given environment. We propose a framework that integrates CRLs with LLMs to enable causally-aware reasoning and planning. This framework learns a causal world model, with causal variables linked to natural language expressions. This mapping provides LLMs with a flexible interface to process and generate descriptions of actions and states in text form. Effectively, the causal world model acts as a simulator that the LLM can query and interact with. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities. Our experiments demonstrate the effectiveness of the approach, with the causally-aware method outperforming LLM-based reasoners, especially for longer planning horizons.

Subjects: Artificial Intelligence ; Machine Learning ; Methodology

Publish: 2024-10-25 18:36:37 UTC

#19 Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning [PDF⁴] [Copy] [Kimi³] [REL]

Authors: Lang Cao ; Chao Peng ; Yitong Li

Mathematical reasoning has been a challenging aspect of large language models (LLMs). However, the introduction of step-by-step Chain-of-Thought (CoT) inference has significantly advanced the mathematical capabilities of LLMs. Despite this progress, current approaches either require massive inference datasets as training datasets or rely on few-shot methods that often sacrifice accuracy. To address this bottleneck in mathematical reasoning, we propose a novel method called Step Guidance Reasoning without involving further model fine-tuning. In this approach, LLMs reflect on small reasoning steps -- similar to how humans deliberate on and focus attention on what to do next. By incorporating this reflective process into the inference stage, LLMs can effectively guide their reasoning from one step to the next. Our method significantly improved the math performance, raising the accuracy on the AMC23 dataset from 30% to 57.5%, a relative improvement of 91.7%, and on the sampled level 5 problem of the MATH dataset, we achieved a relative accuracy improvement of 55.8%, increasing from 43% to 67%.

Subjects: Artificial Intelligence ; Computation and Language ; Human-Computer Interaction

Publish: 2024-10-18 01:38:24 UTC

#20 ScreenWriter: Automatic Screenplay Generation and Movie Summarisation [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Louis Mahon ; Mirella Lapata

The proliferation of creative video content has driven demand for textual descriptions or summaries that allow users to recall key plot points or get an overview without watching. The volume of movie content and speed of turnover motivates automatic summarisation, which is nevertheless challenging, requiring identifying character intentions and very long-range temporal dependencies. The few existing methods attempting this task rely heavily on textual screenplays as input, greatly limiting their applicability. In this work, we propose the task of automatic screenplay generation, and a method, ScreenWriter, that operates only on video and produces output which includes dialogue, speaker names, scene breaks, and visual descriptions. ScreenWriter introduces a novel algorithm to segment the video into scenes based on the sequence of visual vectors, and a novel method for the challenging problem of determining character names, based on a database of actors' faces. We further demonstrate how these automatic screenplays can be used to generate plot synopses with a hierarchical summarisation method based on scene breaks. We test the quality of the final summaries on the recent MovieSum dataset, which we augment with videos, and show that they are superior to a number of comparison models which assume access to goldstandard screenplays.

Subjects: Artificial Intelligence ; Computer Vision and Pattern Recognition ; Multimedia

Publish: 2024-10-17 07:59:54 UTC

#21 Integrating Reasoning Systems for Trustworthy AI, Proceedings of the 4th Workshop on Logic and Practice of Programming (LPOP) [PDF¹] [Copy] [Kimi] [REL]

Authors: Anil Nerode ; Yanhong A. Liu

This proceedings contains abstracts and position papers for the work to be presented at the fourth Logic and Practice of Programming (LPOP) Workshop. The workshop is to be held in Dallas, Texas, USA, and as a hybrid event, on October 13, 2024, in conjunction with the 40th International Conference on Logic Programming (ICLP). The focus of this workshop is integrating reasoning systems for trustworthy AI, especially including integrating diverse models of programming with rules and constraints.

Subjects: Artificial Intelligence ; Logic in Computer Science ; Programming Languages

Publish: 2024-10-01 19:36:08 UTC

#22 GPT-4o System Card [PDF⁷²] [Copy] [Kimi⁵¹] [REL]

Authors: OpenAI : Aaron Hurst ; Adam Lerer ; Adam P. Goucher ; Adam Perelman ; Aditya Ramesh ; Aidan Clark ; AJ Ostrow ; Akila Welihinda ; Alan Hayes ; Alec Radford ; Aleksander Mądry ; Alex Baker-Whitcomb ; Alex Beutel ; Alex Borzunov ; Alex Carney ; Alex Chow ; Alex Kirillov ; Alex Nichol ; Alex Paino ; Alex Renzin ; Alex Tachard Passos ; Alexander Kirillov ; Alexi Christakis ; Alexis Conneau ; Ali Kamali ; Allan Jabri ; Allison Moyer ; Allison Tam ; Amadou Crookes ; Amin Tootoochian ; Amin Tootoonchian ; Ananya Kumar ; Andrea Vallone ; Andrej Karpathy ; Andrew Braunstein ; Andrew Cann ; Andrew Codispoti ; Andrew Galu ; Andrew Kondrich ; Andrew Tulloch ; Andrey Mishchenko ; Angela Baek ; Angela Jiang ; Antoine Pelisse ; Antonia Woodford ; Anuj Gosalia ; Arka Dhar ; Ashley Pantuliano ; Avi Nayak ; Avital Oliver ; Barret Zoph ; Behrooz Ghorbani ; Ben Leimberger ; Ben Rossen ; Ben Sokolowsky ; Ben Wang ; Benjamin Zweig ; Beth Hoover ; Blake Samic ; Bob McGrew ; Bobby Spero ; Bogo Giertler ; Bowen Cheng ; Brad Lightcap ; Brandon Walkin ; Brendan Quinn ; Brian Guarraci ; Brian Hsu ; Bright Kellogg ; Brydon Eastman ; Camillo Lugaresi ; Carroll Wainwright ; Cary Bassin ; Cary Hudson ; Casey Chu ; Chad Nelson ; Chak Li ; Chan Jun Shern ; Channing Conger ; Charlotte Barette ; Chelsea Voss ; Chen Ding ; Cheng Lu ; Chong Zhang ; Chris Beaumont ; Chris Hallacy ; Chris Koch ; Christian Gibson ; Christina Kim ; Christine Choi ; Christine McLeavey ; Christopher Hesse ; Claudia Fischer ; Clemens Winter ; Coley Czarnecki ; Colin Jarvis ; Colin Wei ; Constantin Koumouzelis ; Dane Sherburn ; Daniel Kappler ; Daniel Levin ; Daniel Levy ; David Carr ; David Farhi ; David Mely ; David Robinson ; David Sasaki ; Denny Jin ; Dev Valladares ; Dimitris Tsipras ; Doug Li ; Duc Phong Nguyen ; Duncan Findlay ; Edede Oiwoh ; Edmund Wong ; Ehsan Asdar ; Elizabeth Proehl ; Elizabeth Yang ; Eric Antonow ; Eric Kramer ; Eric Peterson ; Eric Sigler ; Eric Wallace ; Eugene Brevdo ; Evan Mays ; Farzad Khorasani ; Felipe Petroski Such ; Filippo Raso ; Francis Zhang ; Fred von Lohmann ; Freddie Sulit ; Gabriel Goh ; Gene Oden ; Geoff Salmon ; Giulio Starace ; Greg Brockman ; Hadi Salman ; Haiming Bao ; Haitang Hu ; Hannah Wong ; Haoyu Wang ; Heather Schmidt ; Heather Whitney ; Heewoo Jun ; Hendrik Kirchner ; Henrique Ponde de Oliveira Pinto ; Hongyu Ren ; Huiwen Chang ; Hyung Won Chung ; Ian Kivlichan ; Ian O'Connell ; Ian O'Connell ; Ian Osband ; Ian Silber ; Ian Sohl ; Ibrahim Okuyucu ; Ikai Lan ; Ilya Kostrikov ; Ilya Sutskever ; Ingmar Kanitscheider ; Ishaan Gulrajani ; Jacob Coxon ; Jacob Menick ; Jakub Pachocki ; James Aung ; James Betker ; James Crooks ; James Lennon ; Jamie Kiros ; Jan Leike ; Jane Park ; Jason Kwon ; Jason Phang ; Jason Teplitz ; Jason Wei ; Jason Wolfe ; Jay Chen ; Jeff Harris ; Jenia Varavva ; Jessica Gan Lee ; Jessica Shieh ; Ji Lin ; Jiahui Yu ; Jiayi Weng ; Jie Tang ; Jieqi Yu ; Joanne Jang ; Joaquin Quinonero Candela ; Joe Beutler ; Joe Landers ; Joel Parish ; Johannes Heidecke ; John Schulman ; Jonathan Lachman ; Jonathan McKay ; Jonathan Uesato ; Jonathan Ward ; Jong Wook Kim ; Joost Huizinga ; Jordan Sitkin ; Jos Kraaijeveld ; Josh Gross ; Josh Kaplan ; Josh Snyder ; Joshua Achiam ; Joy Jiao ; Joyce Lee ; Juntang Zhuang ; Justyn Harriman ; Kai Fricke ; Kai Hayashi ; Karan Singhal ; Katy Shi ; Kavin Karthik ; Kayla Wood ; Kendra Rimbach ; Kenny Hsu ; Kenny Nguyen ; Keren Gu-Lemberg ; Kevin Button ; Kevin Liu ; Kiel Howe ; Krithika Muthukumar ; Kyle Luther ; Lama Ahmad ; Larry Kai ; Lauren Itow ; Lauren Workman ; Leher Pathak ; Leo Chen ; Li Jing ; Lia Guy ; Liam Fedus ; Liang Zhou ; Lien Mamitsuka ; Lilian Weng ; Lindsay McCallum ; Lindsey Held ; Long Ouyang ; Louis Feuvrier ; Lu Zhang ; Lukas Kondraciuk ; Lukasz Kaiser ; Luke Hewitt ; Luke Metz ; Lyric Doshi ; Mada Aflak ; Maddie Simens ; Madelaine Boyd ; Madeleine Thompson ; Marat Dukhan ; Mark Chen ; Mark Gray ; Mark Hudnall ; Marvin Zhang ; Marwan Aljubeh ; Mateusz Litwin ; Matthew Zeng ; Max Johnson ; Maya Shetty ; Mayank Gupta ; Meghan Shah ; Mehmet Yatbaz ; Meng Jia Yang ; Mengchao Zhong ; Mia Glaese ; Mianna Chen ; Michael Janner ; Michael Lampe ; Michael Petrov ; Michael Wu ; Michele Wang ; Michelle Fradin ; Michelle Pokrass ; Miguel Castro ; Miguel Oom Temudo de Castro ; Mikhail Pavlov ; Miles Brundage ; Miles Wang ; Minal Khan ; Mira Murati ; Mo Bavarian ; Molly Lin ; Murat Yesildal ; Nacho Soto ; Natalia Gimelshein ; Natalie Cone ; Natalie Staudacher ; Natalie Summers ; Natan LaFontaine ; Neil Chowdhury ; Nick Ryder ; Nick Stathas ; Nick Turley ; Nik Tezak ; Niko Felix ; Nithanth Kudige ; Nitish Keskar ; Noah Deutsch ; Noel Bundick ; Nora Puckett ; Ofir Nachum ; Ola Okelola ; Oleg Boiko ; Oleg Murk ; Oliver Jaffe ; Olivia Watkins ; Olivier Godement ; Owen Campbell-Moore ; Patrick Chao ; Paul McMillan ; Pavel Belov ; Peng Su ; Peter Bak ; Peter Bakkum ; Peter Deng ; Peter Dolan ; Peter Hoeschele ; Peter Welinder ; Phil Tillet ; Philip Pronin ; Philippe Tillet ; Prafulla Dhariwal ; Qiming Yuan ; Rachel Dias ; Rachel Lim ; Rahul Arora ; Rajan Troll ; Randall Lin ; Rapha Gontijo Lopes ; Raul Puri ; Reah Miyara ; Reimar Leike ; Renaud Gaubert ; Reza Zamani ; Ricky Wang ; Rob Donnelly ; Rob Honsby ; Rocky Smith ; Rohan Sahai ; Rohit Ramchandani ; Romain Huet ; Rory Carmichael ; Rowan Zellers ; Roy Chen ; Ruby Chen ; Ruslan Nigmatullin ; Ryan Cheu ; Saachi Jain ; Sam Altman ; Sam Schoenholz ; Sam Toizer ; Samuel Miserendino ; Sandhini Agarwal ; Sara Culver ; Scott Ethersmith ; Scott Gray ; Sean Grove ; Sean Metzger ; Shamez Hermani ; Shantanu Jain ; Shengjia Zhao ; Sherwin Wu ; Shino Jomoto ; Shirong Wu ; Shuaiqi ; Xia ; Sonia Phene ; Spencer Papay ; Srinivas Narayanan ; Steve Coffey ; Steve Lee ; Stewart Hall ; Suchir Balaji ; Tal Broda ; Tal Stramer ; Tao Xu ; Tarun Gogineni ; Taya Christianson ; Ted Sanders ; Tejal Patwardhan ; Thomas Cunninghman ; Thomas Degry ; Thomas Dimson ; Thomas Raoux ; Thomas Shadwell ; Tianhao Zheng ; Todd Underwood ; Todor Markov ; Toki Sherbakov ; Tom Rubin ; Tom Stasi ; Tomer Kaftan ; Tristan Heywood ; Troy Peterson ; Tyce Walters ; Tyna Eloundou ; Valerie Qi ; Veit Moeller ; Vinnie Monaco ; Vishal Kuo ; Vlad Fomenko ; Wayne Chang ; Weiyi Zheng ; Wenda Zhou ; Wesam Manassra ; Will Sheu ; Wojciech Zaremba ; Yash Patil ; Yilei Qian ; Yongjik Kim ; Youlong Cheng ; Yu Zhang ; Yuchen He ; Yuchen Zhang ; Yujia Jin ; Yunxing Dai ; Yury Malkov

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

Subjects: Computation and Language ; Artificial Intelligence ; Computer Vision and Pattern Recognition ; Computers and Society ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-10-25 17:43:01 UTC

#23 Enhancing Action Recognition by Leveraging the Hierarchical Structure of Actions and Textual Context [PDF⁷] [Copy] [Kimi⁸] [REL]

Authors: Manuel Benavent-Lledo ; David Mulero-Pérez ; David Ortiz-Perez ; Jose Garcia-Rodriguez ; Antonis Argyros

The sequential execution of actions and their hierarchical structure consisting of different levels of abstraction, provide features that remain unexplored in the task of action recognition. In this study, we present a novel approach to improve action recognition by exploiting the hierarchical organization of actions and by incorporating contextualized textual information, including location and prior actions to reflect the sequential context. To achieve this goal, we introduce a novel transformer architecture tailored for action recognition that utilizes both visual and textual features. Visual features are obtained from RGB and optical flow data, while text embeddings represent contextual information. Furthermore, we define a joint loss function to simultaneously train the model for both coarse and fine-grained action recognition, thereby exploiting the hierarchical nature of actions. To demonstrate the effectiveness of our method, we extend the Toyota Smarthome Untrimmed (TSU) dataset to introduce action hierarchies, introducing the Hierarchical TSU dataset. We also conduct an ablation study to assess the impact of different methods for integrating contextual and hierarchical data on action recognition performance. Results show that the proposed approach outperforms pre-trained SOTA methods when trained with the same hyperparameters. Moreover, they also show a 17.12% improvement in top-1 accuracy over the equivalent fine-grained RGB version when using ground-truth contextual information, and a 5.33% improvement when contextual information is obtained from actual predictions.

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence

Publish: 2024-10-28 17:59:35 UTC

#24 EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation [PDF⁴] [Copy] [Kimi⁴] [REL]

Authors: Shih-Yang Liu ; Huck Yang ; Chein-Yi Wang ; Nai Chit Fung ; Hongxu Yin ; Charbel Sakr ; Saurav Muralidharan ; Kwang-Ting Cheng ; Jan Kautz ; Yu-Chiang Frank Wang ; Pavlo Molchanov ; Min-Hung Chen

In this work, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users (e.g., tasks, compression ratios), resulting in greater flexibility in adjusting overall capacity without being constrained by specific compression formats. However, naively applying SVD to derive residual paths causes suboptimal utilization of the low-rank representation capacity. Instead, we propose Training-free Eigenspace Low-Rank Approximation (EoRA), a method that directly minimizes compression-induced errors without requiring gradient-based training, achieving fast optimization in minutes using a small amount of calibration data. EoRA projects compression errors into the eigenspace of input activations, leveraging eigenvalues to effectively prioritize the reconstruction of high-importance error components. Moreover, EoRA can be seamlessly integrated with fine-tuning and quantization to further improve effectiveness and efficiency. EoRA consistently outperforms previous methods in compensating errors for compressed LLaMA2/3 models on various tasks, such as language generation, commonsense reasoning, and math reasoning tasks (e.g., 31.31%/12.88% and 9.69% improvements on ARC-Easy/ARC-Challenge and MathQA when compensating LLaMA3-8B that is quantized to 4-bit and pruned to 2:4 sparsity). EoRA offers a scalable, training-free solution to compensate for compression errors, making it a powerful tool to deploy LLMs in various capacity and efficiency requirements.

Subjects: Computation and Language ; Artificial Intelligence

Publish: 2024-10-28 17:59:03 UTC

#25 LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior [PDF⁷] [Copy] [Kimi⁷] [REL]

Authors: Hanyu Wang ; Saksham Suri ; Yixuan Ren ; Hao Chen ; Abhinav Shrivastava

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

Subjects: Computer Vision and Pattern Recognition ; Artificial Intelligence

Publish: 2024-10-28 17:57:07 UTC