From Large Language Models to Large Action Models: Reasoning and Planning with Physical World Knowledge

#1 From Large Language Models to Large Action Models: Reasoning and Planning with Physical World Knowledge [PDF²] [Copy] [Kimi] [REL]

While Large Language Models excel in language processing, Large Agent Models are designed to interact with the environment. This transition poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for effective goal interpretation and decision-making. Despite the impressive performance of LLMs/VLMs on various benchmarks, these models perceive images as bags of words (semantic concepts). In detail, they use semantic understanding as a shortcut but lack the ability to recognize geometric structures or solve spatial problems such as mazes. To interact with the physical world, we focus on two dimensions: (1) From high-level semantic to low-level geometric understanding: We introduce a low-level visual description language that serves as geometric tokens, allowing the abstraction of multimodal low-level geometric structures. (2) From fast-thinking to slow-thinking: We propose to quantify long-horizon reasoning by incorporating Markov Decision Process (MDP) based decision-making. The key difference between language models and agent models lies in their decision-making capabilities. This fundamental difference necessitates a shift in how we approach the development of large agent models, focusing on both geometric understanding and long-term planning to create more capable embodied AI agents.

Subject: AAAI.2025 - New Faculty Highlights

35109@AAAI

#1 From Large Language Models to Large Action Models: Reasoning and Planning with Physical World Knowledge [PDF2] [Copy] [Kimi] [REL]

#1 From Large Language Models to Large Action Models: Reasoning and Planning with Physical World Knowledge [PDF²] [Copy] [Kimi] [REL]