Efficient RL for optimizing conversation level outcomes with an LLM-based tutor

#1 Efficient RL for optimizing conversation level outcomes with an LLM-based tutor [PDF⁵] [Copy] [Kimi²] [REL]

Authors: Hyunji Nam, Omer Gottesman, Amy Zhang, Dean Foster, Emma Brunskill, Lyle Ungar

Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2025-07-22 05:56:46 UTC

2507.16252

#1 Efficient RL for optimizing conversation level outcomes with an LLM-based tutor [PDF5] [Copy] [Kimi2] [REL]

#1 Efficient RL for optimizing conversation level outcomes with an LLM-based tutor [PDF⁵] [Copy] [Kimi²] [REL]