| Total: 1000
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
Machine learning models are often used at test-time subject to constraints and trade-offs not present at training-time. For example, a computer vision model operating on an embedded device may need to perform real-time inference, or a translation model operating on a cell phone may wish to bound its average compute time in order to be power-efficient. In this work we describe a mixture-of-experts model and show how to change its test-time resource-usage on a per-input basis using reinforcement learning. We test our method on a small MNIST-based example.
We introduce LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), a framework enabling LLMs to autonomously improve their problem-solving capabilities through self-guided learning. By recursively generating and solving progressively simpler variants of complex problems, LADDER enables models to progressively learn through reinforcement learning how to solve harder problems. This self-improvement process is guided by verifiable reward signals, allowing the model to assess its solutions. Unlike prior approaches requiring curated datasets or human feedback, LADDER leverages the model's own capabilities to easier variants of sample questions. We demonstrate LADDER's effectiveness on mathematical integration tasks, where it improves a Llama 3B model's accuracy from 1\% to 82\% on undergraduate-level problems and enables a 7B parameter model to achieve state-of-the-art performance (70\%) on the MIT Integration Bee examination for it's model size. We also introduce TTRL (Test-Time Reinforcement Learning), a method that generates variants of test problems at inference time and applies reinforcement learning to further improve performance. By further creating and solving related problems during testing, TTRL enables the 7B model to achieve a score of 85\%, surpassing o1. These results showcase how strategic self-directed learning can achieve significant capability improvements without relying on architectural scaling or human supervision.
Careful prompt design is critical to the use of large language models in zero-shot or few-shot learning. As a consequence, there is a growing interest in automated methods to design optimal prompts. In this work, we propose Test-time Prompt Editing using Reinforcement learning (TEMPERA). In contrast to prior prompt generation methods, TEMPERA can efficiently leverage prior knowledge, is adaptive to different queries and provides an interpretable prompt for every query. To achieve this, we design a novel action space that allows flexible editing of the initial prompts covering a wide set of commonly-used components like instructions, few-shot exemplars, and verbalizers. The proposed method achieves significant gains compared with recent SoTA approaches like prompt tuning, AutoPrompt, and RLPrompt, across a variety of tasks including sentiment analysis, topic classification, natural language inference, and reading comprehension. Our method achieves 5.33x on average improvement in sample efficiency when compared to the traditional fine-tuning methods.
Meta reinforcement learning sets a distribution over a set of tasks on which the agent can train at will, then is asked to learn an optimal policy for any test task efficiently. In this paper, we consider a finite set of tasks modeled through Markov decision processes with various dynamics. We assume to have endured a long training phase, from which the set of tasks is perfectly recovered, and we focus on regret minimization against the optimal policy in the unknown test task. Under a separation condition that states the existence of a state-action pair revealing a task against another, Chen et al. (2022) show that O(M2log(H)) regret can be achieved, where M,H are the number of tasks in the set and test episodes, respectively. In our first contribution, we demonstrate that the latter rate is nearly optimal by developing a novel lower bound for test-time regret minimization under separation, showing that a linear dependence with M is unavoidable. Then, we present a family of stronger yet reasonable assumptions beyond separation, which we call strong identifiability, enabling algorithms achieving fast rates log(H) and sublinear dependence with M simultaneously. Our paper provides a new understanding of the statistical barriers of test-time regret minimization and when fast rates can be achieved.
The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, current representative works either rely solely on offline frameworks, limiting the exploration of new sample spaces, or fall short in the utilization of unit test signals, not accounting for specific error locations within the code. To address these issues, we propose RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code is available at: https://github.com/Zyq-scut/RLTF.
Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.
Effective medical test suggestions benefit both patients and physicians to conserve time and improve diagnosis accuracy. In this work, we show that an agent can learn to suggest effective medical tests. We formulate the problem as a stage-wise Markov decision process and propose a reinforcement learning method to train the agent. We introduce a new representation of multiple action policy along with the training method of the proposed representation. Furthermore, a new exploration scheme is proposed to accelerate the learning of disease distributions. Our experimental results demonstrate that the accuracy of disease diagnosis can be significantly improved with good medical test suggestions.
Testing in Continuous Integration (CI) involves test case prioritization, selection, and execution at each cycle. Selecting the most promising test cases to detect bugs is hard if there are uncertainties on the impact of committed code changes or, if traceability links between code and tests are not available. This paper introduces Retecs, a new method for automatically learning test case selection and prioritization in CI with the goal to minimize the round-trip time between code commits and developer feedback on failed test cases. The Retecs method uses reinforcement learning to select and prioritize test cases according to their duration, previous last execution and failure history. In a constantly changing environment, where new test cases are created and obsolete test cases are deleted, the Retecs method learns to prioritize error-prone test cases higher under guidance of a reward function and by observing previous CI cycles. By applying Retecs on data extracted from three industrial case studies, we show for the first time that reinforcement learning enables fruitful automatic adaptive test case selection and prioritization in CI and regression testing.
Continuous Integration (CI) significantly reduces integration problems, speeds up development time, and shortens release time. However, it also introduces new challenges for quality assurance activities, including regression testing, which is the focus of this work. Though various approaches for test case prioritization have shown to be very promising in the context of regression testing, specific techniques must be designed to deal with the dynamic nature and timing constraints of CI. Recently, Reinforcement Learning (RL) has shown great potential in various challenging scenarios that require continuous adaptation, such as game playing, real-time ads bidding, and recommender systems. Inspired by this line of work and building on initial efforts in supporting test case prioritization with RL techniques, we perform here a comprehensive investigation of RL-based test case prioritization in a CI context. To this end, taking test case prioritization as a ranking problem, we model the sequential interactions between the CI environment and a test case prioritization agent as an RL problem, using three alternative ranking models. We then rely on carefully selected and tailored state-of-the-art RL techniques to automatically and continuously learn a test case prioritization strategy, whose objective is to be as close as possible to the optimal one. Our extensive experimental analysis shows that the best RL solutions provide a significant accuracy improvement over previous RL-based work, with prioritization strategies getting close to being optimal, thus paving the way for using RL to prioritize test cases in a CI context.
In modern software engineering, Continuous Integration (CI) has become an indispensable step towards systematically managing the life cycles of software development. Large companies struggle with keeping the pipeline updated and operational, in useful time, due to the large amount of changes and addition of features, that build on top of each other and have several developers, working on different platforms. Associated with such software changes, there is always a strong component of Testing. As teams and projects grow, exhaustive testing quickly becomes inhibitive, becoming adamant to select the most relevant test cases earlier, without compromising software quality. This paper extends recent studies on applying Reinforcement Learning to optimize testing strategies. We test its ability to adapt to new environments, by testing it on novel data extracted from a financial institution, yielding a Normalized percentage of Fault Detection (NAPFD) of over 0.6 using the Network Approximator and Test Case Failure Reward. Additionally, we studied the impact of using Decision Tree (DT) Approximator as a model for memory representation, which failed to produce significant improvements relative to Artificial Neural Networks.
There have been many studies on automated test generation for mobile Graphical User Interface (GUI) applications. These studies successfully demonstrate how to detect fatal exceptions and achieve high code and activity coverage with fully automated test generation engines. However, it is unclear how many GUI functions these engines manage to test. Furthermore, these engines implement only implicit test oracles. We propose Fully Automated Reinforcement LEArning-Driven Specification-Based Test Generator for Android (FARLEAD-Android). FARLEAD-Android accepts a GUI-level formal specification as a Linear-time Temporal Logic (LTL) formula. By dynamically executing the Application Under Test (AUT), it learns how to generate a test that satisfies the LTL formula using Reinforcement Learning (RL). The LTL formula does not just guide the test generation but also acts as a specified test oracle, enabling the developer to define automated test oracles for a wide variety of GUI functions by changing the formula. Our evaluation shows that FARLEAD-Android is more effective and achieves higher performance in generating tests for specified GUI functions than three known approaches, Random, Monkey, and QBEa. To the best of our knowledge, FARLEAD-Android is the first fully automated mobile GUI testing engine that uses formal specifications.
Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells -- up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at https://doi.org/10.6084/m9.figshare.25983166.
Improving system-level resiliency of networked microgrids is an important aspect with increased population of inverter-based resources (IBRs). This paper (1) presents resilient control design in presence of adversarial cyber-events, and proposes a novel federated reinforcement learning (Fed-RL) approach to tackle (a) model complexities, unknown dynamical behaviors of IBR devices, (b) privacy issues regarding data sharing in multi-party-owned networked grids, and (2) transfers learned controls from simulation to hardware-in-the-loop test-bed, thereby bridging the gap between simulation and real world. With these multi-prong objectives, first, we formulate a reinforcement learning (RL) training setup generating episodic trajectories with adversaries (attack signal) injected at the primary controllers of the grid forming (GFM) inverters where RL agents (or controllers) are being trained to mitigate the injected attacks. For networked microgrids, the horizontal Fed-RL method involving distinct independent environments is not appropriate, leading us to develop vertical variant Federated Soft Actor-Critic (FedSAC) algorithm to grasp the interconnected dynamics of networked microgrid. Next, utilizing OpenAI Gym interface, we built a custom simulation set-up in GridLAB-D/HELICS co-simulation platform, named Resilient RL Co-simulation (ResRLCoSIM), to train the RL agents with IEEE 123-bus benchmark test systems comprising 3 interconnected microgrids. Finally, the learned policies in simulation world are transferred to the real-time hardware-in-the-loop test-bed set-up developed using high-fidelity Hypersim platform. Experiments show that the simulator-trained RL controllers produce convincing results with the real-time test-bed set-up, validating the minimization of sim-to-real gap.
Model-based reinforcement learning (MBRL) approaches rely on discrete-time state transition models whereas physical systems and the vast majority of control tasks operate in continuous-time. To avoid time-discretization approximation of the underlying process, we propose a continuous-time MBRL framework based on a novel actor-critic method. Our approach also infers the unknown state evolution differentials with Bayesian neural ordinary differential equations (ODE) to account for epistemic uncertainty. We implement and test our method on a new ODE-RL suite that explicitly solves continuous-time control systems. Our experiments illustrate that the model is robust against irregular and noisy data, is sample-efficient, and can solve control problems which pose challenges to discrete-time MBRL methods.
Test point insertion (TPI) is a widely used technique for testability enhancement, especially for logic built-in self-test (LBIST) due to its relatively low fault coverage. In this paper, we propose a novel TPI approach based on deep reinforcement learning (DRL), named DeepTPI. Unlike previous learning-based solutions that formulate the TPI task as a supervised-learning problem, we train a novel DRL agent, instantiated as the combination of a graph neural network (GNN) and a Deep Q-Learning network (DQN), to maximize the test coverage improvement. Specifically, we model circuits as directed graphs and design a graph-based value network to estimate the action values for inserting different test points. The policy of the DRL agent is defined as selecting the action with the maximum value. Moreover, we apply the general node embeddings from a pre-trained model to enhance node features, and propose a dedicated testability-aware attention mechanism for the value network. Experimental results on circuits with various scales show that DeepTPI significantly improves test coverage compared to the commercial DFT tool. The code of this work is available at https://github.com/cure-lab/DeepTPI.
One of the fundamental challenges associated with reinforcement learning (RL) is that collecting sufficient data can be both time-consuming and expensive. In this paper, we formalize a concept of time reversal symmetry in a Markov decision process (MDP), which builds upon the established structure of dynamically reversible Markov chains (DRMCs) and time-reversibility in classical physics. Specifically, we investigate the utility of this concept in reducing the sample complexity of reinforcement learning. We observe that utilizing the structure of time reversal in an MDP allows every environment transition experienced by an agent to be transformed into a feasible reverse-time transition, effectively doubling the number of experiences in the environment. To test the usefulness of this newly synthesized data, we develop a novel approach called time symmetric data augmentation (TSDA) and investigate its application in both proprioceptive and pixel-based state within the realm of off-policy, model-free RL. Empirical evaluations showcase how these synthetic transitions can enhance the sample efficiency of RL agents in time reversible scenarios without friction or contact. We also test this method in more realistic environments where these assumptions are not globally satisfied. We find that TSDA can significantly degrade sample efficiency and policy performance, but can also improve sample efficiency under the right conditions. Ultimately we conclude that time symmetry shows promise in enhancing the sample efficiency of reinforcement learning and provide guidance when the environment and reward structures are of an appropriate form for TSDA to be employed effectively.
Software testing is a crucial aspect of software development, and the creation of high-quality tests that adhere to best practices is essential for effective maintenance. Recently, Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. However, these LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM). To begin, we analyze the anti-patterns generated by the LLM and show that LLMs can generate undesirable test smells. Thus, we train specific reward models for each static quality metric, then utilize Proximal Policy Optimization (PPO) to train models for optimizing a single quality metric at a time. Furthermore, we amalgamate these rewards into a unified reward model aimed at capturing different best practices and quality aspects of tests. By comparing RL-trained models with those trained using supervised learning, we provide insights into how reliably utilize RL to improve test generation quality and into the effects of various training strategies. Our experimental results demonstrate that the RL-optimized model consistently generated high-quality test cases compared to the base LLM, improving the model by up to 21%, and successfully generates nearly 100% syntactically correct code. RLSQM also outperformed GPT-4 on four out of seven metrics. This represents a significant step towards enhancing the overall efficiency and reliability of software testing through Reinforcement Learning and static quality metrics. Our data are available at this link: https://figshare.com/s/ded476c8d4c221222849.
The Multi-Criteria Test Suite Minimization (MCTSM) problem aims to refine test suites by removing redundant test cases, guided by adequacy criteria such as code coverage or fault detection capability. However, current techniques either exhibit a high loss of fault detection ability or face scalability challenges due to the NP-hard nature of the problem, which limits their practical utility. We propose TripRL, a novel technique that integrates traditional criteria such as statement coverage and fault detection ability with test coverage similarity into an Integer Linear Program (ILP), to produce a diverse reduced test suite with high test effectiveness. TripRL leverages bipartite graph representation and its embedding for concise ILP formulation and combines ILP with effective reinforcement learning (RL) training. This combination renders large-scale test suite minimization more scalable and enhances test effectiveness. Our empirical evaluations demonstrate that TripRL's runtime scales linearly with the magnitude of the MCTSM problem. Notably, for large test suites where existing approaches fail to provide solutions within a reasonable time frame, our technique consistently delivers solutions in less than 47 minutes. The reduced test suites produced by TripRL also maintain the original statement coverage and fault detection ability while having a higher potential to detect unknown faults.
Performance testing with the aim of generating an efficient and effective workload to identify performance issues is challenging. Many of the automated approaches mainly rely on analyzing system models, source code, or extracting the usage pattern of the system during the execution. However, such information and artifacts are not always available. Moreover, all the transactions within a generated workload do not impact the performance of the system the same way, a finely tuned workload could accomplish the test objective in an efficient way. Model-free reinforcement learning is widely used for finding the optimal behavior to accomplish an objective in many decision-making problems without relying on a model of the system. This paper proposes that if the optimal policy (way) for generating test workload to meet a test objective can be learned by a test agent, then efficient test automation would be possible without relying on system models or source code. We present a self-adaptive reinforcement learning-driven load testing agent, RELOAD, that learns the optimal policy for test workload generation and generates an effective workload efficiently to meet the test objective. Once the agent learns the optimal policy, it can reuse the learned policy in subsequent testing activities. Our experiments show that the proposed intelligent load test agent can accomplish the test objective with lower test cost compared to common load testing procedures, and results in higher test efficiency.
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
In this paper we propose real-time recurrent reinforcement learning (RTRRL), a biologically plausible approach to solving discrete and continuous control tasks in partially-observable markov decision processes (POMDPs). RTRRL consists of three parts: (1) a Meta-RL RNN architecture, implementing on its own an actor-critic algorithm; (2) an outer reinforcement learning algorithm, exploiting temporal difference learning and dutch eligibility traces to train the Meta-RL network; and (3) random-feedback local-online (RFLO) learning, an online automatic differentiation algorithm for computing the gradients with respect to parameters of the network.Our experimental results show that by replacing the optimization algorithm in RTRRL with the biologically implausible back propagation through time (BPTT), or real-time recurrent learning (RTRL), one does not improve returns, while matching the computational complexity for BPTT, and even increasing complexity for RTRL. RTRRL thus serves as a model of learning in biological neural networks, mimicking reward pathways in the basal ganglia.
Test-driven development (TDD) is a widely-employed software development practice that mandates writing test cases based on requirements before writing the actual code. While writing test cases is the centerpiece of TDD, it is time-consuming, expensive, and often shunned by developers. To address these issues associated with TDD, automated test case generation approaches have recently been investigated. Such approaches take source code as input, but not the requirements. Therefore, existing work does not fully support true TDD, as actual code is required to generate test cases. In addition, current deep learning-based test case generation approaches are trained with one learning objective, i.e., to generate test cases that are exactly matched with the ground-truth test cases. However, such approaches may limit the model's ability to generate different yet correct test cases. In this paper, we introduce PyTester, a Text-to-Testcase generation approach that can automatically generate syntactically correct, executable, complete, and effective test cases while being aligned with a given natural language requirement. We evaluate PyTester on the public APPS benchmark dataset, and the results show that our Deep RL approach enables PyTester, a small language model, to outperform much larger language models like GPT3.5, StarCoder, and InCoder. Our findings suggest that future research could consider improving small over large LMs for better resource efficiency by integrating the SE domain knowledge into the design of reinforcement learning architecture.
Deep reinforcement learning has been recognized as a promising tool to address the challenges in real-time control of power systems. However, its deployment in real-world power systems has been hindered by a lack of explicit stability and safety guarantees. In this paper, we propose a stability-constrained reinforcement learning (RL) method for real-time voltage control, that guarantees system stability both during policy learning and deployment of the learned policy. The key idea underlying our approach is an explicitly constructed Lyapunov function that leads to a sufficient structural condition for stabilizing policies, i.e., monotonically decreasing policies guarantee stability. We incorporate this structural constraint with RL, by parameterizing each local voltage controller using a monotone neural network, thus ensuring the stability constraint is satisfied by design. We demonstrate the effectiveness of our approach in both single-phase and three-phase IEEE test feeders, where the proposed method can reduce the transient control cost by more than 25% and shorten the voltage recovery time by 21.5% on average compared to the widely used linear policy, while always achieving voltage stability. In contrast, standard RL methods often fail to achieve voltage stability.
In reinforcement learning, it is common to let an agent interact for a fixed amount of time with its environment before resetting it and repeating the process in a series of episodes. The task that the agent has to learn can either be to maximize its performance over (i) that fixed period, or (ii) an indefinite period where time limits are only used during training to diversify experience. In this paper, we provide a formal account for how time limits could effectively be handled in each of the two cases and explain why not doing so can cause state aliasing and invalidation of experience replay, leading to suboptimal policies and training instability. In case (i), we argue that the terminations due to time limits are in fact part of the environment, and thus a notion of the remaining time should be included as part of the agent's input to avoid violation of the Markov property. In case (ii), the time limits are not part of the environment and are only used to facilitate learning. We argue that this insight should be incorporated by bootstrapping from the value of the state at the end of each partial episode. For both cases, we illustrate empirically the significance of our considerations in improving the performance and stability of existing reinforcement learning algorithms, showing state-of-the-art results on several control tasks.