2025-07-11 | | Total: 34
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
In this study, we present a novel constraint-based algorithm for causal structure learning specifically designed for nonlinear autoregressive time series. Our algorithm significantly reduces computational complexity compared to existing methods, making it more efficient and scalable to larger problems. We rigorously evaluate its performance on synthetic datasets, demonstrating that our algorithm not only outperforms current techniques, but also excels in scenarios with limited data availability. These results highlight its potential for practical applications in fields requiring efficient and accurate causal inference from nonlinear time series data.
We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased n-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
Along with accurate prediction, understanding the contribution of each feature to the making of the prediction, i.e., the importance of the feature, is a desirable and arguably necessary component of a machine learning model. For a complex model such as a random forest, such importances are not innate -- as they are, e.g., with linear regression. Efficient methods have been created to provide such capabilities, with one of the most popular among them being permutation feature importance due to its efficiency, model-agnostic nature, and perceived intuitiveness. However, permutation feature importance has been shown to be misleading in the presence of dependent features as a result of the creation of unrealistic observations when permuting the dependent features. In this work, we develop TRIP (Test for Reliable Interpretation via Permutation), a test requiring minimal assumptions that is able to detect unreliable permutation feature importance scores that are the result of model extrapolation. To build on this, we demonstrate how the test can be complemented in order to allow its use in high dimensional settings. Through testing on simulated data and applications, our results show that the test can be used to reliably detect when permutation feature importance scores are unreliable.
We show that mixtures comprised of multicomponent systems typically are much more structurally complex than the sum of their parts; sometimes, infinitely more complex. We contrast this with the more familiar notion of statistical mixtures, demonstrating how statistical mixtures miss key aspects of emergent hierarchical organization. This leads us to identify a new kind of structural complexity inherent in multicomponent systems and to draw out broad consequences for system ergodicity.
We present GO-CBED, a goal-oriented Bayesian framework for sequential causal experimental design. Unlike conventional approaches that select interventions aimed at inferring the full causal model, GO-CBED directly maximizes the expected information gain (EIG) on user-specified causal quantities of interest, enabling more targeted and efficient experimentation. The framework is both non-myopic, optimizing over entire intervention sequences, and goal-oriented, targeting only model aspects relevant to the causal query. To address the intractability of exact EIG computation, we introduce a variational lower bound estimator, optimized jointly through a transformer-based policy network and normalizing flow-based variational posteriors. The resulting policy enables real-time decision-making via an amortized network. We demonstrate that GO-CBED consistently outperforms existing baselines across various causal reasoning and discovery tasks-including synthetic structural causal models and semi-synthetic gene regulatory networks-particularly in settings with limited experimental budgets and complex causal mechanisms. Our results highlight the benefits of aligning experimental design objectives with specific research goals and of forward-looking sequential planning.
Spatial interpolation is a crucial task in geography. As perhaps the most widely used interpolation methods, geostatistical models -- such as Ordinary Kriging (OK) -- assume spatial stationarity, which makes it difficult to capture the nonstationary characteristics of geographic variables. A common solution is trend surface modeling (e.g., Regression Kriging, RK), which relies on external explanatory variables to model the trend and then applies geostatistical interpolation to the residuals. However, this approach requires high-quality and readily available explanatory variables, which are often lacking in many spatial interpolation scenarios -- such as estimating heavy metal concentrations underground. This study proposes a Feature-Free Regression Kriging (FFRK) method, which automatically extracts geospatial features -- including local dependence, local heterogeneity, and geosimilarity -- to construct a regression-based trend surface without requiring external explanatory variables. We conducted experiments on the spatial distribution prediction of three heavy metals in a mining area in Australia. In comparison with 17 classical interpolation methods, the results indicate that FFRK, which does not incorporate any explanatory variables and relies solely on extracted geospatial features, consistently outperforms both conventional Kriging techniques and machine learning models that depend on explanatory variables. This approach effectively addresses spatial nonstationarity while reducing the cost of acquiring explanatory variables, improving both prediction accuracy and generalization ability. This finding suggests that an accurate characterization of geospatial features based on domain knowledge can significantly enhance spatial prediction performance -- potentially yielding greater improvements than merely adopting more advanced statistical models.
Traditional network analysis focuses on binary edges, while real-world relationships are more nuanced, encompassing cooperation, neutrality, and conflict. The rise of negative edges in social media discussions spurred interest in analyzing signed interactions, especially in polarized debates. However, the vast data generated by digital networks presents challenges for traditional methods like Stochastic Block Models (SBM) and Exponential Family Random Graph Models (ERGM), particularly due to the homogeneity assumption and global dependence, which become increasingly unrealistic as network size grows. To address this, we propose a novel method that combines the strengths of SBM and ERGM while mitigating their weaknesses by incorporating local dependence based on non-overlapping blocks. Our approach involves a two-step process: first, decomposing the network into sub-networks using SBM approximation, and then estimating parameters using ERGM methods. We validate our method on large synthetic networks and apply it to a signed Wikipedia network of thousands of editors. Through the use of local dependence, we find patterns consistent with structural balance theory.
We introduce manifolds with kinks, a class of manifolds with possibly singular boundary that notably contains manifolds with smooth boundary and corners. We derive the asymptotic behavior of the Graph Laplace operator with Gaussian kernel and its deterministic limit on these spaces as bandwidth goes to zero. We show that this asymptotic behavior is determined by the inward sector of the tangent space and, as special cases, we derive its behavior near interior and singular points. Lastly, we show the validity of our theoretical results using numerical simulation.
Learning from non-independent and non-identically distributed data poses a persistent challenge in statistical learning. In this study, we introduce data-dependent Bernstein inequalities tailored for vector-valued processes in Hilbert space. Our inequalities apply to both stationary and non-stationary processes and exploit the potential rapid decay of correlations between temporally separated variables to improve estimation. We demonstrate the utility of these bounds by applying them to covariance operator estimation in the Hilbert-Schmidt norm and to operator learning in dynamical systems, achieving novel risk bounds. Finally, we perform numerical experiments to illustrate the practical implications of these bounds in both contexts.
We study a sequential contextual decision-making problem in which certain covariates are missing but can be imputed using a pre-trained AI model. From a theoretical perspective, we analyze how the presence of such a model influences the regret of the decision-making process. We introduce a novel notion called "model elasticity", which quantifies the sensitivity of the reward function to the discrepancy between the true covariate and its imputed counterpart. This concept provides a unified way to characterize the regret incurred due to model imputation, regardless of the underlying missingness mechanism. More surprisingly, we show that under the missing at random (MAR) setting, it is possible to sequentially calibrate the pre-trained model using tools from orthogonal statistical learning and doubly robust regression. This calibration significantly improves the quality of the imputed covariates, leading to much better regret guarantees. Our analysis highlights the practical value of having an accurate pre-trained model in sequential decision-making tasks and suggests that model elasticity may serve as a fundamental metric for understanding and improving the integration of pre-trained models in a wide range of data-driven decision-making problems.
In most real-world applications of artificial intelligence, the distributions of the data and the goals of the learners tend to change over time. The Probably Approximately Correct (PAC) learning framework, which underpins most machine learning algorithms, fails to account for dynamic data distributions and evolving objectives, often resulting in suboptimal performance. Prospective learning is a recently introduced mathematical framework that overcomes some of these limitations. We build on this framework to present preliminary results that improve the algorithm and numerical results, and extend prospective learning to sequential decision-making scenarios, specifically foraging. Code is available at: https://github.com/neurodata/prolearn2.
Estimating spatial regression models on large, irregularly structured datasets poses significant computational hurdles. While Pairwise Likelihood (PL) methods offer a pathway to simplify these estimations, the efficient selection of informative observation pairs remains a critical challenge, particularly as data volume and complexity grow. This paper introduces the Sampled Grid Pairwise Likelihood (SG-PL) method, a novel approach that employs a grid-based sampling strategy to strategically select observation pairs. Simulation studies demonstrate SG-PL's principal advantage: a dramatic reduction in computational time -- often by orders of magnitude -- when compared to benchmark methods. This substantial acceleration is achieved with a manageable trade-off in statistical efficiency. An empirical application further validates SG-PL's practical utility. Consequently, SG-PL emerges as a highly scalable and effective tool for spatial analysis on very large datasets, offering a compelling balance where substantial gains in computational feasibility are realized for a limited cost in statistical precision, a trade-off that increasingly favors SG-PL with larger N.
In this paper, we analyze the behavior of various non-parametric local regression estimators, i.e. estimators that are based on local averaging, for estimating a Lipschitz regression function at a fixed point, or in sup-norm. We first prove some deviation bounds for local estimators that can be indexed by a VC class of sets in the covariates space. We then introduce the general concept of shape-regular local maps, corresponding to the situation where the local averaging is done on sets which, in some sense, have ``almost isotropic'' shapes. On the one hand, we prove that, in general, shape-regularity is necessary to achieve the minimax rates of convergence. On the other hand, we prove that it is sufficient to ensure the optimal rates, up to some logarithmic factors. Next, we prove some deviation bounds for specific estimators, that are based on data-dependent local maps, such as nearest neighbors, their recent prototype variants, as well as a new algorithm, which is a modified and generalized version of CART, and that is minimax rate optimal in sup-norm. In particular, the latter algorithm is based on a random tree construction that depends on both the covariates and the response data. For each of the estimators, we provide insights on the shape-regularity of their respective local maps. Finally, we conclude the paper by establishing some probability bounds for local estimators based on purely random trees, such as centered, uniform or Mondrian trees. Again, we discuss the relations between the rates of the estimators and the shape-regularity of their local maps.
Conformal prediction methods are statistical tools designed to quantify uncertainty and generate predictive sets with guaranteed coverage probabilities. This work introduces an innovative refinement to these methods for classification tasks, specifically tailored for scenarios where multiple observations (multi-inputs) of a single instance are available at prediction time. Our approach is particularly motivated by applications in citizen science, where multiple images of the same plant or animal are captured by individuals. Our method integrates the information from each observation into conformal prediction, enabling a reduction in the size of the predicted label set while preserving the required class-conditional coverage guarantee. The approach is based on the aggregation of conformal p-values computed from each observation of a multi-input. By exploiting the exact distribution of these p-values, we propose a general aggregation framework using an abstract scoring function, encompassing many classical statistical tools. Knowledge of this distribution also enables refined versions of standard strategies, such as majority voting. We evaluate our method on simulated and real data, with a particular focus on Pl@ntNet, a prominent citizen science platform that facilitates the collection and identification of plant species through user-submitted images.
Supervised machine learning pipelines trained on features derived from persistent homology have been experimentally observed to ignore much of the information contained in a persistence diagram. Computing persistence diagrams is often the most computationally demanding step in such a pipeline, however. To explore this, we introduce several methods to generate topological feature vectors from unreduced boundary matrices. We compared the performance of pipelines trained on vectorizations of unreduced PDs to vectorizations of fully-reduced PDs across several data and task types. Our results indicate that models trained on PDs built from unreduced diagrams can perform on par and even outperform those trained on fully-reduced diagrams on some tasks. This observation suggests that machine learning pipelines which incorporate topology-based features may benefit in terms of computational cost and performance by utilizing information contained in unreduced boundary matrices.
We introduce a novel extension of the influential changes-in-changes (CiC) framework [Athey and Imbens, 2006] to estimate the average treatment effect on the treated (ATT) and distributional causal estimands in panel data settings with unmeasured confounding. While CiC relaxes the parallel trends assumption inherent in difference-in-differences (DiD), existing approaches typically accommodate only a single scalar unobserved confounder and rely on monotonicity assumptions between the confounder and the outcome. Moreover, current formulations lack inference procedures and theoretical guarantees that accommodate continuous covariates. Motivated by the intricate nature of confounding in empirical applications and the need to incorporate continuous covariates in a principled manner, we make two key contributions in this technical report. First, we establish nonparametric identification under a novel set of assumptions that permit high-dimensional unmeasured confounders and non-monotonic relationships between confounders and outcomes. Second, we construct efficient estimators that are Neyman orthogonal to infinite-dimensional nuisance parameters, facilitating valid inference even in the presence of high-dimensional continuous or discrete covariates and flexible machine learning-based nuisance estimation.
The objective of this work is to propose an asymptotic correction method for the estimators of parameters from regression models with covariates subject to classification errors. A correction was developed based on the least squares estimators from regression with erroneous covariates, the marginal probability of the true covariates, and the conditional probability of the erroneous covariates given the true covariates. In this way, we can correct these estimators without the need to correct the erroneous covariates or observe the true covariates. We performed simulations to quantify the performance of the proposed corrections, identifying, that correcting the intercept is crucial for a significant improvement in estimation.
Double descent is a phenomenon of over-parameterized statistical models. Our goal is to view double descent from a Bayesian perspective. Over-parameterized models such as deep neural networks have an interesting re-descending property in their risk characteristics. This is a recent phenomenon in machine learning and has been the subject of many studies. As the complexity of the model increases, there is a U-shaped region corresponding to the traditional bias-variance trade-off, but then as the number of parameters equals the number of observations and the model becomes one of interpolation, the risk can become infinite and then, in the over-parameterized region, it re-descends -- the double descent effect. We show that this has a natural Bayesian interpretation. Moreover, we show that it is not in conflict with the traditional Occam's razor that Bayesian models possess, in that they tend to prefer simpler models when possible. We illustrate the approach with an example of Bayesian model selection in neural networks. Finally, we conclude with directions for future research.
Decisions about managing patients on the heart transplant waitlist are currently made by committees of doctors who consider multiple factors, but the process remains largely ad-hoc. With the growing volume of longitudinal patient, donor, and organ data collected by the United Network for Organ Sharing (UNOS) since 2018, there is increasing interest in analytical approaches to support clinical decision-making at the time of organ availability. In this study, we benchmark machine learning models that leverage longitudinal waitlist history data for time-dependent, time-to-event modeling of waitlist mortality. We train on 23,807 patient records with 77 variables and evaluate both survival prediction and discrimination at a 1-year horizon. Our best model achieves a C-Index of 0.94 and AUROC of 0.89, significantly outperforming previous models. Key predictors align with known risk factors while also revealing novel associations. Our findings can support urgency assessment and policy refinement in heart transplant decision making.
Nonlinear causal effects are prevalent in many research scenarios involving continuous exposures, and instrumental variables (IVs) can be employed to investigate such effects, particularly in the presence of unmeasured confounders. However, common IV methods for nonlinear effect analysis, such as IV regression or the control-function method, have inherent limitations, leading to either low statistical power or potentially misleading conclusions. In this work, we propose an alternative IV framework for nonlinear effect analysis, which has recently emerged in genetic epidemiology and addresses many of the drawbacks of existing IV methods. This framework enables study of the effect function while avoiding unnecessary model assumptions. In particular, it facilitates the identification of change points or threshold values in causal effects. Through a wide variety of simulations, we demonstrate that our framework outperforms other representative nonlinear IV methods in predicting the effect shape when the instrument is weak and can accurately estimate the effect function as well as identify the change point and predict its value under various structural model and effect shape scenarios. We further apply our framework to assess the nonlinear effect of alcohol consumption on systolic blood pressure using a genetic instrument (i.e. Mendelian randomization) with UK Biobank data. Our analysis detects a threshold beyond which alcohol intake exhibits a clear causal effect on the outcome. Our results are consistent with published medical guidelines.
Markov chains provide a foundational framework for modeling sequential stochastic processes, with the transition probability matrix characterizing the dynamics of state evolution. While classical estimation methods such as maximum likelihood and empirical Bayes approaches are effective in finite-state settings, they become inadequate in applications involving countably infinite or dynamically expanding state spaces, which frequently arise in domains such as natural language processing, population dynamics, and behavioral modeling. In this work, we introduce a novel Bayesian nonparametric framework for estimating infinite-dimensional transition probability matrices by employing a new class of priors, termed the Generalized Hierarchical Stick-Breaking prior. This prior extends traditional Dirichlet process and stick-breaking constructions, enabling highly flexible modelling of transition probability matrices. The proposed approach offers a principled methodology for inferring transition probabilities in settings characterized by sparsity, high dimensionality, and unobserved state spaces, thereby contributing to the advancement of statistical inference for infinite-dimensional transition probability matrices.
When performing Bayesian inference using Sequential Monte Carlo (SMC) methods, two considerations arise: the accuracy of the posterior approximation and computational efficiency. To address computational demands, Sequential Monte Carlo Squared (SMC2) is well-suited for high-performance computing (HPC) environments. The design of the proposal distribution within SMC2 can improve accuracy and exploration of the posterior as poor proposals may lead to high variance in importance weights and particle degeneracy. The Metropolis-Adjusted Langevin Algorithm (MALA) uses gradient information so that particles preferentially explore regions of higher probability. In this paper, we extend this idea by incorporating second-order information, specifically the Hessian of the log-target. While second-order proposals have been explored previously in particle Markov Chain Monte Carlo (p-MCMC) methods, we are the first to introduce them within the SMC2 framework. Second-order proposals not only use the gradient (first-order derivative), but also the curvature (second-order derivative) of the target distribution. Experimental results on synthetic models highlight the benefits of our approach in terms of step-size selection and posterior approximation accuracy when compared to other proposals.
Time-series models like ARIMA remain widely used for forecasting but limited to linear assumptions and high computational cost in large and complex datasets. We propose Galerkin-ARIMA that generalizes the AR component of ARIMA and replace it with a flexible spline-based function estimated by Galerkin projection. This enables the model to capture nonlinear dependencies in lagged values and retain the MA component and Gaussian noise assumption. We derive a closed-form OLS estimator for the Galerkin coefficients and show the model is asymptotically unbiased and consistent under standard conditions. Our method bridges classical time-series modeling and nonparametric regression, which offering improved forecasting performance and computational efficiency.
Automated radiology report generation is essential for improving diagnostic efficiency and reducing the workload of medical professionals. However, existing methods face significant challenges, such as disease class imbalance and insufficient cross-modal fusion. To address these issues, we propose the learnable Retrieval Enhanced Visual-Text Alignment and Fusion (REVTAF) framework, which effectively tackles both class imbalance and visual-text fusion in report generation. REVTAF incorporates two core components: (1) a Learnable Retrieval Enhancer (LRE) that utilizes semantic hierarchies from hyperbolic space and intra-batch context through a ranking-based metric. LRE adaptively retrieves the most relevant reference reports, enhancing image representations, particularly for underrepresented (tail) class inputs; and (2) a fine-grained visual-text alignment and fusion strategy that ensures consistency across multi-source cross-attention maps for precise alignment. This component further employs an optimal transport-based cross-attention mechanism to dynamically integrate task-relevant textual knowledge for improved report generation. By combining adaptive retrieval with multi-source alignment and fusion, REVTAF achieves fine-grained visual-text integration under weak image-report level supervision while effectively mitigating data imbalance issues. The experiments demonstrate that REVTAF outperforms state-of-the-art methods, achieving an average improvement of 7.4% on the MIMIC-CXR dataset and 2.9% on the IU X-Ray dataset. Comparisons with mainstream multimodal LLMs (e.g., GPT-series models), further highlight its superiority in radiology report generation https://github.com/banbooliang/REVTAF-RRG.