Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

#1 Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification [PDF] [Copy] [Kimi] [REL]

Authors: Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one---typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts---chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Subject: ICML.2025 - Poster

wl3eI4wiE5@OpenReview

#1 Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification [PDF] [Copy] [Kimi] [REL]