Zhongguancun Institute of Technology Makes New Discovery: Lightweight Validator Can Unlock the Best Option for LLM Reasoning-bincial

Zhongguancun Institute of Technology Makes New Discovery: Lightweight Validator Can Unlock the Best Option for LLM Reasoning

Posted Time: 2025 November 6 16:28

The New Intelligence of Science and Technology

This article was jointly completed by authors from Beijing Zhongguancun University, Harbin Institute of Technology, Institute of Automation, Chinese Academy of Sciences and other units. The first author is Yu Bin, a doctoral student in the joint trai

Research Background: Two Paradigms of Test-Time Scaling

As large language models (LLMs) take on various complex tasks, Test-Time Scaling (TTS) has become a core approach to enhancing model inference capabilities. In simple terms, it involves allocating more computational resources during model's answer ti

Internal Test-Time Scaling: Large inference models represented by DeepSeek-R1 achieve internal test time scaling by stretching the thinking chain. External Test-Time Scaling: Allows the model to perform parallel inference when answering questions to

With the proposal of various solutions to improve the reasoning thinking chain, the method of improving model performance through internal Test-Time Scaling is approaching its bottleneck. At this point, a better choice is to turn to answer another qu

The Best-of-N paradigm is a typical representative of testing expansion: for a mathematical problem, the model generates N inference paths and selects the most likely correct path as the final answer, as shown in the figure below.n

There are two traditional methods to implement Best-of-N:

Majority Voting: choose the answer that appears most frequently

Process Reward Model (PRM): an additional model is used to score each step, and the path with the highest total score is then selected.

However, both have their own problems: the voting method is relatively crude, and recent research has found that "the correct answer often lies in the minority", which further reveals the shortcomings of the voting method in Best-of-N tasks; the rele

This study aims to address the shortcomings of this type of research and proposes the TrajSelector method: a lightweight yet powerful Best-of-N strategy that evaluates the quality of inference paths by reusing the hidden states of large models. It ac

Paper Title: TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model Paper Link: https://arxiv.org/abs/2510.16449 Project Page: https://zgca-ai4edu.github.io/TrajSelector/

TrajSelector: Leveraging Hidden States of Large Models to Unlock the Optimal Choice in Large Model Inference

The paper first analyzes two fatal flaws in the existing Best-of-N method.

The cost of the heavyweight process reward model (PRM) is too high: the mainstream method of scoring each inference step with a PRM of 7B parameters has almost the same deployment and inference cost as a strategic model such as Qwen3 with 8B paramete

Why do we need hidden states? Because "self-reflection signals" are often hidden in the hidden states of large models -- for example, when solving math problems, the hidden state at a certain step may already encode information about "whether this de

The core goal of TrajSelector is to solve these two problems: to fully utilize the hidden states of the policy sampling model with the smallest parameter overhead, and to achieve an Effective and Efficient Best-of-N paradigm. The architecture of this

The framework of TrajSelector is very concise, and its essence is a three-step pipeline of parallel sampling - step scoring - aggregation and optimization:

Parallel sampling: Using a frozen strategic model to perform parallel sampling and obtain multiple inference paths and their hidden states.

2. Step scoring: The TrajSelector method uses a lightweight scoring model with only 0.6B parameters (i.e., Qwen3-0.6B-Base) to score each inference step through a reuse of the hidden state of the policy model. This utilization of hidden states enable

Aggregation and Selection Optimization: TrajSelector uses the simplest arithmetic average to calculate the score of each inference path and obtain the global score for each one. It then selects the path with the highest global score as the final answ

Training Plan

Traditional PRM requires a large number of step-by-step annotations, such as manually labeling each inference step as correct/incorrect, which is extremely costly. In contrast, TrajSelector training does not require any manual annotations and can ach

The core challenge during training lies in the fact that a final correct trajectory may not have every step correct (such as redundant steps, but the result is correct). If the trajectory label is directly regarded as the step label, it will introduc

For trajectories labeled as correct, the model is required to predict a probability sum of correct + neutral as 1 (allowing some steps to be neutral and absorb noise); for trajectories labeled as wrong, the model is required to predict a probability

Such a training scheme gets rid of the reliance on manual process annotations, allows models to independently learn how to "focus on the key points" from a data-driven perspective, and implements an intelligent and lightweight process validator under

Experimental Effect

The paper presents the model performance under different settings of N values in the Best-of-N task, including N = 1, 5, 10, 16, 32, 64. The benchmarks selected include mainstream ones such as AMC23, AIME24, AIME25, BeyondAIME, HMMT25, BRUMO-25.

The following table shows the Best-of-N performance with N=16 and N=32 based on Qwen3-8B.

By aggregating the average performance of various baselines, an external Test-Time Scaling curve graph can be plotted, which is achieved by Best-of-N approach.

Compared with various baselines, TrajSelector scheme achieves more stable performance growth as N increases.

Summary

TrajSelector provides an important idea for optimizing large model inference: rather than pursuing a larger model, it is better to use the existing model's capabilities more intelligently. It achieves better results with a lightweight verifier of 0.6