Exhibition Navigation
    Smart Device Industry ChainTransportation Equipment Manufacturing Industry ChainConsumer electronics Industry ChainBuilding material industry chainChemical industry chainExport ServiceEnterprise ServiceOther Devices
Loading...
To Bincial Home
HomeProductExhibitionGroupNewsAbout Us
English
CN
Login/Register
Products
Bincial APPUser Dashboard
About Us
Company ProfileJoin Us
User Agreement
Privacy Policy
Contact Us
Collab:135-8566-0971
Support:021-61673695
Link:support@bincial.com
Addr.:Lane 1220, Yuqiao Road, Pudong New Area, Shanghai
DownloadAPP
Channels
WeChat
WeChat
Douyin
Kuaishou
Copyright© Shanghai Bozhi Technology Co., Ltd. Shanghai ICP No. 2023012989-4
Zhongguancun Institute of Technology Makes New Discovery: Lightweight Validator Can Unlock the Best Option for LLM Reasoning
Posted Time: 2025 November 6 16:28
views
34
AuthorThe New Intelligence of Science and Technology
Image 0
This article was jointly completed by authors from Beijing Zhongguancun University, Harbin Institute of Technology, Institute of Automation, Chinese Academy of Sciences and other units. The first author is Yu Bin, a doctoral student in the joint trai
Research Background: Two Paradigms of Test-Time Scaling
As large language models (LLMs) take on various complex tasks, Test-Time Scaling (TTS) has become a core approach to enhancing model inference capabilities. In simple terms, it involves allocating more computational resources during model's answer ti
Internal Test-Time Scaling: Large inference models represented by DeepSeek-R1 achieve internal test time scaling by stretching the thinking chain. External Test-Time Scaling: Allows the model to perform parallel inference when answering questions to
With the proposal of various solutions to improve the reasoning thinking chain, the method of improving model performance through internal Test-Time Scaling is approaching its bottleneck. At this point, a better choice is to turn to answer another qu
The Best-of-N paradigm is a typical representative of testing expansion: for a mathematical problem, the model generates N inference paths and selects the most likely correct path as the final answer, as shown in the figure below.n
Image 7
There are two traditional methods to implement Best-of-N:
Majority Voting: choose the answer that appears most frequently
Process Reward Model (PRM): an additional model is used to score each step, and the path with the highest total score is then selected.
However, both have their own problems: the voting method is relatively crude, and recent research has found that "the correct answer often lies in the minority", which further reveals the shortcomings of the voting method in Best-of-N tasks; the rele
This study aims to address the shortcomings of this type of research and proposes the TrajSelector method: a lightweight yet powerful Best-of-N strategy that evaluates the quality of inference paths by reusing the hidden states of large models. It ac
Image 13
Paper Title: TrajSelector: Harnessing Latent Representations for Efficient and Effective Best-of-N in Large Reasoning Model Paper Link: https://arxiv.org/abs/2510.16449 Project Page: https://zgca-ai4edu.github.io/TrajSelector/
TrajSelector: Leveraging Hidden States of Large Models to Unlock the Optimal Choice in Large Model Inference
The paper first analyzes two fatal flaws in the existing Best-of-N method.
The cost of the heavyweight process reward model (PRM) is too high: the mainstream method of scoring each inference step with a PRM of 7B parameters has almost the same deployment and inference cost as a strategic model such as Qwen3 with 8B paramete
Why do we need hidden states? Because "self-reflection signals" are often hidden in the hidden states of large models -- for example, when solving math problems, the hidden state at a certain step may already encode information about "whether this de
The core goal of TrajSelector is to solve these two problems: to fully utilize the hidden states of the policy sampling model with the smallest parameter overhead, and to achieve an Effective and Efficient Best-of-N paradigm. The architecture of this
Image 20
The framework of TrajSelector is very concise, and its essence is a three-step pipeline of parallel sampling - step scoring - aggregation and optimization:
Parallel sampling: Using a frozen strategic model to perform parallel sampling and obtain multiple inference paths and their hidden states.
2. Step scoring: The TrajSelector method uses a lightweight scoring model with only 0.6B parameters (i.e., Qwen3-0.6B-Base) to score each inference step through a reuse of the hidden state of the policy model. This utilization of hidden states enable
Aggregation and Selection Optimization: TrajSelector uses the simplest arithmetic average to calculate the score of each inference path and obtain the global score for each one. It then selects the path with the highest global score as the final answ
Training Plan
Traditional PRM requires a large number of step-by-step annotations, such as manually labeling each inference step as correct/incorrect, which is extremely costly. In contrast, TrajSelector training does not require any manual annotations and can ach
The core challenge during training lies in the fact that a final correct trajectory may not have every step correct (such as redundant steps, but the result is correct). If the trajectory label is directly regarded as the step label, it will introduc
For trajectories labeled as correct, the model is required to predict a probability sum of correct + neutral as 1 (allowing some steps to be neutral and absorb noise); for trajectories labeled as wrong, the model is required to predict a probability
Image 29
Such a training scheme gets rid of the reliance on manual process annotations, allows models to independently learn how to "focus on the key points" from a data-driven perspective, and implements an intelligent and lightweight process validator under
Experimental Effect
The paper presents the model performance under different settings of N values in the Best-of-N task, including N = 1, 5, 10, 16, 32, 64. The benchmarks selected include mainstream ones such as AMC23, AIME24, AIME25, BeyondAIME, HMMT25, BRUMO-25.
The following table shows the Best-of-N performance with N=16 and N=32 based on Qwen3-8B.
Image 34
By aggregating the average performance of various baselines, an external Test-Time Scaling curve graph can be plotted, which is achieved by Best-of-N approach.
Image 36
Compared with various baselines, TrajSelector scheme achieves more stable performance growth as N increases.
Summary
TrajSelector provides an important idea for optimizing large model inference: rather than pursuing a larger model, it is better to use the existing model's capabilities more intelligently. It achieves better results with a lightweight verifier of 0.6