Selective Knowledge Distillation Fine Filter: The Inference Decode Accelerator AdaSPEC Is Here
Posted Time: 2025 November 6 15:02
AuthorThe New Intelligence of Science and Technology
Image 0
The co-first authors of this paper are Dr. Hu Yuezhou from the University of California, Berkeley, and undergraduate student Guo Jiaxin from Tsinghua University, while the corresponding author is Associate Professor Zhao Tu from Georgia Tech Universi
Speculative Decoding (SD) significantly accelerates the inference process of large language models (LLM) by generating candidate predictions using a smaller draft model and then validating them with a larger target model. The acceleration effect of S
Currently, the most advanced alignment method is to use Knowledge Distillation (KD) to minimize the KL divergence on all tokens. However, minimizing the global KL divergence does not necessarily mean maximizing the token acceptance rate. Due to limit
To address this issue, the research team from Georgia Institute of Technology, Tsinghua University, and University of California, Berkeley, proposed AdaSPEC, an innovative distillation method that introduces a selective token filtering mechanism. Ada
This selective distillation strategy significantly improves the overall token acceptance rate without reducing the generation quality. We have systematically evaluated it on multiple tasks (arithmetic reasoning, instruction following, code generation
Image 6
Paper Title: AdaSPEC: Selective Knowledge Distillation for Efficient Speculative DecodersnPaper Link: https://arxiv.org/abs/2510.19779nGithub Link: https://github.com/yuezhouhu/adaspec
Research Background
Large language models (LLMs) have demonstrated excellent performance in reasoning and generative tasks. However, their autoregressive decoding mechanism leads to high inference delay and computational cost, becoming a major bottleneck for practical d
In recent years, Speculative Decoding (SD) has provided a new path: generating multiple candidate tokens in parallel through a lightweight draft model, and then validating them in batches by the original main model to reduce the number of forward cal
One current approach is to use knowledge distillation (KD) to let the draft model mimic the output distribution of the main model. However, the draft model is usually one order of magnitude smaller than the main model and has limited capacity. Forcib
In response to this issue, the research team proposed AdaSPEC - a selective knowledge distillation method for speculative decoding. The core idea of AdaSPEC is to allow the draft model to focus on learning the 'easy tokens' that it can truly master a
Experiments show that AdaSPEC consistently improves token acceptance rate (up to 15% improvement) across multiple models and tasks, effectively unleashing the acceleration potential of speculative decoding while maintaining generation quality.
Method Overview
The core idea of AdaSPEC is to identify and filter out the tokens that are difficult to learn during the distillation stage (as shown in Figure 1 below), making knowledge transfer more focused and effective.
Image 16
Selective KD core mechanism
By introducing a reference model, we can automatically filter out the difficult-to-align tokens in the training samples and perform distillation only on the easy-to-learn subset, fundamentally mitigating the draft-target mismatch issue.
Two-stage Training Framework
AdaSPEC firstly performs initial distillation on the reference model to obtain the reference model. Subsequently, it uses the reference model to filter the fine-tuning dataset and optimizes the draft model on the filtered subset. This method signific
Universal adaptability and lightweight implementation
AdaSPEC boasts high modularity compatibility and a clear-cut design that seamlessly integrates with advanced speculation decoding frameworks such as EAGLE and vLLM. With a core implementation of less than a hundred lines of code, it offers intuitive
Experimental Evaluation
The research team has systematically validated on multiple model families (Pythia, CodeGen, Phi-2, etc.) and tasks (GSM8K, Alpaca, MBPP, CNN/DailyMail, XSUM), demonstrating consistent and robust improvement across different model sizes and task types
Token acceptance rate surpasses baseline methods across the board: DistillSpec improves by 5–6% on GSM8K and up to 15% on MBPP. Significant actual acceleration: Using the vLLM framework for fine-tuning, end-to-end inference speed can increase by 10–2
Image 26
Summary and Outlook
AdaSPEC provides a precise, efficient, versatile, and widely applicable acceleration paradigm for speculation decoding. It achieves dynamic alignment of draft-target through selective distillation + adaptive filtering, opening up new directions for r
There are still two directions worth exploring in the current job.
Further study on the dynamic estimation mechanism of token difficulty to achieve finer-grained selective distillation; apply AdaSPEC to multi-modal and reasoning large models to verify its cross-modal adaptability.