Tokenselect Efficient Long Context Inference And

Emily Johnson
-
tokenselect efficient long context inference and

Wei Wu, Zhuoshi Pan, Kun Fu, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Zheng Wang, Hui Xiong [TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection](https://aclanthology.org/2025.emnlp-main.1079/) (Wu et al., EMNLP 2025) ACL materials are Copyright © 1963–2026 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

The ACL Anthology is managed and built by the ACL Anthology team of volunteers. Site last built on 07 March 2026 at 16:51 UTC with commit 824b2f5. The rapid advancement of Large Language Models (LLMs) has driven growing demand for processing extended context sequences in contemporary applications. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference.

TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead. A comprehensive evaluation of TokenSelect demonstrates up to 23.84×23.84\times23.84 × speedup in attention computation and up to 2.28×2.28\times2.28 × acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods. TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection Wei Wu1,††\dagger†,‡‡\ddagger‡, Zhuoshi Pan2,††\dagger†,‡‡\ddagger‡, Chao Wang1, Liyi Chen1,‡‡\ddagger‡, Yunchu Bai1, Tianfu Wang1, Kun Fu3, Zheng Wang3,*, Hui Xiong4,5,* 1University of Science and Technology of China, 2Tsinghua University, 3Alibaba Cloud Computing, 4Hong Kong University of...

With the rapid development of large language models (LLMs), the number of parameters is no longer the sole factor significantly affecting model performance. The ability to effectively process longer context information has become one of the key metrics for evaluating LLMs’ capabilities. The latest applications such as cross-document understanding Bai et al. (2024), LLM-powered search systems Sharma et al. (2024), and complex reasoning OpenAI have all placed higher demands on the long-context abilities of LLMs. There are two main difficulties in using pre-trained LLMs for long-context inference.

On one hand, LLMs are limited by their context length during pre-training (e.g. Llama 3 only has 8192 tokens). Directly inferencing on longer sequences can lead to severe performance degradation due to reasons including sequence lengths out-of-distribution Xiao et al. (2024b); Han et al. (2024). On the other hand, even if LLMs possess sufficiently large context lengths, the quadratic computational complexity of attention with respect to sequence length makes the response time for long-context inference unbearable.

Previous works have made numerous attempts to address these difficulties. To extend the context length of LLMs, the current common practice is to perform post-training on long texts Team et al. (2024); Yang et al. (2024a); GLM et al. (2024). However, this approach comes with significant computational costs, particularly in two aspects: the synthesis of high-quality long-text data and the training process on extended sequences.

To accelerate long-context inference, many studies focus on the sparsity of attention, attempting to reduce the scale of KV Cache involved in computation. The key to this type of method lies in designing sparse patterns for attention, which can be mainly divided into two categories: one uses predefined sparse patterns Wang et al. (2019); Zaheer et al. (2020); Xiao et al. (2024b); Han et al. (2024), while the other estimates the potential importance of KV Cache during the inference process Zhang et al.

(2024c); Oren et al. (2024); Xiao et al. (2024a); Tang et al. (2024b); Jiang et al. (2024), attempting to select relevant KV Cache tokens into attention calculations. However, the design of these sparse patterns is often heuristically based on historical criticality or coarse-grained criticality estimation of tokens, making it difficult to ensure that the selected tokens are truly critical, thus resulting...

1. 💡 Dynamic Token-Level KV Cache Selection: Use Query-Key dot products to measure pre-head KV Cache criticality at token-level. 💡 Per-head Soft Voting Mechanism: Calculate the per-head criticality, normalize through softmax, and sum for all heads, offers better performance and efficiency. 💡 Selection Cache: Allow consecutive similar queries to share token selection results, thereby reducing the selection frequency while ensuring its effectiveness. ✅ TokenSelect – A model-agnostic, training-free method for efficient and accurate long-context inference. It selectively involves a small number of critical KV cache tokens in the attention calculation without sacrificing accuracy.

📊 Result – Up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency! Abstract: The rapid advancement of LLMs has driven growing demand for processing extended context sequences in contemporary applications. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level.

By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention computation and up to 2.28x acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods. The paper "TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection" presents a novel approach to address the challenges associated with long-context inference in LLMs. The study targets two primary obstacles: performance degradation when dealing with sequences longer than those seen during training and the high computational costs associated with quadratic attention complexities. The authors introduce TokenSelect, a method that improves the efficiency and accuracy of long-context inference without requiring additional training or model-specific adaptations.

TokenSelect is predicated on token-level Key-Value (KV) cache selection via dynamic evaluation of token importance, which deviates from more traditional block-level or fixed sparse attention methods. The paper makes significant advancements in the following areas: Evaluation was conducted on benchmarking datasets such as InfiniteBench, RULER, and LongBench using various mainstream LLMs, including Qwen2-7B-Instruct, Llama-3-8B-Instruct, and Yi-1.5-6B-Chat. The results demonstrate that TokenSelect: Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention.

These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.

With the development of large language models (LLMs), the ability to handle longer contexts has become a key capability for Web applications such as cross-document understanding and LLM-powered search systems. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a model-agnostic, training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a small number of critical KV cache tokens in the attention calculation without sacrificing accuracy.

To further accelerate TokenSelect, we designed the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead of token selection. A comprehensive evaluation of TokenSelect demonstrates up to 23.84×23.84\times speedup in attention computation and up to 2.28×2.28\times acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods. With the rapid development of large language models (LLMs), the number of parameters is no longer the sole factor significantly affecting model performance. The ability to effectively process longer context information has become one of the key metrics for evaluating LLMs’ capabilities. The latest Web applications such as cross-document understanding (Bai et al., 2024), LLM-powered search systems (Sharma et al., 2024), repository-level code completion (Zhang et al., 2023; Di et al., 2024), and complex reasoning (OpenAI,... d.]) have all placed higher demands on the long-context abilities of LLMs.

There are two main difficulties in using pre-trained LLMs for long-context inference. On one hand, LLMs are limited by their context length during pre-training (e.g. Llama 3 only has 8192 tokens). Directly inferencing on longer sequences can lead to severe performance degradation due to reasons including sequence lengths out-of-distribution (Xiao et al., 2024; Han et al., 2024). On the other hand, even if LLMs possess sufficiently large context lengths, the quadratic computational complexity of attention with respect to sequence length makes the response time for long-context inference unbearable. Previous works have made numerous attempts to address these difficulties.

To extend the context length of LLMs, the current common practice is to perform post-training on long texts (Team et al., 2024; Yang et al., 2024; GLM et al., 2024). However, this approach comes with significant computational costs, particularly in two aspects: the synthesis of high-quality long-text data and the training process on extended sequences. To accelerate long-context inference, many studies focus on the sparsity of attention, attempting to reduce the scale of KV Cache involved in computation. The key to this type of method lies in designing sparse patterns for attention, which can be mainly divided into two categories: one uses predefined sparse patterns (Wang et al., 2019; Zaheer et al.,... However, the design of these sparse patterns is often heuristically based on historical criticality or coarse-grained criticality estimation of tokens, making it difficult to ensure that the selected tokens are truly critical, thus resulting... 1.

People Also Search

Wei Wu, Zhuoshi Pan, Kun Fu, Chao Wang, Liyi Chen,

Wei Wu, Zhuoshi Pan, Kun Fu, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Zheng Wang, Hui Xiong [TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection](https://aclanthology.org/2025.emnlp-main.1079/) (Wu et al., EMNLP 2025) ACL materials are Copyright © 1963–2026 ACL; other materials are copyrighted by their respective copyrigh...

The ACL Anthology Is Managed And Built By The ACL

The ACL Anthology is managed and built by the ACL Anthology team of volunteers. Site last built on 07 March 2026 at 16:51 UTC with commit 824b2f5. The rapid advancement of Large Language Models (LLMs) has driven growing demand for processing extended context sequences in contemporary applications. However, this progress faces two major challenges: performance degradation due to sequence lengths ou...

TokenSelect Builds Upon The Observation Of Non-contiguous Attention Sparsity, Using

TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observat...

With The Rapid Development Of Large Language Models (LLMs), The

With the rapid development of large language models (LLMs), the number of parameters is no longer the sole factor significantly affecting model performance. The ability to effectively process longer context information has become one of the key metrics for evaluating LLMs’ capabilities. The latest applications such as cross-document understanding Bai et al. (2024), LLM-powered search systems Sharm...

On One Hand, LLMs Are Limited By Their Context Length

On one hand, LLMs are limited by their context length during pre-training (e.g. Llama 3 only has 8192 tokens). Directly inferencing on longer sequences can lead to severe performance degradation due to reasons including sequence lengths out-of-distribution Xiao et al. (2024b); Han et al. (2024). On the other hand, even if LLMs possess sufficiently large context lengths, the quadratic computational...