This paper, published in FAST’25, aims to improve the inference performance of large language models (LLMs).

LLM Inference Basics

LLMs are artificial intelligence systems designed to understand and generate human language. They learn patterns in language, grammar, context, and general knowledge by training on vast datasets of text from various sources. An LLM’s inference workflow is the process of generating responses from an input prompt after pre-training. It begins by breaking down input texts into tokens that can be words, subwords, or characters. This tokenization enables LLM to process diverse vocabularies and writing styles. The tokens are then converted into numerical embeddings (i.e., dense vector representations) that capture the meanings and relationships among tokens.

These embeddings enter LLM’s input layer, which processes them before they move through the transformer architecture. Specifically, the embeddings then pass through multiple transformer layers, with each layer containing a self-attention mechanism (which analyzes the relationships between tokens in the input) and a feed-forward neural network (which improves LLM’s ability in making predictions).

After processing through the stacked transformer layers, the final states pass through an output layer, which converts the high-dimension representations into a probability distribution across the vocabulary to predict the next token. This newly generated token joins the original input sequence, creating an extended input for the next iteration. LLM repeatedly processes the concatenated sequence through the same steps: embedding, transformer layer processing, and output layer prediction, until reaching a predefined stopping condition. Finally, the generated sequence of tokens converts back into human-readable texts for response.

Problem

The responsiveness of LLM applications significantly impacts user experience. A key metric is the time-to-first-token (TTFT), which measures the latency from processing initial input texts until the first output token is generated. However, many LLM applications prepend user queries with context-rich prefixes. While improving response quality, these prefixes increase the input sequence size, consequently extending the critical TTFT.

Existing studies often employ prefix key-value (KV) caches, which store the computed states for common prefixes to avoid re-computations when these prefixes are reused. However, due to the substantial storage footprints of prefix KVs, managing them within the limited space of GPUs or CPUs is impossible. On the other hand, storing prefix KVs on SSDs results in significant I/O latencies to load them into GPU for processing, potentially increasing the TTFT even further.

This paper aims to improve prefix KV management to reduce TTFT.

Main Idea

This paper builds on the previous observation that the KV pairs associated with different tokens vary in their importance for maintaining LLM accuracy during inference. It focuses on minimizing I/O during prefix reuse, and propose selectively retrieving only the KVs deemed important, thereby avoiding the latency associated with loading and processing less critical KV pairs from storage.

This paper introduces IMPRESS, an importance-aware prefix KV cache management approach. For an input sequence S comprising a prefix (P) and a query (Q), denoted S=P||Q, IMPRESS first identifies the longest common subsequence (R) shared between P and previously cached prefixes. Then, it selectively directs only specific KVs to the LLM for processing: those corresponding to (i) important tokens within the matched subsequence R; (ii) tokens unique to the current prefix (P−R), and (iii) the query tokens Q. KVs corresponding to unimportant tokens within R are intentionally omitted to reduce I/O overhead.

Design Summary

IMPRESS faces two design challenges: (i) efficiently identifying important tokens and (ii) optimizing the management of prefix KVs.

To address challenge (i), IMPRESS leverages a key observation: the relative importance ranking between tokens often remains consistent across different attention heads (which are parallel instances of the self-attention mechanism, i.e., the self-attention layer typically consists of multiple attention heads, each capable of learning different patterns or relationships within the input sequence). Thus, IMPRESS proposes checking the importance of tokens in a limited number of selected heads and using the identified important tokens from these heads to approximate those in the remaining heads. This design approach mitigates I/O overhead by only requiring the keys of the selected heads to be loaded into the GPU to identify the important tokens, rather than loading those of all heads.

To address challenge (ii), IMPRESS introduces two approaches. The first approach, KV reordering, addresses the inefficiency of retrieving KVs when those of important tokens are scattered across different chunks (the basic unit of storage management). By reordering the KVs within chunks based on token importance, IMPRESS physically consolidates the KVs of important tokens into fewer chunks. This spatial locality allows retrieval of many important KVs with fewer I/O operations. However, global reordering could disrupt the structure of the radix tree, which is used for efficiently finding the longest common prefix in LLM. IMPRESS confines the reordering scope to tokens within the same node of the radix tree.

The second approach, score-based cache management, augments traditional cache replacement policies with the token importance. Traditional policies often prioritize based solely on access frequency (e.g., least recently used). This should lead to the premature eviction of chunks containing highly important KVs if those chunks have not been accessed recently. IMPRESS assigns each chunk with a score that considers both its access frequency and the proportion of important KVs it contains. Cache eviction decisions are then based on this combined score, ensuring that chunks rich in important KVs are preferentially retained in the fast cache tier.