This paper has been pre-printed on Arxiv, and the accompanying software is open-sourced on GitHub.

Problem

Many tasks require accessing the same large language models (LLMs) at different precision levels. For example, fine-tuning is often performed at a higher precision such as FP16, while the fine-tuned model is often quantized to a lower precision format such as INT8 for faster inference.

However, maintaining multiple versions of a model with varying precisions leads to prohibitive storage costs. A common approach is storing only the high-precision model and generating low-precision models on demand. This approach is inefficient because retrieving a low-precision model requires loading more data than necessary from the high-precision model and performing a computationally expensive quantization process.

The core problem is to reduce LLM storage costs while maintaining efficient retrieval of both high and low-precision models.

Overview

This paper introduces QStore, a novel approach that improves storage efficiency by jointly compressing both high-precision and low-precision versions of the same model. The key insight is that the low-precision model is derived through quantization of the high-precision model, making it possible to leverage the information in the low-precision model to guide the compression of the high-precision model.

Specifically, this paper builds on the observation that two floating-point numbers that quantize to the same value (using the same quantization function) are likely to share many redundant bits. It proposes a two-step grouping approach for compressing the high-precision model. First, it groups the weights of the high-precision model according to the quantization function applied. Within each group, it further divides the weights into subgroups based on those that quantize to the same value in the low-precision model. Compression is then performed on a per-subgroup basis, as weights within the same subgroup are expected to share significant redundancy.

In addition, the low-precision model is directly compressed using Zstd, a state-of-the-art lossless compression algorithm. The likely reason for not using ZipNN (a lossless compression algorithm designed for LLMs) is that the weights in the low-precision model are integers (e.g., INT8 considered in this paper), whereas ZipNN is better suited for compressing LLMs with floating-point parameters.

Compression diagram of QStore (Obtained from the original paper https://arxiv.org/abs/2505.04081).