This paper has been pre-printed on Arxiv, and the accompanying software is open-sourced on GitHub.

Motivation

This paper focuses on lossless compression for neural network models, motivated by three key use cases.

  • Model hubs (e.g., Hugging Face, Model Zoo, PyTorch, TensorFlow) host a large number of models and serve numerous download requests for popular models. Lossless compression offers several benefits, including reducing the amount of data transferred during downloads, minimizing the storage footprint of hosted models, and decreasing the time required to upload and download these models.

  • Distributed training trains large models by distributing the computational workload across multiple devices, such as GPUs, TPUs or servers connected over a network, in order to address the resource limitations of single-device training. Lossless compression can play a crucial role in reducing the amount of data (e.g., model weights, optimizer weights and gradients) transferred between devices and mitigating network bottlenecks.

  • During training, multiple intermediate model versions are saved for various purposes, such as hyperparameter tuning, fault tolerance, analysis, and performance evaluation. Lossless compression helps minimize the storage footprint of these checkpoint versions.

Model Parameters

In neural networks, a layer is a fundamental unit responsible for performing specific transformations, such as fully connected layers, convolutional layers, or attention layers. Each layer typically includes multiple tensors, which are multi-dimensional arrays. These tensors store the model’s parameters—such as weights and biases—that are learned during the training process. This paper focuses on compressing the numeric parameters within these tensors.

The parameters are represented as floating-point numbers, which consist of three key components:

  • Exponent: Indicates the range within which the real number lies.
  • Fraction: Specifies the exact value within the given range.
  • Sign bit: Denotes whether the number is positive or negative.

The real number is calculated as (-1)^{sign} * 2^{exponent-127} * (1.fraction). Models are trained with various standards that represent these floating-point numbers in different ways.

Representations Sign Exponent Fraction
FP32 1 bit 8 bit 23 bit
BF16 1 bit 8 bit 7 bit
FP16 1 bit 5 bit 10 bit

Key Observations

This paper identifies two key observations regarding the compressibility of parameters:

  • Limited effectiveness of LZ compression. Traditional LZ compression algorithms, which rely on removing repetitions, yield minimal storage savings. This is because tensors in neural networks are inherently noisy, and their parameters typically lack any meaningful affinity with neighboring values.
  • Skewness in the exponent part. The exponent values exhibit a highly skewed distribution. Some exponent values occur with significantly higher probabilities, while others are rarely observed. A plausible explanation is that model weights are initially initialized within the range ([-1, +1]), and the training process rarely pushes these weights beyond that range.

Main Idea

This paper presents ZipNN, an approach designed to compress both regular models (unmodified after training) and clean models (with parameters rounded after training). The core innovation of ZipNN lies in byte grouping, a technique that organizes model parameters for more efficient compression. Specifically:

  • The exponent parts of different parameters are grouped together.
  • Similarly, each byte of the fraction parts from different parameters is grouped together.

After grouping, ZipNN applies either ZSTD (a combination of LZ compression and Huffman encoding) or Huffman encoding to each group of bytes, achieving significant compression.

The design rationale is in two folds. First, guided by the second observation, grouping exponents together for entropy encoding compression (e.g., Huffman encoding) is expected to achieve significant storage savings. This is effective for both regular and clean models. Second, in clean models, the rounding of the fraction part results in the least significant bits being zeros, while the most significant bits remain relatively random. By grouping the fraction part byte-by-byte for compression, additional storage savings can be achieved, particularly for the least significant bits in clean models.

Additional Observations

This paper provides additional insights into the compression of gradients and optimizers, as well as delta compression on checkpoint models.

Gradients represent the direction and magnitude of change required for each parameter to reduce the training loss. Optimizers use gradients to update the model’s parameters during training while maintaining additional state to improve convergence. This paper observes that gradients and optimizers are more compressible than the model itself. This is primarily due to the extreme compressibility of the token embeddings layer within gradients and optimizers.

Delta compression, which involves XORing consecutive checkpoint models and further compressing the resulting XOR deltas, offers more storage savings compared to standalone compression. This paper highlights that during the training process, although all parameters are updated in each epoch, an increasing number of bytes remain unchanged as the model approaches convergence. This increases the effectiveness of delta compression over time.