Documentation/Quantization

Model Quantization

The art of making your model smaller, faster, and just a bit dumber.

What is Quantization?

A fancy way of saying "we're cutting corners to make things faster"

Quantization is the process of reducing the precision of the numbers in your model. Instead of using 32-bit floating-point numbers (FP32), we use smaller formats like 16-bit (FP16), 8-bit integers (INT8), or even 4-bit integers (INT4). It's like taking a high-resolution photo and converting it to a pixelated version – it takes up less space, but you might not be able to recognize your aunt Mildred anymore.

Important Note

Quantization is perfect for when your boss says "make it faster" but also "don't spend any money on better hardware." It's the computational equivalent of squinting to see better.

Quantization Types

Different ways to compromise your model's intelligence

Post-Training Quantization (PTQ)

Apply quantization after training is complete. Like trying to compress a JPEG after you've already taken the photo. Quick and dirty, but gets the job done if you're not too picky about quality.

Quantization-Aware Training (QAT)

Train your model with quantization in mind from the start. Like planning to take a low-res photo from the beginning. More work, but better results. Still worse than the original though.

Weight-Only Quantization

Only quantize the model weights, not the activations. Like compressing only half your photo. A compromise between speed and quality, which means it's mediocre at both.

Dynamic Quantization

Quantize on-the-fly during inference. Like deciding how much to compress each part of the photo as you're looking at it. Flexible but unpredictable, like that one coworker who's either brilliant or useless depending on the day.

Quantization Formats

Choose your preferred level of model degradation

FP32
32-bit floating point
100% accuracy
FP16
16-bit floating point
~99% accuracy, 2x smaller
INT8
8-bit integer
~95% accuracy, 4x smaller
INT4
4-bit integer
~90% accuracy, 8x smaller
GGUF
GGML Unified Format
Variable accuracy, optimized for inference

The accuracy percentages are approximate and will vary based on your model and task. In reality, your mileage may vary from "barely noticeable difference" to "is this even the same model?"

When to Use Quantization

A flowchart for deciding if you should compromise your model's intelligence

Use quantization when:

  • Your model needs to run on devices with limited memory or processing power
  • Inference speed is more important than perfect accuracy
  • You need to deploy multiple models on the same hardware
  • Your boss asked why you can't just "make it smaller"

Don't use quantization when:

  • Your task requires extremely high precision (e.g., medical diagnosis)
  • You have unlimited computational resources (congratulations)
  • Your model is already struggling with accuracy
  • You enjoy waiting for things to load