What is Quantization in AI?

From AISApedia, the AI skills & terms encyclopedia

Quantization reduces a language model's memory footprint and computational cost by representing its weights with fewer bits — typically converting 16-bit floating point values to 8-bit, 4-bit, or even lower precision integers. This is the primary technique that enables large language models to run on consumer hardware, trading a small and often imperceptible reduction in output quality for dramatic reductions in memory usage, inference cost, and response latency.

What does reducing numerical precision actually change in a model?

A language model's learned knowledge and capabilities are encoded in its weights — billions of numerical parameters that collectively represent everything the model internalized during training — the weights of its transformer architecture. At full precision (FP16, 16-bit floating point), each weight occupies 16 bits of storage. A 70-billion parameter model at FP16 therefore requires approximately 140 GB of memory just to load the weights — far exceeding the capacity of consumer GPUs and most workstation configurations. Quantization addresses this by representing each weight with fewer bits, directly and proportionally reducing the memory requirement.

At 8-bit quantization (INT8), the same 70B model requires roughly 70 GB of memory. At 4-bit quantization (Q4), approximately 35 GB. At 2-bit quantization (Q2), roughly 17.5 GB. The memory savings scale linearly with the bit reduction, making progressively larger models accessible on progressively smaller hardware. A 7-billion parameter model quantized to 4 bits requires only 4-6 GB of memory — well within the capacity of any modern laptop, enabling fully local inference without cloud APIs or specialized hardware.

The precision loss works analogously to lossy compression in other domains. Just as JPEG reduces image file size by approximating pixel values in ways that human vision rarely detects, quantization approximates model weights in ways that typically produce output differences too subtle for users to notice. A weight stored as 2.6847 in FP16 might be stored as approximately 2.7 in INT4 — and for the vast majority of inference tasks, this approximation produces outputs that are functionally identical to the full-precision version.

How much quality do you actually lose at each quantization level?

The relationship between quantization level and quality degradation is importantly non-linear — the first reduction in precision costs almost nothing, while extreme reductions create noticeable differences. 8-bit quantization (INT8 or Q8_0) typically shows no measurable quality difference from full precision on standard benchmarks and practical tasks. This is the safest, most conservative choice for teams that want to reduce resource requirements by half without accepting any quality risk whatsoever.

4-bit quantization (Q4_K_M, GPTQ-4bit, AWQ-4bit) is the practical sweet spot where most real-world deployment happens. Quality degradation becomes measurable on fine-grained benchmarks but remains imperceptible in the vast majority of production use cases. Tasks involving factual recall, text classification, entity extraction, summarization, and straightforward instruction-following typically show no user-noticeable difference — the same tasks that inform model selection criteria more broadly. Tasks requiring very precise numerical reasoning, subtle stylistic distinctions, or complex multi-step logical chains may show slight, occasional degradation — a minor vocabulary substitution, a less optimal phrasing choice, or a reasoning step that requires one additional prompt to complete.

Below 4 bits (Q3_K, Q2_K), quality degradation becomes progressively more apparent. The model may produce less coherent long-form text, make more frequent factual errors, handle nuanced or ambiguous instructions less reliably, and lose capability on the harder end of its task distribution. These extreme quantization levels are useful for experimentation, for running the largest possible model on limited hardware, and for tasks where response speed matters more than peak quality. But they require careful evaluation against your specific quality requirements before production deployment. The optimal quantization level depends on your quality threshold for the specific task — which is why benchmarking quantized models on your actual workload, as described in model benchmarking, is more reliable than relying on general quality claims.

What are the common quantization formats and when should you use each?

GGUF (evolved from the earlier GGML format) is the most widely used quantization format for CPU inference and mixed CPU/GPU inference, popularized by the llama.cpp project and adopted by Ollama and LM Studio. GGUF files use a standardized naming convention indicating the quantization method and level: Q4_K_M means 4-bit quantization using the K-quant medium variant, Q5_K_S means 5-bit with K-quant small, Q8_0 means basic 8-bit. The K-quant variants (K_S, K_M, K_L) use more sophisticated mixed-precision quantization that preserves quality-critical weights at higher precision while aggressively quantizing less important ones, achieving better quality than uniform quantization at the same average bit count.

GPTQ is a GPU-focused quantization format widely used in the Hugging Face ecosystem for deployment on NVIDIA GPUs. It applies quantization during a calibration step using a representative text dataset, which allows it to minimize quantization error in a data-aware way. This calibration step can produce measurably better quality than post-training quantization methods that do not consider how weights are actually used during inference. GPTQ models are most commonly available in 4-bit and 8-bit variants and are well-supported by inference frameworks like vLLM and text-generation-inference.

AWQ (Activation-aware Weight Quantization) represents a newer approach that identifies which weights are most important for maintaining model output quality — based on the magnitude of activations that pass through them — and preserves those critical weights at higher precision while aggressively quantizing the less important weights. This asymmetric strategy can achieve notably better quality than uniform quantization at the same average bit count, making it particularly valuable at aggressive quantization levels (3-4 bit) where uniform approaches show more degradation. For local model deployment, the format choice depends on your target inference engine, hardware configuration, and whether you prioritize CPU flexibility (GGUF) or GPU throughput (GPTQ/AWQ).

How should you decide which quantization level to use?

The decision starts with your hardware constraints and works backward to the largest model you can run at acceptable speed. If your laptop has 16 GB of RAM, you can comfortably run a 7B model at Q4 (about 4-6 GB) or a 13B model at Q4 (about 8-10 GB) with room for the operating system and other applications. If you have a workstation with 64 GB of RAM and a GPU with 24 GB VRAM, a 70B model at Q4 becomes feasible with partial GPU offloading.

Once you know the largest model your hardware supports, the question becomes whether that model at that quantization level meets your quality requirements. The only reliable way to answer this is to test on your actual tasks. Run twenty to thirty representative inputs through the quantized model and evaluate the outputs against your quality criteria. Compare against the same model at higher precision or against a cloud API if the full-precision model exceeds your local hardware capacity. If the quantized output meets your bar, you have found your deployment configuration.

A common pattern is to maintain two quantization levels: a higher-quality variant (Q5_K_M or Q8_0) for tasks where output quality is paramount and a more aggressive variant (Q4_K_M or Q4_K_S) for high-volume, speed-sensitive tasks where slight quality tradeoffs are acceptable. This mirrors the cloud pattern of using different model tiers for different tasks, but applied locally. The distinction between quality-sensitive and speed-sensitive tasks in your workflow determines how much value this two-tier approach provides.

Try this yourself

Visit Hugging Face and search for any model you use plus 'GGUF' — download both the Q4_K_M (4-bit) and Q8_0 (8-bit) versions. Run the same prompt through both and spot the differences — they're subtle enough that most business use cases won't notice.

Real-world example

A legal tech company needed Mistral Large for contract analysis but couldn't justify $50K in GPU costs. The Q4_K_M quantized version runs on their existing hardware, catches 99.2% of the same contract issues, and occasionally suggests 'shall' instead of 'will' — a rounding error their lawyers don't even notice. They invested the saved $50K in better training data instead.