What is Quantization?

TL;DR

A technique that reduces AI model size and computational cost by lowering numerical precision.

Quantization: Definition & Explanation

Quantization is an optimization technique that reduces an AI model's file size and improves inference speed by lowering the numerical precision of its parameters (weights). For example, parameters normally represented in 32-bit floating point (FP32) can be converted to 16-bit (FP16), 8-bit (INT8), or even 4-bit (INT4). This dramatically reduces memory usage, making it possible to run LLMs on ordinary PCs without high-end GPUs. Various quantization formats exist, including GPTQ, AWQ, and GGUF (used with llama.cpp), and techniques continue to advance in compressing models while minimizing accuracy loss. Ollama uses quantized models for easy local execution, fueling the growth of open-source AI. Quantization is also essential for running AI on edge devices.