What is Inference?
TL;DR
The process by which a trained AI model generates predictions or responses for new inputs.
Inference: Definition & Explanation
Inference is the process by which a trained AI model generates predictions or outputs for new input data. Asking ChatGPT a question and receiving an answer, or providing a prompt to Stable Diffusion to generate an image — these are all examples of inference. Inference speed (latency) and cost directly impact AI service quality, making optimization crucial. Various acceleration techniques have been developed, including specialized hardware (GPUs, TPUs, NPUs), model compression through quantization and distillation, batch processing, caching mechanisms, and speculative decoding. Cloud-based inference (e.g., OpenAI API) and edge-based inference (e.g., local execution with Ollama) differ in cost structure, requiring selection based on use case. Reducing inference costs is a key challenge for expanding AI services, and more efficient inference technologies are actively being developed.