What is Speculative Decoding?

TL;DR

An inference acceleration technique that uses a small model to draft tokens and a large model to verify them. Achieves 2-3x speedup without quality loss.

Speculative Decoding: Definition & Explanation

Speculative Decoding is a technique that dramatically improves LLM inference speed. Normally, LLMs generate tokens one at a time sequentially, which is slow. With Speculative Decoding, a small 'draft model' rapidly generates multiple tokens, and a large 'target model' verifies and corrects them in a single pass. If the draft is correct, it is accepted as-is; if incorrect, the large model provides corrections. This maintains output quality equivalent to the large model alone while achieving 2-3x faster inference speeds. Google, Anthropic, Meta, and others have implemented this in their models, enhancing practicality for real-time applications.

Related Terms

AI Marketing Tools by Our Team