What is RLHF?
TL;DR
A reinforcement learning method that uses human feedback to improve AI model outputs. Essential for AI safety.
RLHF: Definition & Explanation
RLHF (Reinforcement Learning from Human Feedback) is a training technique that improves the quality of AI model outputs based on human evaluations and preferences. After pre-training an LLM, human evaluators compare and rank multiple model outputs, and this feedback is used to train a Reward Model. The Reward Model then guides reinforcement learning to align the LLM's outputs with human preferences. RLHF is considered a major factor in ChatGPT's success, enabling the suppression of harmful content, accurate instruction-following, and more natural, helpful responses. Advanced approaches building on RLHF include RLAIF (Reinforcement Learning from AI Feedback) and Constitutional AI, both pioneered by Anthropic.