What is RLHF?

TL;DR

A reinforcement learning method that uses human feedback to improve AI model outputs. Essential for AI safety.

RLHF: Definition & Explanation

RLHF (Reinforcement Learning from Human Feedback) is a training technique that improves the quality of AI model outputs based on human evaluations and preferences. After pre-training an LLM, human evaluators compare and rank multiple model outputs, and this feedback is used to train a Reward Model. The Reward Model then guides reinforcement learning to align the LLM's outputs with human preferences. RLHF is considered a major factor in ChatGPT's success, enabling the suppression of harmful content, accurate instruction-following, and more natural, helpful responses. Advanced approaches building on RLHF include RLAIF (Reinforcement Learning from AI Feedback) and Constitutional AI, both pioneered by Anthropic.

Related AI Tools

Related Terms

AI Marketing Tools by Our Team