What is AI Red Teaming?
TL;DR
A safety evaluation method that deliberately probes AI models for vulnerabilities and harmful outputs.
AI Red Teaming: Definition & Explanation
AI red teaming adapts cybersecurity red team methodologies to AI, where expert teams deliberately attempt to elicit vulnerabilities, biases, and harmful outputs from AI models. Tests include prompt injection attacks, jailbreaking attempts, and bias-inducing questions to identify and fix issues. Major AI companies including OpenAI, Google, and Anthropic conduct red teaming before model releases, and the US government recommends it as part of AI safety evaluation. Automated red teaming (where AI attacks AI) is also evolving, establishing continuous safety improvement as a standard process.