What is CLIP (Contrastive Language-Image Pre-training)?
TL;DR
An OpenAI multimodal model that learned the relationship between text and images. Foundation technology for image search and generation.
CLIP (Contrastive Language-Image Pre-training): Definition & Explanation
CLIP (Contrastive Language-Image Pre-training) is a multimodal AI model released by OpenAI in 2021. It was trained on 400 million text-image pairs from the internet, learning the semantic correspondence between text descriptions and images. It can select images that best match a text description or classify image content via text. CLIP's technology is used in the text-understanding components of image generation AIs like Stable Diffusion and DALL-E, serving as the foundation for generating appropriate images from prompts. Its ability to perform zero-shot classification (categorizing things it wasn't explicitly trained on) was groundbreaking.