What is VLM (Vision Language Model)?
TL;DR
AI models that understand images and respond with text. Integrating visual recognition with language understanding.
VLM (Vision Language Model): Definition & Explanation
A Vision Language Model (VLM) is an AI model that accepts both images and text as input and can respond in text after understanding the content of images. Representative models include GPT-4V (Vision), Claude 3 Vision, Gemini Pro Vision, and LLaVA. VLMs handle tasks that were impossible with text-only LLMs, such as describing photo contents, analyzing charts and graphs, recognizing handwritten text (OCR), evaluating UI/UX designs, analyzing medical images, and product quality inspection. As a core technology of multimodal AI, VLMs typically combine an image encoder (such as CLIP) with an LLM.