Can take in images as input and answer questions about them, not just OCR. Something like GPT4 vision, LLaVA, BakLLaVA, CogVLM, Qwen-VL