## Problem ## Expected behavior Can we increase multimodal capabilities? Understanding images can improve efficiency.