Multi-modal (image) support

Multi-modal support is pretty standard at this point.  We should offer it as an optional feature that will send images along with the text to the llm.

E.g. "Describe the animal in this picture" or "Transcribe this diagram into a Mermaid block"