Question: Can Context FMHA be used to implement Transformer in a vision encoder for multimodal models?

I see that the multi-model models in the example all use TensorRT directly to deploy vision encoders, why not use TensorRT-LLM? Are there known issues or challenges associated with integrating Context FMHA into visual encoders?