The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment

I tested the robustness of the VIMA model for various words.
For example, I modified this task
`Put the {dragged_texture} object in {scene} into the {base_texture} object.`
into
`jfasfo jdfjs {dragged_texture} aosdj sdfj {scene} asoads jsidf {base_texture} aidfoads.`
which is not making any sense for human.

I expected the model not to perform well, however, the success rate was almost 100%

I need further investigation but I think this model only sees images, overfitted for only images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions