I installed the environment successfully. But when I tested the model on assets/demo.jpg by executing the command
bash assets/demo.sh, the results seemed to be incorrect.
Following is the detection result, which includes some incorrect predictions, e.g, chair.

And here is the referring expression comprehension and segmentation result with the expression "person on the left". Nothing is detected.

Don't know if someone also encounters this problem.