HF Model has visibly lower performance than chat.qwen.ai

I was using"Qwen2.5-VL-32B-Instruct" for person appearance matching, and I realized that the hf model of "Qwen/Qwen2.5-VL-32B-Instruct" running both on my local environment and on "https://huggingface.co/spaces/Qwen/Qwen2.5-VL-32B-Instruct" have visibly lower performance than if I select "Qwen2.5-VL-32B-Instruct" on chat.qwen.ai. Interestingly, my local inference and "https://huggingface.co/spaces/Qwen/Qwen2.5-VL-32B-Instruct" consistently have the same accuracy. The following is the text prompt, due to compliance requirements I cannot share the images. My current suspicion is that chat.qwen.ai's backend ignores my model selection and quietly uses a different, perhaps the bigger 72B model. Can someone from qwen team confirm this?

Prompt: Based on the appearance of the person in each image, are they likely the same person? You should ignore the background and only focus on the person's appearance, clothing, etc. If their clothing is visibly different, they are not the same person. Your output should be a score from 0 to 10, where 0 means definitely not the same person and 10 means definitely the same person. Please only output the score.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HF Model has visibly lower performance than chat.qwen.ai #509

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HF Model has visibly lower performance than chat.qwen.ai #509

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions