Skip to content

VDD Reproduced Results for POPE Experiment #9

@jiyunBae007

Description

@jiyunBae007

Hello,

Thank you for sharing the code and resources for your paper. I have been trying to reproduce the results as described in the paper and have encountered an issue.

Reproduction Details
I followed the settings mentioned in the paper:

python llava_calibrate --noise_step 999 --temperature 1.0 --use_dd --use_dd_unk --cd_alpha 1.0

Using these settings, I was able to reproduce the results for the adversarial setting successfully. However, I couldn't reproduce the results for the random and popular settings.

split random
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/random.json
Evaluate the performance in naive setting
F1: 85.42 Accuracy: 87.0 Precision: 97.27 Recall: 76.13 yes: 39.13 unknow: 0.0 number questions 3000 confidence 0.9609135443426154
Evaluate the performance in none setting
F1: 85.5 Accuracy: 87.07 Precision: 97.28 Recall: 76.27 yes: 39.2 unknow: 0.0 number questions 3000 confidence 0.9605988143682389
Evaluate the performance in unk setting
F1: 85.62 Accuracy: 87.17 Precision: 97.37 Recall: 76.4 yes: 39.23 unknow: 0.0 number questions 3000 confidence 0.9605957010643418
Evaluate the performance in none_unk setting
F1: 85.59 Accuracy: 87.13 Precision: 97.28 Recall: 76.4 yes: 39.27 unknow: 0.0 number questions 3000 confidence 0.9605958235006767
split popular
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/popular.json
Evaluate the performance in naive setting
F1: 84.31 Accuracy: 85.83 Precision: 94.46 Recall: 76.13 yes: 40.3 unknow: 0.0 number questions 3000 confidence 0.955709288506554
Evaluate the performance in none setting
F1: 84.33 Accuracy: 85.83 Precision: 94.31 Recall: 76.27 yes: 40.43 unknow: 0.0 number questions 3000 confidence 0.9556657163929004
Evaluate the performance in unk setting
F1: 84.39 Accuracy: 85.87 Precision: 94.24 Recall: 76.4 yes: 40.53 unknow: 0.0 number questions 3000 confidence 0.9555915723565248
Evaluate the performance in none_unk setting
F1: 84.42 Accuracy: 85.9 Precision: 94.32 Recall: 76.4 yes: 40.5 unknow: 0.0 number questions 3000 confidence 0.9556275825174664
split adversarial
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/adversarial.json
Evaluate the performance in naive setting
F1: 82.45 Accuracy: 83.77 Precision: 89.73 Recall: 76.27 yes: 42.5 unknow: 0.0 number questions 3000 confidence 0.9455780747927737
Evaluate the performance in none setting
F1: 82.64 Accuracy: 83.93 Precision: 89.89 Recall: 76.47 yes: 42.53 unknow: 0.0 number questions 3000 confidence 0.9456214791251655
Evaluate the performance in unk setting
F1: 82.69 Accuracy: 83.97 Precision: 89.84 Recall: 76.6 yes: 42.63 unknow: 0.0 number questions 3000 confidence 0.9457334835011352
Evaluate the performance in none_unk setting
F1: 82.72 Accuracy: 84.0 Precision: 89.91 Recall: 76.6 yes: 42.6 unknow: 0.0 number questions 3000 confidence 0.9456783559072836

Is there something I might be missing in the setup or configuration for these two settings? I would appreciate it if you could clarify or provide additional details.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions