-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hello,
Thank you for sharing the code and resources for your paper. I have been trying to reproduce the results as described in the paper and have encountered an issue.
Reproduction Details
I followed the settings mentioned in the paper:
python llava_calibrate --noise_step 999 --temperature 1.0 --use_dd --use_dd_unk --cd_alpha 1.0
Using these settings, I was able to reproduce the results for the adversarial setting successfully. However, I couldn't reproduce the results for the random and popular settings.
split random
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/random.json
Evaluate the performance in naive setting
F1: 85.42 Accuracy: 87.0 Precision: 97.27 Recall: 76.13 yes: 39.13 unknow: 0.0 number questions 3000 confidence 0.9609135443426154
Evaluate the performance in none setting
F1: 85.5 Accuracy: 87.07 Precision: 97.28 Recall: 76.27 yes: 39.2 unknow: 0.0 number questions 3000 confidence 0.9605988143682389
Evaluate the performance in unk setting
F1: 85.62 Accuracy: 87.17 Precision: 97.37 Recall: 76.4 yes: 39.23 unknow: 0.0 number questions 3000 confidence 0.9605957010643418
Evaluate the performance in none_unk setting
F1: 85.59 Accuracy: 87.13 Precision: 97.28 Recall: 76.4 yes: 39.27 unknow: 0.0 number questions 3000 confidence 0.9605958235006767
split popular
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/popular.json
Evaluate the performance in naive setting
F1: 84.31 Accuracy: 85.83 Precision: 94.46 Recall: 76.13 yes: 40.3 unknow: 0.0 number questions 3000 confidence 0.955709288506554
Evaluate the performance in none setting
F1: 84.33 Accuracy: 85.83 Precision: 94.31 Recall: 76.27 yes: 40.43 unknow: 0.0 number questions 3000 confidence 0.9556657163929004
Evaluate the performance in unk setting
F1: 84.39 Accuracy: 85.87 Precision: 94.24 Recall: 76.4 yes: 40.53 unknow: 0.0 number questions 3000 confidence 0.9555915723565248
Evaluate the performance in none_unk setting
F1: 84.42 Accuracy: 85.9 Precision: 94.32 Recall: 76.4 yes: 40.5 unknow: 0.0 number questions 3000 confidence 0.9556275825174664
split adversarial
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/adversarial.json
Evaluate the performance in naive setting
F1: 82.45 Accuracy: 83.77 Precision: 89.73 Recall: 76.27 yes: 42.5 unknow: 0.0 number questions 3000 confidence 0.9455780747927737
Evaluate the performance in none setting
F1: 82.64 Accuracy: 83.93 Precision: 89.89 Recall: 76.47 yes: 42.53 unknow: 0.0 number questions 3000 confidence 0.9456214791251655
Evaluate the performance in unk setting
F1: 82.69 Accuracy: 83.97 Precision: 89.84 Recall: 76.6 yes: 42.63 unknow: 0.0 number questions 3000 confidence 0.9457334835011352
Evaluate the performance in none_unk setting
F1: 82.72 Accuracy: 84.0 Precision: 89.91 Recall: 76.6 yes: 42.6 unknow: 0.0 number questions 3000 confidence 0.9456783559072836
Is there something I might be missing in the setup or configuration for these two settings? I would appreciate it if you could clarify or provide additional details.
Thank you.