VDD Reproduced Results for POPE Experiment

Hello,

Thank you for sharing the code and resources for your paper. I have been trying to reproduce the results as described in the paper and have encountered an issue.

Reproduction Details
I followed the settings mentioned in the paper:

python llava_calibrate --noise_step 999 --temperature 1.0 --use_dd --use_dd_unk --cd_alpha 1.0

Using these settings, I was able to reproduce the results for the adversarial setting successfully. However, I couldn't reproduce the results for the random and popular settings.

>********************split random********************
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/random.json
Evaluate the performance in naive setting
F1: 85.42 Accuracy: 87.0 Precision: 97.27        Recall: 76.13   yes: 39.13 unknow: 0.0 number questions 3000 confidence 0.9609135443426154
Evaluate the performance in none setting
F1: 85.5 Accuracy: 87.07 Precision: 97.28        Recall: 76.27   yes: 39.2 unknow: 0.0 number questions 3000 confidence 0.9605988143682389
Evaluate the performance in unk setting
F1: 85.62 Accuracy: 87.17 Precision: 97.37       Recall: 76.4    yes: 39.23 unknow: 0.0 number questions 3000 confidence 0.9605957010643418
Evaluate the performance in none_unk setting
F1: 85.59 Accuracy: 87.13 Precision: 97.28       Recall: 76.4    yes: 39.27 unknow: 0.0 number questions 3000 confidence 0.9605958235006767
********************split popular********************
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/popular.json
Evaluate the performance in naive setting
F1: 84.31 Accuracy: 85.83 Precision: 94.46       Recall: 76.13   yes: 40.3 unknow: 0.0 number questions 3000 confidence 0.955709288506554
Evaluate the performance in none setting
F1: 84.33 Accuracy: 85.83 Precision: 94.31       Recall: 76.27   yes: 40.43 unknow: 0.0 number questions 3000 confidence 0.9556657163929004
Evaluate the performance in unk setting
F1: 84.39 Accuracy: 85.87 Precision: 94.24       Recall: 76.4    yes: 40.53 unknow: 0.0 number questions 3000 confidence 0.9555915723565248
Evaluate the performance in none_unk setting
F1: 84.42 Accuracy: 85.9 Precision: 94.32        Recall: 76.4    yes: 40.5 unknow: 0.0 number questions 3000 confidence 0.9556275825174664
********************split adversarial********************
/home/jybae/Project/VDD/experiments/eval/calibrate/temp1.0/v2/adversarial.json
Evaluate the performance in naive setting
F1: 82.45 Accuracy: 83.77 Precision: 89.73       Recall: 76.27   yes: 42.5 unknow: 0.0 number questions 3000 confidence 0.9455780747927737
Evaluate the performance in none setting
F1: 82.64 Accuracy: 83.93 Precision: 89.89       Recall: 76.47   yes: 42.53 unknow: 0.0 number questions 3000 confidence 0.9456214791251655
Evaluate the performance in unk setting
F1: 82.69 Accuracy: 83.97 Precision: 89.84       Recall: 76.6    yes: 42.63 unknow: 0.0 number questions 3000 confidence 0.9457334835011352
Evaluate the performance in none_unk setting
F1: 82.72 Accuracy: 84.0 Precision: 89.91        Recall: 76.6    yes: 42.6 unknow: 0.0 number questions 3000 confidence 0.9456783559072836


Is there something I might be missing in the setup or configuration for these two settings? I would appreciate it if you could clarify or provide additional details.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VDD Reproduced Results for POPE Experiment #9

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

VDD Reproduced Results for POPE Experiment #9

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions