Thanks for sharing the work! I wonder how did you choose the code solution for the unit test generation prompt in data/benchmark/input_humaneval+_ut.jsonl, is it a randomly selected solution from previous execution?
BTW, do you have plans to share the data preparation and/or SFT code for reproduction? Thanks!