I run the evaluation for several times with zero change in this awesome repository.
My results on LIVE_SIMPLE for gpt-4o-mini-2024-07-18-FC: 79.457 %
Leaderboard says: 69.77 %
The same issue with the gpt-4o-2024-11-20-FC (about +10% accuracy in my runs)