[BFCL] score discrepancy for gpt-4o-2024-11-20-FC and gpt-4o-mini-2024-07-18-FC

I run the evaluation for several times with zero change in this awesome repository.

My results on LIVE_SIMPLE for `gpt-4o-mini-2024-07-18-FC`: 79.457 %

Leaderboard says: 69.77 %

The same issue with the `gpt-4o-2024-11-20-FC` (about +10% accuracy in my runs)