Now we can only see the scores of these models, but I'm very interested in how you evaluate these agents.