lm-eval v0.4.9.1 Release Notes

This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!

Enhanced Reasoning Model Handling

Better support for reasoning models with a think_end_token argument to strip intermediate reasoning from outputs for the hf, vllm, and sglang model backends. A related enable_thinking argument was also added for specific models that support it (e.g., Qwen).

New Benchmarks & Tasks

EgyMMLU and EgyHellaSwag by @houdaipha in #3063
MultiBLiMP benchmark by @jmichaelov in #3155
LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in #3062

Fixes & Improvements

Tasks & Benchmarks:

Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (#3201. #3092, #3102)
Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
Removed redundant "Let's think step by step" text from bbh_cot_fewshot prompts by @philipdoldo. (#3140)
Increased max_gen_toks to 2048 for HRM8K math benchmarks by @shing100. (#3124)

Backend & Stability:

Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced DISABLE_MULTIPROC envar by @ankitgola005 and @neel04. (#3135, #3106)
add image hashing and LMEVAL_HASHMM envar by @artemorloff in #2973
TaskManager: include-path precedence handling to prioritize custom dir over default by @parkhs21 in #3068

Housekeeping:

Pinned datasets < 4.0.0 temporarily to maintain compatibility with trust_remote_code by @baberabb. (#3172)
Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)

What's Changed

llama3 task: update README.md by @annafontanaa in #3074
Fix Anthropic API compatibility issues in chat completions by @NourFahmy in #3054
Ensure backwards compatibility in fewshot_context by using kwargs by @kiersten-stokes in #3079
[vllm] remove system message if TemplateError for chat_template by @baberabb in #3076
feat / fix: Properly make use of subfolder from HF models by @younesbelkada in #3072
[HF] fix quantization config by @baberabb in #3039
FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in #3092
Truthfulqa multi harness by @BlancaCalvo in #3062
Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in #3099
Humaneval - fix regression by @baberabb in #3102
Bugfix/hf tokenizer gguf override by @ankush13r in #3098
[FIX] Initial code to disable multi-proc for stderr by @neel04 in #3106
fix deps; update hooks by @baberabb in #3107
delete unneeded files by @baberabb in #3108
Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in #3097
add image hashing and LMEVAL_HASHMM envar by @artemorloff in #2973
removal of Neural Magic models by @baberabb in #3112
Neuralmagic by @baberabb in #3113
check pil dep when hashing images by @baberabb in #3114
warning for "chat" pretrained; disable buggy evalita configs by @baberabb in #3127
fix: remove warning by @baberabb in #3128
Adding EgyMMLU and EgyHellaSwag by @houdaipha in #3063
Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in #3138
Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in #3135
when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in #3132
truncate thinking tags in generations by @baberabb in #3145
bbh_cot_fewshot: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in #3140
Fix medical benchmarks import by @idantene in #3151
fix request hanging when request api by @mmmans in #3090
Custom request headers | trust_remote_code param fix by @RawthiL in #3069
Bugfix: update path for GLUE by @Avelina9X in #3159
Add the MultiBLiMP benchmark by @jmichaelov in #3155
multiblimp - readme by @baberabb in #3162
[tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in #3163
Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in #3124
feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
Added chat_template_args to vllm by @Avelina9X in #3164
Pin datasets < 4.0.0 by @baberabb in #3172
Remove "device" from vllm_causallms.py by @mgoin in #3176
remove trust-remote-code in configs; fix escape sequences by @baberabb in #3180
Fix vllm test issue that call pop() from None by @weireweire in #3182
[hotfix] vllm: pop device from kwargs by @baberabb in #3181
Update vLLM compatibility by @DarkLight1337 in #3024
Fix mmlu_continuation subgroup names to fit Readme and other variants by @lamalunderscore in #3137
Fix humaneval_instruct by @idantene in #3201
Update README.md for mlqa by @newme616 in #3117
improve include-path precedence handling by @parkhs21 in #3068
Bump version to 0.4.9.1 by @baberabb in #3208

New Contributors

@NourFahmy made their first contribution in #3054
@userljz made their first contribution in #3092
@BlancaCalvo made their first contribution in #3062
@stakodiak made their first contribution in #3099
@ankush13r made their first contribution in #3098
@neel04 made their first contribution in #3106
@DebjyotiRay made their first contribution in #3097
@houdaipha made their first contribution in #3063
@ankitgola005 made their first contribution in #3135
@Jacky-MYQ made their first contribution in #3132
@philipdoldo made their first contribution in #3140
@idantene made their first contribution in #3151
@mmmans made their first contribution in #3090
@shing100 made their first contribution in #3124
@karimovaSvetlana made their first contribution in #2943
@weireweire made their first contribution in #3182
@DarkLight1337 made their first contribution in #3024
@lamalunderscore made their first contribution in #3137
@newme616 made their first contribution in #3117
@parkhs21 made their first contribution in #3068

Full Changelog: v0.4.9...v0.4.9.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.9.1