lm-eval v0.4.9.1 Release Notes
This v0.4.9.1 release is a quick patch to bring in some new tasks and fixes. Looking aheas, we're gearing up for some bigger updates to tackle common community pain points. We'll do our best to keep things from breaking, but we anticipate a few changes might not be fully backward-compatible. We're excited to share more soon!
Enhanced Reasoning Model Handling
- Better support for reasoning models with a
think_end_token
argument to strip intermediate reasoning from outputs for thehf
,vllm
, andsglang
model backends. A relatedenable_thinking
argument was also added for specific models that support it (e.g., Qwen).
New Benchmarks & Tasks
- EgyMMLU and EgyHellaSwag by @houdaipha in #3063
- MultiBLiMP benchmark by @jmichaelov in #3155
- LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
- Multilingual Truthfulqa in Spanish, Basque and Galician by @BlancaCalvo in #3062
Fixes & Improvements
Tasks & Benchmarks:
- Aligned Humaneval results for Llama-3.1-70B-Instruct with official scores by @userljz, @baberabb, @idantene in (#3201. #3092, #3102)
- Fixed incorrect dataset paths for GLUE and medical benchmarks by @Avelina9X and @idantene. (#3159, #3151)
- Removed redundant "Let's think step by step" text from
bbh_cot_fewshot
prompts by @philipdoldo. (#3140) - Increased
max_gen_toks
to 2048 for HRM8K math benchmarks by @shing100. (#3124)
Backend & Stability:
- Reduce CLI loading time from 2.2s to 0.05s by @stakodiak. (#3099)
- Fixed a process hang caused by mp.Pool in bootstrap_stderr and introduced
DISABLE_MULTIPROC
envar by @ankitgola005 and @neel04. (#3135, #3106) - add image hashing and
LMEVAL_HASHMM
envar by @artemorloff in #2973 - TaskManager:
include-path
precedence handling to prioritize custom dir over default by @parkhs21 in #3068
Housekeeping:
- Pinned
datasets < 4.0.0
temporarily to maintain compatibility withtrust_remote_code
by @baberabb. (#3172) - Removed models from Neural Magic and other unneeded files by @baberabb. (#3112, #3113, #3108)
What's Changed
- llama3 task: update README.md by @annafontanaa in #3074
- Fix Anthropic API compatibility issues in chat completions by @NourFahmy in #3054
- Ensure backwards compatibility in
fewshot_context
by using kwargs by @kiersten-stokes in #3079 - [vllm] remove system message if
TemplateError
for chat_template by @baberabb in #3076 - feat / fix: Properly make use of
subfolder
from HF models by @younesbelkada in #3072 - [HF] fix quantization config by @baberabb in #3039
- FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct by @userljz in #3092
- Truthfulqa multi harness by @BlancaCalvo in #3062
- Fix: Reduce CLI loading time from 2.2s to 0.05s by @stakodiak in #3099
- Humaneval - fix regression by @baberabb in #3102
- Bugfix/hf tokenizer gguf override by @ankush13r in #3098
- [FIX] Initial code to disable multi-proc for stderr by @neel04 in #3106
- fix deps; update hooks by @baberabb in #3107
- delete unneeded files by @baberabb in #3108
- Fixed #3005: Processes both formats of model_args: string and dictionay by @DebjyotiRay in #3097
- add image hashing and
LMEVAL_HASHMM
envar by @artemorloff in #2973 - removal of Neural Magic models by @baberabb in #3112
- Neuralmagic by @baberabb in #3113
- check pil dep when hashing images by @baberabb in #3114
- warning for "chat" pretrained; disable buggy evalita configs by @baberabb in #3127
- fix: remove warning by @baberabb in #3128
- Adding EgyMMLU and EgyHellaSwag by @houdaipha in #3063
- Added mixed_precision_dtype argument to HFLM to enable autocasting by @Avelina9X in #3138
- Fix for hang due to mp.Pool in bootstrap_stderr by @ankitgola005 in #3135
- when using vllm with lora, it will have some mistakes, now i fix it. by @Jacky-MYQ in #3132
- truncate thinking tags in generations by @baberabb in #3145
bbh_cot_fewshot
: Removed repeated "Let''s think step by step." text from bbh cot prompts by @philipdoldo in #3140- Fix medical benchmarks import by @idantene in #3151
- fix request hanging when request api by @mmmans in #3090
- Custom request headers | trust_remote_code param fix by @RawthiL in #3069
- Bugfix: update path for GLUE by @Avelina9X in #3159
- Add the MultiBLiMP benchmark by @jmichaelov in #3155
- multiblimp - readme by @baberabb in #3162
- [tests] Added missing fixture in test_unitxt_tasks.py by @Avelina9X in #3163
- Fix: extended to max_gen_toks 2048 for HRM8K math benchmarks by @shing100 in #3124
- feat: Add LIBRA benchmark for long-context evaluation by @karimovaSvetlana in #2943
- Added
chat_template_args
to vllm by @Avelina9X in #3164 - Pin datasets < 4.0.0 by @baberabb in #3172
- Remove "device" from vllm_causallms.py by @mgoin in #3176
- remove trust-remote-code in configs; fix escape sequences by @baberabb in #3180
- Fix vllm test issue that call pop() from None by @weireweire in #3182
- [hotfix] vllm: pop
device
from kwargs by @baberabb in #3181 - Update vLLM compatibility by @DarkLight1337 in #3024
- Fix
mmlu_continuation
subgroup names to fit Readme and other variants by @lamalunderscore in #3137 - Fix humaneval_instruct by @idantene in #3201
- Update README.md for mlqa by @newme616 in #3117
- improve include-path precedence handling by @parkhs21 in #3068
- Bump version to 0.4.9.1 by @baberabb in #3208
New Contributors
- @NourFahmy made their first contribution in #3054
- @userljz made their first contribution in #3092
- @BlancaCalvo made their first contribution in #3062
- @stakodiak made their first contribution in #3099
- @ankush13r made their first contribution in #3098
- @neel04 made their first contribution in #3106
- @DebjyotiRay made their first contribution in #3097
- @houdaipha made their first contribution in #3063
- @ankitgola005 made their first contribution in #3135
- @Jacky-MYQ made their first contribution in #3132
- @philipdoldo made their first contribution in #3140
- @idantene made their first contribution in #3151
- @mmmans made their first contribution in #3090
- @shing100 made their first contribution in #3124
- @karimovaSvetlana made their first contribution in #2943
- @weireweire made their first contribution in #3182
- @DarkLight1337 made their first contribution in #3024
- @lamalunderscore made their first contribution in #3137
- @newme616 made their first contribution in #3117
- @parkhs21 made their first contribution in #3068
Full Changelog: v0.4.9...v0.4.9.1