You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) - AlpacaEval is an automatic evaluator for instruction-following language models.
314
314
*[ARES](https://github.com/stanford-futuredata/ARES) - ARES is a framework for automatically evaluating Retrieval-Augmented Generation (RAG) models.
315
315
*[AutoML Benchmark](https://github.com/openml/automlbenchmark) - AutoML Benchmark is a framework for evaluating and comparing open-source AutoML systems.
316
-
*[Banana-lyzer](https://github.com/reworkd/bananalyzer) - Banana-lyzer is an opensource AI Agent evaluation framework and dataset for web tasks with Playwright.
316
+
*[Banana-lyzer](https://github.com/reworkd/bananalyzer) - Banana-lyzer is an open-source AI Agent evaluation framework and dataset for web tasks with Playwright.
317
317
*[Code Generation LM Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) - Code Generation LM Evaluation Harness is a framework for the evaluation of code generation models.
318
-
*[continuous-eval](https://github.com/relari-ai/continuous-eval) - continuous-eval is a framework for data-driven evaluation of LLM-powered application.
319
-
*[Deepchecks](https://github.com/deepchecks/deepchecks) - Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling you to thoroughly test your data and models from research to production.
318
+
*[continuous-eval](https://github.com/relari-ai/continuous-eval) - continuous-eval is a framework for data-driven evaluation of LLM-powered applications.
319
+
*[Deepchecks](https://github.com/deepchecks/deepchecks) - Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling you to test your data and models from research to production thoroughly.
320
320
*[DeepEval](https://github.com/confident-ai/deepeval) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
321
321
*[EvalAI](https://github.com/Cloud-CV/EvalAI) - EvalAI is an open-source platform for evaluating and comparing AI algorithms at scale.
322
322
*[Evals](https://github.com/openai/evals) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
*[Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) - Inspect is a framework for large language model evaluations.
334
334
*[InterCode](https://github.com/princeton-nlp/intercode) - InterCode is a lightweight, flexible, and easy-to-use framework for designing interactive code environments to evaluate language agents that can code.
335
335
*[Langfuse](https://github.com/langfuse/langfuse) - Langfuse is an observability & analytics solution for LLM-based applications.
336
+
*[LangTest](https://github.com/JohnSnowLabs/langtest) - LangTest is a comprehensive evaluation toolkit for NLP models.
336
337
*[Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - Language Model Evaluation Harness is a framework to test generative language models on a large number of different evaluation tasks.
337
338
*[LightEval](https://github.com/huggingface/lighteval) - LightEval is a lightweight LLM evaluation suite.
338
339
*[LLMonitor](https://github.com/lunary-ai/lunary) - LLMonitor is an observability & analytics for AI apps and agents.
339
-
*[LLMPerf](https://github.com/ray-project/llmperf) - LLMPerf is a tool for evaulation the performance of LLM APIs.
340
-
*[LLM AutoEval](https://github.com/mlabonne/llm-autoeval) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!
340
+
*[LLMPerf](https://github.com/ray-project/llmperf) - LLMPerf is a tool for evaluating the performance of LLM APIs.
341
+
*[LLM AutoEval](https://github.com/mlabonne/llm-autoeval) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook.
341
342
*[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) - lmms-eval is an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
342
343
*[MLPerf Inference](https://github.com/mlcommons/inference) - MLPerf Inference is a benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios.
343
344
*[mltrace](https://github.com/loglabs/mltrace) - mltrace is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines.
*[Tonic Validate](https://github.com/TonicAI/tonic_validate) - Tonic Validate is a high-performance evaluation framework for LLM/RAG outputs.
361
362
*[TruLens](https://github.com/truera/trulens) - TruLens provides a set of tools for evaluating and tracking LLM experiments.
362
363
*[TrustLLM](https://github.com/HowieHwong/TrustLLM) - TrustLLM is a comprehensive framework to evaluate the trustworthiness of large language models, which includes principles, surveys, and benchmarks.
363
-
*[UpTrain](https://github.com/uptrain-ai/uptrain) - UpTrain is an open-source tool to evaluate LLM applications.
364
+
*[UpTrain](https://github.com/uptrain-ai/uptrain) - UpTrain is an open-source tool for evaluating LLM applications.
364
365
*[VBench](https://github.com/Vchitect/VBench) - VBench is a comprehensive benchmark suite for video generative models.
365
366
*[VLMEvalKit](https://github.com/open-compass/VLMEvalKit) - VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs).
0 commit comments