Skip to content

Commit 371b7ee

Browse files
authored
add LangTest (#610)
1 parent 4da74b6 commit 371b7ee

File tree

1 file changed

+7
-6
lines changed

1 file changed

+7
-6
lines changed

README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -313,10 +313,10 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
313313
* [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval) ![](https://img.shields.io/github/stars/tatsu-lab/alpaca_eval.svg?style=social) - AlpacaEval is an automatic evaluator for instruction-following language models.
314314
* [ARES](https://github.com/stanford-futuredata/ARES) ![](https://img.shields.io/github/stars/stanford-futuredata/ARES.svg?style=social) - ARES is a framework for automatically evaluating Retrieval-Augmented Generation (RAG) models.
315315
* [AutoML Benchmark](https://github.com/openml/automlbenchmark) ![](https://img.shields.io/github/stars/openml/automlbenchmark.svg?style=social) - AutoML Benchmark is a framework for evaluating and comparing open-source AutoML systems.
316-
* [Banana-lyzer](https://github.com/reworkd/bananalyzer) ![](https://img.shields.io/github/stars/reworkd/bananalyzer.svg?style=social) - Banana-lyzer is an open source AI Agent evaluation framework and dataset for web tasks with Playwright.
316+
* [Banana-lyzer](https://github.com/reworkd/bananalyzer) ![](https://img.shields.io/github/stars/reworkd/bananalyzer.svg?style=social) - Banana-lyzer is an open-source AI Agent evaluation framework and dataset for web tasks with Playwright.
317317
* [Code Generation LM Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) ![](https://img.shields.io/github/stars/bigcode-project/bigcode-evaluation-harness.svg?style=social) - Code Generation LM Evaluation Harness is a framework for the evaluation of code generation models.
318-
* [continuous-eval](https://github.com/relari-ai/continuous-eval) ![](https://img.shields.io/github/stars/relari-ai/continuous-eval.svg?style=social) - continuous-eval is a framework for data-driven evaluation of LLM-powered application.
319-
* [Deepchecks](https://github.com/deepchecks/deepchecks) ![](https://img.shields.io/github/stars/deepchecks/deepchecks.svg?style=social) - Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling you to thoroughly test your data and models from research to production.
318+
* [continuous-eval](https://github.com/relari-ai/continuous-eval) ![](https://img.shields.io/github/stars/relari-ai/continuous-eval.svg?style=social) - continuous-eval is a framework for data-driven evaluation of LLM-powered applications.
319+
* [Deepchecks](https://github.com/deepchecks/deepchecks) ![](https://img.shields.io/github/stars/deepchecks/deepchecks.svg?style=social) - Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling you to test your data and models from research to production thoroughly.
320320
* [DeepEval](https://github.com/confident-ai/deepeval) ![](https://img.shields.io/github/stars/confident-ai/deepeval.svg?style=social) - DeepEval is a simple-to-use, open-source evaluation framework for LLM applications.
321321
* [EvalAI](https://github.com/Cloud-CV/EvalAI) ![](https://img.shields.io/github/stars/Cloud-CV/EvalAI.svg?style=social) - EvalAI is an open-source platform for evaluating and comparing AI algorithms at scale.
322322
* [Evals](https://github.com/openai/evals) ![](https://img.shields.io/github/stars/openai/evals.svg?style=social) - Evals is a framework for evaluating OpenAI models and an open-source registry of benchmarks.
@@ -333,11 +333,12 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
333333
* [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) ![](https://img.shields.io/github/stars/UKGovernmentBEIS/inspect_ai.svg?style=social) - Inspect is a framework for large language model evaluations.
334334
* [InterCode](https://github.com/princeton-nlp/intercode) ![](https://img.shields.io/github/stars/princeton-nlp/intercode.svg?style=social) - InterCode is a lightweight, flexible, and easy-to-use framework for designing interactive code environments to evaluate language agents that can code.
335335
* [Langfuse](https://github.com/langfuse/langfuse) ![](https://img.shields.io/github/stars/langfuse/langfuse.svg?style=social) - Langfuse is an observability & analytics solution for LLM-based applications.
336+
* [LangTest](https://github.com/JohnSnowLabs/langtest) ![](https://img.shields.io/github/stars/JohnSnowLabs/langtest.svg?style=social) - LangTest is a comprehensive evaluation toolkit for NLP models.
336337
* [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) ![](https://img.shields.io/github/stars/EleutherAI/lm-evaluation-harness.svg?style=social) - Language Model Evaluation Harness is a framework to test generative language models on a large number of different evaluation tasks.
337338
* [LightEval](https://github.com/huggingface/lighteval) ![](https://img.shields.io/github/stars/huggingface/lighteval.svg?style=social) - LightEval is a lightweight LLM evaluation suite.
338339
* [LLMonitor](https://github.com/lunary-ai/lunary) ![](https://img.shields.io/github/stars/lunary-ai/lunary.svg?style=social) - LLMonitor is an observability & analytics for AI apps and agents.
339-
* [LLMPerf](https://github.com/ray-project/llmperf) ![](https://img.shields.io/github/stars/ray-project/llmperf.svg?style=social) - LLMPerf is a tool for evaulation the performance of LLM APIs.
340-
* [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) ![](https://img.shields.io/github/stars/mlabonne/llm-autoeval.svg?style=social) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook. You just need to specify the name of your model, a benchmark, a GPU, and press run!
340+
* [LLMPerf](https://github.com/ray-project/llmperf) ![](https://img.shields.io/github/stars/ray-project/llmperf.svg?style=social) - LLMPerf is a tool for evaluating the performance of LLM APIs.
341+
* [LLM AutoEval](https://github.com/mlabonne/llm-autoeval) ![](https://img.shields.io/github/stars/mlabonne/llm-autoeval.svg?style=social) - LLM AutoEval simplifies the process of evaluating LLMs using a convenient Colab notebook.
341342
* [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) ![](https://img.shields.io/github/stars/EvolvingLMMs-Lab/lmms-eval.svg?style=social) - lmms-eval is an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
342343
* [MLPerf Inference](https://github.com/mlcommons/inference) ![](https://img.shields.io/github/stars/mlcommons/inference.svg?style=social) - MLPerf Inference is a benchmark suite for measuring how fast systems can run models in a variety of deployment scenarios.
343344
* [mltrace](https://github.com/loglabs/mltrace) ![](https://img.shields.io/github/stars/loglabs/mltrace.svg?style=social) - mltrace is a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines.
@@ -360,7 +361,7 @@ Please review our [CONTRIBUTING.md](https://github.com/EthicalML/awesome-product
360361
* [Tonic Validate](https://github.com/TonicAI/tonic_validate) ![](https://img.shields.io/github/stars/TonicAI/tonic_validate.svg?style=social) - Tonic Validate is a high-performance evaluation framework for LLM/RAG outputs.
361362
* [TruLens](https://github.com/truera/trulens) ![](https://img.shields.io/github/stars/truera/trulens.svg?style=social) - TruLens provides a set of tools for evaluating and tracking LLM experiments.
362363
* [TrustLLM](https://github.com/HowieHwong/TrustLLM) ![](https://img.shields.io/github/stars/HowieHwong/TrustLLM.svg?style=social) - TrustLLM is a comprehensive framework to evaluate the trustworthiness of large language models, which includes principles, surveys, and benchmarks.
363-
* [UpTrain](https://github.com/uptrain-ai/uptrain) ![](https://img.shields.io/github/stars/uptrain-ai/uptrain.svg?style=social) - UpTrain is an open-source tool to evaluate LLM applications.
364+
* [UpTrain](https://github.com/uptrain-ai/uptrain) ![](https://img.shields.io/github/stars/uptrain-ai/uptrain.svg?style=social) - UpTrain is an open-source tool for evaluating LLM applications.
364365
* [VBench](https://github.com/Vchitect/VBench) ![](https://img.shields.io/github/stars/Vchitect/VBench.svg?style=social) - VBench is a comprehensive benchmark suite for video generative models.
365366
* [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) ![](https://img.shields.io/github/stars/open-compass/VLMEvalKit.svg?style=social) - VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs).
366367

0 commit comments

Comments
 (0)