Evaluator Tool #114

TigranTigranTigran · 2025-08-13T10:11:42Z

Summary 📝

Tool for evaluating LLM output or search result quality (continuing from PR #44)

Details

Made changes to preset evaluators so that each Deepeval DAG has multiple root nodes (TaskNodes) whose outputs are aggregated by. a downstream node (NonBinaryJudgementNode)
Created a custom Deepeval BinaryJudgementNode that used Granite Guardian instead of judge LLM (judge criteria repurposed as risk definition)
Added checks for compatibility of evaluation inputs and metrics

Usage

The LLMEvaluator uses a locally running model (IBM: granite3-dense:8b) but can be easily updated to use any models running locally on Ollama using the following config:
file: .deepeval/.deepeval

{
    "USE_AZURE_OPENAI": "NO",
    "LOCAL_MODEL_NAME": "granite3-dense:8b",
    "LOCAL_MODEL_BASE_URL": "http://localhost:11434",
    "LOCAL_MODEL_API_KEY": "ollama",
    "LOCAL_MODEL_FORMAT": "json",
    "USE_LOCAL_MODEL": "YES"
}

The following code demonstrates its usage with a search result and a direct input:

import asyncio
from akd.tools.evaluator.base_evaluator import LLMEvaluator, LLMEvaluatorConfig, EvalMetricDefinition, SearchResultItem

# Create an evaluator with the default configuration
evaluator = LLMEvaluator(config=LLMEvaluatorConfig())

async def test_with_search_result():

    # Define a search result
    search_result = SearchResultItem(
        query="What is long COVID?",
        content="Long COVID refers to symptoms lasting more than four weeks after infection...",
        title="Understanding Long COVID",
        url="https://example.com/long-covid"
    )

    # Run the evaluator on the search result
    result = await evaluator.arun({
        "search_results": [search_result],
        "metrics": [EvalMetricDefinition.STRUCTURE]
    })

    print(result)


async def test_with_direct_inputs():
    # Define direct inputs
    input_data = {
        "input": "Explain quantum computing in simple terms.",
        "output": "Quantum computing uses quantum bits to perform complex calculations faster than classical computers.",
        "retrieval_context": ["A basic overview of quantum computing concepts."]
    }

    # Run the evaluator on direct inputs with multiple metrics
    result = await evaluator.arun({
        **input_data,
        "metrics": [
            EvalMetricDefinition.ACCURACY,
            EvalMetricDefinition.COMPLETENESS,
            EvalMetricDefinition.FAITHFULNESS
        ]
    })

    print(result)

if __name__ == "__main__":
    asyncio.run(test_with_search_result())
    asyncio.run(test_with_direct_inputs())

Checks

Closed #798
Tested Changes
Stakeholder Approval

NISH1001 · 2025-08-18T20:39:54Z

.deepeval/.deepeval_telemetry.txt

@@ -0,0 +1,2 @@
+DEEPEVAL_ID=c2b8806f-4b6c-46bf-80f4-a300c8ed69e3


do we need this deep eval artifact in the git tree?

@NISH1001 I'm not sure, but I guess probably not.

@leothomas what do you think?

@NISH1001 I've removed .deepeval_telemetry.txt since it's automatically created every time DeepEval is run.

@TigranTigranTigran maybe we can just add this to gitignore? I also see .deepeval dir which might have to be ignored. Let's add that and then we can move forward with this PR. Can we remove the .deepeval from git tree and add to gitignore.

akd/tools/evaluator/base_evaluator.py

added evaluator tool and tests

a37686d

TigranTigranTigran requested review from NISH1001, leothomas and muthukumaranR August 13, 2025 10:11

muthukumaranR closed this Aug 17, 2025

muthukumaranR reopened this Aug 17, 2025

muthukumaranR closed this Aug 18, 2025

muthukumaranR reopened this Aug 18, 2025

github-actions bot added a commit that referenced this pull request Aug 18, 2025

Auto-merge PR #114 (feature/evaluator-tool) into integration for testing

966d7ac

NISH1001 reviewed Aug 18, 2025

View reviewed changes

added two new functions to handle the two cases in arun

720504d

github-actions bot added a commit that referenced this pull request Aug 19, 2025

Auto-merge PR #114 (feature/evaluator-tool) into integration for testing

c256bd7

TigranTigranTigran requested a review from NISH1001 August 19, 2025 10:22

removed .deepeval.telemetry.txt

abf48a2

github-actions bot added a commit that referenced this pull request Aug 19, 2025

Auto-merge PR #114 (feature/evaluator-tool) into integration for testing

285e014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Evaluator Tool #114

Evaluator Tool #114

TigranTigranTigran commented Aug 13, 2025

Uh oh!

NISH1001 Aug 18, 2025

Uh oh!

TigranTigranTigran Aug 19, 2025

Uh oh!

TigranTigranTigran Aug 19, 2025

Uh oh!

NISH1001 Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1,2 @@
		DEEPEVAL_ID=c2b8806f-4b6c-46bf-80f4-a300c8ed69e3

Uh oh!

Evaluator Tool #114

Are you sure you want to change the base?

Evaluator Tool #114

Conversation

TigranTigranTigran commented Aug 13, 2025

Summary 📝

Details

Usage

Checks

Uh oh!

NISH1001 Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

TigranTigranTigran Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

TigranTigranTigran Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

NISH1001 Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NISH1001 Sep 16, 2025 •

edited

Loading