Skip to content

Conversation

@TigranTigranTigran
Copy link
Collaborator

Summary 📝

Tool for evaluating LLM output or search result quality (continuing from PR #44)

Details

  1. Made changes to preset evaluators so that each Deepeval DAG has multiple root nodes (TaskNodes) whose outputs are aggregated by. a downstream node (NonBinaryJudgementNode)
  2. Created a custom Deepeval BinaryJudgementNode that used Granite Guardian instead of judge LLM (judge criteria repurposed as risk definition)
  3. Added checks for compatibility of evaluation inputs and metrics

Usage

The LLMEvaluator uses a locally running model (IBM: granite3-dense:8b) but can be easily updated to use any models running locally on Ollama using the following config:
file: .deepeval/.deepeval

{
    "USE_AZURE_OPENAI": "NO",
    "LOCAL_MODEL_NAME": "granite3-dense:8b",
    "LOCAL_MODEL_BASE_URL": "http://localhost:11434",
    "LOCAL_MODEL_API_KEY": "ollama",
    "LOCAL_MODEL_FORMAT": "json",
    "USE_LOCAL_MODEL": "YES"
}

The following code demonstrates its usage with a search result and a direct input:

import asyncio
from akd.tools.evaluator.base_evaluator import LLMEvaluator, LLMEvaluatorConfig, EvalMetricDefinition, SearchResultItem

# Create an evaluator with the default configuration
evaluator = LLMEvaluator(config=LLMEvaluatorConfig())

async def test_with_search_result():

    # Define a search result
    search_result = SearchResultItem(
        query="What is long COVID?",
        content="Long COVID refers to symptoms lasting more than four weeks after infection...",
        title="Understanding Long COVID",
        url="https://example.com/long-covid"
    )

    # Run the evaluator on the search result
    result = await evaluator.arun({
        "search_results": [search_result],
        "metrics": [EvalMetricDefinition.STRUCTURE]
    })

    print(result)


async def test_with_direct_inputs():
    # Define direct inputs
    input_data = {
        "input": "Explain quantum computing in simple terms.",
        "output": "Quantum computing uses quantum bits to perform complex calculations faster than classical computers.",
        "retrieval_context": ["A basic overview of quantum computing concepts."]
    }

    # Run the evaluator on direct inputs with multiple metrics
    result = await evaluator.arun({
        **input_data,
        "metrics": [
            EvalMetricDefinition.ACCURACY,
            EvalMetricDefinition.COMPLETENESS,
            EvalMetricDefinition.FAITHFULNESS
        ]
    })

    print(result)

if __name__ == "__main__":
    asyncio.run(test_with_search_result())
    asyncio.run(test_with_direct_inputs())

Checks

  • Closed #798
  • Tested Changes
  • Stakeholder Approval

@@ -0,0 +1,2 @@
DEEPEVAL_ID=c2b8806f-4b6c-46bf-80f4-a300c8ed69e3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this deep eval artifact in the git tree?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NISH1001 I'm not sure, but I guess probably not.

@leothomas what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NISH1001 I've removed .deepeval_telemetry.txt since it's automatically created every time DeepEval is run.

Copy link
Collaborator

@NISH1001 NISH1001 Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TigranTigranTigran maybe we can just add this to gitignore? I also see .deepeval dir which might have to be ignored. Let's add that and then we can move forward with this PR. Can we remove the .deepeval from git tree and add to gitignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants