Skip to content

Conversation

Prigoistic
Copy link
Contributor

Issue Link / Problem Description

  • Fixes #2330
  • Evaluating a LlamaIndex query engine raised a runtime NameError: EvaluationResult not defined, because it was imported only under t.TYPE_CHECKING. Intermittent LlamaIndex execution failures also led to IndexError during result collection due to mismatched lengths.

Changes Made

  • Import EvaluationResult at runtime from ragas.dataset_schema in src/ragas/integrations/llama_index.py.
  • Make response/context collection robust:
    • Handle failed executor jobs (NaN placeholders) by inserting empty response/context to maintain alignment with dataset size.
    • Prevent IndexError during dataset augmentation.
  • Light defensive checks to ensure stable evaluation even when some query-engine calls fail.

Testing

  • Automated tests added/updated

How to Test

  • Manual testing steps:
  1. Install for local dev: uv run pip install -e . -e ./examples
  2. Follow the LlamaIndex integration guide to set up a query_engine and EvaluationDataset: docs
  3. Ensure LlamaIndex LLM is configured with n=1 (or unset) to avoid “n values greater than 1 not support” warnings.
  4. Run an evaluation that previously failed; it should complete without the NameError and without IndexError during result collection.
  5. Optional: run lints uv run ruff check .

References

  • Related issues: #2330
  • Documentation: LlamaIndex integration how-to (link)

Screenshots/Examples (if applicable)

  • N/A

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Sep 30, 2025
@jjmachan
Copy link
Member

jjmachan commented Oct 3, 2025

hey @Prigoistic I've fixed the CI - could you take a look and see if everything looks good?
we'll merge it in after that 🙂

retrieved_contexts.append([n.node.text for n in r.source_nodes])
# Handle failed jobs which are recorded as NaN in the executor
if isinstance(r, float) and math.isnan(r):
responses.append("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to fail loudly than silently.

If we still need to pass through, better to keep None. The later metrics can skip None or handle them explicitly.

responses.append(None)
retrieved_contexts.append(None)
logger.warning(f"Query engine failed for query {i}: '{queries[i]}'")

retrieved_contexts.append([])
else:
# Cast to LlamaIndex Response type for proper type checking
response = t.cast("LlamaIndexResponse", r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll be hard on type hints.

Probably better to take from llama_index.core.base.response.schema import Response as LlamaIndexResponse

else:
# Cast to LlamaIndex Response type for proper type checking
response = t.cast("LlamaIndexResponse", r)
responses.append(response.response or "")
Copy link
Contributor

@anistark anistark Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this more explicit?

responses.append(response.response if response.response is not None else "")

@Prigoistic
Copy link
Contributor Author

@jjmachan yes everything looks good to me

@Prigoistic
Copy link
Contributor Author

I see no conflicts so far and all the checks has been passed too, you can go further and merge this :)

@anistark
Copy link
Contributor

@Prigoistic I don't see any changes to the comments.

@Prigoistic
Copy link
Contributor Author

@anistark oh shoot i forgot to push the changes gimme a sec

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Oct 11, 2025
@Prigoistic
Copy link
Contributor Author

@anistark pushed the changes as per the comments :) please check it once

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NameError when evaluating the llamaindex query engine

3 participants