Skip to content

Conversation

sfc-gh-mvashishtha
Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha commented Sep 3, 2025

Currently we use the pandas eval() and query() implementations almost entirely as is. That's not good practice in general, and #7657 shows a performance issue that applies to Modin but not pandas in the current implementation.

In this commit, fork the query() and eval() implementation and eliminate the .values call that causes numpy materialization.

The code here is mostly copied from pandas/pandas/core/computation, except:

Resolves #7657

Copy link
Contributor

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
@sfc-gh-mvashishtha sfc-gh-mvashishtha changed the title PERF-#7657: Fork pandas eval() implementation. PERF-#7657: Fork pandas eval and query implementation to improve performance. Sep 3, 2025
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question regarding the license for forked code. Also, how much of the pandas code did you need to change besides pointing import paths to modin equivalents of pandas modules? If it was very little then we may want to make clearer from folder naming that this is essentially vendored pandas code.

Copy link
Contributor Author

@sfc-gh-mvashishtha sfc-gh-mvashishtha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how much of the pandas code did you need to change besides pointing import paths to modin equivalents of pandas modules? If it was very little then we may want to make clearer from folder naming that this is essentially vendored pandas code.

@sfc-gh-joshi I did make very few changes, but I don't think it's that important to point out with the directory structure that some code has been vendored from pandas. If we were to vendor an entire package I think it would make sense to put it in a new vendored directory. We are also putting some modified code in dataframe.py and other modified code in modin/core/computation/.

@sfc-gh-joshi
Copy link
Contributor

I did make very few changes, but I don't think it's that important to point out with the directory structure that some code has been vendored from pandas.

In that case, as long as all relevant files have something we can grep for if we want to pull in upstream changes it should be fine.

@sfc-gh-mvashishtha
Copy link
Contributor Author

@devin-petersohn could you PTAL at the licensing changes? Thanks!

@sfc-gh-mvashishtha sfc-gh-mvashishtha merged commit 5ed69b5 into modin-project:main Sep 8, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: Fork pandas eval() and query() implementation to reduce to_numpy() calls

3 participants