-
Notifications
You must be signed in to change notification settings - Fork 665
PERF-#7657: Fork pandas eval and query implementation to improve performance. #7658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#7657: Fork pandas eval and query implementation to improve performance. #7658
Conversation
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question regarding the license for forked code. Also, how much of the pandas code did you need to change besides pointing import paths to modin equivalents of pandas modules? If it was very little then we may want to make clearer from folder naming that this is essentially vendored pandas code.
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how much of the pandas code did you need to change besides pointing import paths to modin equivalents of pandas modules? If it was very little then we may want to make clearer from folder naming that this is essentially vendored pandas code.
@sfc-gh-joshi I did make very few changes, but I don't think it's that important to point out with the directory structure that some code has been vendored from pandas. If we were to vendor an entire package I think it would make sense to put it in a new vendored
directory. We are also putting some modified code in dataframe.py and other modified code in modin/core/computation/.
In that case, as long as all relevant files have something we can grep for if we want to pull in upstream changes it should be fine. |
@devin-petersohn could you PTAL at the licensing changes? Thanks! |
Co-authored-by: Devin Petersohn <[email protected]>
Currently we use the pandas eval() and query() implementations almost entirely as is. That's not good practice in general, and #7657 shows a performance issue that applies to Modin but not pandas in the current implementation.
In this commit, fork the query() and eval() implementation and eliminate the
.values
call that causes numpy materialization.The code here is mostly copied from pandas/pandas/core/computation, except:
.values
call that causes the performance issue in PERF: Fork pandas eval() and query() implementation to reduce to_numpy() calls #7657.Resolves #7657