-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices #6156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jacques-n @emkornfield Please help take a look at this one, after then we could rebase and continue to do #6133, thanks! |
|
@TheNeuralBit do you have time to do a first pass on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this if check came from an earlier refactor that happened in Arrow and several tests had failed in Dremio. I think we should keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old implementation, VectorUnloader.java checks the typeBufferCount with getFieldBuffers, it's ok since getFieldBuffers returns all buffers without check buffer size. If we keep this if, then the expected buffer count check in VectorUnloader is invalid, should we remove that check directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the old implementation, VectorUnloader.java checks the typeBufferCount with getFieldBuffers, it's ok since getFieldBuffers returns all buffers without check buffer size. If we keep this if, then the expected buffer count check in VectorUnloader is invalid, should we remove that check directly?
I am okay with this change but it would be great if someone from Dremio can bless this (based on my previous comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep this 'if' here, then I guess we should make some change in VectorUnloader.
If vector.getBufferSize ==0, then no buffers would be sent via IPC, however, the VectorLoader depends on TypeBufferCount to decide how many buffers to load into a vector, and in this case, something is wrong.
To solve this, we may add a check in VectorUnloader, if vector bufferSize==0, we should append it's field buffers(by call getFieldBuffers) also, even their writerIndex/readerIndex=0. In this way, we could keep the 'if (getBufferSize() == 0)' in vectors and the IPC also works well.
Do you think we should keep this PR or update it as I suggested above?
…der/writer indices
e2b077f to
f1b1dae
Compare
|
I'm asking someone to review this from Dremio. |
|
@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken.. |
ok |
thanks @tianchen92 |
|
@praveenbingo did you have a chance to investigate? |
|
ping |
|
@jacques-n @rymurr do you know the progress of this internal to Dremio? It has been blocked a while on feedback, if we don't here back by Monday, I think we should rebase and merge. |
|
Let me look at this change and the impact on dremio. I will update it by tomorrow |
| } | ||
| List<ArrowBuf> list = new ArrayList<>(); | ||
| list.add(validityBuffer); | ||
| list.add(offsetBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this change, the order of validity and offset buffers is changed in the getBuffers method for ListVector which creates problems with serialization/deserialization resulting in test failures in Dremio. This would break backward compatibility with existing serialised files.
Also keeping the existing order in #getBuffers will also break tests since this PR replaces #getFieldBuffers with #getBuffers in VectorUnloader. #getFieldBuffers and #getBuffers currently have different order of buffers while #getFieldBuffers and #loadFieldBuffers same order, and the order should be same for VectorLoader and VectorUnloader.
cc @pravindra @tianchen92
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@projjal Thanks for your testing result!
Seems like a legacy problem here, and I think validityBuffer->offsetBuffer should be the right order considering other vectors.
@jacques-n @pravindra any thoughts on how to resolve the conflicts? :)
|
Does this impact IPC? |
seems not, IPC used getFieldBuffers which has the right buffer order, this PR is going to replace getFieldBuffers with getBuffers (getBuffers has wrong buffer order witch will break Dremio tests) |
|
Can we leave the old method in place and mark it as deprecated and remove in a later release? |
I am afraid it's not reasonable. since we need the right order in IPC and Dremio need the old wrong order in getBuffers to avoid test failure, unless we correct the behavior in getBuffers and rename the old method like getBuffers2 which seems ugly . @projjal Any thought? :) |
|
Given how long this PR has been open and approved, I think we should aim to check it in next Tuesday, unless we can come up with a concrete plan by then to help mitigate impact to dremio. CC @jacques-n |
|
This doesn't just break Dremio tests, it breaks Dremio functionally. A little history lesson: ValueVector.getBuffers() has existed for a much longer time than FieldVector.getFieldBuffers(). It behaves differently, uses less memory and is on a higher level abstraction. I don't think it makes sense to remove this without more thought around all those issues and I definitely think it should be done with step 1 being deprecation and step 2 being removal. Let's constrain this patch to what it was originally intended to solve and not try to consolidate the two different buffer retrieval interfaces. There are definitely scenarios where we want to view the buffers unchanged (getFieldBuffers) and where we want to export them for writing (getBuffers). We should figure out what the consolidated interface is before removing something useful. |
|
@tianchen92 does @jacques-n proposal make sense? |
Hmm, does this mean
|
|
@tianchen92 rereading, after rereading all the comments. I think we should
@jacques-n or someone else from dremio can maybe provide additional insight into how getBuffers is used and whether we really need to keep it |
I agree with item 1 on your list. I think 2-4 need more conversation about what we want to expose. I'd definitely avoid introducing a new method (3 on your list) until we figure out what sets of functionality we want. getBuffers() may actually be exactly what we want (with a changed order as necessary). I think it would be valuable to rationalize getFieldBuffers() in this context. Remember that FieldVector and getFieldBuffers() were introduced when we had separate non-nullable and nullable vectors but wanted to treat the non-nullable ones as internal (and thus they didn't expose the FieldVector interface). It seems like we have several operational needs: getFieldBuffers: get the list of buffers cheaply with no modifications to do some low level buffer operations (e.g. hand written addition logic) |
@jacques-n
If this is the case I think introducing a new method and moving away from getBuffers is the least bad option. Silently breaking the contract of getBuffers doesn't seem to be good idea (as witnessed by the length of time this PR has dragged out). IIRC correctly I think the jpython code had to work around getBuffers being inconsistent as well, so a discussion would be good. @jacques-n since you have the most context and historical knowledge would you mind starting a thread on dev@?
I think this might be have been during a portion of time that I stepped away from the project. |
@emkornfield sorry for the delay, your understanding is right :) |
|
@tianchen92 would you mind starting a thread on the ML, it seems that @jacques-n might not have bandwidth. |
ok, started already. |
d4608a9 to
356c300
Compare
|
@liyafan82 @emkornfield What is the status of this PR? |
@pitrou I looked through the comments, and it seems it has been a long time since this issue was last updated. |
|
I think we can consider it abandoned. |
Related to ARROW-7539.
Per discussion #6133 (comment).
The fact that we have reader/writer settings in getFieldBuffers is wrong. To clarify, getFieldBuffers is distinct from getBuffers. The former should be for getting access to underlying data for higher-performance algorithms. The latter is for sending the data over the wire. Seems we've mixed up use of both.
Currently in VectorUnloader, we used getFieldBuffers to create ArrowRecordBatch that’s why we keep writer/reader indices in getFieldBuffers, we should use getBuffers instead.