Skip to content

Conversation

@tianchen92
Copy link
Contributor

Related to ARROW-7539.

Per discussion #6133 (comment).

The fact that we have reader/writer settings in getFieldBuffers is wrong. To clarify, getFieldBuffers is distinct from getBuffers. The former should be for getting access to underlying data for higher-performance algorithms. The latter is for sending the data over the wire. Seems we've mixed up use of both.
Currently in VectorUnloader, we used getFieldBuffers to create ArrowRecordBatch that’s why we keep writer/reader indices in getFieldBuffers, we should use getBuffers instead.

@tianchen92
Copy link
Contributor Author

tianchen92 commented Jan 10, 2020

@jacques-n @emkornfield Please help take a look at this one, after then we could rebase and continue to do #6133, thanks!

@github-actions
Copy link

@emkornfield
Copy link
Contributor

@TheNeuralBit do you have time to do a first pass on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this if check came from an earlier refactor that happened in Arrow and several tests had failed in Dremio. I think we should keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old implementation, VectorUnloader.java checks the typeBufferCount with getFieldBuffers, it's ok since getFieldBuffers returns all buffers without check buffer size. If we keep this if, then the expected buffer count check in VectorUnloader is invalid, should we remove that check directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old implementation, VectorUnloader.java checks the typeBufferCount with getFieldBuffers, it's ok since getFieldBuffers returns all buffers without check buffer size. If we keep this if, then the expected buffer count check in VectorUnloader is invalid, should we remove that check directly?

I am okay with this change but it would be great if someone from Dremio can bless this (based on my previous comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this 'if' here, then I guess we should make some change in VectorUnloader.
If vector.getBufferSize ==0, then no buffers would be sent via IPC, however, the VectorLoader depends on TypeBufferCount to decide how many buffers to load into a vector, and in this case, something is wrong.
To solve this, we may add a check in VectorUnloader, if vector bufferSize==0, we should append it's field buffers(by call getFieldBuffers) also, even their writerIndex/readerIndex=0. In this way, we could keep the 'if (getBufferSize() == 0)' in vectors and the IPC also works well.

Do you think we should keep this PR or update it as I suggested above?

@tianchen92
Copy link
Contributor Author

ping @siddharthteotia @jacques-n

@jacques-n
Copy link
Contributor

I'm asking someone to review this from Dremio.

@praveenbingo
Copy link
Contributor

@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken..

@tianchen92
Copy link
Contributor Author

@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken..

ok

@praveenbingo
Copy link
Contributor

@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken..

ok

thanks @tianchen92

@emkornfield
Copy link
Contributor

@praveenbingo did you have a chance to investigate?

@wesm
Copy link
Member

wesm commented Jun 12, 2020

ping

@emkornfield
Copy link
Contributor

@jacques-n @rymurr do you know the progress of this internal to Dremio? It has been blocked a while on feedback, if we don't here back by Monday, I think we should rebase and merge.

@projjal
Copy link
Contributor

projjal commented Jun 15, 2020

Let me look at this change and the impact on dremio. I will update it by tomorrow

}
List<ArrowBuf> list = new ArrayList<>();
list.add(validityBuffer);
list.add(offsetBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this change, the order of validity and offset buffers is changed in the getBuffers method for ListVector which creates problems with serialization/deserialization resulting in test failures in Dremio. This would break backward compatibility with existing serialised files.
Also keeping the existing order in #getBuffers will also break tests since this PR replaces #getFieldBuffers with #getBuffers in VectorUnloader. #getFieldBuffers and #getBuffers currently have different order of buffers while #getFieldBuffers and #loadFieldBuffers same order, and the order should be same for VectorLoader and VectorUnloader.
cc @pravindra @tianchen92

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@projjal Thanks for your testing result!
Seems like a legacy problem here, and I think validityBuffer->offsetBuffer should be the right order considering other vectors.
@jacques-n @pravindra any thoughts on how to resolve the conflicts? :)

@wesm
Copy link
Member

wesm commented Jun 24, 2020

Does this impact IPC?

@tianchen92
Copy link
Contributor Author

Does this impact IPC?

seems not, IPC used getFieldBuffers which has the right buffer order, this PR is going to replace getFieldBuffers with getBuffers (getBuffers has wrong buffer order witch will break Dremio tests)

@emkornfield
Copy link
Contributor

Can we leave the old method in place and mark it as deprecated and remove in a later release?

@tianchen92
Copy link
Contributor Author

Can we leave the old method in place and mark it as deprecated and remove in a later release?

I am afraid it's not reasonable. since we need the right order in IPC and Dremio need the old wrong order in getBuffers to avoid test failure, unless we correct the behavior in getBuffers and rename the old method like getBuffers2 which seems ugly . @projjal Any thought? :)

@emkornfield
Copy link
Contributor

Given how long this PR has been open and approved, I think we should aim to check it in next Tuesday, unless we can come up with a concrete plan by then to help mitigate impact to dremio. CC @jacques-n

@jacques-n
Copy link
Contributor

This doesn't just break Dremio tests, it breaks Dremio functionally.

A little history lesson: ValueVector.getBuffers() has existed for a much longer time than FieldVector.getFieldBuffers(). It behaves differently, uses less memory and is on a higher level abstraction. I don't think it makes sense to remove this without more thought around all those issues and I definitely think it should be done with step 1 being deprecation and step 2 being removal. Let's constrain this patch to what it was originally intended to solve and not try to consolidate the two different buffer retrieval interfaces.

There are definitely scenarios where we want to view the buffers unchanged (getFieldBuffers) and where we want to export them for writing (getBuffers). We should figure out what the consolidated interface is before removing something useful.

@emkornfield
Copy link
Contributor

@tianchen92 does @jacques-n proposal make sense?

@tianchen92
Copy link
Contributor Author

@tianchen92 does @jacques-n proposal make sense?

step 1 being deprecation and step 2 being removal

Hmm, does this mean

  1. leave old method in place with deprecation and add a new method(such as getBuffersNew) which should be used in IPC
    or
  2. totally revert this change and just mark getBuffers as deprecated
    ?

@emkornfield
Copy link
Contributor

@tianchen92 rereading, after rereading all the comments. I think we should

  1. Remove setReaderWriterIndeces in getFieldBuffers
  2. Deprecate getBuffers
  3. Introduce a new getIpcBuffers which is unambiguously used for writing record batches (i.e. in VectorUnloader).
  4. Update documentation where it makes sense based on all this conversation.

@jacques-n or someone else from dremio can maybe provide additional insight into how getBuffers is used and whether we really need to keep it

@jacques-n
Copy link
Contributor

@tianchen92 rereading, after rereading all the comments. I think we should

  1. Remove setReaderWriterIndeces in getFieldBuffers
  2. Deprecate getBuffers
  3. Introduce a new getIpcBuffers which is unambiguously used for writing record batches (i.e. in VectorUnloader).
  4. Update documentation where it makes sense based on all this conversation.

@jacques-n or someone else from dremio can maybe provide additional insight into how getBuffers is used and whether we really need to keep it

I agree with item 1 on your list.

I think 2-4 need more conversation about what we want to expose. I'd definitely avoid introducing a new method (3 on your list) until we figure out what sets of functionality we want. getBuffers() may actually be exactly what we want (with a changed order as necessary). I think it would be valuable to rationalize getFieldBuffers() in this context. Remember that FieldVector and getFieldBuffers() were introduced when we had separate non-nullable and nullable vectors but wanted to treat the non-nullable ones as internal (and thus they didn't expose the FieldVector interface). It seems like we have several operational needs:

getFieldBuffers: get the list of buffers cheaply with no modifications to do some low level buffer operations (e.g. hand written addition logic)
getBuffers: export the list of buffers for writing (with or without removing them-- why both...)?

@emkornfield
Copy link
Contributor

I think 2-4 need more conversation about what we want to expose. I'd definitely avoid introducing a new method (3 on your list) until we figure out what sets of functionality we want. getBuffers() may actually be exactly what we want (with a changed order as necessary). I think it would be valuable to rationalize getFieldBuffers() in this context.

@jacques-n
My current understanding is there is a cycle here which needs to be broken (@tianchen92 please let me know if understand the issues correcty).

  1. IPC VectorUnloader currently relies on getFieldBuffers. It shouldn't.
  2. Because of how getFieldBuffers should be used, we shouldn't be setting read/writer indices. But we can't remove the indices setting because it is used in VectorUnloader.
  3. we can't use getBuffers in place of getFieldBuffer in VectorUnloader because it does not return buffers in the same order.

If this is the case I think introducing a new method and moving away from getBuffers is the least bad option. Silently breaking the contract of getBuffers doesn't seem to be good idea (as witnessed by the length of time this PR has dragged out). IIRC correctly I think the jpython code had to work around getBuffers being inconsistent as well, so a discussion would be good. @jacques-n since you have the most context and historical knowledge would you mind starting a thread on dev@?

Remember that FieldVector and getFieldBuffers() were introduced when we had separate non-nullable and nullable vectors but wanted to treat the non-nullable ones as internal (and thus they didn't expose the FieldVector interface). It seems like we have several operational needs

I think this might be have been during a portion of time that I stepped away from the project.

@tianchen92
Copy link
Contributor Author

My current understanding is there is a cycle here which needs to be broken (@tianchen92 please let me know if understand the issues correcty).

@emkornfield sorry for the delay, your understanding is right :)

@emkornfield
Copy link
Contributor

@tianchen92 would you mind starting a thread on the ML, it seems that @jacques-n might not have bandwidth.

@tianchen92
Copy link
Contributor Author

@tianchen92 would you mind starting a thread on the ML, it seems that @jacques-n might not have bandwidth.

ok, started already.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@pitrou
Copy link
Member

pitrou commented Jul 27, 2021

@liyafan82 @emkornfield What is the status of this PR?

@liyafan82
Copy link
Contributor

@liyafan82 @emkornfield What is the status of this PR?

@pitrou I looked through the comments, and it seems it has been a long time since this issue was last updated.
Since this change involves some fundamnetal changes, which may break Dremio function (also possibly break client code), I think we should at least get confirm from Dremio (@jacques-n @praveenbingo) before we can go ahead?

@emkornfield
Copy link
Contributor

I think we can consider it abandoned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Java needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants