ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices #6156

tianchen92 · 2020-01-10T04:25:36Z

The fact that we have reader/writer settings in getFieldBuffers is wrong. To clarify, getFieldBuffers is distinct from getBuffers. The former should be for getting access to underlying data for higher-performance algorithms. The latter is for sending the data over the wire. Seems we've mixed up use of both.
Currently in VectorUnloader, we used getFieldBuffers to create ArrowRecordBatch that’s why we keep writer/reader indices in getFieldBuffers, we should use getBuffers instead.

tianchen92 · 2020-01-10T04:30:53Z

@jacques-n @emkornfield Please help take a look at this one, after then we could rebase and continue to do #6133, thanks!

github-actions · 2020-01-10T04:31:45Z

https://issues.apache.org/jira/browse/ARROW-7539

emkornfield · 2020-02-03T04:15:40Z

@TheNeuralBit do you have time to do a first pass on this?

siddharthteotia · 2020-02-04T05:54:01Z

java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java

I believe this if check came from an earlier refactor that happened in Arrow and several tests had failed in Dremio. I think we should keep it.

In the old implementation, VectorUnloader.java checks the typeBufferCount with getFieldBuffers, it's ok since getFieldBuffers returns all buffers without check buffer size. If we keep this if, then the expected buffer count check in VectorUnloader is invalid, should we remove that check directly?

In the old implementation, VectorUnloader.java checks the typeBufferCount with getFieldBuffers, it's ok since getFieldBuffers returns all buffers without check buffer size. If we keep this if, then the expected buffer count check in VectorUnloader is invalid, should we remove that check directly?

I am okay with this change but it would be great if someone from Dremio can bless this (based on my previous comment)

If we keep this 'if' here, then I guess we should make some change in VectorUnloader.
If vector.getBufferSize ==0, then no buffers would be sent via IPC, however, the VectorLoader depends on TypeBufferCount to decide how many buffers to load into a vector, and in this case, something is wrong.
To solve this, we may add a check in VectorUnloader, if vector bufferSize==0, we should append it's field buffers(by call getFieldBuffers) also, even their writerIndex/readerIndex=0. In this way, we could keep the 'if (getBufferSize() == 0)' in vectors and the IPC also works well.

Do you think we should keep this PR or update it as I suggested above?

…der/writer indices

tianchen92 · 2020-02-28T02:48:59Z

ping @siddharthteotia @jacques-n

jacques-n · 2020-03-13T04:28:18Z

I'm asking someone to review this from Dremio.

praveenbingo · 2020-03-16T04:26:03Z

@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken..

tianchen92 · 2020-03-16T04:34:19Z

@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken..

ok

praveenbingo · 2020-03-16T07:30:06Z

@tianchen92 Tried running this against Dremio and ran into tons of failures, can you please hold while i figure which part is broken..

ok

thanks @tianchen92

emkornfield · 2020-05-15T04:17:25Z

@praveenbingo did you have a chance to investigate?

wesm · 2020-06-12T01:51:32Z

ping

emkornfield · 2020-06-12T04:35:41Z

@jacques-n @rymurr do you know the progress of this internal to Dremio? It has been blocked a while on feedback, if we don't here back by Monday, I think we should rebase and merge.

projjal · 2020-06-15T04:42:08Z

Let me look at this change and the impact on dremio. I will update it by tomorrow

projjal · 2020-06-17T10:18:38Z

java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java

-    }
+    List<ArrowBuf> list = new ArrayList<>();
+    list.add(validityBuffer);
+    list.add(offsetBuffer);


In this change, the order of validity and offset buffers is changed in the getBuffers method for ListVector which creates problems with serialization/deserialization resulting in test failures in Dremio. This would break backward compatibility with existing serialised files.
Also keeping the existing order in #getBuffers will also break tests since this PR replaces #getFieldBuffers with #getBuffers in VectorUnloader. #getFieldBuffers and #getBuffers currently have different order of buffers while #getFieldBuffers and #loadFieldBuffers same order, and the order should be same for VectorLoader and VectorUnloader.
cc @pravindra @tianchen92

@projjal Thanks for your testing result!
Seems like a legacy problem here, and I think validityBuffer->offsetBuffer should be the right order considering other vectors.
@jacques-n @pravindra any thoughts on how to resolve the conflicts? :)

wesm · 2020-06-24T00:38:23Z

Does this impact IPC?

tianchen92 · 2020-06-28T03:37:42Z

Does this impact IPC?

seems not, IPC used getFieldBuffers which has the right buffer order, this PR is going to replace getFieldBuffers with getBuffers (getBuffers has wrong buffer order witch will break Dremio tests)

emkornfield · 2020-06-30T03:40:19Z

Can we leave the old method in place and mark it as deprecated and remove in a later release?

tianchen92 · 2020-06-30T10:25:53Z

Can we leave the old method in place and mark it as deprecated and remove in a later release?

I am afraid it's not reasonable. since we need the right order in IPC and Dremio need the old wrong order in getBuffers to avoid test failure, unless we correct the behavior in getBuffers and rename the old method like getBuffers2 which seems ugly . @projjal Any thought? :)

emkornfield · 2020-07-03T03:31:51Z

Given how long this PR has been open and approved, I think we should aim to check it in next Tuesday, unless we can come up with a concrete plan by then to help mitigate impact to dremio. CC @jacques-n

jacques-n · 2020-07-03T05:02:18Z

This doesn't just break Dremio tests, it breaks Dremio functionally.

A little history lesson: ValueVector.getBuffers() has existed for a much longer time than FieldVector.getFieldBuffers(). It behaves differently, uses less memory and is on a higher level abstraction. I don't think it makes sense to remove this without more thought around all those issues and I definitely think it should be done with step 1 being deprecation and step 2 being removal. Let's constrain this patch to what it was originally intended to solve and not try to consolidate the two different buffer retrieval interfaces.

There are definitely scenarios where we want to view the buffers unchanged (getFieldBuffers) and where we want to export them for writing (getBuffers). We should figure out what the consolidated interface is before removing something useful.

emkornfield · 2020-07-04T05:04:35Z

@tianchen92 does @jacques-n proposal make sense?

tianchen92 · 2020-07-06T03:57:38Z

@tianchen92 does @jacques-n proposal make sense?

step 1 being deprecation and step 2 being removal

Hmm, does this mean

leave old method in place with deprecation and add a new method(such as getBuffersNew) which should be used in IPC
or
totally revert this change and just mark getBuffers as deprecated
?

emkornfield · 2020-07-10T05:30:27Z

@tianchen92 rereading, after rereading all the comments. I think we should

Remove setReaderWriterIndeces in getFieldBuffers
Deprecate getBuffers
Introduce a new getIpcBuffers which is unambiguously used for writing record batches (i.e. in VectorUnloader).
Update documentation where it makes sense based on all this conversation.

@jacques-n or someone else from dremio can maybe provide additional insight into how getBuffers is used and whether we really need to keep it

jacques-n · 2020-07-11T17:04:33Z

@tianchen92 rereading, after rereading all the comments. I think we should

Remove setReaderWriterIndeces in getFieldBuffers

Deprecate getBuffers

Introduce a new getIpcBuffers which is unambiguously used for writing record batches (i.e. in VectorUnloader).

Update documentation where it makes sense based on all this conversation.

@jacques-n or someone else from dremio can maybe provide additional insight into how getBuffers is used and whether we really need to keep it

I agree with item 1 on your list.

I think 2-4 need more conversation about what we want to expose. I'd definitely avoid introducing a new method (3 on your list) until we figure out what sets of functionality we want. getBuffers() may actually be exactly what we want (with a changed order as necessary). I think it would be valuable to rationalize getFieldBuffers() in this context. Remember that FieldVector and getFieldBuffers() were introduced when we had separate non-nullable and nullable vectors but wanted to treat the non-nullable ones as internal (and thus they didn't expose the FieldVector interface). It seems like we have several operational needs:

getFieldBuffers: get the list of buffers cheaply with no modifications to do some low level buffer operations (e.g. hand written addition logic)
getBuffers: export the list of buffers for writing (with or without removing them-- why both...)?

emkornfield · 2020-07-12T04:14:36Z

I think 2-4 need more conversation about what we want to expose. I'd definitely avoid introducing a new method (3 on your list) until we figure out what sets of functionality we want. getBuffers() may actually be exactly what we want (with a changed order as necessary). I think it would be valuable to rationalize getFieldBuffers() in this context.

@jacques-n
My current understanding is there is a cycle here which needs to be broken (@tianchen92 please let me know if understand the issues correcty).

IPC VectorUnloader currently relies on getFieldBuffers. It shouldn't.
Because of how getFieldBuffers should be used, we shouldn't be setting read/writer indices. But we can't remove the indices setting because it is used in VectorUnloader.
we can't use getBuffers in place of getFieldBuffer in VectorUnloader because it does not return buffers in the same order.

If this is the case I think introducing a new method and moving away from getBuffers is the least bad option. Silently breaking the contract of getBuffers doesn't seem to be good idea (as witnessed by the length of time this PR has dragged out). IIRC correctly I think the jpython code had to work around getBuffers being inconsistent as well, so a discussion would be good. @jacques-n since you have the most context and historical knowledge would you mind starting a thread on dev@?

Remember that FieldVector and getFieldBuffers() were introduced when we had separate non-nullable and nullable vectors but wanted to treat the non-nullable ones as internal (and thus they didn't expose the FieldVector interface). It seems like we have several operational needs

I think this might be have been during a portion of time that I stepped away from the project.

tianchen92 · 2020-07-16T07:38:24Z

My current understanding is there is a cycle here which needs to be broken (@tianchen92 please let me know if understand the issues correcty).

@emkornfield sorry for the delay, your understanding is right :)

emkornfield · 2020-08-01T02:45:12Z

@tianchen92 would you mind starting a thread on the ML, it seems that @jacques-n might not have bandwidth.

tianchen92 · 2020-08-04T07:58:24Z

@tianchen92 would you mind starting a thread on the ML, it seems that @jacques-n might not have bandwidth.

ok, started already.

pitrou · 2021-07-27T08:12:06Z

@liyafan82 @emkornfield What is the status of this PR?

liyafan82 · 2021-07-28T02:18:01Z

@liyafan82 @emkornfield What is the status of this PR?

@pitrou I looked through the comments, and it seems it has been a long time since this issue was last updated.
Since this change involves some fundamnetal changes, which may break Dremio function (also possibly break client code), I think we should at least get confirm from Dremio (@jacques-n @praveenbingo) before we can go ahead?

emkornfield · 2021-07-28T03:50:11Z

I think we can consider it abandoned.

tianchen92 mentioned this pull request Jan 10, 2020

ARROW-7494: [Java] Remove reader index and writer index from ArrowBuf #6133

Closed

fsaintjacques added the Component: Java label Jan 16, 2020

tianchen92 requested a review from jacques-n February 1, 2020 06:02

siddharthteotia reviewed Feb 4, 2020

View reviewed changes

TheNeuralBit self-requested a review February 5, 2020 23:41

kszucs force-pushed the ARROW-7539 branch from 31cf05e to e2b077f Compare February 7, 2020 10:14

tianchen92 requested a review from siddharthteotia February 19, 2020 06:54

tianchen92 closed this Feb 26, 2020

tianchen92 reopened this Feb 26, 2020

ARROW-7539: [Java] FieldVector getFieldBuffers API should not set rea…

f1b1dae

…der/writer indices

tianchen92 force-pushed the ARROW-7539 branch from e2b077f to f1b1dae Compare February 26, 2020 08:41

siddharthteotia approved these changes Mar 12, 2020

View reviewed changes

tianchen92 requested a review from siddharthteotia March 12, 2020 10:18

wesm force-pushed the master branch from 5fe5b88 to aa55967 Compare April 19, 2020 22:47

kszucs force-pushed the master branch from 1b71ca7 to 5093b80 Compare April 20, 2020 19:21

projjal reviewed Jun 17, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

emkornfield closed this Jul 28, 2021

asfimport mentioned this pull request Nov 26, 2024

[Java] FieldVector getFieldBuffers API should not set reader/writer indices apache/arrow-java#270

Open

ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices #6156

ARROW-7539: [Java] FieldVector getFieldBuffers API should not set reader/writer indices #6156

Uh oh!

Conversation

tianchen92 commented Jan 10, 2020

Uh oh!

tianchen92 commented Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 10, 2020

Uh oh!

emkornfield commented Feb 3, 2020

Uh oh!

siddharthteotia Feb 4, 2020

Choose a reason for hiding this comment

Uh oh!

tianchen92 Feb 4, 2020

Choose a reason for hiding this comment

Uh oh!

siddharthteotia Mar 12, 2020

Choose a reason for hiding this comment

Uh oh!

tianchen92 Mar 12, 2020

Choose a reason for hiding this comment

Uh oh!

tianchen92 commented Feb 28, 2020

Uh oh!

jacques-n commented Mar 13, 2020

Uh oh!

praveenbingo commented Mar 16, 2020

Uh oh!

tianchen92 commented Mar 16, 2020

Uh oh!

praveenbingo commented Mar 16, 2020

Uh oh!

emkornfield commented May 15, 2020

Uh oh!

wesm commented Jun 12, 2020

Uh oh!

emkornfield commented Jun 12, 2020

Uh oh!

projjal commented Jun 15, 2020

Uh oh!

projjal Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

tianchen92 Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

wesm commented Jun 24, 2020

Uh oh!

tianchen92 commented Jun 28, 2020

Uh oh!

emkornfield commented Jun 30, 2020

Uh oh!

tianchen92 commented Jun 30, 2020

Uh oh!

emkornfield commented Jul 3, 2020

Uh oh!

jacques-n commented Jul 3, 2020

Uh oh!

emkornfield commented Jul 4, 2020

Uh oh!

tianchen92 commented Jul 6, 2020

Uh oh!

emkornfield commented Jul 10, 2020

Uh oh!

jacques-n commented Jul 11, 2020

Uh oh!

emkornfield commented Jul 12, 2020

Uh oh!

tianchen92 commented Jul 16, 2020

Uh oh!

emkornfield commented Aug 1, 2020

Uh oh!

tianchen92 commented Aug 4, 2020

Uh oh!

pitrou commented Jul 27, 2021

Uh oh!

liyafan82 commented Jul 28, 2021

tianchen92 commented Jan 10, 2020 •

edited

Loading