Skip to content

Conversation

@alexwilcoxson-rel
Copy link
Contributor

@alexwilcoxson-rel alexwilcoxson-rel commented Oct 5, 2023

Which issue does this PR close?

Closes #4895

Rationale for this change

see #4895

What changes are included in this PR?

Adds with_send_dictionaries to FlightDataEncoderBuilder to opt in to always sending dictionary messages with each record batch encoded.

Are there any user-facing changes?

with_send_dictionaries added to FlightDataEncoderBuilder and FlightDataEncoder. Doc strings updated.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Oct 5, 2023
@alamb
Copy link
Contributor

alamb commented Oct 6, 2023

Thank you for this PR @alexwilcoxson-rel -- I hope to find time to review it properly in the next few days

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alexwilcoxson-rel -- I think this is a really nice improvement and it could be merged as is.

I was thinking to a future when we might have proper support for resending dictionaries and trying to minimize the API impact as well as adding a nice place for more documentation. What do you think about using an enum instead of a bool like this:

So instead of

pub struct FlightDataEncoder {
...
    send_dictionaries: bool,
...
}

Something like

pub struct FlightDataEncoder {
...
    dictionary_handling: DictionaryHandling,
...
}

/// Defines how a [`FlightDataEncoder`] encodes [`DictionaryArray`]s
pub enum DictionaryHandling {
  /// Expands to the underlying type (default). This likely sends more data over the network
  /// but requires less memory (dictionaries are not tracked) and is  more compatible
  /// with other arrow flight client implementations that may not support `DictionaryEncoding`
  /// see [`hydrate_dictionary`] for more details.
  Hydrate,
  /// Send dictionary FlightData with every RecordBatch that contains a [`DictionaryArray`].
  /// See [`Self::Hydrate`] for more tradeoffs. No attempt is made to skip sending the same (logical)
  /// dictionary values twice. 
  Resend, 
}

}

#[tokio::test]
async fn test_dictionary_hydration() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


let arr_one: Arc<DictionaryArray<UInt16Type>> =
Arc::new(vec!["a", "a", "b"].into_iter().collect());
let arr_two: Arc<DictionaryArray<UInt16Type>> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@alamb
Copy link
Contributor

alamb commented Oct 10, 2023

BTW it looks like some documentation also needs to be updated, like

/// # Caveats

I can handle doing so, but wanted to mention it in case @alexwilcoxson-rel was going to change this one more

@alexwilcoxson-rel
Copy link
Contributor Author

BTW it looks like some documentation also needs to be updated, like

/// # Caveats

I can handle doing so, but wanted to mention it in case @alexwilcoxson-rel was going to change this one more

Updated these docs

@alexwilcoxson-rel
Copy link
Contributor Author

Updated to use Enum per @alamb feedback

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexwilcoxson-rel -- this looks great

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

The CI seems to be failing the doc checks. I'll try and fix that

@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

Here is a proposed PR to your branch for improved documentation relativityone#1 🙏

Improve docs for sending flight dictionaries
@alexwilcoxson-rel
Copy link
Contributor Author

Here is a proposed PR to your branch for improved documentation relativityone#1 🙏

LGTM, merged! please rerun CI when you can.

@alamb alamb changed the title Add option to FlightDataEncoder to always resend batch dictionaries Add option to FlightDataEncoder to always resend batch dictionaries Oct 12, 2023
@alamb alamb merged commit d5a655d into apache:master Oct 12, 2023
@alamb
Copy link
Contributor

alamb commented Oct 12, 2023

Thanks again @alexwilcoxson-rel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add option to FlightDataEncoder to always send dictionaries

2 participants