-
Notifications
You must be signed in to change notification settings - Fork 513
feat: support FixedSizeList<Struct> #5593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
sorry, mislabeled this breaking. Should not be breaking. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
I will update this patch once #5591 is merged - there are some related tests I want to add here that currently fail. |
10ef8c2 to
23ba011
Compare
|
Should this encoder require 2.2? Our general rule of thumb has been "any file written by any version of 2.1 should be readable by any version of 2.1" which is perhaps stricter than normal backwards compatibility rules. Since this allows you to create files that old 2.1 readers will not be able to read I think this will need to be a 2.2 feature. Or maybe you have that check and I just missed it. Also, does this handle |
|
Sounds like this should require 2.2, thanks for catching that. I'll push an update shortly. |
|
@westonpace thanks, this is updated, I think the failures are unrelated. |
👍 #5646 |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have one concern about handle corrupt schemas, but otherwise looks good.
| let size: i32 = | ||
| lt.0.split(':') | ||
| .next_back() | ||
| .expect("fixed_size_list:struct logical type missing size suffix") | ||
| .parse() | ||
| .expect("fixed_size_list:struct logical type has invalid size"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we can panic if we read a dataset that has a corrupt schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the lance schema. I'm not entirely sure users are able to create their own lance schema. So I think the only way this could happen is if there was some kind of corrupt protobuf. Also, there is a significant panic potential down below at lt => DataType::try_from(lt).unwrap().
I suppose it is technically a valid concern but this method has many many callsites and changing it to result returning should probably be a PR on its own or else this one is going to get real confusing.
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done. A few minor thoughts but looks good.
| } | ||
|
|
||
| #[derive(Debug)] | ||
| struct RandomMapGenerator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document this (or the rand_map method). Just a brief comment somewhere to describe that it will randomly generate maps with 0-4 items.
| let total_entries = lengths.values().iter().sum::<i32>() as u64; | ||
| let offsets = OffsetBuffer::from_lengths(lengths.values().iter().map(|v| *v as usize)); | ||
|
|
||
| let keys = self.keys_gen.generate(RowCount::from(total_entries), rng)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really relevant but I wonder if keys need to be unique within a map? I guess not.
rust/lance-encoding/src/decoder.rs
Outdated
| let child = field | ||
| .children | ||
| .first() | ||
| .expect("FixedSizeList field must have a child"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use expect_ok here so we get a result and not a panic?
rust/lance-encoding/src/encoder.rs
Outdated
| let child = field | ||
| .children | ||
| .first() | ||
| .expect("FixedSizeList should have a child"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_ok
| } | ||
|
|
||
| /// Filters garbage from nested FSL arrays that contain list-like children. | ||
| fn filter_nested_fsl_garbage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here.
| async fn test_fsl_struct_random( | ||
| #[case] struct_fields: Fields, | ||
| #[case] dimension: i32, | ||
| #[case] min_version: LanceFileVersion, | ||
| #[values(STRUCTURAL_ENCODING_MINIBLOCK, STRUCTURAL_ENCODING_FULLZIP)] | ||
| structural_encoding: &str, | ||
| ) { | ||
| let data_type = make_fsl_struct_type(struct_fields, dimension); | ||
| let mut field_metadata = HashMap::new(); | ||
| field_metadata.insert( | ||
| STRUCTURAL_ENCODING_META_KEY.to_string(), | ||
| structural_encoding.into(), | ||
| ); | ||
| let field = Field::new("", data_type, true).with_metadata(field_metadata); | ||
| let test_cases = TestCases::basic().with_min_file_version(min_version); | ||
| check_specific_random(field, test_cases).await; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice compact and comprehensive test
| } | ||
|
|
||
| #[test] | ||
| #[should_panic(expected = "Unsupported logical type: map")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if unsupported type changed to err instead of panic. I'll make a ticket and we can handle in a follow-up.
Prior to this commit, FixedSizeList was only supported with primitive element types (e.g., FSL<Float32> for vectors). This adds structural encoding support for FSL<Struct>, enabling use cases like fixed-size arrays of bounding boxes, coordinate tuples, or other structured data. Key changes: - New `FixedSizeListStructuralEncoder` that encodes FSL validity to rep/def and delegates child encoding to the struct encoder - New `StructuralFixedSizeListScheduler` that scales row ranges by the FSL dimension when scheduling reads - New `StructuralFixedSizeListDecoder` that reconstructs FSL arrays from child data and rep/def validity A key challenge is "garbage filtering": unlike variable-length lists which can omit children under null entries, FSL children always exist. When an FSL row is null, any nested list-like types within its children contain undefined "garbage" data. The encoder normalizes these to empty null lists before encoding.
690cac0 to
3d0064b
Compare
Prior to this commit, FixedSizeList was only supported with primitive element types (e.g.,
FSL<Float32>for vectors). This adds structural encoding support forFSL<Struct>, enabling use cases like fixed-size arrays of bounding boxes, coordinate tuples, or other structured data.Key changes:
FixedSizeListStructuralEncoderthat encodes FSL validity to rep/def and delegates child encoding to the struct encoderStructuralFixedSizeListSchedulerthat scales row ranges by the FSL dimension when scheduling readsStructuralFixedSizeListDecoderthat reconstructs FSL arrays from child data and rep/def validityA key challenge is "garbage filtering": unlike variable-length lists which can omit children under null entries, FSL children always exist. When an FSL row is null, any nested list-like types within its children contain undefined "garbage" data. The encoder normalizes these to empty null lists before encoding.