-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16085: [C++][R] InMemoryDataset::ReplaceSchema does not alter scan output #13088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -62,9 +62,13 @@ TEST_F(TestInMemoryDataset, ReplaceSchema) { | |
| schema_, RecordBatchVector{static_cast<size_t>(kNumberBatches), batch}); | ||
|
|
||
| // drop field | ||
| ASSERT_OK(dataset->ReplaceSchema(schema({field("i32", int32())})).status()); | ||
| auto new_schema = schema({field("i32", int32())}); | ||
| ASSERT_OK_AND_ASSIGN(auto new_dataset, dataset->ReplaceSchema(new_schema)); | ||
| AssertDatasetHasSchema(new_dataset, new_schema); | ||
| // add field (will be materialized as null during projection) | ||
| ASSERT_OK(dataset->ReplaceSchema(schema({field("str", utf8())})).status()); | ||
| new_schema = schema({field("str", utf8())}); | ||
| ASSERT_OK_AND_ASSIGN(new_dataset, dataset->ReplaceSchema(new_schema)); | ||
| AssertDatasetHasSchema(new_dataset, new_schema); | ||
| // incompatible type | ||
| ASSERT_RAISES(TypeError, | ||
| dataset->ReplaceSchema(schema({field("i32", utf8())})).status()); | ||
|
|
@@ -107,6 +111,40 @@ TEST_F(TestInMemoryDataset, InMemoryFragment) { | |
| AssertSchemaEqual(batch->schema(), schema); | ||
| } | ||
|
|
||
| TEST_F(TestInMemoryDataset, HandlesDifferingSchemas) { | ||
| constexpr int64_t kBatchSize = 1024; | ||
|
|
||
| // These schemas can be merged | ||
| SetSchema({field("i32", int32()), field("f64", float64())}); | ||
| auto batch1 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_); | ||
| SetSchema({field("i32", int32())}); | ||
| auto batch2 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_); | ||
| RecordBatchVector batches{batch1, batch2}; | ||
|
|
||
| auto dataset = std::make_shared<InMemoryDataset>(schema_, batches); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we actually want this to be valid though? I would expect the batches of a dataset to have a consistent schema
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In file fragments, it's totally normal to have a physical schema that is different from the dataset schema. This came up when I realized we could create a union dataset out of filesystem ones but not in-memory ones if the schemas differed.
I thought about that, but then are we materializing the projected batches before any scan is started? It seems more efficient for the projection to happen as part of the scan.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, good point about the fragments. I was thinking InMemoryDataset already has all the data in memory, so it's not a big deal anyways. But yes, that's unnecessary work compared to this. |
||
|
|
||
| ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan()); | ||
| ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish()); | ||
| ASSERT_OK_AND_ASSIGN(auto table, scanner->ToTable()); | ||
| ASSERT_EQ(*table->schema(), *schema_); | ||
| ASSERT_EQ(table->num_rows(), 2 * kBatchSize); | ||
|
|
||
| // These cannot be merged | ||
| SetSchema({field("i32", int32()), field("f64", float64())}); | ||
| batch1 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_); | ||
| SetSchema({field("i32", struct_({field("x", date32())}))}); | ||
| batch2 = ConstantArrayGenerator::Zeroes(kBatchSize, schema_); | ||
| batches = RecordBatchVector({batch1, batch2}); | ||
|
|
||
| dataset = std::make_shared<InMemoryDataset>(schema_, batches); | ||
|
|
||
| ASSERT_OK_AND_ASSIGN(scanner_builder, dataset->NewScan()); | ||
| ASSERT_OK_AND_ASSIGN(scanner, scanner_builder->Finish()); | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT( | ||
| TypeError, testing::HasSubstr("fields had matching names but differing types"), | ||
| scanner->ToTable()); | ||
| } | ||
|
|
||
| class TestUnionDataset : public DatasetFixtureMixin {}; | ||
|
|
||
| TEST_F(TestUnionDataset, ReplaceSchema) { | ||
|
|
@@ -131,9 +169,13 @@ TEST_F(TestUnionDataset, ReplaceSchema) { | |
| AssertDatasetEquals(reader.get(), dataset.get()); | ||
|
|
||
| // drop field | ||
| ASSERT_OK(dataset->ReplaceSchema(schema({field("i32", int32())})).status()); | ||
| auto new_schema = schema({field("i32", int32())}); | ||
| ASSERT_OK_AND_ASSIGN(auto new_dataset, dataset->ReplaceSchema(new_schema)); | ||
| AssertDatasetHasSchema(new_dataset, new_schema); | ||
| // add nullable field (will be materialized as null during projection) | ||
| ASSERT_OK(dataset->ReplaceSchema(schema({field("str", utf8())})).status()); | ||
| new_schema = schema({field("str", utf8())}); | ||
| ASSERT_OK_AND_ASSIGN(new_dataset, dataset->ReplaceSchema(new_schema)); | ||
| AssertDatasetHasSchema(new_dataset, new_schema); | ||
| // incompatible type | ||
| ASSERT_RAISES(TypeError, | ||
| dataset->ReplaceSchema(schema({field("i32", utf8())})).status()); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this could be a construction-time check to avoid repeated checking except there is no way to return a Status there, unfortunately. (Not a big deal, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but would have to change this to a
::Make()method and didn't want to go that far here.