-
Notifications
You must be signed in to change notification settings - Fork 57
PR2 (nullability bug): adding new OH SparkCatalog which enables preserving non-nullable schemas #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9e93d4e
to
c5a45c5
Compare
494ee8d
to
3f87c2e
Compare
3f87c2e
to
43322e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach is good, but i'd prefer not introducing this in OSS, lets chat more in DM
@@ -0,0 +1,8 @@ | |||
package com.linkedin.openhouse.spark; | |||
|
|||
public class SparkCatalog extends org.apache.iceberg.spark.SparkCatalog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd strongly prefer we not introduce this layer (ie.SparkCatalog) in OSS codebase (if its a must, a better place would be li-wrapper)
Reason:
- iceberg solves this in 1.7.x (https://iceberg.apache.org/docs/1.7.1/spark-configuration/#catalog-configuration) with a configuration. Ideally we upgrade to 1.7.x and this problem is auto-solved for us.
- If we override this conf, iceberg behavior will be ignored
- It's ideal we do not touch iceberg connector code (ie: SparkCatalog, FlinkCatalog etc), its crucial OHCatalog stays minimal with its interfacing (easy for us to upgrade iceberg/spark/other dependencies)
- Adding this layer would allow other iceberg confs to be overridden which we shouldn't allow
- I'm concerned of it getting misused and thereby our forks/behavior deviating from OSS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see this as a perfect option and appreciate the points here.
- we can't upgrade to 1.7.x because it has a dependency on higher versions of spark in spark clients
- what exactly will be ignored / changed? because we are simply extending and adding everything should be preserved
- this can codified among the contributors to not edit
@@ -37,7 +37,7 @@ public static void configureCatalogs( | |||
builder | |||
.config( | |||
String.format("spark.sql.catalog.%s", catalogName), | |||
"org.apache.iceberg.spark.SparkCatalog") | |||
"com.linkedin.openhouse.spark.SparkCatalog") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to change Docker code too, and all other references to this connector.
|
||
// Verify id column is preserved in good catalog, not preserved in bad catalog | ||
assertFalse(sourceSchema.apply("id").nullable(), "Source table id column should be required"); | ||
assertTrue( | ||
targetSchema.apply("id").nullable(), | ||
"Target table id column required should not be preserved -- due to 1) the CTAS non-nullable preservation is off by default and 2) OS spark3.1 catalyst connector lack of support for non-null CTAS"); | ||
targetSchemaBroken.apply("id").nullable(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if both are nullable how is the SparkCatalog
helping?
Summary
problem: the OpenHouse spark catalog does not preserve non-null fields requested by user dataframes. Because of that, tables are saved with the wrong schema. This problem only affects CTAS
solution: we provide a new sparkcatalog with configuration to enable this in this PR ✅ then we then dictate all spark clients to use this spark catalog 🕐
Changes
Testing Done
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.