-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[HUDI-9620] Refactor HoodieHadoopFsRelationFactory #13527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
override val fileIndex = new HoodieIncrementalFileIndex( | ||
sparkSession, metaClient, schemaSpec, options, FileStatusCache.getOrCreate(sparkSession), false, true) | ||
private val incrementalFileIndex = new HoodieIncrementalFileIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be lazy as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter because we always use it right away
assignment can only be done to committers and PMCs. assigning this to Ethan. but let @the-other-tim-brown do the first review and hand it off to ethan |
Typo in the ticket number. HUDI-9620 is correct: https://issues.apache.org/jira/browse/HUDI-9620 |
Change Logs
refactor hadoopfs relation factory
HoodieCDCFileIndex extends HoodieIncrementalFileIndex but doesn't use any of the methods, the only reason is because the extension in the HoodieHadoopFsRelationFactory getRequiredFilters and fileIndex.isInstanceOf[HoodieCDCFileIndex] are used as params to HoodieFileGroupReaderBasedParquetFileFormat. Additionally, buildFileFormat() has a lot of duplicate code between classes, and it is difficult to read with all the boolean params. Finally, everything is extended from mor snapshot, and a bunch of vals are overridden.
This refactor we get rid of the dependence between HoodieCDCFileIndex and HoodieIncrementalFileIndex. We reduce duplicated code by moving as much as we can to the abstract class and use methods to override instead of vals. The refactored code has good structure and breaks mor and cow up in a more reasonable manner. We need to dependency inject HoodieIncrementalFileIndex with v1 and v2 relations, so that is some of the motivation for this refactor. Additionally, it was very tedious to change anything in HoodieHadoopFsRelationFactory because there was a chain of constructors like 8 classes deep that all needed to be changed.
Before this patch:

After this patch:

Impact
code is more clean
Risk level (write none, low medium or high below)
low
Documentation Update
N/A
Contributor's checklist