Skip to content

[HUDI-9620] Refactor HoodieHadoopFsRelationFactory #13527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 22, 2025

Conversation

jonvex
Copy link
Contributor

@jonvex jonvex commented Jul 7, 2025

Change Logs

refactor hadoopfs relation factory

HoodieCDCFileIndex extends HoodieIncrementalFileIndex but doesn't use any of the methods, the only reason is because the extension in the HoodieHadoopFsRelationFactory getRequiredFilters and fileIndex.isInstanceOf[HoodieCDCFileIndex] are used as params to HoodieFileGroupReaderBasedParquetFileFormat. Additionally, buildFileFormat() has a lot of duplicate code between classes, and it is difficult to read with all the boolean params. Finally, everything is extended from mor snapshot, and a bunch of vals are overridden.

This refactor we get rid of the dependence between HoodieCDCFileIndex and HoodieIncrementalFileIndex. We reduce duplicated code by moving as much as we can to the abstract class and use methods to override instead of vals. The refactored code has good structure and breaks mor and cow up in a more reasonable manner. We need to dependency inject HoodieIncrementalFileIndex with v1 and v2 relations, so that is some of the motivation for this refactor. Additionally, it was very tedious to change anything in HoodieHadoopFsRelationFactory because there was a chain of constructors like 8 classes deep that all needed to be changed.

Before this patch:
before_refactoring_class_diagram

After this patch:
after_refactoring

Impact

code is more clean

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Jul 7, 2025

override val fileIndex = new HoodieIncrementalFileIndex(
sparkSession, metaClient, schemaSpec, options, FileStatusCache.getOrCreate(sparkSession), false, true)
private val incrementalFileIndex = new HoodieIncrementalFileIndex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be lazy as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter because we always use it right away

@nsivabalan
Copy link
Contributor

assignment can only be done to committers and PMCs. assigning this to Ethan. but let @the-other-tim-brown do the first review and hand it off to ethan

@jonvex jonvex marked this pull request as ready for review July 19, 2025 19:12
@jonvex jonvex changed the title [MINOR] refactor hadoopfsrelation factory [HUDI-6920] refactor hadoopfsrelation factory Jul 22, 2025
@jonvex jonvex changed the title [HUDI-6920] refactor hadoopfsrelation factory [HUDI-6920] Refactor HoodieHadoopFsRelationFactory Jul 22, 2025
@apache apache deleted a comment from hudi-bot Jul 22, 2025
@jonvex
Copy link
Contributor Author

jonvex commented Jul 22, 2025

image image azure ci all passing

@apache apache deleted a comment from hudi-bot Jul 22, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jonvex jonvex merged commit 74dfc94 into apache:master Jul 22, 2025
64 of 65 checks passed
@jonvex
Copy link
Contributor Author

jonvex commented Jul 22, 2025

Typo in the ticket number. HUDI-9620 is correct: https://issues.apache.org/jira/browse/HUDI-9620

@jonvex jonvex changed the title [HUDI-6920] Refactor HoodieHadoopFsRelationFactory [HUDI-9620] Refactor HoodieHadoopFsRelationFactory Jul 22, 2025
rahil-c pushed a commit to rahil-c/hudi that referenced this pull request Aug 5, 2025
rahil-c pushed a commit to rahil-c/hudi that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants