[DRAFT][DNM] CoW write handle with file group reader #13699

the-other-tim-brown · 2025-08-09T01:24:59Z

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

the-other-tim-brown · 2025-08-09T23:36:05Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/IOUtils.java

@@ -119,6 +119,9 @@ public static Iterator<List<WriteStatus>> runMerge(HoodieMergeHandle<?, ?, ?, ?>
          "Error in finding the old file path at commit " + instantTime + " for fileId: " + fileId);
    } else {
      mergeHandle.doMerge();
+      if (mergeHandle instanceof FileGroupReaderBasedMergeHandle) {
+        mergeHandle.close();


Open question: Is there any reason to avoid calling close on the other merge handles?

danny0405 · 2025-08-11T03:00:52Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/WriteStatus.java

      indexStats.addHoodieRecordDelegate(HoodieRecordDelegate.fromHoodieRecord(record));
    }
    updateStatsForSuccess(optionalRecordMetadata);
  }

+  public void manuallyTrackSuccess() {
+    this.manuallyTrackIndexUpdates = true;


Can just set up trackSuccessRecords as false here.

danny0405 · 2025-08-11T03:03:40Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java

@@ -430,16 +430,11 @@ private static <R> Option<HoodieRecord<R>> mergeIncomingWithExistingRecordWithEx
      //the record was deleted
      return Option.empty();
    }
-    if (mergeResult.getRecord() == null) {
-      // SENTINEL case: the record did not match and merge case and should not be modified
+    if (mergeResult.getRecord() == null || mergeResult == existingBufferedRecord) {


if this is only for merge into and merge into only been utilized with EVENT_TIME merge mode, equals with buffered record is valid, otherwise, better use record equals here.

danny0405 · 2025-08-11T03:07:42Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+   */
+  public FileGroupReaderBasedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
+                                         Iterator<HoodieRecord<T>> recordItr, String partitionPath, String fileId,
+                                         TaskContextSupplier taskContextSupplier, Option<BaseKeyGenerator> keyGeneratorOpt, HoodieReaderContext<T> readerContext) {


do we need to pass around the readerContext explicitly here? can we use hoodieTable.getContext.getReaderContextFactoryForWrite instead?

The issue is that the merge handles are created on the executors in spark so the hoodieTable.getContext will always return a local engine context instead of a spark engine context when required.

always return a local engine context instead of a spark engine context when required

Can we fix that, like hoodieTable.getContextForWrite or something.

is getContextForWrite returning an EngineContext here?

danny0405 · 2025-08-11T03:09:58Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+    init(operation, this.partitionPath);
+    this.props = TypedProperties.copy(config.getProps());
+    this.isCompaction = true;
+    initRecordIndexCallback();


do we even need to track RLI for compactions?

danny0405 · 2025-08-11T03:13:34Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+    initRecordTypeAndCdcLogger(enginRecordType);
+    init(operation, this.partitionPath);
+    this.props = TypedProperties.copy(config.getProps());
+    this.isCompaction = true;


we already have flag preserveMetadata to distinguish table service and regular write, can we continue to use that? some functions like SI tracing already relies on the flag preserveMetadata. And it seems clustering also uses this constructor.

danny0405 · 2025-08-11T03:27:28Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+  }
+
+  private void initRecordIndexCallback() {
+    if (this.writeStatus.isTrackingSuccessfulWrites()) {


The isTrackingSuccessfulWrites flag in write status comes from hoodieTable.shouldTrackSuccessRecords(), which is true when RLI or partitioned RLI is enabled, we should skip the location tracing for compaction which is redundant.

danny0405 · 2025-08-11T03:28:38Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+  private void populateIncomingRecordsMapIterator(Iterator<HoodieRecord<T>> newRecordsItr) {
+    if (!isCompaction) {
+      // avoid populating external spillable in base {@link HoodieWriteMergeHandle)
+      this.incomingRecordsItr = new MappingIterator<>(newRecordsItr, record -> (HoodieRecord) record);


is this still needed ?

danny0405 · 2025-08-11T03:34:44Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+      this.secondaryIndexCallbackOpt = Option.empty();
+    }
+    secondaryIndexCallbackOpt.ifPresent(callbacks::add);
+    return callbacks.isEmpty() ? Option.empty() : Option.of(CompositeCallback.of(callbacks));


all the callbacks can be initialzied as local variables, there is no need to use CompositeCallback when the callbacks is single.

danny0405 · 2025-08-11T03:37:13Z

hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieEngineContext.java

   */
  public ReaderContextFactory<?> getReaderContextFactoryForWrite(HoodieTableMetaClient metaClient, HoodieRecord.HoodieRecordType recordType,
-                                                                 TypedProperties properties) {
+                                                                 TypedProperties properties, boolean outputsCustomPayloads) {


This flag is only meaningful for avro reader context, is there anyway we can constraint it just to AvroReaderContextFactory?

I didn't find a good way right now. This flag is really representing two different stages of the writer path, the dedupe/indexing stages and the final write. In the final write, we don't want to ever use the payload based records since we just want the final indexed record representation of the record.

we have a plan to abadon the payload based records in writer path right? So this should be just a temporary solution?

We'll still need it for ExpressionPayload and for any user provided payload so it is not temporary but these restrictions may allow us to clean things up further.

danny0405 · 2025-08-11T03:40:39Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+
+    @Override
+    public void onUpdate(String recordKey, BufferedRecord<T> previousRecord, BufferedRecord<T> mergedRecord) {
+      writeStatus.addRecordDelegate(HoodieRecordDelegate.create(recordKey, partitionPath, fileRecordLocation, fileRecordLocation, mergedRecord.getHoodieOperation() == HoodieOperation.UPDATE_BEFORE));


do we even need to add the delete when mergedRecord.getHoodieOperation() == HoodieOperation.UPDATE_BEFORE is true ?

The write status will still be updated in the current code with this record delegate even though ignoreIndexUpdate is true. This is keeping parity with the old system but I am not sure of the context for this.

The flag is used when all the delegates are collected into the driver and been utilized to calcurate the RLI index items for MDT, the delegate with flag ignoreIndexUpdate as true are just been dropped directly, so there is no need to even generate and collect them.

danny0405 · 2025-08-11T03:41:07Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+
+    @Override
+    public void onInsert(String recordKey, BufferedRecord<T> newRecord) {
+      writeStatus.addRecordDelegate(HoodieRecordDelegate.create(recordKey, partitionPath, null, fileRecordLocation, newRecord.getHoodieOperation() == HoodieOperation.UPDATE_BEFORE));


newRecord.getHoodieOperation() == HoodieOperation.UPDATE_BEFORE is always false.

It's always false today but do we want to keep this in case there is some future case where it may not be the case?

but do we want to keep this in case there is some future case

I don't think so, the hoodie operation is designed to be force set up there.

danny0405 · 2025-08-11T03:41:32Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+    public void onDelete(String recordKey, BufferedRecord<T> previousRecord, HoodieOperation hoodieOperation) {
+      // The update before operation is used when a deletion is being sent to the old File Group in a different partition.
+      // In this case, we do not want to delete the record metadata from the index.
+      writeStatus.addRecordDelegate(HoodieRecordDelegate.create(recordKey, partitionPath, fileRecordLocation, null, hoodieOperation == HoodieOperation.UPDATE_BEFORE));


hoodieOperation == HoodieOperation.UPDATE_BEFORE is always false.

danny0405 · 2025-08-11T03:44:55Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/UpdateProcessor.java

+          } else {
+            Schema readerSchema = readerContext.getSchemaHandler().getRequestedSchema();
+            // If the record schema is different from the reader schema, rewrite the record using the payload methods to ensure consistency with legacy writer paths
+            if (!readerSchema.equals(recordSchema)) {


This could be super costly. can it be simplified by checking the fields number?

danny0405 · 2025-08-11T03:53:51Z

...main/java/org/apache/hudi/common/table/read/buffer/StreamingFileGroupRecordBufferLoader.java

-  static <T> StreamingFileGroupRecordBufferLoader<T> getInstance() {
-    return INSTANCE;
+  StreamingFileGroupRecordBufferLoader(Schema recordSchema) {
+    this.recordSchema = recordSchema;


there is no need to pass around the schema explicitly, it is actually the writeSchema, which equals to:
schemaHandler.requestedSchema - metadata fields, we already have utility method for it: HoodieAvroUtils.removeMetadataFields.

the-other-tim-brown · 2025-08-10T20:50:10Z

...ient/hudi-client-common/src/main/java/org/apache/hudi/io/SecondaryIndexStreamingTracker.java

+   * @param writeStatus               The Write status
+   * @param secondaryIndexDefns       Definitions for secondary index which need to be updated
+   */
+  static <T> void trackSecondaryIndexStats(HoodieKey hoodieKey, Option<BufferedRecord<T>> combinedRecordOpt, @Nullable BufferedRecord<T> oldRecord, boolean isDelete,


Method mirrors the one above it but operates directly on BufferedRecord instead of converting to HoodieRecord

the-other-tim-brown · 2025-08-11T13:20:18Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+   */
+  public FileGroupReaderBasedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable<T, I, K, O> hoodieTable,
+                                         Iterator<HoodieRecord<T>> recordItr, String partitionPath, String fileId,
+                                         TaskContextSupplier taskContextSupplier, Option<BaseKeyGenerator> keyGeneratorOpt, HoodieReaderContext<T> readerContext) {


The issue is that the merge handles are created on the executors in spark so the hoodieTable.getContext will always return a local engine context instead of a spark engine context when required.

the-other-tim-brown · 2025-08-11T13:23:17Z

hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieEngineContext.java

   */
  public ReaderContextFactory<?> getReaderContextFactoryForWrite(HoodieTableMetaClient metaClient, HoodieRecord.HoodieRecordType recordType,
-                                                                 TypedProperties properties) {
+                                                                 TypedProperties properties, boolean outputsCustomPayloads) {


I didn't find a good way right now. This flag is really representing two different stages of the writer path, the dedupe/indexing stages and the final write. In the final write, we don't want to ever use the payload based records since we just want the final indexed record representation of the record.

the-other-tim-brown · 2025-08-11T13:24:43Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+
+    @Override
+    public void onInsert(String recordKey, BufferedRecord<T> newRecord) {
+      writeStatus.addRecordDelegate(HoodieRecordDelegate.create(recordKey, partitionPath, null, fileRecordLocation, newRecord.getHoodieOperation() == HoodieOperation.UPDATE_BEFORE));


It's always false today but do we want to keep this in case there is some future case where it may not be the case?

the-other-tim-brown · 2025-08-11T13:35:21Z

...ent/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java

+
+    @Override
+    public void onUpdate(String recordKey, BufferedRecord<T> previousRecord, BufferedRecord<T> mergedRecord) {
+      writeStatus.addRecordDelegate(HoodieRecordDelegate.create(recordKey, partitionPath, fileRecordLocation, fileRecordLocation, mergedRecord.getHoodieOperation() == HoodieOperation.UPDATE_BEFORE));


The write status will still be updated in the current code with this record delegate even though ignoreIndexUpdate is true. This is keeping parity with the old system but I am not sure of the context for this.

Co-authored-by: Sivabalan Narayanan <[email protected]>

…differ from base file schema

…ry changes

…rds if possible

…on method for fetching payload class

…fix test mocks

…d log record only

… ordering value types

the-other-tim-brown · 2025-08-11T20:52:27Z

hudi-common/src/main/java/org/apache/hudi/common/table/read/buffer/FileGroupRecordBuffer.java

 import static org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.INSTANT_TIME;

 abstract class FileGroupRecordBuffer<T> implements HoodieFileGroupRecordBuffer<T> {
+  protected final Set<String> usedKeys = new HashSet<>();


There is a possibility of duplicate keys in a file and there is an expectation that updates are applied to both rows. See Test only insert for source table in dup key without preCombineField for an example. We need to figure out if there is a better way to handle this.

…tomRecordMerger to skip flush on _row_key=1

hudi-bot · 2025-08-12T03:03:16Z

CI report:

50022dc UNKNOWN
77a2b45 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:XL PR with lines of changes > 1000 label Aug 9, 2025

the-other-tim-brown commented Aug 9, 2025

View reviewed changes

danny0405 reviewed Aug 11, 2025

View reviewed changes

the-other-tim-brown commented Aug 11, 2025

View reviewed changes

the-other-tim-brown and others added 13 commits August 11, 2025 12:15

Squash: get baseline testing and cow handle setup

8fb3084

Co-authored-by: Sivabalan Narayanan <[email protected]>

fix conflicts

b076278

allow incoming records loader to specify the record schema which may …

8c00a80

…differ from base file schema

move conversion to loader

b45277a

fix handling of auto-keygen flow, update test which requires errors

e73ee14

fix schema used in buffered record after projection, remove unnecessa…

2bff005

…ry changes

update test setup to recreate table if populate meta fields is false

895c29b

pass in the reader context factory so we can use engine specific reco…

3173b08

…rds if possible

cleanup

75e1442

fix expression payload handling (still 2 test failures)

417920f

add temporary shouldIgnore step

26f0dfc

fix sentinel case for index utils

a578372

Add custom merger test

8ab7216

Lokesh Jain and others added 11 commits August 11, 2025 12:16

Change custom merger logic to accept lower ordering value records

3b37c2f

fix test setup

9a4ac5b

move logic to update processer for skipping in payload case, add comm…

38dbe28

…on method for fetching payload class

fix update processor check

06f2893

clean up repeated code

570124e

fix delete context in buffer loader to match incoming record schema, …

e8f9a2c

…fix test mocks

handle expression payload field rewrite when shouldIgnore is false an…

56c4e9e

…d log record only

remove changes to compaction flow for index update, ensure consistent…

60229fb

… ordering value types

add support for merger shouldFlush

c734d41

fix multi-format writes, fix test serialization issues

ea9ae77

use new static instance in test

47f8303

the-other-tim-brown force-pushed the cow-merge-handle-to-fgr-3 branch from 50022dc to 47f8303 Compare August 11, 2025 16:18

the-other-tim-brown added 3 commits August 11, 2025 13:15

fix multi-format on java reader

ddd8d00

add concept of used keys to allow duplicate updates

fb191ec

move addKey to common place with null check

064912d

the-other-tim-brown commented Aug 11, 2025

View reviewed changes

fix handle factory expectations to match new defaults, update TestCus…

77a2b45

…tomRecordMerger to skip flush on _row_key=1

[DRAFT][DNM] CoW write handle with file group reader #13699

Are you sure you want to change the base?

[DRAFT][DNM] CoW write handle with file group reader #13699

Conversation

the-other-tim-brown commented Aug 9, 2025

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Aug 12, 2025

CI report:

Uh oh!

danny0405 Aug 11, 2025 •

edited

Loading

danny0405 Aug 11, 2025 •

edited

Loading