feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

mbutrovich · 2025-10-06T02:37:21Z

This is mostly for discussion at the moment. There are slides from the 10/9/25 Iceberg-Rust community call here where I presented this effort here.

Rationale for this change

I was inspired by @RussellSpitzer's recent talk and wanted to revisit the abstraction layer at which Comet integrates with Iceberg. We have the iceberg_compat codepath for Iceberg integration, but this requires code changes in Iceberg Java to integrate with Parquet reader instantiation. Instead, this prototype works at the FileScanTask layer after planning. This prototype starts us toward fully-native Iceberg scans to match our Parquet logic with native_datafusion scans without any changes in upstream Iceberg Java code.

What changes are included in this PR?

New CometIcebergNativeScanExec node on the Scala side.
Use reflection to extract scan properties, mostly FileScanTasks and serialize to native code.
New IcebergScanExec on native side that uses FileScanTasks to perform reads in iceberg-rust.

How are these changes tested?

New CometIcebergNativeSuite.

Benefits over `iceberg_compat`?

No upstream code changes needed in Iceberg Java, no references to Comet needed in Iceberg anymore.
Better parallelism for file reading, more similar to native_datafusion.
No separate DataFusion runtime, these run in the same context as other operators (compared to iceberg_compat).
Better testing for iceberg-rust. I think I already found a shortcoming with row group pruning logic.
Tested with Iceberg 1.5, 1.7, 1.10.

Current Limitations/Concerns?

I lied about no upstream changes. I need one line changed in iceberg-rust and will open a PR there to make an API public. Currently this PR relies on my fork of iceberg-rust.
Need to try running Iceberg Java tests with this. I need to look at our current pipelines, since in theory we don’t want to apply the diff for iceberg_compat to Iceberg.
Need to explore/validate OpenDAL support for credential providers.
We'd need to try to keep iceberg-rust in sync with Comet's DataFusion dependency. I also had to bump my iceberg-rust fork to DataFusion 50.
We've already entangled Comet and Iceberg Java code, what would the deprecation of that code look like?
RecordBatchTransformer instead of SchemaAdapter/PhysicalExprAdapter. Need to understand the compatibility gap there.
Don't have access to ArrowReaderOptions yet (needed for proper Spark-compatible INT96 handling) https://github.com/apache/iceberg-rust/blob/dc349284a4204c1a56af47fb3177ace6f9e899a0/crates/iceberg/src/arrow/reader.rs#L1384.

codecov-commenter · 2025-10-06T02:54:32Z

Codecov Report

❌ Patch coverage is 76.47059% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.71%. Comparing base (f09f8af) to head (236b339).
⚠️ Report is 631 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	74.34%	56 Missing and 13 partials ⚠️
...e/spark/sql/comet/CometIcebergNativeScanExec.scala	85.10%	3 Missing and 11 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala	53.84%	3 Missing and 3 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	60.00%	0 Missing and 2 partials ⚠️
...la/org/apache/comet/objectstore/NativeConfig.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2528      +/-   ##
============================================
+ Coverage     56.12%   59.71%   +3.58%     
- Complexity      976     1461     +485     
============================================
  Files           119      148      +29     
  Lines         11743    14117    +2374     
  Branches       2251     2423     +172     
============================================
+ Hits           6591     8430    +1839     
- Misses         4012     4435     +423     
- Partials       1140     1252     +112

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-10-06T15:22:35Z

It is promising!

# Conflicts: # native/Cargo.lock # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

…eberg version back to 1.8.1 after hitting known segfaults with old versions.

## Which issue does this PR close? - Part of #1749. ## What changes are included in this PR? - Change `ArrowReaderBuilder::new` to be `pub` instead of `pub(crate)`. ## Are these changes tested? - No new tests for this. Currently being used in DataFusion Comet: apache/datafusion-comet#2528

# Conflicts: # docs/source/user-guide/latest/configs.md # native/Cargo.lock # native/Cargo.toml # native/core/Cargo.toml

# Conflicts: # native/Cargo.lock

# Conflicts: # spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala

mbutrovich added 3 commits October 5, 2025 21:53

CometNativeIcebergScan with iceberg-rust using FileScanTasks.

cded0ad

Clean up tests a little.

4f3004b

Remove old comment.

4afec43

mbutrovich added 6 commits October 6, 2025 06:58

Fix machete and missing suite CI failures.

fc97ce9

Fix unused variables.

cca4911

Spark 4.0 needs Iceberg 1.10, let's see if that works in CI.

93f466d

Remove errant println.

970b692

Remove old path() code path.

c44973b

Update old comment.

0f83fd4

mbutrovich added 2 commits October 6, 2025 11:49

Iceberg 1.5.x compatible reflection. Use 1.5.2 for Spark 3.4 and 3.5.

6cbbd09

Fix scalastyle issues.

6966a12

mbutrovich changed the title ~~feat: Iceberg scan based serializing FileScanTasks to iceberg-rust~~ feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025

mbutrovich force-pushed the iceberg-rust branch from 227332c to 6966a12 Compare October 6, 2025 20:03

mbutrovich changed the title ~~feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust~~ feat: Iceberg scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025

mbutrovich added 7 commits October 7, 2025 13:03

Merge branch 'main' into iceberg-rust

1153d71

# Conflicts: # native/Cargo.lock # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

Remove unused import.

a0f4d63

Clean up docs a bit.

a9cebfd

Refactor and cleanup.

6b2175a

Refactor and cleanup.

3618407

Add IcebergFileStream based on DataFusion, add benchmark. Bump the Ic…

8091a81

…eberg version back to 1.8.1 after hitting known segfaults with old versions.

Fix CometReadBenchmark.

880599e

This was referenced Oct 15, 2025

feat(reader): Make ArrowReaderBuilder::new public apache/iceberg-rust#1748

Merged

ArrowReader enhancements for Apache DataFusion Comet apache/iceberg-rust#1749

Open

mbutrovich added 4 commits October 16, 2025 16:04

Merge branch 'main' into iceberg-rust

5127e1c

# Conflicts: # docs/source/user-guide/latest/configs.md # native/Cargo.lock # native/Cargo.toml # native/core/Cargo.toml

Fixes after bringing in upstream/main.

878c971

Basic complex type support.

e66799e

CometFuzzIceberg stuff.

4f2f3b8

mbutrovich mentioned this pull request Oct 21, 2025

tests: FuzzDataGenerator instead of Parquet-specific generator #2616

Merged

mbutrovich added 5 commits October 21, 2025 11:24

Merge branch 'main' into iceberg-rust

71df65c

# Conflicts: # native/Cargo.lock

format and fix conflicts.

3371cc1

Basic S3 test and properties support

1c40d43

Fix NPE.

40c9a07

Merge branch 'main' into iceberg-rust

19797f3

# Conflicts: # spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala

mbutrovich mentioned this pull request Oct 22, 2025

feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) apache/iceberg-rust#1777

Open

Support migrated tables via apache/iceberg-rust#1777.

236b339

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

mbutrovich commented Oct 6, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 6, 2025 •

edited

Loading

Uh oh!

comphead commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Are you sure you want to change the base?

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Conversation

mbutrovich commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Benefits over iceberg_compat?

Current Limitations/Concerns?

Uh oh!

codecov-commenter commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Oct 6, 2025 •

edited

Loading

Benefits over `iceberg_compat`?

codecov-commenter commented Oct 6, 2025 •

edited

Loading