Skip to content

Conversation

mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Oct 6, 2025

This is mostly for discussion at the moment. There are slides from the 10/9/25 Iceberg-Rust community call here where I presented this effort here.

Rationale for this change

I was inspired by @RussellSpitzer's recent talk and wanted to revisit the abstraction layer at which Comet integrates with Iceberg. We have the iceberg_compat codepath for Iceberg integration, but this requires code changes in Iceberg Java to integrate with Parquet reader instantiation. Instead, this prototype works at the FileScanTask layer after planning. This prototype starts us toward fully-native Iceberg scans to match our Parquet logic with native_datafusion scans without any changes in upstream Iceberg Java code.

What changes are included in this PR?

  • New CometIcebergNativeScanExec node on the Scala side.
  • Use reflection to extract scan properties, mostly FileScanTasks and serialize to native code.
  • New IcebergScanExec on native side that uses FileScanTasks to perform reads in iceberg-rust.

How are these changes tested?

New CometIcebergNativeSuite.

Benefits over iceberg_compat?

  • No upstream code changes needed in Iceberg Java, no references to Comet needed in Iceberg anymore.
  • Better parallelism for file reading, more similar to native_datafusion.
  • No separate DataFusion runtime, these run in the same context as other operators (compared to iceberg_compat).
  • Better testing for iceberg-rust. I think I already found a shortcoming with row group pruning logic.
  • Tested with Iceberg 1.5, 1.7, 1.10.

Current Limitations/Concerns?

  • I lied about no upstream changes. I need one line changed in iceberg-rust and will open a PR there to make an API public. Currently this PR relies on my fork of iceberg-rust.
  • Need to try running Iceberg Java tests with this. I need to look at our current pipelines, since in theory we don’t want to apply the diff for iceberg_compat to Iceberg.
  • Need to explore/validate OpenDAL support for credential providers.
  • We'd need to try to keep iceberg-rust in sync with Comet's DataFusion dependency. I also had to bump my iceberg-rust fork to DataFusion 50.
  • We've already entangled Comet and Iceberg Java code, what would the deprecation of that code look like?
  • RecordBatchTransformer instead of SchemaAdapter/PhysicalExprAdapter. Need to understand the compatibility gap there.
  • Don't have access to ArrowReaderOptions yet (needed for proper Spark-compatible INT96 handling) https://github.com/apache/iceberg-rust/blob/dc349284a4204c1a56af47fb3177ace6f9e899a0/crates/iceberg/src/arrow/reader.rs#L1384.

@codecov-commenter
Copy link

codecov-commenter commented Oct 6, 2025

Codecov Report

❌ Patch coverage is 76.47059% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.71%. Comparing base (f09f8af) to head (236b339).
⚠️ Report is 631 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 74.34% 56 Missing and 13 partials ⚠️
...e/spark/sql/comet/CometIcebergNativeScanExec.scala 85.10% 3 Missing and 11 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala 53.84% 3 Missing and 3 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala 60.00% 0 Missing and 2 partials ⚠️
...la/org/apache/comet/objectstore/NativeConfig.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2528      +/-   ##
============================================
+ Coverage     56.12%   59.71%   +3.58%     
- Complexity      976     1461     +485     
============================================
  Files           119      148      +29     
  Lines         11743    14117    +2374     
  Branches       2251     2423     +172     
============================================
+ Hits           6591     8430    +1839     
- Misses         4012     4435     +423     
- Partials       1140     1252     +112     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@comphead
Copy link
Contributor

comphead commented Oct 6, 2025

It is promising!

@mbutrovich mbutrovich changed the title feat: Iceberg scan based serializing FileScanTasks to iceberg-rust feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025
@mbutrovich mbutrovich changed the title feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust feat: Iceberg scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025
# Conflicts:
#	native/Cargo.lock
#	spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala
…eberg version back to 1.8.1 after hitting known segfaults with old versions.
liurenjie1024 pushed a commit to apache/iceberg-rust that referenced this pull request Oct 16, 2025
## Which issue does this PR close?


- Part of #1749.

## What changes are included in this PR?

- Change `ArrowReaderBuilder::new` to be `pub` instead of `pub(crate)`.

## Are these changes tested?

- No new tests for this. Currently being used in DataFusion Comet:
apache/datafusion-comet#2528
# Conflicts:
#	docs/source/user-guide/latest/configs.md
#	native/Cargo.lock
#	native/Cargo.toml
#	native/core/Cargo.toml
# Conflicts:
#	native/Cargo.lock
# Conflicts:
#	spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants