Skip to content

Spark view creation is slow as it performs a redundant table scan planning #14680

@jbewing

Description

@jbewing

Apache Iceberg version

1.10.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

What

Currently, when creating a view via Apache Spark—I'm running a fork of 3.5—I've observed that table view creation via

SparkSession spark = ...;
spark.sql("CREATE VIEW ... ")

is slow and this slowness generally seems to scale with the size of the table. When creating views over a larger table (hundreds of TBs), creating a small number of views (say just a couple thousand) takes about ~12 hours and requires a moderately sized Spark cluster (~100 CPUs).

This seems excessive and based on logs and flamegraphs, we can see that the overhead here is due to the fact that Spark attempts to perform excessive optimization on the view body. For this given unit test:

  @TestTemplate
  public void createViewWithGroupByOrdinal() throws NoSuchTableException {
    insertRows(3);
    insertRows(2);
    String viewName = viewName("createViewWithGroupByOrdinal");
    sql("CREATE VIEW %s AS SELECT id, count(1) FROM %s GROUP BY 1", viewName, tableName);

    assertThat(sql("SELECT * FROM %s", viewName))
        .hasSize(3)
        .containsExactlyInAnyOrder(row(1, 2L), row(2, 2L), row(3, 1L));
  }

You can observe logs such as:

[Test worker] INFO org.apache.spark.sql.execution.datasources.v2.AppendDataExec - Data source write support IcebergBatchWrite(table=spark_with_views.default.table, format=PARQUET) committed.
[Test worker] INFO org.apache.iceberg.BaseMetastoreTableOperations - Refreshing table metadata from new version: /default/table/metadata/00002-b53fe6fa-4245-4fb3-abea-81a98c8cc849.metadata.json
[Test worker] INFO org.apache.iceberg.BaseMetastoreCatalog - Table loaded by catalog: spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.spark.source.SparkScanBuilder - Skipping aggregate pushdown: group by aggregation push down is not supported
[Test worker] INFO org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown - 
Output: id#13
           
[Test worker] INFO org.apache.iceberg.SnapshotScan - Scanning table spark_with_views.default.table snapshot 2224231829492223177 created at 2025-11-25T03:47:31.504+00:00 with filter true
[Test worker] INFO org.apache.iceberg.BaseDistributedDataScan - Planning file tasks locally for table spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.spark.source.SparkPartitioningAwareScan - Reporting UnknownPartitioning with 1 partition(s) for table spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.view.BaseMetastoreViewCatalog - View properties set at catalog level through catalog properties: {}
[Test worker] INFO org.apache.iceberg.view.BaseMetastoreViewCatalog - View properties enforced at catalog level through catalog properties: {}
[Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Successfully committed to view spark_with_views.default.createViewWithGroupByOrdinal989199 in 2 ms
[Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Refreshing view metadata from new version: /default/createViewWithGroupByOrdinal989199/metadata/00000-66728e6c-ae56-400f-96dd-dbd36dba8ed9.gz.metadata.json
[Test worker] INFO org.apache.iceberg.view.BaseViewOperations - Refreshing view metadata from new version: /default/createViewWithGroupByOrdinal989199/metadata/00000-66728e6c-ae56-400f-96dd-dbd36dba8ed9.gz.metadata.json
[Test worker] INFO org.apache.iceberg.BaseMetastoreTableOperations - Refreshing table metadata from new version: /default/table/metadata/00002-b53fe6fa-4245-4fb3-abea-81a98c8cc849.metadata.json
[Test worker] INFO org.apache.iceberg.BaseMetastoreCatalog - Table loaded by catalog: spark_with_views.default.table
[Test worker] INFO org.apache.iceberg.spark.source.SparkScanBuilder - Skipping aggregate pushdown: group by aggregation push down is not supported
[Test worker] INFO org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown - 

which shows the table optimization step running before view creation which requires a heavier-weight scan operation that may require distributed planning.

Why

Upon looking upstream at Spark, we can see that for similar pieces of the view creation logic that views have some explicit code that enables the view body to be analyzed, but not optimized.

In Iceberg, we hijack the regular upstream Spark code and run our own variants of view creation that don't pull in this optimization. Given that a table scan planning is both redundant and slow in this case, we should update the internal Iceberg view creation code to only be used in Spark analysis, but not optimization phases.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions