Performance Issue with rewrite_data_files Spark procedure

### Apache Iceberg version

None

### Query engine

Spark

### Please describe the bug 🐞

### Setup
I am running `rewrite_data_files` procedure with following setup:
- Spark 3.5.3 on EMR 7.7.0 - 30 executor nodes of `m5.8xlarge` instance type
- Procedure has mostly default setting with `target-file-size-bytes=1G` & `max-file-group-size-bytes=10GB`
- Spark configuration:
    ```
    spark.dynamicAllocation.enabled=true
    spark.driver.memory=96g # a smaller number caused OOM on driver 
    spark.executor.memory=96g
    spark.executor.cores=4 # default setting in EMR

    conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
    conf.set("spark.sql.catalog.glue_catalog.warehouse", f"s3://{ICEBERG_S3_LOCATION}/{ICEBERG_DATABASE}/")
    conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    ```
- Iceberg table with > 100K data files (size ~ 800gb, total records ~ 4b rows, no partition)

### Problem
At first I was encountering the following exception on executor task:
```
Caused by: software.amazon.awssdk.thirdparty.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:316)
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282)
	at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionRequestFactory$DelegatingConnectionRequest.get(ClientConnectionRequestFactory.java:92)
	at software.amazon.awssdk.http.apache.internal.conn.ClientConnectionRequestFactory$InstrumentedConnectionRequest.get(ClientConnectionRequestFactory.java:69)
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at software.amazon.awssdk.thirdparty.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at software.amazon.awssdk.http.apache.internal.impl.ApacheSdkHttpClient.execute(ApacheSdkHttpClient.java:72)
	at software.amazon.awssdk.http.apache.ApacheHttpClient.execute(ApacheHttpClient.java:254)
	at software.amazon.awssdk.http.apache.ApacheHttpClient.access$500(ApacheHttpClient.java:104)
	at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:231)
	at software.amazon.awssdk.http.apache.ApacheHttpClient$1.call(ApacheHttpClient.java:228)
	at software.amazon.awssdk.core.internal.util.MetricUtils.measureDurationUnsafe(MetricUtils.java:102)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.executeHttpRequest(MakeHttpRequestStage.java:79)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:57)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeHttpRequestStage.execute(MakeHttpRequestStage.java:40)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:74)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:43)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:79)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:41)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage2.executeRequest(RetryableStage2.java:93)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage2.execute(RetryableStage2.java:56)
```
After that I followed the recommendation in https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/performance.html#Thread_and_connection_pool_settings `"* Heavy load. Increase pool size and acquisition timeout fs.s3a.connection.acquisition.timeout"` & specify apache http client configuration as instructed in https://iceberg.apache.org/docs/latest/aws/#apache-http-client-configurations for AWS integration and `S3FileIO` connector

```
conf.set("spark.sql.catalog.glue_catalog.http-client.apache.max-connections", "1000")
conf.set("spark.sql.catalog.glue_catalog.http-client.apache.connection-timeout-ms", "60000")
conf.set("spark.sql.catalog.glue_catalog.http-client.apache.socket-timeout-ms", "120000")
conf.set("spark.sql.catalog.glue_catalog.http-client.apache.connection-acquisition-timeout-ms", "60000")
```
All Spark tasks then still failed with container exit on 137 status code (OOM) - even when increase executor memory up to 96g. So I thought this probably means there are many open s3 connection at the same time, with each job tries to rewrite 1500-3000 files and `max-concurrent-file-group-rewrites` default to 5 means there are 5 ongoing jobs. The next step I tried to change the compaction option and connection pool size to reduce concurrent s3 connection:

```
COMPACTION_OPTIONS = {
    'target-file-size-bytes': '536870912',        # 512MB, reduce by half 
    'max-file-group-size-bytes': '5368709120',     # 5GB, reduce by half
    'partial-progress.max-commits': '10',
    'partial-progress.enabled': 'true',
    'min-input-files': '5',
    'max-concurrent-file-group-rewrites': '1'      # Reduce from default 5 to 1
}

# set to even smaller number
conf.set("spark.sql.catalog.glue_catalog.http-client.apache.max-connections", "100")
```

There were then few successful tasks but mostly failed - all Spark jobs despite scheduled sequentially (each rewrite ~800 files) were eventually unable to complete. 
So I would like to understand what cause the memory pressure on the executor (how to avoid) and how to run compaction efficiently on a large table like this ?


### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Issue with rewrite_data_files Spark procedure #14679

Apache Iceberg version

Query engine

Please describe the bug 🐞

Setup

Problem

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Issue with rewrite_data_files Spark procedure #14679

Description

Apache Iceberg version

Query engine

Please describe the bug 🐞

Setup

Problem

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions