Skip to content

Allow Feast Spark materialization to use staging for large dataset spill (avoid driver OOM) #5671

@chimeyrock999

Description

@chimeyrock999

Is your feature request related to a problem? Please describe.

When running store.materialize_incremental() with a Spark-based offline store, Feast currently converts Spark DataFrames to Arrow tables using toPandas() or collect().
This loads the entire dataset into driver memory, causing OutOfMemoryError or spark.driver.maxResultSize exceeded for large datasets.

Although Feast already supports a staging configuration for exporting data, it is not yet used during materialization. As a result, large-scale jobs cannot leverage staging to spill data to disk or remote storage.

Describe the solution you'd like

Enable the existing staging configuration to be used for materialization in the Spark offline store.
When enabled, Feast should write intermediate Spark DataFrames to the staging location (e.g. local disk or S3) before reading them back as Arrow tables for online ingestion.

Example configuration:

offline_store:
  type: spark
  staging_location: s3://bucket/tmp/feast_arrow
  staging_allow_materialize: true

This would allow Feast to handle large datasets safely without driver OOM, by spilling intermediate data to the configured staging location.

Describe alternatives you've considered

  • Increasing driver memory (spark.driver.memory, spark.driver.maxResultSize) — only delays the problem.
  • Using toLocalIterator() — too slow and still limited by memory.

Additional context

During feast materialize with large Spark datasets, jobs fail due to driver OOM even though executors have available resources.
Example error:

Total size of serialized results (2.1 GiB) is bigger than spark.driver.maxResultSize (2.0 GiB)
Caused by: java.lang.OutOfMemoryError: Java heap space

Allowing materialization to use staging would make Feast’s Spark integration far more scalable and production-ready.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions