-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Is your feature request related to a problem? Please describe.
When running store.materialize_incremental() with a Spark-based offline store, Feast currently converts Spark DataFrames to Arrow tables using toPandas() or collect().
This loads the entire dataset into driver memory, causing OutOfMemoryError or spark.driver.maxResultSize exceeded for large datasets.
Although Feast already supports a staging configuration for exporting data, it is not yet used during materialization. As a result, large-scale jobs cannot leverage staging to spill data to disk or remote storage.
Describe the solution you'd like
Enable the existing staging configuration to be used for materialization in the Spark offline store.
When enabled, Feast should write intermediate Spark DataFrames to the staging location (e.g. local disk or S3) before reading them back as Arrow tables for online ingestion.
Example configuration:
offline_store:
type: spark
staging_location: s3://bucket/tmp/feast_arrow
staging_allow_materialize: trueThis would allow Feast to handle large datasets safely without driver OOM, by spilling intermediate data to the configured staging location.
Describe alternatives you've considered
- Increasing driver memory (
spark.driver.memory,spark.driver.maxResultSize) — only delays the problem. - Using
toLocalIterator()— too slow and still limited by memory.
Additional context
During feast materialize with large Spark datasets, jobs fail due to driver OOM even though executors have available resources.
Example error:
Total size of serialized results (2.1 GiB) is bigger than spark.driver.maxResultSize (2.0 GiB)
Caused by: java.lang.OutOfMemoryError: Java heap space
Allowing materialization to use staging would make Feast’s Spark integration far more scalable and production-ready.