Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
* [Offline store](getting-started/components/offline-store.md)
* [Online store](getting-started/components/online-store.md)
* [Feature server](getting-started/components/feature-server.md)
* [Batch Materialization Engine](getting-started/components/batch-materialization-engine.md)
* [Compute Engine](getting-started/components/compute-engine.md)
* [Provider](getting-started/components/provider.md)
* [Authorization Manager](getting-started/components/authz_manager.md)
* [OpenTelemetry Integration](getting-started/components/open-telemetry.md)
Expand Down Expand Up @@ -139,10 +139,10 @@
* [Google Cloud Platform](reference/providers/google-cloud-platform.md)
* [Amazon Web Services](reference/providers/amazon-web-services.md)
* [Azure](reference/providers/azure.md)
* [Batch Materialization Engines](reference/batch-materialization/README.md)
* [Snowflake](reference/batch-materialization/snowflake.md)
* [AWS Lambda (alpha)](reference/batch-materialization/lambda.md)
* [Spark (contrib)](reference/batch-materialization/spark.md)
* [Compute Engines](reference/compute-engine/README.md)
* [Snowflake](reference/compute-engine/snowflake.md)
* [AWS Lambda (alpha)](reference/compute-engine/lambda.md)
* [Spark (contrib)](reference/compute-engine/spark.md)
* [Feature repository](reference/feature-repository/README.md)
* [feature\_store.yaml](reference/feature-repository/feature-store-yaml.md)
* [.feastignore](reference/feature-repository/feast-ignore.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/architecture/write-patterns.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ There are two ways a client (or Data Producer) can *_send_* data to the online s
- Using a synchronous API call for a small number of entities or a single entity (e.g., using the [`push` or `write_to_online_store` methods](../../reference/data-sources/push.md#pushing-data)) or the Feature Server's [`push` endpoint](../../reference/feature-servers/python-feature-server.md#pushing-features-to-the-online-and-offline-stores))
2. Asynchronously
- Using an asynchronous API call for a small number of entities or a single entity (e.g., using the [`push` or `write_to_online_store` methods](../../reference/data-sources/push.md#pushing-data)) or the Feature Server's [`push` endpoint](../../reference/feature-servers/python-feature-server.md#pushing-features-to-the-online-and-offline-stores))
- Using a "batch job" for a large number of entities (e.g., using a [batch materialization engine](../components/batch-materialization-engine.md))
- Using a "batch job" for a large number of entities (e.g., using a [compute engine](../components/compute-engine.md))

Note, in some contexts, developers may "batch" a group of entities together and write them to the online store in a
single API call. This is a common pattern when writing data to the online store to reduce write loads but we would
Expand Down
58 changes: 53 additions & 5 deletions docs/getting-started/components/compute-engine.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Compute Engine (Batch Materialization Engine)
# Compute Engine

Note: The materialization is now constructed via unified compute engine interface.

Expand All @@ -20,8 +20,9 @@ engines.
```markdown
| Compute Engine | Description | Supported | Link |
|-------------------------|-------------------------------------------------------------------------------------------------|------------|------|
| LocalComputeEngine | Runs on Arrow + Pandas/Polars/Dask etc., designed for light weight transformation. | ✅ | |
| LocalComputeEngine | Runs on Arrow + Pandas/Polars/Dask etc., designed for light weight transformation. | ✅ | |
| SparkComputeEngine | Runs on Apache Spark, designed for large-scale distributed feature generation. | ✅ | |
| SnowflakeComputeEngine | Runs on Snowflake, designed for scalable feature generation using Snowflake SQL. | ✅ | |
| LambdaComputeEngine | Runs on AWS Lambda, designed for serverless feature generation. | ✅ | |
| FlinkComputeEngine | Runs on Apache Flink, designed for stream processing and real-time feature generation. | ❌ | |
| RayComputeEngine | Runs on Ray, designed for distributed feature generation and machine learning workloads. | ❌ | |
Expand All @@ -31,7 +32,7 @@ engines.
Batch Engine Config can be configured in the `feature_store.yaml` file, and it serves as the default configuration for all materialization and historical retrieval tasks. The `batch_engine` config in BatchFeatureView. E.g
```yaml
batch_engine:
type: SparkComputeEngine
type: spark.engine
config:
spark_master: "local[*]"
spark_app_name: "Feast Batch Engine"
Expand Down Expand Up @@ -59,7 +60,7 @@ Then, when you materialize the feature view, it will use the batch_engine config
Stream Engine Config can be configured in the `feature_store.yaml` file, and it serves as the default configuration for all stream materialization and historical retrieval tasks. The `stream_engine` config in FeatureView. E.g
```yaml
stream_engine:
type: SparkComputeEngine
type: spark.engine
config:
spark_master: "local[*]"
spark_app_name: "Feast Stream Engine"
Expand Down Expand Up @@ -108,4 +109,51 @@ defined in the DAG. It handles the execution of transformations, aggregations, j

The Feature resolver is the core component of the compute engine that constructs the execution plan for feature
generation. It takes the definitions from feature views and builds a directed acyclic graph (DAG) of operations that
need to be performed to generate the features.
need to be performed to generate the features.

#### DAG
The DAG represents the directed acyclic graph of operations that need to be performed to generate the features. It
contains nodes for each operation, such as transformations, aggregations, joins, and filters. The DAG is built by the
Feature Resolver and executed by the Feature Builder.

DAG nodes are defined as follows:
```
+---------------------+
| SourceReadNode | <- Read data from offline store (e.g. Snowflake, BigQuery, etc. or custom source)
+---------------------+
|
v
+--------------------------------------+
| TransformationNode / JoinNode (*) | <- Merge data sources, custom transformations by user, or default join
+--------------------------------------+
|
v
+---------------------+
| FilterNode | <- used for point-in-time filtering
+---------------------+
|
v
+---------------------+
| AggregationNode (*) | <- only if aggregations are defined
+---------------------+
|
v
+---------------------+
| DeduplicationNode | <- used if no aggregation and for history
+---------------------+ retrieval
|
v
+---------------------+
| ValidationNode (*) | <- optional validation checks
+---------------------+
|
v
+----------+
| Output |
+----------+
/ \
v v
+----------------+ +----------------+
| OnlineStoreWrite| OfflineStoreWrite|
+----------------+ +----------------+
```
2 changes: 1 addition & 1 deletion docs/getting-started/components/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ A complete Feast deployment contains the following components:
* Retrieve online features.
* **Feature Server:** The Feature Server is a REST API server that serves feature values for a given entity key and feature reference. The Feature Server is designed to be horizontally scalable and can be deployed in a distributed manner.
* **Stream Processor:** The Stream Processor can be used to ingest feature data from streams and write it into the online or offline stores. Currently, there's an experimental Spark processor that's able to consume data from Kafka.
* **Batch Materialization Engine:** The [Batch Materialization Engine](batch-materialization-engine.md) component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
* **Compute Engine:** The [Compute Engine](compute-engine.md) component launches a process which loads data into the online store from the offline store. By default, Feast uses a local in-process engine implementation to materialize data. However, additional infrastructure can be used for a more scalable materialization process.
* **Online Store:** The online store is a database that stores only the latest feature values for each entity. The online store is either populated through materialization jobs or through [stream ingestion](../../reference/data-sources/push.md).
* **Offline Store:** The offline store persists batch data that has been ingested into Feast. This data is used for producing training datasets. For feature retrieval and materialization, Feast does not manage the offline store directly, but runs queries against it. However, offline stores can be configured to support writes if Feast configures logging functionality of served features.
* **Authorization Manager**: The authorization manager detects authentication tokens from client requests to Feast servers and uses this information to enforce permission policies on the requested services.
2 changes: 1 addition & 1 deletion docs/getting-started/genai.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,4 +162,4 @@ For more detailed information and examples:
* [MCP Feature Server Reference](../reference/feature-servers/mcp-feature-server.md)
* [Spark Data Source](../reference/data-sources/spark.md)
* [Spark Offline Store](../reference/offline-stores/spark.md)
* [Spark Batch Materialization](../reference/batch-materialization/spark.md)
* [Spark Compute Engine](../reference/compute-engine/spark.md)
4 changes: 2 additions & 2 deletions docs/how-to-guides/running-feast-in-production.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,8 @@ To keep your online store up to date, you need to run a job that loads feature d
Out of the box, Feast's materialization process uses an in-process materialization engine. This engine loads all the data being materialized into memory from the offline store, and writes it into the online store.

This approach may not scale to large amounts of data, which users of Feast may be dealing with in production.
In this case, we recommend using one of the more [scalable materialization engines](./scaling-feast.md#scaling-materialization), such as [Snowflake Materialization Engine](../reference/batch-materialization/snowflake.md).
Users may also need to [write a custom materialization engine](../how-to-guides/customizing-feast/creating-a-custom-materialization-engine.md) to work on their existing infrastructure.
In this case, we recommend using one of the more [scalable compute engines](./scaling-feast.md#scaling-materialization), such as [Snowflake Compute Engine](../reference/compute-engine/snowflake.md).
Users may also need to [write a custom compute engine](../how-to-guides/customizing-feast/creating-a-custom-compute-engine.md) to work on their existing infrastructure.


### 2.2 Scheduled materialization with Airflow
Expand Down
2 changes: 1 addition & 1 deletion docs/how-to-guides/scaling-feast.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ The recommended solution in this case is to use the [SQL based registry](../tuto
The default Feast materialization process is an in-memory process, which pulls data from the offline store before writing it to the online store.
However, this process does not scale for large data sets, since it's executed on a single-process.

Feast supports pluggable [Materialization Engines](../getting-started/components/batch-materialization-engine.md), that allow the materialization process to be scaled up.
Feast supports pluggable [Compute Engines](../getting-started/components/compute-engine.md), that allow the materialization process to be scaled up.
Aside from the local process, Feast supports a [Lambda-based materialization engine](https://rtd.feast.dev/en/master/#alpha-lambda-based-engine), and a [Bytewax-based materialization engine](https://rtd.feast.dev/en/master/#bytewax-engine).

Users may also be able to build an engine to scale up materialization using existing infrastructure in their organizations.
11 changes: 0 additions & 11 deletions docs/reference/batch-materialization/README.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/reference/codebase-structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ There are also several important submodules:
* `ui/` contains the embedded Web UI, to be launched on the `feast ui` command.

Of these submodules, `infra/` is the most important.
It contains the interfaces for the [provider](getting-started/components/provider.md), [offline store](getting-started/components/offline-store.md), [online store](getting-started/components/online-store.md), [batch materialization engine](getting-started/components/batch-materialization-engine.md), and [registry](getting-started/components/registry.md), as well as all of their individual implementations.
It contains the interfaces for the [provider](getting-started/components/provider.md), [offline store](getting-started/components/offline-store.md), [online store](getting-started/components/online-store.md), [compute engine](getting-started/components/compute-engine.md), and [registry](getting-started/components/registry.md), as well as all of their individual implementations.

```
$ tree --dirsfirst -L 1 infra
Expand Down
17 changes: 17 additions & 0 deletions docs/reference/compute-engine/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,18 +48,35 @@ An example of built output from FeatureBuilder:

## ✨ Available Engines


### 🔥 SparkComputeEngine

{% page-ref page="spark.md" %}

- Distributed DAG execution via Apache Spark
- Supports point-in-time joins and large-scale materialization
- Integrates with `SparkOfflineStore` and `SparkMaterializationJob`

### 🧪 LocalComputeEngine

{% page-ref page="local.md" %}

- Runs on Arrow + Specified backend (e.g., Pandas, Polars)
- Designed for local dev, testing, or lightweight feature generation
- Supports `LocalMaterializationJob` and `LocalHistoricalRetrievalJob`

### 🧊 SnowflakeComputeEngine

- Runs entirely in Snowflake
- Supports Snowflake SQL for feature transformations and aggregations
- Integrates with `SnowflakeOfflineStore` and `SnowflakeMaterializationJob`

{% page-ref page="snowflake.md" %}

### LambdaComputeEngine

{% page-ref page="lambda.md" %}

---

## 🛠️ Feature Builder Flow
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Description

The [Snowflake](https://trial.snowflake.com) batch materialization engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (`materialize` and `materialize-incremental`) when using a `SnowflakeSource`.
The [Snowflake](https://trial.snowflake.com) compute engine provides a highly scalable and parallel execution engine using a Snowflake Warehouse for batch materializations operations (`materialize` and `materialize-incremental`) when using a `SnowflakeSource`.

The engine requires no additional configuration other than for you to supply Snowflake's standard login and context details. The engine leverages custom (automatically deployed for you) Python UDFs to do the proper serialization of your offline store data to your online serving tables.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,19 @@
# Spark (alpha)
# Spark

## Description

The Spark batch materialization engine is considered alpha status. It relies on the offline store to output feature values to S3 via `to_remote_storage`, and then loads them into the online store.
Spark Compute Engine provides a distributed execution engine for batch materialization operations (`materialize` and `materialize-incremental`) and historical retrieval operations (`get_historical_features`).

It is designed to handle large-scale data processing and can be used with various offline stores, such as Snowflake, BigQuery, and Spark SQL.

### Design
The Spark Compute engine is implemented as a subclass of `feast.infra.compute_engine.ComputeEngine`.
Offline store is used to read and write data, while the Spark engine is used to perform transformations and aggregations on the data.
The engine supports the following features:
- Support for reading different data sources, such as Spark SQL, BigQuery, and Snowflake.
- Distributed execution of feature transformations and aggregations.
- Support for custom transformations using Spark SQL or UDFs.

See [SparkMaterializationEngine](https://rtd.feast.dev/en/master/index.html?highlight=SparkMaterializationEngine#feast.infra.materialization.spark.spark_materialization_engine.SparkMaterializationEngineConfig) for configuration options.

## Example

Expand All @@ -16,7 +25,12 @@ offline_store:
...
batch_engine:
type: spark.engine
partitions: [optional num partitions to use to write to online store]
partitions: 10 # number of partitions when writing to the online or offline store
spark_conf:
spark.master: "local[*]"
spark.app.name: "Feast Spark Engine"
spark.sql.shuffle.partitions: 100
spark.executor.memory: "4g"
```
{% endcode %}

Expand Down
Loading