You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/getting-started/architecture/feature-transformation.md
+98-1Lines changed: 98 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,4 +18,101 @@ when to use which transformation engine/communication pattern is extremely criti
18
18
the success of your implementation.
19
19
20
20
In general, we recommend transformation engines and network calls to be chosen by aligning it with what is most
21
-
appropriate for the data producer, feature/model usage, and overall product.
21
+
appropriate for the data producer, feature/model usage, and overall product.
22
+
23
+
24
+
## API
25
+
### feature_transformation
26
+
`feature_transformation` or `udf` are the core APIs for defining feature transformations in Feast. They allow you to specify custom logic that can be applied to the data during materialization or retrieval. Examples include:
Aggregation is builtin API for defining batch or streamable aggregations on data. It allows you to specify how to aggregate data over a time window, such as calculating the average or sum of a feature over a specified period. Examples include:
69
+
```python
70
+
from feast import Aggregation
71
+
feature_view = FeatureView(
72
+
aggregations=[
73
+
Aggregation(
74
+
column="amount",
75
+
function="sum"
76
+
)
77
+
Aggregation(
78
+
column="amount",
79
+
function="avg",
80
+
time_window="1h"
81
+
),
82
+
]
83
+
...
84
+
)
85
+
```
86
+
87
+
### Filter
88
+
ttl: They amount of time that the features will be available for materialization or retrieval. The entity rows' timestamp higher that the current time minus the ttl will be used to filter the features. This is useful for ensuring that only recent data is used in feature calculations. Examples include:
89
+
90
+
```python
91
+
feature_view = FeatureView(
92
+
ttl="1d", # Features will be available for 1 day
93
+
...
94
+
)
95
+
```
96
+
97
+
### Join
98
+
Feast can join multiple feature views together to create a composite feature view. This allows you to combine features from different sources or views into a single view. Examples include:
99
+
```python
100
+
feature_view = FeatureView(
101
+
name="composite_feature_view",
102
+
entities=["entity_id"],
103
+
source=[
104
+
FeatureView(
105
+
name="feature_view_1",
106
+
features=["feature_1", "feature_2"],
107
+
...
108
+
),
109
+
FeatureView(
110
+
name="feature_view_2",
111
+
features=["feature_3", "feature_4"],
112
+
...
113
+
)
114
+
]
115
+
...
116
+
)
117
+
```
118
+
The underlying implementation of the join is an inner join by default, and join key is the entity id.
`BatchFeatureView` is a flexible abstraction in Feast that allows users to define features derived from batch data sources or even other `FeatureView`s, enabling composable and reusable feature pipelines. It is an extension of the `FeatureView` class, with support for user-defined transformations, aggregations, and recursive chaining of feature logic.
4
+
5
+
---
6
+
7
+
## ✅ Key Capabilities
8
+
9
+
-**Composable DAG of FeatureViews**: Supports defining a `BatchFeatureView` on top of one or more other `FeatureView`s.
10
+
-**Transformations**: Apply [transformation](../../getting-started/architecture/feature-transformation.md) logic (`feature_transformation` or `udf`) to raw data source, can also be used to deal with multiple data sources.
11
+
-**Aggregations**: Define time-windowed aggregations (e.g. `sum`, `avg`) over event-timestamped data.
12
+
-**Feature resolution & execution**: Automatically resolves and executes DAGs of dependent views during materialization or retrieval. More details in the [Compute engine documentation](../../reference/compute-engine/README.md).
13
+
-**Materialization Sink Customization**: Specify a custom `sink_source` to define where derived feature data should be persisted.
sink_source=FileSource( # Required to specify where to sink the derived view
96
+
name="daily_driver_stats_sink",
97
+
path="s3://bucket/daily_stats/",
98
+
file_format="parquet",
99
+
timestamp_field="event_timestamp",
100
+
created_timestamp_column="created",
101
+
),
102
+
)
103
+
```
104
+
105
+
---
106
+
107
+
## 🔄 Execution Flow
108
+
109
+
Feast automatically resolves the DAG of `BatchFeatureView` dependencies during:
110
+
111
+
-`materialize()`: recursively resolves and executes the feature view graph.
112
+
-`get_historical_features()`: builds the execution plan for retrieving point-in-time correct features.
113
+
-`apply()`: registers the feature view DAG structure to the registry.
114
+
115
+
Each transformation and aggregation is turned into a DAG node (e.g., `SparkTransformationNode`, `SparkAggregationNode`) executed by the compute engine (e.g., `SparkComputeEngine`).
116
+
117
+
---
118
+
119
+
## ⚙️ How Materialization Works
120
+
121
+
- If the `BatchFeatureView` is backed by a base source (`FileSource`, `BigQuerySource`, `SparkSource` etc), the `batch_source` is used directly.
122
+
- If the source is another feature view (i.e., chained views), the `sink_source` must be provided to define the materialization target data source.
123
+
- During DAG planning, `SparkWriteNode` uses the `sink_source` as the batch sink.
124
+
125
+
---
126
+
127
+
## 🧪 Example Tests
128
+
129
+
See:
130
+
131
+
-`test_spark_dag_materialize_recursive_view()`: Validates chaining of two feature views and output validation.
132
+
-`test_spark_compute_engine_materialize()`: Validates transformation and write of features into offline and online stores.
133
+
134
+
---
135
+
136
+
## 🛑 Gotchas
137
+
138
+
-`sink_source` is **required** when chaining views (i.e., `source` is another FeatureView or list of them).
139
+
- Schema fields must be consistent with `sink_source`, `batch_source.field_mapping` if field mappings exist.
140
+
- Aggregation logic must reference columns present in the raw source or transformed inputs.
141
+
142
+
---
143
+
144
+
## 🔮 Future Directions
145
+
146
+
- Support additional offline stores (e.g., Snowflake, Redshift) with auto-generated sink sources.
|`ComputeEngine`| Interface for executing materialization and retrieval tasks |[link](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/compute_engines/base.py)|
19
+
|`FeatureBuilder`| Constructs a DAG from Feature View definition for a specific backend |[link](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/compute_engines/feature_builder.py)|
20
+
|`FeatureResolver`| Resolves feature DAG by topological order for execution |[link](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/compute_engines/feature_resolver.py)|
21
+
|`DAG`| Represents a logical DAG operation (read, aggregate, join, etc.) |[link](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/compute_engines/dag/README.md)|
22
+
|`ExecutionPlan`| Executes nodes in dependency order and stores intermediate outputs |[link]([link](https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/compute_engines/dag/README.md))|
The `FeatureBuilder` initializes a `FeatureResolver` that extracts a DAG from the `FeatureView` definitions, resolving dependencies and ensuring the correct execution order. \
29
+
The FeatureView represents a logical data source, while DataSource represents the physical data source (e.g., BigQuery, Spark, etc.). \
30
+
When defining a `FeatureView`, the source can be a physical `DataSource`, a derived `FeatureView`, or a list of `FeatureViews`.
31
+
The FeatureResolver walks through the FeatureView sources, and topologically sorts the DAG nodes based on dependencies, and returns a head node that represents the final output of the DAG. \
32
+
Subsequently, the `FeatureBuilder` builds the DAG nodes from the resolved head node, creating a `DAGNode` for each operation (read, join, filter, aggregate, etc.).
33
+
An example of built output from FeatureBuilder:
34
+
```markdown
35
+
- Output(Agg(daily_driver_stats))
36
+
- Agg(daily_driver_stats)
37
+
- Filter(daily_driver_stats)
38
+
- Transform(daily_driver_stats)
39
+
- Agg(hourly_driver_stats)
40
+
- Filter(hourly_driver_stats)
41
+
- Transform(hourly_driver_stats)
42
+
- Source(hourly_driver_stats)
43
+
```
44
+
45
+
## Diagram
46
+

47
+
48
+
26
49
## ✨ Available Engines
27
50
28
51
### 🔥 SparkComputeEngine
@@ -44,7 +67,7 @@ This system builds and executes DAGs (Directed Acyclic Graphs) of typed operatio
44
67
SourceReadNode
45
68
|
46
69
v
47
-
JoinNode (Only for get_historical_features with entity df)
70
+
TransformationNode (If feature_transformation is defined) | JoinNode (default behavior for multiple sources)
48
71
|
49
72
v
50
73
FilterNode (Always included; applies TTL or user-defined filters)
@@ -56,9 +79,6 @@ AggregationNode (If aggregations are defined in FeatureView)
56
79
DeduplicationNode (If no aggregation is defined for get_historical_features)
57
80
|
58
81
v
59
-
TransformationNode (If feature_transformation is defined)
60
-
|
61
-
v
62
82
ValidationNode (If enable_validation = True)
63
83
|
64
84
v
@@ -79,20 +99,54 @@ To create your own compute engine:
79
99
80
100
```python
81
101
from feast.infra.compute_engines.base import ComputeEngine
82
-
from feast.infra.materialization.batch_materialization_engine import MaterializationTask, MaterializationJob
83
-
from feast.infra.compute_engines.tasks import HistoricalRetrievalTask
102
+
from typing import Sequence, Union
103
+
from feast.batch_feature_view import BatchFeatureView
104
+
from feast.entity import Entity
105
+
from feast.feature_view import FeatureView
106
+
from feast.infra.common.materialization_job import (
107
+
MaterializationJob,
108
+
MaterializationTask,
109
+
)
110
+
from feast.infra.common.retrieval_task import HistoricalRetrievalTask
111
+
from feast.infra.offline_stores.offline_store import RetrievalJob
112
+
from feast.infra.registry.base_registry import BaseRegistry
113
+
from feast.on_demand_feature_view import OnDemandFeatureView
114
+
from feast.stream_feature_view import StreamFeatureView
0 commit comments