|
| 1 | +# 🧠 ComputeEngine (WIP) |
| 2 | + |
| 3 | +The `ComputeEngine` is Feast’s pluggable abstraction for executing feature pipelines — including transformations, aggregations, joins, and materializations/get_historical_features — on a backend of your choice (e.g., Spark, PyArrow, Pandas, Ray). |
| 4 | + |
| 5 | +It powers both: |
| 6 | + |
| 7 | +- `materialize()` – for batch and stream generation of features to offline/online stores |
| 8 | +- `get_historical_features()` – for point-in-time correct training dataset retrieval |
| 9 | + |
| 10 | +This system builds and executes DAGs (Directed Acyclic Graphs) of typed operations, enabling modular and scalable workflows. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## 🧠 Core Concepts |
| 15 | + |
| 16 | +| Component | Description | |
| 17 | +|--------------------|----------------------------------------------------------------------| |
| 18 | +| `ComputeEngine` | Interface for executing materialization and retrieval tasks | |
| 19 | +| `FeatureBuilder` | Constructs a DAG from Feature View definition for a specific backend | |
| 20 | +| `DAGNode` | Represents a logical operation (read, aggregate, join, etc.) | |
| 21 | +| `ExecutionPlan` | Executes nodes in dependency order and stores intermediate outputs | |
| 22 | +| `ExecutionContext` | Holds config, registry, stores, entity data, and node outputs | |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## ✨ Available Engines |
| 27 | + |
| 28 | +### 🔥 SparkComputeEngine |
| 29 | + |
| 30 | +- Distributed DAG execution via Apache Spark |
| 31 | +- Supports point-in-time joins and large-scale materialization |
| 32 | +- Integrates with `SparkOfflineStore` and `SparkMaterializationJob` |
| 33 | + |
| 34 | +### 🧪 LocalComputeEngine (WIP) |
| 35 | + |
| 36 | +- Runs on Arrow + Pandas (or optionally DuckDB) |
| 37 | +- Designed for local dev, testing, or lightweight feature generation |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## 🛠️ Feature Builder Flow |
| 42 | +```markdown |
| 43 | +SourceReadNode |
| 44 | + | |
| 45 | + v |
| 46 | +JoinNode (Only for get_historical_features with entity df) |
| 47 | + | |
| 48 | + v |
| 49 | +FilterNode (Always included; applies TTL or user-defined filters) |
| 50 | + | |
| 51 | + v |
| 52 | +AggregationNode (If aggregations are defined in FeatureView) |
| 53 | + | |
| 54 | + v |
| 55 | +DeduplicationNode (If no aggregation is defined for get_historical_features) |
| 56 | + | |
| 57 | + v |
| 58 | +TransformationNode (If feature_transformation is defined) |
| 59 | + | |
| 60 | + v |
| 61 | +ValidationNode (If enable_validation = True) |
| 62 | + | |
| 63 | + v |
| 64 | +Output |
| 65 | + ├──> RetrievalOutput (For get_historical_features) |
| 66 | + └──> OnlineStoreWrite / OfflineStoreWrite (For materialize) |
| 67 | +``` |
| 68 | + |
| 69 | +Each step is implemented as a `DAGNode`. An `ExecutionPlan` executes these nodes in topological order, caching `DAGValue` outputs. |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## 🧩 Implementing a Custom Compute Engine |
| 74 | + |
| 75 | +To create your own compute engine: |
| 76 | + |
| 77 | +1. **Implement the interface** |
| 78 | + |
| 79 | +```python |
| 80 | +from feast.infra.compute_engines.base import ComputeEngine |
| 81 | +from feast.infra.materialization.batch_materialization_engine import MaterializationTask, MaterializationJob |
| 82 | +from feast.infra.compute_engines.tasks import HistoricalRetrievalTask |
| 83 | +class MyComputeEngine(ComputeEngine): |
| 84 | + def materialize(self, task: MaterializationTask) -> MaterializationJob: |
| 85 | + ... |
| 86 | + |
| 87 | + def get_historical_features(self, task: HistoricalRetrievalTask) -> RetrievalJob: |
| 88 | + ... |
| 89 | +``` |
| 90 | + |
| 91 | +2. Create a FeatureBuilder |
| 92 | +```python |
| 93 | +from feast.infra.compute_engines.feature_builder import FeatureBuilder |
| 94 | + |
| 95 | +class CustomFeatureBuilder(FeatureBuilder): |
| 96 | + def build_source_node(self): ... |
| 97 | + def build_aggregation_node(self, input_node): ... |
| 98 | + def build_join_node(self, input_node): ... |
| 99 | + def build_filter_node(self, input_node): |
| 100 | + def build_dedup_node(self, input_node): |
| 101 | + def build_transformation_node(self, input_node): ... |
| 102 | + def build_output_nodes(self, input_node): ... |
| 103 | +``` |
| 104 | + |
| 105 | +3. Define DAGNode subclasses |
| 106 | + * ReadNode, AggregationNode, JoinNode, WriteNode, etc. |
| 107 | + * Each DAGNode.execute(context) -> DAGValue |
| 108 | + |
| 109 | +4. Return an ExecutionPlan |
| 110 | + * ExecutionPlan stores DAG nodes in topological order |
| 111 | + * Automatically handles intermediate value caching |
| 112 | + |
| 113 | +## 🚧 Roadmap |
| 114 | +- [x] Modular, backend-agnostic DAG execution framework |
| 115 | +- [x] Spark engine with native support for materialization + PIT joins |
| 116 | +- [ ] PyArrow + Pandas engine for local compute |
| 117 | +- [ ] Native multi-feature-view DAG optimization |
| 118 | +- [ ] DAG validation, metrics, and debug output |
| 119 | +- [ ] Scalable distributed backend via Ray or Polars |
0 commit comments