Skip to content

Conversation

HaoXuAI
Copy link
Collaborator

@HaoXuAI HaoXuAI commented Jun 30, 2025

What this PR does / why we need it:

Nonbreaking changes, backward compatible.

Support multi views in source. E.g,

  1. chained feature_view -> base feature_view -> data_source
base_fv = BatchFeatureView(
        name="hourly_driver_stats",
        entities=[driver],
        schema=[
            Field(name="conv_rate", dtype=Float32),
            Field(name="acc_rate", dtype=Float32),
            Field(name="avg_daily_trips", dtype=Int64),
            Field(name="driver_id", dtype=Int32),
        ],
        online=True,
        offline=True,
        source=source,
    )

chained_fv = BatchFeatureView(
        name="daily_driver_stats",
        entities=[driver],
        udf=transform_feature,
        udf_string="transform",
        schema=[
            Field(name="conv_rate", dtype=Float32),
            Field(name="driver_id", dtype=Int32),
        ],
        online=True,
        offline=True,
        source=base_fv,
        sink_source=SparkSource(
            name="daily_driver_stats_sink",
            path="/tmp/daily_driver_stats_sink",
            file_format="parquet",
            timestamp_field="event_timestamp",
            created_timestamp_column="created",
        ),
    )
  1. feature_view -> [feature_view1, feature_view2]
multi_view = BatchFeatureView(
        name="multi_view",
        entities=[driver],
        schema=[
            Field(name="driver_id", dtype=Int32),
            Field(name="daily_driver_stats__conv_rate", dtype=Float32),
            Field(name="daily_driver_stats__acc_rate", dtype=Float32),
        ],
        online=True,
        offline=True,
        source=[base_fv, chained_fv],
        sink_source=SparkSource(
            name="multi_view_sink",
            path="/tmp/multi_view_sink",
            file_format="parquet",
            timestamp_field="daily_driver_stats__event_timestamp",
            created_timestamp_column="daily_driver_stats__created",
        ),
    )

Diagram:
image

APIs:

  1. sink_source: The sink_source is used when you have multiple views as the source. This is required for materialization to understand what output you need to write to the offline store.
  2. Underline default join operation: data source can be merged via either the transformation udf you specified yourself, or the default join operation. The default join operation is an inner join on each FeatureView's features, and left join on entity df.
  3. Works for both BatchFeatureView and StreamFeatureView.

This unlocks the request to join multiple data sources, such as SparkSource + SnowflakeSource.
You can do with this setups:

  1. Create one Snowflake FeatureView for SnowflakeSource
  2. Create one Spark FeatureView for SparkSource
  3. Create one FeatureView to use those FeatureView as the source.

Which issue(s) this PR fixes:

#5444 (comment)

Misc

TODO:

  1. Not every feature view node needs to be recomputed. Features computed and loaded into the offline store can be skipped for materialization. This is also known as the stateful store.
  2. Default join operation should be customizable.

@HaoXuAI HaoXuAI requested a review from a team as a code owner June 30, 2025 20:20
HaoXuAI added 2 commits July 7, 2025 00:12
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
HaoXuAI added 2 commits July 7, 2025 21:53
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
@HaoXuAI HaoXuAI changed the title Draft: Compute engine multi sources and feature view as source Feat: compute engine multi sources and feature view as source Jul 8, 2025
@HaoXuAI HaoXuAI changed the title Feat: compute engine multi sources and feature view as source feat: Compute engine multi sources and feature view as source Jul 8, 2025
HaoXuAI added 5 commits July 7, 2025 23:39
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
@HaoXuAI HaoXuAI changed the title feat: Compute engine multi sources and feature view as source feat: Support compute engine use multi feature views as source Jul 8, 2025
@HaoXuAI HaoXuAI changed the title feat: Support compute engine use multi feature views as source feat: Support compute engine to use multi feature views as source Jul 8, 2025
HaoXuAI added 2 commits July 8, 2025 09:06
Signed-off-by: HaoXuAI <[email protected]>
Signed-off-by: HaoXuAI <[email protected]>
entities: List[str]
ttl: Optional[timedelta]
source: DataSource
sink_source: Optional[DataSource] = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just call it sink?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will update it

Copy link
Member

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits but this mostly lgtm

Can you add a page on the docs before merging this PR? Would be great to share with community on how to use it.

from feast.infra.compute_engines.dag.node import DAGNode


def topo_sort(root: DAGNode) -> List[DAGNode]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call it topological_sort?

else None
)
source_views = [
FeatureView.from_proto(FeatureViewProto(spec=view_spec, meta=None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_proto() method recursively calls itself for each source view without any depth limit. While cycle detection exists in FeatureResolver, cycle detection only runs when you use the compute engine, but proto deserialization happens much earlier during APIs/ registry loading.

We might need to handle this in FeatureView.from_proto().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we not need to store metadata for nested feature views ? meta=None ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both cycle detection and de-duplication during serialization. It may not cause issue for few feature views but if there are many feature views, it could cause slowness.

A -> [B, C]
B -> [D, E]
C -> [D, E]

When serializing FeatureViewA = FeatureViewD and FeatureViewE get serialized twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right make sense.
I don't have any meta data required for compute engine at the moment, what do you think something useful?

ttl: Optional[timedelta]
batch_source: DataSource
stream_source: Optional[DataSource]
source_views: Optional[List["FeatureView"]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In __eq__ , I think we also need to compare compare source_views, else two FeatureViews with different source dependencies will be considered equal.

same for __copy__

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@HaoXuAI HaoXuAI merged commit b9ac90b into master Jul 10, 2025
17 checks passed
@HaoXuAI
Copy link
Collaborator Author

HaoXuAI commented Jul 10, 2025

Merged it and add new PR with following suggestions

franciscojavierarceo pushed a commit that referenced this pull request Jul 21, 2025
# [0.51.0](v0.50.0...v0.51.0) (2025-07-21)

### Bug Fixes

* FeatureView serialization with cycle detection ([#5502](#5502)) ([f287ca5](f287ca5))
* Fix current version in publish workflow ([#5499](#5499)) ([0af6e94](0af6e94))
* Fix NPM authentication ([#5506](#5506)) ([9f85892](9f85892))
* Fix verify wheels workflow for macos14 ([#5486](#5486)) ([07174cc](07174cc))
* Fixed error thrown for invalid project name on features api ([#5525](#5525)) ([4a9a5d0](4a9a5d0))
* Fixed ODFV on-write transformations ([271ef74](271ef74))
* Move Install OS X dependencies before python setup ([#5488](#5488)) ([35f211c](35f211c))
* Normalize current version by removing 'v' prefix if present ([#5500](#5500)) ([43f3d52](43f3d52))
* Skip macOS 14 with Python 3.10 due to gettext library ([#5490](#5490)) ([41d4977](41d4977))
* Standalone Web UI Publish Workflow ([#5498](#5498)) ([c47b134](c47b134))

### Features

* Added endpoints to allow user to get data for all projects ([4e06965](4e06965))
* Added grpc and rest endpoint for features ([#5519](#5519)) ([0a75696](0a75696))
* Added relationship support to all API endpoints ([#5496](#5496)) ([bea83e7](bea83e7))
* Continue updating doc ([#5523](#5523)) ([ea53b2b](ea53b2b))
* Hybrid offline store ([#5510](#5510)) ([8f1af55](8f1af55))
* Populate created and updated timestamp on data sources ([af3056b](af3056b))
* Provide ready-to-use Python definitions in api ([37628d9](37628d9))
* Snowflake source. fetch MAX in a single query ([#5387](#5387)) ([b49cea1](b49cea1))
* Support compute engine to use multi feature views as source ([#5482](#5482)) ([b9ac90b](b9ac90b))
* Support pagination and sorting on registry apis ([#5495](#5495)) ([c4b6fbe](c4b6fbe))
* Update doc ([#5521](#5521)) ([2808ce1](2808ce1))
HaoXuAI pushed a commit that referenced this pull request Aug 11, 2025
# [0.51.0](v0.50.0...v0.51.0) (2025-07-21)

### Bug Fixes

* FeatureView serialization with cycle detection ([#5502](#5502)) ([f287ca5](f287ca5))
* Fix current version in publish workflow ([#5499](#5499)) ([0af6e94](0af6e94))
* Fix NPM authentication ([#5506](#5506)) ([9f85892](9f85892))
* Fix verify wheels workflow for macos14 ([#5486](#5486)) ([07174cc](07174cc))
* Fixed error thrown for invalid project name on features api ([#5525](#5525)) ([4a9a5d0](4a9a5d0))
* Fixed ODFV on-write transformations ([271ef74](271ef74))
* Move Install OS X dependencies before python setup ([#5488](#5488)) ([35f211c](35f211c))
* Normalize current version by removing 'v' prefix if present ([#5500](#5500)) ([43f3d52](43f3d52))
* Skip macOS 14 with Python 3.10 due to gettext library ([#5490](#5490)) ([41d4977](41d4977))
* Standalone Web UI Publish Workflow ([#5498](#5498)) ([c47b134](c47b134))

### Features

* Added endpoints to allow user to get data for all projects ([4e06965](4e06965))
* Added grpc and rest endpoint for features ([#5519](#5519)) ([0a75696](0a75696))
* Added relationship support to all API endpoints ([#5496](#5496)) ([bea83e7](bea83e7))
* Continue updating doc ([#5523](#5523)) ([ea53b2b](ea53b2b))
* Hybrid offline store ([#5510](#5510)) ([8f1af55](8f1af55))
* Populate created and updated timestamp on data sources ([af3056b](af3056b))
* Provide ready-to-use Python definitions in api ([37628d9](37628d9))
* Snowflake source. fetch MAX in a single query ([#5387](#5387)) ([b49cea1](b49cea1))
* Support compute engine to use multi feature views as source ([#5482](#5482)) ([b9ac90b](b9ac90b))
* Support pagination and sorting on registry apis ([#5495](#5495)) ([c4b6fbe](c4b6fbe))
* Update doc ([#5521](#5521)) ([2808ce1](2808ce1))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants