Skip to content

Undefined features should be rejected when being fetched via get_historical_features / get_online_features #2576

@benjamintanweihao

Description

@benjamintanweihao

Context

I want to create versioned feature views. Through various versions, features could be added or removed.

Expected Behavior

When doing feast.get_historical_features, features that are not defined should be rejected.

Current Behavior

The features get returned even though they have not been defined.

Steps to reproduce

  1. Initialize a new feast repository
  2. Define the features:
driver_hourly_stats = FileSource(
    path="/home/benjamintan/workspace/feast-workflow-demo/feature_repo/data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

driver = Entity(name="driver_id", value_type=ValueType.INT64, description="driver id",)

driver_hourly_stats_view_v1 = FeatureView(
    name="driver_hourly_stats_v1",
    entities=["driver_id"],
    ttl=timedelta(days=1),
    schema=[
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    batch_source=driver_hourly_stats,
    tags={},
)

driver_hourly_stats_view_v2 = FeatureView(
    name="driver_hourly_stats_v2",
    entities=["driver_id"],
    ttl=timedelta(days=1),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    batch_source=driver_hourly_stats,
    tags={},
)
  1. feast apply
  2. Querying Feast:
fs = FeatureStore(repo_path='.')

entity_df = pd.DataFrame(
    {
        "event_timestamp": [
            pd.Timestamp(dt, unit="ms", tz="UTC").round("ms")
            for dt in pd.date_range(
                start=datetime.now() - timedelta(days=3),
                end=datetime.now(),
                periods=3,
            )
        ],
        "driver_id": [1001, 1002, 1003],
    }
)

I do not expect the following to work:

# THIS PART SHOULDN'T WORK
features_wrong = ['driver_hourly_stats_v1:conv_rate',  # doesn't exist in V1
                              'driver_hourly_stats_v1:acc_rate',   # doesn't exist in V1
                              'driver_hourly_stats_v1:avg_daily_trips',
                             ]

hist_features_wrong = fs.get_historical_features(
    entity_df=entity_df,
    features=features_wrong,
)

But I do get results:

                   event_timestamp  driver_id  ...  acc_rate  avg_daily_trips
0 2022-04-17 09:35:35.658000+00:00       1001  ...  0.536431            742.0
1 2022-04-18 21:35:35.658000+00:00       1002  ...  0.496901            678.0
2 2022-04-20 09:35:35.658000+00:00       1003  ...       NaN     

I do not expect this to work because driver_hourly_stats_v1:conv_rate and driver_hourly_stats_v1:acc_rate were not defined in the driver_hourly_stats_view_v1 FeatureView.

And just to double check that driver_hourly_stats_v1 only has avg_daily_trips defined:

➜ feast feature-views describe driver_hourly_stats_v1
spec:
  name: driver_hourly_stats_v1
  entities:
  - driver_id
  features:
  - name: avg_daily_trips
    valueType: INT64
  ttl: 86400s

Specifications

  • Version: 0.20 (tested this on 0.19 and 0.18)
  • Platform: Linux
  • Subsystem: Ubuntu

Possible Solution

The list of features being passed in should be checked against the registry. Currently the feature view name and feature name pairs are not validated together. Here's an example that modifies get_historical_features:

    @log_exceptions_and_usage
    def get_historical_features(
        self,
        entity_df: Union[pd.DataFrame, str],
        features: Union[List[str], FeatureService],
        full_feature_names: bool = False,
    ) -> RetrievalJob:

        # Build a dictionary of feature view names -> feature names (not sure if this function already exists ...)
        fv_name_features = dict([(fv.name, [f.name.split('-')[0] for f in fv.features]) for fv in self.list_feature_views()])
        
        # Check that input features are found in the `fv_name_features` dictionary
        feature_views_not_found = []
        for feature in features:
            k, v = feature.split(":")
            if v not in fv_name_features[k]:
                feature_views_not_found.append(f'{k}:{v}')

        if feature_views_not_found:
            raise FeatureViewNotFoundException(', '.join(feature_views_not_found))

This returns:

feast.errors.FeatureViewNotFoundException: Feature view driver_hourly_stats_v1:conv_rate, driver_hourly_stats_v1:acc_rate does not exist

This doesn't handle the case when a FeatureService is passed in but it shouldn't be too hard.

This should also apply to get_online_features.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions