-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Context
I want to create versioned feature views. Through various versions, features could be added or removed.
Expected Behavior
When doing feast.get_historical_features
, features that are not defined should be rejected.
Current Behavior
The features get returned even though they have not been defined.
Steps to reproduce
- Initialize a new feast repository
- Define the features:
driver_hourly_stats = FileSource(
path="/home/benjamintan/workspace/feast-workflow-demo/feature_repo/data/driver_stats.parquet",
timestamp_field="event_timestamp",
created_timestamp_column="created",
)
driver = Entity(name="driver_id", value_type=ValueType.INT64, description="driver id",)
driver_hourly_stats_view_v1 = FeatureView(
name="driver_hourly_stats_v1",
entities=["driver_id"],
ttl=timedelta(days=1),
schema=[
Field(name="avg_daily_trips", dtype=Int64),
],
online=True,
batch_source=driver_hourly_stats,
tags={},
)
driver_hourly_stats_view_v2 = FeatureView(
name="driver_hourly_stats_v2",
entities=["driver_id"],
ttl=timedelta(days=1),
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="acc_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
online=True,
batch_source=driver_hourly_stats,
tags={},
)
feast apply
- Querying Feast:
fs = FeatureStore(repo_path='.')
entity_df = pd.DataFrame(
{
"event_timestamp": [
pd.Timestamp(dt, unit="ms", tz="UTC").round("ms")
for dt in pd.date_range(
start=datetime.now() - timedelta(days=3),
end=datetime.now(),
periods=3,
)
],
"driver_id": [1001, 1002, 1003],
}
)
I do not expect the following to work:
# THIS PART SHOULDN'T WORK
features_wrong = ['driver_hourly_stats_v1:conv_rate', # doesn't exist in V1
'driver_hourly_stats_v1:acc_rate', # doesn't exist in V1
'driver_hourly_stats_v1:avg_daily_trips',
]
hist_features_wrong = fs.get_historical_features(
entity_df=entity_df,
features=features_wrong,
)
But I do get results:
event_timestamp driver_id ... acc_rate avg_daily_trips
0 2022-04-17 09:35:35.658000+00:00 1001 ... 0.536431 742.0
1 2022-04-18 21:35:35.658000+00:00 1002 ... 0.496901 678.0
2 2022-04-20 09:35:35.658000+00:00 1003 ... NaN
I do not expect this to work because driver_hourly_stats_v1:conv_rate
and driver_hourly_stats_v1:acc_rate
were not defined in the driver_hourly_stats_view_v1
FeatureView.
And just to double check that driver_hourly_stats_v1
only has avg_daily_trips
defined:
➜ feast feature-views describe driver_hourly_stats_v1
spec:
name: driver_hourly_stats_v1
entities:
- driver_id
features:
- name: avg_daily_trips
valueType: INT64
ttl: 86400s
Specifications
- Version: 0.20 (tested this on 0.19 and 0.18)
- Platform: Linux
- Subsystem: Ubuntu
Possible Solution
The list of features being passed in should be checked against the registry. Currently the feature view name and feature name pairs are not validated together. Here's an example that modifies get_historical_features
:
@log_exceptions_and_usage
def get_historical_features(
self,
entity_df: Union[pd.DataFrame, str],
features: Union[List[str], FeatureService],
full_feature_names: bool = False,
) -> RetrievalJob:
# Build a dictionary of feature view names -> feature names (not sure if this function already exists ...)
fv_name_features = dict([(fv.name, [f.name.split('-')[0] for f in fv.features]) for fv in self.list_feature_views()])
# Check that input features are found in the `fv_name_features` dictionary
feature_views_not_found = []
for feature in features:
k, v = feature.split(":")
if v not in fv_name_features[k]:
feature_views_not_found.append(f'{k}:{v}')
if feature_views_not_found:
raise FeatureViewNotFoundException(', '.join(feature_views_not_found))
This returns:
feast.errors.FeatureViewNotFoundException: Feature view driver_hourly_stats_v1:conv_rate, driver_hourly_stats_v1:acc_rate does not exist
This doesn't handle the case when a FeatureService
is passed in but it shouldn't be too hard.
This should also apply to get_online_features
.