-
Notifications
You must be signed in to change notification settings - Fork 43
[HOPSWORKS-2134] Extend Query constructor with filter capability #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 15 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
3f48a4e
restructure constructor code
moritzmeister cf05d94
add filter object
moritzmeister 9d8bcec
refactor classes into less files to avoid circular imports
moritzmeister 6aad389
add filter class
moritzmeister c47e0f7
progress
moritzmeister 4300d43
fix reprs
moritzmeister b397918
add more examples
moritzmeister d6c63fa
fix or bug
moritzmeister a802a85
fix setattr after update
moritzmeister 043fcff
refactor filter logic to a tree
moritzmeister 33594a0
move filter to base feature group class
moritzmeister fec74cb
Merge branch 'master' into HOPSWORKS-2134
moritzmeister fbdbd43
Merge branch 'master' into HOPSWORKS-2134
moritzmeister c321356
overwrite getattr and getitem instead of actually setting features as…
moritzmeister 79f114c
add documentation + fix some links
moritzmeister 0a5f67e
improve docs
moritzmeister File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# Query vs DataFrame | ||
|
||
HSFS provides a DataFrame API to ingest data into the Hopsworks Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or [materialized to file(s)](training_dataset.md) for later use to train models. | ||
|
||
The idea of the Feature Store is to have pre-computed features available for both training and serving models. The key functionality required to generate training datasets from reusable features are: feature selection, joins, filters and point in time queries. To enable this functionality, we are introducing a new expressive Query abstraction with `HSFS` that provides these operations and guarantees reproducible creation of training datasets from features in the Feature Store. | ||
|
||
The new joining functionality is heavily inspired by the APIs used by Pandas to merge DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions. | ||
|
||
```python | ||
# create a query | ||
feature_join = rain_fg.select_all() | ||
.join(temperature_fg.select_all(), on=["date", "location_id"]) | ||
.join(location_fg.select_all()) | ||
|
||
td = fs.create_training_dataset("rain_dataset", | ||
version=1, | ||
label=”weekly_rain”, | ||
data_format=”tfrecords”) | ||
|
||
# materialize query in the specified file format | ||
td.save(feature_join) | ||
|
||
# use materialized training dataset for training, possibly in a different environment | ||
td = fs.get_training_dataset(“rain_dataset”, version=1) | ||
|
||
# get TFRecordDataset to use in a TensorFlow model | ||
dataset = td.tf_data().tf_record_dataset(batch_size=32, num_epochs=100) | ||
|
||
# reproduce query for online feature store and drop label for inference | ||
jdbc_querystring = td.get_query(online=True, with_label=False) | ||
``` | ||
|
||
If a data scientist wants to modify a new feature that is not available in the Feature Store, she can write code to compute the new feature (using existing features or external data) and ingest the new feature values into the Feature Store. If the new feature is based solely on existing feature values in the Feature Store, we call it a derived feature. The same HSFS APIs can be used to compute derived features as well as features using external data sources. | ||
|
||
## The Query Abstraction | ||
|
||
Most operations performed on `FeatureGroup` metadata objects will return a `Query` with the applied operation. | ||
|
||
### Examples | ||
|
||
For example, selecting features from a feature group is a lazy operation, returning a query with the selected | ||
features only: | ||
|
||
```python | ||
rain_fg = fs.get_feature_group("rain_fg") | ||
|
||
# Returns Query | ||
feature_join = rain_fg.select(["location_id", "weekly_rainfall"]) | ||
``` | ||
|
||
Similarly joins return queries: | ||
|
||
```python | ||
feature_join = rain_fg.select_all() | ||
.join(temperature_fg.select_all(), on=["date", "location_id"]) | ||
.join(location_fg.select_all()) | ||
``` | ||
|
||
As well as filters: | ||
```python | ||
feature_join = rain_fg.filter(rain_fg.location_id == 10) | ||
``` | ||
|
||
## Methods | ||
|
||
{{query_methods}} | ||
|
||
## Properties | ||
|
||
{{query_properties}} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# | ||
# Copyright 2020 Logical Clocks AB | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
# | ||
# Copyright 2020 Logical Clocks AB | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
import json | ||
|
||
from hsfs import util | ||
|
||
|
||
class Filter: | ||
GE = "GREATER_THAN_OR_EQUAL" | ||
GT = "GREATER_THAN" | ||
NE = "NOT_EQUALS" | ||
EQ = "EQUALS" | ||
LE = "LESS_THAN_OR_EQUAL" | ||
LT = "LESS_THAN" | ||
|
||
def __init__(self, feature, condition, value): | ||
self._feature = feature | ||
self._condition = condition | ||
self._value = value | ||
|
||
def json(self): | ||
return json.dumps(self, cls=util.FeatureStoreEncoder) | ||
|
||
def to_dict(self): | ||
return { | ||
"feature": self._feature, | ||
"condition": self._condition, | ||
"value": str(self._value), | ||
} | ||
|
||
def __and__(self, other): | ||
if isinstance(other, Filter): | ||
return Logic.And(left_f=self, right_f=other) | ||
elif isinstance(other, Logic): | ||
return Logic.And(left_f=self, right_l=other) | ||
else: | ||
raise TypeError( | ||
"Operator `&` expected type `Filter` or `Logic`, got `{}`".format( | ||
type(other) | ||
) | ||
) | ||
|
||
def __or__(self, other): | ||
if isinstance(other, Filter): | ||
return Logic.Or(left_f=self, right_f=other) | ||
elif isinstance(other, Logic): | ||
return Logic.Or(left_f=self, right_l=other) | ||
else: | ||
raise TypeError( | ||
"Operator `|` expected type `Filter` or `Logic`, got `{}`".format( | ||
type(other) | ||
) | ||
) | ||
|
||
def __repr__(self): | ||
return f"Filter({self._feature!r}, {self._condition!r}, {self._value!r})" | ||
|
||
def __str__(self): | ||
return self.json() | ||
|
||
|
||
class Logic: | ||
AND = "AND" | ||
OR = "OR" | ||
SINGLE = "SINGLE" | ||
|
||
def __init__(self, type, left_f=None, right_f=None, left_l=None, right_l=None): | ||
self._type = type | ||
self._left_f = left_f | ||
self._right_f = right_f | ||
self._left_l = left_l | ||
self._right_l = right_l | ||
|
||
def json(self): | ||
return json.dumps(self, cls=util.FeatureStoreEncoder) | ||
|
||
def to_dict(self): | ||
return { | ||
"type": self._type, | ||
"leftFilter": self._left_f, | ||
"rightFilter": self._right_f, | ||
"leftLogic": self._left_l, | ||
"rightLogic": self._right_l, | ||
} | ||
|
||
@classmethod | ||
def And(cls, left_f=None, right_f=None, left_l=None, right_l=None): | ||
return cls(cls.AND, left_f, right_f, left_l, right_l) | ||
|
||
@classmethod | ||
def Or(cls, left_f=None, right_f=None, left_l=None, right_l=None): | ||
return cls(cls.OR, left_f, right_f, left_l, right_l) | ||
|
||
@classmethod | ||
def Single(cls, left_f): | ||
return cls(cls.SINGLE, left_f) | ||
|
||
def __and__(self, other): | ||
if isinstance(other, Filter): | ||
return Logic.And(left_l=self, right_f=other) | ||
elif isinstance(other, Logic): | ||
return Logic.And(left_l=self, right_l=other) | ||
else: | ||
raise TypeError( | ||
"Operator `&` expected type `Filter` or `Logic`, got `{}`".format( | ||
type(other) | ||
) | ||
) | ||
|
||
def __or__(self, other): | ||
if isinstance(other, Filter): | ||
return Logic.Or(left_l=self, right_f=other) | ||
elif isinstance(other, Logic): | ||
return Logic.Or(left_l=self, right_l=other) | ||
else: | ||
raise TypeError( | ||
"Operator `|` expected type `Filter` or `Logic`, got `{}`".format( | ||
type(other) | ||
) | ||
) | ||
|
||
def __repr__(self): | ||
return f"Logic({self._type!r}, {self._left_f!r}, {self._right_f!r}, {self._left_l!r}, {self._right_l!r})" | ||
|
||
def __str__(self): | ||
return self.json() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep this first example simple and also mention how the joining key is selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more examples also with scala equivalents.