-
Notifications
You must be signed in to change notification settings - Fork 10
Refreshable http getters #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9069687 to
d2c092a
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #882 +/- ##
==========================================
- Coverage 96.94% 95.97% -0.98%
==========================================
Files 202 207 +5
Lines 12681 13597 +916
==========================================
+ Hits 12294 13050 +756
- Misses 387 547 +160 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
# Conflicts: # CHANGELOG.md
| validator=validators.deep_mapping( | ||
| key_validator=validators.instance_of(str), | ||
| value_validator=validators.instance_of((str, bool, list)), | ||
| value_validator=validators.instance_of((str, bool, list, int, float, type(None))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this and in particular type(None)? Same for Line 176.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON can have null values, which were not supported before, so I added this along with the other missing types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that int and float make sense here. However null / None seems not supported, leading to the key not being added as a field.
I think it would make sense to have a test per added datatype to ensure that the processor works as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the support for None type and added tests for the remaining types.
| In that case, only one file must exist.""" | ||
|
|
||
| _base_add: dict = field(default={}, eq=False) | ||
| """Stores original add fields for future refreshes of getters""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Stores original add fields for future refreshes of getters""" | |
| """Stores original add fields (as provided in the config) for future refreshes of getters""" |
As I understand it we support configuring add and add_frome_file simultaneously and merge the data one dict. _base_add helps us to remember the original data from config.add as a baseline to combine it with the refreshed getter contents. If this is correct, then it would be great to add in the documentation that the combination of sources is allowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, I will document it.
|
|
||
| def refresh_getters(): | ||
| """Refreshes all refreshable getters""" | ||
| HttpGetter.refresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| HttpGetter.refresh() | |
| RefreshableGetter.refresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not work, since each class has it's own _shared variable.
_shared is defined as an abstract property in the superclass.
| class RefreshableGetter(Getter, ABC): | ||
| """Interface for getters that refresh their value periodically""" | ||
|
|
||
| _logger = logging.getLogger("console") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be "console", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did not print any logs if it crashed on startup if it was not "console".
That is why I kept it like that even though another logger Name would be more fitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But think this is rather a problem with the logger configuration then. We should definitely keep the logs, as these are useful for debugging and operations purposes. We can surely find a way to make this work with another name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some minor points.
| return FileGetter(protocol=protocol, target=target) # type: ignore | ||
| if protocol == "http": | ||
| return HttpGetter(protocol=protocol, target=target) | ||
| return HttpGetter(protocol=protocol, target=target) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throws an pylint error for "Abstract class 'HttpGetter' with abstract methods instantiated".
The RefreshableGetter implements a _shared() as abstract methods (line 143), but in HTTPGetter it is provided as a class variable (line 371).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to ensure that _shared would have to be defined in all sub classes, since you can't have abstract class variables.
I did not find a better way. Do you maybe have an idea how else to do it?
The return of default values looks good to me. |
logprep/util/configuration.py
Outdated
| get_variable_values(val, variable_values) | ||
| elif isinstance(value, str): | ||
| if value.startswith(("http://", "https://", "file://")): | ||
| variable_values.append(GetterFactory.from_string(value).get_json()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am struggling to grasp the implications of this change and see a potential risk here.
What the change does is to follow all references (http/https endpoints and files), collect their contents (get_json) and include this data in the check whether the configuration has changed. Semantically I can see where this is going -- configuration changes might be hidden behind files/endpoints and we want to make sure that the config is also reloaded when the referenced data changes.
Potential problems I see:
- We are always using
get_jsonwhich will probably fail (=Exception) for referenced yaml-files. Solution: use something more flexible likeget_dictinstead - We are traversing the whole config tree, effectively pulling in all referenced values (e.g. processor-specific rule configuration data)
2.1. Who is responsible for reloading processor-specific data? Does the processor register callbacks on its getters (as this PR does it for the generic resolver) and update itself, or does the config reload everything and initiate pipeline reloads? In any case we shouldn't mix approaches. (I btw favor the variant that the processor takes the responsibility and reloads itself).
2.2. We are potentially iterating over a lot of references, taking quite some time in the manager process (I/O) and collecting a significant amount of data (caches + parsed data). We even need to preserve all the data for comparison purposes. This might block the manager process and even lead to OOMs.
2.3. The manager process suddenly collects data which otherwise would only be loaded in the pipeline processes, leading to duplicated data across processes and thus a higher overall memory usage. - In the end, the pipelines are only being reloaded if the configuration version has changed. So I wonder if loading the complete configuration is actually required and at least the pipelines-subtree of the configuration could be excluded.
3.1. I am not sure about how other things (like metrics on the runner- or manager-level) are being reloaded and maybe these things take place immediately and without a version change. However, this would be a weird mixture of paradigms again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am struggling to grasp the implications of this change and see a potential risk here.
What the change does is to follow all references (http/https endpoints and files), collect their contents (
get_json) and include this data in the check whether the configuration has changed. Semantically I can see where this is going -- configuration changes might be hidden behind files/endpoints and we want to make sure that the config is also reloaded when the referenced data changes.Potential problems I see:
1. We are always using `get_json` which will probably fail (=Exception) for referenced yaml-files. Solution: use something more flexible like `get_dict` instead
Yes, I will change it to get_dict.
2. We are traversing the whole config tree, effectively pulling in all referenced values (e.g. processor-specific rule configuration data) 2.1. Who is responsible for reloading processor-specific data? Does the processor register callbacks on its getters (as [this PR does it for the generic resolver](https://github.com/fkie-cad/Logprep/pull/882/files#diff-003f0669d3897777df80e086070b66beee9a7fad11a894b7c47c58cf7ae43cfbR196)) and update itself, or does the config reload everything and initiate pipeline reloads? In any case we shouldn't mix approaches. (I btw favor the variant that the processor takes the responsibility and reloads itself).
A reload rebuilds all pipelines so it would reload all processors with their rules. This does only happen on config version changes when the configuration data has changed. So it is triggered intentionally. The only difference to before this pull request is that it also checks for the difference of data in paths and not the paths themselves. Shifting the responsibility for reloading processor specific stuff on changes of values resolved from paths solely to callbacks might be better.
2.2. We are potentially iterating over a lot of references, taking quite some time in the manager process (I/O) and collecting a significant amount of data (caches + parsed data). We even need to preserve all the data for comparison purposes. This might block the manager process and even lead to OOMs.
Yes, maybe it would be better to check for a config version change before parsing the whole config if we want to keep this functionality.
2.3. The manager process suddenly collects data which otherwise would only be loaded in the pipeline processes, leading to duplicated data across processes and thus a higher overall memory usage.
Alternatively, one could always reload if the config version changes even if the config content did not change. Then we don't have to keep track of variable values, but every config version change would cause a reload, which would probably happen anyways, since the version is usually changed when the content is also changed.
3. In the end, [the pipelines are only being reloaded if the configuration version has changed](https://github.com/fkie-cad/Logprep/blob/main/logprep/runner.py#L122). So I wonder if loading the complete configuration is actually required and at least the pipelines-subtree of the configuration could be excluded.
Yes, we could exclude the pipeline from resolving paths and rely only on callbacks for processor specific reloads triggered by changes from resolved paths.
3.1. I am not sure about how other things (like metrics on the runner- or manager-level) are being reloaded and maybe these things take place immediately and without a version change. However, this would be a weird mixture of paradigms again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for your replies. I think it makes sense for this PR to not collect file-/endpoint-data and rather have a second PR down the road (which we can handle) which does this:
Alternatively, one could always reload if the config version changes even if the config content did not change. Then we don't have to keep track of variable values, but every config version change would cause a reload, which would probably happen anyways, since the version is usually changed when the content is also changed.
Fine for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be fine for me.
# Conflicts: # tests/unit/framework/test_pipeline_manager.py
No description provided.