Skip to content

Conversation

@ppcad
Copy link
Collaborator

@ppcad ppcad commented Sep 23, 2025

No description provided.

@ppcad ppcad self-assigned this Sep 23, 2025
@ppcad ppcad linked an issue Oct 1, 2025 that may be closed by this pull request
@ppcad ppcad force-pushed the dev-refreshable-http-getters branch from 9069687 to d2c092a Compare October 2, 2025 12:42
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 92.06349% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.97%. Comparing base (ba7aa1a) to head (81dae9b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
logprep/util/getter.py 89.72% 15 Missing ⚠️
logprep/processor/generic_resolver/rule.py 92.10% 3 Missing ⚠️
logprep/util/configuration.py 93.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #882      +/-   ##
==========================================
- Coverage   96.94%   95.97%   -0.98%     
==========================================
  Files         202      207       +5     
  Lines       12681    13597     +916     
==========================================
+ Hits        12294    13050     +756     
- Misses        387      547     +160     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ppcad ppcad requested a review from kaya-david October 15, 2025 07:21
@ppcad ppcad marked this pull request as ready for review October 15, 2025 07:31
@kaya-david kaya-david requested a review from kmeinerz October 15, 2025 07:50
@kaya-david
Copy link
Collaborator

@kmeinerz Could you please take over the PRs related to the http getters topic?
@ppcad @kmeinerz is currently working on this topic - please assign him as reviewer for that type of PRs.

validator=validators.deep_mapping(
key_validator=validators.instance_of(str),
value_validator=validators.instance_of((str, bool, list)),
value_validator=validators.instance_of((str, bool, list, int, float, type(None))),
Copy link
Collaborator

@mhoff mhoff Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this and in particular type(None)? Same for Line 176.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON can have null values, which were not supported before, so I added this along with the other missing types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that int and float make sense here. However null / None seems not supported, leading to the key not being added as a field.

I think it would make sense to have a test per added datatype to ensure that the processor works as intended.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the support for None type and added tests for the remaining types.

In that case, only one file must exist."""

_base_add: dict = field(default={}, eq=False)
"""Stores original add fields for future refreshes of getters"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Stores original add fields for future refreshes of getters"""
"""Stores original add fields (as provided in the config) for future refreshes of getters"""

As I understand it we support configuring add and add_frome_file simultaneously and merge the data one dict. _base_add helps us to remember the original data from config.add as a baseline to combine it with the refreshed getter contents. If this is correct, then it would be great to add in the documentation that the combination of sources is allowed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, I will document it.


def refresh_getters():
"""Refreshes all refreshable getters"""
HttpGetter.refresh()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HttpGetter.refresh()
RefreshableGetter.refresh()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would not work, since each class has it's own _shared variable.
_shared is defined as an abstract property in the superclass.

class RefreshableGetter(Getter, ABC):
"""Interface for getters that refresh their value periodically"""

_logger = logging.getLogger("console")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be "console", right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not print any logs if it crashed on startup if it was not "console".
That is why I kept it like that even though another logger Name would be more fitting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But think this is rather a problem with the logger configuration then. We should definitely keep the logs, as these are useful for debugging and operations purposes. We can surely find a way to make this work with another name

Copy link
Collaborator

@kmeinerz kmeinerz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some minor points.

return FileGetter(protocol=protocol, target=target) # type: ignore
if protocol == "http":
return HttpGetter(protocol=protocol, target=target)
return HttpGetter(protocol=protocol, target=target) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws an pylint error for "Abstract class 'HttpGetter' with abstract methods instantiated".
The RefreshableGetter implements a _shared() as abstract methods (line 143), but in HTTPGetter it is provided as a class variable (line 371).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to ensure that _shared would have to be defined in all sub classes, since you can't have abstract class variables.
I did not find a better way. Do you maybe have an idea how else to do it?

@kmeinerz
Copy link
Collaborator

I added an option for refreshable getters to return default values. Can you please take a look? @kmeinerz

The return of default values looks good to me.

get_variable_values(val, variable_values)
elif isinstance(value, str):
if value.startswith(("http://", "https://", "file://")):
variable_values.append(GetterFactory.from_string(value).get_json())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling to grasp the implications of this change and see a potential risk here.

What the change does is to follow all references (http/https endpoints and files), collect their contents (get_json) and include this data in the check whether the configuration has changed. Semantically I can see where this is going -- configuration changes might be hidden behind files/endpoints and we want to make sure that the config is also reloaded when the referenced data changes.

Potential problems I see:

  1. We are always using get_json which will probably fail (=Exception) for referenced yaml-files. Solution: use something more flexible like get_dict instead
  2. We are traversing the whole config tree, effectively pulling in all referenced values (e.g. processor-specific rule configuration data)
    2.1. Who is responsible for reloading processor-specific data? Does the processor register callbacks on its getters (as this PR does it for the generic resolver) and update itself, or does the config reload everything and initiate pipeline reloads? In any case we shouldn't mix approaches. (I btw favor the variant that the processor takes the responsibility and reloads itself).
    2.2. We are potentially iterating over a lot of references, taking quite some time in the manager process (I/O) and collecting a significant amount of data (caches + parsed data). We even need to preserve all the data for comparison purposes. This might block the manager process and even lead to OOMs.
    2.3. The manager process suddenly collects data which otherwise would only be loaded in the pipeline processes, leading to duplicated data across processes and thus a higher overall memory usage.
  3. In the end, the pipelines are only being reloaded if the configuration version has changed. So I wonder if loading the complete configuration is actually required and at least the pipelines-subtree of the configuration could be excluded.
    3.1. I am not sure about how other things (like metrics on the runner- or manager-level) are being reloaded and maybe these things take place immediately and without a version change. However, this would be a weird mixture of paradigms again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling to grasp the implications of this change and see a potential risk here.

What the change does is to follow all references (http/https endpoints and files), collect their contents (get_json) and include this data in the check whether the configuration has changed. Semantically I can see where this is going -- configuration changes might be hidden behind files/endpoints and we want to make sure that the config is also reloaded when the referenced data changes.

Potential problems I see:

1. We are always using `get_json` which will probably fail (=Exception) for referenced yaml-files. Solution: use something more flexible like `get_dict` instead

Yes, I will change it to get_dict.

2. We are traversing the whole config tree, effectively pulling in all referenced values (e.g. processor-specific rule configuration data)
   2.1. Who is responsible for reloading processor-specific data? Does the processor register callbacks on its getters (as [this PR does it for the generic resolver](https://github.com/fkie-cad/Logprep/pull/882/files#diff-003f0669d3897777df80e086070b66beee9a7fad11a894b7c47c58cf7ae43cfbR196)) and update itself, or does the config reload everything and initiate pipeline reloads? In any case we shouldn't mix approaches. (I btw favor the variant that the processor takes the responsibility and reloads itself).

A reload rebuilds all pipelines so it would reload all processors with their rules. This does only happen on config version changes when the configuration data has changed. So it is triggered intentionally. The only difference to before this pull request is that it also checks for the difference of data in paths and not the paths themselves. Shifting the responsibility for reloading processor specific stuff on changes of values resolved from paths solely to callbacks might be better.

   2.2. We are potentially iterating over a lot of references, taking quite some time in the manager process (I/O) and collecting a significant amount of data (caches + parsed data). We even need to preserve all the data for comparison purposes. This might block the manager process and even lead to OOMs.

Yes, maybe it would be better to check for a config version change before parsing the whole config if we want to keep this functionality.

   2.3. The manager process suddenly collects data which otherwise would only be loaded in the pipeline processes, leading to duplicated data across processes and thus a higher overall memory usage.

Alternatively, one could always reload if the config version changes even if the config content did not change. Then we don't have to keep track of variable values, but every config version change would cause a reload, which would probably happen anyways, since the version is usually changed when the content is also changed.

3. In the end, [the pipelines are only being reloaded if the configuration version has changed](https://github.com/fkie-cad/Logprep/blob/main/logprep/runner.py#L122). So I wonder if loading the complete configuration is actually required and at least the pipelines-subtree of the configuration could be excluded.

Yes, we could exclude the pipeline from resolving paths and rely only on callbacks for processor specific reloads triggered by changes from resolved paths.

   3.1. I am not sure about how other things (like metrics on the runner- or manager-level) are being reloaded and maybe these things take place immediately and without a version change. However, this would be a weird mixture of paradigms again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your replies. I think it makes sense for this PR to not collect file-/endpoint-data and rather have a second PR down the road (which we can handle) which does this:

Alternatively, one could always reload if the config version changes even if the config content did not change. Then we don't have to keep track of variable values, but every config version change would cause a reload, which would probably happen anyways, since the version is usually changed when the content is also changed.

Fine for you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refreshable HTTP getters

6 participants