Refreshable http getters #882

ppcad · 2025-09-23T12:19:56Z

No description provided.

…etters

# Conflicts: # CHANGELOG.md

codecov-commenter · 2025-10-06T12:34:38Z

Codecov Report

❌ Patch coverage is 92.06349% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.97%. Comparing base (ba7aa1a) to head (81dae9b).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
logprep/util/getter.py	89.72%	15 Missing ⚠️
logprep/processor/generic_resolver/rule.py	92.10%	3 Missing ⚠️
logprep/util/configuration.py	93.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #882      +/-   ##
==========================================
- Coverage   96.94%   95.97%   -0.98%     
==========================================
  Files         202      207       +5     
  Lines       12681    13597     +916     
==========================================
+ Hits        12294    13050     +756     
- Misses        387      547     +160

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# Conflicts: # CHANGELOG.md

kaya-david · 2025-10-15T07:53:24Z

@kmeinerz Could you please take over the PRs related to the http getters topic?
@ppcad @kmeinerz is currently working on this topic - please assign him as reviewer for that type of PRs.

logprep/processor/generic_adder/rule.py

mhoff · 2025-10-27T11:57:21Z

logprep/processor/generic_adder/rule.py

            validator=validators.deep_mapping(
                key_validator=validators.instance_of(str),
-                value_validator=validators.instance_of((str, bool, list)),
+                value_validator=validators.instance_of((str, bool, list, int, float, type(None))),


Why do we need this and in particular type(None)? Same for Line 176.

JSON can have null values, which were not supported before, so I added this along with the other missing types.

I can see that int and float make sense here. However null / None seems not supported, leading to the key not being added as a field.

I think it would make sense to have a test per added datatype to ensure that the processor works as intended.

I removed the support for None type and added tests for the remaining types.

mhoff · 2025-10-27T12:06:29Z

logprep/processor/generic_adder/rule.py

        In that case, only one file must exist."""

+        _base_add: dict = field(default={}, eq=False)
+        """Stores original add fields for future refreshes of getters"""


Suggested change

"""Stores original add fields for future refreshes of getters"""

"""Stores original add fields (as provided in the config) for future refreshes of getters"""

As I understand it we support configuring add and add_frome_file simultaneously and merge the data one dict. _base_add helps us to remember the original data from config.add as a baseline to combine it with the refreshed getter contents. If this is correct, then it would be great to add in the documentation that the combination of sources is allowed

This is correct, I will document it.

pyproject.toml

logprep/util/configuration.py

logprep/abc/getter.py

logprep/util/configuration.py

mhoff · 2025-10-28T07:33:18Z

logprep/util/getter.py

+
+def refresh_getters():
+    """Refreshes all refreshable getters"""
+    HttpGetter.refresh()


Suggested change

HttpGetter.refresh()

RefreshableGetter.refresh()

This would not work, since each class has it's own _shared variable.
_shared is defined as an abstract property in the superclass.

mhoff · 2025-10-28T07:34:33Z

logprep/util/getter.py

+class RefreshableGetter(Getter, ABC):
+    """Interface for getters that refresh their value periodically"""
+
+    _logger = logging.getLogger("console")


This should not be "console", right?

It did not print any logs if it crashed on startup if it was not "console".
That is why I kept it like that even though another logger Name would be more fitting.

I see. But think this is rather a problem with the logger configuration then. We should definitely keep the logs, as these are useful for debugging and operations purposes. We can surely find a way to make this work with another name

kmeinerz

I found some minor points.

CHANGELOG.md

kmeinerz · 2025-10-28T12:39:36Z

logprep/util/getter.py

+            return FileGetter(protocol=protocol, target=target)  # type: ignore
        if protocol == "http":
-            return HttpGetter(protocol=protocol, target=target)
+            return HttpGetter(protocol=protocol, target=target)  # type: ignore


This throws an pylint error for "Abstract class 'HttpGetter' with abstract methods instantiated".
The RefreshableGetter implements a _shared() as abstract methods (line 143), but in HTTPGetter it is provided as a class variable (line 371).

I did this to ensure that _shared would have to be defined in all sub classes, since you can't have abstract class variables.
I did not find a better way. Do you maybe have an idea how else to do it?

logprep/ng/processor/generic_resolver/processor.py

logprep/processor/generic_resolver/rule.py

logprep/processor/list_comparison/rule.py

tests/unit/util/test_getter.py

kmeinerz · 2025-10-28T14:57:36Z

I added an option for refreshable getters to return default values. Can you please take a look? @kmeinerz

The return of default values looks good to me.

mhoff · 2025-10-28T14:57:49Z

logprep/util/configuration.py

+                    get_variable_values(val, variable_values)
+            elif isinstance(value, str):
+                if value.startswith(("http://", "https://", "file://")):
+                    variable_values.append(GetterFactory.from_string(value).get_json())


I am struggling to grasp the implications of this change and see a potential risk here.

What the change does is to follow all references (http/https endpoints and files), collect their contents (get_json) and include this data in the check whether the configuration has changed. Semantically I can see where this is going -- configuration changes might be hidden behind files/endpoints and we want to make sure that the config is also reloaded when the referenced data changes.

Potential problems I see:

We are always using get_json which will probably fail (=Exception) for referenced yaml-files. Solution: use something more flexible like get_dict instead

We are traversing the whole config tree, effectively pulling in all referenced values (e.g. processor-specific rule configuration data)
2.1. Who is responsible for reloading processor-specific data? Does the processor register callbacks on its getters (as this PR does it for the generic resolver) and update itself, or does the config reload everything and initiate pipeline reloads? In any case we shouldn't mix approaches. (I btw favor the variant that the processor takes the responsibility and reloads itself).
2.2. We are potentially iterating over a lot of references, taking quite some time in the manager process (I/O) and collecting a significant amount of data (caches + parsed data). We even need to preserve all the data for comparison purposes. This might block the manager process and even lead to OOMs.
2.3. The manager process suddenly collects data which otherwise would only be loaded in the pipeline processes, leading to duplicated data across processes and thus a higher overall memory usage.

In the end, the pipelines are only being reloaded if the configuration version has changed. So I wonder if loading the complete configuration is actually required and at least the pipelines-subtree of the configuration could be excluded.
3.1. I am not sure about how other things (like metrics on the runner- or manager-level) are being reloaded and maybe these things take place immediately and without a version change. However, this would be a weird mixture of paradigms again.

I am struggling to grasp the implications of this change and see a potential risk here.

What the change does is to follow all references (http/https endpoints and files), collect their contents (get_json) and include this data in the check whether the configuration has changed. Semantically I can see where this is going -- configuration changes might be hidden behind files/endpoints and we want to make sure that the config is also reloaded when the referenced data changes.

Potential problems I see:

1. We are always using `get_json` which will probably fail (=Exception) for referenced yaml-files. Solution: use something more flexible like `get_dict` instead

Yes, I will change it to get_dict.

2. We are traversing the whole config tree, effectively pulling in all referenced values (e.g. processor-specific rule configuration data) 2.1. Who is responsible for reloading processor-specific data? Does the processor register callbacks on its getters (as [this PR does it for the generic resolver](https://github.com/fkie-cad/Logprep/pull/882/files#diff-003f0669d3897777df80e086070b66beee9a7fad11a894b7c47c58cf7ae43cfbR196)) and update itself, or does the config reload everything and initiate pipeline reloads? In any case we shouldn't mix approaches. (I btw favor the variant that the processor takes the responsibility and reloads itself).

A reload rebuilds all pipelines so it would reload all processors with their rules. This does only happen on config version changes when the configuration data has changed. So it is triggered intentionally. The only difference to before this pull request is that it also checks for the difference of data in paths and not the paths themselves. Shifting the responsibility for reloading processor specific stuff on changes of values resolved from paths solely to callbacks might be better.

2.2. We are potentially iterating over a lot of references, taking quite some time in the manager process (I/O) and collecting a significant amount of data (caches + parsed data). We even need to preserve all the data for comparison purposes. This might block the manager process and even lead to OOMs.

Yes, maybe it would be better to check for a config version change before parsing the whole config if we want to keep this functionality.

2.3. The manager process suddenly collects data which otherwise would only be loaded in the pipeline processes, leading to duplicated data across processes and thus a higher overall memory usage.

Alternatively, one could always reload if the config version changes even if the config content did not change. Then we don't have to keep track of variable values, but every config version change would cause a reload, which would probably happen anyways, since the version is usually changed when the content is also changed.

3. In the end, [the pipelines are only being reloaded if the configuration version has changed](https://github.com/fkie-cad/Logprep/blob/main/logprep/runner.py#L122). So I wonder if loading the complete configuration is actually required and at least the pipelines-subtree of the configuration could be excluded.

Yes, we could exclude the pipeline from resolving paths and rely only on callbacks for processor specific reloads triggered by changes from resolved paths.

3.1. I am not sure about how other things (like metrics on the runner- or manager-level) are being reloaded and maybe these things take place immediately and without a version change. However, this would be a weird mixture of paradigms again.

Many thanks for your replies. I think it makes sense for this PR to not collect file-/endpoint-data and rather have a second PR down the road (which we can handle) which does this:

Alternatively, one could always reload if the config version changes even if the config content did not change. Then we don't have to keep track of variable values, but every config version change would cause a reload, which would probably happen anyways, since the version is usually changed when the content is also changed.

Fine for you?

Yes, that would be fine for me.

# Conflicts: # tests/unit/framework/test_pipeline_manager.py

ppcad added 4 commits September 23, 2025 14:17

Make http getters refreshable and make config reload work with http g…

063ec94

…etters

Make list comparison processor be refreshable with http getter

0ccae39

Make generic adder processor be refreshable with http getter

cdf2e34

Make generic resolver processor be refreshable with http getter

5d3b703

ppcad self-assigned this Sep 23, 2025

ppcad added 6 commits September 23, 2025 14:25

Fix black

eef733a

Fix mypy

4e70157

Fix mypy again

6cf8467

Add missing starlette dependency

596d985

Reload config if variable values change

b04e9a8

Prevent running pending getter jobs multiple times

9e2cf12

ppcad linked an issue Oct 1, 2025 that may be closed by this pull request

Refreshable HTTP getters #727

Open

ppcad added 2 commits October 2, 2025 14:35

Make generic resolver not change resolve file config on initialization

37e9566

Merge branch 'main' into dev-refreshable-http-getters

d2c092a

# Conflicts: # CHANGELOG.md

ppcad force-pushed the dev-refreshable-http-getters branch from 9069687 to d2c092a Compare October 2, 2025 12:42

ppcad added 6 commits October 2, 2025 14:45

Fix black formatting

7d87cd4

Fix ng generic resolver for refreshable http getters

390fad8

Make generic resolver try load yaml and load json on failure

1de1199

Make http getter use cache on connection error except on startup

2648b03

Fix black formatting

0c4afe1

Fix mypy and add generic resolver test

81dae9b

Merge branch 'main' into dev-refreshable-http-getters

f5ecb31

# Conflicts: # CHANGELOG.md

ppcad requested a review from kaya-david October 15, 2025 07:21

ppcad marked this pull request as ready for review October 15, 2025 07:31

kaya-david requested a review from kmeinerz October 15, 2025 07:50

ppcad added 3 commits October 20, 2025 10:21

Add get_dict and get_list to getters and refactor

ce00fa9

Fix black formatting for generic resolver rule

726f5be

Fix mypy errors

8f7724f