-
Notifications
You must be signed in to change notification settings - Fork 110
feature(sdcm/sct_events): Real-time Argus Events #12095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dc8ca62 to
d24fbbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run also test with Argus disabled (e.g. locally) so we're sure it does not regress local tries. Add a comment that you tested it (or in description)
sdcm/sct_events/argus.py
Outdated
|
|
||
|
|
||
| ARGUS_EVENT_AGGREGATOR_TIME_WINDOW: float = 90 # seconds | ||
| ARGUS_EVENT_AGGREGATOR_MAX_DUPLICATES: int = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe 1 is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values were taken from grafana implementation, we could reduce it, I'll ask people who decided on this number
| LOGGER.info("Initializing Argus connection...") | ||
| try: | ||
| cls._argus_client = get_argus_client(run_id=cls.test_id() if not test_id else test_id) | ||
| enable_argus_posting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start it along with other devices in sdcm.sct_events.setup.start_events_device
so we're not starting it before main events device is there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't - this is to release the lock on the poster, which would either be started or not already by main events device - the reason this exception handler is there is to guard against situations where events device isn't initialized, for example inside the create_argus_run cli command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_events_device also is not run when create_argus_run is fired - is it?
We start other devices there - so this place looks natural.
Also this enable_argus_posting would be redundant then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are currently 3 steps that need to be taken to enable argus events:
- Events device, done in
start_events_device, which initializes the argus events pipeline enable_posting_posting, which sets the client for the postmanstart_posting_argus_events, which releases the lock on the postman to start submitting the events
Latter two can be combined into one, but the split here is warranted as test_config logic decides whether or not argus is enabled and does not enable argus events if they aren't, the pipeline in this case just idles - same as grafana.
| LOGGER = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class Argus: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need that? can we create argus client when enabling posting in ArgusEventPostman?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need that? can we create argus client when enabling posting in
ArgusEventPostman?
This means having to instantiate SCTConfiguration (circular dependency) or reading env vars and then also having to manage logic for when argus isn't available (which again, necessitates importing SCTConfiguration), where this singleton class should help other parts of SCT to get already initialized argus client without either resolving the dependency chain or having to manually initialize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have TestConfig class which should have been initialized when trying to send event to Argus.
Can we use sdcm.test_config.TestConfig.argus_client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circular dependency as well here as importing TestConfig into argus.py means we can't import enablement functions into the test config (NOT the initialization function, those are imported into the events setup.py).
28d08b0 to
1ac6219
Compare
3df1eb5 to
6acc335
Compare
| self.outbound_queue.put(evt) | ||
|
|
||
|
|
||
| class ArgusEventAggregator(EventsProcessPipe[SCTArgusEvent, SCTArgusEvent]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I dislike this algorithm: it delays event by time_window and in case of burst of 1000 events it will send anyway 200 of them (with 5 max duplicates set).
I think it would be better to fire off event straight away and for some time period silence the other events with the same event_key.
Something like this (needs testing with e.g. unit test):
event_key = self.unique_key(event)
if time.perf_counter() - time_window_counters.get(event_key, 0) > TIME_WINDOW # time window e.g. 30 seconds
# not seen event from sometime or ever
time_window_counters[event_key] = time.perf_counter()
else:
# recently seen
continue
This solution will bring some memory burden (proportional to no of distinct event_keys) - but assuming even 1M events, should fit memory anyway.
@k0machi can you try this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sounds better, I'll try it (probably should also figure out a way to add argus events to unit_tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see how other events devices are tested
6acc335 to
b4b73ae
Compare
|
Before merge: Needs rebase after #12091 is merged |
sdcm/sct_events/argus.py
Outdated
| time_window = ARGUS_EVENT_AGGREGATOR_TIME_WINDOW | ||
| max_duplicates = ARGUS_EVENT_AGGREGATOR_MAX_DUPLICATES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are not needed anymore.
@k0machi can you clean up the code a bit and paste links to runs with this new approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
New approach runs:
https://argus.scylladb.com/tests/scylla-cluster-tests/5eec5069-2332-42ea-9c4b-5d686db45555
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try again, with views-with-tablets experimental feature
b4b73ae to
8ea8455
Compare
This commit adds new sct_events related module, which allows posting events as they come from events device into Argus. It is based on the previously available grafana pipeline. To support earlier (and always-available) posting, the argus utils module was extended to provide a singleton argus instance class upon first time initialization of argus client. Task: scylladb/argus#787
8ea8455 to
f267dff
Compare
feature(sdcm/sct_events): Real-time Argus Events
This commit adds new sct_events related module, which allows posting
events as they come from events device into Argus. It is based on the
previously available grafana pipeline. To support earlier (and
always-available) posting, the argus utils module was extended to
provide a singleton argus instance class upon first time initialization
of argus client.
Task: scylladb/argus#787
Testing
PR pre-checks (self review)
backportlabelsReminders
sdcm/sct_config.py)unit-test/folder)