Skip to content

Conversation

@k0machi
Copy link
Contributor

@k0machi k0machi commented Oct 3, 2025

feature(sdcm/sct_events): Real-time Argus Events

This commit adds new sct_events related module, which allows posting
events as they come from events device into Argus. It is based on the
previously available grafana pipeline. To support earlier (and
always-available) posting, the argus utils module was extended to
provide a singleton argus instance class upon first time initialization
of argus client.

Task: scylladb/argus#787

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@k0machi k0machi self-assigned this Oct 3, 2025
@k0machi k0machi requested review from fruch and soyacz October 3, 2025 08:49
@k0machi k0machi force-pushed the sct-realtime-argus-events branch 9 times, most recently from dc8ca62 to d24fbbf Compare October 3, 2025 11:18
Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run also test with Argus disabled (e.g. locally) so we're sure it does not regress local tries. Add a comment that you tested it (or in description)



ARGUS_EVENT_AGGREGATOR_TIME_WINDOW: float = 90 # seconds
ARGUS_EVENT_AGGREGATOR_MAX_DUPLICATES: int = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 1 is enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values were taken from grafana implementation, we could reduce it, I'll ask people who decided on this number

LOGGER.info("Initializing Argus connection...")
try:
cls._argus_client = get_argus_client(run_id=cls.test_id() if not test_id else test_id)
enable_argus_posting()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start it along with other devices in sdcm.sct_events.setup.start_events_device
so we're not starting it before main events device is there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't - this is to release the lock on the poster, which would either be started or not already by main events device - the reason this exception handler is there is to guard against situations where events device isn't initialized, for example inside the create_argus_run cli command

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_events_device also is not run when create_argus_run is fired - is it?
We start other devices there - so this place looks natural.
Also this enable_argus_posting would be redundant then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are currently 3 steps that need to be taken to enable argus events:

  • Events device, done in start_events_device, which initializes the argus events pipeline
  • enable_posting_posting, which sets the client for the postman
  • start_posting_argus_events, which releases the lock on the postman to start submitting the events

Latter two can be combined into one, but the split here is warranted as test_config logic decides whether or not argus is enabled and does not enable argus events if they aren't, the pipeline in this case just idles - same as grafana.

LOGGER = logging.getLogger(__name__)


class Argus:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need that? can we create argus client when enabling posting in ArgusEventPostman?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need that? can we create argus client when enabling posting in ArgusEventPostman?

This means having to instantiate SCTConfiguration (circular dependency) or reading env vars and then also having to manage logic for when argus isn't available (which again, necessitates importing SCTConfiguration), where this singleton class should help other parts of SCT to get already initialized argus client without either resolving the dependency chain or having to manually initialize it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have TestConfig class which should have been initialized when trying to send event to Argus.
Can we use sdcm.test_config.TestConfig.argus_client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular dependency as well here as importing TestConfig into argus.py means we can't import enablement functions into the test config (NOT the initialization function, those are imported into the events setup.py).

@k0machi k0machi force-pushed the sct-realtime-argus-events branch 2 times, most recently from 28d08b0 to 1ac6219 Compare October 3, 2025 15:08
@k0machi k0machi marked this pull request as ready for review October 3, 2025 15:11
@k0machi k0machi requested a review from soyacz October 3, 2025 15:13
@k0machi k0machi force-pushed the sct-realtime-argus-events branch 2 times, most recently from 3df1eb5 to 6acc335 Compare October 3, 2025 15:35
self.outbound_queue.put(evt)


class ArgusEventAggregator(EventsProcessPipe[SCTArgusEvent, SCTArgusEvent]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I dislike this algorithm: it delays event by time_window and in case of burst of 1000 events it will send anyway 200 of them (with 5 max duplicates set).

I think it would be better to fire off event straight away and for some time period silence the other events with the same event_key.
Something like this (needs testing with e.g. unit test):

event_key = self.unique_key(event)
if time.perf_counter() - time_window_counters.get(event_key, 0) > TIME_WINDOW  # time window e.g. 30 seconds
    # not seen event from sometime or ever 
    time_window_counters[event_key] = time.perf_counter()
else:
        # recently seen
        continue

This solution will bring some memory burden (proportional to no of distinct event_keys) - but assuming even 1M events, should fit memory anyway.

@k0machi can you try this approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, sounds better, I'll try it (probably should also figure out a way to add argus events to unit_tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see how other events devices are tested

@k0machi
Copy link
Contributor Author

k0machi commented Oct 17, 2025

Before merge: Needs rebase after #12091 is merged

Comment on lines 70 to 71
time_window = ARGUS_EVENT_AGGREGATOR_TIME_WINDOW
max_duplicates = ARGUS_EVENT_AGGREGATOR_MAX_DUPLICATES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not needed anymore.
@k0machi can you clean up the code a bit and paste links to runs with this new approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try again, with views-with-tablets experimental feature

@k0machi k0machi force-pushed the sct-realtime-argus-events branch from b4b73ae to 8ea8455 Compare October 17, 2025 12:38
This commit adds new sct_events related module, which allows posting
events as they come from events device into Argus. It is based on the
previously available grafana pipeline. To support earlier (and
always-available) posting, the argus utils module was extended to
provide a singleton argus instance class upon first time initialization
of argus client.

Task: scylladb/argus#787
@k0machi k0machi force-pushed the sct-realtime-argus-events branch from 8ea8455 to f267dff Compare October 17, 2025 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants