Skip to content

Conversation

cbcoutinho
Copy link

@cbcoutinho cbcoutinho commented Apr 27, 2025

SUMMARY

Fixes #15179.

The original PR (#14775) that introduced this bug included a non-empty _METRICSLIST in the Metrics class which was subsequently subclassed twice - by the DispatcherMetrics and CallbackMetrics, causing duplicate metrics. This issue doesn't show up immediately upon startup. Once a workflow starts, this metrics appears twice in the /metrics endpoint.

This PR renames the subsystem_* metrics into separate metrics for both the CallbackReceiver and the Dispatcher.

This is technically a breaking change in the metrics API because three metrics have been removed, being replaced by three metrics for CallbackReceiver and three for the Dispatcher

ISSUE TYPE
  • Bug fix in results from /metrics endpoint

Prometheus logs:
Without this change, Prometheus drops the duplicate metrics and logs errors:

prometheus-observability-kube-prometh-prometheus-0 prometheus time=2025-04-27T11:25:48.202Z level=WARN source=scrape.go:1884 msg="Error on ingesting samples with different value but same timestamp" component="scrape manager" scrape_pool=serviceMonitor/awx/awx-web-monitor/0 target=http://10.1.76.143:8052/api/v2/metrics/ num_dropped=3
COMPONENT NAME
  • Other
AWX VERSION
awx: 24.6.2.dev314+g52da7b86df
ADDITIONAL INFORMATION

See #15179 (comment) for context

Copy link

codecov bot commented May 15, 2025

⚠️ Parser warning

The parser emitted a warning. Please review your JUnit XML file:

Warning while parsing testcase attributes: Limit of string is 1000 chars, for name, we got 1932 at 21405:26 in /home/runner/work/awx/awx/reports/junit.xml

@jessicamack jessicamack force-pushed the fix/duplicate-prometheus-metrics branch from a1de394 to 72f1bb5 Compare May 16, 2025 20:37
@AlanCoding
Copy link
Member

I checked the logs from one of the awx_collection tests and it gave this:

2025-05-16 20:42:23,602 ERROR    [-] awx.main.commands.run_callback_receiver Exception on worker 3, reconnecting: 
Traceback (most recent call last):
  File "/awx_devel/awx/main/dispatch/worker/base.py", line 315, in work_loop
    body = self.read(queue)
           ^^^^^^^^^^^^^^^^
  File "/awx_devel/awx/main/dispatch/worker/callback.py", line 113, in read
    self.record_read_metrics()
  File "/awx_devel/awx/main/dispatch/worker/callback.py", line 123, in record_read_metrics
    self.subsystem_metrics.pipe_execute()
  File "/awx_devel/awx/main/analytics/subsystem_metrics.py", line 279, in pipe_execute
    self.METRICS['subsystem_metrics_pipe_execute_seconds'].inc(duration_to_save)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'subsystem_metrics_pipe_execute_seconds'

I think those artifacts are available publicly? You have to click on the check output, and then Summary, and then down at the bottom it has artifacts. That lets you get the logs which often tell what actually went wrong, like the above trace

@cbcoutinho
Copy link
Author

cbcoutinho commented May 29, 2025

Hi @AlanCoding thanks - I've updated my approach based on the failing tests. My initial naive solution was to remove the metrics from one of the subsystems to eliminate the duplicates, but that isn't sufficient.

In my recent changes, I renamed the subsystem_* metrics into separate metrics for both the Callback Receiver and the Dispatcher.

This is technically a breaking change in the metrics API because three metrics have been removed, being replaced by three metrics for Callback Receiver and three for the Dispatcher.

Can you please enable the checks again?

@Sispheor
Copy link

Sispheor commented Jun 6, 2025

Hello there !
We would love to see that fixed merged 😁. What is the status?

@cbcoutinho
Copy link
Author

Hi @AlanCoding can you please re-run the tests?

@cbcoutinho
Copy link
Author

Friendly ping @AlanCoding @chrismeyersfsu @kdelee

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplicate timeseries coming from the metrics endpoint
3 participants