-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Milestone
Description
See #8737 for some background.
When free disk space monitoring fails for any reason, e.g. the command it runs is blocked by a security mechanism, the computed metric value is 'NaN'
(an atom). This value is then rendered
as is to Prometheus format scrapers, causing some to fail since, well, the value is (as it says) not a number.
Why exactly rabbit_disk_monitor
runs into an exception does not really matter. On different
OSes the conditions are different, and usually fairly environment-specific.
To make this worthy an issue, let's describe several other solutions considered:
- We cannot return a
null
orundefined
, or rather, that would not help in any way, and would require CLI tools (namelyrabbitmq-diagnostics status
) and management UI (the table of nodes) to filter out the metric or special case value formatting 0
is not a value we can return as it would immediately trigger a disk alarm on the node, blocking publishers across the entire cluster- Any arbitrary positive value would not make much sense
- We obviously cannot expect a contribution to the Prometheus scraper to handle
NaN
s for numerical data types (gauges, counters) to be considered or get wide adoption in the foreseeable future - A workaround like
rabbitmqctl eval 'rabbit_disk_monitor:set_enabled(false).'
does not help since disabled disk monitoring would not prevent aNaN
from being returned byrabbit_disk_monitor:get_disk_free/0
So the only solution left is to make rabbitmq_prometheus
leave the metric out entirely when the value is NaN
.