Skip to content

Prometheus: exclude rabbitmq_disk_space_available_bytes if it cannot be computed (retrieved) #8740

@michaelklishin

Description

@michaelklishin

See #8737 for some background.

When free disk space monitoring fails for any reason, e.g. the command it runs is blocked by a security mechanism, the computed metric value is 'NaN' (an atom). This value is then rendered
as is to Prometheus format scrapers, causing some to fail since, well, the value is (as it says) not a number.

Why exactly rabbit_disk_monitor runs into an exception does not really matter. On different
OSes the conditions are different, and usually fairly environment-specific.

To make this worthy an issue, let's describe several other solutions considered:

  • We cannot return a null or undefined, or rather, that would not help in any way, and would require CLI tools (namely rabbitmq-diagnostics status) and management UI (the table of nodes) to filter out the metric or special case value formatting
  • 0 is not a value we can return as it would immediately trigger a disk alarm on the node, blocking publishers across the entire cluster
  • Any arbitrary positive value would not make much sense
  • We obviously cannot expect a contribution to the Prometheus scraper to handle NaNs for numerical data types (gauges, counters) to be considered or get wide adoption in the foreseeable future
  • A workaround like rabbitmqctl eval 'rabbit_disk_monitor:set_enabled(false).' does not help since disabled disk monitoring would not prevent a NaN from being returned by rabbit_disk_monitor:get_disk_free/0

So the only solution left is to make rabbitmq_prometheus leave the metric out entirely when the value is NaN.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions