Skip to content

illumos/solaris CPU usage is reported in ticks, not seconds #1837

@davepacheco

Description

@davepacheco

Host operating system: output of uname -a

$ uname -a
SunOS lennier 5.11 omnios-r151034-0d278a0cc5 i86pc i386 i86pc

node_exporter version: output of node_exporter --version

$ ./node_exporter --version
node_exporter, version 1.0.1 (branch: master, revision: d8a1585f59ef1169837d08979ecc92dcea8aa58a)
  build user:       dap@lennier
  build date:       20200904-20:16:54
  go version:       go1.14.7

node_exporter command line flags

No command-line flags passed (node_exporter)

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Viewed stat node_cpu_seconds_total.

What did you expect to see?

I expected to see the total number of seconds of idle time for this CPU since boot.

What did you see instead?

I saw the total number of idle ticks for this CPU since boot.


It's easier to look at all the data in one place:

# curl -s localhost:9100/metrics | grep cpu.*idle; kstat -p -m cpu -i 0 -n sys | grep cpu.*idle; kstat | grep nsec_per_tick
node_cpu_seconds_total{cpu="0",mode="idle"} 8.238178e+06
node_cpu_seconds_total{cpu="1",mode="idle"} 8.344892e+06
cpu:0:sys:cpu_nsec_idle 8238179276443
cpu:0:sys:cpu_ticks_idle        8238179
cpu:0:sys:idlethread    3961542
        nsec_per_tick                   1000000

What we see in this snippet is that:

  • node_reporter is reporting 8238178 for "node_cpu_seconds_total" for cpu=0 mode="idle". This stat is documented to be measured in seconds.
  • According to the underlying kstats, the CPU has been idle for 8238179276443 nanoseconds, or 8238.179276443 seconds. The stat is off by a factor of 1,000,000.

Looking at the source, it's pretty clear why:

"idle": "cpu_ticks_idle",
"kernel": "cpu_ticks_kernel",
"user": "cpu_ticks_user",
"wait": "cpu_ticks_wait",

It's pulling the "cpu_ticks_idle" kstat, which is measured in ticks. That's related to seconds by "nsec_per_tick". The above output shows that nsec_per_tick is 1,000,000 on this system, which explains why our output is off by a factor of 1,000,000.

As far as I can tell, this has always been wrong in this way. My guess is that users don't see this if they're always graphing a ratio of the CPU time metrics (e.g., idle / sum_of_all_of_them). You see this if you're trying to calculate idle percent as 100 * node_cpu_seconds_total{mode="idle"}, which should work.

The straightforward solution would be to use the cpu_nsec_{idle,kernel,user,wait} kstats instead of the cpu_ticks_{idle,kernel,user,wait} kstats. I don't know if we'd be worried about this being a breaking change.

CC @dsnt02518 (because you seem to be doing related work in #1803), @jpds (maybe I've misunderstood something here?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions