-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Host operating system: output of uname -a
$ uname -a
SunOS lennier 5.11 omnios-r151034-0d278a0cc5 i86pc i386 i86pc
node_exporter version: output of node_exporter --version
$ ./node_exporter --version
node_exporter, version 1.0.1 (branch: master, revision: d8a1585f59ef1169837d08979ecc92dcea8aa58a)
build user: dap@lennier
build date: 20200904-20:16:54
go version: go1.14.7
node_exporter command line flags
No command-line flags passed (node_exporter
)
Are you running node_exporter in Docker?
No.
What did you do that produced an error?
Viewed stat node_cpu_seconds_total
.
What did you expect to see?
I expected to see the total number of seconds of idle time for this CPU since boot.
What did you see instead?
I saw the total number of idle ticks for this CPU since boot.
It's easier to look at all the data in one place:
# curl -s localhost:9100/metrics | grep cpu.*idle; kstat -p -m cpu -i 0 -n sys | grep cpu.*idle; kstat | grep nsec_per_tick
node_cpu_seconds_total{cpu="0",mode="idle"} 8.238178e+06
node_cpu_seconds_total{cpu="1",mode="idle"} 8.344892e+06
cpu:0:sys:cpu_nsec_idle 8238179276443
cpu:0:sys:cpu_ticks_idle 8238179
cpu:0:sys:idlethread 3961542
nsec_per_tick 1000000
What we see in this snippet is that:
- node_reporter is reporting 8238178 for "node_cpu_seconds_total" for cpu=0 mode="idle". This stat is documented to be measured in seconds.
- According to the underlying kstats, the CPU has been idle for 8238179276443 nanoseconds, or 8238.179276443 seconds. The stat is off by a factor of 1,000,000.
Looking at the source, it's pretty clear why:
node_exporter/collector/cpu_solaris.go
Lines 63 to 66 in d8a1585
"idle": "cpu_ticks_idle", | |
"kernel": "cpu_ticks_kernel", | |
"user": "cpu_ticks_user", | |
"wait": "cpu_ticks_wait", |
It's pulling the "cpu_ticks_idle" kstat, which is measured in ticks. That's related to seconds by "nsec_per_tick". The above output shows that nsec_per_tick is 1,000,000 on this system, which explains why our output is off by a factor of 1,000,000.
As far as I can tell, this has always been wrong in this way. My guess is that users don't see this if they're always graphing a ratio of the CPU time metrics (e.g., idle / sum_of_all_of_them). You see this if you're trying to calculate idle percent as 100 * node_cpu_seconds_total{mode="idle"}
, which should work.
The straightforward solution would be to use the cpu_nsec_{idle,kernel,user,wait}
kstats instead of the cpu_ticks_{idle,kernel,user,wait}
kstats. I don't know if we'd be worried about this being a breaking change.
CC @dsnt02518 (because you seem to be doing related work in #1803), @jpds (maybe I've misunderstood something here?)