Skip to content

Commit d91e8bd

Browse files
authored
Create setup.md
1 parent 3ba2584 commit d91e8bd

File tree

1 file changed

+122
-0
lines changed

1 file changed

+122
-0
lines changed

docs/setup/setup.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Jobstats setup
2+
3+
Below is an outline of the steps that need to be taken to setup the Jobstats platform for a Slurm cluster:
4+
5+
- Switch to cgroup based job accounting from Linux process accounting
6+
- Setup the exporters: cgroup, node, GPU (on the nodes) and, optionally, GPFS (centrally)
7+
- Setup the prolog.d and epilog.d scripts on the GPU nodes
8+
- Setup the Prometheus server and configure it to scrape data from the compute nodes and all configured exporters
9+
- Setup the slurmctldepilog.sh script for long-term job summary retention
10+
- Lastly, configure Grafana and Open OnDemand
11+
12+
## Exporters
13+
14+
These exporters are used:
15+
16+
- node exporter: https://github.com/prometheus/node_exporter
17+
- cgroup exporter: https://github.com/plazonic/cgroup_exporter
18+
- nvidia gpu exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter
19+
- gpfs exporter: https://github.com/plazonic/gpfs-exporter
20+
21+
## Basic Prometheus Configuration
22+
What follows is an example of production configuration used for the Tiger cluster that has both regular and GPU nodes.
23+
```
24+
---
25+
global:
26+
scrape_interval: 15s
27+
evaluation_interval: 15s
28+
external_labels:
29+
monitor: master
30+
- job_name: Tiger Nodes
31+
scrape_interval: 30s
32+
scrape_timeout: 30s
33+
file_sd_configs:
34+
- files:
35+
- "/etc/prometheus/local_files_sd_config.d/tigernodes.json"
36+
metric_relabel_configs:
37+
- source_labels:
38+
- __name__
39+
regex: "^go_.*"
40+
action: drop
41+
- job_name: TigerGPU Nodes
42+
scrape_interval: 30s
43+
scrape_timeout: 30s
44+
file_sd_configs:
45+
- files:
46+
- "/etc/prometheus/local_files_sd_config.d/tigergpus.json"
47+
metric_relabel_configs:
48+
- source_labels:
49+
- __name__
50+
regex: "^go_.*"
51+
action: drop
52+
```
53+
tigernode.json looks like:
54+
```
55+
[
56+
{
57+
"labels": {
58+
"cluster": "tiger"
59+
},
60+
"targets": [
61+
"tiger-h19c1n10:9100",
62+
"tiger-h19c1n10:9306",
63+
...
64+
]
65+
}
66+
]
67+
```
68+
where both node_exporter (port 9100) and cgroup_exporter (port 9306) are listed, for all of tiger's nodes. tigergpus.json looks very similar except that it collects data from nvidia_gpu_prometheus_exporter on port 9445.
69+
70+
Note the additional label cluster.
71+
72+
## GPU Job Ownership Helper
73+
74+
In order to correctly track which GPU is assigned to which jobid we use slurm prolog and epilog scripts to create files in ```/run/gpustat``` directory named either after GPU ordinal number (0, 1, ..) or, in the case of MIG cards, MIG-UUID. These files contain space separated jobid and uid number of the user. E.g.
75+
```
76+
# cat /run/gpustat/MIG-265a219d-a49f-578a-825d-222c72699c16
77+
45916256 262563
78+
```
79+
These two scripts can be found in the slurm directory. For example, slurm/epilog.d/gpustats_helper.sh could be installed as /etc/slurm/epilog.d/gpustats_helper.sh and slurm/prolog.d/gpustats_helper.sh as /etc/slurm/prolog.d/gpustats_helper.sh with these slurm.conf config statements:
80+
```
81+
Prolog=/etc/slurm/prolog.d/*.sh
82+
Epilog=/etc/slurm/epilog.d/*.sh
83+
```
84+
85+
## Grafana
86+
87+
Grafana dashboard json that uses all of the exporters is included in the grafana subdirectory. It expects one parameter, jobid. As it may not be easy to find the time range we also use an ondemand job stats helper that generates the correct time range given a jobid, documented in the next section.
88+
89+
The following image illustrates what the dashboard looks like in use:
90+
91+
<center><img src="https://tigress-web.princeton.edu/~jdh4/jobstats_grafana.png"></center>
92+
93+
## Open OnDemand JobStats Helper
94+
95+
ood-jobstats-helper subdirectory contains an Open OnDemand app that, given a job id, uses sacct to generate a full Grafana URL with job's jobid, start and end times.
96+
97+
## Generating Job Summaries
98+
99+
Job summaries, as described above, are generated and stored in the Slurm database at the end of each job by using slurmctld epilog script, e.g.:
100+
101+
```
102+
EpilogSlurmctld=/usr/local/sbin/slurmctldepilog.sh
103+
```
104+
105+
The script can be found in the slurm subdirectory, named "slurmctldepilog.sh".
106+
107+
For processing old jobs where slurmctld epilog script did not run or for jobs where it failed there is a per cluster ingest jobstats service. This is a python based script running on the slurmdbd host, as a systemd timer and service, querying and modifying slurm database directly. The script (ingest_jobstats.py) and systemd timer and service scripts are in the slurm directory.
108+
109+
We made heavy use of this script to generate job summaries for older jobs but with the current version of the Epilog script it should not be needed anymore.
110+
111+
## Job email script
112+
113+
For completed jobs, the data is taken from a call to sacct with several fields including AdminComment. For running jobs, the Prometheus database must be queried.
114+
115+
Importantly, the `jobstats` command is also used to replace `smail`, which is the Slurm executable used for sending email reports that are based on `seff`. This means that users receive emails that are the exact output of `jobstats` including the notes.
116+
117+
We use slurm/jobstats_mail.sh as the slurm's Mail program. E.g. from slurm.conf:
118+
119+
```
120+
MailProg=/usr/local/bin/jobstats_mail.sh
121+
````
122+
This will include jobstats information for jobs that have requested email notifications on completion.

0 commit comments

Comments
 (0)