2.4.7. Site-specific Slurm info#

While the Running jobs in Slurm pages provide generic information regarding Slurm, there are additional points to consider when using Slurm on Tier-2 clusters hosted at KU Leuven.

Compute credits#

When submitting a job, you need to provide a valid Slurm credit account holding enough compute credits for the job using the -A/--account option. For more information, please consult the following pages:

Job shell#

For batch jobs we strongly recommend to use #!/bin/bash -l as the shebang at the top of your jobscript. The -l option (hyphen lowercase L) is needed to make sure that your ~/.bashrc settings get applied and the appropriate cluster module gets loaded at the start of the job. This is not strictly needed for interactive jobs: srun ... --pty bash and srun ... --pty bash -l will give essentially identical environments.

Note that we still discourage loading modules in your ~/.bashrc file and recommend to do that in your jobscripts instead (see also the Compiling software for wICE paragraph for example).

Cluster choice#

Many Slurm commands (like sbatch, srun, scontrol, squeue, scancel, …) accept a -M/--clusters option which selects one or more clusters. The default value depends on where the command is executed (Genius for the Genius compute nodes and login nodes; wICE for the wICE compute nodes). This means that if you are connected to a (Genius) login node, you will need to add -M wice in order to select wICE instead of Genius. Also note that some of these commands (such as squeue and sacct) accept -M all which selects all available clusters.

In order to avoid potential mistakes we have made the -M/--clusters option mandatory when submitting jobs.

Monitoring jobs#

For monitoring or debugging jobs, you can look into the following Slurm tools:

  • scontrol to view Slurm configuration and state

  • squeue to get information about jobs in the scheduling queue

  • sacct to display information about finished jobs

Note

Don’t forget the -M/--clusters option for these commands, as mentioned in the Cluster choice paragraph.

For convenience, we provide the slurm_jobinfo tool, which runs and parses output from the Slurm tools mentioned above into a format that is easier to read. Simply use slurm_jobinfo <jobid> where <jobid> has to be replaced by the 8-digit number that identifies your job.

For getting a compact overview of the current state of the cluster, execute slurmtop on any KU Leuven Tier-2 node. Use slurmtop --help to get to know the functionality.

Environment propagation#

Slurm jobs start in a clean environment which corresponds to your login environment, i.e. with only those additional variables that you defined in your ~/.bashrc file (see also the Job shell paragraph above). Environment variables that happen to be set in the session from which you submit the job are not propagated to the job.

If needed you can modify this default behaviour with the --export option. When doing so, keep in mind that you will need to include the default minimal environment as well. To e.g. pass an additional environment variable FOO with value bar, use --export=HOME,USER,TERM,PATH=/bin:/sbin,FOO=bar.

MPI applications#

MPI launchers#

We recommend to start MPI applications using the launcher that comes with the MPI implementation (typically called mpirun). The present Slurm installation has not been configured with PMI support, which may cause applications to hang when launched via srun. The main use for srun on our clusters is to request an interactive job.

Intel MPI pinning#

The Intel MPI library does not always play well with the Slurm scheduler. Specifically, when launching a job from a compute node (for instance from inside an interactive job), processes are not pinned correctly. This issue can be overcome by setting the environment variable I_MPI_PIN_RESPECT_CPUSET=off or equivalently adding the option -env I_MPI_PIN_RESPECT_CPUSET=off to your mpirun command. To check that processes are pinned correctly to physical cores, set the environment variable I_MPI_DEBUG=5 to get more verbose output. Note that this issue does not occur with the Open MPI library.

Setting the GPU compute mode#

NVIDIA GPUs support multiple compute modes. By default, the compute mode is set to Exclusive-process on our clusters (which is the best setting in the majority of cases), but you can choose another compute mode at job submission time. This is done by making use of a plugin for our Slurm job scheduler:

$ sbatch --help
...
Options provided by plugins:

      --gpu_cmode=<shared|exclusive|prohibited>
                              Set the GPU compute mode on the allocated GPUs to
                              shared, exclusive or prohibited. Default is
                              exclusive

Submitting a batch job where you want to set the compute mode of your NVIDIA GPU(s) to be shared can be done with:

sbatch --export=ALL --gpu_cmode=shared jobscript.slurm

An interactive job can be launched as follows:

srun --ntasks-per-node=9 --nodes=1 --gpus-per-node=1 --account=<YOUR_ACCOUNT> \
     --clusters=wice --time=01:00:00 --partition=gpu_a100 --gpu_cmode=shared \
     --pty /bin/bash -l

A few notes on this feature:

  • To check the behaviour is as expected, execute nvidia-smi in your job.

  • Runs with GPUs on multiple nodes are not supported. Contact the helpdesk if you think you have a use case where this would be necessary.

  • The GPU compute mode does not apply when multi-instance GPU partitioning (MIG) is used. This is for instance the case on the wICE Slurm partition called interactive. For jobs on that partition this feature is irrelevant.