Job scripts: Advanced topics#
This page contains some more advanced commands and additional ways to request resources. Make sure to first read our Running jobs in Slurm page.
Requesting resources#
Requesting nodes#
It is possible to specify the number of nodes, or a range of number of nodes, for a job. However, those options should always be used in combination with other options.
In combination with --ntasks
and --cpus-per-task
#
In addition to specifying the number of tasks and cores per task for a job, it is also possible to specify the number of nodes that should be used. This is very useful on clusters that may spread out your tasks over more nodes than the minimum needed, which can happen if you would run your job on clusters (or partitions) that allow multiple users per node or even multiple jobs of a single user per node. Parallel jobs that get spread out over more nodes than really needed may show decreased performance due to more communication over the cluster interconnect (rather than through shared memory within the node) or interference with other jobs on the node (that, e.g., might cause cache thrashing if they have a bad memory access pattern).
Specifying the number of nodes that will be used can be done with either
--nodes=<number of nodes>
or -N <number of nodes>
. If it is specified, then this
is the exact number of nodes that will be allocated.
It is also possible to specify both a minimum and a maximum number of nodes
using either --nodes=<minimum number>-<maximum number>
or
-N <minimum number>-<maximum number>
.
Note that you should be very careful when specifying the number of nodes:
Requesting more nodes than really needed is very asocial behaviour as it will decrease the efficiency of cluster use and lengthen the waiting time for other users.
If you request less nodes than is required to accommodate the tasks (if also specifying the number of tasks and of cores per task), your job will be refused by the system.
If you only request nodes and forget to further specify the number of tasks and cores per task, you will get the default of 1 core per node.
In combination with --ntasks-per-node
and --cpus-per-task
#
In this scenario, a user specifies:
The number of nodes needed for a job using either
--nodes=<number of nodes>
or-N <number of nodes>
. Just as above, it is possible to specify a minimum and maximum value. The system will allocate the maximum value if possible, but given the high load on a typical cluster, you’re more likely to get the minimum number. Note also that you may then need to detect the number of tasks etc. that you can actually run if your code does not automatically adapt but requires parameters on the command line or in the input file to specify the node configuration.The number of tasks per node using
--ntasks-per-node=<number of tasks per node>
. The default is one task per node.The number of cores per task using
--cpus-per-task=<cores per task>
or-c <cores per task>
. Failing to specify this value results in getting one core per task.
In this case also one should be very careful when specifying the parameters:
Not filling up the nodes optimally is very asocial behaviour as it will decrease the efficiency of the cluster use and lengthen the waiting time for others.
Requesting a number of tasks per node in combination with a number of cores per task that cannot be accommodated on the cluster, will lead to the job being refused by Slurm.
Minimizing distance on the network for faster communications#
The nodes in a cluster are connected with each other through a number of switches that form a hierarchical network tree (also called a topology). On typical VSC clusters, the topology is a variant of the tree. All nodes are connected to an edge switch and hence split up naturally in groups corresponding to the edge switch they are connected with. These edge switches are then connected through a number of top level switches so that all nodes can communicate with each other.
This means that traffic between two nodes either passes through just a single switch (if the nodes are connected to the same edge switch) or through a series of three switches (edge - top - edge). Obviously, traffic that passes through multiple switches will incur a higher latency. Slurm tries to minimize network contention by identifying the lowest level switch in the hierarchy that can satisfy a job’s request and then allocates the requested resources, based on a best-fit algorithm and the currently available resources. This may result in an allocation with more than the optimum number of switches when the cluster is under a heavy load.
A user can request a maximum number of switches for a job using the --switches=<count>
option. Usually, one would specify --switches=1
here to make sure the job runs on
nodes which are all connected to the same edge switch.
Note that this may increase the wait time of the job (because the scheduler has
less nodes to choose from).
Optionally, a maximum time delay can be specified as well using
--switches=<count>@<max-time>
to tell the scheduler that it can delay the job until
it either finds an allocation with the desired switch count, or to ignore the switches
request when the time limit expires.
Advanced Slurm commands#
The scontrol command#
Slurm scontrol manual page on the web
The scontrol
command is mostly meant for system administrators. In fact, most of the
actions that scontrol can perform, cannot be performed by regular users.
One good use for regular users though is to get more information about one of your jobs:
scontrol -d show job <jobid>
will show extensive information for the job with job ID <jobid>
.
Some software might require a list of hostnames:
scontrol show hostnames
in the context of a job, returns a list of (unqualified) hostnames of the allocated nodes. This can be used, e.g. with bash scripting, to generate a nodelist file in the format required by the software.
The sinfo command#
Slurm sinfo manual page on the web
The sinfo
command can return a lot of information about a Slurm cluster. Understanding
its output does require a good understanding of the Slurm concepts
(see our basic Slurm use page).
The command
sinfo
will print a list of partitions with their availability, nodes that will be used to run jobs within that partition, number of nodes in that partition, and the state. The default partition will be marked with an asterisk behind the partition name.
The command
sinfo -N -l
will return a more node-oriented output. You’ll see node groups, the partition they belong to, and the amount of cores, memory (in MB), and temporary disk space available on that node group.
By specifying additional command line arguments it is possible to further customize the output format. See the sinfo manual page.
salloc#
You can use salloc
to create a resource allocation. salloc
will wait until the
resources are available and then return a shell prompt. Note however that that shell
is running on the node where you ran salloc
(likely a login node). Contrary to
srun
, the shell is not running on the allocated resources. You can
however run commands on the allocated resources via srun
.
Note
Note that in particular on clusters with multiple CPU architectures, you need to understand Linux environments and the way they interact with Slurm very well as you are now executing commands in two potentially incompatible sections of the cluster that require different settings in the environment. So if you execute a command in the wrong environment it may run inefficiently, or it may simply fail.
Interactive jobs with salloc#
Non-X11 programs#
The following command shows how to run a shared memory program on 16 cores:
login$ salloc --ntasks=1 --cpus-per-task=16 --time=1:00:00 --mem-per-cpu=3g
salloc
executed on a login node will open a shell that runs on the login node but
that can be used to execute tasks on the allocated resources. Therefore,
it is recommended to re-create the environment as some modules
will use different settings when they load in a Slurm environment.
Example Let’s use the previous salloc
shell to start the demo
program omp_hello
on the allocated resources:
login$ # Build the environment (UAntwerp example)
login$ module --force purge
login$ module load calcua/2020a vsc-tutorial
login$ # Launch the program
login$ srun omp_hello
login$ exit
It is essential in this example that omp_hello
is started through srun
as otherwise
it would be running on the login node rather than on the allocated resources. Also do not forget
to leave the shell when you have finished your interactive work!
For an MPI or hybrid MPI/OpenMP program you would proceed in exactly the same way, except that the
resource request is different to allocate all MPI tasks. E.g., to run the demo program
mpi_omp_hello
in an interactive shell and using 16 MPI processes and 8 threads per MPI
rank, you’d allocate the resources through
login$ salloc --ntasks=16 --cpus-per-task=8 --time=1:00:00 --mem-per-cpu=3g
and then run the program with
login$ # Build the environment (UAntwerp example)
login$ module --force purge
login$ module load calcua/2020a vsc-tutorial
login$ # Launch the program
login$ srun mpi_omp_hello
login$ exit
Note that since we are using all allocated resources, we don’t need to specify the number of tasks
or cores to srun
. It will take care of of properly distributing the job according to the
options specified when calling salloc
.