Submission of job scripts: sbatch#
Slurm sbatch manual page
sbatch is the command in Slurm to submit a job script.
A job script first contains a list of resources and other instructions to
Slurm, and this is followed by a set of commands that will be executed on the
first node of the job. When the submission succeeds,
sbatch will print a
message containing the unique job ID for the job.
Resource specifications and other instructions can be specified in three
different ways: command line options, environment variables, and
lines in the job script.
sbatchhas a lot of command line options. We will only list the most important command line options below. Command line options of Slurm take precedence over environment variables and
#SBATCHlines in the job script.
Some command line options can also be passed to
sbatchthrough environment variables instead. A list of those can be found in the sbatch manual page. The name of those variables starts with
SBATCH_and the remaining part is derived from the matching command line option. However, be careful when using those and hiding them in
.bash_profileas they are easily forgotten yet have a higher priority than those on
SBATCHlines which is the most used mechanism to specify resources etc.
All command line options can also be passed in
#SBATCHlines in the job script. These lines should follow immediately below the shebang in the first block of comment lines (lines that start with
#) as otherwise they will be ignored by Slurm.
Note that all
sbatch command line options should be specified
before the name of the job script. All command line parameters specified
after the name of the job script will be passed as command line arguments
to the job script when it executes.
Requesting compute resources#
Slurm supports several ways to request cores and/or GPUs for a job. The easiest way to request cores is by following the “task”-idea of Slurm and specifying the number of parallel tasks and cores per task that you need. By specifying resources this way, it is very easy afterwards to start OpenMP, MPI and hybrid MPI/OpenMP programs in the right configuration.
The number of tasks is specified by
--ntasks=<number of tasks>or
-n <number of tasks>. The
=-sign in the long option format can be replaced by a space, and in the short form (
-n) the space between the flag and the value form can also be omitted (in effect, this holds for all options).
The number of cores per task is specified by
-c <value>. The cores assigned to a task will all reside on the same node.
Slurm uses the term ‘CPUs’ to refer to processing units which correspond to what we usually refer to as ‘cores’. If hyperthreading would be enabled (which is usually not the case on VSC clusters), then these units would correspond to hardware threads.
Make sure to request a valid combination of tasks and/or cores per task. Otherwise, your job can be rejected or it could end up in the partition queue but it will never start (in that case, check the reason code, as explained later in this document in the section on checking the queue).
If set, the Slurm controller will set the corresponding variables,
SLURM_CPUS_PER_TASK in the
environment of the running job.
If not set, the default values of 1 task and 1 core per task are applied.
It is possible to request one or more GPUs for your job on some of the VSC clusters that provide them. For cluster-specific usage instructions, please consult the respective documentation sources:
Jobs can also request an amount of RAM space (resident memory). This is
commonly done by specifying an amount of memory per core with
--mem-per-cpu=1g). The amount is
<unit> can be either
k for kilobytes,
m for megabyte
g for gigabyte. Note that it is not possible in Slurm to specify
an amount of memory per task (only per core, per node or per gpu).
A job will be rejected if the final amount of memory requested cannot be
satisfied. This can happen if
--mem-per-cpu times the number of cores per
node is greater than the total memory on that node that is available for job
allocations. Note that this total amount of allocatable memory is usually
slightly less than the node’s total physical memory to leave sufficient RAM
for the OS and for file system buffers.
--mem-per-cpu is not set, a default value will be used, which is
usually equal to the total memory available for job allocations of that node
divided by the number of cores.
The amount of available memory per core is available via the variable
SLURM_MEM_PER_CPU as an integer with megabytes as unit in the
environment of the running job.
Requesting wall time#
The requested compute time is specified using
<time> is specified in mm (minutes), mm:ss (minutes and seconds), hh:mm:ss
(hours, minutes and seconds), d-hh (days and hours), d-hh:mm (days, hours and minutes)
or d-hh:mm:ss (days, hours, minutes and seconds) format. The
- is not a typo!
If not set, a default wall time of 1 hour will be assigned.
Specifying a partition#
Slurm jobs can be submitted to a certain partition of compute nodes. Indicating
the kind of job in this manner imposes some additional restrictions on resources
and time, but may let the job start sooner. The partition can be specified
If not set, the default partition will be used.
The name of partition is available in the variable
in the environment of the running job.
Specifying a job name#
The default name of a job is the name of the job script. The name can however be changed
The name of the job is available in the variable
in the environment of the running job.
Redirecting stdout and stderr#
By default, Slurm redirects both stdout and stderr to the same file, named
There are two flags to
sbatch to change this behaviour:
-o <output file>will redirect all output to the file specified by
<output file>rather than the default.
-e <error file>will redirect output sent to stderr to the file specified by
<error file>. Output sent to stdout is still sent to the default file, unless
--outputis also used.
--error: stdout and stderr are both sent to the default output file
--outputspecified but no
--error: stdout and stderr are both sent to the file pointed to by
--errorspecified: stdout is redirected to the default output file
slurm-<jobid>.outwhile stderr is redirected to the file pointed to by
--errorare specified: stdout is redirected to the file pointed to by
--outputand stderr is redirected to the file pointed to by
The file name can (and usually will) be a template. It can contain replacement symbols preceded
by a % that allow to use the job ID etc. in the name of the file to ensure unique file names.
The most useful of such symbols is
%j which will be replaced by the unique job ID.
A full list of replacement symbols can be found in the sbatch manual page.
Sending mail at specific events#
Slurm can send mail when a job starts, fails or ends normally, and on a number of other occasions. Two flags influence this behaviour:
--mail-type=<type>specifies when mail should be sent.
<type>is a comma-separated list of type values. Type values include BEGIN, END and FAIL to denote respectively the start of a job, end of a job and failure of a job, but there are many other options that can be found in the sbatch manual page.
--mail-user=<mail address>specifies to which mail address the mails should be sent. The default value is the mail address associated with the VSC-account of the submitting user.
If not set, no mail will be sent.
Job dependencies can be used to defer the start of a job until the specified dependencies have been satisfied. They are very helpful when implementing a workflow consisting of steps with different requirements for each job in the workflow.
The basic way of specifying a job dependency is through
etc. For (almost) each type one can specify one or more job IDs, and it is also possible
to specify multiple types of dependencies.
What it does
Job can begin after all specified jobs have started (or are cancelled)
Job can begin after all specified jobs have terminated
Job can begin after the specified jobs has successfully completed
Job can begin after the specified jobs have failed
Job can start after all previously launched jobs with the same name and same user have ended. This can be useful to collate results after running a batch of related jobs.
The job environment#
sbatch command by default copies the environment in which the job script was submitted
(at least, the environment seen by the
sbatch command, so all exported variables and functions).
This implies that, e.g., all modules that were loaded when you submitted the job script, will
be loaded in your job environment. This poses a number of risks:
Some modules adapt their behaviour to the environment in which they were loaded. One important example are the modules that provide MPI on the cluster. When launched in a Slurm job environment, some environment variables are set to ensure maximal integration with Slurm. However, when loaded on the login nodes these variables are not set as otherwise running a MPI program as a regular program without
mpirun(and launching just a single process) would fail. The latter is a problem for, e.g., Python when some module loads the Python MPI package.
You may be working in a different environment than the one you used the previous time you ran the job script, and as a consequence of this your job script that previously functioned well may now function differently.
Paths may be different on the login nodes and compute nodes. This can happen during OS upgrades of the cluster. These can often be done without downtime or interrupting work on the cluster, but that implies that some nodes will be running one version while other nodes will be running another version of the OS setup.
To alleviate these issues, we set a minimal environment for jobs by default. This means that, along with the SLURM_* variables, only the HOME, USER and TERM environment variables are exported to the job. The PATH environment variable is set to a minimum in the job environment. This implies that the desired software modules must be loaded in your job scripts for use during the execution of the job.
In case you would attempt to use the
--export sbatch option to override this behaviour,
we advise you to apply one of the following solutions to avoid accidental mistakes:
Clear the environment in your job script by reloading all modules that are needed and ensuring that all environment variables that you need are set. Cleaning the module environment can be done by calling
module --force purge
and then reloading the right software stack and application modules in the job. This should be a common practice in all your job scripts.
Use one of the options
--export=NONE(either with the
sbatchcommand or, preferably, as a
#SBATCHline in the job script).
sbatchnot to propagate the environment in which it executes, but to reconstruct the environment that you would get when you log on to the cluster. Though be aware that any environment variables already set in the environment will still take precedence over any environment variables in the user’s login environment. And there is also a difference with what you get when executing your
.bash_profilescript: The environment only contains exported variables and functions and no aliases or variables or functions that are not exported by
--export=NONEwill only define SLURM_* variables from the user environment. When using this option, one must use an absolute path to the binary to be executed (which could then be used to further define the environment). When using this option, it is not possible to pass environment variables to the job script.
The Slurm controller also sets several SLURM_* variables in the environment of the running job. Some of these variables are only available if the corresponding option has been explicitly set, while other variables are always set (with default values filled in, if appropriate). Several of these variables are mentioned on our PBS-to-Slurm conversion tables page. A full list of all SLURM_* environments can be found in the sbatch manual page (in the section on “OUTPUT ENVIRONMENT VARIABLES”).
Starting multiple copies of a process in a job script: srun#
srun command is used to start a new job step in a job script. The most common case is
to start a parallel application.
srun integrates well with major MPI implementations and
can be used instead of
mpiexec to start a parallel MPI application. It then
takes your resource requests and allocated resources into account and does a very good job
of starting each MPI rank on the right set of cores even without having to use additional
command line options. Further down this section/page there are a couple of examples that
demonstrate the power of the
srun command. The advantage of this way of working is that
all processes run under the strict control of Slurm, ensuring that if something goes wrong,
they are also cleaned up properly.
srun command can also be used outside of a resource allocation, i.e., at the command
line of the login nodes, outside a job script or an allocation obtained with
further in the text). It will then first create the resource allocation before executing the
command given as an argument to
srun. One useful case which we discuss further down in this
text is to start an interactive session. Most of the command line options of
sbatch to specify the properties of the allocation can also be used with
srun will propagate the environment. When
srun is used in
a job script to start a parallel application, this is also very sensible and desired
behaviour as it ensures the processes started with Slurm run in the right environment
created by the job script.