Running jobs in Torque#

VSC clusters using the Torque job scheduler:

UGent

Note

Other clusters might use the Slurm job scheduler

The workflow in the HPC is straightforward:

  1. Create a job script

  2. Submit it as a job to the scheduler

  3. Wait for the computation to run and finish

Job script#

A job script is essentially a Bash script, augmented with information for the scheduler. As an example, consider a file hello_world.pbs as below.

 1#!/usr/bin/env bash
 2
 3#PBS -l nodes=1:ppn=1
 4#PBS -l walltime=00:05:00
 5#PBS -l mem=1gb
 6
 7cd $PBS_O_WORKDIR
 8
 9module purge
10module load Python/3.12.3-GCCcore-13.3.0
11
12python hello_world.py

We discuss this script line by line.

  • Line 1 is a she-bang that indicates that this is a Bash script.

  • Lines 3-5 inform the scheduler about the resources required by this job.

    • It requires a single node (nodes=1), and a single core (ppn=1) on that node.

    • It will run for at most 5 minutes (walltime=00:05:00).

    • It will use at most 1 GB of RAM (mem=1gb).

  • Line 7 changes the working directory to the directory in which the job will be submitted (that will be the value of the $PBS_O_WORKDIR environment variable when the job runs).

  • Lines 9 and 10 set up the environment by loading the appropriate modules.

  • Line 12 performs the actual computation, i.e., running a Python script.

Every job script has the same basic structure.

Note

Although you can use any file name extension you want, it is good practice to use .pbs since that allows support staff to easily identify your job script.

More information is available on:

See also

Using the module system

Submitting and monitoring a job#

Once you have created your job script, and transferred all required input data if necessary, you can submit your job to the scheduler

$ qsub hello_world.pbs
11549090

The qsub returns a job ID, an unique identifier that you can use to manage your job.

Once submitted, you can monitor the status of your job using the qstat command.

$ qstat
Job ID     Name             User            Time Use S Queue
---------- ---------------- --------------- -------- - -------
11549090   hello_world.pbs  vsc30140        0:00:10  C cpu...

The status of your job is given in the S column. The most common values are given below.

status

meaning

Q

job is queued, i.e., waiting to be executed

R

job is running

C

job is completed, i.e., finished.

More information is available on

Job output#

By default, the output of your job is saved to two files.

<job_name>.o<jobid>

This file contains all text written to standard output, as well as some information about your job.

<job_name>.e<jobid>

This file contains all text written to standard error, if any. If your job fails, or doesn’t produce the expected output, this is the first place to look.

Troubleshooting#

Advanced topics#

  • Monitoring memory and CPU usage of programs, which helps to find the right parameters to improve your specification of the job requirements.

  • worker quick start: To manage lots of small jobs on a cluster. The cluster scheduler isn’t meant to deal with tons of small jobs. Those create a lot of overhead, so it is better to bundle those jobs in larger sets.

  • The Checkpointing framework can be used to run programs that take longer than the maximum time allowed by the queue. It can break a long job in shorter jobs, saving the state at the end to automatically start the next job from the point where the previous job was interrupted.