My jobs seem to run, but I don’t see any output or errors?#
You ran out of time#
It is possible the job exceeded the walltime that was specified as part of the required resources, or the default value otherwise.
If this is the case, the resource manager will terminate your job, and the job’s output file will contain a line similar to:
=>> PBS: job killed: walltime <value in seconds> exceeded limit <value in seconds>
Try to submit your job specifying a larger walltime.
You ran out of disk space#
You may have exceeded the disk quota for your home directory, i.e., the total file size for your home directory is just too large.
When a job runs, it needs to store temporary output and error files in your home directory. When it fails to do so, the program will crash, and you won’t get feedback, since that feedback would be in the error file that can’t be written.
See the FAQs listed below to check the amount of disk space you are currently using, and for a few hints on what data to store where.
However, your home directory may unexpectedly fill up in two ways:
a running program produces large amounts of output or errors;
a program crashes and produces a core dump.
Note
That one job that produces output or a core dump that is too large for the file system quota will most probably cause all your jobs that are queued to fail.
Large amounts of output or errors#
To deal with the first issue, simply redirect the standard output of the
command to a file that is in your data or scratch directory, or, if you
don’t need that output anyway, redirect it to /dev/null
. A few examples
that can be used in your job scripts that execute, e.g., my-prog
, are
given below.
To send standard output to a file, you can use:
my-prog > $VSC_DATA/my-large-output.txt
If you want to redirect both standard output and standard error, use:
my-prog > $VSC_DATA/my-large-output.txt \
2> $VSC_DATA/my-large-error.txt
To redirect both standard output and standard error to the same file, use:
my-prog &> $VSC_DATA/my-large-output-error.txt
If you don’t care for the standard output, simply write:
my-prog >/dev/null
Core dump#
When a program crashes, a core file is generated. This can be used to
try and analyze the cause of the crash. However, if you don’t need cores
for post-mortem analysis, simply add the following line to your .bashrc
file:
ulimit -c 0
This can be done more selectively by adding this line to your job script prior to invoking your program.
You can find all the core dumps in your home directory using:
$ find $VSC_HOME -name "core.*"
They can be removed (make sure that only unwanted core files are removed by checking with the command above) using:
$ find $VSC_HOME -name "core.*" -exec rm {} +
You ran out of memory (RAM)#
The resource manager monitor the memory usage of your application, and will
automatically terminate your job when that memory exceeds a limit. This limit
is either the value specified in the resource request using pmem
or pvmem
,
or the default value.
You may find an indication that this may be the case by looking at the job’s output file. The epilogue information lists the resources used by the job, including memory.
Resources Used : cput=00:00:00,vmem=110357kb,walltime=00:34:02,mem=984584kb
If the value of mem
is close to the limit, this may indicate that the
application used too much memory.
The used resources are just a rough indication, and the reported value can be lower than the actual value if the application’s memory usage rapidly increased. Hence it is prudent to monitor the memory consumption of your job in more detail.
You can try to resubmit your job specifying more memory per core.