Why doesn’t my job start?#

Jobs are submitted to a queue system, which is monitored by a scheduler that determines when a job can be executed.

The latter depends on two factors:

  1. the priority assigned to the job by the scheduler, and the priorities of the other jobs already in the queue, and

  2. the availability of the resources required to run the job.

The priority of a job is calculated using a formula that takes into account a number of criteria.

  1. the user’s credentials (at the moment, all users are equal).

  2. Fair share: this takes into account the walltime the user has used over the last seven days. The more used, the lower the resulting priority.

  3. Time queued: the longer a job spends in the queue, the higher its priority becomes, so that it will run eventually.

  4. Requested resources: larger jobs get a higher priority.

These factors are used to compute a weighted sum at each iteration of the scheduler to determine a job’s priority. Due to the time queued and fair share, this is not static, but evolves over time while the job is in the queue.

Different clusters use different policies as some clusters are optimized for a particular type of job.

Also, don’t try to outsmart the scheduler by explicitly specifying nodes that seem empty when you submit your job. The scheduler may be reserving these nodes for a job that requires multiple nodes, so your job will likely spend even more time in the queue, since the scheduler will not launch your job on another node which may be available sooner.

Remember that the cluster is not intended as a replacement for a decent desktop PC. Short, sequential jobs may spend quite some time in the queue, but this type of calculation is atypical from an HPC perspective. If you have large batches of (even relatively short) sequential jobs, you can still pack them as longer sequential or even parallel jobs and get to run them sooner. User support can help you with that, or see the page How can I run many similar computations conveniently?.