wICE advanced guide#
Compiling software#
Compared to the SkyLake and CascadeLake CPUs on Genius, the wICE nodes feature more recent CPU models such as Intel IceLake, Intel Sapphire Rapids and AMD Genoa. While architectural differences between SkyLake and CascadeLake CPUs can be neglected, the differences with newer CPU models are more substantial. When it comes to GPUs there are also significant differences in the capabilities of P100, V100, A100 and H100 GPUs.
When locally installing software yourself, we therefore recommend to have separate installations for the different types of CPUs and (if applicable) GPUs on which you intend to run the software. This rule applies most strongly to performance-critical code which is compiled from source. It applies less strongly to precompiled binaries or interpreted code such as pure Python scripts (meaning that it is typically not necessary to e.g. create different Conda environments for different CPU types).
Remember that precompiled binaries (as is often the case when e.g. Conda or PyPI are involved) are not guaranteed to deliver optimal performance for the target device. In case of doubt, performance-critical parts of an application should not rely on precompiled binaries and instead use optimized binaries as provided by the centrally installed modules and/or by local installations from source.
Older toolchains (compilers, BLAS libraries, …) may not be able to take full advantage of newer CPU models and so we typically recommend using the most recent available toolchains. Sapphire Rapids CPUs provide new ‘AMX’ instructions, for example, which may be useful for AI applications. When using GNU compilers, however, the GCC version needs to be sufficiently recent (>= v11) in order to generate AMX code.
To let jobs use the correct installation at runtime, you can make use of
predefined environment variables such as ${VSC_ARCH_LOCAL}
(and possibly
) to organize your
installations (see the examples below).
For software using CPUs, the different installations would be:
one for SkyLake and CascadeLake CPUs
if needed you can control this for your jobs by e.g. adding a--constraint=cascadelake
Slurm option)one for IceLake CPUs
)one for Sapphire Rapids CPUs
For software which also uses GPUs, this would be:
one for SkyLake CPUs with P100 GPUs
)one for CascadeLake CPUs with V100 GPUs
)one for IceLake CPUs with A100 GPUs
)one for AMD Genoa CPUs with H100 GPUs
Unless mentioned otherwise, the ${VSC_ARCH_SUFFIX}
corresponds to an
empty string. You can check which CPU and GPU models are present in which
partitions on the Genius hardware and wICE hardware pages.
Many dependencies you might need are centrally installed. The modules
that are optimized for wICE are available when the appropriate
cluster module is loaded. In most cases this will
happen automatically, but in case of problems it is a good idea to double check
environment variable; it should contain paths that look as
starting with /apps/leuven/rocky8/${VSC_ARCH_LOCAL}${VSC_ARCH_SUFFIX}
indicates the architecture of the
node in question.
Similar to other VSC clusters, wICE supports two families of common toolchains: FOSS and Intel. Next to that, various subtoolchains are available. For more general information on software development on the VSC, have a look at this overview.
The following jobscripts show one of the ways you can put this into practice to compile and then run your software (to be repeated for each CPU model that you intend to use):
#!/bin/bash -l
#SBATCH --clusters=...
#SBATCH --partition=...
module load intel/2022b # just an example
mkdir -p ${installdir}
# build the software, installing the binaries in ${installdir}/bin
#!/bin/bash -l
#SBATCH --clusters=...
#SBATCH --partition=...
module load intel/2022b
export PATH=${installdir}/bin:${PATH}
# run the software
Memory hierarchy#
When running applications in parallel it is often a good idea to take the
memory hierarchy into account (for example when pinning MPI processes
in hybrid MPI/OpenMP calculations).
The nodes in the batch
partition on Genius and wICE are the simpler ones
with a single NUMA domain and L3 cache per CPU, with the usual core-private
L1 and L2 caches. Other node types may feature more than one NUMA domain per
CPU and (in the case of AMD CPUs) more than one L3 cache per CPU.
The 48 cores in a Sapphire Rapids CPU, for example, share a large L3 cache
but are organized in 4 groups of 12 cores, each group associated with one
NUMA domain. For a complete overview, please consult the
Genius hardware and wICE hardware pages.
You can also retrieve this information using the lstopo-no-graphics
command. When on a compute node, keep in mind that the output will only
be complete if all available cores have been allocated to your job.
The Worker framework, which allows to conveniently parameterize simulations, is available on wICE. An attention point is that if you want to lauch Worker jobs from the Genius login nodes, you will need to use a specific module:
$ module load worker/1.6.12-foss-2021a-wice
If instead you want to launch Worker jobs from an interactive job running on
wICE, you can use the worker/1.6.12-foss-2021a
module. But do make sure
this is the version installed specifically for wICE, which you can check
by looking at the installation directory of worker. For example, the path
returned by which worker
should start with /apps/leuven/rocky8/icelake
or /apps/leuven/rocky8/sapphirerapids
or /apps/leuven/rocky8/zen4-h100
Also note that the Worker support for Slurm is not yet complete. Both
the -master
option for wsub
and the wresume
tool currently
only work for PBS/Torque and hence should not be used in the case of Slurm.
All the resources furthermore need to be specified inside the Slurm script used as input for Worker (passing resources via the command line is not supported). Various examples can be found in a development branch.