doduo

This page includes important information that you should be aware of when using the new HPC-UGent Tier-2 cluster "doduo".

The doduo cluster is a replacement for golett (which was shut down in October 2020) and is accessible to all VSC accounts since March 18th 2021.

Resources - Usage - RHEL8 operating system - AMD processors - Scaling of multi-core workloads - Job command wrappers - Relation to Hortense (Tier-1) - Questions


Resources

The doduo cluster consists of 128 workernodes, each with:

  • 2x 48-core AMD EPYC 7552 processor (AMD Zen2 microarchitecture, a.k.a. AMD Rome)
  • 250 GiB of usable RAM memory
  • 180 GB of local storage (SSD)

This totals to over 12,000 cores of additional compute power!

A high performance interconnect (HDR-100 InfiniBand) and a fast connection to the shared filesystems make this cluster well suited for large MPI jobs.


Usage

To start using doduo, just swap to the corresponding cluster module (but also take into account the different operating system, see below):

module swap cluster/doduo


The doduo cluster is a bit different from the other HPC-UGent Tier-2 clusters, so please take the following things into account:


More information on each of these aspects is available below.

Please take a minute to carefully read this information if you plan using the doduo cluster!


RHEL8 operating system + login node

The workernodes in the doduo cluster are running Red Hat Enterprise Linux version 8.2 (RHEL 8.2) as operating system.
This is different from the other Tier-2 clusters and the (default) HPC-UGent login nodes, which are running CentOS Linux version 7.9.

A separate login node which is also running RHEL 8 has been set up, and we strongly advise using this login node when using the doduo cluster,
especially when running tests with software that was installed for doduo, or when using alternate job submission tools like worker.

If you try running software that was installed for doduo on the current default login nodes which are still running CentOS 7, you may run into problems.

To log into a login node that is using RHEL 8 as operating system, you should run the following command after logging in like you usually do (via login.hpc.ugent.be):

ssh login8

You should not notice any difference when using a RHEL 8 login node compared to the default login nodes, other than not seeing some problems when using doduo.

Note that for the other Tier-2 clusters still running CentOS 7, we recommend to keep using the default login nodes (which are also running CentOS 7).

The 'cluster' modules have been adjusted to warn you about a mismatch in operating system between the login node you are using and the cluster you are targeting.


AMD processors

The doduo cluster is powered by AMD processors, which belong to the same processor family as the Intel processors in the other HPC-UGent Tier-2 clusters
(both support the x86_64 instruction set), but they are different in several ways that you may need to be aware of.

Some software that works well on Intel processors is known to under-perform (sometimes significantly) on AMD processors, or in some cases even produce outright incorrect results.

A noteworthy example is the Intel MKL library, which is part of the 'intel' compiler toolchain and provides both linear algebra (BLAS, LAPACK) functionality and Fast Fourier Transform (FFT) routines.

We have observed disappointing performance with recent versions of Intel MKL on the AMD processors in doduo for some use cases, or sometimes even outright incorrect results,
and have therefore changed the composition of the 'intel' toolchain on doduo accordingly to mitigate this. We will keep evaluating this going forward.

These issues have been taken into account for the centrally provided software installations, and we will continue to do so for new software installations as well.

It is important you are aware of this too, and take this into account.

Please double check the results you obtain on the doduo cluster, and make sure they align well with results obtained on the other clusters using Intel CPUs.


Scaling of multi-core workloads

Keep an eye on the performance of your jobs, and be careful with (large) multi-core jobs, especially when running software that is memory bandwidth intensive.

The cache hierarchy of recent AMD processors is significantly different compared to the Intel processors in the other HPC-UGent Tier-2 clusters.
Each group of 4 cores in the AMD processors used in doduo share a part of the L3 cache.

The average memory bandwidth per core for these AMD processors is lower (but total memory bandwidth is higher).

Detailed technical information on the AMD Zen2 microarchitecture is available here and here.

It's not unlikely that you will see (almost) no speedup when using a full doduo workernode (96 cores) compared to half a workernode (48 cores) or less.
One particular example of software like this is OpenFOAM.

You should not blindly use full workernodes (ppn=all) on doduo if you are doing so on the Intel-based Tier-2 clusters,
but first evaluate whether you see significant performance improvements when using more cores.
You may even see better performance when using fewer cores!


Software

A large amount of software is already centrally installed on doduo, and should work as expected.
Just like on the other Tier-2 clusters, this software can be accessed via the 'module' command.

Because of both the newer operating system (RHEL 8) and the AMD processors, we can only install software with a sufficiently recent compiler toolchain on doduo.

We will only install software on doduo with a 2019b compiler toolchain, or newer.

Older installations that are still available on the other HPC-UGent Tier-2 clusters will have to be updated and installed with one of these recent toolchains (if compatible).

Again, please double check the results you obtain through jobs on doduo, and keep an eye open for performance issues, especially for multi-core workloads.

If you notice that software which you would like to use is not available yet for doduo (check via "module avail" after loading the cluster/doduo module),
please submit a software installation request via https://www.ugent.be/hpc/en/support/software-installation-request .

If you are compiling software yourself in order to use it on doduo, make sure you do so on a doduo workernode (not on a login node).


Job command wrappers (jobcli)

Doduo is using Slurm as the resource manager and job scheduler, just like the other HPC-UGent Tier-2 clusters.

We still provide wrappers for the trusted job submission and management commands (qsub, qstat, qdel, etc.), and will continue doing so for the foreseeable future.

The job command wrappers have been re-implemented from scratch however, mainly to make it easier for us to maintain and improve them in the future.

You should not notice any differences when using these commands compared to the other HPC-UGent Tier-2 clusters,
except perhaps for minor differences in the output of the qstat command (which hopefully won't affect your work).

If you do notice anything unexpected, or if you run into something that doesn't work, please let us know and we will try and resolve these issues as soon as possible.


Relation to Hortense (Tier-1)

Doduo is in some sense a precursor to the upcoming VSC Tier-1 system "Hortense",
which will be installed at Ghent University, because it features the same operating system (RHEL 8) as well as similar AMD processors.


Questions

Don't hesitate to contact us via hpc@ugent.be in case of questions, or to report problems with using the new doduo cluster.