HPC-UGent system status

Status

Login nodes: Online
Compute nodes: Online
Storage: Online

Known issues

(none)

Planned maintenance

Maintenance window 2020Q4: 30/11/2020 - 4/12/2020
All clusters, login nodes and storage systems will be unavailable during this maintenance

  • [Fri 4 Dec 2020 15:45]
    • Maintenance is completed
    • All clusters are and login nodes are again available
  • [Fri 4 Dec 2020 10:40]
    • Further analysis a.o. with IBM shows that the filesystem bug is only triggered when a considerable portion of the HPC-UGent infrastructure is rebooted at the same time (e.g. a power failure).
    • IBM confirms the fix will not be available before the new release, scheduled late Monday 7 Dec 2020.
    • As the risk of data integrity loss is low, we have decided to start releasing schedulers.
    • Login nodes should become online soon.
  • [Fri 4 Dec 2020 9h20]
    • We have discovered a bug in the parallel filesystem software, which was seemingly triggered by network reboots. This bug can make the filesystem unstable, with potential impact on data integrity. This issue currently prevents us from going live.
    • We are in communication with our vendor (IBM) to get a fix for this bug as soon as possible.
      • IBM will include and deploy a fix in a new release of GPFS, but that will only be late Monday 7 Dec 2020.
      • We are vying to get an early release of the fix even today.
  • [Thu 3 Dec 2020 19h00]
    • Recompilation of software on updates clusters finalized
    • Updates to authentication services
    • Further stress tests of clusters and storage
  • [Wed 2 Dec 2020 18h10]
    • Electricity works in datacenter finalized
    • InfiniBand network actions completed
    • Stress tests of InfiniBand network and storage started
    • Tier1 cloud setup rebooted
    • Recompilation of software on clusters ongoing
  • [Wed 2 Dec 2020 15h40]
    • Software recompilation on updated clusters progressing
    • Electricity works in datacenter ongoing
  • [Tue 1 Dec 2020 20h15]
    • core network switches failover tested and upgraded
    • firmware and OS updates on supporting services
    • InfiniBand network tests
    • OS updates finalized for
      • cluster skitty
      • cluster swalot
  • [Tue 1 Dec 2020 10h00]
    • OS updates finalized for
      • login nodes
      • cluster kirlia
      • cluster joltik
      • cluster victini
  • [Mon 30 Nov 2020 18h00]
    • Storage firmware upgrades completed
    • Rewiring of network ongoing
    • Belnet network modifications implemented and dealt with
    • all OS updates initiated
    • OS updates finalized for
      • cluster phanpy
  • [Mon 30 Nov 2020 8h00] Maintenance has started

Reminders

  • [Tue 17 Nov 2020] Datacenter works are scheduled for 30/11/2020 - 4/12/2020
    All clusters, login nodes and storage systems will be unavailable during this maintenance
  • [Mon 9 Nov 2020] Cluster golett was decommissioned
  • [Fri 23 Oct 2020, 10am] All workernodes in the joltik GPU cluster have been updated to the latest GPU drivers (supporting CUDA 11).

Cluster load

Consult http://hpc.ugent.be/clusterstate/

(only available within the UGent network)