Tech talks on scientific computing

Thanks to the FOSDEM open source event in Brussels, we have been able to invite several international speakers to give a technical talk on a variety of topics, related to scientific computing.

Intended audience is system administrators and researchers, as well as anyone interested in the covered topics.

Date

Mon 4 February 2019 from 14:00 to 18:00

Venue

Multimediaroom, building S9, Campus De Sterre, Krijgslaan 281, 9000 Gent

Registration

Attendance is free but registration is required for practical reasons.
Follow this link to register for the next session: https://webappsx.ugent.be/eventManager/events/postfosdem

Program

Organized by HPC-UGent

Playlist with talk recordings is available at https://www.youtube.com/playlist?list=PLrmNhuZo9sgYR3qY1-xw5CbzBnhIEQwf9 .

Towards self-managed, re-configurable streaming dataflow systems (Vasia Kalavri, ETH Zürich) [slides, recording]

Next-generation stream processing systems will not only be scalable and reliable, but also autonomous, flexible, and able to automatically re-configure running applications without downtime. Automatic re-configuration will allow stream processors to dynamically mitigate skew and stragglers, switch execution plans on-the-fly, vary resource allocation, and support code updates without the need to halt and restart applications.

In this talk I will share my research group’s recent work in this area. I will present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller for streaming systems which uses lightweight instrumentation to estimate the true processing and output rates of individual dataflow operators.

Vasiliki (Vasia) Kalavri is a postdoctoral fellow at the Systems Group, ETH Zurich, where she is working on distributed stream processing and large-scale graph analytics. Vasia did her PhD at KTH, Stockholm, and UCLouvain, Belgium, where she was an EMJD-DC fellow. Her thesis, "Performance Optimization Techniques and Tools for Distributed Graph Processing" received the IBM Innovation Award 2017 by FNRS. During her PhD, she also spent time at DIMA TU Berlin, Telefonica Research Barcelona, and data Artisans. Vasia is a PMC member of Apache Flink and a core developer of its graph processing API, Gelly.

The convergence of HPC and BigData - What does it mean for HPC sysadmins? (Damien François, UC Louvain) [slides, recording]

There are mainly two types of people in the scientific computing world: those who produce data and those who consume it. Those who have models and generate data from those models, a process known as 'simulation', and those who have data and infer models from the data ('analytics'). The former often originate from disciplines such as Engineering, Physics, or Climatology, while the latter are most often active in Remote sensing, Bioinformatics, Sociology, or Management.

Simulations often require large amount of computations so they are often run on generic High-Performance Computing (HPC) infrastructures built on a cluster of powerful high-end machines linked together with high-bandwidth low-latency networks. The cluster is often augmented with hardware accelerators (co-processors such as GPUs or FPGAs) and a large and fast parallel filesystem, all setup and tuned by systems administrators. By contrast, in analytics, the focus is on the storage and access of the data so analytics is often performed on a BigData infrastructure suited for the problem at hand. Those infrastructure offer specific data stores and are often installed in a more or less self-service way on a public or private 'Cloud' typically built on top of 'commodity' hardware.

Those two worlds, the world of HPC and the world of BigData are slowly, but surely, converging. The HPC world realises that there are more to data storage than just files and that 'self-service' ideas are tempting. In the meantime, the BigData world realises that co-processors and fast networks can really speedup analytics. And indeed, all major public Cloud services now have an HPC offering. And many academic HPC centres start to offer Cloud infrastructures and BigData-related tools.

This talk will focus on the latter point of view and review the tools originating from the BigData and the ideas from the Cloud that can be implemented in a HPC context to enlarge the offer for scientific computing in universities and research centres.
HPC on OpenStack (Petar Forai, IMP (Vienna)) [slides, recording]

This talk mainly targets infrastructure and data center owners, architects and operations personnel.
OpenStack, a free and open source software platform for infrastructure as a service clouds, has emerged from a collaboration of a commercial web hosting provider and a research entity with the goal to provide programmable automation interfaces for data center resources such as compute, network and storage and other higher level services.

The project as been around for more than 10 years now and the increasing demand for automation in data centers has fueled an increase in the adoption of OpenStack as an automation and cloud platform for on premise data centers.

Deploying such a cloud platform for HPC environments is becoming an increasingly and popular choice for HPC sites across the globe due to the obvious advantages promised by the framework. At the same time the convergence of cloud style infrastructures and HPC infrastructure is an active research question.

In this talk I'm going to present and describe the 'CLIP' (CLoud Infrastructure Project) project. The main goal of this project is the consolidation of multiple independent computing environments and HPC infrastructures onto one common platform suitable for a wide variety of academic computing use cases.

The talk will introduce the more detailed goals of the project, the timeline, and efforts involved.

Furthermore, the deployment choices and a detailed outline of the system architecture will be presented and a special emphasis on the following points will be placed:

- Integration of the OpenStack platform in existing data center services (networking, storage, etc.)
- Architecture of an 'infrastructure ad code' based approach towards the automated installation of OpenStack itself (including automated validation and verification of the platform)
- A horizontally scalable monitoring system suitable for the cloud platform itself and the payload systems running on top of it
- Operational issues involved in developing the platform further and keeping it updated
- Payload HPC environment architecture, management and automation.

Petar Forai is the deputy head of IT the shared service department for the research institutes Institute of Molecular Pathology (IMP), Institute of Molecular Biotechnology (IMBA) and the Gregor Mendel Institute (GMI) at the Vienna Biocenter.
He is involved in the topics of systems engineering, IT infrastructure, networking design and architecture questions. Petar has designed and deployed several HPC environments and has been working as an IT infrastructure engineer for 10 years.
System Administration at Fred Hutchinson Cancer Research Center (John F. Dey, https://www.fredhutch.org) [slides, recording]

Partly Cloudy
Seattle is known for rain and the forecast is always Partly Cloudy. Hutch Scientific Computing group is actively using cloud offerings from Amazon, Google and Microsoft. I will present an overview of past and future cloud projects. We manage an on-site cluster which has been expanded to both AWS EC2 and Google Compute Engine. Hutch is using AWS S3 and Azure storage. Last October we hosted a 2 day conference with other life sciences focused institutes to share cloud experiences. In November we hosted our first joint Hackathon with Amazon.

John F. Dey is a system administrator at the Fred Hutchinson Cancer Research Center in Seattle (US). He has over thirty years of experience with Unix/Linux as a system administrator and hacker supporting scientific computing, and ten years with Merck/Rosetta supporting systems used for bioinformatics. He has designed a Top500 Super Computer, ranked at position 100 in November 2016., and has worked the last three years at Fred Hutch for Scientific Computing.
ReFrame: A Regression Testing and Continuous Integration Framework for HPC systems (Vasileios Karakasis, CSCS) [slides, recording]

Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience.

In this presentation, we present ReFrame, a new framework for writing regression tests for HPC systems. ReFrame is designed to abstract away the complexity of the interactions with the system and separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. All the system interaction details, such as programming environment switching, compilation, job submission, job status query, sanity checking and performance assessment, are performed by the different pipeline stages.

Thanks to its high-level abstractions and modular design, ReFrame can also serve as a tool for continuous integration (CI) of scientific software, complementary to other well-known CI solutions. Finally, we present the use cases of two large HPC centers that have adopted or are now adopting ReFrame for regression testing of their computing facilities.