Skills and experience we are looking for in candidates

INTJ is looking for the following skills, knowledge and experience in candidates we would like to contract with:

Operating Systems: Linux

Activities: Installing Linux operating systems, adding users and groups, configuring yum repositories, installing and configuring multimedia packages on CentOS, partitioning hard drives, configuring RAID arrays, adding storage and other desktop Linux support activities.

Programming & Scripting Languages: C, C++, OPENMPI, Bash, Csh and Bourne Scripts

Activities: Writing shell scripts to support HPC activities, PBS queue monitoring scripts, operating system and application monitoring scripts in order to extend and monitor open source vendor applications. Writing scripts to work around bugs in open source vendor supplied applications. OpenMPI programming for support of scientific job scripts particularly for OpenFOAM and ANSYS Fluent job scripts.

Programming Support: GIT and version control

Activities: Administer and maintain GIT, Subversion, Artifactory and other code repositories to ensure effective management of the HPC code environment.

Scientific Application Support: FLUENT, OpenFOAM, Pointwise, Tecplot, FieldView

Activities: Downloading, installing, configuring, compiling and licensing scientific applications. Updating scientific applications on lab Linux PC's and on HPC global file system

HPC APPLICATION SUPPORT: TORQUE, PBS, MAUI, MOAB, SLURM

Activities: Administer and support queueing systems including Torque, PBS, Maui, Moab and SLURM.

Licensing: Flexnet Licence Manager

Activities: Installing flexnet licence manager, installing and updating licence files for OpenFOAM, Pointwise, Tecplot & Fieldview, updating licence manager software version as new versions are released.

HPC Support: Cray and HPE Supercomputers

Activities: Log system support calls with Cray and HP Enterprise, package up and return faulty compute nodes back to Cray and HP, liaise with Cray and HPE support technicians and HPC management, monitor compute nodes, reboot compute nodes as required, reboot entire cluster as required, startup and shutdown cluster as scheduled, monitor scientists HPC jobs, investigate job logs to determine causes of job crashes and other faults, adminster and monitor cluster using Cray ACE Management software.

Networks: Infiniband

Activities: Monitoring HPC infiniband network, bringing networking up and down as required, investigating networking faults that cause jobs to fail, replacing faulty infiniband cables, installing and configuring as required Mellanox switches.

Storage: Dell

Activities: Monitoring storage space on cluster sotrage array, licence server and individual lab Linux PC's, partitioning hard drives of lab PC's, configuring RAID arrays, adding storage, updating firmware on storage array hardware. Support of parellel file systems (Lustre, some GPFS / Spectrumfile system testing for research activities)

HPC General: Support

Activities: Responding to scientists requests for HPC system, operating system and scientific application support in a timely and efective manner, attending HPC related events, conferences and get togethers both domestic and internationally, effective and detailed HPC change management using HP Service Desk, writing effective and clearly understood documentation of HPC system administration and support.

 

 

What are the skills that are useful for a HPC System Administrator to have?

 

Speech Transcript:

Clarke Towson

Tuesday, April 19th 2016

Hello Everyone,

I would like to talk to you today about the kinds of skills and experiences that generally speaking are valuable to Australian employers who are looking to contract or employ full time High Performance Computing System Administrators for the high level purposes of generating wealth and making scientific discoveries via the use of HPC systems for our great country of Australia.

First and foremost the most useful base skill I believe is knowledge of and hands on technical skills with Linux. Linux is the cluster operating system of choice, thanks to its scalability and performance capabilities and the wide variety of open-source software and development tools available for it. The type of activities that a HPC system administrator will generally perform include installing Linux on laboratory PC's, adding users and groups, configuring yum repositories, installing and configuring multimedia packages particularly with Linux flavours that don't have great multimedia support out of the box such as CentOS, partitioning hard drives, configuring RAID arrays, adding storage and other desktop related Linux support activities.

An theoretical understanding of and current hands on technical skills in computer programming is very useful for HPC system administrators to posses. In terms of programming and scripting languages – a good overall knowledge of C, C++, OPENMPI, Bash, Csh and Bourne Scripts is very useful. Day to day activities generally come down to writing shell scripts to support HPC tasks, creating PBS (Torque Resource Manager) queue monitoring maintenance scripts, writing operating system and application monitoring scripts in order to extend and monitor open source vendor applications. Other important tasks include writing scripts to work around bugs in open source vendor supplied applications. Working at the cutting edge of technology means you will be finding bugs and submitting bug reports back to vendors. OpenMPI programming for support of scientific job scripts particularly for OpenFOAM and ANSYS Fluent job scripts is also important. Working at the cutting edge of HPC there are many things that can go wrong with jobs - operating system issues, application errors, crashes and the like. Computer programmers who have the ability to work through frustration and come up with innovative solutions to complex problems will be valued highly. Although linux is a widely praised operating system renowned and lauded (deservedly of course) for it's stability – the probability of system and/or more commonly application crashes can be higher in a HPC system. Users will thrash the system and try to get as much performance from the compute nodes as possible. Hardware issues can quickly become apparent and cause jobs to fail and the ability to investigate system and application crashes and implement work arounds or solutions which may sometimes involve hardware removal or replacement will be required. A very good programmer will generally also be a very good bug catcher!

Programming support activities are essential when working in HPC. A good working knowledge and practical hands on skills in version control using GIT is very useful. Activities can include administering and maintaining GIT, Subversion, Artifactory and other code repositories to ensure effective management of the HPC code environment.

A HPC system administrator can really set themselves apart if they have a working knowledge of or experience in using scientific applications. In particular applications like ANSYS Fluent. Other important applications are OpenFOAM, Pointwise, Tecplot and FieldView which are used in the field of Computational Fluid Dynamics. Some of the day to day application related activities can include downloading, installing, configuring, compiling and licensing these scientific applications. There are some very time consuming activities which centre around application licensing. The Flexnet Licence Manager is used by many scientific applications and common activities include updating this licence manager and installing and keeping licence files for multiple versions of OpenFOAM, Pointwise, Tecplot & Fieldview and keeping each version up to date with the appropriate licence file. In general – the users of HPC systems are scientific researchers, engineers, academic institutions and government agencies and they are all looking to run advanced application programs like the ones I have mentioned efficiently, reliably and quickly in a parallel fashion. Manufacturers are also one of the biggest markets for HPC. Leading automotive, aerospace and heavy equipment manufacturers are users of HPC systems. There are two broad application categories – structural analysis and fluid dynamics analysis. The top 5 HPC applications according to top500.org are LS-DYNA, Abaqus, ANSYS Fluent, STAR-CCM+ and OpenFOAM.

There are some very important system applications which are used in HPC that a system administrator really needs to have a good working and theoretical understanding of. Adaptive Computing is the creator of the Torque Resource Manager PBS as well as Maui and Moab. The job queuing system used on many HPC systems. Slurm is another popular queueing system which is becoming the de-facto standard particularly on Cray Supercomputers. When queueing system problems begin – jobs start to fail and this can cause frustrated users and lost productivity. It's still early days for HPC and as time goes on queuing systems will be improved and made more reliable.

In terms of general HPC support of Cray and HP Enterprise Supercomputers – activities can include logging system support calls with Cray and HP, packaging up and returning faulty compute nodes back to Cray and HP, liaising with Cray and HP support technicians and company management on pressing technical issues, monitoring compute nodes, rebooting compute nodes as required, rebooting the entire cluster as required, starting up and shutting down the cluster at scheduled times, monitoring scientists HPC jobs, investigating job logs to determine the causes of job crashes and other faults, administering and monitoring the cluster using Cray ACE Management software. It's useful to become highly efficient at issuing Ace Management commands from the command line but the GUI is also useful at times.

In terms of networking knowledge - monitoring HPC Infiniband networks is an important day to day activity. Bringing the Infiniband networking up and down as required, investigating networking faults that cause scientists jobs to fail, replacing faulty Infiniband network cables and installing and configuring as required Mellanox switches.

HPC systems come with storage and storage is often a bottleneck. Generally system administrators will spend time monitoring storage space on the cluster storage array, the licence server and individual lab Linux PC's. Additional work will include partitioning hard drives of lab PC's, configuring RAID arrays, adding storage and updating firmware on storage array hardware when vendors release new firmware versions. A good file system knowledge is important for the support of parallel file systems (particularly Lustre, some GPFS / Spectrum file system testing for research activities).

The rest of the time (if there is any – which is rare) system administrators will concentrate on responding to scientists requests for HPC system, operating system and scientific application support in a timely and effective manner and attending HPC related events, conferences and get togethers both domestic and increasingly - internationally. A very important but often not done very well task is effective and detailed HPC change management. A comprehensive ITIL based change management structure surrounding HPC system support is important especially given whatever can go wrong generally will go wrong on large HPC systems.

Good system administrators are in the habit of creating clear and concise documentation. This is particularly important for HPC system administrators due to the level of complexity that is involved with these types of systems and the need to impart knowledge to other staff in this very hard to recruit in area of computing. Growing the skills of existing staff is often seen as the only way to ensure knowledge and skills are retained in an organisation.

Anyway – that's it from me for today. Talking any more would waste valuable seconds and time of course is crucial when working in the High Performance Computing field.

I hope that you enjoyed my discussion on the skills that are useful for system administrators in the High Performance Computing area.

Until next time – all the best!

I'm Clarke Towson

The views and opinions expressed are the views and opinions of Clarke Towson and INTJ Billing Pty Ltd

If you are considering work as a HPC contractor and you are interested in contracting with INTJ please contact Clarke. You can view the INTJ Independent Contractor Agreement as a PDF here: Click here to download