Scientific computing — lessons learned the hard way

Published in

Towards Data Science

12 min readApr 15, 2021

Deep intro into scientific computing — from CPUs, GPUs and HPC clusters to the most used frameworks — OpenMP, MPI and CUDA.

I used to love videos of car crash tests. They are especially good in slow motion — the car speeds up and just before it hits a barrier, slow-mo starts. Glass is everywhere, doors are wide open and finally an explosion! Although, spectacular to watch and very important for road safety, these tests are also very expensive. It is simply impossible to test all possible collision scenarios and verify if the car is safe to drive.

At the same time collision simulations are cheap and can provide an early insight into possible safety problems. In this simulations the motion, the trajectory and deformations, specific to each part of the car, are computed and presented to a user along with the statistics on the pressure exerted by each component, temperature change and etc. The benefits of the simulation are obvious! However, not so obvious are the ways of making the simulation precise and fast. Just imagine the amount of computations needed to simulate the behaviour of a few millions of particles for a second of collision!

And here we can get some help from scientific computing — the field developing computational techniques and ways of efficiently using computational hardware for solving big problems. It might be surprising, but many of our daily activities are affected by the scientific computing. Consider for example airflow simulation for airplanes, rockets and cars, blood flow simulation for medical applications involving forecasting the behaviour of the body during complex surgeries, weather forecasting, explosion simulations for demolishing buildings and many many more!

Much of the work in scientific computing is focused on parallelising and distributing the computations across multiple independent machines.

It was rather hard for me to imagine how easy it would be to write my first scientific computing program — a simple OpenMP based parallelisation of a for loop (I will talk about the details of OpenMP later in this article). Although I got a significant speed up, it was nowhere near the full 8x speed up, which I would expect on my eight core machine. Turned out that getting to the edge of the performance is more difficult than I first thought. After two years of working on distributed and parallel code I could only touch the tip of a giant iceberg.

Looking back to when I was starting, I can now identify areas that were the key to getting up to speed with writing fast parallel and distributed code. First, is knowing how to write efficient serial code. For that it is important to know hardware architectures and memory hierarchy. Knowing these details help finding better ways to, for example, store the data – optimising for access speed. Secondly it is important to know parallelisation frameworks – their strong sides, where they would be more suitable and their performance bottlenecks.

Thus, in this article I will cover the essentials for starting up with scientific computing, which I wish I had summarised when I was starting — hardware architectures, memory hierarchy, ways to organize processors into clusters and parallelisation frameworks.

Hardware Architectures

To make full use of the available machinery it helps to know how the inside works! There are two key components — processing unit, that performs the computations and memory, which stores the data, intermediate results and final calculations. Separate processing units can be organized into clusters.

Central Processing Unit (CPU)

A simple CPU consists of a single core and global memory. A core is a hardware entity which, depending on the architecture, includes a number of arithmetic logic units (ALU) and local memory. An ALU is responsible for computations, such as addition, subtraction, multiplication, division and bitwise logical operations. Depending on the architecture, the functions performed by an ALU can be increasingly complex. For example, a complex ALU can perform a simple addition and a division in one iteration. The speed of CPUs is defined in terms of clock cycles. The duration of a clock cycle is controlled by an oscillator, which sends electric pulses ensuring synchronous operation. The shorter the clock cycle, the higher the frequency of the oscillator, the faster is the execution. Complex CPUs include more than one core and are called multi-core CPUs.

Multi-core CPU simplified architecture to operate. Image by Author.

Graphics Processing Unit (GPU)

GPUs were initially aimed at graphics computations ,but are now increasingly popular in general purpose computations. GPUs used for General Purpose (GP) applications are often referred to as GP-GPUs. Unlike multicore CPUs for which it is unusual to have more than 10 cores, GPUs consist of hundreds of cores. GPU cores have a limited instruction set and lower frequency as well as memory when compared to CPU cores. GPUs are not standalone and need an associated CPU to operate. Modern architectures allow a single GPU to be shared by multiple CPUs as well as a single CPU to have multiple dependant GPUs.

The CPU and its memory is called a host, and the GPU with an associated memory a device. Host can both launch functions, called kernels, on the device and manage the device’s memory. Data transfer from the host to the device is very expensive and should not happen to often.

GPU programming generally uses a Single Instruction Multiple Data (SIMD) model from the Flynn’s taxonomy and are thus very powerful when each of its cores is given the same simple task to perform, while branching or more complex operation can drastically slow down the processing.

Memory Heirarchy

Both GPU and CPU have the same memory hierarchy. Faster but smaller memory is placed close to the cores, whereas larger but slower memory is displaced away from the cores.

Registers are the fastest and the smallest memory, followed by the cache. Registers are used to store values immediately required for computations. Cache memory can be of three types, L1, L2 and L3, from L1 as fastest and smallest and L3 as slowest and largest. Cache is used to store values which have been recently used or may shortly become needed. The slowest, but the largest is Random-Access memory (RAM), where most of the data is stored.

Typically every core has its own cache and shares a common RAM. Data from the RAM is transferred to cache in blocks, called cache lines. These are adjacent RAM memory cells. Thus every core has a local copy of the data from the RAM. If a core needs a new data, its availability is checked in the cache. If it is not, a cache miss occurs. The core thus requests this data from the RAM, which is computationally more expensive. A programmer must ensure that the immediately required or frequently used data is in the cache.

Processor clusters

High performance systems consist of collections of multiple processors, which can form different computer architectures. There are four basic types: Symmetric Multiprocessing (SMP), Non-uniform Memory Access (NUMA), distributed and hybrid systems.

Computer Architectures. Image by Author.

In SMP, every CPU has its own private cache memory, however all processes share a large RAM memory. In NUMA all CPUs have access to a shared memory, however unlike SMP, memory is split between CPUs and physically placed closer to certain units. Thus CPUs spend less time accessing close memory, however, spend longer accessing memory displaced next to another processing unit. Both SMP and NUMA are shared memory systems. Distributed systems on the other hand allow every CPU to have its own memory only accessible by it. Communication and data sharing is made possible via messages. For large problems where computations performed by separate units overlap, communication can thus become a bottleneck. Hybrid systems is a mixture of both distributed and shared memory systems.

Parallelisation frameworks

Knowing the basics of hardware we can now exploit parallelisation framework. Below I talk about the three most commonly used ones: OpenMP – for parallel computing, MPI – for distributed computing and CUDA – for general purpose use of GPUs.

OpenMP

Applications executed on shared memory architectures can be parallelised with OpenMP API. OpenMP is a collection of library routines, compiler directives and environment variables for parallelism in C, C++ and Fortran. OpenMP is portable and has the advantage of preserving the correctness if the compiler directives are ignored. That is, the program parallelized with OpenMP will execute correctly sequentially.

OpenMP execution model resides on three pillars: shared memory, synchronisation and workload distribution.

General strcuture
Parallel entities are called threads. A thread is an execution entity with a stack and associated static memory. OpenMP uses the fork-join model of parallel execution. Several threads are forked before the parallel region and joined after. Work in the parallel region is split between the threads.

Forking and joining of threads in OpenMP API. Image by Author.

Program execution starts with a single “master” thread. The master thread launches (forks) — a team of parallel “worker” threads before the parallel region. When threads reach the end of the parallel construct, they synchronise and terminate (join), leaving only the master thread.

One disadvantage of this model is that a significant amount of time is spent on forking, joining and synchronization. Thus, where possible, the parallel region should not be closed and any redundant synchronization be avoided.

Memory model
OpenMP is aimed at shared memory parallel applications. OpenMP distinguishes between shared and private memory. The former can be accessed and modified by any of the threads. And the latter is maintained by each of the threads separately. Thus threads can communicate using shared variables. Two requirements are needed: cache coherency and memory consistency.

Cache coherency means that the processors see the same data in a particular memory address. One major problem arising is false sharing, when individual memory cells in the same cache line are simultaneously updated by different processors. Even though these updates are logically independent of each other, the entire cache line is invalidated.

Memory consistency in a multithreaded application is an ordered access and update of the same memory location by different threads. Memory inconsistency may lead to race conditions. When several threads are trying to write into the same memory location simultaneously.

Workload distribution
OpenMP provides several directives for managing workload distribution of for loops, where work performed in for loops can be split between the threads. Directives are “static”, “dynamic” and “guided”. With “static”, loop iterations are divided equally among all threads. When “dynamic”, the workload is distributed dynamically at runtime. Every thread is given one more iteration step in the for loop, once it finishes the previous cycle. Similarly to ”dynamic”, “guided” allows each thread to be allocated chunks of contiguous iterations dynamically. However unlike “dynamic”, it introduces an exponential decrease in the number of iterations allocated to a thread with each successive execution of the previous work.

Synchronisation
Threads perform computation asynchronously unless specified otherwise. For consistency OpenMP provides a range of directives for synchronization. These can be used to avoid race conditions. The main are “master”, “critical”, “atomic”, “barrier” and “nowait”. When “master” is encountered, the specified region is executed only by the master thread. Code enclosed in the “critical” section is executed in turn by every thread. Similarly to “critical”, “atomic” is used to specify a statement which needs to be updated atomically by every thread in turn. If “barrier” is encountered, the threads wait until all of them reach it. Once all of them reach the “barrier”. In case of “nowait”, threads are forced not to synchronise at the points of implicit synchronisation e.g. the end of the for loop.

Message Passig Iterface (MPI)

As the name suggests MPI is based on the exchange of messages. The basic parallel entity is a process. A group of processes capable of exchanging messages with each other is called a communicator. Every process within a communicator has a unique rank — an integer id assigned to it during creation. Processes explicitly communicate between themselves by their ranks. Communication is necessary as MPI is a memory-distributed system.

Communication
MPI supports simultaneous message exchange between multiple and two individual processes.

Point-to-point communication: In a point-topoint communication one process performs a send operation and another a receive operation, either synchronous or asynchronous. A synchronous send operation completes only after the acknowledgement that the receiving process safely received the message. It is called blocking, where the process will halt until the operation is successfully accomplished. In an asynchronous communication, processes do not wait for the acknowledgement and simply continue after notifying the MPI protocol that data needs to be sent. The receiving process also does not wait for the data, but expects an interruption from the system when the data is ready. In non-blocking communication, processes ask the MPI library to perform the operation for them. Thus the user cannot predict when the data is available in the receiving process.

Collective communication: MPI provides a framework for managing multiple processes within the same communicator. The framework induces synchronisation points, data movement and collective computation. Examples include: broadcast, scatter, gather and reduction.

Collective communication types. Image by Author.

Two notions associated with the MPI are bandwidth and latency. Since messages are exchanged via a physical channel i.e. the network, the speed of communication is finite. Bandwidth imposes a physical limit on how fast data can be moved. It is intuitive that small messages take less time, whereas larger messages take longer. However, even very small messages take time to be processed and moved. Latency is a lower bound on the transition time for any message. Different computer-networking communication standards were proposed in recent years, the most commonly used in high performance computing is infiniband (IB). It offers both high bandwidth and low latency.

CUDA

CUDA is an Application Programming Interface (API) devbeloped by NVIDIA to give access to GP-GPUs. CUDA provides direct access to GPUs virtual instruction set and compute cores. GPUs are not stand alone devices and must have an associated CPU. Thus CUDA resides in a heterogeneous programming model. CUDA relies on notions of threads, blocks and grids.

A thread is a single executing instance of a kernel. Every thread executes the same code concurrently but on different data and has a small but very fast private memory in the form of registers.

Threads are organized into blocks. Every block has associated shared memory, only accessible by the threads within this block. Threads can communicate within the same block, but not with threads from other blocks. To communicate with each other, threads have a unique id made up of a thread and a block index. Physically a block is mapped to a GPU’s single streaming-multiprocessor (SM), which is a collection of GPU cores, often referred to as Stream Processors (SPs). The number of SPs within the same SM can vary depending on the processor. If memory is sucient, multiple blocks can be mapped to a SM.

Grids are collections of blocks. A grid has an associated, large, but very slow, global memory, accessible by every thread. Global memory and a constant memory are distinguished on the GPU. Both are accessible by every thread, but the latter is read only.

Summary

Parallelisation can be a powerful tool, if used properly, however, the development and de-bugging time can often outweigh the computational gains.

Parallelising the code it is very important to keep in mind the problem being solved, for example if the precision is important or if the structure of the code can be changed. Very often it is more important to speed up serial code.

To write good parallel and distributed code knowledge of the frameworks would not be enough. The highest gain is achieved when the code is optimised for a specific hardware architecture.

If you liked this article share it with a friend! To read more on machine learning, data science, computer vision and image processing, press subscribe!

Have I missed anything? Do not hesitate to leave a note, comment or message me directly!