## Parallel wave modelling on a 8-GPU monster machine

A few years ago we had the rare opportunity to program on a 8-GPU monster machine provided by NVIDIA. That was an experience! The goal was to parallelise a 3D wave modelling tool in the frequency domain (Helmholtz equation) on a machine with multiple GPUs. The motivation was to run a problem of a realistic size that are usually too large to fit in the memory of one GPU. Actually, we are facing the same issue as 20 years ago with CPUs.

###### The monster machine

For our numerical experiments NVIDIA provided a Westmere based 12-cores machine connected to 8 GPUs Tesla 2050 as shown on the figure above. The 12-core machine has 48 GB of RAM. Each socket has 6 CPU cores Intel(R) Xeon(R) CPU X5670 @ 2.93GHz and is connected trough 2 PCI- buses to 4 graphics cards. Note that two GPUs are sharing one PCI-bus connected to a socket. Each GPU consist of 448 cores with a clock rate of 1.5 GHz and has 3 GB of memory.

###### Multi-GPU approach

There are several approaches to deal with multi-GPU:

1. The data-parallel approach, where all matrix-vector and vector-vector operations are split between multiple GPUs. The advantage of this approach is that it is relatively easy to implement. However, matrix- vector multiplication requires exchange of the data between different GPUs, that can lead to significant data transfer times if the computational part is small.
2. Split of the algorithm, where different parts of algorithm are executed on different devices. For instance, the solver is executed on one GPU and the preconditioner on another one. In this way the communication between GPUs will be minimized. However this approach requires individual solution for each algorithm.
3. Domain-Decomposition approach, where the original continuous or discrete problem is decomposed into parts which are executed on different GPUs and the overlapping information (halos) is exchanged by data transfer. For our wave simulation in frequency domain this approach can however have difficulties with convergence for higher frequencies (see Ernst, Gander).

We have chosen the data-parallel approach and split of the algorithms and made a comparison between multi-core and multi- GPUs. We leave out the domain decomposition approach because the convergence of the Helmholtz solver is not guaranteed.

###### Computations on Multi-GPU

You can do computations on multi-GPU by either:

• pushing different context to different GPUs, or
• creating multiple threads on CPU, where each of them will communicate with one GPU.

For our purposes we have chosen the second option, since it was easier to understand and implement.

###### Issues

Implementation on multi-GPUs requires careful consideration of possibilities and optimization options. The issues we encountered during our work are listed below:

•  Multi-threading implementation, where the life of a thread should be as long as the application. This is crucial for the multi-threading way of implementation on multi-GPU. Note that in case of pushing contexts this is not an issue.
• Because of limited GPU memory size, large problems need multiple GPUs.
• Efficient memory reusage to avoid allocation/deallocation. Due to memory limitations the memory should be reused as much as possible, especially in the multigrid method.
In our work we create a pool of vectors on the GPU and reuse them during the whole solution time.
• Limit communications CPUGPU and GPUCPU.
• Back then, it was beneficial to use texture memory when possible, but it was not easy as each GPU needs its own texture reference.
• Coalescing is difficult since each matrix row has a different number of elements.
###### Helmholtz solver

The Helmholtz equation represents the time-harmonic wave propagation in the frequency domain and has applications in many fields of science and technology, e.g. in aeronautics, marine technology, geophysics, and optical problems. In particular we consider the Helmholtz equation discretized by a second order finite difference scheme.

As a solver for the Helmholtz equation (wave equation in frequency domain) we use Bi-CGSTAB with a shifted Laplacian multigrid preconditioner.

Since the Bi-CGSTAB is a collection of vector additions, dot products and matrix-vector multiplications, the multi-GPU version of the Bi-CGSTAB is straight forward. We have seen that the speedup on multi-GPUs is smaller than on single GPU due to the data transfer between CPU and GPU. However it is possible to compute a problem of a realistic size on multi-GPUs and the computation on multi-GPU is still many times faster than 12-core Westmere.

The shifted Laplace preconditioner consists of a coarse grid correction based on the Galerkin method with matrix-dependent prolongation and Gauss-Seidel as a smoother. The implementation of coarse grid correction on multi-GPU is straight forward, since the main ingredient of the coarse grid correction is matrix-vector multiplication. The coarse grid matrices are constructed on CPU and then transferred to the GPUs.

The Gauss-Seidel smoother on multi-GPU requires adaptation of the algorithm. We use eight color Gauss-Seidel, since the Helmholtz equation is given in three dimensions and computations at each discretization point should be done independent on the neighbours to allow parallelism.

###### Conclusions

We implemented the three-dimensional Helmholtz wave equation on a multi-GPU. To keep the double precision convergence the solver (Bi-CGSTAB method) is implemented on GPU in double precision and the preconditioner in single precision.

Two multi-GPU approaches have been considered: data parallel approach and a split of the algorithm.

For the data parallel approach, we were able to solve larger problems than on one GPU
and get a better performance than multi-threaded CPU implementation. However due to
the communication between GPUs and a CPU the resulting speedups have been considerably smaller
compared to the single-GPU implementation.

To minimize the communication but still be able to solve large problems we have introduced split of the  algorithm. In this case the speedup on multi-GPUs is similar to the single GPU compared to the multi-core implementation.

See full comparison of the multi-GPU implementation to a single-GPU and a multi-threaded CPU implementation on a realistic problem size here or request a copy by filling in the contact form or by sending us an email.

We would like to thank again NVIDIA Corporation (in particular François Courteille) for access to the their many-core-multi-GPU architecture.

This work has been presented during the following conferences:

• GTC 2012, San Jose, US
• ENUMATH 2011, Leicester, UK

## Epidemic Simulation: Will You Survive a Virus Outbreak in the Netherlands?

Imagine a virus outbreak starting in the Hague (Netherlands) where you might live. It might be a seasonal flu, it might be the Zika virus, it mights be any other virus. How do you predict whether this virus outbreak will start an epidemy covering your whole country or it will be localised? The answer is to run an epidemic simulation of the virus outbreak and play with different scenarios.

###### Where it all started

It started at the gym when talking to Rory de Vries who is a Taekwondo teacher and has a PhD in virology. Rory is working at the Erasmus MC doing research on vaccins for the influenza virus. With Rory’s feedback we thought of a simulation model that should be more realistic than existing models.

###### OpenStreetMap

The first thing we did is to download the data from OpenStreetMap. Basically it is a list of nodes, ways and relations. We have downloaded the map of the Netherlands (which is about 1 GB, for the whole world is about 30 GB). Then the list of nodes was decimated to get about 15 million nodes (about the same number as the population in the Netherlands). The assumption is that the node density of the map is the same as the population density, meaning that areas with high population will also have high node density.

###### Epidemic model

We start with the discretization of the map. For this we define mobility as the average distance that a person travels per day. This is one of the parameters that you can easily modify in the user interface. As an example and default value we took 10 km. The mobility defines the size of the discretization cell: the larger mobility the larger the cell size. The cell size is calculated as a product of the mobility and the time step, which is set by the user.

The next step is to define a contact rate, we denote is C, which means how many contacts on average has a person during infectious time. It is assumed to be constant over the population.

Another parameter is the probability (P) that this person will infect its contacts. For example, in case of the influenza 1 person infects on average 3 other persons. Here we assume a uniform probability function for the population.

Now that we have defined the global parameters for the epidemic simulation, let us have a look at the infection stages per person. We distinguish the following stages:

• period before infection
• infectious period
• sickness
• recovery and immunity

Actually we combined some of the infection stages compared to the medical definition (here), for simplicity of the simulation and notations.

Basically, the disease development can be described with the following function   , where  can be modified for a specific virus and  is expressed in days.

###### Epidemic Simulation

At the beginning of the epidemic simulation we see a map of the population distribution in the Netherlands (people are represented by white dots).  On the right hand site you see a panel with the simulation time, transparency slider, plot options and global parameters that we described above. The plot options are infection stage (blue colour means infectious period and red colour means recovered people that are immune to the disease) or sickness (front of the spreading of the disease). The disease outbreak starts in the Hague. By clicking on the map we have added another disease source in Groningen.

### Summary

The Netherlands is a very densely populated country. Therefore, it is almost impossible to avoid an epidemic. We also have seen it in the movie above. The idea of this post is that we can model almost anything, whether it will reflect the truth, the history will tell us.