Skip to content

ukri-bench/benchmark-dolfinx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

119 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOLFINx benchmark

Spack build and run Test benchmark

This benchmark tests the performance of an unstructured grid finite element solver. It solves the Poisson equation on a mesh of hexahedral cells using a matrix-free method. Low- and high-degree finite elements bases are supported. Being matrix-free and supporting high-degree finite elements makes this benchmark suitable for CPU and GPU architectures. The finite element implementation uses sum factorisation.

Parallel communication between nodes/devices uses MPI.

Status

Under development.

Maintainers

@chrisrichardson, @garth-wells

Overview

Main code/library

DOLFINx

Architectures

CPU (in progress), GPU.

Languages and programming models

C++, CUDA, HIP, MPI.

Seven 'dwarfs'

  • Dense linear algebra
  • Sparse linear algebra
  • Spectral methods
  • N-body methods
  • Structured grids
  • Unstructured grids
  • Monte Carlo

Building

The benchmark can be built using Spack or manually using CMake.

Spack

A Spack package is provided in the repository https://github.com/ukri-bench/spack-packages. To view the package options:

spack repo add --name bench_pkgs https://github.com/ukri-bench/spack-packages.git bench_pkgs
spack repo add --name fenics https://github.com/FEniCS/spack-fenics.git fenics
spack info bench-dolfinx

Options are used to specify CPU and GPU (AMD or CUDA) builds, e.g. +cuda cuda_arch=80 or +rocm amdgpu_target=gfx90a. The benchmark builds an executable bench_dolfinx.

CMake

The benchmark depends on the library DOLFINx v0.10.0 and can be built using CMake. See the benchmark Spack package file and the Spack dependencies for a comprehensive list of dependencies.

When building the benchmark using CMake, the following benchmark-specific CMake options are available:

  • -DHIP_ARCH=[target] builds using HIP for the specific GPU architecture [target]
  • -DCUDA_ARCH=[target] builds using CUDA for the specific GPU architecture [target]

Potential compilation issues

Running the benchmarks

Selecting a GPU device and binding to CPU NUMA regions

The bench_dolfinx code is designed to run on one CPU MPI rank per GPU device. In order to correctly map devices to cores, it is usually necessary to include a GPU binding script between mpirun and bench_dolfinx. There is an example of how to do this on LUMI-G for ROCm and CSD3 for CUDA. It is also important to bind the CPU cores to the correct NUMA regions, as also described in these links. Additionally, MPI must have GPU support enabled (e.g. export MPICH_GPU_SUPPORT_ENABLED=1 for Cray-MPICH).

Running on a HPC system

The benchmark will often be run on a HPC system using a batch queueing system, such as SLURM. A typical submission script is shown below:

#!/bin/bash
#SBATCH -p partition
#SBATCH --nodes=16
#SBATCH --gpus=64
#SBATCH --exclusive
#SBATCH --job-name=benchmark
#SBATCH --ntasks-per-node=4
#SBATCH --hint=nomultithread
#SBATCH --time=00:20:00

source /project/spack/share/spack/setup-env.sh
spack env activate bench10
module load libfabric/1.22.0

# Check correctness compared to matrix
srun -N ${SLURM_NNODES} -n ${SLURM_NTASKS} ./select_gpu ./bench_dolfinx --nreps=1 --mat_comp --ndofs_global=100000 --degree=3 --json mat_comp-${SLURM_NNODES}.json

# Run Q3 problem with 300M dofs/device
srun --mem-bind=local --cpu-bind=map_cpu:0,72,144,216 -N ${SLURM_NNODES} -n ${SLURM_NTASKS} ./select_gpu ./bench_dolfinx --ndofs=300000000 --degree=3 --cg --json Q3-300M.json
# Run Q6 problem with 500M dofs/device
srun --mem-bind=local --cpu-bind=map_cpu:0,72,144,216 -N ${SLURM_NNODES} -n ${SLURM_NTASKS} ./select_gpu ./bench_dolfinx --ndofs=500000000 --degree=6 --cg --json Q6-500M.json

See examples for example input and output files.

Command line options

The program lists the available options with the -h option.

bench_dolfinx -h

Correctness tests

Compare against the same computation by assembling a matrix:

bench_dolfinx --mat_comp --ndofs_global=10000 --degree=3

This test can be used to verify the matrix-free GPU algorithm is giving the same results as an assembled matrix. Because the matrix is assembled on CPU, it can be very slow for large problems and high polynomial degree. Recommended settings are 10000 global dofs, and degree 3. Results should be the same in parallel with mpirun. The console output should report Norm of error with a small (machine precision) number e.g. for float64 about 1e-15.

Performance tests

The following tests are recommended. A problem size of at least 10M dofs is needed to overcome the GPU launch latency. Problem size per-GPU can be increased until the memory limit is reached. The number of repetitions defaults to 1000.

Single-GPU performance test (10M dofs)

bench_dolfinx --float=64 --degree=6 --ndofs=10000000

Multi-GPU performance test (10M dofs per GPU)

mpirun -n 4 bench_dolfinx --float=64  --degree=6 --ndofs=10000000

Adding the --cg flag will also test additional axpy and global reduce on every iteration. The --float=32 flag will test at 32-bit precision. Changing the --degree will affect the balance of computation and communication (e.g. degree 6 is more computationally efficient, but results in more inter-GPU data transfer on each iteration).

Figures of merit

The main Figure of Merit (FoM) is the computational throughput in GDoF/s. The throughput represents the amount of useful computation that is done by the operator (or Conjugate Gradient) algorithm, and is reported for the whole system. Thus, to get the throughput per GPU, divide by the number of GPUs used. It is printed at the end of each run, and can also be saved in a JSON file by adding the --json filename.json flag.

License

The benchmark code is released under the MIT license.

About

CPU and GPU benchmarks for unstructured grid finite element computations based on the DOLFINx library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors