A comparison of performance obtained on Linux Mint Maya with ViennaCL 1.4.1 for the CUDA, OpenCL and OpenMP backends is given in the following. All tests are run in double precision in order to reflect the typical use cases. Note that the selected operations are memory bandwidth limited, as it is the case with most scientific codes using finite element or finite volume methods, hence the performance characteristics are different from the typical compute-bound matrix-matrix multiplication benchmarks used by companies to promote their hardware.
Benchmarks on the Intel MIC are performed using beta-hardware as well as the beta version of the Intel OpenCL SDK XE. Retail hardware is likely to show different performance characteristics.
The benchmark for vector addition is a good measure for the available memory bandwidth and is one of the operations in the STREAM benchmark. We have compared the operation x = y + z for vectors x, y, and z of varying size. At vector sizes below about 10.000, the unavoidable GPU kernel launch overheads due to PCI-Express latency is readily visible. Also, the wiggles at vector sizes of 8192 on the CPU and MIC are due to the use of OpenMP for vector sizes above 5000. The AMD GPU shows wiggles in the range of 10k to 1M, which may be due to issues in the driver. We obtain about 75 GB/s of memory bandwidth on the Xeon Phi, while performances up to 150 GB/s using additional tricks has already been reported here.
Sparse Matrix-Vector Multiplication
This benchmark deals with the workhorse of many iterative solvers, namely the sparse matrix-vector multiplication. Matrices in CSR format with possible alignment of the number of entries in each row to multiples of 4 and 8 are considered and the fastest execution recorded. While the operation is still mostly limited by memory bandwidth, the access pattern is fairly irregular. Note that the AMD GPU ultimately benefits from its high memory bandwidth. The OpenCL performance on the Xeon Phi suffers from high latency, which may be due to the beta-status of the hard- and software.
Conjugate Gradient Solver Iterations
Finally, the unpreconditioned conjugate gradient solver implemented in ViennaCL is benchmarked on different hardware. In addition to vector additions and sparse matrix-vector operations, communications with the host as well as inner products using reductions are required. Again, a finite difference discretization of the Poisson equation on the unit square is considered as toy example.
Best performance for small systems (below ~10k unknowns) is obtained on the CPU. For large systems (above ~200k), GPUs show better performance due to their higher memory bandwidth. CUDA shows considerably lower latency then OpenCL on NVIDIA hardware, which is still lower than the latency on the AMD GPU.