C++ AMP Performance and Compute Shader Performance
Edit (April 23rd 2012):
The AMP team has updated the N-Body Simulation code to turn it into a clean port that relates to the Compute Shader original in a comprehensible way. Now it has comparable performance to the original (optimized) version (both versions do >330 gFLOPS at >30 fps for 23,040 particles on my pc).
I’m impressed. For one, by the attitude of the AMP people that energetically reacted to issues which other people / teams might well have dismissed as unimportant. Then there is the point that you get maximum performance from a set of very powerfull processors with code that is very short compared to the direct compute code you had to write otherwise, and this code, by AMP design, is very elegant as well.
Of course, there is a risk in short and elegant code: subtle differences in code can make substantial differences in performance, hence developing AMP code is rather knowledge intensive. But I kind of like that.
Edit (April 16th 2012):
The results below were brought to the C++ AMP forum for discussion. Daniel Moth advised to update the driver of the graphics card. This update made a tremendous difference for two of the three programs mentioned below for which now C++ AMP performance is equal to or better than Compute Shader performance.
The discussion on the N-Body Simulation program, which is heavily optimized in the Compute Shader version is still open, mainly because the required information is not available yet. I expect that also in this case C++ AMP will prove to be equipotent to Compute Shader programs.
Now, what have we learned from this exercise? For one, a lot about Compute Shader optimization and the mechanisms of GPU computing performance. This is an interesting and instructive subject. I also have learned that C++ AMP performance is comparable to Compute Shader performance. However, I do not (yet) understand if and how this will always and necessarily be the case, and that still itches a bit.
Results as they are standing now:
|
Program |
AMP |
CS |
|
Guide |
|
|
Average time (ms, 10 it.) |
|
2,650 |
2,995 |
gFLOPS |
|
36.9 |
32.7 |
Max. Data Load (Kb) |
|
714,432 |
691,200 |
|
Vector Addition |
|
|
Average time (ms, 10 it.) |
|
6,017 |
8,155 |
gFLOPS |
|
0.03 |
0.02 |
Max. Data Load (Kb) |
|
1,781,248 |
2,039,056 |
|
N-Body Simulation |
|
|
Number of Particles |
|
16,128 |
16,128 |
Frame rate |
|
44.4 |
63.4 |
gFLOPS |
|
229 |
329 |
Up to date, I find that Compute Shader based programs outperform C++ APM programs both in time and space. Results of example programs I explored, which have been created by the respective product teams tend to show substantially better performance by the Compute Shader programs. These programs are the N-Body Simulation Sample; Basic Summation; and the matrix multiplication programs from the “C++ AMP for the DirectCompute Programmer” guide. Hyperlinks are provided in the sections below.
So, the question is: can there be an AMP program that performs substantially better in time and space on, let’s say, large matrix multiplication (or large matrix-vector multiplication) than a Compute Shader program? C++ AMP has been built upon Direct Compute, so the answer is: not likely.
Should we, alternatively, draw the conclusion that a direct compute program categorically has better performance?
N-Body Simulation
The first pair of programs compared, consisted of:
- The NBodyGravityCS11 Sample from the June 2010 DirectX SDK. According to its documentation, it has been used for demonstration at GDC09.
- The N-Body Simulation Sample port for C++ AMP. According to the post on the post on the Parallel Programming in Native Code blog, it has been used for demonstrations at the AMD Fusion Developer Summit and at Microsoft Build Conference.
Performance is expressed in gFLOPS. The code for the gFLOPS was copied from the C++ AMP version to the Compute Shader version. I also changed the Compute Shader version to make it write gFLOPS and the number of particles to the screen.
First, I tweaked the particle count parameter to get the best gFLOP count from either program; they both peak at 16,128 particles on my PC. Then the following results (gFLOPS) were obtained for release builds, running without debugging (this was also the configuration in the comparisons below).
C++ AMP | Compute Shader | More (%) | Less(%) | |
Number of particles | 16,128 | 16,128 | ||
Frames per second | 43.46 | 57.38 | 32.03 | 24.26 |
gFLOPS | 226.07 | 298.51 | 32.04 | 24.27 |
A note on the More and Less columns: The Compute Shader version delivers 32.03% more frames per second, and the C++ AMP version 24.26% less. So crudely: the Compute Shader version is about 30% faster.
Vector Addition
The second pair of programs compared consisted of:
- The BasicCompute11 Sample from the June 2010 DirectX SDK.
- An adaptation of the first example from Overview of C++ Accelerated Massive Parallelism (C++ AMP). This is also a vector addition.
The C++ AMP code was adapted as follows:
- It was made to work with the same structs as the BasicCompute11 sample. This struct consists of an int and a float.
- The arrays were made global variables.
- A loop was added to fill the input arrays.
- The verification code from the BasicCompute11 sample was added.
For timing, timing code was added to both programs. This timing code is from this post in the Parallel Programming in Native Code blog.
For timing measurements the code was adapted as follows: In the Compute Shader program timing covers code from the Dispatch call to the Map call. In the AMP program timing covers the lambda expression, and an added array_view::Synchronize() call on the “sum” array_view.
In experiments I first pushed the size until, in the case of the Compute Shader version, the output of the result verifying code became “failure”,
and in the case of the C++ AMP program, it either didn’t compile or produced a runtime error.
Then I measured time and gFLOPS. The experiments yielded the following result.
C++ AMP | Compute Shader | More (%) | Less(%) | |
Number array elements | 76*10^6 | 87*10^6 | 14.47 | 12.64 |
Total data size (Kb) | 1,781,250 | 2,039,062.5 | ||
Time (ms) | 6,868 | 8,182 | ||
gFLOPS | 0.022 | 0.021 |
gFLOPS were measured as: 2*n / (10^6 * ms), where n is the number of elements in an array.
It seems to me that the time results are too similar to call them different. The Compute Shader version has a slight space advantage.
Note that since the total data size in both cases is larger than the RAM the graphics card has on board, there is some automatic sectioning going on.
Matrix Multiplication
Both programs in this comparison come from the C++ AMP for the DirectCompute Programmer guide. This guide can be obtained from a post on the official MSDN Parallel Programming in Native Code blog. The C++ AMP program is a transformation of the Compute Shader program.
The code for the starting point of the transformation is not entirely complete, so I added standard code from the BasicCompute11 Sample that loads and compiles the compute shader.
The following results were obtained.
C++ AMP | Compute Shader | More (%) | Less(%) | |
Number array elements | 4,608 | 7,616 | 65.28 | 39.50 |
Total data size (Kb) | 248,832 | 679,728 | 173.17 | 63.39 |
Av Processing time (ms, 10 runs) | 11,742 | 12,804 | ||
gFLOPS | 8.3 | 34.5 | 315.66 | 75.94 |
Notes:
- Both programs measure the time spent in the “mm” function, using the timing code referred to above. This includes uploading and offloading the data onto and from the GPU.
- For both programs we have that any higher multiple of 64 in the number of array elements crashes the display driver.
- gFLOPS are measured as: n^3 / (10^6 * ms) where:
- n is the size of a matrix dimension (the matrices are square).
- Ms is the averaged (over 10 iterations) measured processing time in milliseconds.
Conclusions
Three program pairs have been compared, informally and semi-systematically, for their performance in time and space.
In the case of the N-Body simulation, the data load was selected that is optimal for time performance. That resulted in an about 30% better time performance of the Compute Shader Program.
In the case of vector addition – about the simplest program imaginable in this context – the time performance was measured for maximum data load. This resulted in practically equal time performance for both programs. The Compute Shader version can load some more data.
Finally, the programs from the AMP guide for Compute Shader programmers were implemented, and the time performance was again measured for maximum data load. This resulted in a time performance of the Compute Shader that is three times as good as the time performance of the AMP program.
So, conclusion, it seems that if you want to get the max from your GPU, a Compute Shader is still the way to go.