Posts Tagged ‘Performance’

This is nr. 3 of some blog posts on Parallel Stream Compaction. In previous posts we saw that the implementation by Billeter et al. is a factor faster than implementations by Orange Owls and Spataro, and that a multiplicative relation exists, even when using different graphics cards. We wondered if the essential difference between the implementations is that the former uses loops that process the input in subsequences, and the latters do not.

In this post we will examine loops and sequences in a Cuda copy program, a simpler context than stream compaction. The program takes a long array and returns a copy. Simple as that. The question here is if the use of loops can account for the difference in performance mentioned above, and if so, can we say something about optimal parameter configurations.

This investigation compares loops and subsequences set up according to Billeter et al. to not using loops at all, “grid stride loops” as advocated by Nvidia [1], and warp stride loops which is a generalization of the way Billeter et al. organize loops. For the latter three algorithms we will do a parameter search to (by and large) optimize parameter settings. Recall that the former algorithm has fixed parameters for the number of threads per thread-block and the number of blocks

A link to sample code can be found at the bottom of this post. (not yet, but on its way!)

No loops, Grid stride loops, Billeter et al. loops, and Warp stride loops

The task will be to copy an array of 2^24 elements. The elements are chosen using c++ rand();

No loops

If we do not use loops or sequences we can copy the array with a function like the one below. It is a template, so we can easily experiment with different types of arguments.

template 
__global__ void device_copy_kernel1(T* d_in, T* d_out)
{
	// Determine global index into the arrays
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	// Copy the element
	d_out[idx] = d_in[idx];
}

I used arrays of unsigned int in the experiments described here.

Grid stride loops

When using grid stride loops, the step size equals the entire grid of threads. The size of a grid is

number-of-threads-per-block x number-of-blocks

i.e., the entire number of threads used by the program. We can choose the total number of threads freely, but if we also choose wisely, it is a divisor of the array size, and we choose the block size to be a multiple of the warp size: 32 for current hardware. The maximum block size is 1024.

A simple function to copy an array using grid stride loops is the following.

template
__global__ void device_copy_kernel2(T* d_in, T* d_out)
{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i = idx; i < ARRAY_SIZE; i += blockDim.x * gridDim.x)
	{
		d_out[i] = d_in[i];
	}
}

Billeter et al. loops, Warp stride loops

Billeter et al. have set the number of threads per block to 128, and the number of blocks to 120. They also have a complex way to cut up an array into subarrays. Well, for now I will cut up the array in equal sized chunks. If performance results are close, there is always the option to go into more complexity.

A distinguishing feature of the Billeter et al. loop is that is has a warp size stride. We might find it interesting to explore this feature further, so let’s use the following function.

template
__global__ void device_copy_kernel3(const T* d_in, T* d_out, const unsigned int seq_size)
{
	const auto idx = blockIdx.x * blockDim.x + threadIdx.x;

	// index within warp, the name lane refers to the hardware implementation
	const auto lane = threadIdx.x % WARP_SIZE;
	// global index for warp, sequence
	const auto sidx = idx / WARP_SIZE;

	// set sequence begin and end
	const auto seq_begin = seq_size * sidx;
	const auto seq_end = seq_begin + seq_size;

	// copy the sequence
	for (auto i = seq_begin + lane; i < seq_end; i += WARP_SIZE)
	{
		d_out[i] = d_in[i];
	}
}

the seq_size parameter is defined as:

const unsigned int seq_size = ARRAY_SIZE / warp_cnt;

where warp_cnt is:

const unsigned int warp_cnt = (threads * blocks) / WARP_SIZE; // WARP_SIZE = 32

Benchmarks

Time measurements are averaged over 1000 calls of the kernel; Visual Studio release builds that are run without debugging, on an Asus Geforce GTX 690.

No loops

For the No loops case we cases we let the number of threads per block run from 32 – 1024 in steps of 32, the number of blocks is just ARRAY-SIZE / number-of-threads-per-block.

This gives us the following graph (I have subtracted 850 µs from each time measurement, in order to enhance the differences; all time measurements are larger than 850 µs).

There is a global minimum around the fourth measurement, i.e. in this experiment, the fourth measurement provides the best time performance.

Grid stride loops

The Grid stride loop case has a huge number of possible parameter values, so we do a somewhat coarse search. The set of numbers of threads per block is {64, 128, 256, 512, 1024}. The parameter space for the number of blocks is divided into 32 values (where we disregard the case for zero blocks).

This gives us the following graph (I have again subtracted 850 µs from each time measurement).

The are several, let’s say, regional minima: the 47th, 78th, 109th, and 139th measurement.

Measurement nr.

Threads

Blocks

Time (µs)

47

128

65,536

862.3

78

256

32,768

861.8

109

512

16,384

864.7

139

1024

7,680

871.2

Interestingly, this is a very regular series. The last measurement is just 1 step before the step with 8192 blocks (2 x 8192 = 16384).

Warp stride loops, Billeter et al. loops

Nvidia hardware executes a warp of threads in parallel on a single multiprocessor. So it makes sense to set up sequences that are processed by a single warp of threads. Such sequences are processed by warp stride loops. The number of warps then determines the number of sequences, hence their size. NVidia warns not to write programs that critically depend on the size of a warp. So, production software that uses the warp size as a parameter should always query the hardware at runtime to get the actual warp size (for the overseeable past and future expect it to be either 16, 32, or 64).

Billeter et al. have a warp stride loop with 120 blocks of 128 threads. I have measured the time performance and registered it in the table in the Summary section.

The time performance is not ultimately impressive: 1804.7 µs; the slowest option in the pack .

In order to see if the warp stride loop variant has any merits, I performed a parameter search, similar to the grid stride loop case. Note that the array length, the warp size, the number of threads per block, and the number of blocks together completely determine the size and number of sequences.

This resulted in the following graph (each measurement got again 850 subtracted).

The 47th measurement has the shortest time: 863.7 µs.

Summary

The best time performance and bandwidth results are in the table below.

Time (µs)

Bandwidth (GB/s)

Threads

Blocks

Sequence size

No Sequences

870.9

154.1

128

131,072

Grid Stride Loops

861.8

155.7

800

1,024

array size

Billeter et al. Sequences

1,804.7

74.4

128

120

32,768

Warp Stride Loops

863.7

155.4

128

65,536

64

Time is in microseconds, so all results are about equal, except the Billeter et al. case which is significantly slower.

At this point we see no (dis)advantage of using sequences!

Vectorized Memory Access

Another question is if any of these alternatives can can benefit from using vectorized memory access [2]. The hardware can execute memory access operations on scalars, but also on small vectors of 2 components (pairs) or 4 components (quads). To see what the effects of vectorized memory access are, I adapted the software accordingly. Note that if the program processes vectors of e.g. 4 components, it need only 1/4th of the reads and writes of the scalar case.

Vectorized memory access with pairs

The test program proceeds just as described above. The table below summarizes the results obtained.

Time (µs)

Bandwidth (GB/s)

Threads

Blocks

Sequence size

No Sequences

8,57.9

156.5

704

11,915

Grid Stride Loops

8,59.7

156.1

1,024

8,192

array size

Billeter et al. Sequences

1,434.6

93.6

128

120

17,464

Warp Stride Loops

8,20.9

163.5

256

38,912

26

It turns out that warp stride loops benefit most from vectorized memory access with pairs. This includes Billeter et al. sequences.

Vectorized memory access with quads

The table below summarizes the results obtained.

Time (µs)

Bandwidth (GB/s)

Threads

Blocks

Sequence size

No Sequences

862.3

155.7

1,024

4,096

Grid Stride Loops

862.4

155.6

64

65536

array size

Billeter et al. Sequences

1,335.7

100.5

128

120

8738

Warp Stride Loops

831.7

161.4

256

38,912

26

So, it seems that vectorized memory access with quads, as opposed to pairs, is beneficial for Billeter et al. sequences, but not for the other options.

Conclusions

We conclude that warp stride loops offer the best performance. Better than using no loops, to be sure, if combined with vectorized memory access for pairs (not quads).

We have seen that the performance graphs of the various options show many local minima. So, it pays to do a search for the parameters that maximize performance.

Then, of course, there is the question whether sequences is what makes Billeter et al stream compaction the world champion. The answer is: Definitely not!

Next

If the use of sequences does not set stream compaction according to Billeter et al. apart from the rest, then what does? Is it the difference in size the meta data volume?

References

[1] https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

[2] https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ .

Advertisements

This is number 2 of some blog posts on Parallel Stream Compaction. In part 1 we saw a small number of fast implementations compared. The implementation by Billeter Olsson and Assarsson came up as the fastest. In a digression of different graphics cards we saw that the order in performance holds over different cards, over different generations. Now we want to know why the implementation by Billeter et al. is so distinctively faster than the other two.

In this part we measure time performance of the individual steps that comprise the algorithms, and discuss the results before we dig in deeper in further episodes. The code we use here is the code discussed in part 1.1: the digression comparing various graphics cards.

Measuring time performance of the three processing steps

All algorithms discussed in part 1 have three sequential steps, where each step is a Cuda kernel:

  1. Count the number of valid entries. The Count step.
  2. Determine the global offsets in the output for the entries. The Prefix step.
  3. Compact the valid entries into the output using the offsets. This is the Move step.

So, the first two steps provide metadata, and the third step uses the metadata to compact the stream.

The time performance of the separate steps may show us bottlenecks in the implementations.

We benchmark the individual steps with structured data (see part 1) on a GeForce GTX 690. We use the software from part 1.1, but restrict the size to 2^24. In the case of the Orange Owls and Spataro implementations, we restrict the threading configuration to 1024 threads per thread-block. This latter choice came out as the fastest thread configuration for the specified input length.

Measuring processing time is as follows. The individual steps are measured indirectly: first we measure step 1, then we measure step 1 and 2, and subtract step 1, etc. Time is measured in microseconds (µs = 10^-6 seconds). Measurement times are averages over 1,000 invocations.

We then see the following results.

Count (µs)

Prefix (µs)

Move (µs)

Total (µs)

Billeter et al.

455.5

7.3

820.0

1282.8

Orange Owls

893.6

47.2

1665.8

2606.6

Spataro

946.6

86.6

2107.9

3141.1

Let’s discuss this a bit.

The Prefix Step

The prefix step by Billeter et al. is much faster than the prefix step in the other two implementations. Curiously, both Orange Owls and Spataro employ the thrust::exclusive_scan function. This function is part of the Thrust library that is packed and distributed with the Cuda SDK(!). Ok, we already know from part 1 that Thrust is not the ultimate in performant GPU software, but the difference with Billeter et al. is really too large.

We will discuss Billeter et al.’s prefix step in a later episode in detail.

The Count and Move steps: loops

A major difference between the implementation by Billeter et al. and the other two is that the former employs loops.

Loops, as in sequential processing? In massively parallel software? And then having a performance advantage??? Yes!

Here is a count function from the source code by Billeter at al. that loops through an array.

template< class Pred > static _CHAG_PP_DEV SizeType count(
    const T* aStart, const T* aEnd, const Pred& aPred, volatile SizeType* aSmReduce)
{
    SizeType partialCountA = 0u, partialCountB = 0u;
    for(const _Tp* elem = _Tpp(aStart) + _Env::lane(); elem < _Tpp(aEnd);
        elem += _Env::SIMD)
    {
        _Tp e = *elem;
		
        if( aPred( e.x ) )
	    ++partialCountA;
        if( aPred( e.y ) )
	    ++partialCountB;
    }
	
    SizeType sum = partialCountA + partialCountB;

    return _Unit::reduce( sum, op::Add<SizeType>(), aSmReduce );
}

Notice the for loop, it defines how each thread of a warp iterates through the input from aStart (plus the offset for the thread) up to aEnd, with increments of the warp size (so threads do not get in each other’s way).

Here is the count function from Orange Owls. No loops at all.

template <typename T, typename Predicate>
__global__ void computePredicateTruePerBlock(const T * __restrict__ d_input,
    const int N, int * __restrict__ d_BlockCounts, Predicate predicate)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;

    if (tid < N)
    {

        int pred = predicate(d_input[tid]);
        int BC = __syncthreads_count(pred);

        if (threadIdx.x == 0) { d_BlockCounts[blockIdx.x] = BC; }
    }
}

What does this code do? First it determines a global index for the current thread (tid), then, if the index denotes a location in the input, it determines if a predicate holds for the designated input, by setting a variable to 1 if it does, and to 0 if it doesn’t. It then sums all the outcomes in a warp, and stores the result.

Default threading configuration in Billeter et al.

The threading configuration in Billeter et al. depends on compile time parameters. The default configuration for step 1 (Count) and step 3 (Move) is that processing takes 120 blocks of 128 threads each. This is surprisingly low! Can we even speak of data parallelism? The 128 threads of a block have been subdivided in 4 warps of 32 threads: dim3(32, 4, 1). In contrast: the Prefix step uses 480 threads in 1 block.

The total number of threads (128 x 120) is far less than the size of the input stream. So, the algorithm employs sequences: ordered, contiguous, nonoverlapping subarrays; an order preserving concatenation of the sequences equals the input stream.

Number of sequences

The parallelism in GPU computing is per warp. So, the number of sequences in this algorithm is the total number of warps, which is the number of warps per thread-block x the number of blocks (4 x 120 = 480). The Count step counts the valid entries of the input per sequence and outputs an array of the counts, just as the Move step moves the valid entries to the output per sequence. The Prefix step takes as input the output of the Count step which has length 480. Since the Prefix step has 480 threads, there are no sequences in the Prefix step.

Note that the small numbers of thread-blocks and threads per block result in small data structures for the counts-of-valid-elements-per-sequence, and the global-offsets-per-sequence. I think that having small metadata data structures is a performance factor in itself.

Size of the sequences

Billeter et al. compute the size of the sequences using a rather complicated recipe:

  1. Determine the size of a data block a warp processes in one call. For instance, if we use the uint2 short vector type (of 2 unsigned ints), this is the warp size x the number of unsigned ints in a uint2. Hence, 32 x 2 = 64.
  1. Compute the basic sequence size as the stream size divided by the data block size, and then divided by the number of sequences. Note that these divisions are integer divisions. In our case: 2^24 / 64 = 262,144, then 262,144 / 480 = 546 with a remainder of 64.
  2. Add the remainder R to the size of the first R sequences.

So, in our input of size 2^24, the first 64 sequences have size 35008, and the other 480 – 64 = 416 sequences have size 34944.

The code is flexible and can work with very arbitrary length input streams. If the length of an input array is not a multiple of the warp size, the code calculates a range of auxiliary input elements and processes these separately. I will argue against this solution elsewhere, for now we can ignore this facility since the (large) size we use for experiments is a power of 2 – hence a multiple of 32.

Next

If we want to conclude that the performance difference between the implementation of Billeter et al. on the one hand, and Orange Owls and Spataro on the other hand is indeed the use of sequences, we should be able to replicate this effect in a simpler setting. So, next up is a comparison of a standard copy algorithm and a variant that uses sequences. Would that reveal the difference?

References

For references to articles and software download sites, see part 1 and part 1.1 of this blog post collection.

This is part 1.1 of a number of blog posts on parallel stream compaction

In part 1 we saw that the parallel stream compaction Cuda software by Billeter et al. is faster than the software by Orange Owls , which in turn is faster than the software by Spataro. Commenter Qubey then asked if this relative order is preserved under migration to other graphics cards, notably to graphics cards with newer generations GPU’s. This is an important question because it addresses the validity of a choice for a particular algorithm (and its implementation) for future hardware. This validity depends largely on adherence to an architectural model in successive hardware generations and the software runtime environment.

We set out to experiment. All three implementations will be run with structured data for streams of size 2^10 up to and including 2^26. Structured data is of the form 1, 0, 3, 0, 5, 0, 7, 0, … Values are limited to [0, 2^16 – 1]. We have seen in part 1 that the algorithms are not sensitive to structure in the input data. This was also found in the extensive experiments reported on here (although we will not discuss it further). The implementations of Orange Owls and Spataro will be run for 32, 64, …, 1024 threads per block. The thread layout of the Billeter et al. implementation is fixed. We ran this software on various graphics cards: a GeForce GTX 660, 690, 960 and 1080. Time performance measurements are averaged over 1,000 calls of the kernel.

The questions we would like to see answered are:

  1. Is the relative ordering in time performance of the implementations preserved over the different graphics cards?
  2. Is the time performance relation between the fastest implementation and the runner up implementation constant over the different graphics cards? Can we say that implementation X Is Y times faster than implementation Z?

Relative order

The first question can best be answered by presenting time performance graphs of the cards involved.

Explanation of the numbers on the x-axis: x=1 means the input length is 2^10, x=2: input size is 2^11, … x=17: input size = 2^26. This holds for all graphs below.

It is obvious from the graphs that relative ordering of performance is preserved over the cards.

Note by the way the enormous increase in time performance. The GTX 660 takes almost 8ms to process the largest stream (2^26 elements) using the Billeter et al. algorithm, whereas the GTX 1080 needs only 3ms.

Magnitude

The data from part 1 suggests that there is a multiplicative relation between the fastest algorithm: Billeter et al., and the runner up: Orange Owls. The data from the current test set support this suggestion, as the following graphs illustrate.

The graphs below show (i) the Orange Owls time performance measurements divided by the Billeter et al. measurements; (ii) Spataro divided by Billeter et al.; and (iii) Spataro divided by Orange Owls.

We see that the division of the Orange Owls data divided by the Billeter et al. data, reasonably approximates a straight line, indicating an approximately constant factor on all cards. This will allow us to say something like “The implementation by Billeter et al. is X times faster than the implementation by Orange Owls.” The other two relations are clearly not of this nature.

We see that there is indeed a multiplicative relation between the Billeter et al. and Orange Owls time performance data. So what is the magnitude of these relations for the different cards, and are they more or less the same? Take a look at the table below.

Card Mean Standard deviation
GTX 660

1.6

0.2

GTX 690

1.7

0.2

GTX 960

1.3

0.1

GTX 1080

1.6

0.3

All data

1.5

0.3

As you can see, mean and standard deviation are similar over the cards, with the exception of the GTX 960. I’ll get to that. Based on these results, I’m inclined to say that Billeter et al. stream compaction is about 1.5 times as fast as Orange Owls stream compaction.

On the other hand, the magnitude of the multiplicative relation seems to be about 1.6, with the notable exception of the GTX 960. We note that the theoretical memory bus bandwidth of the GTX 960 is (only) 112 GB/s for the reference card and 120 GB/s for OEM cards (Wikipedia), whereas the bandwidth of the GTX 660 is 144 GB/s, and for the 690: 192 GB/s (per GPU).

The time performance comparison chart for all cards on structured data, Billeter et al. implementation looks like this:

Which shows that the GTX 960 is relatively less suited for stream compaction compared to the other cards.

Wrapping up

In part 1: Introduction, based on one graphics card and a single input stream size, Billeter et al. came out twice as fast as Orange Owls. This finding has now been developed into a factor 1.5 by the introduction of a broad spectrum of cards and a far broader scope of input streams. We have seen that Billeter et al.’s implementation of stream compaction is the fastest, for a substantial number of input stream lengths and over a number of graphics cards generations.

Next

Next we will start digging into the inner workings of the algorithms, as promised in the introduction.

Thanks

I would like to express my gratitude to Qubey for his sharp questions and for his willingness to conduct experiments on the GTX 660, 960 and 1080 cards.

What is Stream Compaction

Stream compaction is simply copying only the nonempty (valid) entries from an input array to a contiguous output array. There is, of course, the option to not preserve the order of the input, but we will skip that one.

In C++ the definition is simple. Given a sparse std::vector<T> v_in, and an equal sized, zero std::vector<T> v_out:

auto j = 0u;
for (const auto& e : v_in)
{
	if (e) v_out[j++] = e;
}

where T is the value type of both v_in and v_out.

If you apply this sequential algorithm to real time graphics tasks, you will find that it is too slow (we will get to the numbers below).

On the other hand, stream compaction is a very important algorithm in general purpose GPU (GPGPU) computing, and/or data parallel algorithms. Why? Typically, GPGPU algorithms have fixed output addresses assigned to each of the many parallel threads (I will not explain data parallel algorithms here, please do an internet search if new to this subject).

Not all threads produce (valid) output, giving rise to sparse output arrays. These sparse output arrays constitute poor quality input arrays for subsequent parallel processing steps: it makes threads process void input. Understandably, the deterioration may increase with an increasing number of processing steps. Hence the need for stream compaction.

World Champion Parallel Stream Compaction

Some time ago I needed a GPU stream compaction algorithm. Of course, initially I was unaware of this term, just looking for a way to remove the empty entries from a large Direct3D buffer. Internet research taught that there are a few fast implementations: by Hughes et al. [3], Spataro [2], and Billeter et al.[1]. And let’s not forget the Cuda Thrust library which contains a copy_if function (Cuda release 8). Software implementations can be downloaded from [6], [5], [4], and [7] respectively.

I’ve benchmarked the implementations on my Asus Geforce GTX 690, also including an implementation of Spataro’s algorithm by Orange Owls [8]. Two input vectors have been used:

  1. A structured vector [1, 0, 3, 0, 5, 0, 7, … ].
  2. A vector of pseudo random unsigned shorts, selected by rand(), with an approximate probability of 50% to be zero (decided using rand() also).

Both vectors have size 2^24 (almost 16.8 million).

The table below displays the results of running standard Visual Studio 2015 release builds without debugging. Measurements are averaged over 1,000 executions of the involved kernels. Measuring code directly surrounds the kernel calls. Outcomes have been checked for correctness by comparison with the outcome of the sequential algorithm above. All algorithms produce correct results.

Implementation

Structured data (ms)

Rand() Data (ms)

CPU method (C++)

7.4

55.1

Billiter, Olsson, Assarsson

1.3

1.4

Orange Owls (3 phases approach)

2.6

2.6

Spataro

3.6

3.6

Cuda Thrust 1.8

4.3

4.4

Hughes, Lim, Jones, Knoll, Spencer

112.3

112.8

So what do we see?

We see that the CPU code produces strongly varying results for the two input vectors. Parallel implementations do not suffer from this variance (or do not benefit from structure that is inherent in the data!).

The algorithm by Billeter et al.’s is at least twice as fast as the other algorithms. It is a step ahead of Spataro, Orange Owls, and Thrust.

Obviously, there is something wrong with Hughes et al.’s algorithm, or its implementation. According to the article [3], it should be faster than, or on par with Billeter et al.’s. Obviously, it isn’t. Inspection using the NVidia Visual Profiler shows that the threads are mainly (over 90%) ‘Inactive’, which explains its lack of performance.

Having read the articles referred to above, I decided to see if I myself could become world champion parallel stream compaction, by writing a new algorithm based on some ideas not found in the articles. So, could I be the new world champion? No. I got results in between Orange Owl’s and Spataro’s, but could not get any faster.

So, the software by Markus Billeter, Ola Olsson, and Ulf Assarsson is the fastest parallel stream compaction algorithm in the world, they are world champion stream compaction, and we have to first learn why exactly, before we can surpass it, if at all. The question that then is:

“What makes Billeter, Olsson, and Assarsson’s parallel stream compaction Cuda program at least 2x as fast as its competitors?

Next

The implementation of Billeter et al.’s algorithm is an optimized library. Optimized also with respect to maintenance: no duplicate code, which makes it fairly cryptic, thus hard to decipher its operational details. Next up is a general description of their program, and its main parameters. The algorithm has three main steps which will be discussed in subsequent posts. Along the way I hope to disclose why their code is at least twice as fast as the other algorithm implementations.

References

[1] Billeter M, Olsson O, Assarsson U: Efficient Stream Compaction on Wide SIMD Many-Core Architectures. In Proceedings of the Conference on High Performance Graphics Vol. 2009 (2009), p. 159-166.
New Orleans, Louisiana — August 01 – 03, 2009. ACM New York, NY, USA. (http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=4FE2F7D1EBA4C616804F53FEF5A95DE2?doi=10.1.1.152.6594&rep=rep1&type=pdf ).

[2] Spataro Davide: Stream Compaction on GPU – Efficient implementation – CUDA (Blog 23-05-2015: http://www.davidespataro.it/cuda-stream-compaction-efficient-implementation/ ).

[3] Hughes D.M. Lim I.S. Jones M.W. Knoll A. Spencer B.: InK-Compact: In-Kernel Stream Compaction and Its Application to
Multi-Kernel Data Visualization on General-Purpose GPUs. In: Computer Graphics Forum, Volume 32 Issue 6 September 2013 Pages 178 – 188. (https://github.com/tpn/pdfs/blob/master/InK-Compact-%20In-Kernel%20Stream%20Compaction%20and%20Its%20Application%20to%20Multi-Kernel%20Data%20Visualization%20on%20General-Purpose%20GPUs%20-%202013.pdf ).

[4] Source code Billeter, Olsson, and Assarsson: https://newq.net/archived/www.cse.chalmers.se/pub/pp/index.html .

[5] Source code Spataro: https://github.com/knotman90/cuStreamComp

[6] Source code Hughes, Lim, Jones, Knoll, and Spencer: https://sourceforge.net/projects/inkc/

[7] Cuda 8 (September 2016 release, includes Thrust): https://developer.nvidia.com/cuda-toolkit .

[8] Orange Owls Solutions implementation of Spataro’s: https://github.com/OrangeOwlSolutions/streamCompaction

For those interested: I have prepared three programs that experiment with the Cuda stream compaction implementions of Billeter et al., Spataro and Orange Owls. You can request download links (executables and sources) by creating a comment.

Harder to C++: Monads for Mortals [7], Monad Composition

Early authors on monads, e.g. [Moggi], [Wadler], describe several types of monads, notably for side-effects, state, exceptions and IO. Of course we want to write programs that have all these features. There are, in principle, two possible approaches. First: write a separate monad each time we think up a new piece of non-pure-functional … (shall we dare call it functionality?). Second: compose different standard monads into larger wholes? In [part 2] of his excellent tutorial on monads Mike Vanier writes:

Functional programmers talk about “composability” all the time, with the implication that if some aspect of a programming language isn’t composable, it’s probably not worth much.’

So, clearly composition is key and we will see some monads composed into larger wholes here. To be clear: I don’t mean function composition, I mean monad composition. In C++. To be precise, I will define:

  • A monad that executes a function which maintains a state (side-effect).
  • A monad that catches and handles exceptions.
  • A monad that writes the result of function composition to console or an error message in case an exception occurred.

Then I will compose those monads into a larger whole, with preservation of imperative ‘functionality’. We will see that it actually works. Then I will show a very general imperative C++ function, that has the same functionality, and we will compare size and performance of both approaches.

Monad Composition

We will use a single monad template and instantiate it for a State type, an Exception type, and a std::cout wrapper. The monad template looks like this:

//V: wrapped value, S: state of some sort
template<typename V, typename S>
struct monad
{
	V value;
	S state;

	monad(const V& v) : value(v) {}
	monad(V& v) : value(std::move(v)) {}
	monad(const V& v, const S& s) : value(v), state(s) {}
	monad(V& v, S& s) : value(std::move(v)), state(std::move(s)) {}
	monad(V&& v, S&& s) : value(std::move(v)), state(std::move(s)) {}
	template<typename T, typename W>
	monad(const monad<T, W>& m) : value(m.value), state(m.state) {}
	monad(monad&& m) : value(std::move(m.value)), state(std::move(m.state)) {}
	~monad() {}
	monad& operator=(const monad& m)
	{
		if(this != &m)
		{
			value = m.value;
			state = m.state;
		}
		return *this;
	}
	monad& operator=(monad&& m)
	{
		if(this != &m)
		{
			value = m.value;
			m.value = V();

			state = m.state;
			m.state = mystate();
		}
		return *this;
	}
};

The template consists of a wrapped value and a field to hold state of some sort, e.g. actual state, an exception, or an io stream. The rest of the code is just what is generally referred to as ‘copy control’.

The ‘S’ parameter will be instantiated using three classes: State, Exception, and Cout

struct State
{
	int count = 1;
	void update(const int seed) { count += seed; }
};
struct Exception : std::exception
{
	Exception() : message(""), errorCode(0) {}

	Exception(string msg, int errorCode) : message(msg), errorCode(errorCode) { }
	string ErrorMessage() const
	{
		stringstream ost;
		ost << message << " " << errorCode;
		return ost.str();
	}

private:
	string message;
	int errorCode;
};
struct Cout
{
	ostream& os;
	Cout() : os(cout) {}
};

Cout is a std::cout wrapper. We need it because io streams cannot be copied or assigned, but by the design of our monad template (S is not a reference or pointer) we need something that actually can be copied or assigned.

For the monad template we define three overloads of the bind function, one for each instantiation of the template:

// bind function overload: State
template<typename A, typename R>
monad<R, State> operator| (const monad<A, State>& mnd, monad<R, State>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	tmp.state.update(mnd.state.count);
	return tmp;
}

// bind function overload: Exception
template<typename A, typename R>
monad<R, Exception> operator| (const monad<A, Exception>& mnd, monad<R, Exception>(*func)(const A&))
{
	try
	{
		return func(mnd.value);
	}
	catch(Exception& ex)
	{
		auto tmp = monad<R, Exception>(R(mnd.value));
		tmp.state = ex;
		return tmp;
	}
}

// bind function overload: Cout
template<typename A, typename R>
monad<R, Cout> operator|(const monad<A, Cout>& mnd, monad<R, Cout>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	// Need to create a new tmp, because we cannot assign, copy ostream.
	// When initializing, we use a ref to ostream
	auto tmp2 = monad<R, Cout>(tmp.value, mnd.state);
	tmp2.state.os << tmp2 << endl;
	return tmp2;
}

In the last bind function we see the familiar output operator. In order to make it work, we need overloads of this operator for the monad template, and the three auxiliary classes (or ‘structs’ if you like):

// << overload for monad
template<typename V, typename S>
ostream& operator<<(ostream& os, const monad<V, S>& m)
{
	os << "Value: " << m.value << " State: " << m.state;
	return os;
}
// << overload for State
ostream& operator<<(ostream& os, const State& s)
{
	os << s.count;
	return os;
}
// << overload for Exception
ostream& operator<<(ostream& os, const Exception& e)
{
	if(e.ErrorMessage() != " 0")
		os << e.ErrorMessage();

	return os;
}
// << overload for Cout, more or less a dummy (but required)
ostream& operator<<(ostream& os, const Cout&)
{
	os << "Cout";
	return os;
}

The last thing we need is some function that can be applied by the bind functions. These functions cannot be type agnostic: the return type differs from the parameter type. So we have three overloads of them, and they do exactly the same. Let’s first define some handy type aliases:

typedef monad<int, State> is_monad;
typedef monad<is_monad, Exception> ise_monad;
typedef monad<ise_monad, Cout> isec_monad;

These typedefs nest a monad template instantiation, which is a type, within another. Now the functions to be used divide a number by 2:

is_monad divideby2(const int& i)
{
	return is_monad(i / 2);
}

ise_monad divideby2(const is_monad& m)
{
	if(m.value < 2)
		// Somewhat artificial
		throw Exception("Division by zero", -1);

	auto m2 = m | divideby2;
	return ise_monad(m2);
}

isec_monad divideby2(const ise_monad& m)
{
	auto m2 = m | divideby2;
	return isec_monad(m2);
}

As you can see, there is composition, but there is no generality. That is, monad composition is in fact ugly and painful and scaling is out of the question.

We run this, in a by now familiar way:

int _tmain(int argc, _TCHAR* argv[])
{
	{
		int i = 8;
		is_monad m1(i);
		ise_monad m3(m1);
		isec_monad m5(m3);

		auto m2 = m5 | divideby2 | divideby2 | divideby2 |divideby2;
	}

	cin.get();
	return 0;
}

and get the result we desired:

After a code exposé of just 200 lines :-).

A Very Ordinary C++ Function

The same result (well, without of course the funny output that so points out recursion) can be obtained by the following function:

void VeryOrdinary()
{
	int num = 8;
	int cnt = 1;

	try
	{
		for(unsigned int i = 0; i< 4; ++i)
		{
			num /= 2;
			++cnt;

			cout << "Value : " << num << " State: " << cnt << endl;

			if(num < 2)
				// Somewhat artificial
				throw Exception("Division by zero", -1);
		}
	}
	catch(Exception& ex)
	{
		cout << "Value : " << num << " State: " << cnt << " " << 					ex.ErrorMessage() << endl;
	}
}

We run it by simply calling it:

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [7]");

	VeryOrdinary();

	cin.get();
	return 0;
}

And the result is:

After a code exposé of 34 lines (it pays to remember the approximate factor of 6 for later).

Let’s compare the sizes of both approaches. Well, the monad composition is ridiculously much larger than the ordinary solution.

Performance

Performance measurements were set up in a way comparable to measurements in part 6: the code above is run a large number of iterations, and time is measured. This is done for 10 runs, and the average over the times is calculated. Before starting measurements, the monad composition is allowed a warming up run, to prevent an outlier. The default release build of the resulting program was run from a command console.

Now, time performance results are dominated almost completely by the use of std::cout and throwing exceptions. If both are used, or occur, times are comparable. If however, we do not log to the console, and have no exceptions thrown (we set value to 32), that is, we measure the performance at the naked difference of the constructed code, the regular code takes only 1.6% of the time the monad solution takes, i.e. the regular solution is about 60 times as fast, see the image below for a typical performance run.

in this particular case the ordinary C++ function is 38.8 / 0.64 = 60.6 times as fast. The time it takes the regular function to execute is 100% x 0.64 / 38.8 = 1.6% of the time it takes the monad composition to execute. Please note that the variation in measured times is limited.

Conclusions

Let’s be brief. The code size and time performance of the monad composition are not in sensible proportion to size and performance of the regular function: 6 : 1 and 60 : 1 respectively. The code size of monad composition is ridiculous, monad composition kills C++ performance.

Wrap Up of Sequential Monads

We are a now at the end of 7 installments experimenting with monads, so let’s wrap up what we have so far. The question is: “Can monads be useful in day-to-day C++ programming?”. And the answer is “No”.

We have looked briefly at a C++ template meta programming (which constitutes a simple functional language) approach. The disadvantages are:

  • Complex syntax, hence diminished simplicity and clarity (one day your code will be maintained by another developer!)
  • Although it is possible to use this approach, the language does not support it. So it is unreasonably hard. (free after Stroustrup’s argument that OO development is possible in C, or assembly, but the language… )
  • There are no debugging facilities, which makes it very hard to develop substantial amounts of code as meta programs.

We have seen a simple and short implementation of the monad in mainstream C++. However, this implementations has the drawback that its performance dies if it is applied to large data structures.

The drawback could be remedied by replacing the unit function by a complete copy control set for the monadic template.

Lazy monads can be elegantly implemented as well, but they have no performance whatsoever, due to the fact that code is evaluated twice, once to create the expression to evaluate, and another time to indeed evaluate the expression.

Furthermore, we have seen, in this episode that if we want to write somewhat larger programs using monad composition, we get very large programs without performance.

I think I’m done now with at least the sequential monad. I have no further interest. It is tedious, needlessly complicated, and leading to inefficient results when trying to construct meaningful programs from different monads in C++.

I wrote ‘sequential monad’ above, because there is still the continuation monad to investigate; the suggested solution to Callback Hell in concurrency contexts. So, for starters, what exactly is Callback Hell? We will see in part 8.

A GPU Bilateral Filter Implementation

This post reports on a bilateral filter implementation that improves processing time from 32ms to 0.25ms.

Introduction

The Kinect (for Windows) depth data are subject to some uncertainty that comes with its resolution. Depth estimates are defined in millimeters, and typically, subsequent depth measurements by the Kinect vary by a fixed amount.

Consider the graphs below. The x-axis counts the number of measurements, the y-axis represents distance measurements of a single point. The top graph shows connected dots, the lower graph shows

just the dots.

De graphs show two tendencies. One is that variance is one unit above, or one unit below the average practically all of the time, the second tendency is that the average changes a bit before it stabilizes. Here we see it change from about 3.76m via 3.8m to about 3.84m.

If the Kinect depth data is projected onto an image this variation translates into a nervous jitter. Since I do not particularly care for a nervous jitter, I would like to stabilize the depth data a bit.

Stabilizing Kinect Depth Data – Temporal Approach

The Kinect for Windows SDK (1.6) contains a whitepaper on skeletal joint smoothing. The paper deals with the reduction of noise in the Kinect skeletal tracking system. This tracking system employs the same depth data, and therefore suffers from the same problem.

The proposed solution is to filter the data over time. The depth measurement z(x,y)(t) of a location (x, y) at time t can be averaged over a number of measurements in the past at the same location: z(x,y)(t-i) where i is in [1, n]. The suggestion is to take n not too large, say 5.

Averaging can also be over measurements in the future. This implies that one or two frames are included in averaging before an image based on the depth image is rendered, hence there is a latency in rendering equal to the number of ‘future’ frames included in averaging. The advantage of considering the ‘future’ is that if the measured scene changes (or a player changes position – in skeletal tracking), another type of averaging can be applied, one that is better suited for changes and e.g. puts a heavier weight on recent measurements.

I’ve done an experiment with temporal filtering, but it was not satisfactory. The fast and nervous jitter just turns into a slower one that is even more disturbing because short periods of stability make changes seem more abrupt.

Stabilizing Kinect Depth Data – Spatial Approach

Another approach is not to average over measurements at the same location through time, but to average within one frame, over several proximate measurements. A standard solution for this kind of filtering is the Bilateral filter. The Bilateral Filter is generally attributed to Carlo Tomasi and Roberto Manduchi. But see this site where it is explained that there were several independent discoveries.

The idea behind the Bilateral Filter is that the weight of a measurement in the average is a Gaussian function of both the distance and the similarity (in color, intensity, or as in our case: depth value). The similarity term prevents edges to be ‘averaged out’.

The Bilateral Filter works well, the only drawback it has is its computational complexity: O(N^2) where N is the (large!) number of pixels in the image. So, several people have been working on fast algorithms to alleviate the computational burden. To me it seems that Ben Weiss provided a good solution, but it is not generally available. The solution by Frédo Durand and Julie Dorsey (2002), and the elaboration of this work by Sylvain Paris and Frédo Durand (2006), all from MIT, seems to be the leading solution, and is general available – both the theory and example software. Their method has a project site that is here.

In a nutshell, the method by Sylvain Paris and Frédo Durand reduces processing time by first down sampling the image, then applying a convolution to compute the averages, and finally scaling up the image again while clamping over out-of-bounds values. So in essence, it operates on a (cleverly) reduced version of the image.

I’ve downloaded and compiled the software – the really fast version with the truncated kernel – and it requires about 0.032s to process a ppm image of 640×480 pixels (grayscale values), where the spatial neighborhood is set to 16 (pixels) and the ‘similarity’ neighborhood is set to 0.1, so grayscale colors that differ more than 0.1 after transformation to normalized double representation, are not considered in the average. See the image below for a screen shot.

The processing time is, of course, computer dependent, but my pc is not really slow. Although 32ms is a fine performance, it is too slow for real-time image processing. The Kinect produces a frame 30 times per second, i.e. every 33ms, and we do not want to create a latency of about one frame just because of the Bilateral Filter.

GPU implementation: C++ AMP

In order to improve on the processing time of this fast algorithm I’ve written a C++ AMP program inspired by the CPU implementation, this program runs on the GPU, instead of on the CPU. For information on C++ AMP, see here and here. What I think is great about AMP is that it provides a completely general access to General Purpose GPU computing. Having said that, I must also warn the reader that I do not master it to the degree that I could guarantee that my implementation of the Bilateral Filter in C++ AMP is representative of what could be achieved with C++ AMP.

The result of my efforts is that the ppm image above can now be processed in little over 1 ms. Consider

the picture below, made with my ATI Radeon HD 5700 Graphics card.

What you see here is a variety of timings of the computational phases. The top cycle takes 1.1ms, the middle one takes 1.19, and the bottom cycle takes 1.07ms. So, what is in the cycle?

1. The image is loaded into the GPU, and data structures are initialized. If you want to know more on ‘warming up’ the data and the code, see here. Since it takes 0.5 to 0.6 ms it is obviously the bottle neck.

2. Down sampling the image to a smaller version takes around 0.1 ms.

3. Computing the convolution takes 0.35 ms. This is the real work.

4. Up scaling and clamping takes again 0.1 ms.

A processing time of about 1 ms is satisfactory as a real-time processing time. Moreover, since we may assume the data is already in GPU memory (we need it there to render it to the screen), GPU upload time is not an attribute of an application of the Bilateral Filter in this context. So we may think of the processing time as being about 0.55 ms. which is absolutely fabulous.

New Graphics Card

At about this time, I bought a new graphics card, an Asus NVidia GTX690 (which for the purposes of this application yields the same results as a GTX 680, I know). This card was installed in my pc. Ok, I didn’t buy a new motherboard, so data is still being uploaded through PCI-e 2.0 and not through PCI-e 3.0 16x (but in time…). So, will this make a difference? Yes, it does. Look at the screen shot below.

I rearranged the timings a bit, to gain better oversight. We see that:

1. Data uploading and the warming up process now takes about 0.45 ms.

2. Filtering now takes about 0.25 ms.

From 32ms to 0.25ms. Most satisfying!

Vector –Matrix Inner Product with Computer Shader and C++ AMP

Large vector-matrix inner products by the GPU are 250 times faster than straight forward CPU implementations on my PC. Using C++ AMP or a Compute Shader the GPU realized a performance of over 30 gFLOPS. That is a huge increase, but my GPU has a “computational power” (whatever that may be) of 1 teraFLOP, and 30 gFLOPS is still a long way from 1000 gFLOPS.

This article presents a general architectural view of the GPU and some details of a particular exemplar: the Ati Radeon HD5750. Then code examples follow that show various approaches to large vector-matrix products. Of course the algorithms at the end of the article are the fastest. It is also the simplest.

Unified View of the GPU Architecture

Programming the GPU is based on an architectural view of the GPU. The purpose of this architectural view is to provide a unified perspective on GPUs from various vendors, hence with different hardware setup. It is this unified architecture that’s being programmed against using DirectX11. A good source of information on Direct Compute and Compute Shaders is the Microsoft Direct Compute BLog. The architecture described below is based on information from Chas Boyd’s talk at PDC09, as published on Channel9. Of course, this blog post only presents some fragments of the information found there.

A GPU is considered to be build from a number of SIMD cores. SIMD means: Single Instruction Multiple Data. By the way, the pictures below are hyperlinks to their source.

The idea is that a single instruction is executed on a lot of data, in parallel. The SIMD processing unit is particularly fit for “data parallel” algorithms. A GPU may consist of 32 SIMD cores (yes, the image shows 40 cores) that access memory with 32 floats at a time (128 bit bus width). Typically the processor runs at 1Ghz, and has a (theoretical) computational power of about 1 TeraFLOP.

A SIMD core uses several kinds of memory:

  • 16 Kbyte of (32-bit) registers. Used for local variables
  • 8 Kbyte SIMD shared memory, L1 cache.
  • L2 cache

The GPU as a whole has typically 1Gb of general RAM. Memory access bandwidth is typically of order 100GBit/s.

Programming Model

A GPU is programmed using a Compute Shader or C++ AMP. Developers can write compute shaders in HLSL (Looks like C) to be executed on the GPU. AMD is a C++ library. The GPU can run up to 1024 threads per SIMD. A thread is a line of execution through code. The SIMD shared memory is shared among the threads of a SIMD. It is programmable in the sense that you can declare variables (arrays) as “groupshared” and they will be stored in the Local Data Share. Note however, that over-allocation will spill the variables to general RAM, thus reducing performance. Local variables in shader code will be stored in registers.

Tactics

The GPU architecture suggests programming tactics that will optimize performance.

  1. Do your program logic on the CPU, send the data to the GPU for operations that apply to (about) all of the data and contain a minimal number of alternative processing paths.
  2. Load as much data as possible into the GPU general RAM, so as to prevent the GPU waiting for data from CPU memory.
  3. Declare registers to store isolated local variables
  4. Cache data that you reuse in “groupshared” Memory. Don’t cache data you don’t reuse. Keep in mind that you can share cached data among the threads of a single group only.
  5. Use as much threads as possible. This requires you use only small amounts of cache memory per thread.
  6. Utilize the GPU as efficiently as possible by offering much more threads to it than it can process in a small amount of time.
  7. Plan the use of threads and memory ahead, then experiment to optimize.

Loading data from CPU memory into GPU memory passes the PCIe bridge which has a bandwidth, typically of order 1GBit/s; that is, it is a bottleneck.

So, you really like to load as much data onto GPU memory before executing your code.

The trick in planning your parallelism is to chop up (schedule, that is J ) the work in SIMD size chunks. You can declare groups of threads; the size of the groups and the number of groups. A group is typically executed by a single SIMD. To optimize performance, use Group Shared Memory, and set up the memory consumption of your thread group so it will fit into the available Group Shared Memory. That is: restrict the number of threads per group, and make sure you have a sufficient number of groups. Thread groups are three dimensional. My hypothesis at this time is that it is best to fit the dimensionality of the thread groups to match the structure of the end result. More about this below. Synchronization of the threads within a thread group flushes the GroupShared Memory of the SIMD.

A register typically has a lifetime that is bound to a thread. Individual threads are member of several groups – depending on how you program stuff. So, intermediate results aggregated by thread groups can be stored in registers.

Does My ATI Radeon HD5750 GPU Look Like This Architecture… A Bit?

The picture below (from here) is of the HD5770, which has 10 SIMD cores, one more than the HD5750.

What do we see here?

  • SIMD engines. We see 10 cores for the HD5770, but there are 9 in the HD5750. Each core consists of 16 red blocks (streaming cores) and 4 yellow blocks (texture units).
  • Registers (light red lines between the red blocks).
  • L1 Textures caches, 18Kbyte per SIMD.
  • Local Data Share, 32 Kbyte per SIMD.
  • L2 caches, 8 Kbyte each.

Not visible is the 1Gb general RAM.

The processing unit runs at 700Mhz, memory runs at 1,150Mhz. Over clocking is possible however. The computational power is 1,008 TeraFLOP. Memory bandwidth is 73.6 GBit/s.

So, my GPU is quite a lot less powerful than the reference model. At first, a bit disappointing but on the other hand: much software I write for this GPU cannot run on the PCs of most people I know – their PCs are too old.

Various Approaches to Vector-Matrix Multiplication

Below we will see a number of approaches to vector-matrix multiplication discussed. The will include measurements of time and capacity. So, how do we execute the code and what do we measure?

Times measured include a number of iterations that each multiply the vector by the matrix. Usually this is 100 iterations, but fast alternatives get 1000 iterations. The faster the alternative, the more we are interested in variance and overhead.

Measurements:

  • Do not include data upload and download times.
  • Concern an equal data load, 12,288 input elements if the alternative can handle it.
  • Correctness check; computation is also performed by CPU code, reference code.
  • Run a release build from Visual Studio, without debugging.
  • Allow AMP programs get a warming up run.

Vector-Matrix Product by CPU: Reference Measurement

In order to determine the performance gain, we measure the time it takes the CPU to perform the product. The algorithm, hence the code is straightforward:

In this particular case rows = cols = 12,288. The average over 100 runs is 2,452 ms, or 2.45 seconds. This amounts to a time performance of 0.12 gFLOPS (giga FLOPS: FLoating point Operations Per Second). We restrict floating point operations to addition and multiplication (yes, that includes subtraction and division). We calculate gFLOPS as:

2 / ms x Rows / 1000 x Cols / 1000, where ms is the average time in milliseconds.

The result of the test is correct.

Parallel Patterns Library

Although this blog post is about GPU performance, I took a quick look at PPL performance. We then see a performance gain of a factor 2, but the result is incorrect, that is, the above code leads to indeterminacy in a parallel_for loop. I left it at that, for now.

Matrix-Matrix Product

We can of course, view a vector as a matrix with a single column. The C++ AMP documentation has a running code example of a matrix multiplication. There is also an accompanying compute shader analog.

AMP

To the standard AMP example I’ve added some optimizing changes, and measured the performance. The AMP code look like this:

Here: amp is an alias for the Concurrency namespace. The tile size TS has been set to 32, which is the maximum; the product of the dimensional extents of a compute domain should not exceed 1024. The extent of the compute domain has been changed to depend on B, the matrix, instead of the output vector. The loop that sums element products has been unrolled in order to further improve performance.

As mentioned above, we start with a warming up. As is clear from the code we do not measure data transport to and from the GPU. Time measurements are over 100 iterations. The average run time obtained is 9,266.6 ms, hence 0.01 gFLOPS. The result after the test run was correct.

The data load is limited to 7*1024 = 7,168; that is 8*1024 is unstable.

Compute Shader

The above code was adapted to also run as a compute shader. The code looks like this:

The variables Group_SIZE_X and Group_SIZE_Y are passed into the shader at compile time, and are set to 32 each.

Time measurements are over 100 iterations. The average run time obtained is 11,468.3 ms, hence 0.01 gFLOPS. The result after the test run was correct. The data load is limited to 7*1024 = 7,168; that is 8*1024 is unstable.

Analysis

The performance of the compute shader is slightly worse that the AMP variant. Analysis with the Visual Studio 11 Concurrency Visualizer shows that work by the GPU in case of the compute shader program is executes in small spurts, separated by small periods of idleness, whereas in the AMP program the work is executed by the GPU in one contiguous period of time.

Nevertheless, performance is bad, worse than the CPU alternative. Why? Take a look at the picture below:

For any value of t_idx.global[0] – which is based on the extent of the matrix- that is unequal to zero, vector A does not have a value. So, in fact, if N is the number of elements in the vector, we do O( N3)retrievals but only O(N2) computations. So, we need an algorithm that is based on the extent of a vector, say the output vector.

Vector-Matrix Product

Somehow, it proved easier to develop the vector-matrix product as a compute shader. This is in spite of the fact that unlike AMP, it is not possible (yet?) to trace a running compute shader in Visual Studio. The idea of the algorithm is that we tile the vector in one dimension, and the matrix in two, thus obtaining the effect that the vector tile can be reused in multiplications with the matrix tile.

Compute Shader

A new compute shader was developed. This compute shader caches vector and matrix data in Group Shared memory. The HLSL code looks like this:

This program can handle much larger amounts of data. Indeed, this program runs problem free for a vector of 12,288 elements and a total data size of 576 Mbyte. Using an input vector of 12,288 elements, with total data size of 576 Mbyte. The time performance is 10.3 ms per run, averaged over 1,000 runs, which amounts to 29.3 gFLOPS. The result of the final run was reported to be correct.

AMP

In analogy to the compute shader above I wrote (and borrowed 🙂 ) a C++ AMP program. The main method looks like this:

The matrix is a vector with size * size elements. He tile size was chosen to be 128, because that setting yields optimal performance. The program was run on an input vector of 12,288 elements again, with total data size of 576 Mbyte. The time performance is 10.1 ms per run, averaged over 1000 runs, which amounts to 30.0 gFLOPS. The result of the final run was reported to be correct.

Analysis

We see here that the performance has much improved. When compared to the reference case, we can now do it (in milliseconds) 2,452 : 10.1 = 243 : 1, hence 243 times faster.

Simpler

Then, I read an MSDN Magazine article on AMP tiling by Daniel Moth, and it reminded me that caching is useless if you do not reuse the data. Well, the above algorithm does not reuse the cached matrix data. So I adapted the Compute Shader program to retrieve matrix data from central GPU memory directly. The HLSL code looks like this:

Note the tileSize of 512(!). This program was run for a vector of 12,288 elements and a total data size of 576 Mbyte. The time performance is again 10.3 ms for a multiplication which amounts to 29,3 gFLOPS (averaged over 1000 runs). The result of the final run was reported to be correct. So, indeed, caching the matrix data does not add any performance improvement.

AMP

For completeness, the AMP version:

Time performance is optimal for a tile size of 128, in case the number of vector elements is 12,288. We obtain an average run time of 9.7 ms (averaged over 1,000 runs), and a corresponding 31.1 gFLOPS. The result of the final run was correct. This program is 2452 / 9.7 = 252.8 times as fast as the reference implementation.

Conclusions

Developing an algorithm for vector-matrix inner product has demonstrated comparable performance for Compute Shaders and AMP, but much better tooling support for AMP: we can step through AMP code while debugging, and the Concurrency Visualizer has an AMP line. This better tool support helped very well in analyzing performance of a first shot at the algorithm. The final algorithm proved over 250 times faster than a straight forward CPU program for the same functionality.

Detailed knowledge of the GPU architecture, or the hardware model, proved of limited value. When trying to run the program with either the maximum nr of threads per group, or the maximum amount of data per Group Shared Memory, I ran into parameter value limits, instabilities, performance loss, and incorrect results. I guess, you will have to leave the detailed optimization to the GPU driver and to the AMP compiler.

One question keeps bothering me though: Where is my TeraFLOP?

I mean, Direct Compute was introduced with the slogan “A teraFLOP for every one of us”, AMP is built on top of Direct Compute, and my GPU has a computational power of 1.08 TeraFLOP. Am I not ‘one of us’?