This is number 2 of some blog posts on Parallel Stream Compaction. In part 1 we saw a small number of fast implementations compared. The implementation by Billeter Olsson and Assarsson came up as the fastest. In a digression of different graphics cards we saw that the order in performance holds over different cards, over different generations. Now we want to know why the implementation by Billeter et al. is so distinctively faster than the other two.
In this part we measure time performance of the individual steps that comprise the algorithms, and discuss the results before we dig in deeper in further episodes. The code we use here is the code discussed in part 1.1: the digression comparing various graphics cards.
Measuring time performance of the three processing steps
All algorithms discussed in part 1 have three sequential steps, where each step is a Cuda kernel:
 Count the number of valid entries. The Count step.
 Determine the global offsets in the output for the entries. The Prefix step.
 Compact the valid entries into the output using the offsets. This is the Move step.
So, the first two steps provide metadata, and the third step uses the metadata to compact the stream.
The time performance of the separate steps may show us bottlenecks in the implementations.
We benchmark the individual steps with structured data (see part 1) on a GeForce GTX 690. We use the software from part 1.1, but restrict the size to 2^24. In the case of the Orange Owls and Spataro implementations, we restrict the threading configuration to 1024 threads per threadblock. This latter choice came out as the fastest thread configuration for the specified input length.
Measuring processing time is as follows. The individual steps are measured indirectly: first we measure step 1, then we measure step 1 and 2, and subtract step 1, etc. Time is measured in microseconds (µs = 10^6 seconds). Measurement times are averages over 1,000 invocations.
We then see the following results.
Count (µs) 
Prefix (µs) 
Move (µs) 
Total (µs) 

Billeter et al. 
455.5 
7.3 
820.0 
1282.8 
Orange Owls 
893.6 
47.2 
1665.8 
2606.6 
Spataro 
946.6 
86.6 
2107.9 
3141.1 
Let’s discuss this a bit.
The Prefix Step
The prefix step by Billeter et al. is much faster than the prefix step in the other two implementations. Curiously, both Orange Owls and Spataro employ the thrust::exclusive_scan function. This function is part of the Thrust library that is packed and distributed with the Cuda SDK(!). Ok, we already know from part 1 that Thrust is not the ultimate in performant GPU software, but the difference with Billeter et al. is really too large.
We will discuss Billeter et al.’s prefix step in a later episode in detail.
The Count and Move steps: loops
A major difference between the implementation by Billeter et al. and the other two is that the former employs loops.
Loops, as in sequential processing? In massively parallel software? And then having a performance advantage??? Yes!
Here is a count function from the source code by Billeter at al. that loops through an array.
template< class Pred > static _CHAG_PP_DEV SizeType count( const T* aStart, const T* aEnd, const Pred& aPred, volatile SizeType* aSmReduce) { SizeType partialCountA = 0u, partialCountB = 0u; for(const _Tp* elem = _Tpp(aStart) + _Env::lane(); elem < _Tpp(aEnd); elem += _Env::SIMD) { _Tp e = *elem; if( aPred( e.x ) ) ++partialCountA; if( aPred( e.y ) ) ++partialCountB; } SizeType sum = partialCountA + partialCountB; return _Unit::reduce( sum, op::Add<SizeType>(), aSmReduce ); }
Notice the for loop, it defines how each thread of a warp iterates through the input from aStart (plus the offset for the thread) up to aEnd, with increments of the warp size (so threads do not get in each other’s way).
Here is the count function from Orange Owls. No loops at all.
template <typename T, typename Predicate> __global__ void computePredicateTruePerBlock(const T * __restrict__ d_input, const int N, int * __restrict__ d_BlockCounts, Predicate predicate) { int tid = threadIdx.x + blockIdx.x*blockDim.x; if (tid < N) { int pred = predicate(d_input[tid]); int BC = __syncthreads_count(pred); if (threadIdx.x == 0) { d_BlockCounts[blockIdx.x] = BC; } } }
What does this code do? First it determines a global index for the current thread (tid), then, if the index denotes a location in the input, it determines if a predicate holds for the designated input, by setting a variable to 1 if it does, and to 0 if it doesn’t. It then sums all the outcomes in a warp, and stores the result.
Default threading configuration in Billeter et al.
The threading configuration in Billeter et al. depends on compile time parameters. The default configuration for step 1 (Count) and step 3 (Move) is that processing takes 120 blocks of 128 threads each. This is surprisingly low! Can we even speak of data parallelism? The 128 threads of a block have been subdivided in 4 warps of 32 threads: dim3(32, 4, 1). In contrast: the Prefix step uses 480 threads in 1 block.
The total number of threads (128 x 120) is far less than the size of the input stream. So, the algorithm employs sequences: ordered, contiguous, nonoverlapping subarrays; an order preserving concatenation of the sequences equals the input stream.
Number of sequences
The parallelism in GPU computing is per warp. So, the number of sequences in this algorithm is the total number of warps, which is the number of warps per threadblock x the number of blocks (4 x 120 = 480). The Count step counts the valid entries of the input per sequence and outputs an array of the counts, just as the Move step moves the valid entries to the output per sequence. The Prefix step takes as input the output of the Count step which has length 480. Since the Prefix step has 480 threads, there are no sequences in the Prefix step.
Note that the small numbers of threadblocks and threads per block result in small data structures for the countsofvalidelementspersequence, and the globaloffsetspersequence. I think that having small metadata data structures is a performance factor in itself.
Size of the sequences
Billeter et al. compute the size of the sequences using a rather complicated recipe:
 Determine the size of a data block a warp processes in one call. For instance, if we use the uint2 short vector type (of 2 unsigned ints), this is the warp size x the number of unsigned ints in a uint2. Hence, 32 x 2 = 64.
 Compute the basic sequence size as the stream size divided by the data block size, and then divided by the number of sequences. Note that these divisions are integer divisions. In our case: 2^24 / 64 = 262,144, then 262,144 / 480 = 546 with a remainder of 64.
 Add the remainder R to the size of the first R sequences.
So, in our input of size 2^24, the first 64 sequences have size 35008, and the other 480 – 64 = 416 sequences have size 34944.
The code is flexible and can work with very arbitrary length input streams. If the length of an input array is not a multiple of the warp size, the code calculates a range of auxiliary input elements and processes these separately. I will argue against this solution elsewhere, for now we can ignore this facility since the (large) size we use for experiments is a power of 2 – hence a multiple of 32.
Next
If we want to conclude that the performance difference between the implementation of Billeter et al. on the one hand, and Orange Owls and Spataro on the other hand is indeed the use of sequences, we should be able to replicate this effect in a simpler setting. So, next up is a comparison of a standard copy algorithm and a variant that uses sequences. Would that reveal the difference?
References
For references to articles and software download sites, see part 1 and part 1.1 of this blog post collection.