Archive for the ‘Crazy on Cuda’ Category

This is nr. 3 of some blog posts on Parallel Stream Compaction. In previous posts we saw that the implementation by Billeter et al. is a factor faster than implementations by Orange Owls and Spataro, and that a multiplicative relation exists, even when using different graphics cards. We wondered if the essential difference between the implementations is that the former uses loops that process the input in subsequences, and the latters do not.

In this post we will examine loops and sequences in a Cuda copy program, a simpler context than stream compaction. The program takes a long array and returns a copy. Simple as that. The question here is if the use of loops can account for the difference in performance mentioned above, and if so, can we say something about optimal parameter configurations.

This investigation compares loops and subsequences set up according to Billeter et al. to not using loops at all, “grid stride loops” as advocated by Nvidia [1], and warp stride loops which is a generalization of the way Billeter et al. organize loops. For the latter three algorithms we will do a parameter search to (by and large) optimize parameter settings. Recall that the former algorithm has fixed parameters for the number of threads per thread-block and the number of blocks

A link to sample code can be found at the bottom of this post. (not yet, but on its way!)

No loops, Grid stride loops, Billeter et al. loops, and Warp stride loops

The task will be to copy an array of 2^24 elements. The elements are chosen using c++ rand();

No loops

If we do not use loops or sequences we can copy the array with a function like the one below. It is a template, so we can easily experiment with different types of arguments.

__global__ void device_copy_kernel1(T* d_in, T* d_out)
	// Determine global index into the arrays
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	// Copy the element
	d_out[idx] = d_in[idx];

I used arrays of unsigned int in the experiments described here.

Grid stride loops

When using grid stride loops, the step size equals the entire grid of threads. The size of a grid is

number-of-threads-per-block x number-of-blocks

i.e., the entire number of threads used by the program. We can choose the total number of threads freely, but if we also choose wisely, it is a divisor of the array size, and we choose the block size to be a multiple of the warp size: 32 for current hardware. The maximum block size is 1024.

A simple function to copy an array using grid stride loops is the following.

__global__ void device_copy_kernel2(T* d_in, T* d_out)
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i = idx; i < ARRAY_SIZE; i += blockDim.x * gridDim.x)
		d_out[i] = d_in[i];

Billeter et al. loops, Warp stride loops

Billeter et al. have set the number of threads per block to 128, and the number of blocks to 120. They also have a complex way to cut up an array into subarrays. Well, for now I will cut up the array in equal sized chunks. If performance results are close, there is always the option to go into more complexity.

A distinguishing feature of the Billeter et al. loop is that is has a warp size stride. We might find it interesting to explore this feature further, so let’s use the following function.

__global__ void device_copy_kernel3(const T* d_in, T* d_out, const unsigned int seq_size)
	const auto idx = blockIdx.x * blockDim.x + threadIdx.x;

	// index within warp, the name lane refers to the hardware implementation
	const auto lane = threadIdx.x % WARP_SIZE;
	// global index for warp, sequence
	const auto sidx = idx / WARP_SIZE;

	// set sequence begin and end
	const auto seq_begin = seq_size * sidx;
	const auto seq_end = seq_begin + seq_size;

	// copy the sequence
	for (auto i = seq_begin + lane; i < seq_end; i += WARP_SIZE)
		d_out[i] = d_in[i];

the seq_size parameter is defined as:

const unsigned int seq_size = ARRAY_SIZE / warp_cnt;

where warp_cnt is:

const unsigned int warp_cnt = (threads * blocks) / WARP_SIZE; // WARP_SIZE = 32


Time measurements are averaged over 1000 calls of the kernel; Visual Studio release builds that are run without debugging, on an Asus Geforce GTX 690.

No loops

For the No loops case we cases we let the number of threads per block run from 32 – 1024 in steps of 32, the number of blocks is just ARRAY-SIZE / number-of-threads-per-block.

This gives us the following graph (I have subtracted 850 µs from each time measurement, in order to enhance the differences; all time measurements are larger than 850 µs).

There is a global minimum around the fourth measurement, i.e. in this experiment, the fourth measurement provides the best time performance.

Grid stride loops

The Grid stride loop case has a huge number of possible parameter values, so we do a somewhat coarse search. The set of numbers of threads per block is {64, 128, 256, 512, 1024}. The parameter space for the number of blocks is divided into 32 values (where we disregard the case for zero blocks).

This gives us the following graph (I have again subtracted 850 µs from each time measurement).

The are several, let’s say, regional minima: the 47th, 78th, 109th, and 139th measurement.

Measurement nr.



Time (µs)

















Interestingly, this is a very regular series. The last measurement is just 1 step before the step with 8192 blocks (2 x 8192 = 16384).

Warp stride loops, Billeter et al. loops

Nvidia hardware executes a warp of threads in parallel on a single multiprocessor. So it makes sense to set up sequences that are processed by a single warp of threads. Such sequences are processed by warp stride loops. The number of warps then determines the number of sequences, hence their size. NVidia warns not to write programs that critically depend on the size of a warp. So, production software that uses the warp size as a parameter should always query the hardware at runtime to get the actual warp size (for the overseeable past and future expect it to be either 16, 32, or 64).

Billeter et al. have a warp stride loop with 120 blocks of 128 threads. I have measured the time performance and registered it in the table in the Summary section.

The time performance is not ultimately impressive: 1804.7 µs; the slowest option in the pack .

In order to see if the warp stride loop variant has any merits, I performed a parameter search, similar to the grid stride loop case. Note that the array length, the warp size, the number of threads per block, and the number of blocks together completely determine the size and number of sequences.

This resulted in the following graph (each measurement got again 850 subtracted).

The 47th measurement has the shortest time: 863.7 µs.


The best time performance and bandwidth results are in the table below.

Time (µs)

Bandwidth (GB/s)



Sequence size

No Sequences





Grid Stride Loops





array size

Billeter et al. Sequences






Warp Stride Loops






Time is in microseconds, so all results are about equal, except the Billeter et al. case which is significantly slower.

At this point we see no (dis)advantage of using sequences!

Vectorized Memory Access

Another question is if any of these alternatives can can benefit from using vectorized memory access [2]. The hardware can execute memory access operations on scalars, but also on small vectors of 2 components (pairs) or 4 components (quads). To see what the effects of vectorized memory access are, I adapted the software accordingly. Note that if the program processes vectors of e.g. 4 components, it need only 1/4th of the reads and writes of the scalar case.

Vectorized memory access with pairs

The test program proceeds just as described above. The table below summarizes the results obtained.

Time (µs)

Bandwidth (GB/s)



Sequence size

No Sequences





Grid Stride Loops





array size

Billeter et al. Sequences






Warp Stride Loops






It turns out that warp stride loops benefit most from vectorized memory access with pairs. This includes Billeter et al. sequences.

Vectorized memory access with quads

The table below summarizes the results obtained.

Time (µs)

Bandwidth (GB/s)



Sequence size

No Sequences





Grid Stride Loops





array size

Billeter et al. Sequences






Warp Stride Loops






So, it seems that vectorized memory access with quads, as opposed to pairs, is beneficial for Billeter et al. sequences, but not for the other options.


We conclude that warp stride loops offer the best performance. Better than using no loops, to be sure, if combined with vectorized memory access for pairs (not quads).

We have seen that the performance graphs of the various options show many local minima. So, it pays to do a search for the parameters that maximize performance.

Then, of course, there is the question whether sequences is what makes Billeter et al stream compaction the world champion. The answer is: Definitely not!


If the use of sequences does not set stream compaction according to Billeter et al. apart from the rest, then what does? Is it the difference in size the meta data volume?



[2] .


This is number 2 of some blog posts on Parallel Stream Compaction. In part 1 we saw a small number of fast implementations compared. The implementation by Billeter Olsson and Assarsson came up as the fastest. In a digression of different graphics cards we saw that the order in performance holds over different cards, over different generations. Now we want to know why the implementation by Billeter et al. is so distinctively faster than the other two.

In this part we measure time performance of the individual steps that comprise the algorithms, and discuss the results before we dig in deeper in further episodes. The code we use here is the code discussed in part 1.1: the digression comparing various graphics cards.

Measuring time performance of the three processing steps

All algorithms discussed in part 1 have three sequential steps, where each step is a Cuda kernel:

  1. Count the number of valid entries. The Count step.
  2. Determine the global offsets in the output for the entries. The Prefix step.
  3. Compact the valid entries into the output using the offsets. This is the Move step.

So, the first two steps provide metadata, and the third step uses the metadata to compact the stream.

The time performance of the separate steps may show us bottlenecks in the implementations.

We benchmark the individual steps with structured data (see part 1) on a GeForce GTX 690. We use the software from part 1.1, but restrict the size to 2^24. In the case of the Orange Owls and Spataro implementations, we restrict the threading configuration to 1024 threads per thread-block. This latter choice came out as the fastest thread configuration for the specified input length.

Measuring processing time is as follows. The individual steps are measured indirectly: first we measure step 1, then we measure step 1 and 2, and subtract step 1, etc. Time is measured in microseconds (µs = 10^-6 seconds). Measurement times are averages over 1,000 invocations.

We then see the following results.

Count (µs)

Prefix (µs)

Move (µs)

Total (µs)

Billeter et al.





Orange Owls










Let’s discuss this a bit.

The Prefix Step

The prefix step by Billeter et al. is much faster than the prefix step in the other two implementations. Curiously, both Orange Owls and Spataro employ the thrust::exclusive_scan function. This function is part of the Thrust library that is packed and distributed with the Cuda SDK(!). Ok, we already know from part 1 that Thrust is not the ultimate in performant GPU software, but the difference with Billeter et al. is really too large.

We will discuss Billeter et al.’s prefix step in a later episode in detail.

The Count and Move steps: loops

A major difference between the implementation by Billeter et al. and the other two is that the former employs loops.

Loops, as in sequential processing? In massively parallel software? And then having a performance advantage??? Yes!

Here is a count function from the source code by Billeter at al. that loops through an array.

template< class Pred > static _CHAG_PP_DEV SizeType count(
    const T* aStart, const T* aEnd, const Pred& aPred, volatile SizeType* aSmReduce)
    SizeType partialCountA = 0u, partialCountB = 0u;
    for(const _Tp* elem = _Tpp(aStart) + _Env::lane(); elem < _Tpp(aEnd);
        elem += _Env::SIMD)
        _Tp e = *elem;
        if( aPred( e.x ) )
        if( aPred( e.y ) )
    SizeType sum = partialCountA + partialCountB;

    return _Unit::reduce( sum, op::Add<SizeType>(), aSmReduce );

Notice the for loop, it defines how each thread of a warp iterates through the input from aStart (plus the offset for the thread) up to aEnd, with increments of the warp size (so threads do not get in each other’s way).

Here is the count function from Orange Owls. No loops at all.

template <typename T, typename Predicate>
__global__ void computePredicateTruePerBlock(const T * __restrict__ d_input,
    const int N, int * __restrict__ d_BlockCounts, Predicate predicate)
    int tid = threadIdx.x + blockIdx.x*blockDim.x;

    if (tid < N)

        int pred = predicate(d_input[tid]);
        int BC = __syncthreads_count(pred);

        if (threadIdx.x == 0) { d_BlockCounts[blockIdx.x] = BC; }

What does this code do? First it determines a global index for the current thread (tid), then, if the index denotes a location in the input, it determines if a predicate holds for the designated input, by setting a variable to 1 if it does, and to 0 if it doesn’t. It then sums all the outcomes in a warp, and stores the result.

Default threading configuration in Billeter et al.

The threading configuration in Billeter et al. depends on compile time parameters. The default configuration for step 1 (Count) and step 3 (Move) is that processing takes 120 blocks of 128 threads each. This is surprisingly low! Can we even speak of data parallelism? The 128 threads of a block have been subdivided in 4 warps of 32 threads: dim3(32, 4, 1). In contrast: the Prefix step uses 480 threads in 1 block.

The total number of threads (128 x 120) is far less than the size of the input stream. So, the algorithm employs sequences: ordered, contiguous, nonoverlapping subarrays; an order preserving concatenation of the sequences equals the input stream.

Number of sequences

The parallelism in GPU computing is per warp. So, the number of sequences in this algorithm is the total number of warps, which is the number of warps per thread-block x the number of blocks (4 x 120 = 480). The Count step counts the valid entries of the input per sequence and outputs an array of the counts, just as the Move step moves the valid entries to the output per sequence. The Prefix step takes as input the output of the Count step which has length 480. Since the Prefix step has 480 threads, there are no sequences in the Prefix step.

Note that the small numbers of thread-blocks and threads per block result in small data structures for the counts-of-valid-elements-per-sequence, and the global-offsets-per-sequence. I think that having small metadata data structures is a performance factor in itself.

Size of the sequences

Billeter et al. compute the size of the sequences using a rather complicated recipe:

  1. Determine the size of a data block a warp processes in one call. For instance, if we use the uint2 short vector type (of 2 unsigned ints), this is the warp size x the number of unsigned ints in a uint2. Hence, 32 x 2 = 64.
  1. Compute the basic sequence size as the stream size divided by the data block size, and then divided by the number of sequences. Note that these divisions are integer divisions. In our case: 2^24 / 64 = 262,144, then 262,144 / 480 = 546 with a remainder of 64.
  2. Add the remainder R to the size of the first R sequences.

So, in our input of size 2^24, the first 64 sequences have size 35008, and the other 480 – 64 = 416 sequences have size 34944.

The code is flexible and can work with very arbitrary length input streams. If the length of an input array is not a multiple of the warp size, the code calculates a range of auxiliary input elements and processes these separately. I will argue against this solution elsewhere, for now we can ignore this facility since the (large) size we use for experiments is a power of 2 – hence a multiple of 32.


If we want to conclude that the performance difference between the implementation of Billeter et al. on the one hand, and Orange Owls and Spataro on the other hand is indeed the use of sequences, we should be able to replicate this effect in a simpler setting. So, next up is a comparison of a standard copy algorithm and a variant that uses sequences. Would that reveal the difference?


For references to articles and software download sites, see part 1 and part 1.1 of this blog post collection.

This is part 1.1 of a number of blog posts on parallel stream compaction

In part 1 we saw that the parallel stream compaction Cuda software by Billeter et al. is faster than the software by Orange Owls , which in turn is faster than the software by Spataro. Commenter Qubey then asked if this relative order is preserved under migration to other graphics cards, notably to graphics cards with newer generations GPU’s. This is an important question because it addresses the validity of a choice for a particular algorithm (and its implementation) for future hardware. This validity depends largely on adherence to an architectural model in successive hardware generations and the software runtime environment.

We set out to experiment. All three implementations will be run with structured data for streams of size 2^10 up to and including 2^26. Structured data is of the form 1, 0, 3, 0, 5, 0, 7, 0, … Values are limited to [0, 2^16 – 1]. We have seen in part 1 that the algorithms are not sensitive to structure in the input data. This was also found in the extensive experiments reported on here (although we will not discuss it further). The implementations of Orange Owls and Spataro will be run for 32, 64, …, 1024 threads per block. The thread layout of the Billeter et al. implementation is fixed. We ran this software on various graphics cards: a GeForce GTX 660, 690, 960 and 1080. Time performance measurements are averaged over 1,000 calls of the kernel.

The questions we would like to see answered are:

  1. Is the relative ordering in time performance of the implementations preserved over the different graphics cards?
  2. Is the time performance relation between the fastest implementation and the runner up implementation constant over the different graphics cards? Can we say that implementation X Is Y times faster than implementation Z?

Relative order

The first question can best be answered by presenting time performance graphs of the cards involved.

Explanation of the numbers on the x-axis: x=1 means the input length is 2^10, x=2: input size is 2^11, … x=17: input size = 2^26. This holds for all graphs below.

It is obvious from the graphs that relative ordering of performance is preserved over the cards.

Note by the way the enormous increase in time performance. The GTX 660 takes almost 8ms to process the largest stream (2^26 elements) using the Billeter et al. algorithm, whereas the GTX 1080 needs only 3ms.


The data from part 1 suggests that there is a multiplicative relation between the fastest algorithm: Billeter et al., and the runner up: Orange Owls. The data from the current test set support this suggestion, as the following graphs illustrate.

The graphs below show (i) the Orange Owls time performance measurements divided by the Billeter et al. measurements; (ii) Spataro divided by Billeter et al.; and (iii) Spataro divided by Orange Owls.

We see that the division of the Orange Owls data divided by the Billeter et al. data, reasonably approximates a straight line, indicating an approximately constant factor on all cards. This will allow us to say something like “The implementation by Billeter et al. is X times faster than the implementation by Orange Owls.” The other two relations are clearly not of this nature.

We see that there is indeed a multiplicative relation between the Billeter et al. and Orange Owls time performance data. So what is the magnitude of these relations for the different cards, and are they more or less the same? Take a look at the table below.

Card Mean Standard deviation
GTX 660



GTX 690



GTX 960



GTX 1080



All data



As you can see, mean and standard deviation are similar over the cards, with the exception of the GTX 960. I’ll get to that. Based on these results, I’m inclined to say that Billeter et al. stream compaction is about 1.5 times as fast as Orange Owls stream compaction.

On the other hand, the magnitude of the multiplicative relation seems to be about 1.6, with the notable exception of the GTX 960. We note that the theoretical memory bus bandwidth of the GTX 960 is (only) 112 GB/s for the reference card and 120 GB/s for OEM cards (Wikipedia), whereas the bandwidth of the GTX 660 is 144 GB/s, and for the 690: 192 GB/s (per GPU).

The time performance comparison chart for all cards on structured data, Billeter et al. implementation looks like this:

Which shows that the GTX 960 is relatively less suited for stream compaction compared to the other cards.

Wrapping up

In part 1: Introduction, based on one graphics card and a single input stream size, Billeter et al. came out twice as fast as Orange Owls. This finding has now been developed into a factor 1.5 by the introduction of a broad spectrum of cards and a far broader scope of input streams. We have seen that Billeter et al.’s implementation of stream compaction is the fastest, for a substantial number of input stream lengths and over a number of graphics cards generations.


Next we will start digging into the inner workings of the algorithms, as promised in the introduction.


I would like to express my gratitude to Qubey for his sharp questions and for his willingness to conduct experiments on the GTX 660, 960 and 1080 cards.

What is Stream Compaction

Stream compaction is simply copying only the nonempty (valid) entries from an input array to a contiguous output array. There is, of course, the option to not preserve the order of the input, but we will skip that one.

In C++ the definition is simple. Given a sparse std::vector<T> v_in, and an equal sized, zero std::vector<T> v_out:

auto j = 0u;
for (const auto& e : v_in)
	if (e) v_out[j++] = e;

where T is the value type of both v_in and v_out.

If you apply this sequential algorithm to real time graphics tasks, you will find that it is too slow (we will get to the numbers below).

On the other hand, stream compaction is a very important algorithm in general purpose GPU (GPGPU) computing, and/or data parallel algorithms. Why? Typically, GPGPU algorithms have fixed output addresses assigned to each of the many parallel threads (I will not explain data parallel algorithms here, please do an internet search if new to this subject).

Not all threads produce (valid) output, giving rise to sparse output arrays. These sparse output arrays constitute poor quality input arrays for subsequent parallel processing steps: it makes threads process void input. Understandably, the deterioration may increase with an increasing number of processing steps. Hence the need for stream compaction.

World Champion Parallel Stream Compaction

Some time ago I needed a GPU stream compaction algorithm. Of course, initially I was unaware of this term, just looking for a way to remove the empty entries from a large Direct3D buffer. Internet research taught that there are a few fast implementations: by Hughes et al. [3], Spataro [2], and Billeter et al.[1]. And let’s not forget the Cuda Thrust library which contains a copy_if function (Cuda release 8). Software implementations can be downloaded from [6], [5], [4], and [7] respectively.

I’ve benchmarked the implementations on my Asus Geforce GTX 690, also including an implementation of Spataro’s algorithm by Orange Owls [8]. Two input vectors have been used:

  1. A structured vector [1, 0, 3, 0, 5, 0, 7, … ].
  2. A vector of pseudo random unsigned shorts, selected by rand(), with an approximate probability of 50% to be zero (decided using rand() also).

Both vectors have size 2^24 (almost 16.8 million).

The table below displays the results of running standard Visual Studio 2015 release builds without debugging. Measurements are averaged over 1,000 executions of the involved kernels. Measuring code directly surrounds the kernel calls. Outcomes have been checked for correctness by comparison with the outcome of the sequential algorithm above. All algorithms produce correct results.


Structured data (ms)

Rand() Data (ms)

CPU method (C++)



Billiter, Olsson, Assarsson



Orange Owls (3 phases approach)






Cuda Thrust 1.8



Hughes, Lim, Jones, Knoll, Spencer



So what do we see?

We see that the CPU code produces strongly varying results for the two input vectors. Parallel implementations do not suffer from this variance (or do not benefit from structure that is inherent in the data!).

The algorithm by Billeter et al.’s is at least twice as fast as the other algorithms. It is a step ahead of Spataro, Orange Owls, and Thrust.

Obviously, there is something wrong with Hughes et al.’s algorithm, or its implementation. According to the article [3], it should be faster than, or on par with Billeter et al.’s. Obviously, it isn’t. Inspection using the NVidia Visual Profiler shows that the threads are mainly (over 90%) ‘Inactive’, which explains its lack of performance.

Having read the articles referred to above, I decided to see if I myself could become world champion parallel stream compaction, by writing a new algorithm based on some ideas not found in the articles. So, could I be the new world champion? No. I got results in between Orange Owl’s and Spataro’s, but could not get any faster.

So, the software by Markus Billeter, Ola Olsson, and Ulf Assarsson is the fastest parallel stream compaction algorithm in the world, they are world champion stream compaction, and we have to first learn why exactly, before we can surpass it, if at all. The question that then is:

“What makes Billeter, Olsson, and Assarsson’s parallel stream compaction Cuda program at least 2x as fast as its competitors?


The implementation of Billeter et al.’s algorithm is an optimized library. Optimized also with respect to maintenance: no duplicate code, which makes it fairly cryptic, thus hard to decipher its operational details. Next up is a general description of their program, and its main parameters. The algorithm has three main steps which will be discussed in subsequent posts. Along the way I hope to disclose why their code is at least twice as fast as the other algorithm implementations.


[1] Billeter M, Olsson O, Assarsson U: Efficient Stream Compaction on Wide SIMD Many-Core Architectures. In Proceedings of the Conference on High Performance Graphics Vol. 2009 (2009), p. 159-166.
New Orleans, Louisiana — August 01 – 03, 2009. ACM New York, NY, USA. (;jsessionid=4FE2F7D1EBA4C616804F53FEF5A95DE2?doi= ).

[2] Spataro Davide: Stream Compaction on GPU – Efficient implementation – CUDA (Blog 23-05-2015: ).

[3] Hughes D.M. Lim I.S. Jones M.W. Knoll A. Spencer B.: InK-Compact: In-Kernel Stream Compaction and Its Application to
Multi-Kernel Data Visualization on General-Purpose GPUs. In: Computer Graphics Forum, Volume 32 Issue 6 September 2013 Pages 178 – 188. ( ).

[4] Source code Billeter, Olsson, and Assarsson: .

[5] Source code Spataro:

[6] Source code Hughes, Lim, Jones, Knoll, and Spencer:

[7] Cuda 8 (September 2016 release, includes Thrust): .

[8] Orange Owls Solutions implementation of Spataro’s:

For those interested: I have prepared three programs that experiment with the Cuda stream compaction implementions of Billeter et al., Spataro and Orange Owls. You can request download links (executables and sources) by creating a comment.