Parallel Stream Compaction: [4] Metadata Volume

Posted: June 15, 2018 in Crazy on Cuda
Tags: Cuda, Performance, Stream Compaction

This is nr. 4 in a collection of blog posts on Parallel Stream Compaction. In previous posts we saw that the implementation by Billeter, Olsson, and Assarsson is faster than implementations by Orange Owls and Spataro; a multiplicative relation holds, even over different graphics cards. We saw that the essential differences between the fastest algorithm, by Billeter et al., and the others, are that the former uses sequences and a very small data structure for metadata, and the latter programs do not.

In the previous post we saw that the introduction of sequences in itself does not provide a performance advantage. The question we address here is: Can the sheer volume of the metadata account for the performance differences?

The size of the meta data structure

The element counts of the metadata structures corresponding to best time performance on a stream of 2^24 elements are:

512 unsigned ints in Billeter, Olsson, and Assarsson’s algorithm.
16,384 ints in Spataro’s algorithm.
16,384 ints in Orange Owls’ implementation of Spataro

Billeter et al. have a metadata structure of 512 elements (of which they use 480 elements). The size is set at compile time and is part of a configuration the authors deem optimal. If you do not change the code, the size fixed.

The size of the metadata structure in Spataro’s and Orange Owls’ code equals the number of blocks, hence it varies with the threading configuration. In the first post in this series we determined the best time performance for a stream of 2^24 ints of both programs. The nr of elements in the metadata structures corresponding with the best performance was 16,384 in both cases.

The large difference in size is a consequence of design: Spataro’s and Orange Owls’ algorithms collect metadata per block of threads. Billeter et al.’s: code collect metadata per sequence.

Parameter search

To find out if there are significant differences, I did a parameter search for optimal numbers of threads per block and the number of thread blocks for two approaches:

With sequences; the number of elements in the metadata array equals 512.
Without sequences; the number of elements in the metadata array equals the number of blocks.

The test code that made the comparisons again uses a stream of 2^24 unsigned ints, selected in a rand() way. The fraction of valid elements is about one half, again decided by rand(). Both approaches were tested with scalars, pairs and quads.

Sequence wise algorithm

The small volume metadata code uses a warp size stride sequence algorithm. Starting from 128 threads per block with 128 blocks (a generalization of Billeter et al.’s threading configuration), I varied the numbers but held the product constant: 128 x 128 = 64 x 256 = 32 x 512, and vice versa, so a 512 element metadata structure suffices.

The main code for scalars look like this:


// Nvidia example code
template <typename T>
__inline__ __device__ T warpReduceSum(T val)
{
	for (auto offset = warpSize / 2; offset > 0; offset /= 2)
	{
		// Accumulate values over the warp
		val += __shfl_down(val, offset);
	}

	return val;
}

__global__ void device_prefix_kernel2(const unsigned int* d_in, unsigned int* d_out,
const unsigned int seq_size)
{
	// Global thread index
	const auto idx = blockIdx.x * blockDim.x + threadIdx.x;

	// index within warp
	const auto lane = threadIdx.x % WARP_SIZE;
	// global index for warp, sequence
	const auto sidx = idx / WARP_SIZE;
	// Start of the sequence in the input array
	const auto seq_begin = seq_size * sidx;
	// End of the sequence
	const auto seq_end = seq_begin + seq_size;

	// Work the sequence in a warp
	auto sum = 0u;
	for (auto i = seq_begin + lane; i < seq_end; i += WARP_SIZE) { sum += d_in[i] > 0;
	}

	// Accumulate the warp local sums
	const auto cnt = warpReduceSum(sum);

	// Write output
	if (lane == 0)
		d_out[sidx] = cnt;
}

Block wise algorithm

The large volume metadata main code for scalars looks this:


// After Spataro
__global__ void device_count_kernel1(unsigned int* d_in, unsigned int* d_out)
{
	// Determine global thread index
	const auto idx = blockIdx.x * blockDim.x + threadIdx.x;

	// Read input value
	const auto val = d_in[idx];
	// Count all positive values 
	const auto cnt = __syncthreads_count(val > 0);
	// Write the output
	if (threadIdx.x == 0)
		d_out[blockIdx.x] = cnt;
}

A link to the complete demo code can be found at the bottom.

Time performance results

The table below shows the performance for both alternatives; for thread counts that seem to fit the algorithms best.

	Block Wise (µs)	Sequence Wise (µs)
Threads	512 (16,384 blocks)	64 (256 blocks)
Scalar	771.0	502.1
uint2	509.6	445.1
uint4	448.2	445.4

We see that the block wise algorithm can take great advantage of vectorized reads. For uint4 the algorithm is just as fast as the sequence wise algorithm. The latter can take only limited advantage of vectorized reads, but is in itself blazingly fast.

Lesson Learned

Compare these results to the table in part 2 of this series. In that table Billeter et al. have 455.5 µs for the count phase. They use vectorized reads with pairs. Notice how similar the time measurements in both tables are for this IO configuration. Spataro and Orange Owls however, use scalar based IO, which is much slower. This then is the explanation why the code by Spataro and Orange Owls is slower than that by Billetter et al.: The formers do not use vectorized IO! It also explains why the difference in time performance is a factor, i.e. has a multiplicative character: Billeter et al. read and write twice as much data in a single action.

So, the code by Spataro and Orange Owls could have been just as fast as the code by Billeter et al., if they just had used vectorized IO. It is neither the use of sequences, nor the size of the metadata structure.

This is an important lesson: It is knowledge of the language and platform facilities that makes a program the fastest one, or a slower one.

Demo Code

The complete demo code for this blog post can be downloaded from here.

Next up: Now that it is clear that all three implementations of parallel stream compaction that have been explored in this blog post series could have had about equal performance, I would like to combine the software I wrote or adapted from the sources investigated, into a short and fast program for parallel stream compaction.

Clatter from the Byte Kitchen

Recent Posts

Archives

Categories

Meta