This is nr. 3 of some blog posts on Parallel Stream Compaction. In previous posts we saw that the implementation by Billeter et al. is a factor faster than implementations by Orange Owls and Spataro, and that a multiplicative relation exists, even when using different graphics cards. We wondered if the essential difference between the implementations is that the former uses loops that process the input in subsequences, and the latters do not.

In this post we will examine loops and sequences in a Cuda copy program, a simpler context than stream compaction. The program takes a long array and returns a copy. Simple as that. The question here is if the use of loops can account for the difference in performance mentioned above, and if so, can we say something about optimal parameter configurations.

This investigation compares loops and subsequences set up according to Billeter et al. to not using loops at all, “grid stride loops” as advocated by Nvidia [1], and warp stride loops which is a generalization of the way Billeter et al. organize loops. For the latter three algorithms we will do a parameter search to (by and large) optimize parameter settings. Recall that the former algorithm has fixed parameters for the number of threads per thread-block and the number of blocks

A link to sample code can be found at the bottom of this post. (not yet, but on its way!)

No loops, Grid stride loops, Billeter et al. loops, and Warp stride loops

The task will be to copy an array of 2^24 elements. The elements are chosen using c++ rand();

No loops

If we do not use loops or sequences we can copy the array with a function like the one below. It is a template, so we can easily experiment with different types of arguments.

template 
__global__ void device_copy_kernel1(T* d_in, T* d_out)
{
	// Determine global index into the arrays
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	// Copy the element
	d_out[idx] = d_in[idx];
}

I used arrays of unsigned int in the experiments described here.

Grid stride loops

When using grid stride loops, the step size equals the entire grid of threads. The size of a grid is

number-of-threads-per-block x number-of-blocks

i.e., the entire number of threads used by the program. We can choose the total number of threads freely, but if we also choose wisely, it is a divisor of the array size, and we choose the block size to be a multiple of the warp size: 32 for current hardware. The maximum block size is 1024.

A simple function to copy an array using grid stride loops is the following.

template
__global__ void device_copy_kernel2(T* d_in, T* d_out)
{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	for (int i = idx; i < ARRAY_SIZE; i += blockDim.x * gridDim.x)
	{
		d_out[i] = d_in[i];
	}
}

Billeter et al. loops, Warp stride loops

Billeter et al. have set the number of threads per block to 128, and the number of blocks to 120. They also have a complex way to cut up an array into subarrays. Well, for now I will cut up the array in equal sized chunks. If performance results are close, there is always the option to go into more complexity.

A distinguishing feature of the Billeter et al. loop is that is has a warp size stride. We might find it interesting to explore this feature further, so let’s use the following function.

template
__global__ void device_copy_kernel3(const T* d_in, T* d_out, const unsigned int seq_size)
{
	const auto idx = blockIdx.x * blockDim.x + threadIdx.x;

	// index within warp, the name lane refers to the hardware implementation
	const auto lane = threadIdx.x % WARP_SIZE;
	// global index for warp, sequence
	const auto sidx = idx / WARP_SIZE;

	// set sequence begin and end
	const auto seq_begin = seq_size * sidx;
	const auto seq_end = seq_begin + seq_size;

	// copy the sequence
	for (auto i = seq_begin + lane; i < seq_end; i += WARP_SIZE)
	{
		d_out[i] = d_in[i];
	}
}

the seq_size parameter is defined as:

const unsigned int seq_size = ARRAY_SIZE / warp_cnt;

where warp_cnt is:

const unsigned int warp_cnt = (threads * blocks) / WARP_SIZE; // WARP_SIZE = 32

Benchmarks

Time measurements are averaged over 1000 calls of the kernel; Visual Studio release builds that are run without debugging, on an Asus Geforce GTX 690.

No loops

For the No loops case we cases we let the number of threads per block run from 32 – 1024 in steps of 32, the number of blocks is just ARRAY-SIZE / number-of-threads-per-block.

This gives us the following graph (I have subtracted 850 µs from each time measurement, in order to enhance the differences; all time measurements are larger than 850 µs).

There is a global minimum around the fourth measurement, i.e. in this experiment, the fourth measurement provides the best time performance.

Grid stride loops

The Grid stride loop case has a huge number of possible parameter values, so we do a somewhat coarse search. The set of numbers of threads per block is {64, 128, 256, 512, 1024}. The parameter space for the number of blocks is divided into 32 values (where we disregard the case for zero blocks).

This gives us the following graph (I have again subtracted 850 µs from each time measurement).

The are several, let’s say, regional minima: the 47th, 78th, 109th, and 139th measurement.

Measurement nr.

Threads

Blocks

Time (µs)

47

128

65,536

862.3

78

256

32,768

861.8

109

512

16,384

864.7

139

1024

7,680

871.2

Interestingly, this is a very regular series. The last measurement is just 1 step before the step with 8192 blocks (2 x 8192 = 16384).

Warp stride loops, Billeter et al. loops

Nvidia hardware executes a warp of threads in parallel on a single multiprocessor. So it makes sense to set up sequences that are processed by a single warp of threads. Such sequences are processed by warp stride loops. The number of warps then determines the number of sequences, hence their size. NVidia warns not to write programs that critically depend on the size of a warp. So, production software that uses the warp size as a parameter should always query the hardware at runtime to get the actual warp size (for the overseeable past and future expect it to be either 16, 32, or 64).

Billeter et al. have a warp stride loop with 120 blocks of 128 threads. I have measured the time performance and registered it in the table in the Summary section.

The time performance is not ultimately impressive: 1804.7 µs; the slowest option in the pack .

In order to see if the warp stride loop variant has any merits, I performed a parameter search, similar to the grid stride loop case. Note that the array length, the warp size, the number of threads per block, and the number of blocks together completely determine the size and number of sequences.

This resulted in the following graph (each measurement got again 850 subtracted).

The 47th measurement has the shortest time: 863.7 µs.

Summary

The best time performance and bandwidth results are in the table below.

Time (µs)

Bandwidth (GB/s)

Threads

Blocks

Sequence size

No Sequences

870.9

154.1

128

131,072

Grid Stride Loops

861.8

155.7

800

1,024

array size

Billeter et al. Sequences

1,804.7

74.4

128

120

32,768

Warp Stride Loops

863.7

155.4

128

65,536

64

Time is in microseconds, so all results are about equal, except the Billeter et al. case which is significantly slower.

At this point we see no (dis)advantage of using sequences!

Vectorized Memory Access

Another question is if any of these alternatives can can benefit from using vectorized memory access [2]. The hardware can execute memory access operations on scalars, but also on small vectors of 2 components (pairs) or 4 components (quads). To see what the effects of vectorized memory access are, I adapted the software accordingly. Note that if the program processes vectors of e.g. 4 components, it need only 1/4th of the reads and writes of the scalar case.

Vectorized memory access with pairs

The test program proceeds just as described above. The table below summarizes the results obtained.

Time (µs)

Bandwidth (GB/s)

Threads

Blocks

Sequence size

No Sequences

8,57.9

156.5

704

11,915

Grid Stride Loops

8,59.7

156.1

1,024

8,192

array size

Billeter et al. Sequences

1,434.6

93.6

128

120

17,464

Warp Stride Loops

8,20.9

163.5

256

38,912

26

It turns out that warp stride loops benefit most from vectorized memory access with pairs. This includes Billeter et al. sequences.

Vectorized memory access with quads

The table below summarizes the results obtained.

Time (µs)

Bandwidth (GB/s)

Threads

Blocks

Sequence size

No Sequences

862.3

155.7

1,024

4,096

Grid Stride Loops

862.4

155.6

64

65536

array size

Billeter et al. Sequences

1,335.7

100.5

128

120

8738

Warp Stride Loops

831.7

161.4

256

38,912

26

So, it seems that vectorized memory access with quads, as opposed to pairs, is beneficial for Billeter et al. sequences, but not for the other options.

Conclusions

We conclude that warp stride loops offer the best performance. Better than using no loops, to be sure, if combined with vectorized memory access for pairs (not quads).

We have seen that the performance graphs of the various options show many local minima. So, it pays to do a search for the parameters that maximize performance.

Then, of course, there is the question whether sequences is what makes Billeter et al stream compaction the world champion. The answer is: Definitely not!

Next

If the use of sequences does not set stream compaction according to Billeter et al. apart from the rest, then what does? Is it the difference in size the meta data volume?

References

[1] https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

[2] https://devblogs.nvidia.com/cuda-pro-tip-increase-performance-with-vectorized-memory-access/ .

Advertisements

This is number 2 of some blog posts on Parallel Stream Compaction. In part 1 we saw a small number of fast implementations compared. The implementation by Billeter Olsson and Assarsson came up as the fastest. In a digression of different graphics cards we saw that the order in performance holds over different cards, over different generations. Now we want to know why the implementation by Billeter et al. is so distinctively faster than the other two.

In this part we measure time performance of the individual steps that comprise the algorithms, and discuss the results before we dig in deeper in further episodes. The code we use here is the code discussed in part 1.1: the digression comparing various graphics cards.

Measuring time performance of the three processing steps

All algorithms discussed in part 1 have three sequential steps, where each step is a Cuda kernel:

  1. Count the number of valid entries. The Count step.
  2. Determine the global offsets in the output for the entries. The Prefix step.
  3. Compact the valid entries into the output using the offsets. This is the Move step.

So, the first two steps provide metadata, and the third step uses the metadata to compact the stream.

The time performance of the separate steps may show us bottlenecks in the implementations.

We benchmark the individual steps with structured data (see part 1) on a GeForce GTX 690. We use the software from part 1.1, but restrict the size to 2^24. In the case of the Orange Owls and Spataro implementations, we restrict the threading configuration to 1024 threads per thread-block. This latter choice came out as the fastest thread configuration for the specified input length.

Measuring processing time is as follows. The individual steps are measured indirectly: first we measure step 1, then we measure step 1 and 2, and subtract step 1, etc. Time is measured in microseconds (µs = 10^-6 seconds). Measurement times are averages over 1,000 invocations.

We then see the following results.

Count (µs)

Prefix (µs)

Move (µs)

Total (µs)

Billeter et al.

455.5

7.3

820.0

1282.8

Orange Owls

893.6

47.2

1665.8

2606.6

Spataro

946.6

86.6

2107.9

3141.1

Let’s discuss this a bit.

The Prefix Step

The prefix step by Billeter et al. is much faster than the prefix step in the other two implementations. Curiously, both Orange Owls and Spataro employ the thrust::exclusive_scan function. This function is part of the Thrust library that is packed and distributed with the Cuda SDK(!). Ok, we already know from part 1 that Thrust is not the ultimate in performant GPU software, but the difference with Billeter et al. is really too large.

We will discuss Billeter et al.’s prefix step in a later episode in detail.

The Count and Move steps: loops

A major difference between the implementation by Billeter et al. and the other two is that the former employs loops.

Loops, as in sequential processing? In massively parallel software? And then having a performance advantage??? Yes!

Here is a count function from the source code by Billeter at al. that loops through an array.

template< class Pred > static _CHAG_PP_DEV SizeType count(
    const T* aStart, const T* aEnd, const Pred& aPred, volatile SizeType* aSmReduce)
{
    SizeType partialCountA = 0u, partialCountB = 0u;
    for(const _Tp* elem = _Tpp(aStart) + _Env::lane(); elem < _Tpp(aEnd);
        elem += _Env::SIMD)
    {
        _Tp e = *elem;
		
        if( aPred( e.x ) )
	    ++partialCountA;
        if( aPred( e.y ) )
	    ++partialCountB;
    }
	
    SizeType sum = partialCountA + partialCountB;

    return _Unit::reduce( sum, op::Add<SizeType>(), aSmReduce );
}

Notice the for loop, it defines how each thread of a warp iterates through the input from aStart (plus the offset for the thread) up to aEnd, with increments of the warp size (so threads do not get in each other’s way).

Here is the count function from Orange Owls. No loops at all.

template <typename T, typename Predicate>
__global__ void computePredicateTruePerBlock(const T * __restrict__ d_input,
    const int N, int * __restrict__ d_BlockCounts, Predicate predicate)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;

    if (tid < N)
    {

        int pred = predicate(d_input[tid]);
        int BC = __syncthreads_count(pred);

        if (threadIdx.x == 0) { d_BlockCounts[blockIdx.x] = BC; }
    }
}

What does this code do? First it determines a global index for the current thread (tid), then, if the index denotes a location in the input, it determines if a predicate holds for the designated input, by setting a variable to 1 if it does, and to 0 if it doesn’t. It then sums all the outcomes in a warp, and stores the result.

Default threading configuration in Billeter et al.

The threading configuration in Billeter et al. depends on compile time parameters. The default configuration for step 1 (Count) and step 3 (Move) is that processing takes 120 blocks of 128 threads each. This is surprisingly low! Can we even speak of data parallelism? The 128 threads of a block have been subdivided in 4 warps of 32 threads: dim3(32, 4, 1). In contrast: the Prefix step uses 480 threads in 1 block.

The total number of threads (128 x 120) is far less than the size of the input stream. So, the algorithm employs sequences: ordered, contiguous, nonoverlapping subarrays; an order preserving concatenation of the sequences equals the input stream.

Number of sequences

The parallelism in GPU computing is per warp. So, the number of sequences in this algorithm is the total number of warps, which is the number of warps per thread-block x the number of blocks (4 x 120 = 480). The Count step counts the valid entries of the input per sequence and outputs an array of the counts, just as the Move step moves the valid entries to the output per sequence. The Prefix step takes as input the output of the Count step which has length 480. Since the Prefix step has 480 threads, there are no sequences in the Prefix step.

Note that the small numbers of thread-blocks and threads per block result in small data structures for the counts-of-valid-elements-per-sequence, and the global-offsets-per-sequence. I think that having small metadata data structures is a performance factor in itself.

Size of the sequences

Billeter et al. compute the size of the sequences using a rather complicated recipe:

  1. Determine the size of a data block a warp processes in one call. For instance, if we use the uint2 short vector type (of 2 unsigned ints), this is the warp size x the number of unsigned ints in a uint2. Hence, 32 x 2 = 64.
  1. Compute the basic sequence size as the stream size divided by the data block size, and then divided by the number of sequences. Note that these divisions are integer divisions. In our case: 2^24 / 64 = 262,144, then 262,144 / 480 = 546 with a remainder of 64.
  2. Add the remainder R to the size of the first R sequences.

So, in our input of size 2^24, the first 64 sequences have size 35008, and the other 480 – 64 = 416 sequences have size 34944.

The code is flexible and can work with very arbitrary length input streams. If the length of an input array is not a multiple of the warp size, the code calculates a range of auxiliary input elements and processes these separately. I will argue against this solution elsewhere, for now we can ignore this facility since the (large) size we use for experiments is a power of 2 – hence a multiple of 32.

Next

If we want to conclude that the performance difference between the implementation of Billeter et al. on the one hand, and Orange Owls and Spataro on the other hand is indeed the use of sequences, we should be able to replicate this effect in a simpler setting. So, next up is a comparison of a standard copy algorithm and a variant that uses sequences. Would that reveal the difference?

References

For references to articles and software download sites, see part 1 and part 1.1 of this blog post collection.

This is part 1.1 of a number of blog posts on parallel stream compaction

In part 1 we saw that the parallel stream compaction Cuda software by Billeter et al. is faster than the software by Orange Owls , which in turn is faster than the software by Spataro. Commenter Qubey then asked if this relative order is preserved under migration to other graphics cards, notably to graphics cards with newer generations GPU’s. This is an important question because it addresses the validity of a choice for a particular algorithm (and its implementation) for future hardware. This validity depends largely on adherence to an architectural model in successive hardware generations and the software runtime environment.

We set out to experiment. All three implementations will be run with structured data for streams of size 2^10 up to and including 2^26. Structured data is of the form 1, 0, 3, 0, 5, 0, 7, 0, … Values are limited to [0, 2^16 – 1]. We have seen in part 1 that the algorithms are not sensitive to structure in the input data. This was also found in the extensive experiments reported on here (although we will not discuss it further). The implementations of Orange Owls and Spataro will be run for 32, 64, …, 1024 threads per block. The thread layout of the Billeter et al. implementation is fixed. We ran this software on various graphics cards: a GeForce GTX 660, 690, 960 and 1080. Time performance measurements are averaged over 1,000 calls of the kernel.

The questions we would like to see answered are:

  1. Is the relative ordering in time performance of the implementations preserved over the different graphics cards?
  2. Is the time performance relation between the fastest implementation and the runner up implementation constant over the different graphics cards? Can we say that implementation X Is Y times faster than implementation Z?

Relative order

The first question can best be answered by presenting time performance graphs of the cards involved.

Explanation of the numbers on the x-axis: x=1 means the input length is 2^10, x=2: input size is 2^11, … x=17: input size = 2^26. This holds for all graphs below.

It is obvious from the graphs that relative ordering of performance is preserved over the cards.

Note by the way the enormous increase in time performance. The GTX 660 takes almost 8ms to process the largest stream (2^26 elements) using the Billeter et al. algorithm, whereas the GTX 1080 needs only 3ms.

Magnitude

The data from part 1 suggests that there is a multiplicative relation between the fastest algorithm: Billeter et al., and the runner up: Orange Owls. The data from the current test set support this suggestion, as the following graphs illustrate.

The graphs below show (i) the Orange Owls time performance measurements divided by the Billeter et al. measurements; (ii) Spataro divided by Billeter et al.; and (iii) Spataro divided by Orange Owls.

We see that the division of the Orange Owls data divided by the Billeter et al. data, reasonably approximates a straight line, indicating an approximately constant factor on all cards. This will allow us to say something like “The implementation by Billeter et al. is X times faster than the implementation by Orange Owls.” The other two relations are clearly not of this nature.

We see that there is indeed a multiplicative relation between the Billeter et al. and Orange Owls time performance data. So what is the magnitude of these relations for the different cards, and are they more or less the same? Take a look at the table below.

Card Mean Standard deviation
GTX 660

1.6

0.2

GTX 690

1.7

0.2

GTX 960

1.3

0.1

GTX 1080

1.6

0.3

All data

1.5

0.3

As you can see, mean and standard deviation are similar over the cards, with the exception of the GTX 960. I’ll get to that. Based on these results, I’m inclined to say that Billeter et al. stream compaction is about 1.5 times as fast as Orange Owls stream compaction.

On the other hand, the magnitude of the multiplicative relation seems to be about 1.6, with the notable exception of the GTX 960. We note that the theoretical memory bus bandwidth of the GTX 960 is (only) 112 GB/s for the reference card and 120 GB/s for OEM cards (Wikipedia), whereas the bandwidth of the GTX 660 is 144 GB/s, and for the 690: 192 GB/s (per GPU).

The time performance comparison chart for all cards on structured data, Billeter et al. implementation looks like this:

Which shows that the GTX 960 is relatively less suited for stream compaction compared to the other cards.

Wrapping up

In part 1: Introduction, based on one graphics card and a single input stream size, Billeter et al. came out twice as fast as Orange Owls. This finding has now been developed into a factor 1.5 by the introduction of a broad spectrum of cards and a far broader scope of input streams. We have seen that Billeter et al.’s implementation of stream compaction is the fastest, for a substantial number of input stream lengths and over a number of graphics cards generations.

Next

Next we will start digging into the inner workings of the algorithms, as promised in the introduction.

Thanks

I would like to express my gratitude to Qubey for his sharp questions and for his willingness to conduct experiments on the GTX 660, 960 and 1080 cards.

What is Stream Compaction

Stream compaction is simply copying only the nonempty (valid) entries from an input array to a contiguous output array. There is, of course, the option to not preserve the order of the input, but we will skip that one.

In C++ the definition is simple. Given a sparse std::vector<T> v_in, and an equal sized, zero std::vector<T> v_out:

auto j = 0u;
for (const auto& e : v_in)
{
	if (e) v_out[j++] = e;
}

where T is the value type of both v_in and v_out.

If you apply this sequential algorithm to real time graphics tasks, you will find that it is too slow (we will get to the numbers below).

On the other hand, stream compaction is a very important algorithm in general purpose GPU (GPGPU) computing, and/or data parallel algorithms. Why? Typically, GPGPU algorithms have fixed output addresses assigned to each of the many parallel threads (I will not explain data parallel algorithms here, please do an internet search if new to this subject).

Not all threads produce (valid) output, giving rise to sparse output arrays. These sparse output arrays constitute poor quality input arrays for subsequent parallel processing steps: it makes threads process void input. Understandably, the deterioration may increase with an increasing number of processing steps. Hence the need for stream compaction.

World Champion Parallel Stream Compaction

Some time ago I needed a GPU stream compaction algorithm. Of course, initially I was unaware of this term, just looking for a way to remove the empty entries from a large Direct3D buffer. Internet research taught that there are a few fast implementations: by Hughes et al. [3], Spataro [2], and Billeter et al.[1]. And let’s not forget the Cuda Thrust library which contains a copy_if function (Cuda release 8). Software implementations can be downloaded from [6], [5], [4], and [7] respectively.

I’ve benchmarked the implementations on my Asus Geforce GTX 690, also including an implementation of Spataro’s algorithm by Orange Owls [8]. Two input vectors have been used:

  1. A structured vector [1, 0, 3, 0, 5, 0, 7, … ].
  2. A vector of pseudo random unsigned shorts, selected by rand(), with an approximate probability of 50% to be zero (decided using rand() also).

Both vectors have size 2^24 (almost 16.8 million).

The table below displays the results of running standard Visual Studio 2015 release builds without debugging. Measurements are averaged over 1,000 executions of the involved kernels. Measuring code directly surrounds the kernel calls. Outcomes have been checked for correctness by comparison with the outcome of the sequential algorithm above. All algorithms produce correct results.

Implementation

Structured data (ms)

Rand() Data (ms)

CPU method (C++)

7.4

55.1

Billiter, Olsson, Assarsson

1.3

1.4

Orange Owls (3 phases approach)

2.6

2.6

Spataro

3.6

3.6

Cuda Thrust 1.8

4.3

4.4

Hughes, Lim, Jones, Knoll, Spencer

112.3

112.8

So what do we see?

We see that the CPU code produces strongly varying results for the two input vectors. Parallel implementations do not suffer from this variance (or do not benefit from structure that is inherent in the data!).

The algorithm by Billeter et al.’s is at least twice as fast as the other algorithms. It is a step ahead of Spataro, Orange Owls, and Thrust.

Obviously, there is something wrong with Hughes et al.’s algorithm, or its implementation. According to the article [3], it should be faster than, or on par with Billeter et al.’s. Obviously, it isn’t. Inspection using the NVidia Visual Profiler shows that the threads are mainly (over 90%) ‘Inactive’, which explains its lack of performance.

Having read the articles referred to above, I decided to see if I myself could become world champion parallel stream compaction, by writing a new algorithm based on some ideas not found in the articles. So, could I be the new world champion? No. I got results in between Orange Owl’s and Spataro’s, but could not get any faster.

So, the software by Markus Billeter, Ola Olsson, and Ulf Assarsson is the fastest parallel stream compaction algorithm in the world, they are world champion stream compaction, and we have to first learn why exactly, before we can surpass it, if at all. The question that then is:

“What makes Billeter, Olsson, and Assarsson’s parallel stream compaction Cuda program at least 2x as fast as its competitors?

Next

The implementation of Billeter et al.’s algorithm is an optimized library. Optimized also with respect to maintenance: no duplicate code, which makes it fairly cryptic, thus hard to decipher its operational details. Next up is a general description of their program, and its main parameters. The algorithm has three main steps which will be discussed in subsequent posts. Along the way I hope to disclose why their code is at least twice as fast as the other algorithm implementations.

References

[1] Billeter M, Olsson O, Assarsson U: Efficient Stream Compaction on Wide SIMD Many-Core Architectures. In Proceedings of the Conference on High Performance Graphics Vol. 2009 (2009), p. 159-166.
New Orleans, Louisiana — August 01 – 03, 2009. ACM New York, NY, USA. (http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=4FE2F7D1EBA4C616804F53FEF5A95DE2?doi=10.1.1.152.6594&rep=rep1&type=pdf ).

[2] Spataro Davide: Stream Compaction on GPU – Efficient implementation – CUDA (Blog 23-05-2015: http://www.davidespataro.it/cuda-stream-compaction-efficient-implementation/ ).

[3] Hughes D.M. Lim I.S. Jones M.W. Knoll A. Spencer B.: InK-Compact: In-Kernel Stream Compaction and Its Application to
Multi-Kernel Data Visualization on General-Purpose GPUs. In: Computer Graphics Forum, Volume 32 Issue 6 September 2013 Pages 178 – 188. (https://github.com/tpn/pdfs/blob/master/InK-Compact-%20In-Kernel%20Stream%20Compaction%20and%20Its%20Application%20to%20Multi-Kernel%20Data%20Visualization%20on%20General-Purpose%20GPUs%20-%202013.pdf ).

[4] Source code Billeter, Olsson, and Assarsson: https://newq.net/archived/www.cse.chalmers.se/pub/pp/index.html .

[5] Source code Spataro: https://github.com/knotman90/cuStreamComp

[6] Source code Hughes, Lim, Jones, Knoll, and Spencer: https://sourceforge.net/projects/inkc/

[7] Cuda 8 (September 2016 release, includes Thrust): https://developer.nvidia.com/cuda-toolkit .

[8] Orange Owls Solutions implementation of Spataro’s: https://github.com/OrangeOwlSolutions/streamCompaction

For those interested: I have prepared three programs that experiment with the Cuda stream compaction implementions of Billeter et al., Spataro and Orange Owls. You can request download links (executables and sources) by creating a comment.

Harder to C++: Monads for Mortals [7], Monad Composition

Early authors on monads, e.g. [Moggi], [Wadler], describe several types of monads, notably for side-effects, state, exceptions and IO. Of course we want to write programs that have all these features. There are, in principle, two possible approaches. First: write a separate monad each time we think up a new piece of non-pure-functional … (shall we dare call it functionality?). Second: compose different standard monads into larger wholes? In [part 2] of his excellent tutorial on monads Mike Vanier writes:

Functional programmers talk about “composability” all the time, with the implication that if some aspect of a programming language isn’t composable, it’s probably not worth much.’

So, clearly composition is key and we will see some monads composed into larger wholes here. To be clear: I don’t mean function composition, I mean monad composition. In C++. To be precise, I will define:

  • A monad that executes a function which maintains a state (side-effect).
  • A monad that catches and handles exceptions.
  • A monad that writes the result of function composition to console or an error message in case an exception occurred.

Then I will compose those monads into a larger whole, with preservation of imperative ‘functionality’. We will see that it actually works. Then I will show a very general imperative C++ function, that has the same functionality, and we will compare size and performance of both approaches.

Monad Composition

We will use a single monad template and instantiate it for a State type, an Exception type, and a std::cout wrapper. The monad template looks like this:

//V: wrapped value, S: state of some sort
template<typename V, typename S>
struct monad
{
	V value;
	S state;

	monad(const V& v) : value(v) {}
	monad(V& v) : value(std::move(v)) {}
	monad(const V& v, const S& s) : value(v), state(s) {}
	monad(V& v, S& s) : value(std::move(v)), state(std::move(s)) {}
	monad(V&& v, S&& s) : value(std::move(v)), state(std::move(s)) {}
	template<typename T, typename W>
	monad(const monad<T, W>& m) : value(m.value), state(m.state) {}
	monad(monad&& m) : value(std::move(m.value)), state(std::move(m.state)) {}
	~monad() {}
	monad& operator=(const monad& m)
	{
		if(this != &m)
		{
			value = m.value;
			state = m.state;
		}
		return *this;
	}
	monad& operator=(monad&& m)
	{
		if(this != &m)
		{
			value = m.value;
			m.value = V();

			state = m.state;
			m.state = mystate();
		}
		return *this;
	}
};

The template consists of a wrapped value and a field to hold state of some sort, e.g. actual state, an exception, or an io stream. The rest of the code is just what is generally referred to as ‘copy control’.

The ‘S’ parameter will be instantiated using three classes: State, Exception, and Cout

struct State
{
	int count = 1;
	void update(const int seed) { count += seed; }
};
struct Exception : std::exception
{
	Exception() : message(""), errorCode(0) {}

	Exception(string msg, int errorCode) : message(msg), errorCode(errorCode) { }
	string ErrorMessage() const
	{
		stringstream ost;
		ost << message << " " << errorCode;
		return ost.str();
	}

private:
	string message;
	int errorCode;
};
struct Cout
{
	ostream& os;
	Cout() : os(cout) {}
};

Cout is a std::cout wrapper. We need it because io streams cannot be copied or assigned, but by the design of our monad template (S is not a reference or pointer) we need something that actually can be copied or assigned.

For the monad template we define three overloads of the bind function, one for each instantiation of the template:

// bind function overload: State
template<typename A, typename R>
monad<R, State> operator| (const monad<A, State>& mnd, monad<R, State>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	tmp.state.update(mnd.state.count);
	return tmp;
}

// bind function overload: Exception
template<typename A, typename R>
monad<R, Exception> operator| (const monad<A, Exception>& mnd, monad<R, Exception>(*func)(const A&))
{
	try
	{
		return func(mnd.value);
	}
	catch(Exception& ex)
	{
		auto tmp = monad<R, Exception>(R(mnd.value));
		tmp.state = ex;
		return tmp;
	}
}

// bind function overload: Cout
template<typename A, typename R>
monad<R, Cout> operator|(const monad<A, Cout>& mnd, monad<R, Cout>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	// Need to create a new tmp, because we cannot assign, copy ostream.
	// When initializing, we use a ref to ostream
	auto tmp2 = monad<R, Cout>(tmp.value, mnd.state);
	tmp2.state.os << tmp2 << endl;
	return tmp2;
}

In the last bind function we see the familiar output operator. In order to make it work, we need overloads of this operator for the monad template, and the three auxiliary classes (or ‘structs’ if you like):

// << overload for monad
template<typename V, typename S>
ostream& operator<<(ostream& os, const monad<V, S>& m)
{
	os << "Value: " << m.value << " State: " << m.state;
	return os;
}
// << overload for State
ostream& operator<<(ostream& os, const State& s)
{
	os << s.count;
	return os;
}
// << overload for Exception
ostream& operator<<(ostream& os, const Exception& e)
{
	if(e.ErrorMessage() != " 0")
		os << e.ErrorMessage();

	return os;
}
// << overload for Cout, more or less a dummy (but required)
ostream& operator<<(ostream& os, const Cout&)
{
	os << "Cout";
	return os;
}

The last thing we need is some function that can be applied by the bind functions. These functions cannot be type agnostic: the return type differs from the parameter type. So we have three overloads of them, and they do exactly the same. Let’s first define some handy type aliases:

typedef monad<int, State> is_monad;
typedef monad<is_monad, Exception> ise_monad;
typedef monad<ise_monad, Cout> isec_monad;

These typedefs nest a monad template instantiation, which is a type, within another. Now the functions to be used divide a number by 2:

is_monad divideby2(const int& i)
{
	return is_monad(i / 2);
}

ise_monad divideby2(const is_monad& m)
{
	if(m.value < 2)
		// Somewhat artificial
		throw Exception("Division by zero", -1);

	auto m2 = m | divideby2;
	return ise_monad(m2);
}

isec_monad divideby2(const ise_monad& m)
{
	auto m2 = m | divideby2;
	return isec_monad(m2);
}

As you can see, there is composition, but there is no generality. That is, monad composition is in fact ugly and painful and scaling is out of the question.

We run this, in a by now familiar way:

int _tmain(int argc, _TCHAR* argv[])
{
	{
		int i = 8;
		is_monad m1(i);
		ise_monad m3(m1);
		isec_monad m5(m3);

		auto m2 = m5 | divideby2 | divideby2 | divideby2 |divideby2;
	}

	cin.get();
	return 0;
}

and get the result we desired:

After a code exposé of just 200 lines :-).

A Very Ordinary C++ Function

The same result (well, without of course the funny output that so points out recursion) can be obtained by the following function:

void VeryOrdinary()
{
	int num = 8;
	int cnt = 1;

	try
	{
		for(unsigned int i = 0; i< 4; ++i)
		{
			num /= 2;
			++cnt;

			cout << "Value : " << num << " State: " << cnt << endl;

			if(num < 2)
				// Somewhat artificial
				throw Exception("Division by zero", -1);
		}
	}
	catch(Exception& ex)
	{
		cout << "Value : " << num << " State: " << cnt << " " << 					ex.ErrorMessage() << endl;
	}
}

We run it by simply calling it:

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [7]");

	VeryOrdinary();

	cin.get();
	return 0;
}

And the result is:

After a code exposé of 34 lines (it pays to remember the approximate factor of 6 for later).

Let’s compare the sizes of both approaches. Well, the monad composition is ridiculously much larger than the ordinary solution.

Performance

Performance measurements were set up in a way comparable to measurements in part 6: the code above is run a large number of iterations, and time is measured. This is done for 10 runs, and the average over the times is calculated. Before starting measurements, the monad composition is allowed a warming up run, to prevent an outlier. The default release build of the resulting program was run from a command console.

Now, time performance results are dominated almost completely by the use of std::cout and throwing exceptions. If both are used, or occur, times are comparable. If however, we do not log to the console, and have no exceptions thrown (we set value to 32), that is, we measure the performance at the naked difference of the constructed code, the regular code takes only 1.6% of the time the monad solution takes, i.e. the regular solution is about 60 times as fast, see the image below for a typical performance run.

in this particular case the ordinary C++ function is 38.8 / 0.64 = 60.6 times as fast. The time it takes the regular function to execute is 100% x 0.64 / 38.8 = 1.6% of the time it takes the monad composition to execute. Please note that the variation in measured times is limited.

Conclusions

Let’s be brief. The code size and time performance of the monad composition are not in sensible proportion to size and performance of the regular function: 6 : 1 and 60 : 1 respectively. The code size of monad composition is ridiculous, monad composition kills C++ performance.

Wrap Up of Sequential Monads

We are a now at the end of 7 installments experimenting with monads, so let’s wrap up what we have so far. The question is: “Can monads be useful in day-to-day C++ programming?”. And the answer is “No”.

We have looked briefly at a C++ template meta programming (which constitutes a simple functional language) approach. The disadvantages are:

  • Complex syntax, hence diminished simplicity and clarity (one day your code will be maintained by another developer!)
  • Although it is possible to use this approach, the language does not support it. So it is unreasonably hard. (free after Stroustrup’s argument that OO development is possible in C, or assembly, but the language… )
  • There are no debugging facilities, which makes it very hard to develop substantial amounts of code as meta programs.

We have seen a simple and short implementation of the monad in mainstream C++. However, this implementations has the drawback that its performance dies if it is applied to large data structures.

The drawback could be remedied by replacing the unit function by a complete copy control set for the monadic template.

Lazy monads can be elegantly implemented as well, but they have no performance whatsoever, due to the fact that code is evaluated twice, once to create the expression to evaluate, and another time to indeed evaluate the expression.

Furthermore, we have seen, in this episode that if we want to write somewhat larger programs using monad composition, we get very large programs without performance.

I think I’m done now with at least the sequential monad. I have no further interest. It is tedious, needlessly complicated, and leading to inefficient results when trying to construct meaningful programs from different monads in C++.

I wrote ‘sequential monad’ above, because there is still the continuation monad to investigate; the suggested solution to Callback Hell in concurrency contexts. So, for starters, what exactly is Callback Hell? We will see in part 8.

Harder to C++: Monads for Mortals [6], Performance of Basic Implementations

In parts 2 – 5 we have seen four different implementations of the monad. Now, it is time for a shootout. I want to see which version has best time performance and compare it with a regular task to evaluate the cost of using a monad.

The monad implementations are:

  1. The first implementation that copies all its data.
  2. The lazy implementation that evaluates at explicit call.
  3. The references and move semantics implementation.
  4. The pointer based implementation.

The Task at Hand

The challenge is to create a task that is hard to optimize by cleverly removing parts of the functionality. Imagine we have an algorithm that fills a vector with arbitrary data, which is then thrown away unused because the vector goes out of scope. We wouldn’t want the compiler to decide that it is useless to create the vector altogether and skip the code.

Further, I will make the task less predictable by extensive use of the rand() function. I know, rand() is not a great pseudo random engine. However, rand() is not used here to simulate randomness, but to get values that are available only at runtime, not at compile time – when the code optimizer runs. This way, the code cannot have optimizations based on compile time knowledge of constant values.

The algorithm that will be implemented by the functions that are ‘applied’ by the monad, and that will also be implemented by two regular functions is as follows (we will be working with the ValueObject class used in earlier episodes, but with a method added that retrieves a pointer to the payload):

  1. Receive a (pointer / reference to a) ValueObject object.
  2. Retrieve a char from the ValueObject object’s payload, with a random index, stored it in a global data structure.
  3. Create a size for a new ValueObject object, that depends on the size of the received object, and the pseudo random generator.
  4. Create a (monad with a new) ValueObject object and return (a pointer to) it.

Step 2 is to ensure that the payload is not optimized away because it will never be accessed. Step 3 serves to make the size of the payload somewhat unpredictable, and also wildly varying.

This is the version for the reference and move semantics based implementation:

	// Creates one more ValueObject object
	monad<ValueObject> onemore(const ValueObject& v)
	{
		// samples is a file scope vector<char>
		samples.push_back(*(v.getload() + rand() % (v.size()+1)));

		do
		{
			rsize = rand();
		} while (rsize == 0);

		vsize = v.size() + 1;
		rsize = rsize > vsize ? rsize / vsize : vsize / rsize;

		return monad<ValueObject>(ValueObject(rsize));
	}

Main Routine

In the main routine, we have a composition of 25 applications of the onemore function by the bind function. We execute this composition iterations times, which we call a run. We measure the time of each run. We average over the runs, and print the result tot screen.

The number of iterations and runs are given by rand(), which is seeded with a number obtained from the user.

Scoping is such that all objects are created and destroyed in measured time.

This is the code for the reference and move semantics based implementation:

{
	using namespace I3;

	srand(seed);
	unsigned int iterations = rand();
	unsigned int runs = rand() % 1000;

	cout << "Runs: " << runs << endl
		<< "Iterations per run: " << iterations << endl
		<< "Applications per iteration: 25" << endl;

	cout << "References and move semantics." << endl;
	total = 0;

	for (unsigned int i = 0; i < runs; ++i)
	{
		t.Start();

		for (unsigned int j = 0; j < iterations; ++j)
		{
			unsigned int size1 = rand();

			auto m1 = monad<ValueObject>(ValueObject(size1));

			auto m2 = m1
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore;			}

		t.Stop();
		total += t.Elapsed();

		cout << ". ";
	}

	cout << endl << "Average over "
	<< runs << " differently sized runs of 25 x "
	<< iterations << " applications: " << total / runs << " ms"
	<< endl << endl;
}

As earlier, the timer code is from Simon Wybranski. The outer braces are just to assure different implementations do not interfere.

The iterations for the lazy monad differ from the above articulation in that it also makes a call: m2() to evaluate the constructed expression.

Regular Algorithms

The monad implementations are compared to regular algorithms that are functionally equivalent, but do not use monads. We will use one variant with references, and one variant with pointers and heap allocation.

The algorithm used is the same as above. The call in the main loop is terrible:

auto m2 = m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m1)
))))))))))))))))))))))));

Which also illustrates that the name onemore was abbreviated to m.

For references (&), the m function was implemented as:

	ValueObject m(const ValueObject& v)
	{
		samples.push_back(*(v.getload() + rand() % (v.size()+1)));

		do
		{
			rsize = rand();
		} while (rsize == 0);

		vsize = v.size() + 1;
		rsize = rsize > vsize ? rsize / vsize : vsize / rsize;

		// First creating a tmp, then returning it coerces move semantics
		// instead of copy semantics
		auto tmp = ValueObject(rsize);

		return tmp;
	}

Procedure

Time performance was measured using a release build, with default compiler switches for release builds, and the runs were executed outside of Visual Studio. That is, the program was started from a console window, not from Visual Studio.

Results

And now the results:

So, what do we see?

  1. The monad implementation based on references and move semantics (yellow) is the fastest. But the pointer based implementation has about the same performance.
  2. The regular implementations are only slightly faster (red, orange).
  3. The lazy monad (blue) is very slow, as expected.

Conclusions

The choice is clear. In following episodes, we will work with the monad built on references and move semantics. If required, we might use the pointer based variant as well. I will not use the lazy monad, unless it is clear that very, very special circumstances make it a favorable choice.

I was surprised by the small difference (if at all) between the fast monad implementation and the regular algorithms; the monad covers about twice the source code as the regular function – it also includes the bind function. So (hypothesis), the cost of using the references and move semantics monad is negligible.

Next

Next time we will start looking at various types of monads and how we could build compositions of them.

Harder to C++: Monads for Mortals [5], Pointers

In part 2 we have seen a small and elegant implementation of the monad. In part 4 we have seen how references and move semantics can be used to make the monad’s performance independent of the size of the wrapped value. In this part we will see another implementation the monad based on a pointer.

Implementation of a Monad with a Smart Pointer

We have seen in part 2 that the type constructor is represented by a template in C++. In part 2 this is a simple struct, in part 3 it is a std::function holding a lambda expression – so as to implement delayed evaluation, and in this part, it is a smart pointer, the std::unique_ptr to be precise.

So, now the type constructor is:

// Monad type constructor
template<typename T>
using monad = unique_ptr<T>;

We do not use a separate unit function, the unique_ptr‘s constructor will do. The bind function is:

// Bitwise OR operator overload, Monad bind function
template<typename A, typename R>
monad<R> operator|(monad<A>& mnd, monad<R>(*func)(const A*))
{
	log("---Function bind");

	return func(mnd.get());
}

Since in the applied functions we do not manage the life cycle of the existing A object, we send in a raw pointer, not a smart pointer.

Given the simple functions we used earlier, this implementation of the monad is used as follows:

// divides arg by 2 and returns result as monadic template
monad<ValueObject> divideby2(const ValueObject* v)
{
	log("---Function divideby2");

	return monad<ValueObject>(new ValueObject(v->size() / 2));
}

// checks if arg is even and returns result as monadic template
monad<bool> even(const ValueObject* v)
{
	log("---Function even");

	return monad<bool>(new bool(v->size() % 2 == 0));
}

void valueobjectmain()
{
	{
		auto m1 = monad<ValueObject>(new ValueObject(16));

		auto m2 = m1 | divideby2 | divideby2 | divideby2 | even;

		cout << boolalpha << *m2 << endl;
	}
}

The ValueObject class is the same as used in part 4.

Running the above code results in output:

I think the most important result here is what we don’t see: calls to the move constructor and the accompanying call to the destructor of the ValueObject.

Pro’s, Con’s, and Questions

The above image of the output is nicely constrained. Functions return the unique_ptr by value, i.e. it gets copied which is ok, and the bind function takes a reference to a unique_ptr to elicit the raw pointer from. We see the calls to ValueObject destructor occur after assignment to m2, and when leaving the anonymous scope.

So, pro’s for this implementation are its simplicity, clarity, and efficiency: the overhead of a unique_ptr is comparable to the overhead of a raw pointer. Values are not copied, so performance is size independent.

You may wonder, though, whether it is a good idea to create all the monad’s values on the heap, which is not so efficient. In part 6 we take a look at what we have so far and see what works best.