Posts Tagged ‘Performance’

Harder to C++: Monads for Mortals [7], Monad Composition

Early authors on monads, e.g. [Moggi], [Wadler], describe several types of monads, notably for side-effects, state, exceptions and IO. Of course we want to write programs that have all these features. There are, in principle, two possible approaches. First: write a separate monad each time we think up a new piece of non-pure-functional … (shall we dare call it functionality?). Second: compose different standard monads into larger wholes? In [part 2] of his excellent tutorial on monads Mike Vanier writes:

Functional programmers talk about “composability” all the time, with the implication that if some aspect of a programming language isn’t composable, it’s probably not worth much.’

So, clearly composition is key and we will see some monads composed into larger wholes here. To be clear: I don’t mean function composition, I mean monad composition. In C++. To be precise, I will define:

  • A monad that executes a function which maintains a state (side-effect).
  • A monad that catches and handles exceptions.
  • A monad that writes the result of function composition to console or an error message in case an exception occurred.

Then I will compose those monads into a larger whole, with preservation of imperative ‘functionality’. We will see that it actually works. Then I will show a very general imperative C++ function, that has the same functionality, and we will compare size and performance of both approaches.

Monad Composition

We will use a single monad template and instantiate it for a State type, an Exception type, and a std::cout wrapper. The monad template looks like this:

//V: wrapped value, S: state of some sort
template<typename V, typename S>
struct monad
{
	V value;
	S state;

	monad(const V& v) : value(v) {}
	monad(V& v) : value(std::move(v)) {}
	monad(const V& v, const S& s) : value(v), state(s) {}
	monad(V& v, S& s) : value(std::move(v)), state(std::move(s)) {}
	monad(V&& v, S&& s) : value(std::move(v)), state(std::move(s)) {}
	template<typename T, typename W>
	monad(const monad<T, W>& m) : value(m.value), state(m.state) {}
	monad(monad&& m) : value(std::move(m.value)), state(std::move(m.state)) {}
	~monad() {}
	monad& operator=(const monad& m)
	{
		if(this != &m)
		{
			value = m.value;
			state = m.state;
		}
		return *this;
	}
	monad& operator=(monad&& m)
	{
		if(this != &m)
		{
			value = m.value;
			m.value = V();

			state = m.state;
			m.state = mystate();
		}
		return *this;
	}
};

The template consists of a wrapped value and a field to hold state of some sort, e.g. actual state, an exception, or an io stream. The rest of the code is just what is generally referred to as ‘copy control’.

The ‘S’ parameter will be instantiated using three classes: State, Exception, and Cout

struct State
{
	int count = 1;
	void update(const int seed) { count += seed; }
};
struct Exception : std::exception
{
	Exception() : message(""), errorCode(0) {}

	Exception(string msg, int errorCode) : message(msg), errorCode(errorCode) { }
	string ErrorMessage() const
	{
		stringstream ost;
		ost << message << " " << errorCode;
		return ost.str();
	}

private:
	string message;
	int errorCode;
};
struct Cout
{
	ostream& os;
	Cout() : os(cout) {}
};

Cout is a std::cout wrapper. We need it because io streams cannot be copied or assigned, but by the design of our monad template (S is not a reference or pointer) we need something that actually can be copied or assigned.

For the monad template we define three overloads of the bind function, one for each instantiation of the template:

// bind function overload: State
template<typename A, typename R>
monad<R, State> operator| (const monad<A, State>& mnd, monad<R, State>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	tmp.state.update(mnd.state.count);
	return tmp;
}

// bind function overload: Exception
template<typename A, typename R>
monad<R, Exception> operator| (const monad<A, Exception>& mnd, monad<R, Exception>(*func)(const A&))
{
	try
	{
		return func(mnd.value);
	}
	catch(Exception& ex)
	{
		auto tmp = monad<R, Exception>(R(mnd.value));
		tmp.state = ex;
		return tmp;
	}
}

// bind function overload: Cout
template<typename A, typename R>
monad<R, Cout> operator|(const monad<A, Cout>& mnd, monad<R, Cout>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	// Need to create a new tmp, because we cannot assign, copy ostream.
	// When initializing, we use a ref to ostream
	auto tmp2 = monad<R, Cout>(tmp.value, mnd.state);
	tmp2.state.os << tmp2 << endl;
	return tmp2;
}

In the last bind function we see the familiar output operator. In order to make it work, we need overloads of this operator for the monad template, and the three auxiliary classes (or ‘structs’ if you like):

// << overload for monad
template<typename V, typename S>
ostream& operator<<(ostream& os, const monad<V, S>& m)
{
	os << "Value: " << m.value << " State: " << m.state;
	return os;
}
// << overload for State
ostream& operator<<(ostream& os, const State& s)
{
	os << s.count;
	return os;
}
// << overload for Exception
ostream& operator<<(ostream& os, const Exception& e)
{
	if(e.ErrorMessage() != " 0")
		os << e.ErrorMessage();

	return os;
}
// << overload for Cout, more or less a dummy (but required)
ostream& operator<<(ostream& os, const Cout&)
{
	os << "Cout";
	return os;
}

The last thing we need is some function that can be applied by the bind functions. These functions cannot be type agnostic: the return type differs from the parameter type. So we have three overloads of them, and they do exactly the same. Let’s first define some handy type aliases:

typedef monad<int, State> is_monad;
typedef monad<is_monad, Exception> ise_monad;
typedef monad<ise_monad, Cout> isec_monad;

These typedefs nest a monad template instantiation, which is a type, within another. Now the functions to be used divide a number by 2:

is_monad divideby2(const int& i)
{
	return is_monad(i / 2);
}

ise_monad divideby2(const is_monad& m)
{
	if(m.value < 2)
		// Somewhat artificial
		throw Exception("Division by zero", -1);

	auto m2 = m | divideby2;
	return ise_monad(m2);
}

isec_monad divideby2(const ise_monad& m)
{
	auto m2 = m | divideby2;
	return isec_monad(m2);
}

As you can see, there is composition, but there is no generality. That is, monad composition is in fact ugly and painful and scaling is out of the question.

We run this, in a by now familiar way:

int _tmain(int argc, _TCHAR* argv[])
{
	{
		int i = 8;
		is_monad m1(i);
		ise_monad m3(m1);
		isec_monad m5(m3);

		auto m2 = m5 | divideby2 | divideby2 | divideby2 |divideby2;
	}

	cin.get();
	return 0;
}

and get the result we desired:

After a code exposé of just 200 lines :-).

A Very Ordinary C++ Function

The same result (well, without of course the funny output that so points out recursion) can be obtained by the following function:

void VeryOrdinary()
{
	int num = 8;
	int cnt = 1;

	try
	{
		for(unsigned int i = 0; i< 4; ++i)
		{
			num /= 2;
			++cnt;

			cout << "Value : " << num << " State: " << cnt << endl;

			if(num < 2)
				// Somewhat artificial
				throw Exception("Division by zero", -1);
		}
	}
	catch(Exception& ex)
	{
		cout << "Value : " << num << " State: " << cnt << " " << 					ex.ErrorMessage() << endl;
	}
}

We run it by simply calling it:

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [7]");

	VeryOrdinary();

	cin.get();
	return 0;
}

And the result is:

After a code exposé of 34 lines (it pays to remember the approximate factor of 6 for later).

Let’s compare the sizes of both approaches. Well, the monad composition is ridiculously much larger than the ordinary solution.

Performance

Performance measurements were set up in a way comparable to measurements in part 6: the code above is run a large number of iterations, and time is measured. This is done for 10 runs, and the average over the times is calculated. Before starting measurements, the monad composition is allowed a warming up run, to prevent an outlier. The default release build of the resulting program was run from a command console.

Now, time performance results are dominated almost completely by the use of std::cout and throwing exceptions. If both are used, or occur, times are comparable. If however, we do not log to the console, and have no exceptions thrown (we set value to 32), that is, we measure the performance at the naked difference of the constructed code, the regular code takes only 1.6% of the time the monad solution takes, i.e. the regular solution is about 60 times as fast, see the image below for a typical performance run.

in this particular case the ordinary C++ function is 38.8 / 0.64 = 60.6 times as fast. The time it takes the regular function to execute is 100% x 0.64 / 38.8 = 1.6% of the time it takes the monad composition to execute. Please note that the variation in measured times is limited.

Conclusions

Let’s be brief. The code size and time performance of the monad composition are not in sensible proportion to size and performance of the regular function: 6 : 1 and 60 : 1 respectively. The code size of monad composition is ridiculous, monad composition kills C++ performance.

Wrap Up of Sequential Monads

We are a now at the end of 7 installments experimenting with monads, so let’s wrap up what we have so far. The question is: “Can monads be useful in day-to-day C++ programming?”. And the answer is “No”.

We have looked briefly at a C++ template meta programming (which constitutes a simple functional language) approach. The disadvantages are:

  • Complex syntax, hence diminished simplicity and clarity (one day your code will be maintained by another developer!)
  • Although it is possible to use this approach, the language does not support it. So it is unreasonably hard. (free after Stroustrup’s argument that OO development is possible in C, or assembly, but the language… )
  • There are no debugging facilities, which makes it very hard to develop substantial amounts of code as meta programs.

We have seen a simple and short implementation of the monad in mainstream C++. However, this implementations has the drawback that its performance dies if it is applied to large data structures.

The drawback could be remedied by replacing the unit function by a complete copy control set for the monadic template.

Lazy monads can be elegantly implemented as well, but they have no performance whatsoever, due to the fact that code is evaluated twice, once to create the expression to evaluate, and another time to indeed evaluate the expression.

Furthermore, we have seen, in this episode that if we want to write somewhat larger programs using monad composition, we get very large programs without performance.

I think I’m done now with at least the sequential monad. I have no further interest. It is tedious, needlessly complicated, and leading to inefficient results when trying to construct meaningful programs from different monads in C++.

I wrote ‘sequential monad’ above, because there is still the continuation monad to investigate; the suggested solution to Callback Hell in concurrency contexts. So, for starters, what exactly is Callback Hell? We will see in part 8.

A GPU Bilateral Filter Implementation

This post reports on a bilateral filter implementation that improves processing time from 32ms to 0.25ms.

Introduction

The Kinect (for Windows) depth data are subject to some uncertainty that comes with its resolution. Depth estimates are defined in millimeters, and typically, subsequent depth measurements by the Kinect vary by a fixed amount.

Consider the graphs below. The x-axis counts the number of measurements, the y-axis represents distance measurements of a single point. The top graph shows connected dots, the lower graph shows

just the dots.

De graphs show two tendencies. One is that variance is one unit above, or one unit below the average practically all of the time, the second tendency is that the average changes a bit before it stabilizes. Here we see it change from about 3.76m via 3.8m to about 3.84m.

If the Kinect depth data is projected onto an image this variation translates into a nervous jitter. Since I do not particularly care for a nervous jitter, I would like to stabilize the depth data a bit.

Stabilizing Kinect Depth Data – Temporal Approach

The Kinect for Windows SDK (1.6) contains a whitepaper on skeletal joint smoothing. The paper deals with the reduction of noise in the Kinect skeletal tracking system. This tracking system employs the same depth data, and therefore suffers from the same problem.

The proposed solution is to filter the data over time. The depth measurement z(x,y)(t) of a location (x, y) at time t can be averaged over a number of measurements in the past at the same location: z(x,y)(t-i) where i is in [1, n]. The suggestion is to take n not too large, say 5.

Averaging can also be over measurements in the future. This implies that one or two frames are included in averaging before an image based on the depth image is rendered, hence there is a latency in rendering equal to the number of ‘future’ frames included in averaging. The advantage of considering the ‘future’ is that if the measured scene changes (or a player changes position – in skeletal tracking), another type of averaging can be applied, one that is better suited for changes and e.g. puts a heavier weight on recent measurements.

I’ve done an experiment with temporal filtering, but it was not satisfactory. The fast and nervous jitter just turns into a slower one that is even more disturbing because short periods of stability make changes seem more abrupt.

Stabilizing Kinect Depth Data – Spatial Approach

Another approach is not to average over measurements at the same location through time, but to average within one frame, over several proximate measurements. A standard solution for this kind of filtering is the Bilateral filter. The Bilateral Filter is generally attributed to Carlo Tomasi and Roberto Manduchi. But see this site where it is explained that there were several independent discoveries.

The idea behind the Bilateral Filter is that the weight of a measurement in the average is a Gaussian function of both the distance and the similarity (in color, intensity, or as in our case: depth value). The similarity term prevents edges to be ‘averaged out’.

The Bilateral Filter works well, the only drawback it has is its computational complexity: O(N^2) where N is the (large!) number of pixels in the image. So, several people have been working on fast algorithms to alleviate the computational burden. To me it seems that Ben Weiss provided a good solution, but it is not generally available. The solution by Frédo Durand and Julie Dorsey (2002), and the elaboration of this work by Sylvain Paris and Frédo Durand (2006), all from MIT, seems to be the leading solution, and is general available – both the theory and example software. Their method has a project site that is here.

In a nutshell, the method by Sylvain Paris and Frédo Durand reduces processing time by first down sampling the image, then applying a convolution to compute the averages, and finally scaling up the image again while clamping over out-of-bounds values. So in essence, it operates on a (cleverly) reduced version of the image.

I’ve downloaded and compiled the software – the really fast version with the truncated kernel – and it requires about 0.032s to process a ppm image of 640×480 pixels (grayscale values), where the spatial neighborhood is set to 16 (pixels) and the ‘similarity’ neighborhood is set to 0.1, so grayscale colors that differ more than 0.1 after transformation to normalized double representation, are not considered in the average. See the image below for a screen shot.

The processing time is, of course, computer dependent, but my pc is not really slow. Although 32ms is a fine performance, it is too slow for real-time image processing. The Kinect produces a frame 30 times per second, i.e. every 33ms, and we do not want to create a latency of about one frame just because of the Bilateral Filter.

GPU implementation: C++ AMP

In order to improve on the processing time of this fast algorithm I’ve written a C++ AMP program inspired by the CPU implementation, this program runs on the GPU, instead of on the CPU. For information on C++ AMP, see here and here. What I think is great about AMP is that it provides a completely general access to General Purpose GPU computing. Having said that, I must also warn the reader that I do not master it to the degree that I could guarantee that my implementation of the Bilateral Filter in C++ AMP is representative of what could be achieved with C++ AMP.

The result of my efforts is that the ppm image above can now be processed in little over 1 ms. Consider

the picture below, made with my ATI Radeon HD 5700 Graphics card.

What you see here is a variety of timings of the computational phases. The top cycle takes 1.1ms, the middle one takes 1.19, and the bottom cycle takes 1.07ms. So, what is in the cycle?

1. The image is loaded into the GPU, and data structures are initialized. If you want to know more on ‘warming up’ the data and the code, see here. Since it takes 0.5 to 0.6 ms it is obviously the bottle neck.

2. Down sampling the image to a smaller version takes around 0.1 ms.

3. Computing the convolution takes 0.35 ms. This is the real work.

4. Up scaling and clamping takes again 0.1 ms.

A processing time of about 1 ms is satisfactory as a real-time processing time. Moreover, since we may assume the data is already in GPU memory (we need it there to render it to the screen), GPU upload time is not an attribute of an application of the Bilateral Filter in this context. So we may think of the processing time as being about 0.55 ms. which is absolutely fabulous.

New Graphics Card

At about this time, I bought a new graphics card, an Asus NVidia GTX690 (which for the purposes of this application yields the same results as a GTX 680, I know). This card was installed in my pc. Ok, I didn’t buy a new motherboard, so data is still being uploaded through PCI-e 2.0 and not through PCI-e 3.0 16x (but in time…). So, will this make a difference? Yes, it does. Look at the screen shot below.

I rearranged the timings a bit, to gain better oversight. We see that:

1. Data uploading and the warming up process now takes about 0.45 ms.

2. Filtering now takes about 0.25 ms.

From 32ms to 0.25ms. Most satisfying!

Vector –Matrix Inner Product with Computer Shader and C++ AMP

Large vector-matrix inner products by the GPU are 250 times faster than straight forward CPU implementations on my PC. Using C++ AMP or a Compute Shader the GPU realized a performance of over 30 gFLOPS. That is a huge increase, but my GPU has a “computational power” (whatever that may be) of 1 teraFLOP, and 30 gFLOPS is still a long way from 1000 gFLOPS.

This article presents a general architectural view of the GPU and some details of a particular exemplar: the Ati Radeon HD5750. Then code examples follow that show various approaches to large vector-matrix products. Of course the algorithms at the end of the article are the fastest. It is also the simplest.

Unified View of the GPU Architecture

Programming the GPU is based on an architectural view of the GPU. The purpose of this architectural view is to provide a unified perspective on GPUs from various vendors, hence with different hardware setup. It is this unified architecture that’s being programmed against using DirectX11. A good source of information on Direct Compute and Compute Shaders is the Microsoft Direct Compute BLog. The architecture described below is based on information from Chas Boyd’s talk at PDC09, as published on Channel9. Of course, this blog post only presents some fragments of the information found there.

A GPU is considered to be build from a number of SIMD cores. SIMD means: Single Instruction Multiple Data. By the way, the pictures below are hyperlinks to their source.

The idea is that a single instruction is executed on a lot of data, in parallel. The SIMD processing unit is particularly fit for “data parallel” algorithms. A GPU may consist of 32 SIMD cores (yes, the image shows 40 cores) that access memory with 32 floats at a time (128 bit bus width). Typically the processor runs at 1Ghz, and has a (theoretical) computational power of about 1 TeraFLOP.

A SIMD core uses several kinds of memory:

  • 16 Kbyte of (32-bit) registers. Used for local variables
  • 8 Kbyte SIMD shared memory, L1 cache.
  • L2 cache

The GPU as a whole has typically 1Gb of general RAM. Memory access bandwidth is typically of order 100GBit/s.

Programming Model

A GPU is programmed using a Compute Shader or C++ AMP. Developers can write compute shaders in HLSL (Looks like C) to be executed on the GPU. AMD is a C++ library. The GPU can run up to 1024 threads per SIMD. A thread is a line of execution through code. The SIMD shared memory is shared among the threads of a SIMD. It is programmable in the sense that you can declare variables (arrays) as “groupshared” and they will be stored in the Local Data Share. Note however, that over-allocation will spill the variables to general RAM, thus reducing performance. Local variables in shader code will be stored in registers.

Tactics

The GPU architecture suggests programming tactics that will optimize performance.

  1. Do your program logic on the CPU, send the data to the GPU for operations that apply to (about) all of the data and contain a minimal number of alternative processing paths.
  2. Load as much data as possible into the GPU general RAM, so as to prevent the GPU waiting for data from CPU memory.
  3. Declare registers to store isolated local variables
  4. Cache data that you reuse in “groupshared” Memory. Don’t cache data you don’t reuse. Keep in mind that you can share cached data among the threads of a single group only.
  5. Use as much threads as possible. This requires you use only small amounts of cache memory per thread.
  6. Utilize the GPU as efficiently as possible by offering much more threads to it than it can process in a small amount of time.
  7. Plan the use of threads and memory ahead, then experiment to optimize.

Loading data from CPU memory into GPU memory passes the PCIe bridge which has a bandwidth, typically of order 1GBit/s; that is, it is a bottleneck.

So, you really like to load as much data onto GPU memory before executing your code.

The trick in planning your parallelism is to chop up (schedule, that is J ) the work in SIMD size chunks. You can declare groups of threads; the size of the groups and the number of groups. A group is typically executed by a single SIMD. To optimize performance, use Group Shared Memory, and set up the memory consumption of your thread group so it will fit into the available Group Shared Memory. That is: restrict the number of threads per group, and make sure you have a sufficient number of groups. Thread groups are three dimensional. My hypothesis at this time is that it is best to fit the dimensionality of the thread groups to match the structure of the end result. More about this below. Synchronization of the threads within a thread group flushes the GroupShared Memory of the SIMD.

A register typically has a lifetime that is bound to a thread. Individual threads are member of several groups – depending on how you program stuff. So, intermediate results aggregated by thread groups can be stored in registers.

Does My ATI Radeon HD5750 GPU Look Like This Architecture… A Bit?

The picture below (from here) is of the HD5770, which has 10 SIMD cores, one more than the HD5750.

What do we see here?

  • SIMD engines. We see 10 cores for the HD5770, but there are 9 in the HD5750. Each core consists of 16 red blocks (streaming cores) and 4 yellow blocks (texture units).
  • Registers (light red lines between the red blocks).
  • L1 Textures caches, 18Kbyte per SIMD.
  • Local Data Share, 32 Kbyte per SIMD.
  • L2 caches, 8 Kbyte each.

Not visible is the 1Gb general RAM.

The processing unit runs at 700Mhz, memory runs at 1,150Mhz. Over clocking is possible however. The computational power is 1,008 TeraFLOP. Memory bandwidth is 73.6 GBit/s.

So, my GPU is quite a lot less powerful than the reference model. At first, a bit disappointing but on the other hand: much software I write for this GPU cannot run on the PCs of most people I know – their PCs are too old.

Various Approaches to Vector-Matrix Multiplication

Below we will see a number of approaches to vector-matrix multiplication discussed. The will include measurements of time and capacity. So, how do we execute the code and what do we measure?

Times measured include a number of iterations that each multiply the vector by the matrix. Usually this is 100 iterations, but fast alternatives get 1000 iterations. The faster the alternative, the more we are interested in variance and overhead.

Measurements:

  • Do not include data upload and download times.
  • Concern an equal data load, 12,288 input elements if the alternative can handle it.
  • Correctness check; computation is also performed by CPU code, reference code.
  • Run a release build from Visual Studio, without debugging.
  • Allow AMP programs get a warming up run.

Vector-Matrix Product by CPU: Reference Measurement

In order to determine the performance gain, we measure the time it takes the CPU to perform the product. The algorithm, hence the code is straightforward:

In this particular case rows = cols = 12,288. The average over 100 runs is 2,452 ms, or 2.45 seconds. This amounts to a time performance of 0.12 gFLOPS (giga FLOPS: FLoating point Operations Per Second). We restrict floating point operations to addition and multiplication (yes, that includes subtraction and division). We calculate gFLOPS as:

2 / ms x Rows / 1000 x Cols / 1000, where ms is the average time in milliseconds.

The result of the test is correct.

Parallel Patterns Library

Although this blog post is about GPU performance, I took a quick look at PPL performance. We then see a performance gain of a factor 2, but the result is incorrect, that is, the above code leads to indeterminacy in a parallel_for loop. I left it at that, for now.

Matrix-Matrix Product

We can of course, view a vector as a matrix with a single column. The C++ AMP documentation has a running code example of a matrix multiplication. There is also an accompanying compute shader analog.

AMP

To the standard AMP example I’ve added some optimizing changes, and measured the performance. The AMP code look like this:

Here: amp is an alias for the Concurrency namespace. The tile size TS has been set to 32, which is the maximum; the product of the dimensional extents of a compute domain should not exceed 1024. The extent of the compute domain has been changed to depend on B, the matrix, instead of the output vector. The loop that sums element products has been unrolled in order to further improve performance.

As mentioned above, we start with a warming up. As is clear from the code we do not measure data transport to and from the GPU. Time measurements are over 100 iterations. The average run time obtained is 9,266.6 ms, hence 0.01 gFLOPS. The result after the test run was correct.

The data load is limited to 7*1024 = 7,168; that is 8*1024 is unstable.

Compute Shader

The above code was adapted to also run as a compute shader. The code looks like this:

The variables Group_SIZE_X and Group_SIZE_Y are passed into the shader at compile time, and are set to 32 each.

Time measurements are over 100 iterations. The average run time obtained is 11,468.3 ms, hence 0.01 gFLOPS. The result after the test run was correct. The data load is limited to 7*1024 = 7,168; that is 8*1024 is unstable.

Analysis

The performance of the compute shader is slightly worse that the AMP variant. Analysis with the Visual Studio 11 Concurrency Visualizer shows that work by the GPU in case of the compute shader program is executes in small spurts, separated by small periods of idleness, whereas in the AMP program the work is executed by the GPU in one contiguous period of time.

Nevertheless, performance is bad, worse than the CPU alternative. Why? Take a look at the picture below:

For any value of t_idx.global[0] – which is based on the extent of the matrix- that is unequal to zero, vector A does not have a value. So, in fact, if N is the number of elements in the vector, we do O( N3)retrievals but only O(N2) computations. So, we need an algorithm that is based on the extent of a vector, say the output vector.

Vector-Matrix Product

Somehow, it proved easier to develop the vector-matrix product as a compute shader. This is in spite of the fact that unlike AMP, it is not possible (yet?) to trace a running compute shader in Visual Studio. The idea of the algorithm is that we tile the vector in one dimension, and the matrix in two, thus obtaining the effect that the vector tile can be reused in multiplications with the matrix tile.

Compute Shader

A new compute shader was developed. This compute shader caches vector and matrix data in Group Shared memory. The HLSL code looks like this:

This program can handle much larger amounts of data. Indeed, this program runs problem free for a vector of 12,288 elements and a total data size of 576 Mbyte. Using an input vector of 12,288 elements, with total data size of 576 Mbyte. The time performance is 10.3 ms per run, averaged over 1,000 runs, which amounts to 29.3 gFLOPS. The result of the final run was reported to be correct.

AMP

In analogy to the compute shader above I wrote (and borrowed 🙂 ) a C++ AMP program. The main method looks like this:

The matrix is a vector with size * size elements. He tile size was chosen to be 128, because that setting yields optimal performance. The program was run on an input vector of 12,288 elements again, with total data size of 576 Mbyte. The time performance is 10.1 ms per run, averaged over 1000 runs, which amounts to 30.0 gFLOPS. The result of the final run was reported to be correct.

Analysis

We see here that the performance has much improved. When compared to the reference case, we can now do it (in milliseconds) 2,452 : 10.1 = 243 : 1, hence 243 times faster.

Simpler

Then, I read an MSDN Magazine article on AMP tiling by Daniel Moth, and it reminded me that caching is useless if you do not reuse the data. Well, the above algorithm does not reuse the cached matrix data. So I adapted the Compute Shader program to retrieve matrix data from central GPU memory directly. The HLSL code looks like this:

Note the tileSize of 512(!). This program was run for a vector of 12,288 elements and a total data size of 576 Mbyte. The time performance is again 10.3 ms for a multiplication which amounts to 29,3 gFLOPS (averaged over 1000 runs). The result of the final run was reported to be correct. So, indeed, caching the matrix data does not add any performance improvement.

AMP

For completeness, the AMP version:

Time performance is optimal for a tile size of 128, in case the number of vector elements is 12,288. We obtain an average run time of 9.7 ms (averaged over 1,000 runs), and a corresponding 31.1 gFLOPS. The result of the final run was correct. This program is 2452 / 9.7 = 252.8 times as fast as the reference implementation.

Conclusions

Developing an algorithm for vector-matrix inner product has demonstrated comparable performance for Compute Shaders and AMP, but much better tooling support for AMP: we can step through AMP code while debugging, and the Concurrency Visualizer has an AMP line. This better tool support helped very well in analyzing performance of a first shot at the algorithm. The final algorithm proved over 250 times faster than a straight forward CPU program for the same functionality.

Detailed knowledge of the GPU architecture, or the hardware model, proved of limited value. When trying to run the program with either the maximum nr of threads per group, or the maximum amount of data per Group Shared Memory, I ran into parameter value limits, instabilities, performance loss, and incorrect results. I guess, you will have to leave the detailed optimization to the GPU driver and to the AMP compiler.

One question keeps bothering me though: Where is my TeraFLOP?

I mean, Direct Compute was introduced with the slogan “A teraFLOP for every one of us”, AMP is built on top of Direct Compute, and my GPU has a computational power of 1.08 TeraFLOP. Am I not ‘one of us’?

C++ AMP Performance and Compute Shader Performance

Edit (April 23rd 2012):

The AMP team has updated the N-Body Simulation code to turn it into a clean port that relates to the Compute Shader original in a comprehensible way. Now it has comparable performance to the original (optimized) version (both versions do >330 gFLOPS at >30 fps for 23,040 particles on my pc).

I’m impressed. For one, by the attitude of the AMP people that energetically reacted to issues which other people / teams might well have dismissed as unimportant. Then there is the point that you get maximum performance from a set of very powerfull processors with code that is very short compared to the direct compute code you had to write otherwise, and this code, by AMP design, is very elegant as well.

Of course, there is a risk in short and elegant code: subtle differences in code can make substantial differences in performance, hence developing AMP code is rather knowledge intensive. But I kind of like that.

Edit (April 16th 2012):

The results below were brought to the C++ AMP forum for discussion. Daniel Moth advised to update the driver of the graphics card. This update made a tremendous difference for two of the three programs mentioned below for which now C++ AMP performance is equal to or better than Compute Shader performance.

The discussion on the N-Body Simulation program, which is heavily optimized in the Compute Shader version is still open, mainly because the required information is not available yet. I expect that also in this case C++ AMP will prove to be equipotent to Compute Shader programs.

Now, what have we learned from this exercise? For one, a lot about Compute Shader optimization and the mechanisms of GPU computing performance. This is an interesting and instructive subject. I also have learned that C++ AMP performance is comparable to Compute Shader performance. However, I do not (yet) understand if and how this will always and necessarily be the case, and that still itches a bit.

Results as they are standing now:

 

Program

AMP

CS

 

Guide

 

 

Average time (ms, 10 it.)

 

2,650

2,995

gFLOPS

 

36.9

32.7

Max. Data Load (Kb)

 

714,432

691,200

 

Vector Addition

 

 

Average time (ms, 10 it.)

 

6,017

8,155

gFLOPS

 

0.03

0.02

Max. Data Load (Kb)

 

1,781,248

2,039,056

 

N-Body Simulation

 

 

Number of Particles

 

16,128

16,128

Frame rate

 

44.4

63.4

gFLOPS

 

229

329

Up to date, I find that Compute Shader based programs outperform C++ APM programs both in time and space. Results of example programs I explored, which have been created by the respective product teams tend to show substantially better performance by the Compute Shader programs. These programs are the N-Body Simulation Sample; Basic Summation; and the matrix multiplication programs from the “C++ AMP for the DirectCompute Programmer” guide. Hyperlinks are provided in the sections below.

So, the question is: can there be an AMP program that performs substantially better in time and space on, let’s say, large matrix multiplication (or large matrix-vector multiplication) than a Compute Shader program? C++ AMP has been built upon Direct Compute, so the answer is: not likely.

Should we, alternatively, draw the conclusion that a direct compute program categorically has better performance?

N-Body Simulation

The first pair of programs compared, consisted of:

Performance is expressed in gFLOPS. The code for the gFLOPS was copied from the C++ AMP version to the Compute Shader version. I also changed the Compute Shader version to make it write gFLOPS and the number of particles to the screen.

First, I tweaked the particle count parameter to get the best gFLOP count from either program; they both peak at 16,128 particles on my PC. Then the following results (gFLOPS) were obtained for release builds, running without debugging (this was also the configuration in the comparisons below).

C++ AMP Compute Shader More (%) Less(%)
Number of particles 16,128 16,128
Frames per second 43.46 57.38 32.03 24.26
gFLOPS 226.07 298.51 32.04 24.27

A note on the More and Less columns: The Compute Shader version delivers 32.03% more frames per second, and the C++ AMP version 24.26% less. So crudely: the Compute Shader version is about 30% faster.

Vector Addition

The second pair of programs compared consisted of:

The C++ AMP code was adapted as follows:

  • It was made to work with the same structs as the BasicCompute11 sample. This struct consists of an int and a float.
  • The arrays were made global variables.
  • A loop was added to fill the input arrays.
  • The verification code from the BasicCompute11 sample was added.

For timing, timing code was added to both programs. This timing code is from this post in the Parallel Programming in Native Code blog.

For timing measurements the code was adapted as follows: In the Compute Shader program timing covers code from the Dispatch call to the Map call. In the AMP program timing covers the lambda expression, and an added array_view::Synchronize() call on the “sum” array_view.

In experiments I first pushed the size until, in the case of the Compute Shader version, the output of the result verifying code became “failure”,

and in the case of the C++ AMP program, it either didn’t compile or produced a runtime error.

Then I measured time and gFLOPS. The experiments yielded the following result.

C++ AMP Compute Shader More (%) Less(%)
Number array elements 76*10^6 87*10^6 14.47 12.64
Total data size (Kb) 1,781,250 2,039,062.5
Time (ms) 6,868 8,182
gFLOPS 0.022 0.021

gFLOPS were measured as: 2*n / (10^6 * ms), where n is the number of elements in an array.

It seems to me that the time results are too similar to call them different. The Compute Shader version has a slight space advantage.

Note that since the total data size in both cases is larger than the RAM the graphics card has on board, there is some automatic sectioning going on.

Matrix Multiplication

Both programs in this comparison come from the C++ AMP for the DirectCompute Programmer guide. This guide can be obtained from a post on the official MSDN Parallel Programming in Native Code blog. The C++ AMP program is a transformation of the Compute Shader program.

The code for the starting point of the transformation is not entirely complete, so I added standard code from the BasicCompute11 Sample that loads and compiles the compute shader.

The following results were obtained.

C++ AMP Compute Shader More (%) Less(%)
Number array elements 4,608 7,616 65.28 39.50
Total data size (Kb) 248,832 679,728 173.17 63.39
Av Processing time (ms, 10 runs) 11,742 12,804
gFLOPS 8.3 34.5 315.66 75.94

Notes:

  • Both programs measure the time spent in the “mm” function, using the timing code referred to above. This includes uploading and offloading the data onto and from the GPU.
  • For both programs we have that any higher multiple of 64 in the number of array elements crashes the display driver.

  • gFLOPS are measured as: n^3 / (10^6 * ms) where:
  • n is the size of a matrix dimension (the matrices are square).
  • Ms is the averaged (over 10 iterations) measured processing time in milliseconds.

Conclusions

Three program pairs have been compared, informally and semi-systematically, for their performance in time and space.

In the case of the N-Body simulation, the data load was selected that is optimal for time performance. That resulted in an about 30% better time performance of the Compute Shader Program.

In the case of vector addition – about the simplest program imaginable in this context – the time performance was measured for maximum data load. This resulted in practically equal time performance for both programs. The Compute Shader version can load some more data.

Finally, the programs from the AMP guide for Compute Shader programmers were implemented, and the time performance was again measured for maximum data load. This resulted in a time performance of the Compute Shader that is three times as good as the time performance of the AMP program.

So, conclusion, it seems that if you want to get the max from your GPU, a Compute Shader is still the way to go.

Pixel Shader Based Expansion

Best animation performance in Silverlight 4 is obtained from the combination of procedural animation and a pixel shader, as reported in a previous blog. A pixel shader is not really meant to be used for spatial manipulation, but in Silverlight 4 vertex and geometry shaders are not available. Also, pixel shaders are limited to Model 2 shaders in Silverlight 4, and only limited data exchange is possible. The question is then, “how far can we push pixel shaders model 2”. Another previous blog post discussed a preliminary version of dispersion. This post is about Expansion. The effect is much like Dispersion, but the implementation is quite different – better if I may say so. Expansion means that a surface, in this case a playing video is divided up in blocks. These blocks then move toward, and seemingly beyond the edges of the surface.

The Pixel Shader

The pixel shader was created using Shazzam. The first step is to reduce the image within its drawing surface, thus creating room for expansion. Expansion is in fact just translation of all the blocks in distinct direction that depend on the location of the block relative to the center of the reduced image.

In the pixel shader below, parameter S is for scaling, reducing, the pixel surface. Parameter N defines the number of blocks along an axis. So if N = 20, expansion concerns 400 blocks. In the demo App I’ve set the maximum for N to 32, which results in 1024 blocks tops. Parameter D defines the distance over which the reduced image is expanded. If the maximum value for D is the number of blocks, all the blocks (if N is even), or all but the center block (N is odd) just ‘move off’ the surface.

The code has been commented extensively, so should be self explanatory. The big picture is that the reduced, centered and then translated location of the blocks is calculated. Then we test whether a texel is in a bloc, not in an inter block gap. If the test is positive, we sample the unreduced, uncentered and untranslated image for a value to assign to the texel.

sampler2D input : register(s0);

// new HLSL shader

/// <summary>Reduces (S: Scales) the image</summary>
/// <minValue>0.01</minValue>
/// <maxValue>1.0</maxValue>
/// <defaultValue>0.3</defaultValue>
float S : register(C0);

/// <summary>Number of blocks along the X or Y axis</summary>
/// <minValue>1</minValue>
/// <maxValue>10</maxValue>
/// <defaultValue>5</defaultValue>
float N : register(C2);

/// <summary>Displaces (d) the coordinate of the reduced image along the X or Y axis</summary>
/// <minValue>0.0</minValue>
/// <maxValue>10.0</maxValue> // Max should be N
/// <defaultValue>0.2</defaultValue>
float D : register(C1);

float4 main(float2 uv : TEXCOORD) : COLOR
{
	/* Helpers */
	float4 Background = {0.937f, 0.937f, 0.937f, 1.0f};
	// Length of the side of a block
	float L = S / N;
	// Total displacement within a Period, an inter block gap
	float d = D / N;
	// Period
	float P = L + d;
	// Offset. d is subtracted because a Period also holds a d
	float o = (1.0f - S - D - d) / 2.0f;
	// Minimum coord value
	float Min = d + o;
	// Maximum coord value
	float Max = S + D + o;

	// First filter out the texels that will not sample anyway
	if (uv.x >= Min && uv.x <= Max && uv.y >= Min && uv.y <= Max)
	{
		// i is the index of the block a texel belongs to
		float2 i = floor( float2( (Max - uv.x ) / P , (Max - uv.y ) / P  ));

		// iM is a kind of macro, reduces calculations.
		float2 iM = Max - i * P;
		// if a texel is in a centered block,
		// as opposed to a centered gap
		if (uv.x >= (iM.x - L) && uv.x <= iM.x &&
		    uv.y >= (iM.y - L) && uv.y <= iM.y)
		{
			// sample the source, but first undo the translation
			return tex2D(input, (uv - o - d * (N -i)) / S );
		}
		else
			return Background;
	}
	else
		return Background;
}

Client Application

The above shader is used in an App that runs a video fragment, and can be explored at my App Shop. The application has controls for image size, number of blocks, and distance. The video used in the application is a fragment of “Big Buck Bunny”, an open source animated video, made using only open source software tools.

Animation

Each of the above controls can be animated independently. Animation is implemented using Storyboards for the slider control values. Hence you‘ll see them slide during animation. The App is configured to take advantage of GPU acceleration, where possible. It really needs that for smooth animation. Also the maximum frame rate has been raised to 1000.

Performance Statistics

The animations run at about 225 FPS on my pc. This requires significant effort as from the CPU –about 50% of the processor time. The required memory approaches 2.3Mb.

Pixel Shader Based Translation

Best animation performance in Silverlight 4 is obtained from the combination of procedural animation and a pixel shader, as reported in a previous blog. I know, a pixel shader is not really meant to be used for spatial manipulation. However, in Silverlight 4 vertex and geometry shaders are not available. Also, pixel shaders are limited to Model 2 shaders, and only limited data exchange is possible. The question is then, “how far can we push pixel shaders model 2” Another previous blog post discussed a preliminary version of dispersion, which is fairly complicated. This post is about translation. Translation means here that the coordinates in the 2-dimensional plane of a reduced image, in this case a video, are changed.

The Pixel Shader

The pixel shader was created using Shazzam. The first step is to reduce the image within its drawing surface, thus creating the room required for translation. Translation is in fact just changing the coordinates of the reduced images while correcting for the change in coordinates when sampling the source texture.

The pixel shader below, the if statement does the filtering. Left and Right are parameterized offset and cutoff values. The offset ‘Left’ is also used to correct for the displacement in sampling. Sampling is done by the ‘tex2d’ function.

sampler2D input : register(s0);

// new HLSL shader

/// <summary>Reduces the image</summary>
/// <minValue>0.01</minValue>
/// <maxValue>1.0</maxValue>
/// <defaultValue>0.33</defaultValue>
float Scale : register(C0);

/// <summary>Changes the coordinate of the reduced image along the X axis</summary>
/// <minValue>0.0</minValue>
/// <maxValue>1.0</maxValue>
/// <defaultValue>0.5</defaultValue>
float OriginX : register(C1);

/// <summary>Changes the coordinate of the reduced image along the Y axis</summary>
/// <minValue>0.0</minValue>
/// <maxValue>1.0</maxValue>
/// <defaultValue>0.5</defaultValue>
float OriginY : register(C2);

float4 main(float2 uv : TEXCOORD) : COLOR
{
	float4 Background = {0.937f, 0.937f, 0.937f, 1.0f};
	float2 Origin = {OriginX, OriginY};
	// Reduce nr. of computations
	float halfScale = 0.5 * Scale;
	float2 Left = Origin - halfScale;
	float2 Right = Origin + halfScale;

	// Filter out texels that do not sample (the correct location)
	if (uv.x >= Left.x && uv.x <= Right.x && uv.y >= Left.y && uv.y <= Right.y)
		return tex2D(input, (uv - Left) / Scale);
	else
		return Background;
}

Client Application

The above shader is used in an App that runs a video fragmant, and can be explored
at my App Shop. The application has controls for video surface reduction, translation along the X-axis, and translation along the Y-axis. The video used in the application is a fragment of “Big Buck Bunny”, an open source animated video, made using only open source software tools.

Animation

Each of the above controls can be animated independently. Animation is implemented suing Storyboards for the slider control values. Hence you‘ll see them slide during animation. I didn’t use procedural animation, since, as discussed in the aforementioned previous blog post, is hindered by the implementation of the MediaElement.

The App is configured to take advantage of GPU acceleration, where possible. It really needs that for smooth animation. Also the maximum frame rate has been raised to 1000.

Performance Statistics

The animations run at about 250 FPS on my pc. This requires both significant effort from both the GPU as from the CPU. The required memory reaches a little over 3.5Mb.

Spatial Transformation by Pixel Shader

As reported in a previous blog, best animation performance in Silverlight 4 is obtained from the combination of procedural animation and a pixel shader. A pixel shader is meant to be used for the manipulation of colors and transparency, resulting also in lightning and structural effects that can be expressed pixel wise.

A pixel shader is not really meant to be used for spatial manipulation – spatial information is hard to come by, but see e.g. this discussion on the ShaderEffect.DdxUvDdyUvRegisterIndex property. Vertex shaders and geometry shaders exist to filter, scale, translate, rotate and filter vertices and topological primitives respectively. Another limitation of the use of shaders in Silverlight 4 is that only model 2 pixel shaders can be used, so that only 64 arithmetic slots are available. Finally, you can apply only one pixel shader to a UIElement.

In order to explore the limitations on spatial manipulation and the number of arithmetic slots, I decided to build a Silverlight 4 application that does some spatial manipulation of a playing video by means of procedural animation and a pixel shader. Well, this is also my first encounter with pixel shaders, so I gathered that a real challenge would show me many sides of pixel shaders (and indeed, no disappointments here).

The manipulation consists of two steps. First the video surface is reduced in size in order to create some space for the second step. In the second step, the video is divided up in a (large) number of rectangles which drift apart – disperse, while reducing in size. The second step is animated.

To be frank, I’m not very excited about the results so far. It definitely needs improvement, but I need some time and a well defined starting point for further elaboration.

The completed application

You can explore the completed application, called ‘Dispersion Demo’ in my App Shop. The application is controlled by 4 parameters:

  1. Initial Reduction. This parameter reduces the video surface. The higher the value, the greater the reduction.
  2. Dispersion. The Dispersion parameter controls the amount of dispersion of the blocks. If you set both Initial Reduction and Dispersion to 1, you can see the untouched video.
  3. Resolution. The Resolution is a measure for the number of blocks along both the X-axis, and the Y-axis.
  4. Duration. Controls the duration of the dispersion animation.

The application also has three buttons to Start, Stop, and Pause the animation of the Dispersion. The idea is that you first select an initial Reduction and a Resolution and then click Start to animate the Dispersion.

The video used in the application is a fragment of “Big Buck Bunny”, an open source animated video, made using only open source software tools (and an astonishing amount of talent).

Performance statistics

The use of some procedural animation and the pixel shader results on my pc in a frame rate of about 198 FPS, and a footprint of about 2Mb video memory. This is good (though ~400FPS would be exciting), and leaves plenty of time for other manipulations of the video image.

The pixel shader

The pixel shader was created using Shazzam, an absolutely fabulous tool, and perfect for the job. The pixel shader shown below centers the image or video snapshot and reduces it. Then it enlarges it again according to the Dispersion factor. The farther away from the center, the larger the gaps are between the blocks displaying video, and the smaller the blocks themselves are.

sampler2D input : register(s0);

// new HLSL shader

/// <summary>Reduces image to starting point.</summary>
/// <minValue>1.0</minValue>
/// <maxValue>20.0</maxValue>
/// <defaultValue>10.0</defaultValue>
float Scale : register(C0);

/// <summary>Expands / enlarges the image.</summary>
/// <minValue>1</minValue>
/// <maxValue>20</maxValue>
/// <defaultValue>6</defaultValue>
float Dispersion : register(C1); //Dispersion should never get bigger than Scale!!!

/// <summary>Measure for number of blocks</summary>
/// <minValue>100</minValue>
/// <maxValue>800</maxValue>
/// <defaultValue>200</defaultValue>
float Resolution : register(C2);

float4 main(float2 uv : TEXCOORD) : COLOR
{
       float ratio = Dispersion / Scale;
       float offset = (1 - ratio) / 2;
       float cutoff = 1 - offset;
       float center = 0.5;
       float4 background = {0.937f, 0.937f, 0.937f, 1.0f}; // Same as demo program's Grid

       float2 dispersed = (uv - offset)  / ratio;
       float2 limit = 4 * (abs(uv - center) * (Dispersion - 1) / (Scale - 1)) - 1;

       if(uv.x >= offset && uv.x <= cutoff && uv.y >= offset && uv.y <= cutoff)
       {
             if (sin(dispersed.x * Resolution) > limit.x &&
                 sin(dispersed.y * Resolution) > limit.y)
             {
                    return tex2D(input, dispersed);
             }
             else return background;
       }
       else return background;
}

In the code above, the outer “if” does the scaling and centering. The sine function and Limit variable take care of the gaps and block size. See the graph below, the Limit rises to cut off increasingly many values of the sine, thus creating smaller blocks and larger gaps.

Evaluation of limitations

Although this is a solution for the challenge set, it is not the solution. This solution uses scaling for dispersion, but what you really would like to have is a translation per block of pixels that disperse. Needing this reveals another limitation: in Silverlight 4 you cannot provide your pixel shader with an array of structs that define your blocks layout, hence it is very hard to identify the block a certain pixel belongs to, and thus what its translation vector is.

At this point I had the idea to code the block layout into a texture which is made available to the pixel shader (you can have up to three extra textures (PNG images) in your pixel shader. To create such a lookup map you have to create a writeable bitmap with the values you want, then convert this to a PNG image as Andy Beaulieu shows, using Joe Stegman’s PNG encoder. The hard part will be the encoding (and decoding in the shader). You have at your disposal 4 unsigned bytes, and you will have to encode 2D translation vectors that also consist of negative numbers. The encoding / decoding will involve a signed – unsigned conversion, and the encoding shorts (16 bit integers) as 2 bytes. This will, of course, be a feasible but tedious chore.

A better move would be to invest some effort in getting acquainted with Silverlight 5 Beta’s integration with XNA, by which also vertex shaders come available.

As far as the limit of 64 arithmetic slots concern, it seems that one absolutely must hit it. But then, it always seems possible to rethink and simplify your code, thus reducing its size. This limitation turned out not very prohibitive.

Procedural animation

The above pixel shader is integrated in the Silverlight application using the C# code generated by Shazzam. Then when implementing procedural animation, another limitation came up concerning manipulating videos.

The initial plan was to display the video using a collection of WriteableBitmaps that could be manipulated using the high performance extensions from the WriteableBitmapEx library. However, it turns out that to take a snapshot from a MediaElement, you have to create a new WriteableBitmap per snapshot. This is expensive since we now have to create at least about 30 WriteableBitmaps per second. Indeed, running the Dispersion application (at 198 FPS) sets the CPU load to about 50% at my pc. This is not really what we are looking for.

A way out is to implement the abstract MediaStreamSource class, both for the vision part as well as for the sound part of the video to be exposed. By the looks of it, this seems to be quite a challenge. However, Pete Brown has provided examples for both video and sound. And there is also the CodePlex project that provides the ManagedMediaHelpers. So, we add this challenge to the list.

Another experiment might be to see if the XNA sound and video facilities can be accessed when exploiting the Silverlight – XNA integration in Silverlight 5.

Conclusions

The challenge set turned out to be very instructive. I’ve learned a lot about writing pixel shaders, though there will be much more to learn. No doubt about that.

It is absolutely true that spatial manipulation is hard in pixel shaders :-), the main cause being the extraordinary effort required to provide the shader with sufficient data, or sufficiently elaborate data. The current shader needs improvement, the next challenge in writing shaders will address Silverlight 5 (beta) and XNA.

I noticed that you can achieve a lot in pixel shaders using trigonometry, and (no doubt) the math of signals.

Finally, in order to really do procedural animation on surfaces playing parts of a video, you seem to need to either implement MediaStreamSource or get use the XNA media facilities.