Posts Tagged ‘DirectX’

Data Compression for the Kinect

Transmitting uncompressed Kinect depth and color data requires a network bandwidth of about 460Mbit/s. Using the RleCodec or the LZ4 library we achieve tremendous compression – a compression ratio of 10 or 22 respectively, at lightning speed – over 1600Mbytes/s. We achieve this not so much by the compression algorithms, but by removing undesirable effects (jitter, by the DiscreteMedianFilter) and redundancy (already sent data, by taking the Delta).

Introduction

From the start, one goal of the Kinect Client Server (KCS) project was to provide a version of the KCS viewer, called 3D-TV, from the Windows Store. Because of certification requirement 3.1 (V5.0)

“Windows Store apps must not communicate with local desktop applications or services via local mechanisms,..”

3D-TV has to connect to a KinectColorDepth server application on another PC. In practice, the network bandwidth that is required to transfer uncompressed Kinect depth and color data over Ethernet LAN using TCP is about 460Mbit/s, see e.g. the blog post on the jitter filter. This is a lot, and we would like to reduce it using data compression.

This is the final post in a series of three on the Kinect Client Server system, an Open Source project at CodePlex, where the source code of the software discussed here can be obtained.

What Do We Need?

Let’s first clarify the units of measurements we use.

Mega and Giga

There is some confusion regarding the terms Mega and Giga, as in Megabit.

  • For files 1 Megabit = 2^20 bits, and 1 Gigabit = 2^30 bits.
  • For network bandwidth, 1 megabit = 10^6 bits, and 1 gigabit = 10^9 bits.

Here we will use the units for network bandwidth. The Kinect produces a color frame of 4 bytes and a depth frame of 2 bytes at 30 frames per second. This amounts to:

30 FPS x 640×48 resolution x (2 depth bytes + 4 color bytes) x 8 bits = 442.4 Mbit/s = 55.3 Mbyte/s.

We have seen before that the actual network bandwidth is about 460Mbit/s, so network transport creates about 18Mbit/s overhead, or about 4%.

Since we are dealing with data here that is streamed at 30 FPS, time performance is more important than compression rate. It turns out that a compression rate of at least 2 (compressed file size is at most 50% of uncompressed size), in about 5ms satisfies all requirements.

Data Compression: Background, Existing Solutions

Theory

If you are looking for an introduction to data compression, you might want to take a look at Rui del-Negro’s excellent 3 part introduction to data compression. In short: there are lossless compression techniques and lossy compression techniques. The lossy ones achieve better compression, but at the expense of some loss of the original data. This loss can be a nuisance, or irrelevant, e.g. because it defines information that cannot be detected by our senses. Both types of compression are applied, often in combination, to images, video and sound.

The simplest compression technique is Run Length Encoding, a lossless compression technique. It simply replaces a sequence of identical tokens by one occurrence of the token and the count of occurrences. A very popular somewhat more complex family of compression techniques is the LZ (Lempel-Ziv) family (e.g. LZ, LZ77, LZ78, LZW) which is a dictionary based, lossless compression. For video, the MPEG family of codecs is a well known solution.

Existing Solutions

There are many, many data compression libraries, see e.g. Stephan Busch’s Squeeze Chart for an overview and bench marks. Before I decided to roll my own implementation of a compression algorithm, I checked out two other solutions: The Windows Media Codecs In Window Media Foundation. But consider the following fragment from the Windows Media Codecs documentation:

It seems as if the codecs are way too slow: max 8Mbit/s where we need 442Mbit/s. The WMF codecs obviously serve a different purpose.

The compression used for higher compression levels seems primarily of the lossy type. Since we have only 640×480 pixels I’m not sure whether it is a good idea to go ‘lossy’. It also seems that not all versions of Windows 8 support the WMF.

LZ4, an open source compression library. This is a very fast compression library, see image below, from the website:

So, LZ4 can compress a file to 50% at over 400MByte/s. That is absolutely great! I’ve downloaded the source and tried it on Kinect color and depth data. Results are shown and compared in the Performance section.

The RleCodec

I decided to write my own data compression codec, and chose the Run Length Encoding algorithm as a starting point. Why?

Well, I expected a custom algorithm, tailored to the situation at hand would outperform the general purpose LZ4 library. And the assumption turned out to be correct. A prototype implementation of the RleCodec supported by both the DiscreteMedianFilter and creating a Delta before compressing data really outperformed the LZ4 reference implementation, as can be read from the performance data in the Performance section.

It only dawned on me much later that removing undesired effects (like jitter, by the DiscreteMedianFilter) and redundant information (already sent data, by taking the Delta) before compressing and transmitting data is not an improvement of just the RLE algorithm, but should be applied before any compression and transmission takes place. So, I adjusted my approach and in the performance comparison below, we compare the core RLE and LZ4 algorithms, and see that LZ4 is indeed the better algorithm.

However, I expect that in time the data compression codec will be implemented as a GPU program, because there might be other image pre-processing steps that will also have to be executed at the GPU on the server machine; to copy the data to the GPU just for compression requires too much overhead. It seems to me that for GPU implementations the RLE algorithm will turn out the better choice. The LZ4 algorithm is a dictionary based algorithm, and creating and consulting a dictionary requires intensive interaction with global memory on the GPU, which is relatively expensive. An algorithm that can do its computations purely locally has an advantage.

Design

Lossless or Lossy

Shall we call the RleCodec a lossy or lossless compression codec? Of course, RLE is lossless, but when compressing the data, the KinectColorDepth server also applies the DiscreteMedianFilter and takes the Delta with the previous frame. Both reduce the information contained in the data. Since these steps are responsible for enormous reduction of compressed size, I am inclined to consider the resulting library a lossy compression library, noting that it only looses information we would like to lose, i.e. the jitter and data we already sent over the wire.

Implementation

Algorithm

In compressing, transmitting, and decompressing data the KinectColorDepth server application takes the following steps:

  1. Apply the DiscreteMedianFilter.
  2. Take the Delta of the current input with the previous input.
  3. Compress the data.
  4. Transmit the data over Ethernet using TCP.
  5. Decompress the data at the client side.
  6. Update the previous frame with the Delta.

Since the first frame has no predecessor, it is a Delta itself and send over the network as a whole.

Code

The RleCodec was implemented in C++ as a template class. Like with the DiscreteMedianFilter, traits classes have been defined to inject the properties that are specific to color and depth data at compile time.

The interface consists of:

  • A declaration that take the value type as the template argument.
  • A constructor that takes the number of elements (not the number of bytes) as an argument.
  • The size method that returns the byte size of the compressed data.
  • The data method that returns a shared_ptr to the compressed data.
  • The encode method that takes a vector of the data to compress, and stores the result in a private array.
  • The decode method that takes a vector, of sufficient size, to write the decompressed data into.

Like the DiscreteMedianFilter, the RleCodec uses channels and an offset to control the level of parallelism and to skip channels that do not contain information (specifically, the A (alpha or opacity) channel of the color data). Parallelism is implemented using concurrency::parallel_for from the PPL library.

Meta Programming

The RleCodec contains some meta programming in the form of template classes that roll out loops over the channels and offset during compilation. The idea is that removing logic that loops over channels and checks if a specific channel has to be processed or skipped will provide a performance gain. However, it turned out that this gain is only marginal (and a really thin margin 🙂 ). It seems as if the compiler obviates meta programming, except perhaps for very complicated cases.

Performance

How does our custom RLE codec perform in test environment and in the practice of transmitting Kinect data over a network? How does its performance compare to that of LZ4?. Let’s find out.

In Vitro Performance

In vitro performance tests evaluate in controlled and comparable circumstances the speed at which the algorithms compress and decompress data.

Timing

In order to get comparable timings for the two codecs, we measure the performance within the context of a small test program I wrote, the same program is used for both codecs. This makes the results comparable and sheds light on the usefulness of the codec in the context of transmitting Kinect data over a network.

Algorithm

In comparing the RleCodec and LZ4, both algorithms take advantage of working the delta of the input with the previous input. We use 3 subsequent depth frames and 3 subsequent color frames for a depth test and a color test. In a test the frames are processed following the sequence below:

  1. Compute delta of current frame from previous frame.
  2. Compress the delta.
  3. Measure the time it takes to compress the delta
  4. Decompress the delta
  5. Measure the time it takes to decompress the delta
  6. Update previous frame with the delta
  7. Check correctness, i.e. the equality of the current input with the updated previous input.

We run the sequence 1000 times and average the results. Averaging is useful for processing times, the compression factor will be the same in each run. The frames we used are not filtered (DiscreteMedianFilter).

Let’s first establish that the performance of these libraries are of very different order than the performance of the WMF codecs. Let’s also establish that compression speed and decompression speed is much more than sufficient: as noted above 50Mbyte/s would do.

For depth data, we see that the RleCodec delivers fastest compression. LZ4 delivers faster decompression, and obtains stronger compression.

The RleCodec was tested twice with the color data. In the first test we interpreted the color data as of type unsigned char. We used 3 channels with offset 4 to compress it. In the second test we interpreted the color data as unsigned int. We used 4 channels, with offset 4. We see that now the RleCodec runs about 4 times as fast as with char data. The compression strength is the same, however. So, for color data, the same picture arises as with depth data: times are of the same order, but LZ4 creates stronger compression.

The difference in naked compression ratios has limited relevance, however. We will see in the section on In Vivo testing that the effects of working with a Delta, an in particular of the DiscreteMedianFilter dwarfs these differences.

We note that the first depth frame yields lesser results both for time performance and compression ratio. The lesser (de)compression speed is due to the initialization of the PPL concurrency library. The lesser compression ratio is illustrative of the effect of processing a Delta: the first frame has no predecessor, hence there is no Delta and the full frame is compressed. The second frame does have a Delta, and the compression ratio improves by a factor of 2.4 – 2.5.

In Vivo Performance

In Vivo tests measure, in fact, the effect of the DiscreteMedianFilter on the data compression. In Vivo performance testing measures the required network bandwidth in various configurations:

  1. No compression, no filter (DiscreteMedianFilter).
  2. With compression, no filter.
  3. With compression, with filter, with noise.
  4. With compression, with filter, without noise.

We measure the use of the RleCodec and the LZ4 libraries. Measurements are made for a static scene. Measuring means here to read values from the Task Manager’s Performance tab (update speed = low).

Using a static scene is looking for rock bottom results. Since activity in the scene- people moving around can be expressed as a noise level, resulting compression will always be somewhere between the noiseless level and the “no compression, no filter” level. Increase will be about linear, given the definition of noise as a percentage of changed pixels.

The measurements in the table below are in Megabits per second, whereas the table above shows measurements in Megabytes per second. So, in order to compare the numbers, if so required, the entries in the table below have to be divided by 8. Note that 460Mbit/s is 57.5Mbyte/s.

What we see is that:

  • Compression of the delta reduces network bandwidth width 13* (RLE), or 33% (LZ4).
  • Application of the filter reduces it further with 53% (RLE), or 39% (LZ4).
  • Cancelling noise reduces it further with 24% (RLE), or 23% (LZ4).
  • We end up with a compression factor of about 10 (RLE), or 22 (LZ4).

:-).

What do we transmit at the no noise level? Just the jitter beyond the breadth of the DiscreteMedianFilter, a lot of zeroes, and some networking overhead.

As noted above, the differences between the core RleCodec and LZ4 are insignificant compared to the effects of the DiscreteMedianFilter and taking the Delta.

Conclusions

Using the RleCodec or the LZ4 library we achieve tremendous compression, a compression ratio of 10 or 22 respectively , at lightning speed – over 1600Mbytes/s. We achieve this not so much by the compression algorithms, but by removing undesirable effects (jitter, by the DiscreteMedianFilter) and redundancy (already sent data, by taking the Delta).

ToDo

There is always more to wish for, so what more do we want?

Navigation needs to be improved. At the moment it is somewhat jerky because of the reduction in depth information. If we periodically, say once a second, send a full (delta yes, filter no) frame, in order to have all information at the client’s end, we might remedy this effect.

A GPU implementation. Compute shader or C++ AMP based. But only in combination with other processing steps that really need the GPU.

Improve on RLE. RLE compresses only sequences of the same token. What would it take to store each literal only once, and insert a reference to the first occurrence at reencountering it? Or would that be reinventing LZ?

Advertisement

A Jitter Filter for the Kinect

This blog post introduces a filter for the jitter caused by the Kinect depth sensor. The filter works essentially by applying a dynamic threshold. Experience shows that a threshold works much better than averaging, which has the disadvantage of negatively influencing motion detection, and has only moderate results. The presented DiscreteMedianFilter removes the jitter. A problem that remains to be solved is the manifestation of depth shadows. Performance of the filter is fine. Performance is great in the absence of depth shadow countermeasures.

Introduction

Kinect depth images show considerable jitter, see e.g. the depth samples from the SDK. Jitter degrades image quality. But it also makes compression(Run Length Encoding) harder; compression for the Kinect Server System will be discussed in a separate blog post. For these reasons we want to reduce the jitter, if not eliminate it.

Kinect Depth Data

What are the characteristics of Kinect depth data?

Literature on Statistical Analysis of the Depth Sensor

Internet search delivers a number of papers reporting on thorough analysis of the depth sensor. In particular:

[1] A very extensive and accessible technical report by M.R. Andersen, T. Jensen, P. Lisouski, A.K. Mortensen, M.K. Hansen, T. Gregersen and P. Ahrendt: Kinect Depth Sensor Evaluation for Computer Vision Applications.

[2] An also well readable article by Kourosh Khoshelham and Sander Oude Elberink.

[3] A more technically oriented article by Jae-Han Park, Yong-Deuk Shin, Ji-Hun Bae and Moon-Hong Baeg.

[4] Of course, there is always the Wikipedia

These articles discuss the Kinect 360. I’ve not found any evidence that these results do not carry over to the Kinect for Windows, within the range ([0.8m – 4m]) of the Default mode.

Depth Data

We are interested in the depth properties of the 640×480 spatial image that the Kinect produces at 30 FPS in the Default range. From the he SDK documentation we know that the Kinect provides depth measurements in millimeters. A dept value measures the distance between a coordinate in the spatial image and the corresponding coordinate in the parallel plane at the depth sensor, see image below from the Kinect SDK Documentation.

Some characteristics:

1. Spatial resolution: at 0.8m the 640×480 (width x height) depth coordinates cover an approximately 87cmx63cm plane. The resolution is inversely proportional with the squared distance from the depth sensor. The sensor has an angular field of view of 57° horizontally and 43° vertically.

2. Depth resolution: the real world distance between 2 subsequent depth values the Kinect can deliver is about 2mm at 1m from the Kinect, about 2.5cm at 3m, and about 7cm at 5m. Resolution decreases quadratically as a function of the distance.

Jitter

The Kinect depth measurements are characterized by some uncertainty that is expressible as a random error. One can distinguish between errors in the x,y-plane on the one hand, and on the z-axis (depth values) on the other hand. It is the latter that is referenced to as the depth jitter. The random error in the x,y-plane is much smaller than the depth jitter. I suspect it manifests itself as the color jitter in the KinectColorDepthServer through the mapping of color onto depth, but that still has to be sorted out. Nevertheless, the filter described here is also applied to the color data, after mapping onto depth.

The depth jitter has the following characteristics:

1. The error in depth measurements: The jitter is about a few millimeters at 0.5m, up to 4cm at 5m, increasing quadratically over the distance from the camera.

2. The jitter is a walk over a small number of nearby discrete values. We have already seen this in a previous post in The Byte Kitchen Blog, where we found a variance over mainly 3 different values, and incidentally 4 different values. Of course, for very long measuring times, we may expect an increased variance.

3. The jitter is much larger at the boundaries of depth shadows. Visually, this is a pretty disturbing effect, but the explanation is simple. The Kinect emits an infra red beam for depth measurements which, of course, creates a shadow. The jitter on the edges of a depth shadow jumps from an object to its shadow on the background which is usually much further away. We cannot remove this jitter without removing the difference between an object and its background, so for now, I’ve left it as is.

The miniature below is a link to a graph in [2] (page 1450, or 14 of 18) of the depth resolution (blue) and size of the theoretical depth measurement error (red).

 

A Kinect Produces a Limited Set of Discrete Depth Values

It is not the goal of the current project to correct the Kinect depth data, we just want to send it over an Ethernet network. What helps a lot is, and you could see this one coming:

The Kinect produces a limited set of depth values.

The Kinect for Windows produces 345 different depth values in the Default range, not counting the special values for unknown, and out of range measurements. The depth values for my Kinect for Windows are (divide by 8 to get the depth distance in mm):

This is the number of values for the Kinect for Windows. I also checked the number of unique values for my Kinect 360, and it has a larger but also stable set of unique depth values. The number of depth values is 781, due to the larger range.

The fact that a Kinect produces a limited and small set of depth values makes live a lot easier: we can use a lookup table were we would otherwise have to use function approximation. The availability of a lookup table is also good news for the time performance of the filter.

A question might be: can you use the depth table of an arbitrary Kinect to work with any other Kinect? I assume that each Kinect has a slightly different table, and this assumption is based on the fact that my Kinect for Windows has slightly different values than my Kinect 360, for the same sub range. However, if you use an upper bound search (this filter uses std::upper_bound), you will find the first value equal to or larger than a value from an arbitrary Kinect, which will usually be a working approximation (better than having the jitter). Of course, an adaptive table would be better, and it is on the ToDo list.

Design

I’ve experimented with several approaches: sliding window of temporal averages, Bilateral Filter. But these were unsatisfactory:

– Reduction of Jitter is much less good compared to applying a threshold.

– Movement detection is as much reduced as the jitter, which is an undesirable effect.

A simple threshold, of about the size of the breadth of the error function proved the best solution. As noted above, the jitter typically is limited to a few values above and below the ‘real’ value. We could call the ‘real’ value the median of the jitter range, and describe this jitter range not in terms of the depth values themselves but of the enumeration of the sorted list of discrete depth values (see table). We get a Discrete Median Filter if we map all discrete values within the range onto the median of that range (minding the asymmetry of the sub ranges at the boundaries of our sorted list).

The DiscreteMedianFilter Removes Jitter

In practice we see no jitter anymore when the filter is applied: The DiscreteMedianFilter ends the jitter (period). However, the filter is not applicable to (edges of) depth shadows.

Noise

Actually, it turned out that this filter is in fact too good. If the Kinect registers a moving object, we get a moving depth shadow. The filter cannot deal with illegal depth values, so we are stuck with a depth shadow smear.

A modest level of noise solves this problem. In each frame 10% of the pixels the filter skips is selected at random, and updated. This works fine, but it should be regarded as a temporal solution: the real problem is, of course, the depth shadow, and that should be taken up.

Implementation

The Discrete Median Filter was implemented in C++, as a template class, with a traits template class (struct, actually); one specialization for the depth value type and one specialization for the color value type, to set the parameters that are typical for each data type, and a policy template which holds the variant of the algorithm that is typical for the depth and color data respectively. For evaluation purposes, I also implemented traits and policy classes for unsigned int.

So, the DiscreteMedianFilter class is a template that takes a value type argument. Its interface consists of a parameter free constructor and the Filter method that takes a pointer to an input array and a pointer to a state. The method changes the state where the input deviates from the state more than a specified radius.

Channels and Offset

Color data is made up of RGBA data channels: e.g. R is a channel. Working with channels is inspired on data compression. More on this subject in the blog post on data compression.

The advantages of working with channels for the DiscreteMedianFilter are:

1. We can skip the A channel since it is all zeroes

2. Per channel, changes of color due to sensor errors are usually expressible as small changes, which is not the case for the color as a whole, so we get better filtering results.

3. There are less values: 3 x 256 vs 2^32, so the probability of equal values is higher.

Note that the more values are found to be within the range of values that do not need change, the better time performance we get.

The individual channels are not processed in parallel (unlike in the compression library). We will show below that parallelism should not be required (but actually is, as long as we have noise).

Code

The code is complex at points, so it seems to me that printing the code here would raise more questions than it would answer. Interested people may download the code from The Byte Kitchen Open Sources at Codeplex. If you have a question about the code, please post a comment.

Performance

How much space and time do we need for filtering?

A small test program was built to run the filter on a number of generated arrays simulating subsequent depth and color frames. The program size never gets over 25.4 megabytes. The processing speed (without noise) is:

– About 0.77ms for an array of 1228800 bytes (640×480=307200 ARGB color values); 1530Mbyte/s

– About 0.36ms for an array of 640×480=307200 unsigned short depth values; 1615Mbyte/s.

So, this is all very fast, on a relatively small footprint: in a little more than 1ms we have filtered both the depth and the color data of a Kinect frame.

The simple test program and its synthetic data are not suitable for use with noise. So, we measured the time the KinectColorDepthServer needs for calls to the DiscreteMedianFilter. In this case the noise is set at 10% of the values that would otherwise be skipped. The times below are averaged over 25,000 calls:

– Depth: 6.7ms per call.

– Color: 19.3ms per call.

So, we may conclude that the noise is really a performance killer. Another reason to tackle the depth shadow issue. Nevertheless, we are still within the 33ms time window that is available for each frame at 30 FPS.

Does the DiscreteMedianFilter has any effect on the required network capacity? No, in both cases a capacity of 460Mbit/s is required for a completely static scene, compression is off. I do have the impression that the filter has a smoothing effect on instantaneous (as opposed to average) required network capacity.

To do

There is always more to wish for, so what more do we want?

– Resolve the need for the noise factor, i.e. do something about the depth shadows. This will increase performance greatly.

– Because of filtering, navigation through the scene is jerky. There are much less depth values in the image, so movement is not so smooth. This is to be helped by sending a full depth image every now and then. After all, the filter only replaces values that differ more than a threshold, the rest of the image is retained. Refreshing the overall picture now and then retains the richness of the depth image, helps to make noise superfluous, and smoothes navigation.

– Make the table of depth values adaptive. If a value is not present we replace the nearest value. Of course, we would then also like to save the new table to file, and load it at any subsequent program starts.

Kinect Client Server System V0.2

The Kinect Client Server System V0.2 adds the possibility to V0.1 to watch Kinect Color and Depth data over a network, and to navigate the rendered 3D scene.

To support data transfer over TCP, the Kinect Client Server System (KCS system) contains a custom build implementation of Run Length Encoding compression.

To both maximize compression and improve image quality the KCS system uses a jitter filter

Introduction

Version 0.1 of the KCS system allowed the display of Kinect data in a Windows Store app. This is a restricted scenario: for security reasons, a Windows Store app cannot make a network connection to the same PC it is running on, unless in software development scenarios. Version 0.2 overcomes this restriction by:

1. Support for viewing Kinect data from another PC.

2. Providing the 3D-TV viewer from the Windows Store (free of charge).

Of course, V0.2 is an open Source project, the code and binaries can be downloaded from The Byte Kitchen’s open Source project at CodePlex.

Usage

The easiest way to start using the KCS system v0.2 is to download 3D-TV from the Windows Store, navigate to the Help-About screen (via the ‘Settings’ popup), click on the link to the online manual and follow the stepwise instructions.

The general usage schema is depicted below.

The PC called Kinect server runs the KinectColorDepthServer application. In order for the client PC running 3D-TV to find the server PC on the network, the server PC must be connected to gigabit Ethernet (cabled computer network, say) with a network adapter carrying the IP address 192.168.0.20. The IP address of the Ethernet adaptor of the PC running 3D-TV must satisfy 192.168.0.xxx, where 0 <= x <=255. It is wise to expect that required data capacity is well over 100Megabit/second.

The online manual also provides a complete description of the keyboard keys to be used for navigating around a Kinect scene.

More information

In a few more blog posts, I will discuss more technical details of the new features in the KCS system V0.2, specifically the jitter filter that is to stabilize the Kinect imagery and help in data compression, and the data compression itself.

A GPU Bilateral Filter Implementation

This post reports on a bilateral filter implementation that improves processing time from 32ms to 0.25ms.

Introduction

The Kinect (for Windows) depth data are subject to some uncertainty that comes with its resolution. Depth estimates are defined in millimeters, and typically, subsequent depth measurements by the Kinect vary by a fixed amount.

Consider the graphs below. The x-axis counts the number of measurements, the y-axis represents distance measurements of a single point. The top graph shows connected dots, the lower graph shows

just the dots.

De graphs show two tendencies. One is that variance is one unit above, or one unit below the average practically all of the time, the second tendency is that the average changes a bit before it stabilizes. Here we see it change from about 3.76m via 3.8m to about 3.84m.

If the Kinect depth data is projected onto an image this variation translates into a nervous jitter. Since I do not particularly care for a nervous jitter, I would like to stabilize the depth data a bit.

Stabilizing Kinect Depth Data – Temporal Approach

The Kinect for Windows SDK (1.6) contains a whitepaper on skeletal joint smoothing. The paper deals with the reduction of noise in the Kinect skeletal tracking system. This tracking system employs the same depth data, and therefore suffers from the same problem.

The proposed solution is to filter the data over time. The depth measurement z(x,y)(t) of a location (x, y) at time t can be averaged over a number of measurements in the past at the same location: z(x,y)(t-i) where i is in [1, n]. The suggestion is to take n not too large, say 5.

Averaging can also be over measurements in the future. This implies that one or two frames are included in averaging before an image based on the depth image is rendered, hence there is a latency in rendering equal to the number of ‘future’ frames included in averaging. The advantage of considering the ‘future’ is that if the measured scene changes (or a player changes position – in skeletal tracking), another type of averaging can be applied, one that is better suited for changes and e.g. puts a heavier weight on recent measurements.

I’ve done an experiment with temporal filtering, but it was not satisfactory. The fast and nervous jitter just turns into a slower one that is even more disturbing because short periods of stability make changes seem more abrupt.

Stabilizing Kinect Depth Data – Spatial Approach

Another approach is not to average over measurements at the same location through time, but to average within one frame, over several proximate measurements. A standard solution for this kind of filtering is the Bilateral filter. The Bilateral Filter is generally attributed to Carlo Tomasi and Roberto Manduchi. But see this site where it is explained that there were several independent discoveries.

The idea behind the Bilateral Filter is that the weight of a measurement in the average is a Gaussian function of both the distance and the similarity (in color, intensity, or as in our case: depth value). The similarity term prevents edges to be ‘averaged out’.

The Bilateral Filter works well, the only drawback it has is its computational complexity: O(N^2) where N is the (large!) number of pixels in the image. So, several people have been working on fast algorithms to alleviate the computational burden. To me it seems that Ben Weiss provided a good solution, but it is not generally available. The solution by Frédo Durand and Julie Dorsey (2002), and the elaboration of this work by Sylvain Paris and Frédo Durand (2006), all from MIT, seems to be the leading solution, and is general available – both the theory and example software. Their method has a project site that is here.

In a nutshell, the method by Sylvain Paris and Frédo Durand reduces processing time by first down sampling the image, then applying a convolution to compute the averages, and finally scaling up the image again while clamping over out-of-bounds values. So in essence, it operates on a (cleverly) reduced version of the image.

I’ve downloaded and compiled the software – the really fast version with the truncated kernel – and it requires about 0.032s to process a ppm image of 640×480 pixels (grayscale values), where the spatial neighborhood is set to 16 (pixels) and the ‘similarity’ neighborhood is set to 0.1, so grayscale colors that differ more than 0.1 after transformation to normalized double representation, are not considered in the average. See the image below for a screen shot.

The processing time is, of course, computer dependent, but my pc is not really slow. Although 32ms is a fine performance, it is too slow for real-time image processing. The Kinect produces a frame 30 times per second, i.e. every 33ms, and we do not want to create a latency of about one frame just because of the Bilateral Filter.

GPU implementation: C++ AMP

In order to improve on the processing time of this fast algorithm I’ve written a C++ AMP program inspired by the CPU implementation, this program runs on the GPU, instead of on the CPU. For information on C++ AMP, see here and here. What I think is great about AMP is that it provides a completely general access to General Purpose GPU computing. Having said that, I must also warn the reader that I do not master it to the degree that I could guarantee that my implementation of the Bilateral Filter in C++ AMP is representative of what could be achieved with C++ AMP.

The result of my efforts is that the ppm image above can now be processed in little over 1 ms. Consider

the picture below, made with my ATI Radeon HD 5700 Graphics card.

What you see here is a variety of timings of the computational phases. The top cycle takes 1.1ms, the middle one takes 1.19, and the bottom cycle takes 1.07ms. So, what is in the cycle?

1. The image is loaded into the GPU, and data structures are initialized. If you want to know more on ‘warming up’ the data and the code, see here. Since it takes 0.5 to 0.6 ms it is obviously the bottle neck.

2. Down sampling the image to a smaller version takes around 0.1 ms.

3. Computing the convolution takes 0.35 ms. This is the real work.

4. Up scaling and clamping takes again 0.1 ms.

A processing time of about 1 ms is satisfactory as a real-time processing time. Moreover, since we may assume the data is already in GPU memory (we need it there to render it to the screen), GPU upload time is not an attribute of an application of the Bilateral Filter in this context. So we may think of the processing time as being about 0.55 ms. which is absolutely fabulous.

New Graphics Card

At about this time, I bought a new graphics card, an Asus NVidia GTX690 (which for the purposes of this application yields the same results as a GTX 680, I know). This card was installed in my pc. Ok, I didn’t buy a new motherboard, so data is still being uploaded through PCI-e 2.0 and not through PCI-e 3.0 16x (but in time…). So, will this make a difference? Yes, it does. Look at the screen shot below.

I rearranged the timings a bit, to gain better oversight. We see that:

1. Data uploading and the warming up process now takes about 0.45 ms.

2. Filtering now takes about 0.25 ms.

From 32ms to 0.25ms. Most satisfying!

Viewing Kinect Data in the New Windows 8 UI

Introduction

The Kinect SDK is not compatible with WinRT in the sense that software developed using the SDK cannot have a WinRT (Windows RunTime) UI. The reason is that the Kinect SDK is .Net software and you cannot run (full) managed code on the WinRT.

Nevertheless, I want to create software that can show Kinect data in a WinRT UI. For multiple reasons, one being that software written for the WinRT can run on a PC, a tablet, very large screens, now called a Surface, and a Windows Phone. A survey of other solutions, see below, reveals that solutions to this problem are based on networking. Networking allows us to deliver Kinect data anywhere. This then is another reason to work on separating the source of Kinect data from its presentation.

The Solution

The general solution is to make a client-server system. The server lives in the classic Windows environment, the client is a WinRT app. Communication between client and server is realized using networking technology; preferably the fastest available. The server receives the data from the Kinect and does any processing that involves the Kinect SDK. The client prepares the data for presentation on the screen. If multiple servers are involved, it integrates and time-synchronizes data from several servers. Since I’m a C++, DirectX guy, the server and client are built on just these platforms

Other Solutions

Several other solution already exist. Without pretending to be exhaustive, and in any order:

– The KinectMetro App by the WiseTeam

– ‘Using Kinect in a Windows 8 / Metro App’ by InterKnowlogy

– The Kinect Service from Coding4Fun

The KinectMetro App by the WiseTeam

The application by the WiseTeam is described in this blog post. The software is available at Codeplex. The software was written for the Windows 8 Consumer Preview as part of a MS Imagine Cup participation. I’ve downloaded the software, but couldn’t get it to run on the Windows 8 RTM version. The application is based on event aggregation, as found in PRISM, and on WebSockets.

‘Using Kinect in a Windows 8 / Metro App’ by InterKnowlogy

The approach InterKnowlogy took is blogged here. This is the entry point to several blog posts, some videos (Vimeo and Youtube), and a little bit of code. This solution is also written in C# .Net, using WebSockets.

The Kinect Service by Coding4Fun

This software is available from Codeplex. It is not aimed at the WinRT, it aims at distributing Kinect data to a wider spectrum of clients. Hence it can also be used as a base to target the WinRT. Apart from the server, it consists of a WPF client and a phone client. This looks like a high standards, well written solution. Neat! Data transport uses WinSockets (not WebSockets). The code is available in both C# and VB.

Evaluation

In theory, WebSockets are slower than WinSockets. There can be much discussion about what would be the fastest solution under which circumstances. I expect WinSockets to be the fastest solution, therefore I prefer WinSockets.

Also, in theory, a C++ program is faster, and smaller, than an equivalent C# program. There can be much discussion … , therefore I prefer a program written in C++.

Of course, we should do asynchronously, or in parallel, whatever can be done quicker in parallel.

Approach

So, what’s a smart way to develop a client server system to show Kinect color and depth data in a WinRT app? For one, we start from SDK samples:

– A sample from the Kinect SDK that elaborates processing depth and color data together.

– A sample the shows how to use WinSockets (in C++).

– A sample that show how to use the WinRT StreamSocket using PPL tasks (yes we will exploit parallelism extensively 🙂 .

– Windows Service (optional, see below).

The use of a Windows service is an option for later use. To work with a service instead of a simple console application requires that the server is capable of handling all kinds of exceptional situations, if only by resetting itself. Consider e.g. the case that no Kinect is connected, or if the Kinect is malfunctioning? Etc.? And apart from that, I expect the Kinect SDK to be made available for WinRT applications in due time.

Architecture

Server side architecture

The general software architecture looks like this:

The test application instantiates the KinectColorDepthServer DLL. The idea is that in case the DLL is run by a service, the DLL can be loaded and dropped easily / frequently so as to prevent problems that relate to long running processes. So every time the client closes the WinSock connection, the application (or service), drops the DLL and creates a new instance.

The KinectColorDepthServer has a simple interface; you can Run it, Stop it and Destroy it. The interface has this neutral character so we can use the same interface for other data sources, like a stereoscopic camera. The server instantiates a Kinect DataSource on a separate thread, and waits until the connection is closed. The Server also creates two WinSock servers and hands references of these servers to the Kinect DataSource. The WinSock servers are created at a relatively high level, so we can configure them at a high level in the call chain. Lifecycle management of the WinSock servers is in parallel.

The Kinect DataSource contains those parts of the Kinect sample that contain, or refer to Kinect SDK code (which cannot be run in the WinRT client). The Kinect DataSource sends pairs of a depth image and a color image in parallel to the client. The main method in the Kinect DataSource deals with mapping the color data to the depth data.

The WinSock server is just the basic WinSock server sample from the Windows SDK documentation.

Client Side Architecture

The general software architecture looks like this:

The WinRT UI application class manages the lifecycle of the application. The MainPage manages the state of the user interface.

The MainPage references the Scene1 class that inherits from the Scene class in My DirectX Framework. This framework organizes standard WinRT DirectX11.1 code in a structure that is similar to the XNA architecture. This latter architecture support easy creation and management of graphical components very well. So, it keeps my code clean and well organized under a growing number of components. I like that, because I like to have oversight.

The Scene1 class refers to the KinectColorDepthclient, which provides the data, and the KinectImage class which contains the DirectX code (a WinRT port) from the Kinect SDK sample, which it uses to display the Kinect data on the screen, using a SwapChainBackgroundPanel. The Scene1 class also references a Camera class (not shown in the diagram) that allows the user to navigate through the 3D scene.

The KinectColorDepthClient creates two StreamSockets, one for depth data, and one for color data. Reception of depth and color images is parallel, then synchronized so as to keep matching color and depth images together. The resulting data is handed over to the KinectImage.

One goal of this architecture is that the KinectColorDepthClient class can be easily replaced by another class, e.g. when Microsoft decides to release a version of the Kinect SDK that is compatible with WinRT. For this reason it has a limited and general interface.

Parallelism is coded making extensive use of PPL task parallelism. PPL Tasks is really a pleasure to use in code.

WinSock2 sockets cannot be used in the WinRT, as it turns out. The alternative at hand is the StreamSocket. However, the StreamSocket still contains a bug. Closing a StreamSocket is done in C++ by calling delete on a StreamSocket object. This however, raises an unhandled exception (that I did not succeed in catching, by the way). It does not only do this in my code, but also in the StreamSocket sample that can be downloaded from MSDN (12 October 2012). A bug report has been filed.

Performance

So, now that we have this nice software, just what is the performance, that is, how quick is it, and how large are the programs involved?

Dry testing the transmission speed

To gain an idea of the speed with which data can be transported from one process to another, I sent a 1Mbyte blob from a Winsock2 server to a Winsock2 client 10.000 times, and averaged the transmission time.

Clocking was done using the ‘QueryPerformanceCounter’ function, which is quite high res. The performance counter was queried just before the start of transmission at the server, and just after arrival of the last blob at the client. The difference between the tick counts is then divided by ‘QueryPerformanceFrequency’, which give you the result in seconds. So multiply by 1000 (ms) and divide by the number of cycles (10.000). This shows that transmission of 1Mbyte takes about 1.5 ms (release build).

Now, we are planning to send 640 x 480 pixels (4 bytes each), and an equal nr. of depth values (2 bytes each) over the line. This will take us about 1.5 * (1.843.200 / 1.048.576) = 2.6 ms (wow!). The conclusion is that there will be no noticeable latency.

Visual Studio Performance Analysis

This tool is about finding bottlenecks in your code, so you may remove them. In an analysis run of the server, 5595 samples were taken. The CPU was found executing code I wrote / copied myself in 21.4% of the samples, all in one method. It is possible to examine which lines of code take the most time in that method. I measured an average processing time of these lines of code, and they typically take 1.7 ms (release build, debugger attached) to execute. Well, what can I say? Although I suspect the 21.4% could be improved, we will just leave it as it is.

In a second analysis run, the client application was scrutinized. In this run 2357 samples were taken – I guess it turned out harder to take samples. As little as 2.64% of the samples were in ‘my code’ (that is: 58 samples). Another 8.10% is taken up by DirectX – running shaders for my program, I think. So, in all about 11%. Since the rest of the samples hit code that I cannot touch the source code of, and that we may assume is already well optimized, this is a very fine result.

Footprint

And how about the size, the footprint? The release build shows a client that has a working set of around 40 Mbyte, and a server with a working set of about 95 Mbyte. Together about 135 Mbyte. Well, that’s not small, but what should we compare it to? The Kinect Service by Coding4Fun, of course!

I downloaded and ran the WPF sample (pre-built). It turns out that the server usually stays under 130Mb, and the client will stay under 67 Mb. Together slightly less than 200Mb.

In conclusion: the footprint of the C++ application is smaller. Its size is 2/3 of the .Net application size, but it is not dramatically smaller.

Demo video

Below you’ll find a link (picture) to a video demonstrating the Kinect client-server system. First the server is started in a Windows desktop environment, then the user (me 🙂 ) switches over to the Start window to start up the client. You can see the client connect to the server – watch the log window at the lower left – and then see the Kinect data on the screen. The stream is stopped and then restarted. That is, in fact, all. The video has been made using Microsoft Expression Encoder Screen Capture. The screen capture has been processed with Encoder, with which we also made the snapshot that serves as the hyperlink to the download site (SkyDrive – Cloud!).

The jitter in the picture is caused by the depth stream. The depth stream consists of depth measurements expressed as the distance from the camera along the normal emanating from the camera, in mm. These measurements are subject to a certain error, or uncertainty, which causes fluctuations in measurements, hence the jitter in the stream.

Filtering away the jitter is high on my agenda.

The Windows 8 Metro SwapChainBackgroundPanel

Microsoft has provided a nice facility for inter-operation between XAML user interface elements and DirectX graphics: the SwapChainBackgroundPanel. In fact they have provided three alternatives, but here we focus on the high performance alternative that also leaves most control to the developer.

Microsoft was kind enough to provide a sample program that shows how to use the SwapChainBackgroundPanel. However, this program also does a fairly large number of other things. So, I decided to create a small project in which the use of the SwapChainBackgroundPanel is central, but that can also be used as a starting point for a larger program.

You can download the Visual Studio 2012 project from here. You will need Windows 8 (Release Preview) and Visual Studio 2012 (Release Candidate) to build and run the application.

The starter project combines elements from the XAML DirectX 3D shooting game sample (which exemplifies the use of the SwapChainBackgroundPanel, with elements of the standard Visual Studio Metro DirectX application template. All the application does is show a rotating colored cube.

Well, that is not entirely true. Couldn’t resist the temptation to add a slider (and a data bound TextBox that shows the value) that controls the rotation speed and direction of the cube.

The behavior of the slider is not (yet) as desired, see this screen capture video; the slider moves uncontrollably back and forth (albeit once in each direction) when the setting has changed. I’ve issued a feedback item for this, and trust that this problem will be solved in the RTM version.

Some other controls also suffer from this type of problems concerning the sharing of screen ’real estate’ between raw DirectX and the XAML render engine, try e.g. the ComboBox control.

The project setup follows a specific pattern. A Visual C++ project may collect files in filters – much like folders, but not physical. A blank Metro style project already contains the Assets and Common filters, for Metro specific files. I found it is becoming standard practice to collect basic DirectX code under a DirectXBase filter. This filter hides all DirectX related code the can easily be reused in other projects. The Precompiled Headers filter hides just what it says it will hide. It advances build performance (pretty much) to collect all standard and / or multiply used headers in pch.h. For application specific rendering you create your own render engine – hidden by the Render Engines filter. Your render engine will use Shaders – see the Shaders filter. Finally, application specific DirectX render logic, like your standard Update method, is situated in the custom Controller class, hidden by the Controllers filter. Architecturally speaking, the Controller class inherits from RenderEngine, which in turn inherits from DirectXBase. The App class is responsible for application management, and the MainPage class is responsible for management of the visual state.

The intended architecture is also depicted in the UML class diagram below.

This setup is a copy of the shooting game sample. It seems, however, more natural to attach the controller to the MainPage, since the SwapChainBackgroundPanel, which provide the render surface for the DirectX code is in the MainPage as well.

Of course, If you really want to do a clean job, you could separate off the DirectX part into a WinRT dll. This would allow for reuse and interop with C# code. Alternatively, the controller for the SwapChainBackgroundPanel could be attached to, conforming to the MVVM pattern. At this point, however, I was happy to have a working first application and left pimping up the project for another occasion.

Vector –Matrix Inner Product with Computer Shader and C++ AMP

Large vector-matrix inner products by the GPU are 250 times faster than straight forward CPU implementations on my PC. Using C++ AMP or a Compute Shader the GPU realized a performance of over 30 gFLOPS. That is a huge increase, but my GPU has a “computational power” (whatever that may be) of 1 teraFLOP, and 30 gFLOPS is still a long way from 1000 gFLOPS.

This article presents a general architectural view of the GPU and some details of a particular exemplar: the Ati Radeon HD5750. Then code examples follow that show various approaches to large vector-matrix products. Of course the algorithms at the end of the article are the fastest. It is also the simplest.

Unified View of the GPU Architecture

Programming the GPU is based on an architectural view of the GPU. The purpose of this architectural view is to provide a unified perspective on GPUs from various vendors, hence with different hardware setup. It is this unified architecture that’s being programmed against using DirectX11. A good source of information on Direct Compute and Compute Shaders is the Microsoft Direct Compute BLog. The architecture described below is based on information from Chas Boyd’s talk at PDC09, as published on Channel9. Of course, this blog post only presents some fragments of the information found there.

A GPU is considered to be build from a number of SIMD cores. SIMD means: Single Instruction Multiple Data. By the way, the pictures below are hyperlinks to their source.

The idea is that a single instruction is executed on a lot of data, in parallel. The SIMD processing unit is particularly fit for “data parallel” algorithms. A GPU may consist of 32 SIMD cores (yes, the image shows 40 cores) that access memory with 32 floats at a time (128 bit bus width). Typically the processor runs at 1Ghz, and has a (theoretical) computational power of about 1 TeraFLOP.

A SIMD core uses several kinds of memory:

  • 16 Kbyte of (32-bit) registers. Used for local variables
  • 8 Kbyte SIMD shared memory, L1 cache.
  • L2 cache

The GPU as a whole has typically 1Gb of general RAM. Memory access bandwidth is typically of order 100GBit/s.

Programming Model

A GPU is programmed using a Compute Shader or C++ AMP. Developers can write compute shaders in HLSL (Looks like C) to be executed on the GPU. AMD is a C++ library. The GPU can run up to 1024 threads per SIMD. A thread is a line of execution through code. The SIMD shared memory is shared among the threads of a SIMD. It is programmable in the sense that you can declare variables (arrays) as “groupshared” and they will be stored in the Local Data Share. Note however, that over-allocation will spill the variables to general RAM, thus reducing performance. Local variables in shader code will be stored in registers.

Tactics

The GPU architecture suggests programming tactics that will optimize performance.

  1. Do your program logic on the CPU, send the data to the GPU for operations that apply to (about) all of the data and contain a minimal number of alternative processing paths.
  2. Load as much data as possible into the GPU general RAM, so as to prevent the GPU waiting for data from CPU memory.
  3. Declare registers to store isolated local variables
  4. Cache data that you reuse in “groupshared” Memory. Don’t cache data you don’t reuse. Keep in mind that you can share cached data among the threads of a single group only.
  5. Use as much threads as possible. This requires you use only small amounts of cache memory per thread.
  6. Utilize the GPU as efficiently as possible by offering much more threads to it than it can process in a small amount of time.
  7. Plan the use of threads and memory ahead, then experiment to optimize.

Loading data from CPU memory into GPU memory passes the PCIe bridge which has a bandwidth, typically of order 1GBit/s; that is, it is a bottleneck.

So, you really like to load as much data onto GPU memory before executing your code.

The trick in planning your parallelism is to chop up (schedule, that is J ) the work in SIMD size chunks. You can declare groups of threads; the size of the groups and the number of groups. A group is typically executed by a single SIMD. To optimize performance, use Group Shared Memory, and set up the memory consumption of your thread group so it will fit into the available Group Shared Memory. That is: restrict the number of threads per group, and make sure you have a sufficient number of groups. Thread groups are three dimensional. My hypothesis at this time is that it is best to fit the dimensionality of the thread groups to match the structure of the end result. More about this below. Synchronization of the threads within a thread group flushes the GroupShared Memory of the SIMD.

A register typically has a lifetime that is bound to a thread. Individual threads are member of several groups – depending on how you program stuff. So, intermediate results aggregated by thread groups can be stored in registers.

Does My ATI Radeon HD5750 GPU Look Like This Architecture… A Bit?

The picture below (from here) is of the HD5770, which has 10 SIMD cores, one more than the HD5750.

What do we see here?

  • SIMD engines. We see 10 cores for the HD5770, but there are 9 in the HD5750. Each core consists of 16 red blocks (streaming cores) and 4 yellow blocks (texture units).
  • Registers (light red lines between the red blocks).
  • L1 Textures caches, 18Kbyte per SIMD.
  • Local Data Share, 32 Kbyte per SIMD.
  • L2 caches, 8 Kbyte each.

Not visible is the 1Gb general RAM.

The processing unit runs at 700Mhz, memory runs at 1,150Mhz. Over clocking is possible however. The computational power is 1,008 TeraFLOP. Memory bandwidth is 73.6 GBit/s.

So, my GPU is quite a lot less powerful than the reference model. At first, a bit disappointing but on the other hand: much software I write for this GPU cannot run on the PCs of most people I know – their PCs are too old.

Various Approaches to Vector-Matrix Multiplication

Below we will see a number of approaches to vector-matrix multiplication discussed. The will include measurements of time and capacity. So, how do we execute the code and what do we measure?

Times measured include a number of iterations that each multiply the vector by the matrix. Usually this is 100 iterations, but fast alternatives get 1000 iterations. The faster the alternative, the more we are interested in variance and overhead.

Measurements:

  • Do not include data upload and download times.
  • Concern an equal data load, 12,288 input elements if the alternative can handle it.
  • Correctness check; computation is also performed by CPU code, reference code.
  • Run a release build from Visual Studio, without debugging.
  • Allow AMP programs get a warming up run.

Vector-Matrix Product by CPU: Reference Measurement

In order to determine the performance gain, we measure the time it takes the CPU to perform the product. The algorithm, hence the code is straightforward:

In this particular case rows = cols = 12,288. The average over 100 runs is 2,452 ms, or 2.45 seconds. This amounts to a time performance of 0.12 gFLOPS (giga FLOPS: FLoating point Operations Per Second). We restrict floating point operations to addition and multiplication (yes, that includes subtraction and division). We calculate gFLOPS as:

2 / ms x Rows / 1000 x Cols / 1000, where ms is the average time in milliseconds.

The result of the test is correct.

Parallel Patterns Library

Although this blog post is about GPU performance, I took a quick look at PPL performance. We then see a performance gain of a factor 2, but the result is incorrect, that is, the above code leads to indeterminacy in a parallel_for loop. I left it at that, for now.

Matrix-Matrix Product

We can of course, view a vector as a matrix with a single column. The C++ AMP documentation has a running code example of a matrix multiplication. There is also an accompanying compute shader analog.

AMP

To the standard AMP example I’ve added some optimizing changes, and measured the performance. The AMP code look like this:

Here: amp is an alias for the Concurrency namespace. The tile size TS has been set to 32, which is the maximum; the product of the dimensional extents of a compute domain should not exceed 1024. The extent of the compute domain has been changed to depend on B, the matrix, instead of the output vector. The loop that sums element products has been unrolled in order to further improve performance.

As mentioned above, we start with a warming up. As is clear from the code we do not measure data transport to and from the GPU. Time measurements are over 100 iterations. The average run time obtained is 9,266.6 ms, hence 0.01 gFLOPS. The result after the test run was correct.

The data load is limited to 7*1024 = 7,168; that is 8*1024 is unstable.

Compute Shader

The above code was adapted to also run as a compute shader. The code looks like this:

The variables Group_SIZE_X and Group_SIZE_Y are passed into the shader at compile time, and are set to 32 each.

Time measurements are over 100 iterations. The average run time obtained is 11,468.3 ms, hence 0.01 gFLOPS. The result after the test run was correct. The data load is limited to 7*1024 = 7,168; that is 8*1024 is unstable.

Analysis

The performance of the compute shader is slightly worse that the AMP variant. Analysis with the Visual Studio 11 Concurrency Visualizer shows that work by the GPU in case of the compute shader program is executes in small spurts, separated by small periods of idleness, whereas in the AMP program the work is executed by the GPU in one contiguous period of time.

Nevertheless, performance is bad, worse than the CPU alternative. Why? Take a look at the picture below:

For any value of t_idx.global[0] – which is based on the extent of the matrix- that is unequal to zero, vector A does not have a value. So, in fact, if N is the number of elements in the vector, we do O( N3)retrievals but only O(N2) computations. So, we need an algorithm that is based on the extent of a vector, say the output vector.

Vector-Matrix Product

Somehow, it proved easier to develop the vector-matrix product as a compute shader. This is in spite of the fact that unlike AMP, it is not possible (yet?) to trace a running compute shader in Visual Studio. The idea of the algorithm is that we tile the vector in one dimension, and the matrix in two, thus obtaining the effect that the vector tile can be reused in multiplications with the matrix tile.

Compute Shader

A new compute shader was developed. This compute shader caches vector and matrix data in Group Shared memory. The HLSL code looks like this:

This program can handle much larger amounts of data. Indeed, this program runs problem free for a vector of 12,288 elements and a total data size of 576 Mbyte. Using an input vector of 12,288 elements, with total data size of 576 Mbyte. The time performance is 10.3 ms per run, averaged over 1,000 runs, which amounts to 29.3 gFLOPS. The result of the final run was reported to be correct.

AMP

In analogy to the compute shader above I wrote (and borrowed 🙂 ) a C++ AMP program. The main method looks like this:

The matrix is a vector with size * size elements. He tile size was chosen to be 128, because that setting yields optimal performance. The program was run on an input vector of 12,288 elements again, with total data size of 576 Mbyte. The time performance is 10.1 ms per run, averaged over 1000 runs, which amounts to 30.0 gFLOPS. The result of the final run was reported to be correct.

Analysis

We see here that the performance has much improved. When compared to the reference case, we can now do it (in milliseconds) 2,452 : 10.1 = 243 : 1, hence 243 times faster.

Simpler

Then, I read an MSDN Magazine article on AMP tiling by Daniel Moth, and it reminded me that caching is useless if you do not reuse the data. Well, the above algorithm does not reuse the cached matrix data. So I adapted the Compute Shader program to retrieve matrix data from central GPU memory directly. The HLSL code looks like this:

Note the tileSize of 512(!). This program was run for a vector of 12,288 elements and a total data size of 576 Mbyte. The time performance is again 10.3 ms for a multiplication which amounts to 29,3 gFLOPS (averaged over 1000 runs). The result of the final run was reported to be correct. So, indeed, caching the matrix data does not add any performance improvement.

AMP

For completeness, the AMP version:

Time performance is optimal for a tile size of 128, in case the number of vector elements is 12,288. We obtain an average run time of 9.7 ms (averaged over 1,000 runs), and a corresponding 31.1 gFLOPS. The result of the final run was correct. This program is 2452 / 9.7 = 252.8 times as fast as the reference implementation.

Conclusions

Developing an algorithm for vector-matrix inner product has demonstrated comparable performance for Compute Shaders and AMP, but much better tooling support for AMP: we can step through AMP code while debugging, and the Concurrency Visualizer has an AMP line. This better tool support helped very well in analyzing performance of a first shot at the algorithm. The final algorithm proved over 250 times faster than a straight forward CPU program for the same functionality.

Detailed knowledge of the GPU architecture, or the hardware model, proved of limited value. When trying to run the program with either the maximum nr of threads per group, or the maximum amount of data per Group Shared Memory, I ran into parameter value limits, instabilities, performance loss, and incorrect results. I guess, you will have to leave the detailed optimization to the GPU driver and to the AMP compiler.

One question keeps bothering me though: Where is my TeraFLOP?

I mean, Direct Compute was introduced with the slogan “A teraFLOP for every one of us”, AMP is built on top of Direct Compute, and my GPU has a computational power of 1.08 TeraFLOP. Am I not ‘one of us’?

C++ AMP Performance and Compute Shader Performance

Edit (April 23rd 2012):

The AMP team has updated the N-Body Simulation code to turn it into a clean port that relates to the Compute Shader original in a comprehensible way. Now it has comparable performance to the original (optimized) version (both versions do >330 gFLOPS at >30 fps for 23,040 particles on my pc).

I’m impressed. For one, by the attitude of the AMP people that energetically reacted to issues which other people / teams might well have dismissed as unimportant. Then there is the point that you get maximum performance from a set of very powerfull processors with code that is very short compared to the direct compute code you had to write otherwise, and this code, by AMP design, is very elegant as well.

Of course, there is a risk in short and elegant code: subtle differences in code can make substantial differences in performance, hence developing AMP code is rather knowledge intensive. But I kind of like that.

Edit (April 16th 2012):

The results below were brought to the C++ AMP forum for discussion. Daniel Moth advised to update the driver of the graphics card. This update made a tremendous difference for two of the three programs mentioned below for which now C++ AMP performance is equal to or better than Compute Shader performance.

The discussion on the N-Body Simulation program, which is heavily optimized in the Compute Shader version is still open, mainly because the required information is not available yet. I expect that also in this case C++ AMP will prove to be equipotent to Compute Shader programs.

Now, what have we learned from this exercise? For one, a lot about Compute Shader optimization and the mechanisms of GPU computing performance. This is an interesting and instructive subject. I also have learned that C++ AMP performance is comparable to Compute Shader performance. However, I do not (yet) understand if and how this will always and necessarily be the case, and that still itches a bit.

Results as they are standing now:

 

Program

AMP

CS

 

Guide

 

 

Average time (ms, 10 it.)

 

2,650

2,995

gFLOPS

 

36.9

32.7

Max. Data Load (Kb)

 

714,432

691,200

 

Vector Addition

 

 

Average time (ms, 10 it.)

 

6,017

8,155

gFLOPS

 

0.03

0.02

Max. Data Load (Kb)

 

1,781,248

2,039,056

 

N-Body Simulation

 

 

Number of Particles

 

16,128

16,128

Frame rate

 

44.4

63.4

gFLOPS

 

229

329

Up to date, I find that Compute Shader based programs outperform C++ APM programs both in time and space. Results of example programs I explored, which have been created by the respective product teams tend to show substantially better performance by the Compute Shader programs. These programs are the N-Body Simulation Sample; Basic Summation; and the matrix multiplication programs from the “C++ AMP for the DirectCompute Programmer” guide. Hyperlinks are provided in the sections below.

So, the question is: can there be an AMP program that performs substantially better in time and space on, let’s say, large matrix multiplication (or large matrix-vector multiplication) than a Compute Shader program? C++ AMP has been built upon Direct Compute, so the answer is: not likely.

Should we, alternatively, draw the conclusion that a direct compute program categorically has better performance?

N-Body Simulation

The first pair of programs compared, consisted of:

Performance is expressed in gFLOPS. The code for the gFLOPS was copied from the C++ AMP version to the Compute Shader version. I also changed the Compute Shader version to make it write gFLOPS and the number of particles to the screen.

First, I tweaked the particle count parameter to get the best gFLOP count from either program; they both peak at 16,128 particles on my PC. Then the following results (gFLOPS) were obtained for release builds, running without debugging (this was also the configuration in the comparisons below).

C++ AMP Compute Shader More (%) Less(%)
Number of particles 16,128 16,128
Frames per second 43.46 57.38 32.03 24.26
gFLOPS 226.07 298.51 32.04 24.27

A note on the More and Less columns: The Compute Shader version delivers 32.03% more frames per second, and the C++ AMP version 24.26% less. So crudely: the Compute Shader version is about 30% faster.

Vector Addition

The second pair of programs compared consisted of:

The C++ AMP code was adapted as follows:

  • It was made to work with the same structs as the BasicCompute11 sample. This struct consists of an int and a float.
  • The arrays were made global variables.
  • A loop was added to fill the input arrays.
  • The verification code from the BasicCompute11 sample was added.

For timing, timing code was added to both programs. This timing code is from this post in the Parallel Programming in Native Code blog.

For timing measurements the code was adapted as follows: In the Compute Shader program timing covers code from the Dispatch call to the Map call. In the AMP program timing covers the lambda expression, and an added array_view::Synchronize() call on the “sum” array_view.

In experiments I first pushed the size until, in the case of the Compute Shader version, the output of the result verifying code became “failure”,

and in the case of the C++ AMP program, it either didn’t compile or produced a runtime error.

Then I measured time and gFLOPS. The experiments yielded the following result.

C++ AMP Compute Shader More (%) Less(%)
Number array elements 76*10^6 87*10^6 14.47 12.64
Total data size (Kb) 1,781,250 2,039,062.5
Time (ms) 6,868 8,182
gFLOPS 0.022 0.021

gFLOPS were measured as: 2*n / (10^6 * ms), where n is the number of elements in an array.

It seems to me that the time results are too similar to call them different. The Compute Shader version has a slight space advantage.

Note that since the total data size in both cases is larger than the RAM the graphics card has on board, there is some automatic sectioning going on.

Matrix Multiplication

Both programs in this comparison come from the C++ AMP for the DirectCompute Programmer guide. This guide can be obtained from a post on the official MSDN Parallel Programming in Native Code blog. The C++ AMP program is a transformation of the Compute Shader program.

The code for the starting point of the transformation is not entirely complete, so I added standard code from the BasicCompute11 Sample that loads and compiles the compute shader.

The following results were obtained.

C++ AMP Compute Shader More (%) Less(%)
Number array elements 4,608 7,616 65.28 39.50
Total data size (Kb) 248,832 679,728 173.17 63.39
Av Processing time (ms, 10 runs) 11,742 12,804
gFLOPS 8.3 34.5 315.66 75.94

Notes:

  • Both programs measure the time spent in the “mm” function, using the timing code referred to above. This includes uploading and offloading the data onto and from the GPU.
  • For both programs we have that any higher multiple of 64 in the number of array elements crashes the display driver.

  • gFLOPS are measured as: n^3 / (10^6 * ms) where:
  • n is the size of a matrix dimension (the matrices are square).
  • Ms is the averaged (over 10 iterations) measured processing time in milliseconds.

Conclusions

Three program pairs have been compared, informally and semi-systematically, for their performance in time and space.

In the case of the N-Body simulation, the data load was selected that is optimal for time performance. That resulted in an about 30% better time performance of the Compute Shader Program.

In the case of vector addition – about the simplest program imaginable in this context – the time performance was measured for maximum data load. This resulted in practically equal time performance for both programs. The Compute Shader version can load some more data.

Finally, the programs from the AMP guide for Compute Shader programmers were implemented, and the time performance was again measured for maximum data load. This resulted in a time performance of the Compute Shader that is three times as good as the time performance of the AMP program.

So, conclusion, it seems that if you want to get the max from your GPU, a Compute Shader is still the way to go.

PInvoking DirectX from Silverlight

Before moving on to Windows 8 development, I decided to write some legacy software. Well actually, this legacy software is perfectly up-to-date Windows 7 level software; tricks presented here will be useful for years to come. It’s just that Windows 8 (Consumer Preview) provides standard solutions to the problems solved here. This blog post discusses the use of a DirectX application, packaged as a DLL, by a Silverlight application, via PInvoke.

The problems tackled here stem from the desire to have Rich Internet Applications (RIAs) for Windows, that use computational resources on the client computer. In particular DirectX for 3D-graphics, X3dAudio, for 3D-audio, and also the GPU (Graphics Processing Unit – a powerful, highly parallel processor). Silverlight provides the facilities to write RIAs, but has a somewhat outdated 3D-graphics library: a subset of XNA – a managed wrapper for DirectX9 (but we want DirectX11, at least!). This Silverlight 3D-graphics library is not very extensive, it lacks e.g. 3D-audio.

On the other hand, Silverlight does provide facilities for interoperability with native code, e.g. by means of PInvoke: the invocation of native code in Dynamic Link Libraries (DLLs). PInvoke is here the bridge between Silverlight and DirectX code.

This blog post presents:

  • A sample DirectX11 application, and its transformation into a DLL to be used from Silverlight.
  • A Silverlight application that calls methods in the dll.
  • How to install and uninstall the DLL, and how to manage its lifetime explicitly, so the DLL may be uninstalled by the Silverlight application itself.
  • Performance aspects of the Silverlight-DirectX application, and a comparison with a Silverlight application that uses the Silverlight 3D-graphics library for the same task.
  • Concluding remarks, for one thing that this application should have had 3D-audio to decisively mark the advantage of the approach presented here (but at some point, you just have to round up).

The DirectX 11 Sample Application

The DirectX 11 Tutorial05 sample application will serve as the application a user wants to run on his or hers PC, and that uses resources already present on that PC. This DirectX application is the most simple application that contains some animation, and it has also a part – the small cube – that we can multiply in order to generate data for various performance loads.

To that end we transform it into a DLL with as much unnecessary functionality stripped, and an adequate interface added, including the code to transfer the data we need in the Silverlight application. Let’s take a look at the main changes.

Minimizing Window Management Code

For starters, We do not need a window, we use the DirectX application only to compute the 3D-graphics we present in the Silverlight application. The wWinMain (application entry point) function now looks like this:

Sample code like above is entered into the text as pictures. If you would like to have the code, just leave a comment on this blog with an e-mail address and I will ship it to you.

The function has no “Windows” parameters any more, nor has it a main message loop. The InitWindow function has been reduced to:

We do need to create a window in order to create a swap chain, and only for that reason, so we keep it as simple and small as possible. Note that the wcex.lpfnWndProc is assigned the DefWindowProc. That is: the application has no WindowProc of its own.

Create Texture to be Used in Export

In order to export the 3D-graphics data, an additional texture (a texture is a pixel array) called g_pOutputImage is created in the InitDevice function:

This texture has usage “Staging”, so the CPU can access it, and we specified CPU access as “Read”. With these settings we can’t bind the texture to the DeviceContext anymore, so no BindFlags. Note that we cannot have a texture that the GPU writes to, and the CPU reads from. If that would have been possible we could have had a data structure that both DirectX and Silverlight could have used directly. Since this is impossible we will have to perform expensive copy operations. Alas.

A final change in this same function is that we do not release the pointer to the back buffer, but keep it alive in order to export the graphics data in the Render function.

Rendering 3D-Graphics

The Render function has a loop added so we can have multiple small cubes. The idea is to compute a World matrix for each additional small cube. That is, we have only one cube, but draw it multiple times at different locations. Like this:

and:

Converting and Exporting 3D-Graphics Data

Finally, we want to copy the 3D-graphics data into an array the Silverlight client has provided, so that the client can show it to the user. This is done like so:

The above is standard code, I obtained it around here (the direct link seems broken). The ConvertToARGB function, however is a custom addition, replacing the memcpy call (more about that in the section on performance). This ConvertToARGB converts the RGBA format of DirectX to the premultiplied (PM) ARGB format used in Silverlight. This PM ARGB format is considered legacy now. The conversion step is a real performance hit as anyone can imagine. The function looks like this:

Essentially this OR-s 4 integers, the first one is constructed by byte-shifting the A (transparency) byte all to the left, then 3 integers are created by pushing the RGB bytes in place. This is a fast algorithm since shifting is a quick operation. I found it here. After the conversion, the pixels are in the correct format in an array that is owned by the Silverlight client application.

The DLL Interface

The interface has the following methods:

And for performance measurements:

The above functions return an average time over the Render function, and an average time over the conversion and export respectively. Details will be discussed below. The

decoration results in a clean export of the function names. Without the decoration, the C++ compiler will add a number of tokens (among which at least a few like @#$%^&*) to the function name in order to make it unique. The problem with this is that you’ll have a hard time retrieving the actual function name for use in the Silverlight client.

The Silverlight Client

General Architecture

The application has the following structure:

The App class is the application entry point (as usual). The Application_Startup event handler, depicted below,

first checks if the application is running out-of-browser (OOB). Running OOB is the intended normal use of this application. If so, a MainPage control is instantiated which will run the DirectX code. If the application is running in-browser, it still needs to be installed. Only after installation, the application has access to the file system – required to save and load the dll – and to the GPU. The application requires Windows 7 or higher and bails out if a lower level Windows or Apple OS is found.

The install page offers to install the application on the user’s PC, as depicted below,

or tells the user that the application is already installed, and hints at ways to uninstall the application if so desired.

If the user installs the application, it starts running out of browser and shows the MainPage with the DirectX animation.

Installing, Uninstalling, and Managing DLL Lifetimes

Installing includes saving the DirectX application in the DLL to a file on the user’s PC. The DLL is packaged with the Silverlight application as a resource. For execution, the DLL has to be loaded in memory, or be present on the PC as a file. Saving the DLL to file is done with code after an example from the NESL application. We store the application at “<SystemDrive>ProgramDataRealManMonths PInvokeDirectXTutorial05”.

Once the DLL is saved to file we load it into memory using the LoadLibrary function from the kernel32.dll. The reason we manage the dll’s lifetime explicitly instead of implicitly by importing the dll, and calling its functions, is that we need to be able to explicitly remove the dll from memory when exiting the application, see below. Loading into memory requires a dll import declaration:

And a call of this function, in the MainPage_Loaded event handler:

Where DllPath is just the path specified above. Is that all? Yes, that’s all.

When the application is exited, we use the handleToDll to release the library with repeated calls to FreeLibrary. Declaration:

Then we call it in the Application_Exit event handler as follows:

The point is that each method we import from the dll increases the reference count. As long as the reference count is larger than zero we cannot unload the DLL, nor delete its file. Not being able to delete the file means we cannot properly uninstall the application – we would leave a mess. Once the ref count is zero, FreeLibrary unloads the library from memory.

The final question in this section is why we delete the dll file every time we exit the application, and create the file every time we start it up. The reason is that if we do not do that, and the user uninstalls the application from the InstallPage (running in-browser), the application does not have the permissions to access the file system, hence the DLL file will not be deleted. So, all these file manipulations are bound to the runtime of the application in order to have a clean install and uninstall experience for the user.

PInvoking the DirectX Functions

Now that the application can be installed, functions from the DirectX application interface can be declared and executed.

[DllImport(DLL_NAME, SetLastError = true, CallingConvention = CallingConvention.Cdecl)]

public extern static int Init(int width, int height,

[MarshalAs(UnmanagedType.LPWStr)] String effectFilePath);

[DllImport(DLL_NAME, SetLastError = true, CallingConvention = CallingConvention.Cdecl)]

public extern static void Render([In, Out] int[] array);

[DllImport(DLL_NAME, SetLastError = true,CallingConvention = CallingConvention.Cdecl)]

public extern static int Cleanup();

[DllImport(DLL_NAME, SetLastError = true, CallingConvention = CallingConvention.Cdecl)]

public extern static void GetRenderTimerAv(ref double pArOut);

[DllImport(DLL_NAME, SetLastError = true, CallingConvention = CallingConvention.Cdecl)]

public extern static void GetTransferTimerAv(ref double pArOut);

We make a call to the Init function in the MainPage_Loaded event handler, calls to the dll Render function, in the local Render method, and a call to CleanUp in the Application_Exit event handler.

Calls to the timer functions are made when the user clicks the “Get Timing Av” button on the MainPage.

Debugging PInvoke DLLs

At times you may want to trace the flow of control from the Silverlight client application into the native code of the DLL. This, however is not possible in Silverlight. Silverlight projects have no option to enable debugging native code. Manually editing the project file doesn’t help at this point. Now what?

A work around is to create a Windows Presentation Foundation (WPF) client. I did this for the current application. This WPF application does not show the graphics data the DirectX library returns, it just gets an array of integers.

To trace the flow of control into the DLL you need to uncheck ”Enable Just My Code (Managed only)” at (in the menu bar) Tools | Options| Debugging, and to check (in the project properties) “Enable native code debugging” at Properties | Debug | Enable Debuggers.

If you now set a breakpoint in the native code and start debugging from the WPF application, program execution stops at your breakpoint.

Reactive Extensions

In order to have a stable program execution, the calls to the dll’s Render method are made on a worker thread. We use two WriteableBitmaps, one is returned to the UI thread upon entering the Silverlight method that calls the dll’s Render method, the other WriteableBitmap is then rendered to by the DLL. After rendering, the worker thread pauses to fill up a time slot of 16.67ms (60 fps).

Thread management and processing the indices that point into the WriteableBitmap array (implementation detail J ) is done using Reactive Extensions (RX). The idea is that the stream of indices the method returns is interpreted by RX as an Observable collection and ‘observed’ such that it takes the last index upon arrival, and uses the index to render the corresponding WriteableBitmap to screen. This results in elegant and clean code, as presented below.

The first statement create an observable collection from a method that returns an IEnumerable. Note that ‘observing’ is on the UI thread (referred to by the ‘DispatcherScheduler’)

The SubscribeOn(ScheduleNewThread)-clause creates a new thread for the render process. The lambda expression defines the action if a new int (index) is observed.

Rendering on the worker thread proceeds as follows:

To stop rendering we just put IsRunning to “false”. And that’s it.

Performance

DirectX applications – by definition – have higher performance than .Net applications. However, if you pull out the data from a DirectX application and send it elsewhere, there is a performance penalty. You will be doing something like this:

CPU -> GPU -> CPU -> GPU -> Screen instead of CPU -> GPU -> Screen

The extra actions: copying data from the GPU to CPU accessible memory and converting to Premultiplied ARGB will take time. So the questions are:

  1. How much time is involved in these actions?
  2. Will the extra required time pose a problem?
  3. How does performance compare to the Silverlight 3D-graphics library?
  4. Are there space (footprint) consequences as well?

Before we dive into answering the questions, note that:

– The use of DirectX will be primarily motivated by the need to use features that are not present in the Silverlight 3D-Graphics / -Audio library at all. In such cases comparative performance is not at all relevant. Performance is relevant if the use of DirectX becomes prohibitively slow.

For the measurements I let the system run without fixed frequency; usually you would let the system run at a frequency of 60Hz, since this is fast enough to make animations fluent. At top speed, the frequency is typically around 110Hz. I found no significant performance differences between debug builds and release builds.

Visual Studio 11CP Performance analysis: Sampling

If we run a sampling performance analysis – this involves the CPU only, the bottleneck in the process becomes clear immediately: The conversion from RGBA to premultiplied ARGB (and I’m not even pre-multiplying) takes 96.5% of CPU time.

It is, of course, disturbing that the bulk of the time is spent in some stupid conversion. On the other hand, work done by the GPU is not considered here.

To investigate the contribution of the conversion further, I replaced the conversion by a memcpy call. Then we get a different color palette J, like this:

But look, the frequency jumps up to 185 fps (80% more). The analysis then yields:

That is: much improved results, but shoving data around is still the main time consumer. Note that the change of color palette by the crude reinterpretation of the pixel array is a problem we could solve at compile time, by pro-actively re-coloring the assets.

Compare to a Silverlight 5 3D-library application

Would the performance of our application hold up to the performance of a Silverlight application using the regular 3D-graphics library? To find out I transformed the standard Silverlight 3D-graphics starter application to a functional equivalent of our Silverlight-DirectX application, as depicted below – one large cube and 5 small cubes orbiting around it (yes, one small cube is hidden behind the large one).

If we click the “Get Timing Av button”, we typically get a “Client Time Average” (average time per Draw event handler call) of 16.6.. ms, corresponding to the 60 fps. The time it takes to actually render the scene averages to 3.3 ms. This latter time is 0.8ms without conversion, and 2.8ms with conversion for the Silverlight – DirectX application (if we let it run at max frequency). So, the Silverlight-DirectX alternative can be regarded as quicker.

If we look at the footprint, we see that the Silverlight-DirectX application uses 1,880K of video memory, and has an image of 50,048K in the Task Manager. The regular Silverlight application uses 5,883K of video memory, and has a 37,708K image. Both in SLlauncher. So, the regular Silverlight application is smaller.

Concluding Remarks

For one, it is feasible to use DirectX from Silverlight. PInvoke is a useful way to bridge the gap. This opens up the road to use of more, if not all, parts of the DirectX libraries. In the example studied here, the Silverlight-DirectX application is faster, but has a larger footprint.

We can provide the user with a clean install and uninstall experience that covers handling and lifetime management of the native dll.

Threading can be well covered with Reactive extensions.

There is a demo application here. This application requires the installation of the DirectX 11 and the Visual C++ 2010SP1 runtime packages (links are provided at the demo application site). I’ve kept these prerequisites separate, instead of integrating their deployment in the demo application installation the NESL way, mainly because the DirectX runtime package has no uninstaller.

If you would like to have the source code for the example program, just create a comment on this blog to request for the source code, I’ll send it to you if you provide an e-mail address.

Silverlight massive animation performance

As it turns out, Storyboard animations in Silverlight have limited performance capability. Presumably this system has been designed for ease of use and developer / designer productivity. If you want to create massive amounts of animations, like for instance in particle systems, you soon hit the performance limits of the rendering, graphics, animation subsystem.

Hey, that’s interesting!

Of course, now we want to know what the performance limits are, and how we can get around them. When I first hit the aforementioned performance limits, I had no clue as to how to improve performance. In this article you will find some articles I found on the World Wide Web concerning the subject. Great stuff. Some solutions found are about 20 times faster than others, and current developments of Silverlight 5 seems to promise to take it a step further.

Growing trees

The first article encountered was How I let the trees grow by Peter Kuhn. He describes how he ran into performance problems creating a tree that grows by splitting branches into smaller branches, terminating in leaves. At some point he finds his software trying to render over 20k paths, which is ‘massive’ enough to create performance problems. The solution is found in the use of the WriteableBitmapEx CodePlex project. The WriteableBitmapEx contains (among others) a fast Blit operation for copy operations (claimed to be 20-30 times faster than the standard Silverlight operation – I believe it). You can draw on Bitmaps that are not shown yet, thus prepare images for the screen, and then quickly shove them into vision when ready. The (in browser – IE9) solution presented performs well.

What we do not get from this article are clear figures about standard Silverlight performance and improved performance. So let’s discuss another article.

Procedural animations

The WriteableBitmapEx CodePlex project contains a reference to Advanced Animation: Animating 15,000 Visuals in Silverlight by Eric Klimczak. He tells us that if we want to animate ~50 objects concurrently, we need additional performance measures over Storyboards and Timelines. The main performance measures he employs extend the ones mentioned above with: “procedural animations”.

In Procedural Animation within the context of Silverlight you code an Update() and Draw() Loop that is driven by the Windows.Media.CompositionTarget.Rendering event. Essentially, you now code the new position, color, or any other attribute, in the Update() method, and Blit it to the render target in the Draw() method – thus putting it on screen.

This works very well! Eric Klimczak has provided source code with his article, among which a program that animates moving particles that respond quickly to mouse actions (in browser).

For 3000 particles the program renders at ~200 frames per second (FPS), tops, and 15.000(!) particles are rendered at a still pleasant 36 – 46 FPS. I’ve used the Silverlight FPS counter for all Silverlight programs in this article in order to get comparable measurements. See the fps counter in the status bar of the IE screenshot below.

Curiously, there is no maxFrameRate setting in his code. About this maxFrameRate setting the Silverlight documentation writes:A value of approximately 60 frames-per-second should be reliable on all platforms. The 1000 and 30000 frames-per-second range is where the maximum frame rate could differ between platforms”. So, the obvious step is to set the maxFrameRate to 1.000 – both in code as well as in the html host, which showed a factor 3 performance increase compared to the original article software, for the 3K particle case (screenshot above). The Silverlight Documentation also states that the enableGPUAcceleration setting doesn’t work for the WriteableBitmap, so I skipped that one.

It seems to me that this approach solves most problems. However, procedural animation – a gaming software approach – opens the door to other, even more apt approaches.

Note that this approach does not employ the GPU. All rendering is done using the CPU.

Pixel shaders

An approach that takes performance a step further is Silverlight 3 WriteableBitmap Performance Follow-Up by René Schulte. In this article a number of approaches are compared. All approaches yield comparable results, except the pixel shader approach, which yields a factor ~20 better performance compared to the WriteableBitmap (WOW!).

How does it work? UIElement descendants have an Effect property. You can create custom Effects using a pixel shader written in HLSL (a .fx file) which you compile using e.g. fxc.exe – the DirectX HLSL compiler, or Shazzam. The compiled shader effect is loaded as a resource by a descendant of the ShaderEffect class. The article by René Schulte uses a custom derived class thereby showing how to transfer data into the shader during program execution. The loaded shader should be attached to the UIElement’s Effect Property. The shader will be executed for each pixel to be rendered. This gives you great control over the UIElement. You can modify many attributes of each pixel, for instance color and opacity. Do not forget that dropshadows are implemented as shaders, so you can also duplicate the UIElement visual.

According to René Schulte, the program / pixel shader is not executed on the GPU. That may have been true for Silverlight 3, but in Silverlight 4 it is absolutely possible to put the GPU to work. So, with a bit of tweaking the code here and there we find a maximum performance of >450 FPS.

I’ve registered the GPU invocation for specific tasks using the Catalyst utility of my graphics card, see the fields ‘GPU Clock’ and ‘Memory Clock’ at the bottom of the screen shot below. Regular values are 157 and 300 respectively.

Silverlight 5 Beta and XNA

Recently (April 13th 2011), Silverlight 5 Beta was released. It includes the DrawingSurface control which is a gateway into XNA functionality. A little experimenting reveals that like XNA the default drawing frequency is at 60 FPS, and you can’t seem to get it up by recurring calls to the OnDraw() event handler.

In Silverlight the frequency is raised as described above. In XNA the default of 60 FPS can be lifted by setting both the Game’s object ‘IsFixedTimeStep’ property and the GraphicsDeviceManager’s ‘SynchronizeWithVerticalRetrace’ property to false.

From the MIX demo video it is clear that performance is very good, however, at 60 FPS within Silverlight.  The performce step is ‘made’ by the shaders and realized on the GPU. It is currently not clear to me how to measure that performance, so this exercise ends here for now.

Non Silverlight performance

What performance can we expect? Is Silverlight slow, despite the extra tricks? What is the promise hidden in Silverlight 5? We now know that for demanding graphics we can turn to the integration of Silverlight with XNA. XNA, in turn is built upon DirectX.

Below you’ll find a screenshot of a DirectX11 particle demo. For 16K particles (reminiscent of the 15K in the above particle demo) we see a performance of ~620 FPS (not measured with the same frame counter as with the other programs, however), immediately requiring maximum performance from the GPU. For 8K particles performance rises to ~1175 FPS.

One conclusion I would like to draw here is that this performance correlates to the performance of the pixel shader used as a custom Effect. So, we may conclude that the real performance enhancement lies with the use of shaders.

Will this performance hold up in XNA? Yes, a particle simulation in XNA (from the XNA community, with small adaptation) brings us a ~1000 FPS performance, see screenshot.

Conclusion

The above is an exploration of techniques and approaches to realize massive animation performance in Silverlight. It is not a methodological, comparative study. A more rigorous investigation into performance (of what exactly?) might be subject for a later article that builds on the findings presented here.

Here we have learnt that in order to have massive animation in Silverlight we use the WriteableBitmap, the Blit operation from the WriteableBitmapEx Codeplex project, Procedural Animation programming, and pixel shaders (do not forget the enableGPUAcceleration setting, when applicable). We have seen that the exposure of XNA, built on DirectX, in Silverlight will most likely bring us further performance improvements.

Today we can have a very powerful massive animation performance of around 400-500 FPS, and the future is bright.