Posts Tagged ‘WinRT’

Data Compression for the Kinect

Transmitting uncompressed Kinect depth and color data requires a network bandwidth of about 460Mbit/s. Using the RleCodec or the LZ4 library we achieve tremendous compression – a compression ratio of 10 or 22 respectively, at lightning speed – over 1600Mbytes/s. We achieve this not so much by the compression algorithms, but by removing undesirable effects (jitter, by the DiscreteMedianFilter) and redundancy (already sent data, by taking the Delta).


From the start, one goal of the Kinect Client Server (KCS) project was to provide a version of the KCS viewer, called 3D-TV, from the Windows Store. Because of certification requirement 3.1 (V5.0)

“Windows Store apps must not communicate with local desktop applications or services via local mechanisms,..”

3D-TV has to connect to a KinectColorDepth server application on another PC. In practice, the network bandwidth that is required to transfer uncompressed Kinect depth and color data over Ethernet LAN using TCP is about 460Mbit/s, see e.g. the blog post on the jitter filter. This is a lot, and we would like to reduce it using data compression.

This is the final post in a series of three on the Kinect Client Server system, an Open Source project at CodePlex, where the source code of the software discussed here can be obtained.

What Do We Need?

Let’s first clarify the units of measurements we use.

Mega and Giga

There is some confusion regarding the terms Mega and Giga, as in Megabit.

  • For files 1 Megabit = 2^20 bits, and 1 Gigabit = 2^30 bits.
  • For network bandwidth, 1 megabit = 10^6 bits, and 1 gigabit = 10^9 bits.

Here we will use the units for network bandwidth. The Kinect produces a color frame of 4 bytes and a depth frame of 2 bytes at 30 frames per second. This amounts to:

30 FPS x 640×48 resolution x (2 depth bytes + 4 color bytes) x 8 bits = 442.4 Mbit/s = 55.3 Mbyte/s.

We have seen before that the actual network bandwidth is about 460Mbit/s, so network transport creates about 18Mbit/s overhead, or about 4%.

Since we are dealing with data here that is streamed at 30 FPS, time performance is more important than compression rate. It turns out that a compression rate of at least 2 (compressed file size is at most 50% of uncompressed size), in about 5ms satisfies all requirements.

Data Compression: Background, Existing Solutions


If you are looking for an introduction to data compression, you might want to take a look at Rui del-Negro’s excellent 3 part introduction to data compression. In short: there are lossless compression techniques and lossy compression techniques. The lossy ones achieve better compression, but at the expense of some loss of the original data. This loss can be a nuisance, or irrelevant, e.g. because it defines information that cannot be detected by our senses. Both types of compression are applied, often in combination, to images, video and sound.

The simplest compression technique is Run Length Encoding, a lossless compression technique. It simply replaces a sequence of identical tokens by one occurrence of the token and the count of occurrences. A very popular somewhat more complex family of compression techniques is the LZ (Lempel-Ziv) family (e.g. LZ, LZ77, LZ78, LZW) which is a dictionary based, lossless compression. For video, the MPEG family of codecs is a well known solution.

Existing Solutions

There are many, many data compression libraries, see e.g. Stephan Busch’s Squeeze Chart for an overview and bench marks. Before I decided to roll my own implementation of a compression algorithm, I checked out two other solutions: The Windows Media Codecs In Window Media Foundation. But consider the following fragment from the Windows Media Codecs documentation:

It seems as if the codecs are way too slow: max 8Mbit/s where we need 442Mbit/s. The WMF codecs obviously serve a different purpose.

The compression used for higher compression levels seems primarily of the lossy type. Since we have only 640×480 pixels I’m not sure whether it is a good idea to go ‘lossy’. It also seems that not all versions of Windows 8 support the WMF.

LZ4, an open source compression library. This is a very fast compression library, see image below, from the website:

So, LZ4 can compress a file to 50% at over 400MByte/s. That is absolutely great! I’ve downloaded the source and tried it on Kinect color and depth data. Results are shown and compared in the Performance section.

The RleCodec

I decided to write my own data compression codec, and chose the Run Length Encoding algorithm as a starting point. Why?

Well, I expected a custom algorithm, tailored to the situation at hand would outperform the general purpose LZ4 library. And the assumption turned out to be correct. A prototype implementation of the RleCodec supported by both the DiscreteMedianFilter and creating a Delta before compressing data really outperformed the LZ4 reference implementation, as can be read from the performance data in the Performance section.

It only dawned on me much later that removing undesired effects (like jitter, by the DiscreteMedianFilter) and redundant information (already sent data, by taking the Delta) before compressing and transmitting data is not an improvement of just the RLE algorithm, but should be applied before any compression and transmission takes place. So, I adjusted my approach and in the performance comparison below, we compare the core RLE and LZ4 algorithms, and see that LZ4 is indeed the better algorithm.

However, I expect that in time the data compression codec will be implemented as a GPU program, because there might be other image pre-processing steps that will also have to be executed at the GPU on the server machine; to copy the data to the GPU just for compression requires too much overhead. It seems to me that for GPU implementations the RLE algorithm will turn out the better choice. The LZ4 algorithm is a dictionary based algorithm, and creating and consulting a dictionary requires intensive interaction with global memory on the GPU, which is relatively expensive. An algorithm that can do its computations purely locally has an advantage.


Lossless or Lossy

Shall we call the RleCodec a lossy or lossless compression codec? Of course, RLE is lossless, but when compressing the data, the KinectColorDepth server also applies the DiscreteMedianFilter and takes the Delta with the previous frame. Both reduce the information contained in the data. Since these steps are responsible for enormous reduction of compressed size, I am inclined to consider the resulting library a lossy compression library, noting that it only looses information we would like to lose, i.e. the jitter and data we already sent over the wire.



In compressing, transmitting, and decompressing data the KinectColorDepth server application takes the following steps:

  1. Apply the DiscreteMedianFilter.
  2. Take the Delta of the current input with the previous input.
  3. Compress the data.
  4. Transmit the data over Ethernet using TCP.
  5. Decompress the data at the client side.
  6. Update the previous frame with the Delta.

Since the first frame has no predecessor, it is a Delta itself and send over the network as a whole.


The RleCodec was implemented in C++ as a template class. Like with the DiscreteMedianFilter, traits classes have been defined to inject the properties that are specific to color and depth data at compile time.

The interface consists of:

  • A declaration that take the value type as the template argument.
  • A constructor that takes the number of elements (not the number of bytes) as an argument.
  • The size method that returns the byte size of the compressed data.
  • The data method that returns a shared_ptr to the compressed data.
  • The encode method that takes a vector of the data to compress, and stores the result in a private array.
  • The decode method that takes a vector, of sufficient size, to write the decompressed data into.

Like the DiscreteMedianFilter, the RleCodec uses channels and an offset to control the level of parallelism and to skip channels that do not contain information (specifically, the A (alpha or opacity) channel of the color data). Parallelism is implemented using concurrency::parallel_for from the PPL library.

Meta Programming

The RleCodec contains some meta programming in the form of template classes that roll out loops over the channels and offset during compilation. The idea is that removing logic that loops over channels and checks if a specific channel has to be processed or skipped will provide a performance gain. However, it turned out that this gain is only marginal (and a really thin margin 🙂 ). It seems as if the compiler obviates meta programming, except perhaps for very complicated cases.


How does our custom RLE codec perform in test environment and in the practice of transmitting Kinect data over a network? How does its performance compare to that of LZ4?. Let’s find out.

In Vitro Performance

In vitro performance tests evaluate in controlled and comparable circumstances the speed at which the algorithms compress and decompress data.


In order to get comparable timings for the two codecs, we measure the performance within the context of a small test program I wrote, the same program is used for both codecs. This makes the results comparable and sheds light on the usefulness of the codec in the context of transmitting Kinect data over a network.


In comparing the RleCodec and LZ4, both algorithms take advantage of working the delta of the input with the previous input. We use 3 subsequent depth frames and 3 subsequent color frames for a depth test and a color test. In a test the frames are processed following the sequence below:

  1. Compute delta of current frame from previous frame.
  2. Compress the delta.
  3. Measure the time it takes to compress the delta
  4. Decompress the delta
  5. Measure the time it takes to decompress the delta
  6. Update previous frame with the delta
  7. Check correctness, i.e. the equality of the current input with the updated previous input.

We run the sequence 1000 times and average the results. Averaging is useful for processing times, the compression factor will be the same in each run. The frames we used are not filtered (DiscreteMedianFilter).

Let’s first establish that the performance of these libraries are of very different order than the performance of the WMF codecs. Let’s also establish that compression speed and decompression speed is much more than sufficient: as noted above 50Mbyte/s would do.

For depth data, we see that the RleCodec delivers fastest compression. LZ4 delivers faster decompression, and obtains stronger compression.

The RleCodec was tested twice with the color data. In the first test we interpreted the color data as of type unsigned char. We used 3 channels with offset 4 to compress it. In the second test we interpreted the color data as unsigned int. We used 4 channels, with offset 4. We see that now the RleCodec runs about 4 times as fast as with char data. The compression strength is the same, however. So, for color data, the same picture arises as with depth data: times are of the same order, but LZ4 creates stronger compression.

The difference in naked compression ratios has limited relevance, however. We will see in the section on In Vivo testing that the effects of working with a Delta, an in particular of the DiscreteMedianFilter dwarfs these differences.

We note that the first depth frame yields lesser results both for time performance and compression ratio. The lesser (de)compression speed is due to the initialization of the PPL concurrency library. The lesser compression ratio is illustrative of the effect of processing a Delta: the first frame has no predecessor, hence there is no Delta and the full frame is compressed. The second frame does have a Delta, and the compression ratio improves by a factor of 2.4 – 2.5.

In Vivo Performance

In Vivo tests measure, in fact, the effect of the DiscreteMedianFilter on the data compression. In Vivo performance testing measures the required network bandwidth in various configurations:

  1. No compression, no filter (DiscreteMedianFilter).
  2. With compression, no filter.
  3. With compression, with filter, with noise.
  4. With compression, with filter, without noise.

We measure the use of the RleCodec and the LZ4 libraries. Measurements are made for a static scene. Measuring means here to read values from the Task Manager’s Performance tab (update speed = low).

Using a static scene is looking for rock bottom results. Since activity in the scene- people moving around can be expressed as a noise level, resulting compression will always be somewhere between the noiseless level and the “no compression, no filter” level. Increase will be about linear, given the definition of noise as a percentage of changed pixels.

The measurements in the table below are in Megabits per second, whereas the table above shows measurements in Megabytes per second. So, in order to compare the numbers, if so required, the entries in the table below have to be divided by 8. Note that 460Mbit/s is 57.5Mbyte/s.

What we see is that:

  • Compression of the delta reduces network bandwidth width 13* (RLE), or 33% (LZ4).
  • Application of the filter reduces it further with 53% (RLE), or 39% (LZ4).
  • Cancelling noise reduces it further with 24% (RLE), or 23% (LZ4).
  • We end up with a compression factor of about 10 (RLE), or 22 (LZ4).


What do we transmit at the no noise level? Just the jitter beyond the breadth of the DiscreteMedianFilter, a lot of zeroes, and some networking overhead.

As noted above, the differences between the core RleCodec and LZ4 are insignificant compared to the effects of the DiscreteMedianFilter and taking the Delta.


Using the RleCodec or the LZ4 library we achieve tremendous compression, a compression ratio of 10 or 22 respectively , at lightning speed – over 1600Mbytes/s. We achieve this not so much by the compression algorithms, but by removing undesirable effects (jitter, by the DiscreteMedianFilter) and redundancy (already sent data, by taking the Delta).


There is always more to wish for, so what more do we want?

Navigation needs to be improved. At the moment it is somewhat jerky because of the reduction in depth information. If we periodically, say once a second, send a full (delta yes, filter no) frame, in order to have all information at the client’s end, we might remedy this effect.

A GPU implementation. Compute shader or C++ AMP based. But only in combination with other processing steps that really need the GPU.

Improve on RLE. RLE compresses only sequences of the same token. What would it take to store each literal only once, and insert a reference to the first occurrence at reencountering it? Or would that be reinventing LZ?


A Jitter Filter for the Kinect

This blog post introduces a filter for the jitter caused by the Kinect depth sensor. The filter works essentially by applying a dynamic threshold. Experience shows that a threshold works much better than averaging, which has the disadvantage of negatively influencing motion detection, and has only moderate results. The presented DiscreteMedianFilter removes the jitter. A problem that remains to be solved is the manifestation of depth shadows. Performance of the filter is fine. Performance is great in the absence of depth shadow countermeasures.


Kinect depth images show considerable jitter, see e.g. the depth samples from the SDK. Jitter degrades image quality. But it also makes compression(Run Length Encoding) harder; compression for the Kinect Server System will be discussed in a separate blog post. For these reasons we want to reduce the jitter, if not eliminate it.

Kinect Depth Data

What are the characteristics of Kinect depth data?

Literature on Statistical Analysis of the Depth Sensor

Internet search delivers a number of papers reporting on thorough analysis of the depth sensor. In particular:

[1] A very extensive and accessible technical report by M.R. Andersen, T. Jensen, P. Lisouski, A.K. Mortensen, M.K. Hansen, T. Gregersen and P. Ahrendt: Kinect Depth Sensor Evaluation for Computer Vision Applications.

[2] An also well readable article by Kourosh Khoshelham and Sander Oude Elberink.

[3] A more technically oriented article by Jae-Han Park, Yong-Deuk Shin, Ji-Hun Bae and Moon-Hong Baeg.

[4] Of course, there is always the Wikipedia

These articles discuss the Kinect 360. I’ve not found any evidence that these results do not carry over to the Kinect for Windows, within the range ([0.8m – 4m]) of the Default mode.

Depth Data

We are interested in the depth properties of the 640×480 spatial image that the Kinect produces at 30 FPS in the Default range. From the he SDK documentation we know that the Kinect provides depth measurements in millimeters. A dept value measures the distance between a coordinate in the spatial image and the corresponding coordinate in the parallel plane at the depth sensor, see image below from the Kinect SDK Documentation.

Some characteristics:

1. Spatial resolution: at 0.8m the 640×480 (width x height) depth coordinates cover an approximately 87cmx63cm plane. The resolution is inversely proportional with the squared distance from the depth sensor. The sensor has an angular field of view of 57° horizontally and 43° vertically.

2. Depth resolution: the real world distance between 2 subsequent depth values the Kinect can deliver is about 2mm at 1m from the Kinect, about 2.5cm at 3m, and about 7cm at 5m. Resolution decreases quadratically as a function of the distance.


The Kinect depth measurements are characterized by some uncertainty that is expressible as a random error. One can distinguish between errors in the x,y-plane on the one hand, and on the z-axis (depth values) on the other hand. It is the latter that is referenced to as the depth jitter. The random error in the x,y-plane is much smaller than the depth jitter. I suspect it manifests itself as the color jitter in the KinectColorDepthServer through the mapping of color onto depth, but that still has to be sorted out. Nevertheless, the filter described here is also applied to the color data, after mapping onto depth.

The depth jitter has the following characteristics:

1. The error in depth measurements: The jitter is about a few millimeters at 0.5m, up to 4cm at 5m, increasing quadratically over the distance from the camera.

2. The jitter is a walk over a small number of nearby discrete values. We have already seen this in a previous post in The Byte Kitchen Blog, where we found a variance over mainly 3 different values, and incidentally 4 different values. Of course, for very long measuring times, we may expect an increased variance.

3. The jitter is much larger at the boundaries of depth shadows. Visually, this is a pretty disturbing effect, but the explanation is simple. The Kinect emits an infra red beam for depth measurements which, of course, creates a shadow. The jitter on the edges of a depth shadow jumps from an object to its shadow on the background which is usually much further away. We cannot remove this jitter without removing the difference between an object and its background, so for now, I’ve left it as is.

The miniature below is a link to a graph in [2] (page 1450, or 14 of 18) of the depth resolution (blue) and size of the theoretical depth measurement error (red).


A Kinect Produces a Limited Set of Discrete Depth Values

It is not the goal of the current project to correct the Kinect depth data, we just want to send it over an Ethernet network. What helps a lot is, and you could see this one coming:

The Kinect produces a limited set of depth values.

The Kinect for Windows produces 345 different depth values in the Default range, not counting the special values for unknown, and out of range measurements. The depth values for my Kinect for Windows are (divide by 8 to get the depth distance in mm):

This is the number of values for the Kinect for Windows. I also checked the number of unique values for my Kinect 360, and it has a larger but also stable set of unique depth values. The number of depth values is 781, due to the larger range.

The fact that a Kinect produces a limited and small set of depth values makes live a lot easier: we can use a lookup table were we would otherwise have to use function approximation. The availability of a lookup table is also good news for the time performance of the filter.

A question might be: can you use the depth table of an arbitrary Kinect to work with any other Kinect? I assume that each Kinect has a slightly different table, and this assumption is based on the fact that my Kinect for Windows has slightly different values than my Kinect 360, for the same sub range. However, if you use an upper bound search (this filter uses std::upper_bound), you will find the first value equal to or larger than a value from an arbitrary Kinect, which will usually be a working approximation (better than having the jitter). Of course, an adaptive table would be better, and it is on the ToDo list.


I’ve experimented with several approaches: sliding window of temporal averages, Bilateral Filter. But these were unsatisfactory:

– Reduction of Jitter is much less good compared to applying a threshold.

– Movement detection is as much reduced as the jitter, which is an undesirable effect.

A simple threshold, of about the size of the breadth of the error function proved the best solution. As noted above, the jitter typically is limited to a few values above and below the ‘real’ value. We could call the ‘real’ value the median of the jitter range, and describe this jitter range not in terms of the depth values themselves but of the enumeration of the sorted list of discrete depth values (see table). We get a Discrete Median Filter if we map all discrete values within the range onto the median of that range (minding the asymmetry of the sub ranges at the boundaries of our sorted list).

The DiscreteMedianFilter Removes Jitter

In practice we see no jitter anymore when the filter is applied: The DiscreteMedianFilter ends the jitter (period). However, the filter is not applicable to (edges of) depth shadows.


Actually, it turned out that this filter is in fact too good. If the Kinect registers a moving object, we get a moving depth shadow. The filter cannot deal with illegal depth values, so we are stuck with a depth shadow smear.

A modest level of noise solves this problem. In each frame 10% of the pixels the filter skips is selected at random, and updated. This works fine, but it should be regarded as a temporal solution: the real problem is, of course, the depth shadow, and that should be taken up.


The Discrete Median Filter was implemented in C++, as a template class, with a traits template class (struct, actually); one specialization for the depth value type and one specialization for the color value type, to set the parameters that are typical for each data type, and a policy template which holds the variant of the algorithm that is typical for the depth and color data respectively. For evaluation purposes, I also implemented traits and policy classes for unsigned int.

So, the DiscreteMedianFilter class is a template that takes a value type argument. Its interface consists of a parameter free constructor and the Filter method that takes a pointer to an input array and a pointer to a state. The method changes the state where the input deviates from the state more than a specified radius.

Channels and Offset

Color data is made up of RGBA data channels: e.g. R is a channel. Working with channels is inspired on data compression. More on this subject in the blog post on data compression.

The advantages of working with channels for the DiscreteMedianFilter are:

1. We can skip the A channel since it is all zeroes

2. Per channel, changes of color due to sensor errors are usually expressible as small changes, which is not the case for the color as a whole, so we get better filtering results.

3. There are less values: 3 x 256 vs 2^32, so the probability of equal values is higher.

Note that the more values are found to be within the range of values that do not need change, the better time performance we get.

The individual channels are not processed in parallel (unlike in the compression library). We will show below that parallelism should not be required (but actually is, as long as we have noise).


The code is complex at points, so it seems to me that printing the code here would raise more questions than it would answer. Interested people may download the code from The Byte Kitchen Open Sources at Codeplex. If you have a question about the code, please post a comment.


How much space and time do we need for filtering?

A small test program was built to run the filter on a number of generated arrays simulating subsequent depth and color frames. The program size never gets over 25.4 megabytes. The processing speed (without noise) is:

– About 0.77ms for an array of 1228800 bytes (640×480=307200 ARGB color values); 1530Mbyte/s

– About 0.36ms for an array of 640×480=307200 unsigned short depth values; 1615Mbyte/s.

So, this is all very fast, on a relatively small footprint: in a little more than 1ms we have filtered both the depth and the color data of a Kinect frame.

The simple test program and its synthetic data are not suitable for use with noise. So, we measured the time the KinectColorDepthServer needs for calls to the DiscreteMedianFilter. In this case the noise is set at 10% of the values that would otherwise be skipped. The times below are averaged over 25,000 calls:

– Depth: 6.7ms per call.

– Color: 19.3ms per call.

So, we may conclude that the noise is really a performance killer. Another reason to tackle the depth shadow issue. Nevertheless, we are still within the 33ms time window that is available for each frame at 30 FPS.

Does the DiscreteMedianFilter has any effect on the required network capacity? No, in both cases a capacity of 460Mbit/s is required for a completely static scene, compression is off. I do have the impression that the filter has a smoothing effect on instantaneous (as opposed to average) required network capacity.

To do

There is always more to wish for, so what more do we want?

– Resolve the need for the noise factor, i.e. do something about the depth shadows. This will increase performance greatly.

– Because of filtering, navigation through the scene is jerky. There are much less depth values in the image, so movement is not so smooth. This is to be helped by sending a full depth image every now and then. After all, the filter only replaces values that differ more than a threshold, the rest of the image is retained. Refreshing the overall picture now and then retains the richness of the depth image, helps to make noise superfluous, and smoothes navigation.

– Make the table of depth values adaptive. If a value is not present we replace the nearest value. Of course, we would then also like to save the new table to file, and load it at any subsequent program starts.

Kinect Client Server System V0.2

The Kinect Client Server System V0.2 adds the possibility to V0.1 to watch Kinect Color and Depth data over a network, and to navigate the rendered 3D scene.

To support data transfer over TCP, the Kinect Client Server System (KCS system) contains a custom build implementation of Run Length Encoding compression.

To both maximize compression and improve image quality the KCS system uses a jitter filter


Version 0.1 of the KCS system allowed the display of Kinect data in a Windows Store app. This is a restricted scenario: for security reasons, a Windows Store app cannot make a network connection to the same PC it is running on, unless in software development scenarios. Version 0.2 overcomes this restriction by:

1. Support for viewing Kinect data from another PC.

2. Providing the 3D-TV viewer from the Windows Store (free of charge).

Of course, V0.2 is an open Source project, the code and binaries can be downloaded from The Byte Kitchen’s open Source project at CodePlex.


The easiest way to start using the KCS system v0.2 is to download 3D-TV from the Windows Store, navigate to the Help-About screen (via the ‘Settings’ popup), click on the link to the online manual and follow the stepwise instructions.

The general usage schema is depicted below.

The PC called Kinect server runs the KinectColorDepthServer application. In order for the client PC running 3D-TV to find the server PC on the network, the server PC must be connected to gigabit Ethernet (cabled computer network, say) with a network adapter carrying the IP address The IP address of the Ethernet adaptor of the PC running 3D-TV must satisfy, where 0 <= x <=255. It is wise to expect that required data capacity is well over 100Megabit/second.

The online manual also provides a complete description of the keyboard keys to be used for navigating around a Kinect scene.

More information

In a few more blog posts, I will discuss more technical details of the new features in the KCS system V0.2, specifically the jitter filter that is to stabilize the Kinect imagery and help in data compression, and the data compression itself.

Viewing Kinect Data in the New Windows 8 UI


The Kinect SDK is not compatible with WinRT in the sense that software developed using the SDK cannot have a WinRT (Windows RunTime) UI. The reason is that the Kinect SDK is .Net software and you cannot run (full) managed code on the WinRT.

Nevertheless, I want to create software that can show Kinect data in a WinRT UI. For multiple reasons, one being that software written for the WinRT can run on a PC, a tablet, very large screens, now called a Surface, and a Windows Phone. A survey of other solutions, see below, reveals that solutions to this problem are based on networking. Networking allows us to deliver Kinect data anywhere. This then is another reason to work on separating the source of Kinect data from its presentation.

The Solution

The general solution is to make a client-server system. The server lives in the classic Windows environment, the client is a WinRT app. Communication between client and server is realized using networking technology; preferably the fastest available. The server receives the data from the Kinect and does any processing that involves the Kinect SDK. The client prepares the data for presentation on the screen. If multiple servers are involved, it integrates and time-synchronizes data from several servers. Since I’m a C++, DirectX guy, the server and client are built on just these platforms

Other Solutions

Several other solution already exist. Without pretending to be exhaustive, and in any order:

– The KinectMetro App by the WiseTeam

– ‘Using Kinect in a Windows 8 / Metro App’ by InterKnowlogy

– The Kinect Service from Coding4Fun

The KinectMetro App by the WiseTeam

The application by the WiseTeam is described in this blog post. The software is available at Codeplex. The software was written for the Windows 8 Consumer Preview as part of a MS Imagine Cup participation. I’ve downloaded the software, but couldn’t get it to run on the Windows 8 RTM version. The application is based on event aggregation, as found in PRISM, and on WebSockets.

‘Using Kinect in a Windows 8 / Metro App’ by InterKnowlogy

The approach InterKnowlogy took is blogged here. This is the entry point to several blog posts, some videos (Vimeo and Youtube), and a little bit of code. This solution is also written in C# .Net, using WebSockets.

The Kinect Service by Coding4Fun

This software is available from Codeplex. It is not aimed at the WinRT, it aims at distributing Kinect data to a wider spectrum of clients. Hence it can also be used as a base to target the WinRT. Apart from the server, it consists of a WPF client and a phone client. This looks like a high standards, well written solution. Neat! Data transport uses WinSockets (not WebSockets). The code is available in both C# and VB.


In theory, WebSockets are slower than WinSockets. There can be much discussion about what would be the fastest solution under which circumstances. I expect WinSockets to be the fastest solution, therefore I prefer WinSockets.

Also, in theory, a C++ program is faster, and smaller, than an equivalent C# program. There can be much discussion … , therefore I prefer a program written in C++.

Of course, we should do asynchronously, or in parallel, whatever can be done quicker in parallel.


So, what’s a smart way to develop a client server system to show Kinect color and depth data in a WinRT app? For one, we start from SDK samples:

– A sample from the Kinect SDK that elaborates processing depth and color data together.

– A sample the shows how to use WinSockets (in C++).

– A sample that show how to use the WinRT StreamSocket using PPL tasks (yes we will exploit parallelism extensively 🙂 .

– Windows Service (optional, see below).

The use of a Windows service is an option for later use. To work with a service instead of a simple console application requires that the server is capable of handling all kinds of exceptional situations, if only by resetting itself. Consider e.g. the case that no Kinect is connected, or if the Kinect is malfunctioning? Etc.? And apart from that, I expect the Kinect SDK to be made available for WinRT applications in due time.


Server side architecture

The general software architecture looks like this:

The test application instantiates the KinectColorDepthServer DLL. The idea is that in case the DLL is run by a service, the DLL can be loaded and dropped easily / frequently so as to prevent problems that relate to long running processes. So every time the client closes the WinSock connection, the application (or service), drops the DLL and creates a new instance.

The KinectColorDepthServer has a simple interface; you can Run it, Stop it and Destroy it. The interface has this neutral character so we can use the same interface for other data sources, like a stereoscopic camera. The server instantiates a Kinect DataSource on a separate thread, and waits until the connection is closed. The Server also creates two WinSock servers and hands references of these servers to the Kinect DataSource. The WinSock servers are created at a relatively high level, so we can configure them at a high level in the call chain. Lifecycle management of the WinSock servers is in parallel.

The Kinect DataSource contains those parts of the Kinect sample that contain, or refer to Kinect SDK code (which cannot be run in the WinRT client). The Kinect DataSource sends pairs of a depth image and a color image in parallel to the client. The main method in the Kinect DataSource deals with mapping the color data to the depth data.

The WinSock server is just the basic WinSock server sample from the Windows SDK documentation.

Client Side Architecture

The general software architecture looks like this:

The WinRT UI application class manages the lifecycle of the application. The MainPage manages the state of the user interface.

The MainPage references the Scene1 class that inherits from the Scene class in My DirectX Framework. This framework organizes standard WinRT DirectX11.1 code in a structure that is similar to the XNA architecture. This latter architecture support easy creation and management of graphical components very well. So, it keeps my code clean and well organized under a growing number of components. I like that, because I like to have oversight.

The Scene1 class refers to the KinectColorDepthclient, which provides the data, and the KinectImage class which contains the DirectX code (a WinRT port) from the Kinect SDK sample, which it uses to display the Kinect data on the screen, using a SwapChainBackgroundPanel. The Scene1 class also references a Camera class (not shown in the diagram) that allows the user to navigate through the 3D scene.

The KinectColorDepthClient creates two StreamSockets, one for depth data, and one for color data. Reception of depth and color images is parallel, then synchronized so as to keep matching color and depth images together. The resulting data is handed over to the KinectImage.

One goal of this architecture is that the KinectColorDepthClient class can be easily replaced by another class, e.g. when Microsoft decides to release a version of the Kinect SDK that is compatible with WinRT. For this reason it has a limited and general interface.

Parallelism is coded making extensive use of PPL task parallelism. PPL Tasks is really a pleasure to use in code.

WinSock2 sockets cannot be used in the WinRT, as it turns out. The alternative at hand is the StreamSocket. However, the StreamSocket still contains a bug. Closing a StreamSocket is done in C++ by calling delete on a StreamSocket object. This however, raises an unhandled exception (that I did not succeed in catching, by the way). It does not only do this in my code, but also in the StreamSocket sample that can be downloaded from MSDN (12 October 2012). A bug report has been filed.


So, now that we have this nice software, just what is the performance, that is, how quick is it, and how large are the programs involved?

Dry testing the transmission speed

To gain an idea of the speed with which data can be transported from one process to another, I sent a 1Mbyte blob from a Winsock2 server to a Winsock2 client 10.000 times, and averaged the transmission time.

Clocking was done using the ‘QueryPerformanceCounter’ function, which is quite high res. The performance counter was queried just before the start of transmission at the server, and just after arrival of the last blob at the client. The difference between the tick counts is then divided by ‘QueryPerformanceFrequency’, which give you the result in seconds. So multiply by 1000 (ms) and divide by the number of cycles (10.000). This shows that transmission of 1Mbyte takes about 1.5 ms (release build).

Now, we are planning to send 640 x 480 pixels (4 bytes each), and an equal nr. of depth values (2 bytes each) over the line. This will take us about 1.5 * (1.843.200 / 1.048.576) = 2.6 ms (wow!). The conclusion is that there will be no noticeable latency.

Visual Studio Performance Analysis

This tool is about finding bottlenecks in your code, so you may remove them. In an analysis run of the server, 5595 samples were taken. The CPU was found executing code I wrote / copied myself in 21.4% of the samples, all in one method. It is possible to examine which lines of code take the most time in that method. I measured an average processing time of these lines of code, and they typically take 1.7 ms (release build, debugger attached) to execute. Well, what can I say? Although I suspect the 21.4% could be improved, we will just leave it as it is.

In a second analysis run, the client application was scrutinized. In this run 2357 samples were taken – I guess it turned out harder to take samples. As little as 2.64% of the samples were in ‘my code’ (that is: 58 samples). Another 8.10% is taken up by DirectX – running shaders for my program, I think. So, in all about 11%. Since the rest of the samples hit code that I cannot touch the source code of, and that we may assume is already well optimized, this is a very fine result.


And how about the size, the footprint? The release build shows a client that has a working set of around 40 Mbyte, and a server with a working set of about 95 Mbyte. Together about 135 Mbyte. Well, that’s not small, but what should we compare it to? The Kinect Service by Coding4Fun, of course!

I downloaded and ran the WPF sample (pre-built). It turns out that the server usually stays under 130Mb, and the client will stay under 67 Mb. Together slightly less than 200Mb.

In conclusion: the footprint of the C++ application is smaller. Its size is 2/3 of the .Net application size, but it is not dramatically smaller.

Demo video

Below you’ll find a link (picture) to a video demonstrating the Kinect client-server system. First the server is started in a Windows desktop environment, then the user (me 🙂 ) switches over to the Start window to start up the client. You can see the client connect to the server – watch the log window at the lower left – and then see the Kinect data on the screen. The stream is stopped and then restarted. That is, in fact, all. The video has been made using Microsoft Expression Encoder Screen Capture. The screen capture has been processed with Encoder, with which we also made the snapshot that serves as the hyperlink to the download site (SkyDrive – Cloud!).

The jitter in the picture is caused by the depth stream. The depth stream consists of depth measurements expressed as the distance from the camera along the normal emanating from the camera, in mm. These measurements are subject to a certain error, or uncertainty, which causes fluctuations in measurements, hence the jitter in the stream.

Filtering away the jitter is high on my agenda.

Silverlight and COM

Silverlight supports interoperability with COM, and that is a good thing. It does not support registration-free COM, and that seems to me a missed opportunity. This lack of support for registration-free COM has, however, no impact on the importance of COM.


Many books have been written about COM, and none of them can give a short definition of COM that is also comprehensible to people that not already know what COM is. So, here we go, and take the same route J: COM is a software development discipline to develop reusable components. This reusability is at the binary (not human readable) level. There is a standard way of interacting with COM components. Interaction with COM components is programming language and location independent. COM components may (depending on how they are built) support parallelism.


The relevance of COM is enormous. Microsoft has written a great deal of its software (Windows) as COM components. For instance, DirectX consists of a number of COM components. Windows 8 is also being built upon COM. The Windows 8 WinRT layer is an intermediate layer of components that are built on COM (are COM components). Each WinRT component implements the IInspectable interface, which inherits from IUnknown. However, working with WinRT is unlike working with classical COM. See this article for more. The message is that COM, in some form or another will be with us for quite some time to come. Yes we are talking about the Windows platform here. The market share of Windows among computers (including phones and tablets) that access the internet is at time of writing about 84%. Some see this go down, I think it will go up with the introduction of Windows 8 and its emphasis on tablets and phones, and its uniform user experience over a large range of devices.

The good thing for us is that it is far easier to use COM Components than it is to write COM components. Using WinRT components is even easier. On the other hand, let’s not dramatize writing COM. COM programming concerns only a very thin layer in you code; the interface with the outside world.

Writing COM

COM hinges on a few principles that are not very hard to understand. A good book on COM is Don Box’s. Programming COM is quite a bit tougher. To my opinion this has to do with the large amount of macros (#define, #ifdef), typedef-s and odd naming conventions. I understand the typedef-s and macros’ are there to make things simpler. Point is that there are so many simplifications that they pose a problem; why can’t I just use the C++ I already know?


But then, if you really want to write COM, you will use ATL (Active Template Library). This makes writing COM much easier (‘easier’ as in “I don’t know what I’m doing”) writing ATL is a science in itself. I find it very hard to recognize the COM in ATL. An online introduction on ATL is here. The latest version number of ATL is unknown (to me). Visual Studio 2005 carried ATL 8.0. I’ve found no version numbers specific to VS2008 or VS2010.

Attributed ATL

There is a special flavor of ATL called “Attributed ATL” or ATL and Attributes. The attributes were added to make programming ATL simpler. And I think it does. However, the general public doesn’t seem to have adopted this style of programming due to diminished control over the ATL / COM code – the attribute would typically lead to source code insertion. See also the introduction into ATL referred to above. For an introductory article into Attributes ATL see here. In community forums the dominant emotion is that Attributed ATL is deprecated.

Registration-Free COM

Registration-free COM objects are a special case of Isolated Applications and Side-by-Side Assemblies, which is deployment and servicing technology for both native and .Net applications. Introductions into Registration-Free COM are here, and here. Isolated applications and Side-by-Side assemblies depend on the use of manifests. An assembly is an abstract entity defined by the manifest. Then, of course people choke on manifest creation. A handy tool is Regsvr42.

How does registration-free COM activation work? When a client application loads a dll holding a COM component, the loader checks if the application has a manifest before it checks the registry. If there is an application manifest, the loader searches for (a) library manifest(s). If found the COM component gets loaded and executed. This approach makes registration of COM components unnecessary. Requirements for registration-free COM to work are that the client is an application (something that has a process) and that this application has either an embedded or an external manifest associated with it. An embedded manifest cannot be overridden by an external manifest (except on Windows XP).


Silverlight supports COM interoperability. Wouldn’t it be nice then, if we could also ship COM components with the Silverlight Xap over the internet for execution elsewhere. COM components that we made ourselves, for instance. Of course, executing a COM component that is received over the web needs some user initiated installation in order to maintain the necessary protection against evils from internet. So, we do not expect to be able to execute our COM component within the context of our Silverlight application’s resources.

The current mission is to use the DirectX runtimes that are installed on every Windows 7+ systems from a Silverlight application. In this we can take the NESL approach: to install native software, like COM components, if the user installs the Silverlight application (installation of the Silverlight application is, for all practical purposes, required for use of the GPU). The installed application being the custom DirectX application to be used for ultimate 3D graphics and media effects in Silverlight. We could improve the user experience by skipping the installation of COM components, by using registration-free COM. Silverlight would then not require a search in the registry for a ProgId, the loader could read the manifest and simply load the required dll-s.

However, registration-free COM activation doesn’t work for OOB Silverlight Applications. These applications run in the process of the sllauncher.exe, which has an embedded manifest, as inspection with a simple text editor reveals. So, that ends the story of registration-free COM and Silverlight.

What can we do have?

We can have Silverlight applications, for instance Out-of-Browser applications, that consume COM components. If we bring our own component, we have to install it, registration-free activation will not work. Installation can be done the NESL way.

In some community forums thread someone wrote that only interoperability with out-of-process COM servers is possible. This is not correct. The consumed COM component should have a ProgId in the registry. It is not a requirement that is also is an application (has its own process). I’ve tested this statement with a simple in-process COM server that returns a string appended to its input string parameter: “Marc” -> “Hello Marc”. A 100.000 calls take about 760ms on my machine. By contrast, an out-of-Process server that does the same takes about 19 seconds. Both debug builds, by the way.

Since the Windows 8 Windows Runtime is built on COM, investment in COM expertise also seems to be future proof.