Posts Tagged ‘C++’

Harder to C++: Monads for Mortals [7], Monad Composition

Early authors on monads, e.g. [Moggi], [Wadler], describe several types of monads, notably for side-effects, state, exceptions and IO. Of course we want to write programs that have all these features. There are, in principle, two possible approaches. First: write a separate monad each time we think up a new piece of non-pure-functional … (shall we dare call it functionality?). Second: compose different standard monads into larger wholes? In [part 2] of his excellent tutorial on monads Mike Vanier writes:

Functional programmers talk about “composability” all the time, with the implication that if some aspect of a programming language isn’t composable, it’s probably not worth much.’

So, clearly composition is key and we will see some monads composed into larger wholes here. To be clear: I don’t mean function composition, I mean monad composition. In C++. To be precise, I will define:

  • A monad that executes a function which maintains a state (side-effect).
  • A monad that catches and handles exceptions.
  • A monad that writes the result of function composition to console or an error message in case an exception occurred.

Then I will compose those monads into a larger whole, with preservation of imperative ‘functionality’. We will see that it actually works. Then I will show a very general imperative C++ function, that has the same functionality, and we will compare size and performance of both approaches.

Monad Composition

We will use a single monad template and instantiate it for a State type, an Exception type, and a std::cout wrapper. The monad template looks like this:

//V: wrapped value, S: state of some sort
template<typename V, typename S>
struct monad
{
	V value;
	S state;

	monad(const V& v) : value(v) {}
	monad(V& v) : value(std::move(v)) {}
	monad(const V& v, const S& s) : value(v), state(s) {}
	monad(V& v, S& s) : value(std::move(v)), state(std::move(s)) {}
	monad(V&& v, S&& s) : value(std::move(v)), state(std::move(s)) {}
	template<typename T, typename W>
	monad(const monad<T, W>& m) : value(m.value), state(m.state) {}
	monad(monad&& m) : value(std::move(m.value)), state(std::move(m.state)) {}
	~monad() {}
	monad& operator=(const monad& m)
	{
		if(this != &m)
		{
			value = m.value;
			state = m.state;
		}
		return *this;
	}
	monad& operator=(monad&& m)
	{
		if(this != &m)
		{
			value = m.value;
			m.value = V();

			state = m.state;
			m.state = mystate();
		}
		return *this;
	}
};

The template consists of a wrapped value and a field to hold state of some sort, e.g. actual state, an exception, or an io stream. The rest of the code is just what is generally referred to as ‘copy control’.

The ‘S’ parameter will be instantiated using three classes: State, Exception, and Cout

struct State
{
	int count = 1;
	void update(const int seed) { count += seed; }
};
struct Exception : std::exception
{
	Exception() : message(""), errorCode(0) {}

	Exception(string msg, int errorCode) : message(msg), errorCode(errorCode) { }
	string ErrorMessage() const
	{
		stringstream ost;
		ost << message << " " << errorCode;
		return ost.str();
	}

private:
	string message;
	int errorCode;
};
struct Cout
{
	ostream& os;
	Cout() : os(cout) {}
};

Cout is a std::cout wrapper. We need it because io streams cannot be copied or assigned, but by the design of our monad template (S is not a reference or pointer) we need something that actually can be copied or assigned.

For the monad template we define three overloads of the bind function, one for each instantiation of the template:

// bind function overload: State
template<typename A, typename R>
monad<R, State> operator| (const monad<A, State>& mnd, monad<R, State>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	tmp.state.update(mnd.state.count);
	return tmp;
}

// bind function overload: Exception
template<typename A, typename R>
monad<R, Exception> operator| (const monad<A, Exception>& mnd, monad<R, Exception>(*func)(const A&))
{
	try
	{
		return func(mnd.value);
	}
	catch(Exception& ex)
	{
		auto tmp = monad<R, Exception>(R(mnd.value));
		tmp.state = ex;
		return tmp;
	}
}

// bind function overload: Cout
template<typename A, typename R>
monad<R, Cout> operator|(const monad<A, Cout>& mnd, monad<R, Cout>(*func)(const A&))
{
	auto tmp = func(mnd.value);
	// Need to create a new tmp, because we cannot assign, copy ostream.
	// When initializing, we use a ref to ostream
	auto tmp2 = monad<R, Cout>(tmp.value, mnd.state);
	tmp2.state.os << tmp2 << endl;
	return tmp2;
}

In the last bind function we see the familiar output operator. In order to make it work, we need overloads of this operator for the monad template, and the three auxiliary classes (or ‘structs’ if you like):

// << overload for monad
template<typename V, typename S>
ostream& operator<<(ostream& os, const monad<V, S>& m)
{
	os << "Value: " << m.value << " State: " << m.state;
	return os;
}
// << overload for State
ostream& operator<<(ostream& os, const State& s)
{
	os << s.count;
	return os;
}
// << overload for Exception
ostream& operator<<(ostream& os, const Exception& e)
{
	if(e.ErrorMessage() != " 0")
		os << e.ErrorMessage();

	return os;
}
// << overload for Cout, more or less a dummy (but required)
ostream& operator<<(ostream& os, const Cout&)
{
	os << "Cout";
	return os;
}

The last thing we need is some function that can be applied by the bind functions. These functions cannot be type agnostic: the return type differs from the parameter type. So we have three overloads of them, and they do exactly the same. Let’s first define some handy type aliases:

typedef monad<int, State> is_monad;
typedef monad<is_monad, Exception> ise_monad;
typedef monad<ise_monad, Cout> isec_monad;

These typedefs nest a monad template instantiation, which is a type, within another. Now the functions to be used divide a number by 2:

is_monad divideby2(const int& i)
{
	return is_monad(i / 2);
}

ise_monad divideby2(const is_monad& m)
{
	if(m.value < 2)
		// Somewhat artificial
		throw Exception("Division by zero", -1);

	auto m2 = m | divideby2;
	return ise_monad(m2);
}

isec_monad divideby2(const ise_monad& m)
{
	auto m2 = m | divideby2;
	return isec_monad(m2);
}

As you can see, there is composition, but there is no generality. That is, monad composition is in fact ugly and painful and scaling is out of the question.

We run this, in a by now familiar way:

int _tmain(int argc, _TCHAR* argv[])
{
	{
		int i = 8;
		is_monad m1(i);
		ise_monad m3(m1);
		isec_monad m5(m3);

		auto m2 = m5 | divideby2 | divideby2 | divideby2 |divideby2;
	}

	cin.get();
	return 0;
}

and get the result we desired:

After a code exposé of just 200 lines :-).

A Very Ordinary C++ Function

The same result (well, without of course the funny output that so points out recursion) can be obtained by the following function:

void VeryOrdinary()
{
	int num = 8;
	int cnt = 1;

	try
	{
		for(unsigned int i = 0; i< 4; ++i)
		{
			num /= 2;
			++cnt;

			cout << "Value : " << num << " State: " << cnt << endl;

			if(num < 2)
				// Somewhat artificial
				throw Exception("Division by zero", -1);
		}
	}
	catch(Exception& ex)
	{
		cout << "Value : " << num << " State: " << cnt << " " << 					ex.ErrorMessage() << endl;
	}
}

We run it by simply calling it:

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [7]");

	VeryOrdinary();

	cin.get();
	return 0;
}

And the result is:

After a code exposé of 34 lines (it pays to remember the approximate factor of 6 for later).

Let’s compare the sizes of both approaches. Well, the monad composition is ridiculously much larger than the ordinary solution.

Performance

Performance measurements were set up in a way comparable to measurements in part 6: the code above is run a large number of iterations, and time is measured. This is done for 10 runs, and the average over the times is calculated. Before starting measurements, the monad composition is allowed a warming up run, to prevent an outlier. The default release build of the resulting program was run from a command console.

Now, time performance results are dominated almost completely by the use of std::cout and throwing exceptions. If both are used, or occur, times are comparable. If however, we do not log to the console, and have no exceptions thrown (we set value to 32), that is, we measure the performance at the naked difference of the constructed code, the regular code takes only 1.6% of the time the monad solution takes, i.e. the regular solution is about 60 times as fast, see the image below for a typical performance run.

in this particular case the ordinary C++ function is 38.8 / 0.64 = 60.6 times as fast. The time it takes the regular function to execute is 100% x 0.64 / 38.8 = 1.6% of the time it takes the monad composition to execute. Please note that the variation in measured times is limited.

Conclusions

Let’s be brief. The code size and time performance of the monad composition are not in sensible proportion to size and performance of the regular function: 6 : 1 and 60 : 1 respectively. The code size of monad composition is ridiculous, monad composition kills C++ performance.

Wrap Up of Sequential Monads

We are a now at the end of 7 installments experimenting with monads, so let’s wrap up what we have so far. The question is: “Can monads be useful in day-to-day C++ programming?”. And the answer is “No”.

We have looked briefly at a C++ template meta programming (which constitutes a simple functional language) approach. The disadvantages are:

  • Complex syntax, hence diminished simplicity and clarity (one day your code will be maintained by another developer!)
  • Although it is possible to use this approach, the language does not support it. So it is unreasonably hard. (free after Stroustrup’s argument that OO development is possible in C, or assembly, but the language… )
  • There are no debugging facilities, which makes it very hard to develop substantial amounts of code as meta programs.

We have seen a simple and short implementation of the monad in mainstream C++. However, this implementations has the drawback that its performance dies if it is applied to large data structures.

The drawback could be remedied by replacing the unit function by a complete copy control set for the monadic template.

Lazy monads can be elegantly implemented as well, but they have no performance whatsoever, due to the fact that code is evaluated twice, once to create the expression to evaluate, and another time to indeed evaluate the expression.

Furthermore, we have seen, in this episode that if we want to write somewhat larger programs using monad composition, we get very large programs without performance.

I think I’m done now with at least the sequential monad. I have no further interest. It is tedious, needlessly complicated, and leading to inefficient results when trying to construct meaningful programs from different monads in C++.

I wrote ‘sequential monad’ above, because there is still the continuation monad to investigate; the suggested solution to Callback Hell in concurrency contexts. So, for starters, what exactly is Callback Hell? We will see in part 8.

Advertisement

Harder to C++: Monads for Mortals [6], Performance of Basic Implementations

In parts 2 – 5 we have seen four different implementations of the monad. Now, it is time for a shootout. I want to see which version has best time performance and compare it with a regular task to evaluate the cost of using a monad.

The monad implementations are:

  1. The first implementation that copies all its data.
  2. The lazy implementation that evaluates at explicit call.
  3. The references and move semantics implementation.
  4. The pointer based implementation.

The Task at Hand

The challenge is to create a task that is hard to optimize by cleverly removing parts of the functionality. Imagine we have an algorithm that fills a vector with arbitrary data, which is then thrown away unused because the vector goes out of scope. We wouldn’t want the compiler to decide that it is useless to create the vector altogether and skip the code.

Further, I will make the task less predictable by extensive use of the rand() function. I know, rand() is not a great pseudo random engine. However, rand() is not used here to simulate randomness, but to get values that are available only at runtime, not at compile time – when the code optimizer runs. This way, the code cannot have optimizations based on compile time knowledge of constant values.

The algorithm that will be implemented by the functions that are ‘applied’ by the monad, and that will also be implemented by two regular functions is as follows (we will be working with the ValueObject class used in earlier episodes, but with a method added that retrieves a pointer to the payload):

  1. Receive a (pointer / reference to a) ValueObject object.
  2. Retrieve a char from the ValueObject object’s payload, with a random index, stored it in a global data structure.
  3. Create a size for a new ValueObject object, that depends on the size of the received object, and the pseudo random generator.
  4. Create a (monad with a new) ValueObject object and return (a pointer to) it.

Step 2 is to ensure that the payload is not optimized away because it will never be accessed. Step 3 serves to make the size of the payload somewhat unpredictable, and also wildly varying.

This is the version for the reference and move semantics based implementation:

	// Creates one more ValueObject object
	monad<ValueObject> onemore(const ValueObject& v)
	{
		// samples is a file scope vector<char>
		samples.push_back(*(v.getload() + rand() % (v.size()+1)));

		do
		{
			rsize = rand();
		} while (rsize == 0);

		vsize = v.size() + 1;
		rsize = rsize > vsize ? rsize / vsize : vsize / rsize;

		return monad<ValueObject>(ValueObject(rsize));
	}

Main Routine

In the main routine, we have a composition of 25 applications of the onemore function by the bind function. We execute this composition iterations times, which we call a run. We measure the time of each run. We average over the runs, and print the result tot screen.

The number of iterations and runs are given by rand(), which is seeded with a number obtained from the user.

Scoping is such that all objects are created and destroyed in measured time.

This is the code for the reference and move semantics based implementation:

{
	using namespace I3;

	srand(seed);
	unsigned int iterations = rand();
	unsigned int runs = rand() % 1000;

	cout << "Runs: " << runs << endl
		<< "Iterations per run: " << iterations << endl
		<< "Applications per iteration: 25" << endl;

	cout << "References and move semantics." << endl;
	total = 0;

	for (unsigned int i = 0; i < runs; ++i)
	{
		t.Start();

		for (unsigned int j = 0; j < iterations; ++j)
		{
			unsigned int size1 = rand();

			auto m1 = monad<ValueObject>(ValueObject(size1));

			auto m2 = m1
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore
				| onemore | onemore | onemore | onemore | onemore;			}

		t.Stop();
		total += t.Elapsed();

		cout << ". ";
	}

	cout << endl << "Average over "
	<< runs << " differently sized runs of 25 x "
	<< iterations << " applications: " << total / runs << " ms"
	<< endl << endl;
}

As earlier, the timer code is from Simon Wybranski. The outer braces are just to assure different implementations do not interfere.

The iterations for the lazy monad differ from the above articulation in that it also makes a call: m2() to evaluate the constructed expression.

Regular Algorithms

The monad implementations are compared to regular algorithms that are functionally equivalent, but do not use monads. We will use one variant with references, and one variant with pointers and heap allocation.

The algorithm used is the same as above. The call in the main loop is terrible:

auto m2 = m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m(m1)
))))))))))))))))))))))));

Which also illustrates that the name onemore was abbreviated to m.

For references (&), the m function was implemented as:

	ValueObject m(const ValueObject& v)
	{
		samples.push_back(*(v.getload() + rand() % (v.size()+1)));

		do
		{
			rsize = rand();
		} while (rsize == 0);

		vsize = v.size() + 1;
		rsize = rsize > vsize ? rsize / vsize : vsize / rsize;

		// First creating a tmp, then returning it coerces move semantics
		// instead of copy semantics
		auto tmp = ValueObject(rsize);

		return tmp;
	}

Procedure

Time performance was measured using a release build, with default compiler switches for release builds, and the runs were executed outside of Visual Studio. That is, the program was started from a console window, not from Visual Studio.

Results

And now the results:

So, what do we see?

  1. The monad implementation based on references and move semantics (yellow) is the fastest. But the pointer based implementation has about the same performance.
  2. The regular implementations are only slightly faster (red, orange).
  3. The lazy monad (blue) is very slow, as expected.

Conclusions

The choice is clear. In following episodes, we will work with the monad built on references and move semantics. If required, we might use the pointer based variant as well. I will not use the lazy monad, unless it is clear that very, very special circumstances make it a favorable choice.

I was surprised by the small difference (if at all) between the fast monad implementation and the regular algorithms; the monad covers about twice the source code as the regular function – it also includes the bind function. So (hypothesis), the cost of using the references and move semantics monad is negligible.

Next

Next time we will start looking at various types of monads and how we could build compositions of them.

Harder to C++: Monads for Mortals [5], Pointers

In part 2 we have seen a small and elegant implementation of the monad. In part 4 we have seen how references and move semantics can be used to make the monad’s performance independent of the size of the wrapped value. In this part we will see another implementation the monad based on a pointer.

Implementation of a Monad with a Smart Pointer

We have seen in part 2 that the type constructor is represented by a template in C++. In part 2 this is a simple struct, in part 3 it is a std::function holding a lambda expression – so as to implement delayed evaluation, and in this part, it is a smart pointer, the std::unique_ptr to be precise.

So, now the type constructor is:

// Monad type constructor
template<typename T>
using monad = unique_ptr<T>;

We do not use a separate unit function, the unique_ptr‘s constructor will do. The bind function is:

// Bitwise OR operator overload, Monad bind function
template<typename A, typename R>
monad<R> operator|(monad<A>& mnd, monad<R>(*func)(const A*))
{
	log("---Function bind");

	return func(mnd.get());
}

Since in the applied functions we do not manage the life cycle of the existing A object, we send in a raw pointer, not a smart pointer.

Given the simple functions we used earlier, this implementation of the monad is used as follows:

// divides arg by 2 and returns result as monadic template
monad<ValueObject> divideby2(const ValueObject* v)
{
	log("---Function divideby2");

	return monad<ValueObject>(new ValueObject(v->size() / 2));
}

// checks if arg is even and returns result as monadic template
monad<bool> even(const ValueObject* v)
{
	log("---Function even");

	return monad<bool>(new bool(v->size() % 2 == 0));
}

void valueobjectmain()
{
	{
		auto m1 = monad<ValueObject>(new ValueObject(16));

		auto m2 = m1 | divideby2 | divideby2 | divideby2 | even;

		cout << boolalpha << *m2 << endl;
	}
}

The ValueObject class is the same as used in part 4.

Running the above code results in output:

I think the most important result here is what we don’t see: calls to the move constructor and the accompanying call to the destructor of the ValueObject.

Pro’s, Con’s, and Questions

The above image of the output is nicely constrained. Functions return the unique_ptr by value, i.e. it gets copied which is ok, and the bind function takes a reference to a unique_ptr to elicit the raw pointer from. We see the calls to ValueObject destructor occur after assignment to m2, and when leaving the anonymous scope.

So, pro’s for this implementation are its simplicity, clarity, and efficiency: the overhead of a unique_ptr is comparable to the overhead of a raw pointer. Values are not copied, so performance is size independent.

You may wonder, though, whether it is a good idea to create all the monad’s values on the heap, which is not so efficient. In part 6 we take a look at what we have so far and see what works best.

Harder to C++: Monads for Mortals [4], References and Move Semantics

In part 3 we have seen a simple monad with delayed evaluation. Until we find a use for delayed evaluation that adds value to mainstream imperative practices, we will focus on the eager or strict evaluating monad. Until the shootouts, for sure.

In part 2 we have seen a simple implementation of the eager monad. A drawback of that specific implementation is that it copies its value around, which will lead to poor performance in case of large data structures.

In this part we will examine an improved implementation that has the same time performance for any data size (that fits in RAM memory). We will improve the design of part 1 by using references and move semantics. To me, a good read on move semantics seems this article on StackOverflow.

Of course, we want to see the move semantics in action, so we can be sure the presented code indeed uses move semantics and references, hence is a performant solution. To be able to trace how an object, representing the monad value is moved around, we construct a class that will serve as the value, and which contains all necessary constructors, destructors and assignment operators, each fitted with an output statement reporting its execution.

We will add the same trace instruments to the monad template,. This requires that the unit function is integrated into the monad template in the form of various constructors (hence destructors) and assignment operators. To integrate the unit function into the monad template this way is not different from the practice Eric Lippert exercises in his tutorial on monads. There will be no use of a so called swap function in applicable constructors and assignment operators; that would again merge what we are trying to separate here.

Implementation of a Monad with References and Move Semantics

The Value Class

We will use the following class as the value in the monad to evaluate the use of refs and move semantics.

The constructors, destructor and assignment operators are all textbook examples. The class has a payload which we will use in performance measurements. It is a char array of size sz. The ValueObject class has a method (at the bottom) that returns size sz.

class ValueObject
{
    unsigned int sz;
    char* payload;
public:
    // default
    ValueObject();
    // Ref argument constructor
    ValueObject(const int& i);
    // Copy constructor
    ValueObject(const ValueObject& v);
    // Move constructor
    ValueObject(ValueObject&& v);
    // Destructor
    ~ValueObject();
    // copy assignment
    ValueObject& operator=(const ValueObject& v);
    // move assignment
    ValueObject& operator=(ValueObject&& v);

    unsigned int size() const;
};

Each constructor, destructor and assignment operator writes a line to the console to mark that it has been called.

To comfortably output the trace we use an output operator overload:

ostream& operator<<(ostream& os, const ValueObject& v)
{
	os << v.size();
	return os;
}

We test the ValueObject class with code that exercises all methods above.

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [4]");

	{
		ValueObject v1;
		ValueObject v2(10);
		ValueObject v3(v2);
		ValueObject v4(std::move(v3));
		v4 = v1;
		v2 = std::move(v1);
	}
	// The destructors get called at the end of the scope

	cin.get();
	return 0;
}

Which results in the following output.

The Monad Template

The monad type constructor does double duty as a unit function, with a broad spectrum of constructors. Below you will find the declarations in the template definition.

template<typename V> struct monad
{
	// Wrapped value
	V value;

	// std::move is here to coerce the use of v's move constructor, if any
	// The parameter V& is *not* const. Const coerces the use of V's copy
	// constructor
	monad(V& v); // : value(std::move(v))
	// std::move is here to coerce the use of v's move constructor, if any
	monad(V&& v); // : value(std::move(v))
	// copy constructor
	monad(const monad& m);
	// move constructor
	monad(monad&& m); // : value(std::move(m.value))
	//destructor
	~monad();
	// copy assignment
	monad& operator=(const monad& m);
	// move assignment
	monad& operator=(monad&& m);
};

Most implementations are standard. Note that there is no parameterless default constructor. The signature of the first constructor is special in that one would expect a const V& parameter, like so

    monad(const V& v) : value(std::move(v))

But doing that would coerce the use of the ValueObject’s copy constructor instead of the desired move constructor; the copy constructor has a const parameter, whereas the move constructor has a non-const parameter.

As with the ValueObject class, each method writes a line to the console to mark that it has been executed.

We test the monad template with code that exercises all methods above.

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [4]");

	{
		int i = 11;
		monad<int> m1(i);
		monad<int> m2(10);
		monad<int> m3(m2);
		monad<int> m4 = std::move(m2);
		m4 = m3;
		m4 = std::move(m1);
	}
	// The destructors get called at the end of the scope

	cin.get();
	return 0;
}

Which results in the following output.

Monad Bind Function

The Bind function has not really changed, compared to previous versions, but it does report on its use now.

// Bitwise OR operator overload: bind function
template<typename A, typename R>
monad<R> operator| (const monad<A>& mnd, monad<R>(*func)(const A&))
{
	cout << line++ << "\t" << "---Function bind" << endl;

	return func(mnd.value);
}

For use with the ValueObject, we will add two example functions. In analogy to previous installments, the first function divides the size of the payload by 2, the second function tells us if the size is even.

// divides arg by 2 and returns result as monadic template
monad<ValueObject> divideby2(const ValueObject& v)
{
	cout << line++ << "\t" << "---Function divideby2" << endl;

	return monad<ValueObject>(ValueObject(v.size() / 2));
}

// checks if arg is even and returns result as monadic template
monad<bool> even(const ValueObject& v)
{
	cout << line++ << "\t" << "Function even" << endl;

	return monad<bool>(v.size() % 2 == 0);
}

Tracing Moves

We are now in a position to trace the application of functions by the monad, and verify that indeed only references and move semantics are applied. We do this with the following code.

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [4]");

	{
		cout << line++ << "\t" << "---Object creation" << endl;

		ValueObject v1(22);
		monad<ValueObject> m1(v1);

		cout << line++ << "\t" << "---Creation of m2" << endl;

		auto m2 = m1 | divideby2 | even;

		cout << line++ << "\t" << boolalpha << m2.value << endl;

		cout << line++ << "\t" << "---Leaving scope" << endl;
	}
	// The destructors get called at the end of the scope

	cin.get();
	return 0;
}

Which results in the following output.

The good news is: the code indeed only uses references and move semantics.

But OMG, What A Shocking Lot Of Function Calls! What a Hydrocephalic Overhead!!!

Look what happens between entering the function divideby2 and the subsequent call to the bind function: No less than 8 calls to constructors and destructors. What a shocking waste of time and effort. To create 2 objects, apply 2 functions, and then clean up the mess, 24 function calls are required. What circumstances could ever justify such an overhead?

But of course, we cannot blame just the monad for this, it is also the move semantics. And that is what really shocks me. How could they do that? Now I understand why the committee is ready to embrace the monad. Who cares about the overhead of the monad if you already have the overhead of move semantics?

Minimal Verification that Time Performance is Independent of Size

Ok, let’s put the drama behind us and see if this solution is indeed size independent. We will compare a payload of size 1 to a payload of 10,000,000,000.

To make the test size independent we will use an identity function. This function does not modify the data, it just moves the data on – Yes it is still immutable data. The identity function will not muddle time measurements with memory (de-)allocations for the actual data, so we will only measure the overhead for the monad and the move semantics.

Time measurements will be made with the high resolution performance timer by Simon Wybranski for a release win32 build.

The identity function looks like this.

monad<ValueObject> identity(const ValueObject& v)
{
	return monad<ValueObject>(const_cast<ValueObject&>(v));
}

The test code is as follows.

int _tmain(int argc, _TCHAR* argv[])
{
	SetConsoleTitle(L"Harder to C++: Monads [4]");

	double total = 0;
	for (unsigned int k = 0; k < 10; ++k)
	{
		//ValueObject v1(1);
		ValueObject v1(10000000000);
		monad<ValueObject> m1(v1);

		double elapsed = 0;
		for (unsigned int i = 0; i < 10000; ++i)
		{
			t.Start();

			auto m2 = m1
			| identity | identity | identity | identity | identity
			| identity | identity | identity | identity | identity
			| identity | identity | identity | identity | identity
			| identity | identity | identity | identity | identity
			| identity | identity | identity | identity | identity
			| identity | identity | identity | identity | identity;

			t.Stop();
			elapsed += t.Elapsed();
		}
		// The destructors get called at the end of the scope

		cout << "Time 30 x 10000 applications: " << elapsed << " ms" << endl;
		total += elapsed;
	}

	cout << "Average time over 30 x 10000 iterations: " << total / 10 << " ms"
		<< endl;

	cin.get();
	return 0;
}

A typical output is

Comparable values (about 2 ms) are indeed obtained for both the large and the small payload.

Pros, Cons, and Questions

So, a strict monad that is based on references and move semantics takes 2 ms to do nothing 30,000 times, independently of size of the data structure 🙂 .

The amount of time per function application is small, so we might conclude that references and move semantics indeed provide an advantage over code that always copies its data structures, as we had in parts 1 and 2.

On the other hand, I find the amount of constructor and destructor calls distressing: returning a monad from a function takes two constructor calls and two destructor calls. Can we do better than that? Is it worth the trouble to try and do better than that?

I think it surely is worth the trouble. There is always the potential comparison between 2 programs that are functionally equivalent, but one program needs only half the instructions the other program needs. The advantages of the smaller program are of great importance. It can run equally fast on much less expensive hardware, it can save 50% battery power compared to the other program. On equal hardware it leaves room for the execution of other instructions that can contribute to a significantly better UX. So, finding the best performance brings you a competitive edge that can make a world of differences in the long run, or at large scale.

Next episode we will investigate if a pointer based monad can do any better than the constructs presented here.

Harder to C++: Monads for Mortals [3], Lazy implementation

This blog post is part 3 in a series. This part presents a simple implementation of a lazy evaluating monad. It is a minimal variation of the implementation in part 2.

Below we will see how the implementation is constructed, then an example is given that provides initial verification that the construction is indeed a monad.

Implementation of a Lazy Monad

Implementation starts again with some includes and a using statement. The first important construct is the monad type constructor, a C++ template which in this particular case takes the form of a std::function template:


// Monad type constructor: a std::function template
template<typename T>
using monad = function < T() >;

The using statement defines ‘monad’ as a type alias.

Of course, a suitably formed pointer to function object would suffice as well. The std::function template has the advantage that it looks neat, and we can instantiate it in an elegant way using a lambda expression. Thus upholding some of the functional character of the monad’s roots.

The unit function, which creates instances of the ‘monadic’ function object is as follows:

// Monad unit function, wraps a value in a lambda
template<typename V>
monad<V> unit(V val)
{
       return[=](){ return val; };
}

The unit function returns a lambda expression the returns a value, when executed (later some time, if any).

The bind function follows the description of bind operators to the letter. It evaluates the monad argument, and returns a function that applies the function argument to the result of that evaluation:


// Monad bind operator
template<typename A, typename R>
monad<R> operator|(monad<A>& mnd, monad<R>(*func)(A))
{
       // return a callable to evaluate later
       return[=]()
       {
             A unwrapped = mnd();
             monad<R> result = func(unwrapped);
             return result(); //parentheses, because we define a function that returns a value
       };
}

The returned function is not evaluated before function call is applied to it. Hence using this bind operator we can construct long and complex function compositions, which get evaluated only after we explicitly do so.

The above triple is the complete implementation of the monad. Yes I know, there should be two additional operations defined as well: the sequencing operator and the fail operator (see e.g. Mike Vanier, part 3. However, many texts discussing monads indicate that these operations are uninteresting. So, we will skip them in this series.

Let’s add the same functions as in part 2, and run a simple composition:


// divides arg by 2 and returns result as monadic template
monad<int> divideby2(int i)
{
       return unit<int>(i / 2);
}

// checks if arg is even and returns result as monadic template
monad<bool> even(int i)
{
       return unit<bool>(i % 2 == 0);
}

int _tmain(int argc, _TCHAR* argv[])
{
       SetConsoleTitle(L"Harder to C++: Monads [3]");

       // Wrap 22 in monadic template
       auto m1 = unit<int>(22);

       // Apply bind 4 times in function composition
       auto m2 = m1 | divideby2 | divideby2 | divideby2 | even;

       cout << boolalpha << m2() << endl;

       cin.get();
       return 0;
}

Note that the functions ‘divideby2’ and ‘even’ have not changed. The only change in the _tmain function is that in the output statement m2.value has been replaced by m2(). Note the parentheses, which denote lazy evaluation of the constructed composition.

Minimal Verification that Monadic Laws Hold

The minimal verification is the lazy analog to verification in the eager case. It looks like this:


int _tmain(int argc, _TCHAR* argv[])
{
       SetConsoleTitle(L"Harder to C++: Monads [3]");

       // Wrap 22 in monadic template
       auto m1 = unit<int>(22);

       // Left identity
       // First explicitly convert to function pointer (because of template)
       auto pf1 = &unit < int >;
       // then apply bind operator
       auto m2 = m1 | pf1;

       cout << "Left Identity result: " << m2() << endl;

       // Right identity
       auto m3 = m1 | divideby2;
       auto m4 = divideby2(22);

       cout << boolalpha << "Right Identity result : " << (m3() == m4()) << endl;

       // Associativity
       // Case 1: equivalent to bind(m1, divideby2);bind(m5, even)
       auto m5 = m1 | divideby2;
       auto m6 = m5 | even;

       // Case 2: equivalent to Bind(m1, divideby2;bind(..., even));
       // Create function composition
       auto f = [](int i){ return divideby2(i) | even; };

       // Explicitly convert f to function pointer
       auto pf2 = static_cast<monad<bool>(*)(int)>(f);
       auto m7 = m1 | pf2;

       cout << boolalpha << "Associativity result : " << (m6() == m7()) << endl;

       cin.get();
       return 0;
}

which generates output:

image0011

as expected (and desired).

Pros, Cons

Some texts declare that lazy implementation provides advantages over eager or strict evaluation. Advantages presented are:

  1. Calculations are only made when required and in as far as required.
  2. ‘Lazy’ allows he evaluation of lists that are unbounded (endless, infinite)
  3. Delayed evaluation of function compositions allows for an optimization before evaluation of the composition takes place.

What to think of this?

The first advantage strikes me as odd; I usually do not write programs that make calculations which’ results are not required.

The second advantage is odd as well. My computer is as powerful as a Turing machine, hence it can process input lists of unboundedly length. Without lazy evaluation.

The third advantage could be a real advantage, but I wouldn’t know how to do this in C++; the advantage presupposes that it is possible to create code at runtime and then compile and execute it, which isn’t true for C++.

I do see a disadvantage, though, and we will get to it in detail when doing some shootouts between the various alternative monad implementations. The disadvantage is that the software does extra work. It first creates an expression, then evaluates it. So in C++, for each lazy evaluating program, there is an equivalent eager evaluating program that is faster and smaller, hence more competitive.

In the next part of this series we will work on an (eager) implementation that has the same time performance characteristics for any size value. There will be many references and move semantics. Hmm … mjummy 😉

Harder to C++: Monads for Mortals [2], First implementation

This blog post is part 2 in a series. Part 1 is a general introduction. This part discusses a first implementation of the monad in C++.

There are two fundamentally different approaches to implementing monads in C++. Let’s first discuss the approach we will not pursue.

A Way We Will Not to Implement a Monad in C++

The C++ template mechanism is a full functional language (this discovery, by Erwin Unruh, was a bit of a surprise). As a consequence C++ templates are sometimes used for template meta programming: programs can be written using (only) C++ templates and are as such evaluated / executed, at compile time. There are not many practical uses for template meta programming. One practical use was to unroll (small) loops in order to gain performance. This gain, however, can nowadays also be obtained by modern, optimizing C++ compilers.

The discovery that the C++ template mechanism constitutes a functional language, also has lead some people to implement monads in a template meta programming based functional language. Examples are the Cee Plus Plus Monad Example, and Implementing Monads for C++ Template Metaprograms. However, these implementations live in a part of C++ that is little used. Moreover, the required syntax is complex, which makes it unattractive for general use.

The above approach and the goal pursued here are, I think, trying to answer different questions. The above examples seem to try to mimic functional programming with C++, and implement monads in that context. Contrary to bringing C++ to functional programming, I want to know if bringing the functional programming construct ‘monad’ to C++ will provide me with an advantage, and to what cost.

Below we will start with a simple implementation of a monad in mainstream C++ and show that it is indeed a monad. Then we will see what we like or dislike. Note that we will not be doing any functional programming, which, in fact, is quite in the spirit of Eugenio Moggi’s intentions.

A Simple, Mainstream C++ Implementation of the Monad

Implementation starts with some includes and a using statement. The first important construct is the monad type constructor, or rather: the C++ template.

#include <iostream>;

using namespace std;

// monad type constructor
template<typename V>; struct monad
{
	// Wrapped value
	V value;
};

We saw in part 1 that the result of the unit function has the semantics of a computation. As you can see, this C++ class / struct template does not have any computations, it is a trivial computation. For one thing, we could add a function application operator to it ( operator() ), that would perform some non trivial computation, and we might indeed at some point do so.

The next item to define is the unit function, the factory:

// factory: unit function
template<typename V> monad<V> unit(V val)
{
	return monad < V > { val };
}

It is simplicity itself. An alternative could have been to implement the unit function as the monad struct constructor. That would have been more efficient, so we will do that in a number of following implementations.

I have given the bind function the form of an overload of the bitwise OR operator ” | “.

// Bitwise OR operator overload: bind function
template<typename A, typename R>
monad<R> operator| (monad<A>& mnd, monad<R>(*func)(A))
{
	return func(mnd.value);
}

I got the idea to use an operator instead of a function called ‘bind’ from the CeePlusPlus example referred to above. However, that code overloads the ” >>= ” operator, in order to make the code resemble Haskell somewhat. I prefer the ” | ” because it is the same symbol as the Unix / Linux shell pipe operator. And indeed, we will see shortly that in function composition, bind can well be considered to pipe results from function to function.

The above bind function first ‘unwraps the value’, by referencing the value of the monad object, it then applies the function ‘func’ to it, and returns the result wrapped in a new monad template object.

At this point we have a first implementation of a monad! Note how very small and simple it is. Let’s add some functions and give it a spin.

// divides arg by 2 and returns result as monadic template
monad<int> divideby2(int i)
{
	return unit<int>(i/2);
}

// checks if arg is even and returns result as monadic template
monad<bool> even(int i)
{
	return unit<bool>(i % 2 == 0);
}

int _tmain(int argc, _TCHAR* argv[])
{
	// Wrap 22 in monadic template
	auto m1 = unit<int>(22);

	// Apply bind 4 times in function composition
	auto m2 = m1 | divideby2 | divideby2 | divideby2 | even;

	cout << boolalpha << m2.value << endl;

	cin.get();
	return 0;
}

Note the elegant function composition offered by the bind operator.

Minimal Verification that Monadic Laws Hold

By means of a single example we will verify that the monadic laws (see part1) hold:

int _tmain(int argc, _TCHAR* argv[])
{
	// Wrap 22 in monadic template
	auto m1 = unit<int>(22);

	// Left identity
	// First explicitly convert to function pointer (because of template)
	auto pf1 = &unit < int >;
	// then apply bind operator
	auto m2 = m1 | pf1;

	cout << "Left Identity result: " << m2.value << endl;

	// Right identity
	auto m3 = m1 | divideby2;
	auto m4 = divideby2(22);

	cout << boolalpha << "Right Identity result : " << (m3.value == m4.value)
	     << endl;

	// Associativity
	// Case 1: equivalent to bind(m1, divideby2);bind(m5, even)
	auto m5 = m1 | divideby2;
	auto m6 = m5 | even;

	// Case 2: equivalent to Bind(m1, divideby2;bind(..., even));
	// Create function composition
	auto f = [](int i){ return divideby2(i) | even; };
	// Verbosely convert f to function pointer
	auto pf2 = static_cast<monad<bool>(*)(int)>(f);
	auto m7 = m1 | pf2;

	cout << boolalpha << "Associativity result : " << (m6.value == m7.value)
	     << endl;

	cin.get();
	return 0;
}

which generates output:

The funny thing about the associative law is that the intuitive use of the bind operator like in:

auto m2 = m1 | divideby2 | even;

is equivalent to:

auto m2 = bind(bind(m1, divideby2), even);

which is very obviously equivalent to Case 1 above. So indeed, parentheses do not matter as long as we pipe the result of applying divideby2 to m1 into event.

Pros, Cons, and Questions

So, now that we have made a start, what to think of it?

The ‘pros’ are:

  • The small size
  • The very elegant function composition.
  • From a functional programming perspective one might add that this implementation works with immutable data.

The most visible ‘cons’ are:

  • The code copies function arguments and function results, so it is inefficient, and performance for large objects will be bad.
  • The const qualifier is completely absent from the code.
  • It would be more efficient to make the unit function the constructor in the monadic template (as it is in the examples in Eric Lippert’s tutorial on monads in C#).

There is also this question about the bind operator:

  • Why is the bind operator not a member of the monadic template like with Lippert? Note that doing so would turn the monadic template into the complete monad. Well, this is C++, and I like to follow C++ best practices as accumulated by Sutter and Alexandrescu here, see item 44 “Prefer writing nonmember nonfriend functions”. There are more reasons, having to do with composition of different types of monads. But that’s for later…

Next time: a simple implementation of the monad for lazy evaluation.

Harder to C++: Monads for Mortals [1]

Posted: October 22, 2014 in Harder to C++
Tags: ,

Harder to C++: Monads for Mortals [1]

The monad is a programming construct, a coding idiom, or perhaps even a design pattern. Its roots lie in Category Theory, coming to programming via programming language semantics and functional languages. I have been reading up on monads (there are multiple, closely related variants), specifically monads for C++. It turns out that the concept of a monad is quit slippery; one minute you think you have a nice understanding, the next minute you are back at zero.

It seems to me that for a thorough understanding of what monads are, you need some knowledge of Category Theory and neighboring fields in logics and algebra, and sound knowledge of the functional programming language Haskell. But alas, I’m just a simple mortal programmer. Moreover:

  • I don’t want to know about Category Theory!
  • I don’t care about have time for Haskell !
  • I just want to write programs in C++!

An alternative route to understanding monads that suits me (simple mortal soul) better, then, is to experiment with various implementations of monads in C++, and evaluate the results, experiment with an improved implementation. And so on.

I have been busy running this experimental cycle, and will describe the condensed experiences in a number of blog posts.

Many interesting things happened, but I can’t say I found a satisfying implementation. Indeed, I have come to doubt the usefulness of monads for application in C++, despite its conceptual and syntactical beauty. Even worse, I have come to doubt the usefulness of some other additions to the C++ language, new since C++11.

But monads are hot!, They must be useful in C++! Well, let us be clear about our motive to write programs in C++. The unique selling points of C++ are (i) the fastest code, against (ii) the smallest footprint. I think people buy in on C++ for exactly these selling points.

Let’s compare with driving a car (what a boyish thing to do). Programming in C++ is then, of course, driving a formula 1 race car; fast, small, unsafe, and uncomfortable. You want comfort, drive a large American saloon, i.e. code in a language designed for developer productivity. You want to be safe, correct? Drive a Toyota Prius, i.e. program a functional language.

Of course, Formula 1 drivers want to be as safe and comfortable as possible. But not at the expense of performance!

So, the monad can be a contribution to C++, if(f) it has no time, space performance cost.

Some Background on Monads

Let’s see what monads are, where they come from, and how they got into programming.

For a C++ developer, a monad is a triple <M, u, b> where:

  • M is a template (yes, m for monad 🙂 ).
  • u is a factory function, in general called unitof signature a -> M<a>, that given an object or value of some type a creates a template object of type M<a>.
  • b is a function application function called bind of signature (M<a>, f) -> M<b> that:
    • Retrieves the object of type a from M<a>
    • Applies a function f: a -> M<b> (from type a to a template of type M<b>)
    • Returns a template of type M<b>

This triple has to satisfy three laws in order to be a monad:

  1. Left identity: bind(M<a>, u) = M<a> (so left means left argument for bind)
  2. Right identity: for an f: a->M<b>, bind(M<a>, f) = f(a)
  3. Associativity: bind(x, f); bind(y, g) = bind(x, f; bind(y, g))

In the description of the associative law:

  • x is a variable of type M<a>
  • y is the result of bind(x, f)
  • the ‘;’ operator denotes left to right sequencing of operations
  • f is a function of type a -> M<b> and g is a function of type b -> M<c>

Now. let’s add some intuition.

According to Eugenio Moggi, the semantics of M is that of a computation. He means to include a general class of computations, e.g. partial computations, nondeterministic computations, side-effects, exceptions, continuations, and interactive input and output, and possibly more.

Further, u is the inclusion of value a in a computation, as a parameter. So, we may think of M as a computation.

The application function b (applied to function f) is the extension of a function f from values to computations to a function from computations to computations, which first evaluates a computation and then applies f to the resulting value.

Philip Wadler took the monad to Haskell and showed it can be used to do bad things in a good way. That is, to perform imperative programming language operations in a pure functional language way. Haskell is a pure functional language. In a pure functional language all functions are pure, i.e. functions in which the input-output relation cannot be augmented by things like state, or context, or whatever; the domain – codomain relation is fixed. In a pure functional language you can thus have no side effects, or interactive input / output, etc.

The monad solves this problem by encapsulating ‘imperative operations’ in the M computation. According to Wadler a function

f: a -> M<b>

is then said to map type a onto type b, with a possible additional effect captured by M. The monad takes the ‘imperative operations’ out of the (pure) function application mechanism, and the functional composition chain, and thus saves the pure character of the pure functional language.

Does this talk about functional language has any value for C++? That is debatable. C++ is an imperative language, the idea to improve it by a construct that makes a functional language do imperative operations is preposterous. On the other hand, Bartosz Milewski is convinced that concurrency in C++ would be improved greatly if it would be structured the monadic way (think of the continuations mentioned above). Concurrency is new territory for C++, so we have to investigate this further. I intend to discuss this viewpoint somewhere in this series of articles on monads. First I want to develop a thorough understanding of monad implementations in C++.

Next in this series: a first C++ implementation of a monad.

Links to Texts on Monads

Who writes or wrote about monads, and what do they think is the merit of monads?

  1. Eugenio Moggi wrote Notions of computation and monads, about. This paper introduces monads as the categorical semantics for computations. It is very hard to read, because full of funny mathematical words that have a very precise meaning which then, of course, hardly anybody knows exactly. According to Moggi, monads allow mathematicians to reason about the equivalence of a very general class of programs.
  2. Philip Wadler introduced monads into Haskell in order to perform ‘imperative operations’ in a decent way. Very readable text.
  3. Mike Vanier wrote an extensive introduction into monads. Hard to read because of much Haskell code. After Philip Wadler he states that monads are key in allowing programmers to write side-effecting programs in pure functional languages.
  4. Bartosz Milewski is a fervent supporter of application of monads to C++, specifically in the context of concurrency. He has written much texts that might make reading Moggi’s text easier. Milewski stresses the idea that std::fucture should be monadic.
  5. Eric Lippert wrote a very (very) accessible introduction to monads for C#. He saw merit of monads in the construction of LINQ-to-Objects.

Boolean Fields in a Struct: Another Complex PInvoke Construct

Adjacent Boolean fields in a structure constitute a problem in marshaling a C++ struct to C# .Net. Point is that the memory layout of structs is maintained in marshalling. Using the MarshalAs function or the StructLayout Pack field doesn’t solve the problem. Sharing of the bool aggregation memory word by several Properties, in combination with bit-masks and a bitwise operator, does however do resolve the situation.

Introduction

In the previous blog post we saw how to use PInvoke to copy an array of structs from C++ to C#, using a callback. In this blog post we build on the code developed in that previous post to solve a problem in working with Boolean fields (fields of type bool).

I know, it should be easy. However, it is complicated.

Point is: if we extend our structure with a single Boolean field, marshaling works fine, but if we add two adjacent Boolean fields, marshaling breaks down. Why?

Let’s first see what happens, then analyze the problem , and with the analysis in hand evaluate solution alternatives. And yes, the problem will be solved.

Adding Boolean Fields to a Struct, Then Marshal it in PInvoke

Assume we would like to extend our C++ MeteoInfo (meteorological Information) struct with a Boolean field indicating whether it is raining (whether there is precipitation) at the meteo station at time of measurement. In C++ we get:

struct MeteoInfo
{
    wchar_t* DisplayName;
    wchar_t* UniqueID;
    int Temp;
    bool IsRaining;
    double Humidity;
};

In our WPF C# GUI we would meet that with:

public struct MMeteoInfo
{
// Properties, not fields to serve data binding
    public string DisplayName { get; private set; }
    public string UniqueID { get; private set; }
    public int Temp { get; private set; }
    public bool IsRaining { get; private set; }
    public double Humidity { get; private set; }
}

 

And this will work.

Then, encouraged by our success, we add two more, but adjacent Boolean fields: IsOperational and IsOnline, which indicate whether the meteo station is known to function properly, and can be reached by internet, respectively. In C++:

struct MeteoInfo
{
    wchar_t* DisplayName;
    wchar_t* UniqueID;
    bool IsOperational;
    bool IsOnline;
    int Temp;
    bool IsRaining;
    double Humidity;
};

And in C#:

public struct MMeteoInfo
{
// Properties, not fields to serve data binding
    public string DisplayName { get; private set; }
    public string UniqueID { get; private set; }
    public bool IsOperational { get; private set; }
    public bool IsOnline { get; private set; }
    public int Temp { get; private set; }
    public bool IsRaining { get; private set; }
    public double Humidity { get; private set; }
}

But now we are in trouble. We create an array of MeteoInfos as:

MeteoInfo m_infos[] =
{
    { L"Meteo1", L"123-123-123", true, true, 25, true, 60.3 },
    { L"Meteo2", L"456-456-456", true, false, 27, false, 81.25 },
    { L"Meteo3", L"789-789-789", false, true, 33, true, 36.7 }
};

And it shows up in the GUI like:

Which is completely wrong!

So, what happened?

Analysis: Struct Memory Layout

From an issue posted a connect.microsoft.com (MS’s feedback site). We learn that:

  • In Marshaling a type with layout from C++, default alignment/padding of types is the same as in C++.
  • Alignment requirement is by default 8 but can be changed using the Pack field on the StructLayoutAttribute.

To complete the picture we should add that:

  • The (minimum) byte size of the bool type has not been defined in C++, so it is implementation dependent. In VC++ sizeof(bool)=1, as it is in C#.

Now let’s take a look at the memory layout of the structure in C++, hence in C# .Net, and compare this with the layout we would expect in C# based on aligned type byte sizes.

Field Byte offset in C++, C# Type Size CLR Aligned type size based offset
DisplayName 0 8 0
UniqueID 8 8 8
Temp 16 4 16
IsRaining 20 1 (padding=3) 20
Humidity 24 8 24
One Past End of struct (size) 32 32 32

So indeed, no problems in sight; the actual offset are equal to the offsets based on aligned managed type sizes.

How did I get these data? The C++ structure memory layout can be read off the Memory window while debugging in VS2013. However, in order to obtain the offset numbers for the C# struct (MMeteoInfo) you have to use a trick, see here: assign each field of a struct object to a suitably typed variable (i.e. var 🙂 ), and evaluate the offset of the source value. (I’m open to suggestions on utilities that show memory layout of complex types in managed memory, how ever volatile.)

When we add two (or more) adjacent Boolean fields, we are in trouble

Field Byte offset in C++, C# Type Size CLR Aligned type size based offset
DisplayName 0 8 0
UniqueID 8 8 8
IsOperational 16 1 16
IsOnline 17 1 20
Temp 20 (aligned at word) 4 24
IsRaining 24 1 28
Humidity 32 (aligned at 8 multiple) 8 32
One Past End of struct(size) 40 40 40

This shows that the two adjacent Boolean fields are packed together in a single word. This is the doing of default marshaling, but clearly the CLR cannot adapt to this.

We can explain, or predict the errors now:

  • An assignment of IsOperational grabs four bytes, so it also copies IsOnline, hence with our initializations the result is always true. Indeed, if we set both to False, IsOperational comes out false too.
  • An assignment of IsOnline also grabs four bytes, so also the first byte of Temp. Temp is positive in our initializations, so IsOnline is always true. Indeed, if we set Temp to zero, IsOnline comes out False.
  • Assignment of Temp really grabs the value of IsRaining at offset 24, so we get to see the value of IsRaining instead of the value of Temp.
  • The value for IsRaining is copied from the empty padding bytes at offset 28, rendering it the value 0.
  • The value or Humidity, a double, is taken from the first multiple of 8 bytes, at offset 32, hence it is correct.

So, now we know the MSIL instructions copy too many bytes, or bytes from the wrong location, how do we repair this error?

Solutions and Non-Solutions

A solution we do not want is to change the C++ code by using #pragma pack(…). #pragma is used to create non-portable code, pack is used to adapt the memory layout of structures. I don’t want to make code non-portable, just because I would like to add a GUI to that code.

Let’s review a number of approaches that might seem reasonable, but in fact will not work.

The MarshalAs Method

We could insert a clause like:

MarshalAs(UnmanagedType.Bool)

However, that one is meant for export of data to C++, targeting the Windows BOOL type, which is four bytes. For our 1 byte bool type we use:

MarshalAs(UnmanagedType.U1)

Then we change the definition of MMeteoInfo so we can marshal the bool variables, if we would like to:

[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
public struct MMeteoInfo
{
    public string DisplayName { get; private set; }
    public string UniqueID { get; private set; }

    [MarshalAs(UnmanagedType.U1)]
    bool isOperational;
    public bool IsOperational
    {
        get { return isOperational; }
        private set { isOperational = value; }
    }

    [MarshalAs(UnmanagedType.U1)]
    bool isOnline;
    public bool IsOnline
    {
        get { return isOnline; }
        private set { isOnline = value; }
    }

    public int Temp { get; private set; }

    [MarshalAs(UnmanagedType.U1)]
    bool isRaining;
    public bool IsRaining
    {
        get { return isRaining; }
        private set { isRaining = value; }
    }

    public double Humidity { get; private set; }
}

 

But if we run the program, we get an error:

Let’s say this is a marshaling error.

The StructLayoutAttribute.Pack Field

If we set Packing to 4, all our troubles should be over; all fields should start at word boundary. However, mirroring the C++ layout takes precedence – the two Boolean fields keep being marshaled as two adjacent bytes, and we still get the same error message. The same result is obtained for each packing value we may choose.

Use the StructLayout Attribute with LayoutKind.Explicit

So, we cannot change the structure’s memory layout by merely specifying a packing. Next step is to explicitly specify the layout per structure field. If we do so, the definition of the structure becomes like this:

[StructLayout(LayoutKind.Explicit, CharSet = CharSet.Unicode)]
public struct MMeteoInfo
{
    [FieldOffset(0)]
    string displayName;
    public string DisplayName
    {
        get { return displayName; }
        private set { displayName = value; }
    }

    [FieldOffset(8)]
    string uniqueID;
    public string UniqueID
    {
        get { return uniqueID; }
        private set { uniqueID = value; }
    }

    [FieldOffset(16)]
    bool isOperational;
    public bool IsOperational
    {
        get { return isOperational; }
        private set { isOperational = value; }
    }

    public bool IsOnline
    {
        get { return isOperational; }
        private set { isOperational = value; }
    }

    [FieldOffset(20)]
    int temp;
    public int Temp
    {
        get { return temp; }
        private set { temp = value; }
    }

    [FieldOffset(24)]
    bool isRaining;
    public bool IsRaining
    {
        get { return isRaining; }
        private set { isRaining = value; }
    }

    // double gets laid out on a multiple of 8 bytes.
    [FieldOffset(32)]
    double humidity;
    public double Humidity
    {
        get { return humidity; }
        private set { humidity = value; }
    }
}

Note that the property IsOnline does not have a backing field of its own, It shares a backing field with IsOperational. After all, both bools are encoded in these same 4 bytes. This code works, except that IsOperational, and IsOnline are now dependent values.

So, how do we get the correct value for each bool out of the 4 bytes in the isOperational field?

We use a mask and a Boolean bitwise operator, like this:

[FieldOffset(16)]
int booleans;
public bool IsOperational
{
    get
    {
        int mask = 1;
        // The first byte of booleans contains the value of IsOperational
        int tmp = booleans & mask;
        return tmp > 0;
    }
    private set { ; } // dummy
}

public bool IsOnline
{
    get
    {
        int mask = 256;
        // The second byte of booleans contains the value of IsOnline
        int tmp = booleans & mask;
        return tmp > 255;
    }
    private set { ; } // dummy
}

And it works!

Now we have the same values as in the initialization code above.

A WPF GUI for a C++ DLL: A Complex PInvoke Construct

Introduction

Suppose you have a C++ DLL that needs to relay some data and possibly some error messages, and you want to present this data in a graphical user interface (GUI) that is compatible with Windows 7 and up?

Then you have quite some options, among which:

– A win32 GUI in C++.

– A GUI in WPF, connected to the DLL by C++ interop, i.e. managed C++.

– A GUI in WPF, connected to the DLL by PInvoke.

– A GUI in Qt or another 3rd party GUI library for C++.

You might want to choose the option that you are familiar with, and that has a future, also. In my case that is a choice for WPF (familiarity) and PInvoke (has a future). That is, I am not familiar with QT or other 3rd party GUI libraries, and a win32 GUI seems too tedious. I also happen to think that managed C++ is, or will be orphaned. Marshalling (which we use for PInvoke) on the other hand, has been enhanced even in .Net 4.5.1, see e.g. the Marshal.SizeOf<T>() method here.

In this blog post we will explore a WPF GUI that sends a pointer-to-callback to a C++ DLL, which uses the callback to send an array of structures (containing strings) back to the GUI, which the GUI presents to the user by means of data binding. The case will be made for a simple weather forecast system that displays some data from (imaginary) distributed measurements collected by the DLL. We will use a separate callback to allow the DLL to log meta level messages.

The reason for writing this article on PInvoke is that in my research for the solution described here, I found many explanations of advanced PInvoke techniques, but few integrated examples, and even less example of connecting PInvoke to data binding.

So, the software presented here combines a number of advanced PInvoke techniques into a coherent, easy-to-use, functional whole, connected to data binding. The complete example code (a vs2013 x64 solution) can be downloaded using the link at the bottom of this article.

Sources

I will not assert that this article contains substantial original work. It just integrates insights published (abundantly) on the internet. Places to look for adequate information on PInvoke and Marshalling are:

– The MSDN Library on PInvoke and Marshalling.

– Stack Overflow on PInvoke.

Manski’s P/Invoke Tutorial .

Particularly informative articles (to me) were:

A support article by David Stucki on wrapping a C++ class for use by managed code.

An article by JeffB on CodeProject on marshalling a C++ class.

Below we will first explore how to data bind to a string obtained from the DLL, then we will explore the case of an array of structures.

Getting a Log Message String from the DLL by Callback

And data binding it to a TextBlock in the GUI.

In this section we will first review the C++ DLL involved. Then we will discuss the case of obtaining a log message. We will thereby ignore all aspects of the C++ code that have to do with marshalling an array of structures; that will be the subject of the next section.

The Native DLL

The DLL consists of an Interface file of exported symbols, a header file that defines the implementation class, and a cpp file with the implementations.

The DLL interface is defined as follows:

The MeteoInfo structure will be presented in the next section.

We note the logCallback function pointer, the lifetime management functions CreateMeteo and DestroyMeteo, and a Send function that we use to signal the DLL to send data to the WPF GUI using callbacks.

The Header file for the implementation of the interface is as follows.

We see a simple class derived from the Interface. Note the fields, initialized by the constructor, that hold the callback function pointers for local, repeated use.

The cpp implementation file is as follows:

So basically, the Meteo::Send method invokes the log callback (m_lcb) with a wchar_t constant.

Then, what do we need to data bind the result of this call to a TextBlock in the WPF GUI?

WPF GUI: The Logging Text Part

First of all we need to declare that we will use methods from the DLL in the C# code to create a Meteo object that holds a pointer to the log callback, and to signal the DLL to send data to the WPF GUI. In this simple scenario, the DLL responds to invoking the Send method by invoking the callback once, but in a realistic setting the Send method would start a process that repeatedly invokes the callback autonomously – according to its own logic.

The argument for Send is the instance we obtained from CreateMeteo. Obviously (being the argument for CreateMeteo) , we need a delegate called LogDelegate:

Note the CharSet specifier. We need this since the default is ANSI. Next we bind the delegate to an actual method:

Here we copy the native string referred to by the IntPtr pointer to a managed string.

LogString is a property, not a field. We use a property so we can data bind a TextBlock to it. The property is:

The OnPropertyChanged call is part of a standard INotifyPropertyChanged implementation.

The data binding in XAML is just:

And that’s all.

Getting an Array of Structures from the DLL by Callback

The relative simplicity of the above was somewhat surprising to me, since explanations of PInvoking strings, in articles on the internet can be dauntingly complex. The good news is that applying the same pattern to an array of structures, that hold strings as well as other data types, does not lead to much additional complexity.

Recall the C# DLL function declaration that creates a Meteo object:

It also holds a delegate for relaying data that we use to update the GUI. GuiUpdateDelegate is defined as:

The delegate’s parameters are a pointer to a native array, and an int that designates the size of the array.

The array is a simple array of MeteoInfo structures. In order to relay the data we need corresponding structures in C++ and in C# (managed code). The native structure is:

The corresponding managed structure is:

You might expect fields here, instead of properties. We should keep in mind, however, that a property is just some code around a field. If we would define the structure with fields, the size in bytes of an object of the structure would be the same. However, we need (public) properties in order to support data binding.

The delegate specified above is bound to an actual method that produces an array of MMeteoInfo objects:

The Marshal.PtrToStructure method copies the native data into a managed structure.

The InfoList object mentioned in the code above is a ListView defined in XAML

And we’re done. Output from the sample application looks like this:

Link to Sample Code

Sample Code

Data Compression for the Kinect

Transmitting uncompressed Kinect depth and color data requires a network bandwidth of about 460Mbit/s. Using the RleCodec or the LZ4 library we achieve tremendous compression – a compression ratio of 10 or 22 respectively, at lightning speed – over 1600Mbytes/s. We achieve this not so much by the compression algorithms, but by removing undesirable effects (jitter, by the DiscreteMedianFilter) and redundancy (already sent data, by taking the Delta).

Introduction

From the start, one goal of the Kinect Client Server (KCS) project was to provide a version of the KCS viewer, called 3D-TV, from the Windows Store. Because of certification requirement 3.1 (V5.0)

“Windows Store apps must not communicate with local desktop applications or services via local mechanisms,..”

3D-TV has to connect to a KinectColorDepth server application on another PC. In practice, the network bandwidth that is required to transfer uncompressed Kinect depth and color data over Ethernet LAN using TCP is about 460Mbit/s, see e.g. the blog post on the jitter filter. This is a lot, and we would like to reduce it using data compression.

This is the final post in a series of three on the Kinect Client Server system, an Open Source project at CodePlex, where the source code of the software discussed here can be obtained.

What Do We Need?

Let’s first clarify the units of measurements we use.

Mega and Giga

There is some confusion regarding the terms Mega and Giga, as in Megabit.

  • For files 1 Megabit = 2^20 bits, and 1 Gigabit = 2^30 bits.
  • For network bandwidth, 1 megabit = 10^6 bits, and 1 gigabit = 10^9 bits.

Here we will use the units for network bandwidth. The Kinect produces a color frame of 4 bytes and a depth frame of 2 bytes at 30 frames per second. This amounts to:

30 FPS x 640×48 resolution x (2 depth bytes + 4 color bytes) x 8 bits = 442.4 Mbit/s = 55.3 Mbyte/s.

We have seen before that the actual network bandwidth is about 460Mbit/s, so network transport creates about 18Mbit/s overhead, or about 4%.

Since we are dealing with data here that is streamed at 30 FPS, time performance is more important than compression rate. It turns out that a compression rate of at least 2 (compressed file size is at most 50% of uncompressed size), in about 5ms satisfies all requirements.

Data Compression: Background, Existing Solutions

Theory

If you are looking for an introduction to data compression, you might want to take a look at Rui del-Negro’s excellent 3 part introduction to data compression. In short: there are lossless compression techniques and lossy compression techniques. The lossy ones achieve better compression, but at the expense of some loss of the original data. This loss can be a nuisance, or irrelevant, e.g. because it defines information that cannot be detected by our senses. Both types of compression are applied, often in combination, to images, video and sound.

The simplest compression technique is Run Length Encoding, a lossless compression technique. It simply replaces a sequence of identical tokens by one occurrence of the token and the count of occurrences. A very popular somewhat more complex family of compression techniques is the LZ (Lempel-Ziv) family (e.g. LZ, LZ77, LZ78, LZW) which is a dictionary based, lossless compression. For video, the MPEG family of codecs is a well known solution.

Existing Solutions

There are many, many data compression libraries, see e.g. Stephan Busch’s Squeeze Chart for an overview and bench marks. Before I decided to roll my own implementation of a compression algorithm, I checked out two other solutions: The Windows Media Codecs In Window Media Foundation. But consider the following fragment from the Windows Media Codecs documentation:

It seems as if the codecs are way too slow: max 8Mbit/s where we need 442Mbit/s. The WMF codecs obviously serve a different purpose.

The compression used for higher compression levels seems primarily of the lossy type. Since we have only 640×480 pixels I’m not sure whether it is a good idea to go ‘lossy’. It also seems that not all versions of Windows 8 support the WMF.

LZ4, an open source compression library. This is a very fast compression library, see image below, from the website:

So, LZ4 can compress a file to 50% at over 400MByte/s. That is absolutely great! I’ve downloaded the source and tried it on Kinect color and depth data. Results are shown and compared in the Performance section.

The RleCodec

I decided to write my own data compression codec, and chose the Run Length Encoding algorithm as a starting point. Why?

Well, I expected a custom algorithm, tailored to the situation at hand would outperform the general purpose LZ4 library. And the assumption turned out to be correct. A prototype implementation of the RleCodec supported by both the DiscreteMedianFilter and creating a Delta before compressing data really outperformed the LZ4 reference implementation, as can be read from the performance data in the Performance section.

It only dawned on me much later that removing undesired effects (like jitter, by the DiscreteMedianFilter) and redundant information (already sent data, by taking the Delta) before compressing and transmitting data is not an improvement of just the RLE algorithm, but should be applied before any compression and transmission takes place. So, I adjusted my approach and in the performance comparison below, we compare the core RLE and LZ4 algorithms, and see that LZ4 is indeed the better algorithm.

However, I expect that in time the data compression codec will be implemented as a GPU program, because there might be other image pre-processing steps that will also have to be executed at the GPU on the server machine; to copy the data to the GPU just for compression requires too much overhead. It seems to me that for GPU implementations the RLE algorithm will turn out the better choice. The LZ4 algorithm is a dictionary based algorithm, and creating and consulting a dictionary requires intensive interaction with global memory on the GPU, which is relatively expensive. An algorithm that can do its computations purely locally has an advantage.

Design

Lossless or Lossy

Shall we call the RleCodec a lossy or lossless compression codec? Of course, RLE is lossless, but when compressing the data, the KinectColorDepth server also applies the DiscreteMedianFilter and takes the Delta with the previous frame. Both reduce the information contained in the data. Since these steps are responsible for enormous reduction of compressed size, I am inclined to consider the resulting library a lossy compression library, noting that it only looses information we would like to lose, i.e. the jitter and data we already sent over the wire.

Implementation

Algorithm

In compressing, transmitting, and decompressing data the KinectColorDepth server application takes the following steps:

  1. Apply the DiscreteMedianFilter.
  2. Take the Delta of the current input with the previous input.
  3. Compress the data.
  4. Transmit the data over Ethernet using TCP.
  5. Decompress the data at the client side.
  6. Update the previous frame with the Delta.

Since the first frame has no predecessor, it is a Delta itself and send over the network as a whole.

Code

The RleCodec was implemented in C++ as a template class. Like with the DiscreteMedianFilter, traits classes have been defined to inject the properties that are specific to color and depth data at compile time.

The interface consists of:

  • A declaration that take the value type as the template argument.
  • A constructor that takes the number of elements (not the number of bytes) as an argument.
  • The size method that returns the byte size of the compressed data.
  • The data method that returns a shared_ptr to the compressed data.
  • The encode method that takes a vector of the data to compress, and stores the result in a private array.
  • The decode method that takes a vector, of sufficient size, to write the decompressed data into.

Like the DiscreteMedianFilter, the RleCodec uses channels and an offset to control the level of parallelism and to skip channels that do not contain information (specifically, the A (alpha or opacity) channel of the color data). Parallelism is implemented using concurrency::parallel_for from the PPL library.

Meta Programming

The RleCodec contains some meta programming in the form of template classes that roll out loops over the channels and offset during compilation. The idea is that removing logic that loops over channels and checks if a specific channel has to be processed or skipped will provide a performance gain. However, it turned out that this gain is only marginal (and a really thin margin 🙂 ). It seems as if the compiler obviates meta programming, except perhaps for very complicated cases.

Performance

How does our custom RLE codec perform in test environment and in the practice of transmitting Kinect data over a network? How does its performance compare to that of LZ4?. Let’s find out.

In Vitro Performance

In vitro performance tests evaluate in controlled and comparable circumstances the speed at which the algorithms compress and decompress data.

Timing

In order to get comparable timings for the two codecs, we measure the performance within the context of a small test program I wrote, the same program is used for both codecs. This makes the results comparable and sheds light on the usefulness of the codec in the context of transmitting Kinect data over a network.

Algorithm

In comparing the RleCodec and LZ4, both algorithms take advantage of working the delta of the input with the previous input. We use 3 subsequent depth frames and 3 subsequent color frames for a depth test and a color test. In a test the frames are processed following the sequence below:

  1. Compute delta of current frame from previous frame.
  2. Compress the delta.
  3. Measure the time it takes to compress the delta
  4. Decompress the delta
  5. Measure the time it takes to decompress the delta
  6. Update previous frame with the delta
  7. Check correctness, i.e. the equality of the current input with the updated previous input.

We run the sequence 1000 times and average the results. Averaging is useful for processing times, the compression factor will be the same in each run. The frames we used are not filtered (DiscreteMedianFilter).

Let’s first establish that the performance of these libraries are of very different order than the performance of the WMF codecs. Let’s also establish that compression speed and decompression speed is much more than sufficient: as noted above 50Mbyte/s would do.

For depth data, we see that the RleCodec delivers fastest compression. LZ4 delivers faster decompression, and obtains stronger compression.

The RleCodec was tested twice with the color data. In the first test we interpreted the color data as of type unsigned char. We used 3 channels with offset 4 to compress it. In the second test we interpreted the color data as unsigned int. We used 4 channels, with offset 4. We see that now the RleCodec runs about 4 times as fast as with char data. The compression strength is the same, however. So, for color data, the same picture arises as with depth data: times are of the same order, but LZ4 creates stronger compression.

The difference in naked compression ratios has limited relevance, however. We will see in the section on In Vivo testing that the effects of working with a Delta, an in particular of the DiscreteMedianFilter dwarfs these differences.

We note that the first depth frame yields lesser results both for time performance and compression ratio. The lesser (de)compression speed is due to the initialization of the PPL concurrency library. The lesser compression ratio is illustrative of the effect of processing a Delta: the first frame has no predecessor, hence there is no Delta and the full frame is compressed. The second frame does have a Delta, and the compression ratio improves by a factor of 2.4 – 2.5.

In Vivo Performance

In Vivo tests measure, in fact, the effect of the DiscreteMedianFilter on the data compression. In Vivo performance testing measures the required network bandwidth in various configurations:

  1. No compression, no filter (DiscreteMedianFilter).
  2. With compression, no filter.
  3. With compression, with filter, with noise.
  4. With compression, with filter, without noise.

We measure the use of the RleCodec and the LZ4 libraries. Measurements are made for a static scene. Measuring means here to read values from the Task Manager’s Performance tab (update speed = low).

Using a static scene is looking for rock bottom results. Since activity in the scene- people moving around can be expressed as a noise level, resulting compression will always be somewhere between the noiseless level and the “no compression, no filter” level. Increase will be about linear, given the definition of noise as a percentage of changed pixels.

The measurements in the table below are in Megabits per second, whereas the table above shows measurements in Megabytes per second. So, in order to compare the numbers, if so required, the entries in the table below have to be divided by 8. Note that 460Mbit/s is 57.5Mbyte/s.

What we see is that:

  • Compression of the delta reduces network bandwidth width 13* (RLE), or 33% (LZ4).
  • Application of the filter reduces it further with 53% (RLE), or 39% (LZ4).
  • Cancelling noise reduces it further with 24% (RLE), or 23% (LZ4).
  • We end up with a compression factor of about 10 (RLE), or 22 (LZ4).

:-).

What do we transmit at the no noise level? Just the jitter beyond the breadth of the DiscreteMedianFilter, a lot of zeroes, and some networking overhead.

As noted above, the differences between the core RleCodec and LZ4 are insignificant compared to the effects of the DiscreteMedianFilter and taking the Delta.

Conclusions

Using the RleCodec or the LZ4 library we achieve tremendous compression, a compression ratio of 10 or 22 respectively , at lightning speed – over 1600Mbytes/s. We achieve this not so much by the compression algorithms, but by removing undesirable effects (jitter, by the DiscreteMedianFilter) and redundancy (already sent data, by taking the Delta).

ToDo

There is always more to wish for, so what more do we want?

Navigation needs to be improved. At the moment it is somewhat jerky because of the reduction in depth information. If we periodically, say once a second, send a full (delta yes, filter no) frame, in order to have all information at the client’s end, we might remedy this effect.

A GPU implementation. Compute shader or C++ AMP based. But only in combination with other processing steps that really need the GPU.

Improve on RLE. RLE compresses only sequences of the same token. What would it take to store each literal only once, and insert a reference to the first occurrence at reencountering it? Or would that be reinventing LZ?