Ah, C++, how much I didn't quite miss thee

In my current project, I'm writing quite a bit of C++ code. Granted it's front-end C++ code with Qt, so it's not what one would call "vanilla C++", but it has been enough to remind myself of what I like and what I really don't miss in the language. Let's try to build a list:

Things I didn't miss

  1. Unhelpful compiler error messages: C++ is a very powerful language, but quite low level. And who suffers the most with this power is the compiler. Sometimes it's quite clueless in being able to correctly point out to you what is wrong. I'll give an example (although I don't have the exact error message strings with me right now): I've added a lot of different lines of code that used some class, but barely touched that class definition and then suddenly I started getting errors compiling the class complaining that "QObject::QObject cannot access private member declared in class QObject". QObject is the parent class of my class. The error message pointed to a line of code in QObject that was something like:

    private:
        Q_DISABLE_COPY(QObject);

    So what could I do if the problem was in somebody else's code? And what did I do to cause this? Well, the answer was that I by accident had a line that was like this:

    MyClass obj = MyClass("construct-me");

    What it means it that it was creating a new member of my class and then creating yet another one using the copy constructor. however, the copy constructor is private in QObject, so it was throwing an error. The tricky thing is that it doesn't throw an error in the code that was calling it, but within the QObject code. And there went 30 minutes of my life tracking this problem...

  2. Slow compilation: I think this is more because I've been doing a lot of Java lately using eclipse that has a pretty good incremental compiler that allows me to most of the time not worry at all with compilation. I can finish writing something, click save and then click run and it's running. There is no multiple seconds, sometimes minutes if I touch a file that has too many dependencies, of wait time.
  3. Garbage collection: I'm not against not having a garbage collector as a concept. But it does make coding slower, especially when you want to build things that are a little bit more dynamic. You have to keep references to things around and remember that somewhere you need to release them. And depending on how the code goes, you might change the decision mid-way and get nice exception of trying to delete the same object twice. The result is that I find myself writing less clean code, or over-copying objects, simply because I know that this way I can always track my references.
  4. Weaker library support: some C++ libraries are quite powerful, Qt included. However, it's very common for me to find myself in situations where I have to write my own solution for simple things, like JSON output, simply because the libraries that are out there don't quite support the data structures that I'm using, so it takes more code to keep changing data structures than writing the whole code myself.

What I like:

  1. Binary control: sometimes you just want to do some binary operations and know exactly how many bits you are sending and reading (very important when you are doing cryptographic signing and CRC calculations), so it's nice to know the size of the things you are handling, and being able to easily move between higher-level object representation to the actual bytes that are being stored.
  2. Binary understandability: this is actually a little bad at the same time. A coulple of weeks ago I was trying to track why something that I wanted to do wasn't actually working. The piece that wasn't working was inside a piece of third-party code (WebKit to be more exact) and I wasn't quite able to tell what was going on. I had access to the source code, but not the same one that my libraries were built againt. My solution: look at the assembly generated and then map back to the code that I had. It wasn't that hard to do, and provided me with precise information of what was going on (a "feature" in WebKit) and partial information on how to get around it.
  3. No 10 layers of frameworks: that's probably something that is more typical with Java than other languages. When you get into a "production-level" codebase that you don't know in Java, you will usually find yourself in layers and layers of framework code that ties your code together. This framework greatly reduces the amount of code that you need to write to get some things done, but makes undestanding the execution flow quite hard. So working on an existing codebase can be daunting. In C++ there is some of framework too, but usually not as much (partly because of reasons #3 and #4 above). This makes it much easier to get into the code and figure out what is going on and when.

So, would I choose C++ as a preferred programming language? Probably not, but I do have some appreciation for people that actually do. What I'm not looking forward to is writing C for one of my future projects...

When text templating can fail you

There are lots of systems out there to generate HTML pages, emails and reports in which you set up a specific template and then a pre/post processing step fills the "blanks" with data from a database or some other backing data structures. The problem with this approach is that it doesn't fail very gracefully. Because the original page is an actual valid output page, if the process to fill in the gaps misses one of the gaps, it might not be able to tell and will just output things that look like gibberish. Just like the email that I received this morning from babbel:

Remember what {Trans:LearningLang [ENG]} word for thank you is?

Apparently I'm not learning any language right now (I haven't really touched that site in a few months), so it couldn't find my current learning language translated to English. Instead of failing to send the email, because {Trans:LearningLang[ENG]} is a valid text in the title of an email, the templating engine falls back to believing that that's what the user meant to provide, and gives me a funny looking email.

What is the solution to this? Well, there are many ways templating can be made safer. First of them is to disable fallbacks and actually throw exceptions when something that looks like a template "blank" turns out not to have backend data for. The problem with this solution is that if you actually want to write text that looks like the template (very useful if you are trying to generate documentation for your engine), you now need to provide escaping mechanisms, and if the person forgets to escape their markup, they will get errors or very weird results.

Another option is to not start with plain text, but actually structured data. This provides a "type-safe" way of saying: this is text and this is a "blank". The problem with this approach is that you need special UI for building your templates and not just a text processor. So things become much more expensive to maintain (because whenever you have to also maintain a UI, any project becomes a never-ending project, as contexts where UIs are used are constantly changing and adapting to different input behavior and operating system choices).

My preferred solution is somewhere in between. I think you should be explicit on what you want to replace. At the beginning of your template, you can say something like:

{REQUIRED_REPLACE:*} or {REQUIRED_REPLACE:Trans:*}

This will make it fail if it sees a template that starts with it (or any, with the *). Then you have to still have escaping mechanisms for the documentation use case (you probably need that anyway). The problem that this still doesn't solve is the case that somebody made a typo and instead of starting their template entry with "{" they started with, let's say, "(". In the middle of everything, they didn't really realize it.

That's a trickier problem. Solution for that would require a spelling-correction-type warning system, which would look for things that look like a valid template except for one character. Or just look for common mistakes. This quickly becomes a hard problem if the templating system is very powerful and allows for a lot of different says of representing templates (like many HTML-generating template frameworks). And in the execution it will require a warnings report, which also is not straight-forward to integrate with a build process (how to ensure that people are actually looking at the report, for example).

Anyway, it's funny how simple emails can trigger long posts when I'm in an introspective mood.

An ode to immutability

(WARNING: this is a mostly technical post. If you don't realy care about it, look at the categories of the post before spending your time trying to understand what I mean by "immutability")

I've been working lately on a very large and interesting project at work that is taking a lot of my brain cycles. There are lots of technical challenges that I'm trying to resolve, the biggest of them being performance. Lots of weekends and late nights later, it's still here...

Anyway, for the last week or so, I was working on a piece of code to improve performance of feature extraction from large amounts of text. My features are mostly textual and what I need to do is identify where in the text terms appear. Sounds easy, right? Well... I'll skip the details of my approaches so far to try to jump right into my immutability ode here on my latest approach.

So I have an iterator over the parsed string that identifies where in the graph it is. An iterator patern is easy, in its by-the-OO-patterns-book not stateless at all. The challenge is that there are parts of the string that I find that I could interpret it multiple ways, i.e., try multiple directions on my parsing state machine.

At this point a CS purist will say that if I don't really need to have any memory (which I don't), I could just increase the size of my parse graph and still keep moving in one direction only. That is true, but that would basically blow up my graph in size, in a factor of N^2, which is not a very pleasant solution.

So I found myself changing my code to be recursive and creating an Iterator.clone() which would copy the state of the iterator, so that I could handle those different cases.

void handleEntry(ItemIterator it) {
    Item item = it.next();
    if(shouldSplit(item)) {
        handleEntry(split(it.clone());
    }
    handleEntry(it);
}

After making this started becoming more complicated and deciding whether to clone it or not because trickier and had to be decided after the "handleEntry(it)" (because it depended on the result of handleEntry()), I found myself just always cloning my iterators. Then, if I was cloning it all the time, why not make it immutable? And that's what I did and that made this code cleaner, with only one minor detail: the iterator became a little "fatter":

void handleEntry(ItemIterator it) {
    it = it.next();
    Item item = it.getItem();
    if(shouldSplit(item)) {
        handleEntry(it);
    }
    handleEntry(it);
}

So now the Iterator actually keeps the Item around and not just produces it and moves on. Implementation-wise, though, that doesn't really affect things much.

The interesting thing is that not only this code became cleaner, but I suddenly had a code that I could switch around from being recursive to being iterative without having to worry about state management, as the piece that contained most of the state could be put into a stack and retrieved it later without having to worry about it having a different state when I retrieved it later. It saved me a lot of work! So go immutable objects! Maybe I should start coding in Haskell or Erlang (I played with Erlang before and it was interesting - more on that some other day)...