Thursday, April 3, 2008

Alternatives to Object Oriented Programming

Object oriented programming is extremely popular these days. It's so popular that some people aren't even aware of alternatives to it. The reason its popular is clear: OOP works well a lot of the time. Traditional object oriented programming styles have some significant disadvantages in some circumstances, however. Some of the alternatives are worth looking into when those disadvantages become awkward.

As part of the static analysis project that I'm working on, I'm trying to use alternatives wherever I can. In part, this is because the advantages are substantial for the project I'm working on, but I'm also curious how far it can go and what will happen when I push it too far.

So far, I love it. Inheritance is a relationship between classes that is only a small step from C++ friendship. The derived class becomes extremely tightly coupled to the base class, and, in some cases, the base class can even become coupled to the derived class. The temptation to add functionality to the base class can be large as well.

I'll talk about two ways of partially avoiding inheritance. Both of them rely on compile time polymorphism and cannot (or should not) be used if runtime polymorphism is required.

How often do OO programmers actually need the runtime polymorphism that they enjoy so much? Certainly it's needed some of the time, but it's not nearly as useful as it looks at first. For example, my static analysis project contains a different class for each of several dozen different Lua constructions. This sounds a lot like a case for inheritance, but it's only got one virtual function in the whole project, and that's there only to merge the C and C++ code. If the whole thing had been in C++, it wouldn't have been needed.

Free Functions


Many programmers in the OOP world will happily add new functions to the public interface of classes whenever they can. For example, the C++ string class that PC-Doctor uses has a public member function to return a uuencoded version of the string. This is an extreme case, but it's still worth looking at.

In this case, we've got some functionality that we want to give to a small number of users of a class. There aren't many places in our code base that use this function, and essentially everything uses the string class. Furthermore, the function can easily be implemented using the existing public interface to the string. In this case, it's pretty obvious that everyone would be happier if the function was a free function. As a free function, most users of the string won't ever have to see it. In fact, they don't have to include the appropriate header file or even link to the appropriate module. The interface to string becomes a bit easier to understand for the majority of string users, too.

Furthermore, if you wanted to implement uuencoding for multiple types of strings, you could still do it. For example, if you had different types for UTF-8 and UTF-16 strings, you could overload the free function for both them.

As I said, this was an extreme case. There are disadvantages to free functions, however. Let's look at the other extreme for a bit. For this, let's see what Herb Sutter has to say about free functions in Exceptional C++ Style. He claims that all you should construct your interface to classes by exposing the minimal set of public member functions with everything else as a free function. His argument in the book also looks at a string class, std::basic_string.

Strings, however, are a bad example to generalize from. Some of his arguments get much weaker if you look at a highly specialized class that will only be used in a few places in a single program. Since the vast majority of code that I write is exactly that, I'd rather study this case.

With that in mind, I've made the 40 classes that form the abstract syntax tree for Lua into classes that only have constructors, destructors, and access to internal data. This is, perhaps, a bit more than even Herb Sutter would recommend. I give public access to all of the member data so that there isn't a chance that any member functions could be needed.

I've written most of the code, and it works fine. However, it hasn't survived the maintenance phase, and it's not clear that it will.

However, I'm not going to change the way I write normal application code. At PC-Doctor, I'll write many more classes that get used by only a few other classes. I won't create many free functions for them. Why?

First, free functions clutter the namespace they reside in. I only want that function to be seen by the small number of programmers that might be interested in the class I just wrote. Considerably more people will see the namespace that the class is in.

Let's look at an example. Suppose I want to write a function that appends an alert to a collection of alerts that will be shown to the user. The function might be named "append". If this is a free function, it'll get lost. I want users of my AlertCollection class to know that they can append alerts to it. I don't want them to have to dig around. My users are not going to be using it for long, and they don't want to remember the details later. They just want a complete catalog of everything they can do so they can quickly do what needs to be done.

If a programmer at PC-Doctor decided that they were fans of handling alerts and they wanted to actually remember my API, then putting some of the functions as free functions is, again, a problem. There's one extra bit of information that the user has to remember for each function.

Finally, languages like Java and C# can't have free functions. This is their loss, unfortunately. While I can't give a hard and fast rule for when to use free functions, I can say that they are a valuable tool when developing some APIs.

Visitor Pattern


Well, that was more than I wanted to write in one week. (Remember, I'm a programmer!) The visitor pattern will have to wait until next week. It's probably worth a whole post on its own anyway. It's tricky to use effectively, but it's occasionally extremely effective.

This originally appeared on PC-Doctor's blog.

Friday, March 21, 2008

Requirements of a Language to Statically Analyze

For those of you just tuning in, I'm working on a project to statically analyze an untyped code base to try to bring some of the advantages of typed languages to the code base.

The first step is to figure out which language I should write a static analysis tool for. This is obviously an important decision with quite a few implications both at the beginning, during the creation of the parser, and at the end when we try to find an actively developed code base to look at.

The first requirement is that it be a relatively popular, untyped language. Let's look at some languages:

1. JavaScript, ECMA Script, or JScript. This is used by millions.
2. Lua. This is a popular language to embed in applications. I know it extremely well, too.
3. Python. This is really popular, and it's a pretty clean language, too.
4. Ruby. This is getting to be a popular language. (I really like the language, too.)
5. PHP. This is also a popular language.
6. Perl. I've gone through life without having to learn COBOL, Fortran, or Perl. I'm a really happy guy. Do people still use it? I wouldn't know.

Yes, there are some others, but I'll limit my discussion to these.

I'll reject Perl immediately because it's designed to be complex. This is just a hobby project, and I'm not going to be able to spend lots of time dealing with unneeded complexities. I also have no interest in learning it.

I'm also going to reject PHP for similar reasons. I just have no interest in it. Now we're down to harder choices.

We're down to four languages: Ruby, Python, JavaScript, and Lua. These are all languages that I've enjoyed using, and some of them I know pretty well.

Let's start with Ruby. Ruby is a wonderful scripting language. I love it for its terse, expressive syntax. I'm a bit worried by the quantity of syntactic sugar in it, but most of that can be stripped away after parsing it.

However, Ruby code is frequently written in a different style than conventional, object oriented code. Ruby programmers are used to doing things that would be utterly opaque to any reasonable static analysis tool. Here's an extreme example of some Ruby on Rails code that scares me:

lang_code = 'en'
obj = MyModel.find_by_name 'name of obj'
result = obj.send :"language_#{lang_code}"

For the non-Ruby geeks out there, this pulls a MyModel object out of the my_models table in the database. The object is populated with some members based on the database schema. That's mildly scary all by itself, since I'd rather not spend my time dealing with DB schemas. However, the last line is even worse! This calls the get_language_en member function in obj, but the only way for a program like mine to figure that out is to look at the value of lang_code. I'd much rather ignore values and only look at types. (Many scripting languages treat functions as ordinary values that can be moved around. As we'll see in future installments, this isn't always a big problem.)

Other languages can do this, too, but Ruby programmers enjoy doing it a bit too much.

Python is another well designed language that would be fun to use. I can't claim that I'm an expert at it, so there may be other problems with it. However, the standard libraries for it scare me a bit. There are a huge number of them. As far as I can tell, Python programmers use whatever they feel like using from these, so an automated way of deducing the argument types and return values of functions would be required. I'm sure that a doxygen summary of the functions could be found somewhere, but I'd prefer to not rely too heavily on infrastructure that doesn't sound like much fun to work on.

Lua is an extremely simple language that's designed to be embedded in applications. As such, it avoids both the large amount of syntactic sugar in Ruby and, at least in most cases, the huge library of Python. However, it does mean that you'd have to find the correct application of Lua to look at. Let's assume that such an application exists.

Like most (or all?) of these languages, Lua treats functions as first class values. Because of this, it is possible to pick the function to call based on a variable. Here's an example:

selector = 'a'
obj = { a: function(num) return num+1 end, b: function(str) return str..'!' end }
obj[selector]('string')

The last line is the worrisome one. This calls a function, passing a string into a function that probably requires a number. (There are ways to modify the behavior of types in Lua. Off the top of my head, I don't remember if the latest versions of Lua allow strings to have overloaded operators.) If we assume that this isn't possible, we'd like to detect such an error. That's extremely hard.

Realistically, however, if we reject Lua because of this problem, then we'll also reject all other scripting languages that treat functions as normal values. As far as I know, that's all of the ones on my list. Fortunately, this behavior is not a frequent pattern in the Lua code that I've seen and written. There are other ways of accomplishing the same task in Lua that are significantly simpler, more powerful, and, therefore, more popular.

The last language on the list is JavaScript. JavaScript can be used in a lot of different ways, but by far the most common and exciting way is in client side scripting on web pages. This provides an enormous body of code that could be analyzed. In addition, there are other tools out there that create type safe JavaScript by cross compiling from another language. This code would make an excellent test case for my tool.

JavaScript doesn't suffer from the same problems that Ruby and Python have, either. While you can do what I described in the Lua example above in JavaScript as well, it's not especially common for exactly the same reason. (Both Lua and JavaScript support closures. Closures are much more powerful, and, even though Internet Explorer has some nasty bugs related to them, they are extremely popular.)

Web browsers do have a mildly large library embedded in them, but the security restrictions are so large that the library is inherently limited. ActiveX controls and Mozilla Plugins and extend the API in annoying ways, but the body of code out there is so large and varied that pages that use plugins that I don't want to support can be thrown out.

The one real disadvantage of JavaScript is the optional semicolon rule in the parser. The parser is required to insert semicolons in a bunch of places, and this does sound mildly annoying to parse. It can't be that bad, though. (Lua has a similar rule involving newlines, but it's much more obscure.)

Up until now, JavaScript looks like an exciting possibility. Unfortunately, I do a lot of work here at PC-Doctor using JavaScript, and I'd rather do something completely different. I really don't want a useless hobby project to feel like work! If I need another excuse, it's that others have already written limited tools like this for JavaScript.

Lua 5.1 is it.

This originally appeared on PC-Doctor's blog.

Tuesday, March 18, 2008

Mixing C and C++ Code

One of the early steps to my static analysis project is to parse the language that I'm going to analyze. I'd like to form a relatively clean Abstract Syntax Tree that I can play with later.

C++ has a lot of advantages over C for this sort of thing. It's got an enormous amount of machinery that can be used to build high level abstractions without sacrificing much more runtime overhead than you're willing to pay for.
However, C is also still a useful language. For example, it's extremely simple when compared to C++. Perhaps in part because of this, a lot of code generators out there still create C code. If you want to use one of these code generators with anything but C or C++, you can't do much to tweak the output they generate. C++ could well be unique in its ability to mix the old style C code with modern programming techniques.

I've been using some code generated by Flex and ACCENT recently, and it's nice being able to use a lot of C++'s features with the code. (ACCENT still uses K&R function definitions. It's pretty old and primitive, I guess.)

I've worked on a few projects that required a mix of the two, but I can't claim that I've ever thought about exactly how it should be done. Part of the problem is that it's extremely easy to mix the two. C++ allows much more integration with a C library than most languages. In C#, for example, about all you can do when dealing with a C library is make calls into it. C++ allows you to define types that can be used by the C library, you can play with #defines just like a C programmer can, and you can pass function pointers around without worrying whether they're from C or C++. (Though std::tr1::function is better!) You can even include random header files and make calls into the internals of a library.

Clearly, some thought should be done before mixing the two. I'm embarrassed to say that I haven't ever done this deliberately.

Currently, I'm using Flex, a library that exposes more than simply an API to call into. You are supposed to compile the code into your module, and it is possible to modify the code by creating some C compatible types and passing them in through #defines.

It's all very clunky and C-like, but it works great, and that's all I care about today. I'm happy to use it, but I don't really want to pass in a POD type. I want a full fledged C++ class to be used by Flex.

Mixing Flex's generated code and modern C++ code is entirely possible, but there is a lot of type safety that gets thrown away when going to the C code. For example, Flex thinks that each programming language construct and all of its parts are all made up of Symbol* objects. In C++, I don't want to have an abstract base class that defines all behavior that all of these constructs might ever need. Instead, each different type of construct is a different type in C++. This means, for example, that it's not possible to pass an object that isn't an expression to a function that requires an expression. I like that a lot, but Flex wants all of these objects to be the same.

The C code, therefore, will go through a facade that does a bunch of careful type conversions and then passes the resulting, typed objects through to the C++ code.

This is a frequent theme in other projects that I've done that involve mixing C and C++ code. C++ is capable of supporting a much richer set of types than C is, and there always seems to be some work involved in preserving C++'s type expressiveness in C.

Incidentally, the facade is incredibly ugly at the moment. There's lots of typeid operators and long if..then..else if...etc constructs generated by preprocessor macros. However, the rest of the C++ code always knows exactly what the type is, so it's all worth it.

Flex and Accent's generated code do have to perform operations on these objects, however. They end up using a long list of "member functions" that always take Symbol pointers. For example, there might be a function to create a C++ expression object by taking a unary operator and another expression object. In Accent's world, this looks like this:

Symbol* CreateUnaryOperation( Symbol* op, Symbol* expression );

That's essentially a constructor for the Symbol* object. (Flex doesn't have to worry about whether or not it's a pointer or a struct. To Flex, it's just a copyable typedef.)

This is another common theme of mixing C and C++. You don't have to stop object oriented programming in C. You just have to wrap it all up in a bunch of free functions. It's even possible to have virtual functions by manually creating the vtbl.

It also nicely displays another annoying theme of mixing C and C++. In C++, you always know what the rules for the lifetime of an object is. If a function returns a value directly, then the caller gets control over the lifetime. If the function returns a smart pointer, then the smart pointer gets control over the lifetime. (Smart pointers also come with a set of rules about what the caller is allowed to do with the object.) If a reference is returned, then the object's lifetime is controlled elsewhere. It's all wonderfully unambiguous.

In C you always get either a value or a pointer. It's true in my project, too. All of my fake constructor functions return a Symbol*. I really want it to be an auto_ptr, though! If Flex or Accent were to misplace a pointer, then it would be lost forever. So far, they don't seem to do that, but it's certainly filled with ambiguity.

I haven't come up with a satisfactory solution to this problem. My current solution is to always do exactly the same thing with all of the objects returned by my facade's functions. If the programmer has to remember a set of rules at all times, then it'd better be a simple set of rules.

It looks like this is the beginning of officially working on statically analyzing an untyped language. Next week, I might talk about what language I picked and why. Or perhaps I'll talk about my crazy idea for a completely clean decorator pattern implementation so that I don't have to mess up my nice, clean AST.

This originally appeared on PC-Doctor's blog.

Monday, March 10, 2008

Static Analysis of Untyped Languages

A while ago I wrote about whether or not untyped languages were a good idea or a bad idea. I didn't come to any real conclusions at the time, and it's bothered me. I'd like to outline a way to gather some real conclusions now. I still won't be able to come to a conclusion, but I think this approach sounds interesting.

There are a number of advantages and disadvantages to untyped languages. One of the disadvantages is that, because variables have no type, you can assign the wrong type to them. A strongly typed language can avoid most of this problem. (You might still have problems involving type conversion, but a good programmer or a strict language can avoid those problems.)
I'd like to propose a tool that can analyze some untyped code without running it. (This is called static analysis.) The tool will try to deduce the types of variables. In theory, it is possible to analyze what possible types a value has at any given time. We couldn't be as strict as a typed language, but we can make some conclusions. For example, let's look at the following Python code:

class Duck:
  def _init_(self):
    self.hungerLevel = 0

daffy = Duck()
feed(daffy)

The first few lines define a type with a single numeric member variable. We can tell this before we run the program. The last two lines create a Duck and pass it to the feed function. We can tell at compile time that daffy is a variable of type Duck and feed is a function that will receive Ducks. Functions may receive different types at different times, but we can record all of the possible inputs and make sure the types are compatible at each point.

This cannot work perfectly, of course. For example, there might be code paths that are simply impossible to go through because of the value of variables rather than the type of variables. Determining all of these cases cannot be done without running the program. (My theoretical computer science is pretty weak, but I vaguely remember some theorems that could prove that.) However, if there's a program out there that is type safe but requires some of the code paths to be impossible to reach, it's probably not a particularly well written program.

I'd like to write something like this, but haven't fully justified it, yet. If I had such a tool, I'd try to run it on a body of existing, open source, untyped code.

I can imagine two possible ways to use this to strongly suggest that untyped languages are dangerous despite the claims of untyped language fans. First, real problems could be found in the code base. If the author(s) of the untyped code would admit that problems were there, then, almost by definition, it seems as though the problems exist. Second, the author(s) could become of a fan of the tool and use it during development. Perhaps the final code doesn't have type safety problems, but the code has those problems in its early stages. This could also be used as evidence that there's a problem with untyped languages.

Unfortunately, it's hard to imagine how this experiment could prove that untyped languages don't have a problem. Furthermore, it's possible that a tool like this works well enough that the problem largely goes away after programmers start using it. (The latter seems unlikely, but it's certainly a potential complication.)

Don't expect rapid progress here, but I'd guess that, at the very least, I'll have some follow up posts about this project as it progresses.

This originally appeared in PC-Doctor's blog.

Wednesday, February 13, 2008

Tools for Writing a Parser

People have been writing tools to generate scanners and parsers for decades. YACC is probably the most famous. It was created in the 1970s, and, since it stands for Yet Another Compiler Compiler, it probably wasn't the first attempt at the problem.

YACC is a pain to use, though. It uses a parsing algorithm that has great worst case performance but causes massive headaches for programmers. Essentially, you have to ensure that your grammar conforms to the LARL(1) rules. You'd better know what that means before you use it, too!
People have been writing tools to generate scanners and parsers for decades. YACC is probably the most famous. It was created in the 1970s, and, since it stands for Yet Another Compiler Compiler, it probably wasn't the first attempt at the problem.

YACC is a pain to use, though. It uses a parsing algorithm that has great worst case performance but causes massive headaches for programmers. Essentially, you have to ensure that your grammar conforms to the LARL(1) rules. You'd better know what that means before you use it, too!

Well, I wanted to accomplish a task instead of learning about parsing algorithms. I looked around for alternatives. Wow! There's a lot of possibilities out there!

If you look at that link, though, most of them use the same algorithm or a similar algorithm as YACC. I ended up looking at two that didn't: Accent and Spirit.

I've used Spirit a few times before, and it works great. It's a great C++ library created by Joel de Guzemann and others. It lets you write the scanner and parser using a C++ expression template library. It's extremely flexible, fairly slow, and mostly easy to use. I decided against it, though because I was curious what it was competing against.

Accent appears to be its major competition if I didn't want to learn about LL(1) grammars and left/reduce vs reduce/reduce conflicts. (Yeah, I got that wrong. Feel free to prove me wrong by explaining why in a few simple sentences!)

Accent is a fairly small program that generates some pretty nice C code. (I tried to embed some C++ in the semantic actions and failed. Just accept that you'll have to manually manage memory yourself for a few moments.) Anyway, it's pretty easy to use. On my first try, I created a grammar that turned out to be "ambiguous". I didn't bother to figure out its algorithm. I just added a bit of logic to disambiguate it, and I moved on with my life. Very nice!

Oh, one more thing. Spirit uses the same technology to build both the scanner and the parser. It's really pretty slick. If you're creating an AST tree, it's even possible to make a trivial scanner and do both the tokenizing and the parsing in one grammar. This really slows things down a lot, but it is possible! (Don't use Spirit if speed is your goal. Code generators are still better. Spirit 2.0 may change that, though. We'll see.)

Anyway, Accent uses Flex to generate the scanner. It's another file format you'll have to learn, but it's extremely simple. Spirit isn't really gaining much there.

Which one is better? Spirit is incredibly flexible, but Accent produces much faster code. They're both pretty easy to use.

This originally appeared in PC-Doctor's blog.

Tuesday, February 5, 2008

Interfaces, Friends, and the .NET Framework

The .NET Framework has a lot of really great things in it. I've just started playing with a few corners of it, and I love the amount of stuff that it's got in it. Some things really irritate me, though, and it's a lot more satisfying to talk about that stuff!

C# and the CLR make it really hard to hide information. First of all, the .NET framework is built around inheritance. Everything is inherited from something else, and if you want to extend an object, then you're going to inherit, too. Inheritance hides almost nothing, but you already knew that, and, presumably, you don't use it as much as the programmers in Redmond.

That's not what I want to talk about here. I'm going to complain about the member access rights that C# and the CLR support.

In C#, there are four different member access rights. The usual private, protected, and public member access rights are supported. They also support "internal", which is the same as not exporting a function from your DLL in C++. The CLR supports one additional one called "Family and Assembly" which is the same as a protected member that is not exported from the DLL.

It's missing two extremely useful access rights, however. Java's package level scope and C++'s friendship. Both of these allow a programmer to grant access to a limited number of functions and classes. Java's package level scope is the optimal one, in my opinion. This allows a programmer to give limited access to a limited number of classes. Friendship allows complete access to individual classes and functions.

In the C# world, you're expected to grant public access to functions that should only be used by one other function outside your class. You're supposed to be happy about it, too!

This originally appeared on the PC-Doctor blog.

Monday, January 14, 2008

Exception Safety in C#

C# vs. C++


C# tries to perform resource management through finally clauses and the garbage collector. This is radically different from C++'s approach. C++ executes destructors (which do all of the resource cleanup in modern C++) in a deterministic order.

While exception safety is certainly possible in C#, it is potentially error-prone. It frequently relies on the user of a class putting the correct code in a finally block. (This is code that the compiler frequently figures out in C++/CLI, so it's not a problem with the CLR.) C++, on the other hand, relies on the author of a class to put the correct code in the constructor and destructor.

C# requires that resource management be done every time a programmer uses a class. C++ requires that it be done once. This lies at the heart of my complaint with C#, but it gets worse, as we'll see.

To be fair, C# does handle one resource automatically. Memory management doesn't require any work since it's all cleaned up by the garbage collector. All other resources require some extra effort on the part of the programmer. C++ requires that programmers put all memory resources inside an appropriate smart pointer.

The conventional C# approach


Code that allocates a non-memory resource in C# is supposed to look like this:

// Declare all variables that may require cleanup at the top of your function.
// Also include enough state information that you'll know how to undo
// partial changes that were made.
try {
  // Use the variables.
} catch( Exception ) {
  // Roll back any changes that were made.
} finally {
  // Call Dispose on all of the objects created that implement Dispose.
}

Needless to say, this sucks. It requires that you know which objects require disposal. It forces you to declare all of your variables at the top of the function and move them away from where they are used. It's also a lot of extra text that makes it harder to figure out what the function is actually doing.

However, assuming that the programmer didn't screw up, it will work.

What happens when someone screws up?


Unfortunately, the burden of cleaning up resources is placed on the user of the resource rather than the implementor of the class that handles the resource. This means that you can't be sure that Dispose will be called every time that it has to be called. Furthermore, it means that resource cleanup has to be performed correctly more than once per class.

The CLR has anticipated this problem to some extent. It allows a class to declare a finalize function which gets called by the GC right before it gets destroyed. You won't know when this function will be called, but at least your resource will get cleaned up eventually if you use this.

Is Finalize enough?


Unfortunately, Finalize has some serious problems. Because the garbage collector destroys the objects in a non-deterministic order, some of the member variables of the object being finalized might have been destroyed already.

The CLR provides some information about the order in which objects are destroyed, however. The following assumptions can be used:
1. Objects that derive from CriticalFinalizerObject and have a finalization method can use references to objects without a finalization method.

2. Objects that do not derive from CriticalFinalizerObject can use references to objects that do derive from CriticalFinalizerObject and also objects that do not have finalization methods.

Microsoft strongly recommends that you don't derive anything from CriticalFinalizerObject. However, this is the only mechanism they give you to order the finalization methods. If you have one object that must finalize before another, then you either have to use this, or you have to do what Microsoft did for System.IO.StreamWriter. This class does not flush its buffer during finalization because it may contain a FileStream object that must be closed after the StreamWriter flushes its buffer. In other words, the StreamWriter simply doesn't work unless it is Disposed correctly.

Of course, if you have three objects who's finalization order must be determined ahead of time, then even deriving from CriticalFinalizerObject won't work. In other words, even if you didn't care exactly when a resource was cleaned up, finalize isn't enough to make all classes safe to use. Sometimes you must rely on the user of your class to make it safe in C#. The fact that it works a lot of the time is only going to make the users careless.

Personally, I find this scary.

This originally appeared on the PC-Doctor blog.