Friday, March 21, 2008

Requirements of a Language to Statically Analyze

For those of you just tuning in, I'm working on a project to statically analyze an untyped code base to try to bring some of the advantages of typed languages to the code base.

The first step is to figure out which language I should write a static analysis tool for. This is obviously an important decision with quite a few implications both at the beginning, during the creation of the parser, and at the end when we try to find an actively developed code base to look at.

The first requirement is that it be a relatively popular, untyped language. Let's look at some languages:

1. JavaScript, ECMA Script, or JScript. This is used by millions.
2. Lua. This is a popular language to embed in applications. I know it extremely well, too.
3. Python. This is really popular, and it's a pretty clean language, too.
4. Ruby. This is getting to be a popular language. (I really like the language, too.)
5. PHP. This is also a popular language.
6. Perl. I've gone through life without having to learn COBOL, Fortran, or Perl. I'm a really happy guy. Do people still use it? I wouldn't know.

Yes, there are some others, but I'll limit my discussion to these.

I'll reject Perl immediately because it's designed to be complex. This is just a hobby project, and I'm not going to be able to spend lots of time dealing with unneeded complexities. I also have no interest in learning it.

I'm also going to reject PHP for similar reasons. I just have no interest in it. Now we're down to harder choices.

We're down to four languages: Ruby, Python, JavaScript, and Lua. These are all languages that I've enjoyed using, and some of them I know pretty well.

Let's start with Ruby. Ruby is a wonderful scripting language. I love it for its terse, expressive syntax. I'm a bit worried by the quantity of syntactic sugar in it, but most of that can be stripped away after parsing it.

However, Ruby code is frequently written in a different style than conventional, object oriented code. Ruby programmers are used to doing things that would be utterly opaque to any reasonable static analysis tool. Here's an extreme example of some Ruby on Rails code that scares me:

lang_code = 'en'
obj = MyModel.find_by_name 'name of obj'
result = obj.send :"language_#{lang_code}"

For the non-Ruby geeks out there, this pulls a MyModel object out of the my_models table in the database. The object is populated with some members based on the database schema. That's mildly scary all by itself, since I'd rather not spend my time dealing with DB schemas. However, the last line is even worse! This calls the get_language_en member function in obj, but the only way for a program like mine to figure that out is to look at the value of lang_code. I'd much rather ignore values and only look at types. (Many scripting languages treat functions as ordinary values that can be moved around. As we'll see in future installments, this isn't always a big problem.)

Other languages can do this, too, but Ruby programmers enjoy doing it a bit too much.

Python is another well designed language that would be fun to use. I can't claim that I'm an expert at it, so there may be other problems with it. However, the standard libraries for it scare me a bit. There are a huge number of them. As far as I can tell, Python programmers use whatever they feel like using from these, so an automated way of deducing the argument types and return values of functions would be required. I'm sure that a doxygen summary of the functions could be found somewhere, but I'd prefer to not rely too heavily on infrastructure that doesn't sound like much fun to work on.

Lua is an extremely simple language that's designed to be embedded in applications. As such, it avoids both the large amount of syntactic sugar in Ruby and, at least in most cases, the huge library of Python. However, it does mean that you'd have to find the correct application of Lua to look at. Let's assume that such an application exists.

Like most (or all?) of these languages, Lua treats functions as first class values. Because of this, it is possible to pick the function to call based on a variable. Here's an example:

selector = 'a'
obj = { a: function(num) return num+1 end, b: function(str) return str..'!' end }
obj[selector]('string')

The last line is the worrisome one. This calls a function, passing a string into a function that probably requires a number. (There are ways to modify the behavior of types in Lua. Off the top of my head, I don't remember if the latest versions of Lua allow strings to have overloaded operators.) If we assume that this isn't possible, we'd like to detect such an error. That's extremely hard.

Realistically, however, if we reject Lua because of this problem, then we'll also reject all other scripting languages that treat functions as normal values. As far as I know, that's all of the ones on my list. Fortunately, this behavior is not a frequent pattern in the Lua code that I've seen and written. There are other ways of accomplishing the same task in Lua that are significantly simpler, more powerful, and, therefore, more popular.

The last language on the list is JavaScript. JavaScript can be used in a lot of different ways, but by far the most common and exciting way is in client side scripting on web pages. This provides an enormous body of code that could be analyzed. In addition, there are other tools out there that create type safe JavaScript by cross compiling from another language. This code would make an excellent test case for my tool.

JavaScript doesn't suffer from the same problems that Ruby and Python have, either. While you can do what I described in the Lua example above in JavaScript as well, it's not especially common for exactly the same reason. (Both Lua and JavaScript support closures. Closures are much more powerful, and, even though Internet Explorer has some nasty bugs related to them, they are extremely popular.)

Web browsers do have a mildly large library embedded in them, but the security restrictions are so large that the library is inherently limited. ActiveX controls and Mozilla Plugins and extend the API in annoying ways, but the body of code out there is so large and varied that pages that use plugins that I don't want to support can be thrown out.

The one real disadvantage of JavaScript is the optional semicolon rule in the parser. The parser is required to insert semicolons in a bunch of places, and this does sound mildly annoying to parse. It can't be that bad, though. (Lua has a similar rule involving newlines, but it's much more obscure.)

Up until now, JavaScript looks like an exciting possibility. Unfortunately, I do a lot of work here at PC-Doctor using JavaScript, and I'd rather do something completely different. I really don't want a useless hobby project to feel like work! If I need another excuse, it's that others have already written limited tools like this for JavaScript.

Lua 5.1 is it.

This originally appeared on PC-Doctor's blog.

Tuesday, March 18, 2008

Mixing C and C++ Code

One of the early steps to my static analysis project is to parse the language that I'm going to analyze. I'd like to form a relatively clean Abstract Syntax Tree that I can play with later.

C++ has a lot of advantages over C for this sort of thing. It's got an enormous amount of machinery that can be used to build high level abstractions without sacrificing much more runtime overhead than you're willing to pay for.
However, C is also still a useful language. For example, it's extremely simple when compared to C++. Perhaps in part because of this, a lot of code generators out there still create C code. If you want to use one of these code generators with anything but C or C++, you can't do much to tweak the output they generate. C++ could well be unique in its ability to mix the old style C code with modern programming techniques.

I've been using some code generated by Flex and ACCENT recently, and it's nice being able to use a lot of C++'s features with the code. (ACCENT still uses K&R function definitions. It's pretty old and primitive, I guess.)

I've worked on a few projects that required a mix of the two, but I can't claim that I've ever thought about exactly how it should be done. Part of the problem is that it's extremely easy to mix the two. C++ allows much more integration with a C library than most languages. In C#, for example, about all you can do when dealing with a C library is make calls into it. C++ allows you to define types that can be used by the C library, you can play with #defines just like a C programmer can, and you can pass function pointers around without worrying whether they're from C or C++. (Though std::tr1::function is better!) You can even include random header files and make calls into the internals of a library.

Clearly, some thought should be done before mixing the two. I'm embarrassed to say that I haven't ever done this deliberately.

Currently, I'm using Flex, a library that exposes more than simply an API to call into. You are supposed to compile the code into your module, and it is possible to modify the code by creating some C compatible types and passing them in through #defines.

It's all very clunky and C-like, but it works great, and that's all I care about today. I'm happy to use it, but I don't really want to pass in a POD type. I want a full fledged C++ class to be used by Flex.

Mixing Flex's generated code and modern C++ code is entirely possible, but there is a lot of type safety that gets thrown away when going to the C code. For example, Flex thinks that each programming language construct and all of its parts are all made up of Symbol* objects. In C++, I don't want to have an abstract base class that defines all behavior that all of these constructs might ever need. Instead, each different type of construct is a different type in C++. This means, for example, that it's not possible to pass an object that isn't an expression to a function that requires an expression. I like that a lot, but Flex wants all of these objects to be the same.

The C code, therefore, will go through a facade that does a bunch of careful type conversions and then passes the resulting, typed objects through to the C++ code.

This is a frequent theme in other projects that I've done that involve mixing C and C++ code. C++ is capable of supporting a much richer set of types than C is, and there always seems to be some work involved in preserving C++'s type expressiveness in C.

Incidentally, the facade is incredibly ugly at the moment. There's lots of typeid operators and long if..then..else if...etc constructs generated by preprocessor macros. However, the rest of the C++ code always knows exactly what the type is, so it's all worth it.

Flex and Accent's generated code do have to perform operations on these objects, however. They end up using a long list of "member functions" that always take Symbol pointers. For example, there might be a function to create a C++ expression object by taking a unary operator and another expression object. In Accent's world, this looks like this:

Symbol* CreateUnaryOperation( Symbol* op, Symbol* expression );

That's essentially a constructor for the Symbol* object. (Flex doesn't have to worry about whether or not it's a pointer or a struct. To Flex, it's just a copyable typedef.)

This is another common theme of mixing C and C++. You don't have to stop object oriented programming in C. You just have to wrap it all up in a bunch of free functions. It's even possible to have virtual functions by manually creating the vtbl.

It also nicely displays another annoying theme of mixing C and C++. In C++, you always know what the rules for the lifetime of an object is. If a function returns a value directly, then the caller gets control over the lifetime. If the function returns a smart pointer, then the smart pointer gets control over the lifetime. (Smart pointers also come with a set of rules about what the caller is allowed to do with the object.) If a reference is returned, then the object's lifetime is controlled elsewhere. It's all wonderfully unambiguous.

In C you always get either a value or a pointer. It's true in my project, too. All of my fake constructor functions return a Symbol*. I really want it to be an auto_ptr, though! If Flex or Accent were to misplace a pointer, then it would be lost forever. So far, they don't seem to do that, but it's certainly filled with ambiguity.

I haven't come up with a satisfactory solution to this problem. My current solution is to always do exactly the same thing with all of the objects returned by my facade's functions. If the programmer has to remember a set of rules at all times, then it'd better be a simple set of rules.

It looks like this is the beginning of officially working on statically analyzing an untyped language. Next week, I might talk about what language I picked and why. Or perhaps I'll talk about my crazy idea for a completely clean decorator pattern implementation so that I don't have to mess up my nice, clean AST.

This originally appeared on PC-Doctor's blog.

Monday, March 10, 2008

Static Analysis of Untyped Languages

A while ago I wrote about whether or not untyped languages were a good idea or a bad idea. I didn't come to any real conclusions at the time, and it's bothered me. I'd like to outline a way to gather some real conclusions now. I still won't be able to come to a conclusion, but I think this approach sounds interesting.

There are a number of advantages and disadvantages to untyped languages. One of the disadvantages is that, because variables have no type, you can assign the wrong type to them. A strongly typed language can avoid most of this problem. (You might still have problems involving type conversion, but a good programmer or a strict language can avoid those problems.)
I'd like to propose a tool that can analyze some untyped code without running it. (This is called static analysis.) The tool will try to deduce the types of variables. In theory, it is possible to analyze what possible types a value has at any given time. We couldn't be as strict as a typed language, but we can make some conclusions. For example, let's look at the following Python code:

class Duck:
  def _init_(self):
    self.hungerLevel = 0

daffy = Duck()
feed(daffy)

The first few lines define a type with a single numeric member variable. We can tell this before we run the program. The last two lines create a Duck and pass it to the feed function. We can tell at compile time that daffy is a variable of type Duck and feed is a function that will receive Ducks. Functions may receive different types at different times, but we can record all of the possible inputs and make sure the types are compatible at each point.

This cannot work perfectly, of course. For example, there might be code paths that are simply impossible to go through because of the value of variables rather than the type of variables. Determining all of these cases cannot be done without running the program. (My theoretical computer science is pretty weak, but I vaguely remember some theorems that could prove that.) However, if there's a program out there that is type safe but requires some of the code paths to be impossible to reach, it's probably not a particularly well written program.

I'd like to write something like this, but haven't fully justified it, yet. If I had such a tool, I'd try to run it on a body of existing, open source, untyped code.

I can imagine two possible ways to use this to strongly suggest that untyped languages are dangerous despite the claims of untyped language fans. First, real problems could be found in the code base. If the author(s) of the untyped code would admit that problems were there, then, almost by definition, it seems as though the problems exist. Second, the author(s) could become of a fan of the tool and use it during development. Perhaps the final code doesn't have type safety problems, but the code has those problems in its early stages. This could also be used as evidence that there's a problem with untyped languages.

Unfortunately, it's hard to imagine how this experiment could prove that untyped languages don't have a problem. Furthermore, it's possible that a tool like this works well enough that the problem largely goes away after programmers start using it. (The latter seems unlikely, but it's certainly a potential complication.)

Don't expect rapid progress here, but I'd guess that, at the very least, I'll have some follow up posts about this project as it progresses.

This originally appeared in PC-Doctor's blog.