Friday, March 21, 2008

Requirements of a Language to Statically Analyze

For those of you just tuning in, I'm working on a project to statically analyze an untyped code base to try to bring some of the advantages of typed languages to the code base.

The first step is to figure out which language I should write a static analysis tool for. This is obviously an important decision with quite a few implications both at the beginning, during the creation of the parser, and at the end when we try to find an actively developed code base to look at.

The first requirement is that it be a relatively popular, untyped language. Let's look at some languages:

1. JavaScript, ECMA Script, or JScript. This is used by millions.
2. Lua. This is a popular language to embed in applications. I know it extremely well, too.
3. Python. This is really popular, and it's a pretty clean language, too.
4. Ruby. This is getting to be a popular language. (I really like the language, too.)
5. PHP. This is also a popular language.
6. Perl. I've gone through life without having to learn COBOL, Fortran, or Perl. I'm a really happy guy. Do people still use it? I wouldn't know.

Yes, there are some others, but I'll limit my discussion to these.

I'll reject Perl immediately because it's designed to be complex. This is just a hobby project, and I'm not going to be able to spend lots of time dealing with unneeded complexities. I also have no interest in learning it.

I'm also going to reject PHP for similar reasons. I just have no interest in it. Now we're down to harder choices.

We're down to four languages: Ruby, Python, JavaScript, and Lua. These are all languages that I've enjoyed using, and some of them I know pretty well.

Let's start with Ruby. Ruby is a wonderful scripting language. I love it for its terse, expressive syntax. I'm a bit worried by the quantity of syntactic sugar in it, but most of that can be stripped away after parsing it.

However, Ruby code is frequently written in a different style than conventional, object oriented code. Ruby programmers are used to doing things that would be utterly opaque to any reasonable static analysis tool. Here's an extreme example of some Ruby on Rails code that scares me:

lang_code = 'en'
obj = MyModel.find_by_name 'name of obj'
result = obj.send :"language_#{lang_code}"

For the non-Ruby geeks out there, this pulls a MyModel object out of the my_models table in the database. The object is populated with some members based on the database schema. That's mildly scary all by itself, since I'd rather not spend my time dealing with DB schemas. However, the last line is even worse! This calls the get_language_en member function in obj, but the only way for a program like mine to figure that out is to look at the value of lang_code. I'd much rather ignore values and only look at types. (Many scripting languages treat functions as ordinary values that can be moved around. As we'll see in future installments, this isn't always a big problem.)

Other languages can do this, too, but Ruby programmers enjoy doing it a bit too much.

Python is another well designed language that would be fun to use. I can't claim that I'm an expert at it, so there may be other problems with it. However, the standard libraries for it scare me a bit. There are a huge number of them. As far as I can tell, Python programmers use whatever they feel like using from these, so an automated way of deducing the argument types and return values of functions would be required. I'm sure that a doxygen summary of the functions could be found somewhere, but I'd prefer to not rely too heavily on infrastructure that doesn't sound like much fun to work on.

Lua is an extremely simple language that's designed to be embedded in applications. As such, it avoids both the large amount of syntactic sugar in Ruby and, at least in most cases, the huge library of Python. However, it does mean that you'd have to find the correct application of Lua to look at. Let's assume that such an application exists.

Like most (or all?) of these languages, Lua treats functions as first class values. Because of this, it is possible to pick the function to call based on a variable. Here's an example:

selector = 'a'
obj = { a: function(num) return num+1 end, b: function(str) return str..'!' end }
obj[selector]('string')

The last line is the worrisome one. This calls a function, passing a string into a function that probably requires a number. (There are ways to modify the behavior of types in Lua. Off the top of my head, I don't remember if the latest versions of Lua allow strings to have overloaded operators.) If we assume that this isn't possible, we'd like to detect such an error. That's extremely hard.

Realistically, however, if we reject Lua because of this problem, then we'll also reject all other scripting languages that treat functions as normal values. As far as I know, that's all of the ones on my list. Fortunately, this behavior is not a frequent pattern in the Lua code that I've seen and written. There are other ways of accomplishing the same task in Lua that are significantly simpler, more powerful, and, therefore, more popular.

The last language on the list is JavaScript. JavaScript can be used in a lot of different ways, but by far the most common and exciting way is in client side scripting on web pages. This provides an enormous body of code that could be analyzed. In addition, there are other tools out there that create type safe JavaScript by cross compiling from another language. This code would make an excellent test case for my tool.

JavaScript doesn't suffer from the same problems that Ruby and Python have, either. While you can do what I described in the Lua example above in JavaScript as well, it's not especially common for exactly the same reason. (Both Lua and JavaScript support closures. Closures are much more powerful, and, even though Internet Explorer has some nasty bugs related to them, they are extremely popular.)

Web browsers do have a mildly large library embedded in them, but the security restrictions are so large that the library is inherently limited. ActiveX controls and Mozilla Plugins and extend the API in annoying ways, but the body of code out there is so large and varied that pages that use plugins that I don't want to support can be thrown out.

The one real disadvantage of JavaScript is the optional semicolon rule in the parser. The parser is required to insert semicolons in a bunch of places, and this does sound mildly annoying to parse. It can't be that bad, though. (Lua has a similar rule involving newlines, but it's much more obscure.)

Up until now, JavaScript looks like an exciting possibility. Unfortunately, I do a lot of work here at PC-Doctor using JavaScript, and I'd rather do something completely different. I really don't want a useless hobby project to feel like work! If I need another excuse, it's that others have already written limited tools like this for JavaScript.

Lua 5.1 is it.

This originally appeared on PC-Doctor's blog.

No comments: