One of the early steps to my static analysis project is to parse the language that I'm going to analyze. I'd like to form a relatively clean Abstract Syntax Tree that I can play with later.
C++ has a lot of advantages over C for this sort of thing. It's got an enormous amount of machinery that can be used to build high level abstractions without sacrificing much more runtime overhead than you're willing to pay for.
C++ has a lot of advantages over C for this sort of thing. It's got an enormous amount of machinery that can be used to build high level abstractions without sacrificing much more runtime overhead than you're willing to pay for.
However, C is also still a useful language. For example, it's extremely simple when compared to C++. Perhaps in part because of this, a lot of code generators out there still create C code. If you want to use one of these code generators with anything but C or C++, you can't do much to tweak the output they generate. C++ could well be unique in its ability to mix the old style C code with modern programming techniques.
I've been using some code generated by Flex and ACCENT recently, and it's nice being able to use a lot of C++'s features with the code. (ACCENT still uses K&R function definitions. It's pretty old and primitive, I guess.)
I've worked on a few projects that required a mix of the two, but I can't claim that I've ever thought about exactly how it should be done. Part of the problem is that it's extremely easy to mix the two. C++ allows much more integration with a C library than most languages. In C#, for example, about all you can do when dealing with a C library is make calls into it. C++ allows you to define types that can be used by the C library, you can play with #defines just like a C programmer can, and you can pass function pointers around without worrying whether they're from C or C++. (Though
Clearly, some thought should be done before mixing the two. I'm embarrassed to say that I haven't ever done this deliberately.
Currently, I'm using Flex, a library that exposes more than simply an API to call into. You are supposed to compile the code into your module, and it is possible to modify the code by creating some C compatible types and passing them in through #defines.
It's all very clunky and C-like, but it works great, and that's all I care about today. I'm happy to use it, but I don't really want to pass in a POD type. I want a full fledged C++ class to be used by Flex.
Mixing Flex's generated code and modern C++ code is entirely possible, but there is a lot of type safety that gets thrown away when going to the C code. For example, Flex thinks that each programming language construct and all of its parts are all made up of Symbol* objects. In C++, I don't want to have an abstract base class that defines all behavior that all of these constructs might ever need. Instead, each different type of construct is a different type in C++. This means, for example, that it's not possible to pass an object that isn't an expression to a function that requires an expression. I like that a lot, but Flex wants all of these objects to be the same.
The C code, therefore, will go through a facade that does a bunch of careful type conversions and then passes the resulting, typed objects through to the C++ code.
This is a frequent theme in other projects that I've done that involve mixing C and C++ code. C++ is capable of supporting a much richer set of types than C is, and there always seems to be some work involved in preserving C++'s type expressiveness in C.
Incidentally, the facade is incredibly ugly at the moment. There's lots of typeid operators and long if..then..else if...etc constructs generated by preprocessor macros. However, the rest of the C++ code always knows exactly what the type is, so it's all worth it.
Flex and Accent's generated code do have to perform operations on these objects, however. They end up using a long list of "member functions" that always take Symbol pointers. For example, there might be a function to create a C++ expression object by taking a unary operator and another expression object. In Accent's world, this looks like this:
That's essentially a constructor for the Symbol* object. (Flex doesn't have to worry about whether or not it's a pointer or a struct. To Flex, it's just a copyable typedef.)
This is another common theme of mixing C and C++. You don't have to stop object oriented programming in C. You just have to wrap it all up in a bunch of free functions. It's even possible to have virtual functions by manually creating the vtbl.
It also nicely displays another annoying theme of mixing C and C++. In C++, you always know what the rules for the lifetime of an object is. If a function returns a value directly, then the caller gets control over the lifetime. If the function returns a smart pointer, then the smart pointer gets control over the lifetime. (Smart pointers also come with a set of rules about what the caller is allowed to do with the object.) If a reference is returned, then the object's lifetime is controlled elsewhere. It's all wonderfully unambiguous.
In C you always get either a value or a pointer. It's true in my project, too. All of my fake constructor functions return a Symbol*. I really want it to be an auto_ptr, though! If Flex or Accent were to misplace a pointer, then it would be lost forever. So far, they don't seem to do that, but it's certainly filled with ambiguity.
I haven't come up with a satisfactory solution to this problem. My current solution is to always do exactly the same thing with all of the objects returned by my facade's functions. If the programmer has to remember a set of rules at all times, then it'd better be a simple set of rules.
It looks like this is the beginning of officially working on statically analyzing an untyped language. Next week, I might talk about what language I picked and why. Or perhaps I'll talk about my crazy idea for a completely clean decorator pattern implementation so that I don't have to mess up my nice, clean AST.
This originally appeared on PC-Doctor's blog.
I've been using some code generated by Flex and ACCENT recently, and it's nice being able to use a lot of C++'s features with the code. (ACCENT still uses K&R function definitions. It's pretty old and primitive, I guess.)
I've worked on a few projects that required a mix of the two, but I can't claim that I've ever thought about exactly how it should be done. Part of the problem is that it's extremely easy to mix the two. C++ allows much more integration with a C library than most languages. In C#, for example, about all you can do when dealing with a C library is make calls into it. C++ allows you to define types that can be used by the C library, you can play with #defines just like a C programmer can, and you can pass function pointers around without worrying whether they're from C or C++. (Though
std::tr1::function
is better!) You can even include random header files and make calls into the internals of a library.Clearly, some thought should be done before mixing the two. I'm embarrassed to say that I haven't ever done this deliberately.
Currently, I'm using Flex, a library that exposes more than simply an API to call into. You are supposed to compile the code into your module, and it is possible to modify the code by creating some C compatible types and passing them in through #defines.
It's all very clunky and C-like, but it works great, and that's all I care about today. I'm happy to use it, but I don't really want to pass in a POD type. I want a full fledged C++ class to be used by Flex.
Mixing Flex's generated code and modern C++ code is entirely possible, but there is a lot of type safety that gets thrown away when going to the C code. For example, Flex thinks that each programming language construct and all of its parts are all made up of Symbol* objects. In C++, I don't want to have an abstract base class that defines all behavior that all of these constructs might ever need. Instead, each different type of construct is a different type in C++. This means, for example, that it's not possible to pass an object that isn't an expression to a function that requires an expression. I like that a lot, but Flex wants all of these objects to be the same.
The C code, therefore, will go through a facade that does a bunch of careful type conversions and then passes the resulting, typed objects through to the C++ code.
This is a frequent theme in other projects that I've done that involve mixing C and C++ code. C++ is capable of supporting a much richer set of types than C is, and there always seems to be some work involved in preserving C++'s type expressiveness in C.
Incidentally, the facade is incredibly ugly at the moment. There's lots of typeid operators and long if..then..else if...etc constructs generated by preprocessor macros. However, the rest of the C++ code always knows exactly what the type is, so it's all worth it.
Flex and Accent's generated code do have to perform operations on these objects, however. They end up using a long list of "member functions" that always take Symbol pointers. For example, there might be a function to create a C++ expression object by taking a unary operator and another expression object. In Accent's world, this looks like this:
Symbol* CreateUnaryOperation( Symbol* op, Symbol* expression );
That's essentially a constructor for the Symbol* object. (Flex doesn't have to worry about whether or not it's a pointer or a struct. To Flex, it's just a copyable typedef.)
This is another common theme of mixing C and C++. You don't have to stop object oriented programming in C. You just have to wrap it all up in a bunch of free functions. It's even possible to have virtual functions by manually creating the vtbl.
It also nicely displays another annoying theme of mixing C and C++. In C++, you always know what the rules for the lifetime of an object is. If a function returns a value directly, then the caller gets control over the lifetime. If the function returns a smart pointer, then the smart pointer gets control over the lifetime. (Smart pointers also come with a set of rules about what the caller is allowed to do with the object.) If a reference is returned, then the object's lifetime is controlled elsewhere. It's all wonderfully unambiguous.
In C you always get either a value or a pointer. It's true in my project, too. All of my fake constructor functions return a Symbol*. I really want it to be an auto_ptr, though! If Flex or Accent were to misplace a pointer, then it would be lost forever. So far, they don't seem to do that, but it's certainly filled with ambiguity.
I haven't come up with a satisfactory solution to this problem. My current solution is to always do exactly the same thing with all of the objects returned by my facade's functions. If the programmer has to remember a set of rules at all times, then it'd better be a simple set of rules.
It looks like this is the beginning of officially working on statically analyzing an untyped language. Next week, I might talk about what language I picked and why. Or perhaps I'll talk about my crazy idea for a completely clean decorator pattern implementation so that I don't have to mess up my nice, clean AST.
This originally appeared on PC-Doctor's blog.
No comments:
Post a Comment