Tuesday, February 14, 2006

Reducing boilerplate by embedding XSLT

I've recently been working on an custom IDL compiler that can target Python and C++. Think COM, only portable between platforms and targeted towards a select few languages. A lot of patterns immediately appear in both the frontend and backends of the compiler that's just boilerplate work. Consider the classic implementation of an IDL parser (or any programming language parser, for that matter):
  1. Parse the input, building an abstract syntax tree of the declarations.
  2. Build a type graph and interconnect the type nodes so that types can be resolved.
  3. Resolve all type references (using fixups or in a separate pass if forward references allowed.)
  4. Invoke one or more backends if the resulting tree is valid.
The backends in turn will walk the tree (via some variation of the Visitor pattern) and produce output. This quickly becomes repetitive and mundane work. Visitor is well known for it's strong impact on code coupling and strong binding between entities. In another sandbox project of mine I tried out a completely different approach. Instead of writing yet another AST hierarchy to interface the frontend I experimented with XML as an intermediate format in the compiler itself. Given an input file, my LALR parser will essentially emit a well-formed XML document using the DOM APIs. This in itself doesn't buy much, but it provides tremendous flexibility down the road. Think of it as capturing the entire AST in a portable format. Indeed, the GCC project has recently added a AST-to-XML converter as part of their (excellent) compiler frontend so more people are thinking along these lines. With the AST swimming around in an XML DOM tree, it becomes possible to rewrite repetitive and boring tasks in XML query tools that are designed for those jobs rather than a general-purpose programming language such as C++. Whipping up flexible type resolvers, type and syntax checkers using a decent XSLT processor is a lot of fun, and it can be done with a surprisingly small amount of XSLT code. XQuery processors can easily build statistics and metrics with a minimal amount of code. Another interesting thing about keeping the AST in XML is that it becomes trivial to add whatever meta-data you like in a matter of seconds. That data can travel through the filters (for example in an XML namespace of its own) all the way to the backend, where it can be used to tweak the output. Also, changing the XML AST doesn't require recompiling hundreds of AST classes! The key concept to making this work is pipelining, which is a common design pattern anyway in compilers. Connecting the output of one XSLT processing filter to the input of the next nicely captures the pipelining pattern while at the same time enforcing the idea that one piece of code should have a single, well-defined purpose. XSLT also works wonders for code generation backends, because it designed for transforming data, something C++, Java and Python simply aren't. So what are the drawbacks? Simply put: speed and memory consumption. Clearly, a hand-written implementation of say a type resolver will have a lower memory consumption and better processing speed than a XSLT-based implementation. Also, while XSLT is a very useful language (especially with version 2) it's not very strong with error reporting or string handling, to name a few areas. If you're willing to strike a deal with the portability devil and settle on a single XSLT library it's possible to work around most of these problems. Given the prototyping advantages I still think this is a viable implementation strategy for compiler projects.