Monday, May 7, 2007

Language Extensibility In CDT 4.0

This screenshot is really cool. But why?


We'll get back to the screenshot I promise.

Well, CDT 4.0 RC0 just went out the door last week, marking our first feature complete build for Europa. Confusingly enough our next build this week is going to be marked as M7, so we have the odd situation where we have a milestone build after our first RC, but the team felt it was important to keep the naming convention constistent with the Europa build of which we will be a part of, and we didn't want users getting confused about which build of CDT to use with Europa M7.

There are a few cool new features that my team here at IBM have been working on that a number of ISVs and other language tools authors are going to hopefully find useful. It's always been fairly easy to add support to CDT for compiling different languages via CDT's Managed Build and Standard Make projects, but we've been working recently to make it easier to integrate new C-like languages into the Core so all those cool features like search, open declaration, and content assist all work.

For a while now, it's been possible to contribute definitions for new languages into the CDT core. Circa CDT 3.1, we added an extension point to CDT to allow you to contribute new languages via the ILanguage interface, and to map those ILanguages to an Eclipse content type. Each ILanguage has methods it must provide that let you parse a file and get an Abstract Syntax Tree (AST) out of it as result. Once you have an AST all those cool features I mentioned eariler start working, provided you use CDT's DOM AST APIs.

This worked great for clients such as the Photran project (who do the Fortran language IDE integration for Eclipse), but it was a bit problematic if you actually wanted to override what the language was for C and C++ files. CDT would look for extensions to the extension point, but would stop looking once it found the first one for any given content type (I'm simplifying things but this is how it would appear from the user's point of view). Hence, there was no deterministic way to make sure that your language was the one that got used.

In CDT 4.0 we've now added the concept of language mappings to the workbench preferences and the project properties. What this means is that the user can go in and change which language is mapped to which content type, even down to the level of the individual file.

The language mapping feature is great for those that compile the same project on multiple platforms with different compilers that all support slightly different variations of C or C++. Now if you have a build configuration for each platform you can set the language mappings on each configuration individually, and your code will be parsed properly in each configuration (provided that you have an ILanguage to handle those scenarios). It's also great for embedded vendors, as most of them tend to have slightly different variations on the C programming language to enable you to do some cool things like handle interrupts, etc. This way they can define their own ILanguage which can handle these differences.

Another big thing we've been working on is making it easier to create the parsers for those language variants. The most frequently encountered use case for this stuff are the use cases belonging to people like those embedded vendors I mentioned. For the most part the language they are implementing is nearly identical to C or C++, and they just need to add a couple of keywords or a few new constructs. Up until now they've pretty much had to write a whole new parser for that from the ground up. The GNU C and C++ parsers that are bundled with CDT are lean, mean, hardcoded state machines, and they are pretty difficult to get your head around if you are brave enough to crack open the code; difficult enough that most people that want to integrate a new language variant into CDT pretty much gave up right there. Don't get me wrong, those parsers are great at what they do (and without them we'd have been parserless for years), but they were designed with peformance in mind, and not readability or maintainability. If you tried to extend from the concrete classes in order to modify the behaviour of the parser you'd end up overridding big gnarly functions that do most of the work, and so if we ever fixed a bug in the original parser it probably wouldn't trickle down to your code unless you looked for it and cut & paste it into the parser you created.

Enter the new parsers that my team has been working on. One of our core requirements from our internal customers here at IBM was support for new language variants. Since we knew we were going to have several variants to support over the next few years, it seemed like a worthwhile investment to create some kind of extensible parser framework. To keep things "simple" we started with C. The goal was to create a basic C parser based on the ISO C99 standard, and to make it reuseable to support other language variants. In theory then language implementers would get C parsing for "free" and could concentrate on just defining the delta of their language compared to the base language.

It seemed natural for us to to use a parser generator for this. Parser generators take as input a grammar which specifies the rules of a language, and from that grammar it generates a parser that can handle that language. Just having a grammar will let you recognize whether a set of input abides by the rules of the language, but generally you want to do more than that. Typically as well you would define semantic actions in your grammar that do interesting things, which in our case was build up an AST with CDT's DOM AST APIs, so that once the language was parsed all those cool tools I mentioned earlier could recognize the structure of the code and do Cool Things(TM) with it.

So, what we did was create a C99 parser using the LPG parser generator, which has semantic actions in it to build up an AST for CDT. LPG is a parser generator built by some folks at IBM Research, which is being used for the parser in Eclipse's JDT, as well as for the SAFARI IDE Generator. The cool thing about LPG is that it has a notion of language inheritance. What this means is that if you take our C99 grammar file, you can do the equivalent of a #include in your own grammar for your own language to pull in our grammar. You can then add new rules or overrride our rules as you see fit, i.e. you get C parsing "for free".

The results of this were pretty amazing. One of our requirements which we got from the Eclipse Parallel Tools Platform people was to support a new programming language coming out, Unified Parallel C, which is a variant of C for massively parallel applications. The language adds new keywords and constructs which allow you to control the parallelization and synchronization of your program. By including the C99 grammar in our UPC grammar, we were able to get UPC working in a matter of days. Time to go back to our screenshot of the CDT editor, with a UPC file open:

There's a whole lot of cool stuff going on there:
  • Syntax highlighting of new keywords (upc_forall)
  • Outline view works
  • Content assist is finding constructs in the code
  • Content assist is working on constructs that are not normally legal C!!! It's a subtle point, but take a look at where the caret is in the upc_forall statement. This construct takes four expressions, not the usual three that your plain old ordinary for loop takes. Yet, content assist in that fourth expression just plain works!
Doing all this with the old parser would have taken a long time and been very error prone. I would definitely say that thusfar this effort has been a resounding success.

After CDT 4.0 is out the door we're going to start looking at doing some more interesting things with parsers.

  • Firstly, we want to write a GNU C language variant on top of our C99 parser and see how that stacks up against the existing GNU parser in CDT in terms of correctness and performance. We're already re-using all of the parser JUnits on our parser, so I already have a warm fuzzy about correctness. If the speed is good enough then I would love to replace the old parser with one based on ours because then it will be a lot easier to maintain.
  • Secondly, we're going to start tackling C++. Parsing C++ properly is a very difficult problem, given all the ambiguities in the language itself. I know personally of teams of people using bottom-up parsing techniques to parse C++ so I know it can be done (LPG is bottom-up too), but we have to figure out how feasible this is to do with LPG. Luckily we have a good line of communication with the LPG authors, and they are keen to see LPG being used successfully on C++, so if we encounter any roadblocks hopefully we can work together to smash them down.
The future for language support in CDT is looking very bright :-)

5 comments:

Unknown said...

Sounds great but where can I find the c99 grammar file etc?

Chris Recoskie said...

It's all in the CDT CVS under org.eclipse.cdt/c99

Pop me an email if you have problems getting setup... recoskie at ca.ibm.com

Unknown said...

HIi Chris

For a long time now I am trying to figure out how to actually use the DOM AST. I was able to get the AST from the translation unit was also able to write the visit functions for IASTName. But then how do I recognize what type i have to use. For eg when i have to check for an array declaration, the coe does not recognize IASTArrayDeclarator. Is there any way through which I can understand how AST for C works because I have worked for java and even then I am not able to figure out for C

Chris Recoskie said...

Megha,

You should post your question to the CDT newsgroup, which you can find on the Eclipse.org website. It's kind of hard to answer the question properly in blog comments :-)

Unknown said...

hello chris
I am having some problem while extending the cdt managed builder.
basically i am creating a new project wizard page , which creates a project using "createCDTproject()",
after that i am adding the mingw projecttype and toolchain to it along with all the configuration.
When i am building the project , i am successfully getting the required dll files.
Now i want to change the build path directory so that the dll is created in my desired location not in "/project/debug/".
Apart from this i want to do some pre steps like compiling other files specific to my project.
I am new to cdt,but i came to know that i need to create my own makefile for this.Please correct me if i am wrong.
So i used
setmanagedbuild(false)
setbuildcommand("make.....")
setbuildpath(".....")

but this is not working for me.when i see the project c/c++ build properties , the build command is updated but the path is not.moreover the build makefile automatically button is selected by default.

When i say build its creating some make files inside"../project/debug/" directory.


Is there something I am missing.


Thanks
Padam