Authors: Paul Gazzillo, Robert Grimm Title: Parsing All of C by Taming the Preprocessor Abstract Given the continuing popularity of C for building large-scale pro- grams, such as Linux, Apache, and Bind, it is critical to provide effective tool support, including, for example, code browsing, bug finding, and automated refactoring. Common to all such tools is a need to parse C. But C programs contain not only the C lan- guage proper but also preprocessor invocations for file inclusion (#include), conditional compilation (#if, #ifdef, and so on), and macro definition/expansion (#define). Worse, the preproces- sor is a textual substitution system, which is oblivious to C con- structs and operates on individual tokens. At the same time, the preprocessor is indispensable for improving C's expressivity, ab- stracting over software/hardware dependencies, and deriving vari- ations from the same code base. The x86 version of the Linux ker- nel, for example, depends on about 7,600 header files for file in- clusion, 7,000 configuration variables for conditional compilation, and 520,000 macros for code expansion.
In this paper, we present a new tool for parsing all of C, in- cluding arbitrary preprocessor use. Our tool, which is called Su- perC, is based on a systematic analysis of all interactions between lexing, preprocessing, and parsing to ensure completeness. It first lexes and preprocesses source code while preserving conditionals. It then parses the result using a novel variant of LR parsing, which automatically forks parsers when encountering a conditional and merges them again when reaching the same input in the same state. The result is a well-formed AST, containing static choice nodes for conditionals. While the parsing algorithm and engine are new, nei- ther grammar nor LR parser table generator need to change. We discuss the results of our problem analysis, the parsing algorithm itself, the pragmatics of building a real-world tool, and a demon- stration on the x86 version of the Linux kernel.