Parsing All of C by Taming the Preprocessor
Authors: Paul Gazzillo, Robert Grimm

Title: Parsing All of C by Taming the Preprocessor

Given the continuing popularity of C for building large-scale pro-
grams, such as Linux, Apache, and Bind, it is critical to provide
effective tool support, including, for example, code browsing, bug
finding, and automated refactoring. Common to all such tools is
a need to parse C. But C programs contain not only the C lan-
guage proper but also preprocessor invocations for file inclusion
(#include), conditional compilation (#if, #ifdef, and so on),
and macro definition/expansion (#define). Worse, the preproces-
sor is a textual substitution system, which is oblivious to C con-
structs and operates on individual tokens. At the same time, the
preprocessor is indispensable for improving C's expressivity, ab-
stracting over software/hardware dependencies, and deriving vari-
ations from the same code base. The x86 version of the Linux ker-
nel, for example, depends on about 7,600 header files for file in-
clusion, 7,000 configuration variables for conditional compilation,
and 520,000 macros for code expansion.
In this paper, we present a new tool for parsing all of C, in- cluding arbitrary preprocessor use. Our tool, which is called Su- perC, is based on a systematic analysis of all interactions between lexing, preprocessing, and parsing to ensure completeness. It first lexes and preprocesses source code while preserving conditionals. It then parses the result using a novel variant of LR parsing, which automatically forks parsers when encountering a conditional and merges them again when reaching the same input in the same state. The result is a well-formed AST, containing static choice nodes for conditionals. While the parsing algorithm and engine are new, nei- ther grammar nor LR parser table generator need to change. We discuss the results of our problem analysis, the parsing algorithm itself, the pragmatics of building a real-world tool, and a demon- stration on the x86 version of the Linux kernel.