A tour of GNAT

The GNAT system has four main components:

  • a front-end, written in Ada, which parses, analyzes, and expands Ada95 programs. The output of the front-end is an annotated abstract syntax tree for an Ada compilation unit.

  • A tree translator, which takes the output of the front-end and translates it into gcc internal trees. This component of GNAT is written in C, and is familiarly known as Gigi (the gnat-to-gnu translator, aka GG).

  • the GCC back-end, which generates assembly code for the target machine.

  • a run-time library, written mostly in Ada, which contains the predefined units of the language (I/O packages, elementary functions, complex types, random-number generation, etc) as well as support for tasking on various target operating systems.

    The front-end consists of over 400 K lines of Ada, distributed among 909 files. Gigi consists of 21 K lines of C in 30 files. The GCC back-end is more than 500K lines of C, and includes code generators for most processors in use today. The front-end and Gigi are fully machine-independent. The back-end of GCC is completely language independent, and interestingly much of it is machine- independent as well.

    The front-end

    The front-end of GNAT consists of the following traditional compiler phases:

  • the lexical scanner.
  • the parser.
  • the semantic analyzer.
  • the expander.

    These phases are invoked by the procedure Frontend. The reader will notice that only the parser and the semantics are called explicitly. This is because the lexical scanner is invoked from the parser, and supplies successive tokens on demand, and because the expander phase is recursively entwined with the semantic phase.

    The lexical scanner

    The lexical scanner supplies successive tokens to the parser. Scanner and parser communicate through global data structures that describe the current token, including its lexical kind and its position in the source. The scanner reads the compilation unit into memory in full, so that only one input-output operation needs to be performed per compilation unit.

    Global data structures

    The interface between the scanner and the parser is found in file scans.ads. The enumeration Token_Type lists the token kinds needed to describe the lexical structures of Ada95. Each delimiter, each operator symbol, and each reserved word is a distinct Token_Type. This convention makes the code more compact and efficient.
    The variable Token indicates the current token being examined by the parser. The procedure Scan moves past the current Token and recognizes the next one. The variable Prev_Token indicates the immediately preceeding token. Uses such as the following should be self-explanatory (this is from the parsing of short-circuit expressions):
             if (Prev_Token = Tok_And and then Token = Tok_Then)
                  or else
                (Prev_Token = Tok_Or  and then Token = Tok_Else)
             then
                Scan;
             end if;
    

    Algorithm

    The lexical scanner proper is found in files scn and its subunits scn-nlit, scn-slit. The main procedure, called Scan, processes one token. The input is held in a character array called Source. The variable Scan_Ptr is the index of the character currently being examined. The algorithm is driven by the value of Source (Scan_Ptr). In many compilers the algorithm is structured as a case statement, but here for efficiency it is an extended if-statement. The first entries take care of one- and two-character tokens. The non-trivial cases are identifiers, keywords, numeric literals, and string literals. The code is complicated only by the need to handle an extended character set that may include 16-bit wide characters, and by possible style-checks (optional) that verify the consistency of casing conventions.

    If the first character in a token is a letter, it indicates the start of an identifier or keyword. The code that recognizes identifiers is headed by the label Scan_Identifier. The presence of a label may seem anomalous in code that otherwise is as well-structured as one could wish, but it reflects the fact that a lexical scanner is a finite-state machine, and the simplest way to program an FSM is to use a labelled code fragment for each state.

    If the token is an identifier, it is entered into the Names table, a global data structure used by all phases of the compiler. The global variable Token_Name is an index into that table. This table is a hash-table and it stores only one occurrence of each string. When the parser constructs the syntax tree for a given construct, it places the token_name in the corresponding node. Visibility analysis, and all error message processing, use these names, rather than the tokens themselves. The tokens only exist while the parser is building the AST.

    If the first character in the token is a digit, this indicates the start of a numeric literal, either integer or real, and possibly given with an explicit base notation. The rest of the literal is processed in procedure Nlit. The subsidiary procedure Scan_Integer is used for integer literals and for the integer and fractional parts of real numbers.

    If the first character of the token is a double-quote, it is the beginning of a string literal. The processing is straightforward, except for the need to handle extended characters, and for the fact that in certain contexts in Ada a string may denote an operator, as in the declaration:
           function "+" (M1, M2 : Matrix) return Matrix;
    
    The lexical scanner is not always able to recognize whether a string is an operator, so it converts all strings that may denote an operator into an operator token. Semantic analysis may undo this choice if the context indicates that the construct is just a character string, as in the concatenation: "and" & " so what?".

    Performance Issues

    The lexical scanner must examine every single character in the program, and therefore no occasion to minimize the amount of work per character should be ignored. The reader will notice several places where loops have been unrolled to speed up some common processing (for example skipping blanks at the beginning of the line) Various tables indexed by characters are used to determine the lexical category in which a character belongs, by means of a direct table lookup.

    The Parser

    The GNAT parser is a recursive descent parser: the structure of the parser, and the details of the code, reflect directly the syntax of the language, as described in the Ada95 Reference manual. The core of the parser consists of a series of functions, each one of each is a recognizer for one production in the language. In order to reflect closely the language definition, the parser is organized into separate files which are in 1-1 correspondence with the chapters of the ARM. The driver is found in file par.adb. Subunits par-ch2, par-ch3,..par-ch12 contain the actual recognizer routines. For example, par-ch3 processes all type and object declarations, par-ch5 handles various statement forms, and par-ch9 handles all language constructs involving tasks and protected objects. For documentation purposes, the header of each procedure includes the production which it recognizes. For example, here is the full procedure for recognizing assignments:
       -------------------------------
       -- 5.2  Assignment Statement --
       -------------------------------
    
       --  ASSIGNMENT_STATEMENT ::=
       --    variable_NAME := EXPRESSION;
    
       --  Error recovery: can raise Error_Resync
    
       function P_Assignment_Statement (Lhs : Node_Id) return Node_Id is
          Assign_Node : Node_Id;
    
       begin
          Assign_Node := New_Node (N_Assignment_Statement, Prev_Token_Ptr);
          Set_Name (Assign_Node, LHS);
          Set_Expression (Assign_Node, P_Expression_No_Right_Paren);
          TF_Semicolon;
          return Assign_Node;
       end P_Assignment_Statement;
    
    The parser is not just a recognizer: it builds the syntax tree for the program. Thus, each subprogram is a function that returns the tree fragment corresponding to the construct that it recognizes. Each call to New_Node indicates the construction of a tree node corresponding to some non-terminal. The descendants of the node are constructed by calls to the corresponding functions. In this case, the assignment statement has two descendants: a name (the left-hand side of the assignment) and an expression. This function is called after the left-hand side and the assignment symbol have been processed, so the tree corresponding to the left-hand side is a parameter in the call (it has been built already). The right-hand side is recognized by P_Expression_No_Right_Paren (a variant of P_Expression that checks that the expression is not followed by a right parenthesis) followed by a semicolon. The assignment node is then returned to the caller.

    Procedures with names of the form T_xxx, where Tok_xxx is a token name, check that the current token matches the required token, and if so, scan past it. If not, an error is issued indicating that the required token is not present (xxx expected). In most cases, the scan pointer is not moved in the not-found case.

    Procedures with names of the form TF_xxx, where Tok_xxx is a token name, check that the current token matches the required token, and if so, scan past it. If not, an error message is issued indicating that the required token is not present (xxx expected). For example, the assignment statement, like all other statements, is semicolon-terminated, and TF_Semicolon will find the expected symbol or else complain.