These phases are invoked by the procedure Frontend. The reader will notice
that only the parser and the semantics are called explicitly. This is because
the lexical scanner is invoked from the parser, and supplies successive tokens
on demand, and because the expander phase is recursively entwined with the
The lexical scanner
The lexical scanner supplies successive tokens to the parser. Scanner and
parser communicate through global data structures that describe the current
token, including its lexical kind and its position in the source. The
scanner reads the compilation unit into memory in full, so that only one
input-output operation needs to be performed per compilation unit.
Global data structures
The interface between the scanner and the parser is found in file
scans.ads. The enumeration Token_Type lists the token
kinds needed to describe the lexical structures of Ada95. Each delimiter,
each operator symbol, and each reserved word is a distinct Token_Type. This
convention makes the code more compact and efficient.
The variable Token indicates the current token being examined by the parser.
The procedure Scan moves past the current Token and recognizes the next one.
The variable Prev_Token indicates the immediately preceeding token. Uses such
as the following should be self-explanatory (this is from the parsing of
if (Prev_Token = Tok_And and then Token = Tok_Then)
(Prev_Token = Tok_Or and then Token = Tok_Else)
The lexical scanner proper is found in files scn and
its subunits scn-nlit, scn-slit. The main procedure,
called Scan, processes one token. The input is held in a character array
called Source. The variable Scan_Ptr is the index of the
character currently being examined. The algorithm is driven by the value of
Source (Scan_Ptr). In many compilers the algorithm is structured as a case
statement, but here for efficiency it is an extended if-statement. The first
entries take care of one- and two-character tokens. The non-trivial cases
are identifiers, keywords, numeric literals, and string literals. The code
is complicated only by the need to handle an extended character set that
may include 16-bit wide characters, and by possible style-checks (optional)
that verify the consistency of casing conventions.
If the first character in a token is a letter, it indicates the start of an
identifier or keyword. The code that recognizes identifiers is headed by the
label Scan_Identifier. The presence of a label may seem anomalous in code
that otherwise is as well-structured as one could wish, but it reflects the
fact that a lexical scanner is a finite-state machine, and the simplest way
to program an FSM is to use a labelled code fragment for each state.
If the token is an identifier, it is entered into the Names table, a global
data structure used by all phases of the compiler. The global variable
Token_Name is an index into that table. This table is a hash-table
and it stores only one occurrence of each string. When the parser constructs
the syntax tree for a given construct, it places the token_name in the
corresponding node. Visibility analysis, and all error message processing, use
these names, rather than the tokens themselves. The tokens only exist while the
parser is building the AST.
If the first character in the token is a digit, this indicates the start of
a numeric literal, either integer or real, and possibly given with an explicit
base notation. The rest of the literal is processed in procedure Nlit.
The subsidiary procedure Scan_Integer is used for integer literals and for
the integer and fractional parts of real numbers.
If the first character of the token is a double-quote, it is the beginning of
a string literal. The processing is straightforward, except for the need to
handle extended characters, and for the fact that in certain contexts in
Ada a string may denote an operator, as in the declaration:
function "+" (M1, M2 : Matrix) return Matrix;
The lexical scanner is not always able to recognize whether a string is an
operator, so it converts all strings that may denote an operator into an
operator token. Semantic analysis may undo this choice if the context indicates
that the construct is just a character string, as in the concatenation:
"and" & " so what?".
The lexical scanner must examine every single character in the program, and
therefore no occasion to minimize the amount of work per character should be
ignored. The reader will notice several places where loops have been unrolled
to speed up some common processing (for example skipping blanks at the
beginning of the line) Various tables indexed by characters are used
to determine the lexical category in which a character belongs, by means of
a direct table lookup.
The GNAT parser is a recursive descent parser: the structure of the parser,
and the details of the code, reflect directly the syntax of the language,
as described in the Ada95 Reference manual. The core of the parser consists
of a series of functions, each one of each is a recognizer for one production
in the language. In order to reflect closely the language definition, the
parser is organized into separate files which are in 1-1 correspondence with
the chapters of the ARM. The driver is found in file par.adb.
Subunits par-ch2, par-ch3,..par-ch12 contain the actual
recognizer routines. For example, par-ch3 processes all
type and object declarations, par-ch5 handles various
statement forms, and par-ch9 handles all language constructs
involving tasks and protected objects. For documentation purposes, the header
of each procedure includes the production which it recognizes. For example,
here is the full procedure for recognizing assignments:
-- 5.2 Assignment Statement --
-- ASSIGNMENT_STATEMENT ::=
-- variable_NAME := EXPRESSION;
-- Error recovery: can raise Error_Resync
function P_Assignment_Statement (Lhs : Node_Id) return Node_Id is
Assign_Node : Node_Id;
Assign_Node := New_Node (N_Assignment_Statement, Prev_Token_Ptr);
Set_Name (Assign_Node, LHS);
Set_Expression (Assign_Node, P_Expression_No_Right_Paren);
The parser is not just a recognizer: it builds the syntax tree for the
program. Thus, each subprogram is a function that returns the tree fragment
corresponding to the construct that it recognizes. Each call to New_Node
indicates the construction of a tree node corresponding to some non-terminal.
The descendants of the node are constructed by calls to the corresponding
functions. In this case, the assignment statement has two descendants: a
name (the left-hand side of the assignment) and an expression. This function
is called after the left-hand side and the assignment symbol have been
processed, so the tree corresponding to the left-hand side is a parameter in
the call (it has been built already). The right-hand side is recognized by
P_Expression_No_Right_Paren (a variant of P_Expression that checks that the
expression is not followed by a right parenthesis) followed by a semicolon.
The assignment node is then returned to the caller.
Procedures with names of the form T_xxx, where Tok_xxx is a token
name, check that the current token matches the required token, and
if so, scan past it. If not, an error is issued indicating that
the required token is not present (xxx expected). In most cases, the
scan pointer is not moved in the not-found case.
Procedures with names of the form TF_xxx, where Tok_xxx is a token
name, check that the current token matches the required token, and
if so, scan past it. If not, an error message is issued indicating
that the required token is not present (xxx expected). For example, the
assignment statement, like all other statements, is semicolon-terminated,
and TF_Semicolon will find the expected symbol or else complain.