The Pattern Classes

Overview:  two representations of patterns

A PatternCollection consists of a set of (named) patterns and a sequence of PatternSets.  Each PatternSet consists in turn of a sequence of PatternRules.  A PatternRule specifies a pattern to look for in the input text, and a sequence of Actions to be performed if this pattern is found.  It has the structure
when pattern, action, action, ...
The basic top-level pattern-matching operation in JET is the application of a PatternSet to a text segment, by the method PatternSet.apply().

There are two different representations used internally for patterns and rules.  The first representation (the pattern and rule representation) corresponds closely to the structure of a pattern file:  each pattern is represented separately as a sequence of pattern elements, and a pattern set is represented as a sequence of rules, where each rule consists of a pattern (name) and actions.  This representation is used when the pattern file is being read in.  It would also be suitable in the future if we provide some facility for editing patterns interactively.

Once the patterns have all be read in (or, in the future, after the patterns are modified), each pattern set is converted to a pattern graph.  The graph is a representation of all the patterns in a PatternSet as a single directed graph.  Optional elements, repeated elements, and references within one pattern to other patterns are "expanded" in the graph;  this simplifies the process of pattern matching.  More importantly, however, the graph is created with a view to graph optimizations which may be performed in the future.  Identical arcs leading from a node can be merged into a single arc (this corresponds to identifying common pattern prefixes).  In addition, if a large number of arcs leading from a single node match different strings, these can be reduced to a hash table to avoid sequential matching of the current token against each string..

AtomicPatternElements

AtomicPatternElements (an abstract class) are PatternElements which do not contain embedded references to other PatternElements.Examples are TokenStringPatternElement (which matches a particular string appearing as a token), AnnotationPatternElement (which matches an annotation on the document), and IntegerPatternElement (which matches an integer token).  In the graph representation of the pattern set, each arc is labeled with an AtomicPatternElement.

PatternElements

In the pattern and rule representation, the patterns are represented as nested sets of PatternElementsPatternElement is the abstract class which includes both AtomicPatternElements (described just above) and various classes for composing pattern elements, such as PatternRepetition, PatternSequence, and PatternAlternation.  For example, the pattern
p1 p2 | p3*
is represented as a PatternAlternation with two alternatives;   the first alternative ("p1 p2") is a PatternSequence with two elements, p1 and p2;  while the second alternative ("p3*") is a PatternRepetition.

Every class of PatternElement has a toGraph method for converting the element into a pattern graph.

PatternGraphs

A pattern graph consists of PatternNodes connected by PatternArcs.  Associated with each PatternArc is an AtomicPatternElement -- the condition under which this arc may be traversed when matching the pattern graph against the text.  PatternNodes are divided into InternalPatternNodes and FinalPatternNodesInternalPatternNodes have outgoing edges (the alternative pattern elements to matched in the next step of pattern matching).  FinalPatternNodes have sets of actions -- the actions to be performed if the pattern matching process successfully reaches this point.

Associated with PatternNodes is an eval method which matches the graph rooted at that node against the text.  The eval method on a node invokes the eval method on each PatternArc leaving the node.  The eval method on the arc invokes in turn the eval method on the AtomicPatternElement associated with the arc;  the latter eval methods actually test the document (for the presence of a particular token, for example).

Translating Repetitions

In the pattern-and-rule representation, a repetition (A?, A+, or A*) is represented by a PatternRepetition object.  This is translated into a graph structure.  A* is translated into

while A+ is translated into