next up previous
Next: 6.5 Exceptions Up: 6. Conclusions Previous: 6.3 Types

6.4 String Handling

As remarked in Section 2.14.1 [Matching by Regular Expression], the extensions to SETL's string slicing and subscripting forms to allow selections to be made on the basis of patterns described by regular expressions when magic is true, and the functions sub, gsub, mark, gmark, and split, which likewise accept regular expression patterns, have proven to be very useful in their own right, but sub/gsub and the assigning forms s(p) := r and s(p1..p2) := r lack the means to refer to matched substrings conveniently in the construction of replacement strings.

The way this is done in the standard Unix 98 editing tools, which effectively support the one-pattern forms (e.g. s(p) := r but not s(p1..p2) := r), is to take an ampersand (&) or backslash-escaped zero (\0) in the replacement string to mean the entire matched substring, and to take \1\2, etc. to mean substrings that are framed in the pattern by backslash-escaped parentheses, i.e. \( and \), where \k in the replacement string refers to the the kth such framed pattern. The backslash-escaped parentheses are not themselves matched against characters in the subject string, but can be nested, and are numbered according to where the left parenthesis begins in the pattern.

SNOBOL [99] and subsequently the MTS [188] editor allow variables to be assigned as a side-effect of pattern matching, by providing syntax that associates variables directly with subpatterns. The ``immediate value assignment'' form causes these variables to be assigned during the course of the matching process, while ``conditional value assignment'' defers such assignments until matching is complete (and no assignment will be made if the pattern fails to match the subject string). There is generally no practical reason for preferring conditional value assignment, though efficiency was originally (and probably wrongly) cited. In any case, the resulting values of the variables are available for use in constructing a replacement string or for any other subsequent purpose.

SNOBOL pattern matching is very general. For example, it is fully backtracking, and deferred-value expressions with embedded function calls can be incorporated in patterns. There is a whole suite of predefined patterns and pattern-matching functions. Substring starting positions as well as substrings can be assigned to variables during matching.

Since regular expressions have already been introduced into SETL and are familiar to Unix 98 users, the ideal pattern-matching facility for SETL will integrate their use with some SNOBOL-like ability to incorporate matched substrings in expressions defining replacement strings. Given immediate value assignment into SETL variables, there should be no need for the rather arcane and error-prone ampersand and backslash escape conventions of replacement strings in the regular expression regime of Unix editors, however, since such strings will be able to be constructed by much more powerful means.

Patterns logically form a distinct type, values of which should be produced by certain functions and operators.

For example, the exclamation mark (!) might be used as a binary operator that takes a pattern argument p on the left and an arbitrary assignment target t on the right, and yields a pattern which, when evaluated in the course of matching, assigns to t the substring matched by p. This is like SNOBOL's immediate value assignment. Such operators should also be overloaded for other types of p that serve as primitive patterns, such as string, ordered pair of strings (consistent with the overloading of sub, etc.), and routine (a pattern-matching function).

The at-sign (@) might be a unary operator that takes an arbitrary assignment target t and produces a pattern which matches the null string and assigns to t the ``cursor position'' (string index) of the current location of matching. Again, this corresponds to a SNOBOL primitive operator.

Several primitive functions are also worth borrowing from SNOBOL. For example, arbno should take a pattern argument p (or string etc. that can serve as a pattern) and return a pattern that matches zero or more occurrences of p. The functions pos and rpos should each take an integer argument n and match the null string if and only if the matching cursor is currently n characters from the beginning and end of the string, respectively. Unary overloadings of any, break, len, notany, span, and rany through rspan, all of which already appear in SETL as binary functions, are also available for use as pattern-producing functions.

The ``unevaluated expressions'' of SNOBOL, which are useful in all sorts of pattern-matching situations, could be approximated readily. With no extra syntax, any tuple whose first element is a routine could serve as a pattern in which subsequent elements, evaluated prior to matching, would be passed as arguments to the routine when the pattern was encountered during matching. The asterisk (*) could be given similar significance as a unary special form when it appears before a global function name f that might or might not be followed by a parenthesized list of arguments: *f(x1x2, ...) would be equivalent to [routine fx1x2, ...] except that it would produce an actual pattern instead of just a tuple that can serve as a pattern. Likewise, *f would be a pattern equivalent to routine f or [routine f].

The reason for insisting that the arguments intended for f in the foregoing be evaluated when the pattern expression is being constructed rather than when it is being used in matching is that to do otherwise would be to invite dynamic scope violations: patterns can certainly be constructed and yielded by functions. This is also why f must be a global function name (as all function names are in the current version of SETL). The asterisk notation could easily be extended to apply to global variables, where it would defer their evaluation until pattern matching time. Nesting, as in *f(x1, *g(*y1y2), *x3), is of course perfectly feasible.

Pattern-matching functions yield what they match from the pattern matcher's point of view. They also advance a cursor when they succeed. There are several ways in which user-defined pattern-matching functions could be stereotyped in order to map to this behavior. For example, the subject string and cursor could always be supplied as parameters, or they could be predefined variables that the system pushes and pops when necessary to accommodate pattern matching nested within such a function. The function could indicate failure by yielding om and success by yielding the number of characters matched, or perhaps yield the matched substring itself on success. Alternatively, the function could be a predicate, advancing the cursor and yielding true on success, but on failure leaving it alone and yielding false. ``Off-the-shelf'' predicates could then be used in patterns to match the null string and allow matching to continue (success), or match nothing and cause the pattern matcher to back up and seek alternatives (failure).

Existing SETL operator symbols are also overloadable for operations such as pattern concatenation and alternation. Alternation could use a new symbol or overload an old one, but concatenation should surely be done with ``+'', since that is how strings are concatenated. This raises the slight problem that the plus-sign is also used for tuple concatenation, and although some tuples can serve as patterns, not all their concatenations make sense as patterns. For this reason, and because programmers should be encouraged to write code that is free from even the hint of ambiguity, there should be a pat operator that converts its argument to a pattern or helpfully complains.

In expressions s(p) := r, where s is a string, p is a pattern or equivalent (s(p1..p2) may be read as s([p1p2]) under this proposal), and r is a replacement string, it is important to specify something about SETL semantics that was previously left open, namely which of s(p) and r is evaluated first. Clearly, if immediate value assignments in p are to produce values that will be of use in constructing the replacement string r, then s(p) must be evaluated first. Note that s was already specified as the very first part of such forms to be evaluated [181, p. 93], and the last to be assigned. Its value is effectively copied into a temporary t initially. Then matching is performed, and substring replacement is performed on t. Finally the result in t is copied to s. In a complex statement such as *

s(p)(i..j) := r;
which is equivalent to *
temp := s(p);
temp(i..j) := r;
s(p) := temp;
the expression s(p)is defined to be evaluated twice, and the programmer must as always beware of unintended side-effects. The best policy is to ensure that the side-effects in s(p) consist of nothing more than assignments which can occur either once or twice with equivalent effect. This neither runs afoul of the semantic assumptions nor interferes with the machinations of optimizers.

next up previous
Next: 6.5 Exceptions Up: 6. Conclusions Previous: 6.3 Types
David Bacon