next up previous
Next: 2.15 Field Selection Syntax for Maps Up: 2. Environmentally Friendly I/O Previous: 2.13 Normal and Abnormal Endings

Subsections


  
2.14 Strings

Ultimately, all input and output reduces to the communication of strings. The importance of string handling in data processing languages was appreciated in both CIMS SETL [181] and SETL2 [190], which went beyond the already powerful string slicing operations and introduced a set of built-in procedures inspired by the intrinsic pattern-matching functions of SNOBOL.

  
2.14.1 Matching by Regular Expression

I have gone a step further and extended the string slicing operations themselves so that wherever an integer trimscript is required, a regular expression may be used instead. The regular expression is itself just a string in which certain characters called ``metacharacters'' are not meant to be taken literally, but act as patterns. The pattern-defining sublanguage is very similar to that accepted by the GNU egrep command. The predefined boolean variable *

magic
may be assigned false to make all the metacharacters literal instead of special. Because magic is a global variable that defaults to true, the SETL programmer should normally set it back that way after any code sequence that requires it to be false, so for convenience there is also *
old_magic := set_magic (new_magic);
where old_magic and new_magic are boolean. For example, a piece of code in a subroutine should set magic according to its local needs and then restore it: *
saved_magic := set_magic (false);   -- we need metacharacters turned off
  ...   --   ... pattern-matching activity ...
set_magic (saved_magic);          -- restore prevailing value of magic

The string slicing extensions work as follows. Given string s and regular expression pattern p, the expression *

s(p)
refers to the leftmost substring of s that satisfies p (and p itself will be ``greedy'' in what it matches wherever the Kleene star or other unbounded subpattern occurs). This expression may be used in store or fetch positions as usual, replacing or producing a substring accordingly. If there are no occurrences of p in s, then s(p) has the value om, and assigning to s(p) has no effect.

Given s and two regular expression patterns p1 and p2, the expression *

s(p1..p2)
refers to the substring of s which begins with the leftmost substring satisfying p1 and ends with the first substring to the right of that satisfying p2. For example, if s contains the text of a C program, the assignment *
s(`/\\*'..`\\*/') := ` ';
replaces its first C comment (if any) with a blank.

We see here a consequence of the fact that the backslash is the ``literal next character'' indicator both in SETL strings and in the regular expression sublanguage. To match an actual asterisk, rather than have the asterisk in the pattern interpreted as a ``0 or more occurrences'' suffix operator (Kleene star), it is necessary to double the backslash. This produces a single backslash in the string value corresponding to the raw denotation, and this backslash in turn protects the asterisk in the regular expression.

Alternatively, of course, magic could be set to false so that *

s(`/*'..`*/') := ` ';
would have the desired effect.

Although I have found regular expressions for string slicing to be very useful, they do not provide an easy way to construct replacement strings as expressions in terms of matched substrings. This virtue is possessed by SNOBOL and by the standard editing tools in Unix, and is useful enough that I plan to add such a capability to SETL (see Section 6.4 [String Handling]).

Meanwhile, the mark, gmark, sub, gsub, and split built-in routines for scanning and modifying strings help to cover much of the need for a more complete pattern-matching facility: *

[ij] := mark (sp);             -- ij := integers such that s(i..j) = s(p)
[[i1j1], [i2j2], ...] := gmark (sp);  -- all occurrences of p in s
x := sub (spr);                -- x := s(p); s(p) := r; [if no side-effects in p]
x := sub (sp);                   -- x := s(p); s(p) := `'; [if no side-effects in p]
[x1x2, ...] := gsub (spr);   -- all (replaced) occurrences of p in s
[x1x2, ...] := gsub (sp);      -- all (deleted) occurrences of p in s
t := split (sp);   -- t := tuple of p-delimited substrings of s
t := split (s);      -- t := split (s, `\f\n\r\t\v]+'); [whitespace delim.]
Each pattern argument is denoted p in this synopsis. It may be either a regular expression as with the string slicing extensions, or an ordered pair [p1p2] of regular expressions, where p1 and p2 behave, in terms of matching, exactly like the p1 and p2 in the slicing form s(p1..p2) just reviewed. As a matter of fact, p1 and/or p2 can be integers in all these forms, for full orthogonality in expressions like s(p1..p2) or mark(s, [p1p2]). Gsub returns the tuple of substrings of s that are replaced by r. Gmark does not rewrite s but returns a tuple of ordered pairs of (integer) indices such that every pair [ikjk] frames a substring of s that is entirely matched by the pattern p, or more precisely, such that s(ik..jk)(p) = s(ik..jk).

More information on these and myriad other intrinsic operations comprising the SETL ``library'' can be found on the World Wide Web [19].

  
2.14.2 Formatting and Extracting Values

The next 3 routines, for formatting numbers in decimal, are named after functions in Algol 68 [137].

The string-valued expression *

whole (iwidth)
represents the integer i in decimal, with a possible leading minus sign. If the absolute value of the integer width is more than the number of characters in this converted number, then in the manner of printf, if width is positive, the number is right-justified in a field of width characters, and if negative, it is left-justified in a field of -width characters. If i is real, an integer nearest to i takes its place.

For a string that includes a possible decimal point and subsequent digits as well, *

fixed (xwidthprec)
takes a real or integer x, a width that functions exactly as in whole, and a non-negative integer prec stating the number of digits to follow the decimal point. If prec is zero, fixed omits the decimal point as well, and in fact acts just like whole then.

For scientific notation, there is *

floating (xwidthprec)
which differs from fixed only in that the character `E' followed by a sign and at least 2 decimal digits are appended, representing the power of 10 by which the part before the `E' is understood to be multiplied. That initial segment will have just one digit before the decimal point (if any). The width specification applies to the entire string.

Integers can also be rendered in explicit-radix form. The call *

strad (xradix)
for integers x and radix, given a value of radix in the range 2 to 36, produces a string of the form `radix#digits', where the radix part is in decimal and the digits part consists of digits in the given radix. The convention is that the letters `a' through `z' are digits representing the values 11 through 36, respectively. Here are some examples: *
strad (10, 10) = `10#10'
strad (10, 16) = `16#a'
strad (10, 2) = `2#1010'
strad (-899, 36) = `-36#oz'
The contents of the strings produced by strad would be acceptable as integer denotations if compiled as part of a SETL program, and would also be acceptable to the read, reada, and reads routines mentioned in Sections 2.3.1 [Sequential Reading and Writing] and 2.3.2 [String I/O], as well as as to the unstr, val, and denotype operators described below. In all of these cases, another sharp sign (`#') may optionally be appended to the literal without changing its meaning.

Finally, the programmer can always use the general-purpose str operator to let the system choose how to format a given number. For integers, this will always be a decimal string, preceded by a minus sign (`-') if appropriate. Str also occurred in CIMS SETL and in SETL2.

As introduced in SETL2, the unstr operator is approximately the inverse of str. It cannot produce an atom (only newat can do that), nor a procedure value (only the routine operator can do that). Also, it is not guaranteed that (1/3) = unstr str (1/3), because there is no guarantee about how many digits str will produce. It is merely implementation advice that the number of significant digits yielded be close to, but not exceed, the precision of the machine representation of a SETL real, which should normally have at least 50 bits of mantissa.

When the programmer wishes to determine whether a given string s consists of a single valid numeric denotation (with possible leading and/or trailing whitespace), and obtain the corresponding value if it does, *

val s            -- this is real or integer or om
will yield the appropriate value. Note that the following identities hold for any integers x and width, and any radix in the range 2 to 36: *
x = val str x
x = val whole (xwidth)
x = val strad (xradix)
x = unstr strad (xradix)

By design, an important difference between unstr and val is that val is defined to return om when its argument is a string but does not consist of a numeric denotation, whereas the behavior of unstr is unspecified for invalid arguments. The intent is that SETL implementations raise some kind of exception when unstr cannot recognize a SETL denotation in its argument. At the time of this writing, there is no formally defined exception mechanism for SETL, though see Section 6.5 [Exceptions]. Meanwhile, checking implementations are expected to handle this kind of error in some manner helpful to programmers. For example, when my SETL implementation [19] detects an error at run time, it highlights a source line, points to a relevant token, and displays a subroutine traceback.

In order to determine whether a string would be acceptable to unstr, *

denotype s
is defined as type unstr s if s consists of a valid SETL value denotation, but om otherwise. No exceptions!

  
2.14.3 Printable Strings

When str is confronted with a string argument, it increases the quoting level if necessary by surrounding the string with quote marks and doubling internal ones, but leaves all ``unprintable'' characters as they are. (The reason it may not be necessary to add quote marks is that the string may have the form of a SETL identifier--an alphabetic character followed by alphanumeric and underscore characters. Str and unstr are identity operators on strings with content restricted in exactly this way.)

The expression *

pretty s
formats a string s such that all characters are represented as ``printable'' characters. Quotes and backslashes are doubled, the ``control'' characters are represented in C or SETL denotation form as shown in the following table, all other unprintable characters are rendered as a backslash followed by octal digits, and the remaining characters, all printable, are left as they are:

FORM 		 FUNCTION  

`\a' audible alarm
`\b' backspace
`\f' formfeed
`\n' newline (linefeed)
`\r' carriage return
`\t' horizontal tab
`\v' vertical tab
The pretty operator also encloses the result string in quotes.

Conversely, *

unpretty p
takes a pretty string p and performs the inverse operation. It is of course liberal enough even to accept some strings that pretty would not produce, though it does insist on the enclosing quotes (single or double).

Another operator which converts a string to another string having all characters printable is *

hex s
which has the inverse *
unhex s
so that unhex hex s = s. Unhex returns om if its argument fails to consist of an even number of hexadecimal characters, those being the decimal digits and the letters `a' through `f' in upper or lower case.

Hex is particularly useful for instrumenting low-level code in which special string encodings are used, such as when a serial-line device has a predefined command protocol. The Canon VC-C3 [35] videocamera system to which the control service described in Section 4.1.2 [Camera Control Services] interfaces is a perfect example. Similarly, unhex makes it very easy to set up a low-level diagnostic tool, to allow the prober to throw arbitrary strings at the device. In programs such as vc-model.setl, listed in Section A.27 [vc-model.setl], unhex can also be seen to serve the rather trivial but welcome purpose of facilitating the use of hexadecimal string denotations in the program text itself, thereby avoiding the need for `\x' escapes to be repeatedly embedded in string literals--a low-level aid to readability.

2.14.4 Case Conversions and Character Encodings

The expression *

to_upper s
is the same as s except that all lowercase characters are converted to their corresponding uppercase forms, and *
to_lower s
is the obvious complement. These case conversion operators are useful for canonicalizing a string such as might occur in an input command to a program, because then all subsequent tests or map lookups on the converted string can be effectively case-insensitive.

Following the CIMS version of SETL, the asterisk is overloaded to allow a string s to be ``multiplied'' by a non-negative integer n to produce the concatenation of n copies of s. The arguments can be in either order. For example, a row of 70 dashes can be specified as (70 * `-') or (`-' * 70).

Likewise, lpad(sn) and rpad(sn) yield copies of s padded with blanks as necessary on the left or right, respectively, to make up n characters.

As in CIMS SETL, the char operator takes an integer that is the internal code of some character, and returns that character as a string of length 1. The abs operator is overloaded to act as char's inverse, and *

ichar s
is introduced in SETL as the equivalent of abs s when s is a string (and it is an error for s not to be a string).

The up-to-date reader will note that no distinction has been made between bytes and characters for SETL strings. In effect, only the ``POSIX locale'' defined in Unix 98 is accommodated by the current design of SETL, and characters are assumed to occupy 8 bits. However, the language is not strongly tied to this assumption, and can be expected to evolve gracefully toward support for ``wider'' characters and for contemporary internationalization and localization standards. Areas of the language for which compatibility issues will arise (although the new definitions should largely be upwardly compatible with the existing ones) include the char and ichar operators of this section, hex and unhex, escape sequences in string denotations, and direct-access I/O operations. Which characters are considered ``printable'', the collating order among strings, case conversions, the decimal point symbol, and the format of times and dates should all ultimately become locale-dependent. If the locale can be changed by a SETL program during its own execution, which does not seem unreasonable, there will also be dynamic convertibility concerns to be addressed.

2.14.5 Concatenation and a Note on Defaults

String concatenation is a very common operation, particularly when used for building up output strings. In principle, it is possible to require the SETL programmer to apply str to every value that is not already a string when building up a string, but in practice, it is much more convenient for the programmer if str is invoked implicitly. This is not by any means a context in which type mistakes are likely to be disastrous, and given that many strings are built for the sake of producing error messages, it is actually more likely that an important diagnostic prepared by the SETL programmer will be missed due to a gratuitous crash than that a critical type error will go unnoticed and have its deadly effects propagated far, if there is insistence on explicit coding of a str in front of every non-string expression in a long concatenation. For example, expressions that evaluate to om in this situation will show up as `*' in the concatenated string, and this itself conveys useful information.

It happens that the ``+:='' operator is overloaded to support om as the initial value of its (writable) left-hand operand when the right-hand operand is an integer, real, string, set, or tuple, in which case it acts as if it had been initialized to the appropriate identity element (0, 0.0, `', {}, or [ ], respectively). This is helpful in loops, such as when tallies are being recorded against keys in a map, e.g.: *

tally_map := {};
for x ... loop
  tally_map(x) +:= 1;
    ...
end loop;
Without the identity-element default, the statement ``tally_map(x) +:= 1;'' above would have to be preceded by ``tally_map(x) ?:= 0;'', which in practice is a nuisance that is hard to justify by the need for protection against failure-to-initialize errors. But since the expression om + `a' is supposed to be equivalent to str om + `a' (which has the value `*a') by the implicit-str rule, the question arises: should x +:= `a' for uninitialized x mean x := `a', or x := `*a'? It is unusual to want to form a string starting with the converted value of om (indeed, the very use of ``+:='' in a string-building expression that is itself meant to be copied somewhere is stylistically questionable), but it is not at all unusual to want automatic initialization of a string that will be accumulated by concatenation, so the decision is easily made in favor of the latter interpretation.


next up previous
Next: 2.15 Field Selection Syntax for Maps Up: 2. Environmentally Friendly I/O Previous: 2.13 Normal and Abnormal Endings
David Bacon
1999-12-10