next up previous
Next: 6.10 Summary Up: 6. Conclusions Previous: 6.8 SETL Implementations

Subsections


6.9 Comparison with Other Languages

The main thing that distinguishes SETL from other high-level languages is its syntax. Its roots in mathematical notation, developed and refined by and for people long before computers arrived on the scene, lend it the ability to express a wide variety of computational processes naturally, elegantly, and above all readably.

  
6.9.1 Perl

Perl [182,205,155] is perhaps the most commonly used ``high-level'' language for data processing. It sprang from Larry Wall's frustration with the insufficiency of awk for these purposes, and has grown into a baroque monstrosity through endless patching. The syntax is arcane and turgid (practically every variable reference has to start with a special symbol), and the semantics are just as bad: all undeclared variables are global, and the system interface is so implementation-dependent as to make a mockery of the claims of portability usually made for Perl. Worse, the interface to Unix is so thin that, for example, select cannot be used with the normal buffered I/O streams but only with raw file descriptors, and the semantics of signals depend on which strain of Unix is hosting the Perl implementation. The latter means that re-establishing the signal handler may well suffer from the historical System V race condition that led to the BSD-style and eventually Posix signal handling definitions. Unix routines that are not re-entrant have the same potential for causing disaster if called from signal handlers in Perl as they would in C, and to top it all off, signals will be fielded asynchronously regardless of the state of the garbage-collecting memory manager, leaving the documentation no choice but to advise programmers to ``do as little as possible'' in signal handlers to minimize the probability of disaster--no guarantee can be made in this regard, no matter how simple the handler is kept.

SETL, of course, strictly shields the high-level programmer from such nonsense, by providing a high-level interface to select, by permitting only synchronous access to signals, and in general by keeping all issues of portability internal to the run-time system rather than exposing them at an inappropriate level. Moreover SETL, unlike Perl, provides bidirectional pipe streams (called pumps in SETL), a powerful inter-process communication facility whose usefulness is amply demonstrated by the WEBeye case study of Chapter 4 [WEBeye: A Case Study].

Perl is often cited for conciseness of expression, but I have never found it to beat SETL on that score either. Indeed, the opposite is often true. Here, for example, is a TCP client program derived rather directly from the one in the ``llama'' book [182, p. 224], where it is described as a ``simple client''. This version is modified to be functionally identical to the one-line SETL program in Section 3.1.1 [A Client], and is also similar to the Perl sample program that can be found on the perlipc manual page on most Unix systems:

$port = 13;
$them = 'galt.cs.nyu.edu';

use Socket;

$sockaddr = 'S n a4 x8';

($name, $aliases, $proto) = getprotobyname('tcp');

($name, $aliases, $type, $len, $thisaddr) = gethostbyname('localhost');
($name, $aliases, $type, $len, $thataddr) = gethostbyname($them);

$this = pack($sockaddr, AF_INET, 0, $thisaddr);
$that = pack($sockaddr, AF_INET, $port, $thataddr);

socket(S, PF_INET, SOCK_STREAM, $proto) || die "socket: $!";
bind(S, $this) || die "bind: $!";
connect(S, $that) || die "connect: $!";

print <S>;

Kernighan and Pike [133] code the classic Markov chain algorithm in C, Java, C++, Awk, and Perl to compare them on a typical small data processing task. The program sizes range from 150 lines of source code for the C version down to 18 lines for the Perl version, reproduced here:

# markov.pl:  markov chain algorithm for 2-word prefixes

$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD;               # initial state
while (<>) {                       # read each line of input
    foreach (split) {
        push(@{$statetab{$w1}{$w2}}, $_);
        ($w1, $w2) = ($w2, $_);      # multiple assignment
    }
}
push(@{$statetab{$w1}{$w2}}, $NONWORD);      # add tail

$w1 = $w2 = $NONWORD;
for ($i = 0; $i < $MAXGEN; $i++) {
    $suf = $statetab{$w1}{$w2};   # array reference
    $r = int(rand @$suf);         # @$suf is number of elems
    exit if (($t = $suf->[$r]) eq $NONWORD);
    print "$t\n";
    ($w1, $w2) = ($w2, $t);       # advance chain
}
The authors comment that ``the Perl and Awk programs are short compared to the three earlier versions, but they are harder to adapt to handle prefixes that are not exactly two words'' [133, p. 80]--indeed, in the versions written in the lower-level languages, the prefix size was governed by a named integer constant. Perl is very much on home turf here, and in fact this is exactly the sort of program Perl programmers love to write. The following SETL version is only slightly shorter than the above Perl program on which it is modeled, but is more general (the number of prefix words is controlled by a constant) and more direct (for example, the tuple of words recorded in each statetab entry is built up by tuple concatenation rather than with some mysterious push operator). To me, it also looks like a real program instead of a wild festival of glyphs: *
-- Markov chain demo
 
const maxgen = 10000;
const npfx = 2;              -- number of prefix words
const nonword = `\n';
 
statetab := {};
words := npfx * [nonword];              -- initial state
for word in split (getfile stdinloop
    statetab(words) +:= [word];
    words := words(2..) + [word];
end loop;
statetab(words) +:= [nonword];          -- add tail
 
words := npfx * [nonword];
for i in [1..maxgenloop
    if (word := random statetab(words)) = nonword then stopend if;
    print(word);
    words := words(2..) + [word];
end loop;

  
6.9.2 Icon

Icon [97,96] is an interesting high-level language for programming in the small. It has no type declarations, and its variable declarations and scope rules are similar to those of SETL. It represents something of an extreme in programming language design in that its expression evaluation mechanism is fully backtracking over generators which suspend and produce a value when first encountered or when asked to resume by being backed into. Unless it is abandoned before it is exhausted, a generator must ultimately fail. Together with some rules which limit backtracking in appropriate contexts, this provides a convenient basis for control flow, because the assignment of expression results into variables typically occurs at the same time as successful evaluation, eliminating the need for a distinct Boolean type.

Icon's pervasive backtracking is a generalization of the backtracking that was used in SNOBOL string pattern matching. I have used both Icon and SNOBOL's near-identical twin SPITBOL extensively, and although the tight interweaving of control flow with expression evaluation in Icon sometimes produces welcome effects, such as the fact that i < j < k accomplishes what in most languages would have to be written as (i < j) and (j < k), the extra ``liveness'' of the ubiquitous generators somehow does not turn out to be so often useful as it does to require vigilance. The backtracking that worked so well for pattern matching in SNOBOL is perhaps at its best in that limited domain. Stranger still, the syntactic improvements to pattern matching that were made never turned out to be quite as comfortable as the original SNOBOL forms.

Recently, Icon has proven itself to be particularly well suited to graphics programming [98], and of course it has always been strong on string manipulation. My feeling is that Icon is a close second to SETL for data processing. Its syntax is fairly elegant, though not as close to that of mathematics as SETL's (in particular, it doesn't have set and tuple formers), and is rather heavy with operator forms--for example, ``not equals'' is spelled ~=, ~==, or ~===, depending on whether numbers, strings, or general objects such as lists are being compared. Icon also lacks SETL's value semantics, as is evident in the fact that the outcome of === or ~=== depends on whether the objects are identical, which for lists means both operands refer to the same list, but for scalars simply means that they have the same value.

Icon allows record definitions which, like procedures, are global. The fieldnames are not typed, and may be addressed by name or by position. Records are lumped in with lists, sets, and tables as the aggregates of values collectively called structures in Icon terminology. They all have pointer semantics, and a copy primitive performs ``one-level'' copying. It is tempting to adopt a record model for SETL that is as simple as Icon's, perhaps borrowing its cavalier attitude towards field typing and reference though not its pointer semantics. I think this should wait, however, until an Ada-based type system such as that suggested in Section 6.3 [Types] has at least been investigated far enough to ensure that no unfortunate incompatibilities will be introduced.

6.9.3 Functional Languages

LISP and derivatives such as Scheme [128], and the purely functional languages such as ML [146] and Haskell [197], are of more theoretical than practical interest for data processing, though Haskell's list comprehensions are very similar to SETL's tuple formers, and its lazy evaluation admits of concise expressions that can represent infinite lists much more directly than functions can. Haskell's strong typing combined with the ability to dispatch to a routine of the correct signature based on generalized pattern matching lend it great potential for clarity and elegance. Haskell does not seem to have found its way into the world of data processing to any significant extent. I do not know why, but can only surmise that the lack of assignment is disconcerting to many. In the case of servers, it is not clear that tail recursion is the most natural way to express an infinite main loop, nor whether state information such as a map of current client records is best represented as a succession of values, each map (say) being constructed from the previous map and passed as a parameter to the next round of recursion. On the other hand, if one assumes that the optimizer and garbage collector in a Haskell implementation are able to recognize the opportunity for conversion of tail recursion to iteration and the possibility of suppressing copy proliferation, this approach does offer the significant advantage of avoiding the hazards associated with the updating of variables in place.

  
6.9.4 Python

Python [201] is another language that is close in level to SETL, and is deliberately oriented towards what I have been calling data processing in this dissertation. It descends from ABC [193], a language which Guido van Rossum, the designer and implementer of Python, helped design in the 1980s. Python has a number of odd and in some cases counterintuitive aspects, such as its scoping rules [147]. One of the interesting features it inherits from ABC is the use of indentation as the only syntactic mechanism for control-structure grouping. Python suffers from the regrettable pleonasm of simultaneously offering lists, which are tuples with pointer semantics, and tuples, which are tuples with value semantics. Tuple denotations are marked by parentheses, and the problem of how to denote a singleton is unfortunately solved by writing, e.g., ``(x,)''.

The typical assertions of conciseness for a high-level language are made on behalf of Python [200], but it does not have anything approaching SETL's set and tuple formers. As a consequence, even a ``showpiece'' subroutine such as the following one [199] by van Rossum can be written more concisely and readably in SETL:

def find_path(graph, start, end, path=[ ]):
    path = path + [start]
    if start == end:
        return path
    if not graph.has_key(start):
        return None
    for node in graph[start]:
        if node not in path:
            newpath = find_path(graph, node, end, path)
            if newpath: return newpath
    return None
In SETL, the equivalent function may be written with an existential predicate in place of the for-loop: *
proc find_path (graphfirstlast);
    if first = last then
        return [first];
    end if;
    if exists next in graph{first} |
          (path := find_path (graph lessf firstnextlast)) /= om then
        return [first] + path;
    end if;
    return om;
end find_path;
This assumes a multi-map graph, but can be modified to accommodate an adjacency list by changing ``graph{first}'' to ``graph(first) ? {}''. The difference in languages becomes even more apparent in the following example, again by van Rossum:
def find_all_paths(graph, start, end, path=[ ]):
    path = path + [start]
    if start == end:
        return [path]
    if not graph.has_key(start):
        return [ ]
    paths = [ ]
    for node in graph[start]:
        if node not in path:
            newpaths = find_all_paths(graph, node, end, path)
            for newpath in newpaths:
                paths.append(newpath)
    return paths
In SETL, this becomes *
proc find_all_paths (graphfirstlast);
    if first = last then
        return {[first]};
    end if;
    return {[first] + path :
                   path in +/{find_all_paths (graph lessf firstnextlast) :
                                       next in graph{first}}};
end find_all_paths;
where the structure of the result, a set of paths that are tuples, is more obvious and at the same time more natural than the list of lists returned by the Python function. And of course in SETL, if we are content to have paths represented as subgraphs and are more interested in a concise definition than in micro-managing the optimization process, we can simply write *
proc find_all_paths (graphfirstlast);
    return {s in pow graph | is_path (sfirstlast)};
end proc;
where is_path might be defined as *
proc is_path (graphfirstlast);
    vertices := domain graph + range graph;
    return #graph = #vertices - 1  and
      domain graph = vertices less last  and
      range graph = vertices less first;
end proc;
which tests for graph being a tree whose domain is all vertices except last and whose range is all vertices except first. Finally, find_path is easily implemented as arb applied to the result of find_all_paths.

  
6.9.5 Rexx

Rexx [49], originally a shell language that began to supersede the primitive EXEC job control language in the CMS component of IBM's VM/370 operating system during the 1980s, has been used for many years now as a general-purpose high-level language, and has been ported to a wide variety of systems. It retains many of the characteristics of a shell programming language. For example, any commands that are not recognized as designating built-in or user-defined operations are passed to the environmental command interpreter, such as CMS or an editing environment such as that operating system's standard XEDIT.

Although all values manipulated by a Rexx program are strings (even arithmetic is defined only on strings representing numbers), Rexx has both simple variable names such as x and compound ones such as x.y. The latter serve as a general associative device, because the y in x.y can be a number, a simple variable name, or a symbol that is neither a number nor a variable name. Moreover, if y is the name of an uninitialized variable, its default value is `Y', i.e., its own name in uppercase.

This gives considerable power to a language that is fundamentally very simple. Its parser has to be present at run time, because any string can be treated as an expression to be evaluated. Its scopes are unrepentantly dynamic. Rexx is a good ``glue'' language, suitable for high-level control, but it lacks both the mathematical syntax and data structuring capabilities of SETL. Although its compound variable names allow it to be used directly for data processing, these shortcomings and the usual pitfalls of a highly dynamic language are likely to be felt more acutely as program size increases.

  
6.9.6 Java

There are many more languages which could be compared with SETL as high-level data processing languages, but let us conclude by reflecting on one that is really of intermediate level but has gained much currency in recent years: Java. This is a language which took advantage of the widespread popularity of C++ to make an immediate appeal to programmers. It is backed by huge financial resources, is well documented and widely implemented, and (perhaps most important of all) has APIs for a great variety of application domains defined for it.

The brilliance of Java as a language is really its reductionism. Its syntax is very thin--it doesn't have the operator declarations of C++, Algol 68, or SETL, and certainly nothing close to SETL's set and tuple formers. Its object inheritance facilities are also very streamlined. For this, however, it probably deserves more praise than criticism. A class can only be derived from one other class, but it can also be derived from any number of interfaces, which are purely abstract descriptors with no data of their own. This eliminates the troublesome questions that arise in general multiple inheritance, such as what happens when a base class with data occurs more than once in the hierarchy of ancestors, yet the Java model still allows an object to play multiple roles, such as identifying itself as a party interested in several different kinds of event.

Everything is an object in Java, except for a few built-in types of scalar value. This is a less welcome aspect of its reductionism, as it encourages the creation of aliases instead of independent values (cf. Sections 5.3.5 [On Aliases] and 6.7.1 [Pointers]). Its threads encourage the asynchronous sharing of variables, which is worse, as outlined in Section 6.7.3 [Threads and Fine-Grained Concurrency].

Java's approach to the handling of exceptions may seem a little severe in its tendency to force programmers to consider errors they might not be interested in, but this is certainly to be preferred to the approach which doesn't allow errors to be caught at all, especially in long-running programs intended for public use.

The best thing to do with Java is probably to follow the lead of the Intermetrics corporation in adapting a better language, Ada 95 in that company's case, to Java's magnificent suite of APIs, and use the Java language as the mid-level, symbolic form of what is undeniably a highly portable machine language, Java Byte Code. The maintainers of Rexx have also taken this approach. JPython [115] goes even farther, being an entirely new implementation of Python written in Java and targeted to the Java run-time environment. Each of these systems takes advantage of and reflects Java package structure in some way, and SETL would do well to follow suit.


next up previous
Next: 6.10 Summary Up: 6. Conclusions Previous: 6.8 SETL Implementations
David Bacon
1999-12-10