next up previous
Next: 2.3 Sequential and Direct-Access I/O Up: 2. Environmentally Friendly I/O Previous: 2.1 Invocation Environment


2.2 Filters, Pipes, and Pumps

One of the great contributions of Unix to the world of data processing was its tendency to encourage the use of small, modular processes that can be connected together in pipelines, also known as filter chains, where the standard (default) output stream of one process would feed into the standard input of another, the whole chain thus forming a larger module whose standard input and output would be available for further redirection to any file, device, or process. Programs designed to act as pipeline elements are naturally called filters. Filtering is usually a one-shot event in the sense that a filter will typically read all its input and then quit, having produced some output along the way. All Unix shells in popular use have the same basic syntax for connecting filters into pipelines: a vertical bar between two simple command (program) invocations indicates that the standard output of the first is to be passed into the standard input of the second. For example,

cat *.txt | fmt -60 | wc
concatenates all files with a .txt suffix in the current directory into a stream which is fed into a command, fmt, that treats its input as text to be left-justified in paragraphs not to exceed 60 characters in width. The output of fmt in turn passes into wc, whose output is simply a report of the number of lines, words, and characters in its input. If this pipeline had been typed in at an interactive text command shell, with its output left attached to the display, the output of wc, a single line in this case, would simply be displayed as text looking something like this:

89 378 1429

It is also possible for programs to engage other programs as child processes with communication arrangements such that the parent's I/O handle is connected to the standard input and/or standard output of the child. The unidirectional channels, called pipes, are typically used by programs to read or write some data through a filter in situations where it is convenient for the parent to execute several I/O operations in the course of reading or writing the data.

The bidirectional case is the pump 2.2 stream, where a program's I/O handle is connected to both the standard input and standard output of the child process. There is generally some sort of protocol involved in this case, and the parent-child interaction will often span considerable real time. We will see many examples of this in Chapter 3 [Internet Sockets], where processes called servers deal with clients only through child processes, each of which normally exists for the duration of a client connection. Pump streams are also useful outside the context of networks, as for example when a program spawns local GUI (graphical user interface) processes.

2.2.1 Filters

Because SETL is good at handling strings, it is convenient to use it both for processing strings and for passing them to other programs. The SETL statement *

output := filter (cmdinput);
causes cmd to be submitted to the standard Unix 98 [154] ``shell'' command language interpreter, sh, through its -c (command) argument. The command specified by cmd, which may internally contain pipeline and other I/O redirection indicators, is run in a child process. The string input, which defaults to the null string, is fed into this child's standard input, and the string output receives everything that issues from its standard output stream.

If filter is unable to create a child process due to resource exhaustion, it returns om. When the string input is non-null, two child processes may need to be created: one to run the command cmd, and one to feed input into the command. The parent SETL process remains as the ``consumer'' that builds a string containing the command's output, to be returned to the caller of filter.

The following is a simple example of the use of filter to format and left-justify text so that it fits within a prescribed width such as might be imposed by a user's text window. The program in Section A.33 [vc-ptz.setl] uses this technique. This subroutine runs an external command, fmt, to insert end-of-line characters in the appropriate places: *

proc fill_message (textwidth);              -- wrap text
  return filter (`fmt -' + str widthtext);
end proc;
In this example, str is used to convert the presumed positive integer parameter width to a decimal string, which is appended to `fmt -' to form the whole command including the command-line parameter. The string text is filtered through fmt, and the formatted result is returned.

2.2.2 Pipes

A unidirectional stream connected to the standard input or standard output of an child process is called a pipe.

In SETL, here is how to start an external command as a child process and open an input pipe stream connected to its standard output: *

fd := open (cmd, `pipe-from');   -- or `pipe-in'
To launch an external process with an output pipe stream connected to its standard input, use *
fd := open (cmd, `pipe-to');   -- or `pipe-out'
In both of these cases, cmd can be any string that makes sense to the environmental command interpreter (the shell), just as for filter.

The stream handle returned by open in the above prototypes is assigned to the variable fd, as a mnemonic for file descriptor. The SETL programmer should treat it as opaque and certainly never do arithmetic on it, but may wish to be aware, especially when setting up communication with programs written in languages other than SETL, that the SETL file descriptor is exactly the integer that is assigned by the kernel as the result of open and related calls, and is called a file descriptor throughout the Unix literature. SETL implementations are expected to provide buffering over this handle, as detailed in Section 2.2.4 [Buffering]. If open fails to create a child process due to resource exhaustion, then it returns om instead of a valid file descriptor.

Below is an example of the use of `pipe-from', where an input pipe stream is connected to the standard output of the Unix ls command to obtain a list of files in the current working directory, one filename per line. To each filename read from the ls process, the SETL program applies the fsize operator to discover the size of the named file in bytes, and prints the resulting integer right-justified in a 10-character field beside the left-justified filename, separated by a space: *

fd := open (`ls', `pipe-from');           -- open file-listing subprocess
while (name := getline fd/= om loop     -- loop for each input name
  print (whole (fsize name, 10), name);   -- print file size and name
end loop;
close (fd);                               -- close child process

Here is an example of `pipe-to', where the SETL program opens a stream to a print spooler, lpr: *

log_fd := open (`lpr', `pipe-to');
printa (log_fd, `Log begins at', date);

There are also primitives named pipe_from_child and pipe_to_child which are essentially degenerate forms of the pump primitive described in Section 2.2.3 [Pumps].

2.2.3 Pumps

An external command can be started as a child process with its standard input and output connected to a bidirectional stream in the parent SETL program as follows: *

fd := open (cmd, `pump');
Even without the direct appearance of sockets, this is a powerful tool for distributed computing, because the string cmd can specify an invocation of rsh to execute, for example, the spitbol command on a remote host even if the local one doesn't have spitbol executably installed, or wishes to distribute its load.

Sometimes, instead of starting an external command as the specification of a child process, it is convenient to create the child as a clone of the currently executing SETL program. The new nullary primitive pump creates a child which inherits a copy of the parent SETL program's data space in the manner of fork (see Section 2.17.2 [Processes]). If successful, it returns -1 in the child and returns a bidirectional file descriptor in the parent, connected to the standard input and output of the child. If unsuccessful due to resource exhaustion, it returns om, as does the `pump' mode of open above: *

fd := pump();   -- the optional ``()'' suggests more than a mere fetch
This fragmentary code template shows how to use pump: *
fd := pump();   -- spawn clone
if fd = -1 then
  -- Child:  I/O on stdinstdout, which are connected to parent's fd
  stop;   -- normal exit
end if;
if fd /= om then
  -- Parent:  I/O on fd until EOF tells us the child has completed
  close (fd);   -- clear child from process table
  -- Child process could not be created--handle or ignore failure
end if;
The reason it usually works best to put the child code first is that it is a program in miniature, exiting just before the end of the block that contains it, whereas the parent is most likely to save the new fd and carry on. It would be clumsy to have the parent's code section end with a branch around the child's code. An exception to this rule is where that branch is really a return, as in this generic launcher: *
proc start_helper (helper);   -- launch helper and return its pump fd
  fd := pump();               -- spawn clone
  if fd = om then
    -- No child created
    return om;                -- failure return
  elseif fd /= -1 then
    -- Parent process, with fd connected to child
    return fd;              -- caller will use and then close fd
  end if;
  -- Child process, with stdin and stdout connected to parent
  call (helper);          -- indirect call to the helper procedure
  stop;                   -- guard against helper neglecting to exit
end proc;
One program which uses pump is the second version of impatient.setl in Section 3.3.1 [Time-Monitoring Server].

2.2.4 Buffering

For output to files, print spoolers, and one-shot filters, the SETL programmer may never need to be aware of buffering, but when processes are interconnected through pipes and pumps, there are times when it will be necessary to tell the I/O system explicitly to move all data currently accumulated in a stream buffer out to the receiver. This is done by the following call: *

flush (fd);        -- get the kernel caught up
One feature of the pump stream, whether created by pump or by open specifying mode `pump', is that its output side is automatically flushed whenever a read from its input side is attempted. In fact, this is true for all bidirectional and direct-access streams created by SETL, such as the ones listed in Section 2.3 [Sequential and Direct-Access I/O] and the socket streams introduced in Chapter 3 [Internet Sockets].

This automatic flushing association between the input and output sides of a bidirectional stream is called tying, and can also be requested between any otherwise independent pair of streams where one is open for input and the other for output, using the call: *

tie (fd_infd_out);   -- autoflush fd_out on each fd_in input try
Thus it is common to see the statement *
tie (stdinstdout);   -- autoflush stdout on each stdin input try
near the beginning of SETL programs intended to be invoked through the `pump' mode of open. This association is made automatically in the child process arising from a successful pump call. A program intended merely as a filter, by contrast, will not tie stdin to stdout if it wishes to operate line by line internally and yet remain buffer-efficient.

Buffering is not part of the SETL language specification, but implementations are expected to make the behavior as much like that of the FILE type in the C stdio library as possible. By default, stderr should be flushed after every character, and other output streams should be ``block buffered'' (meaning only automatically flushed when the buffer fills up) except when connected to a terminal-like device, in which case they should be ``line buffered'' (flushed at least after each output line).

2.2.5 Line-Pumps

There is one more variant of the versatile pump stream, available through the I/O mode `line-pump': *

fd := open (cmd, `line-pump');         -- or `tty-pump'

The difference between a line pump and a regular pump is that the environment provided to the child process in the case of the line pump is as much as possible like a line-by-line virtual terminal. Many programs, including the usual Unix shells, govern their behavior according to whether the output model is another program or a user at such a terminal.

Most significantly, the standard C stdio library uses line buffering (in the sense described in Section 2.2.4 [Buffering]) instead of block buffering on the standard output stream when it is connected to a line-by-line terminal, and programs rarely change this default. Hence it is possible to use many ``off-the-shelf'' programs as coöperating child processes even when they were intended as filters, at the cost of assuming something about each such program's implementation, specifically its output flushing policy.

The line pump is a rather specialized feature and probably best avoided in code intended to be ported easily outside the Unix world, but it can be very handy for setting up automated interactions with programs such as mail clients. For example, I was easily able to expunge several thousand unwanted mail messages that an out-of-control robot recently sent me, by the simple expedient of having a small SETL program invoke the Unix Mail client program and ``type'' the deletion command in response to each message it recognized as being from the robot.

next up previous
Next: 2.3 Sequential and Direct-Access I/O Up: 2. Environmentally Friendly I/O Previous: 2.1 Invocation Environment
David Bacon